`Get Complete Project Material File(s) Now! »`

**Creating meaningful clusters**

No theoretical guideline exists on how to extract the optimal feature vector set from the input vector set for a specific clustering application. Owing to the limited intrinsic information within the feature vector set, it is difficult to design a clustering algorithm that will find clusters to match the desired cluster labels.

This constraint is created by a clustering algorithm, as it tends to find clusters in the feature space irrespective of whether any real clusters exist. This constraint motivates the notion that any two arbitrary patterns can be made to appear equally similar when evaluating a large number of dimensions of information in the feature space. This will result in defining a meaningless clustering function FC. This makes clustering a subjective task in nature, which can be modified to fit any particular application.

The advantage in this versatility is that the clustering algorithm can be used as either an exploratory or a confirmatory analysis tool. Clustering used as an exploratory analysis tool is there to explore the underlying structures of the data. No predefined models or hypotheses are needed when exploring the data set. Clustering used as a confirmatory analysis tool is to confirm any set of hypotheses or assumptions. In certain applications, clustering is used as both; first to explore the underlying structures to form new hypotheses. Second, to test these hypotheses by clustering the feature vector set. This makes clustering a data-driven learning algorithm and any domain knowledge that is available can improve the forming of clusters.

Domain knowledge is used to reduce complexity by aiding in processes such as feature selection and feature extraction. Proper domain knowledge leads to good feature vector representation that will yield exceptional performance with the most common clustering algorithms, while incomplete domain knowledge leads to poor feature vector representation that will only yield acceptable performance with a complex clustering algorithm.

An aerial photo is used to illustrate the clustering of different land cover types in figure 4.1. In this image two land cover types are of interest: natural vegetation and human settlement.

**DETERMINING THE NUMBER OF CLUSTERS**

The most difficult design consideration is to determine the correct number of clusters that should be extracted from the data set. Hundreds of methods have been developed to determine the number of clusters within a data set. The choice in determining the number of clusters K is always ambiguous and is a distinct issue from the process of actually solving the unsupervised clustering problem.

The problem if the number of clusters K is increased without penalty in the design phase (which defeats the purpose of clustering), is that the number of incorrect cluster assignments will steadily decrease to zero. In the extreme case; each feature vector is assigned to its own cluster, which results in zero incorrect clustering allocations. Intuitively this makes the choice in the number of clusters a balance between the maximum compression of the feature vectors into a single cluster and complete accuracy by assigning each feature vector to it own cluster.

The silhouette value is used as a measure of how close each feature vector is to its own cluster when compared to feature vectors in neighbouring clusters.

The variable |#FC(~x q)| denotes the number of feature vectors within the kth cluster.

The silhouette value S(~xp,K) ranges from -1 to 1. A silhouette value S(~xp,K) ! 1 indicates that the feature vector ~xp is very distant from the neighbouring K clusters. A silhouette value S(~xp,K) ! 0 indicates the feature vector ~xp is close to the decision boundary between two clusters. A silhouette value S(~xp,K) ! −1 indicates that the feature vector ~xp is probably in the wrong cluster.

A silhouette graph is a visual representation of the silhouette values and is a visual aid used to determine the number of clusters. The x-axis denotes the silhouette values and the y-axis denotes the cluster labels. The silhouette graph shown in figure 4.5 was created from a larger set of segments defined in the example of land cover classification (figure 4.3). In this silhouette graph; cluster 3 has high silhouette values present, which implies that the current feature vectors within cluster 3 are well separated from the other two clusters. Cluster 1 also has high silhouette values, but with a few feature vectors considered to be ill-positioned. Cluster 2 has significantly lower silhouette values and most of its feature vectors are closely positioned at the boundary between clusters. This might suggest that cluster 2 can be subdivided into two separate clusters.

An analytical method of deciding on the correct number of clusters K, is the computation of the average of the silhouette value.

**CHAPTER 1 – INTRODUCTION **

1.1 Problem statement

1.2 Objective of this thesis and proposed solution

1.3 Outline of Thesis

**CHAPTER 2 – REMOTE SENSING USED FOR LAND COVER CHANGE DETECTION **

2.1 Overview

2.2 Spontaneous Settlements

2.3 Overview of Remote Sensing

2.4 Electromagnetic radiation

2.5 Earth’s Energy Budget

2.6 MODerate resolution Imaging Spectroradiometer

2.7 Vegetation Indices

2.8 Land cover change detection methods

2.9 Summary

**CHAPTER 3 – SUPERVISED CLASSIFICATION **

3.1 Overview

3.2 Classification

3.3 Supervised Classification

3.4 Artificial Neural Networks

3.5 Other variants of Artificial Neural Networks used for Classification

3.6 Design consideration: Supervised classification

3.7 Summary

**CHAPTER 4 – UNSUPERVISED CLASSIFICATION **

4.1 Overview

4.2 Clustering

4.3 Similarity metric

4.4 Hierarchical clustering algorithms

4.5 Partitional clustering algorithms

4.6 Determining the number of clusters

4.7 Classification of cluster labels

4.8 Summary

**CHAPTER 5 – FEATURE EXTRACTION 84**

5.1 Overview

5.2 Time series representation

5.3 State-space representation

5.4 Kalman filter

5.5 Extended Kalman filter

5.6 Least squares model fitting

5.7 M-estimate model fitting

5.8 Fourier Transform

5.9 Summary

**CHAPTER 6 – SEASONAL FOURIER FEATURES **

6.1 Overview

6.2 Time series analysis

6.3 Meaningless analysis

6.4 Meaningful clustering

6.5 Change detection method using the seasonal Fourier features

6.6 Summary

**CHAPTER 7 – EXTENDED KALMAN FILTER FEATURES **

7.1 Overview

7.2 Change detection method: Extended Kalman Filter

7.3 Autocovariance Least Squares method

7.4 Summary

**CHAPTER 8 – RESULTS **

8.1 Overview

8.2 Ground truth data set

8.3 System outline

8.4 Experimental Plan

8.5 Parameter Exploration

8.6 Classification

8.7 Change detection

8.8 Change detection algorithm comparison

8.9 Provincial experiments

8.10 Computational complexity

8.11 Summary

**CHAPTER 9 – CONCLUSION**

9.1 Concluding remarks

9.2 Future Recommendations

**REFERENCES**