Expectation Maximization (EM) Cluster Analysis

Introduction

Cluster analysis (or clustering) is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters or classes), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. The computational task of classifying the data set into k clusters is often referred to as k-clustering.

Algorithm

The EM algorithm implemented in SNAP can be regarded as a generalization of the k-means algorithm. The main differences are:

  1. Pixels are not assigned to clusters. The membership of each pixel to a cluster is defined by a (posterior) probability. For each pixel, there are as many (posterior) probability values as there are clusters and for each pixel the sum of (posterior) probability values is equal to unity.
  2. Clusters are defined by a prior probability, a cluster center, and a cluster covariance matrix. Cluster centers and covariance matrixes determine a Mahalanobis distance between a cluster center and a pixel.
  3. For each cluster a pixel likelihood function is defined as a normalized Gaussian function of the Mahalanobis distance between cluster center and pixels.
  4. Posterior cluster probabilities as well as cluster centers and covariance matrixes and are recalculated iteratively. In the E-step, for each cluster, the cluster prior and posterior probabilities are recalculated. In the M-step all cluster centers and covariance matrixes are recalculated from the updated posteriors, so that the resulting data likelihood function is maximized.
  5. When the iteration is completed, each pixel is assigned to the cluster where the posterior probability is maximal.

The algorithm is described in detail on the Wikipedia entry on Expectation maximization. Use this algorithm when you want to perform a cluster analysis of a small scene or region-of-interest and are not satisfied with the results obtained from the k-means algorithm.

The result of the cluster analysis is written to a band named class_indices. The values in this band indicate the class indices, where a value '0' refers to the first cluster, a value of '1' refers to the second cluster, etc. The class indices are sorted according to the prior probability associated with cluster, i.e. a class index of '0' refers to the cluster with the highest probability.

Note that an index coding is attached to the class_indices band, which can be edited in the Color Manipulation Window. It is possible to change the label and the color associated with a class index. The last columns of the color manipulation window lists the location of the cluster centers. Further information on the clusters is listed in the Cluster-Analysis group of the product metadata.

Color Manipulation Window

User Interface

The EM cluster analysis tool can be invoked in the Sentinel Toolbox from the tool menu by selecting Processing->Image Analysis->EM Cluster Analysis.... In the command line it is available by means of the Graph Processing Tool gpt which is located in the Sentinel Toolbox bin directory. Please type gpt EMClusterAnalysis -h for further information.

Selecting the EM Cluster Analysis command from the Sentinel Toolbox tool menu pops up the following dialog:

EM Cluster Analysis Dialog

Source Product Group

Name: Here the user specifies the name of the source product. The combo box presents a list of all products opened in the Sentinel Toolbox. The user may select one of these or, by clicking on the button next to the combo box, choose a product from the file system.

Target Product Group

Name: Used to specify the name of the target product.

Save as: Used to specify whether the target product should be saved to the file system. The combo box presents a list of file formats, currently BEAM-DIMAP, GeoTIFF, and HDF5. The text field allows to specify a target directory.

Open in SNAP: Used to specify whether the target product should be opened in the Sentinel Toolbox. When the target product is not saved, it is opened in the Sentinel Toolbox automatically.

K-Means Cluster Analysis Dialog

Processing Parameters Panel

Number of clusters: Use this field to specify the number of clusters. The default is 14 clusters.

Number of iterations: Use this field to specify the maximum number of iterations to be carried out. The default is 30 iterations. The cluster analysis stops when the maximum number of iterations is exceeded.

Random seed: The EM algorithm starts with a pseudo-random distribution of initial clusters. The random seed initializes the pseudo-random number generator, which is used to generate the initial clusters. By changing the random seed, you can start with different initial clusters. Any positive integral number will be a perfect random seed. The default seed is 31415.

Source band names: Use this field to specify the names of the source bands. Press the control key while selecting or deselecting individual bands.

ROI-Mask: Use this field to restrict the cluster analysis to a region-of-interest (ROI). The combo box allows to select the band which provides the ROI.

Include probability bands: Check this box if you want the cluster posterior probabilities to be included in the target product. The target then will contain a single probability band for each cluster.

Button Group

Run Creates the target product. The cluster analysis is actually deferred until its band data are accessed, either by writing the product to a file or by viewing its band data. When the Save as option is checked, the cluster analysis is triggered automatically.

Close Closes the dialog.

Help Displays this page in the Sentinel Toolbox Help.

Further information

A good starting point for obtaining further information on cluster analysis terms and algorithms is the Wikipedia entry on data clustering.