Function File: eva = evalclusters (x, clust, criterion)
Function File: eva = evalclusters (…, Name, Value)

Create a clustering evaluation object to find the optimal number of clusters.

evalclusters creates a clustering evaluation object to evaluate the optimal number of clusters for data x, using criterion criterion. The input data x is a matrix with n observations of p variables. The evaluation criterion criterion is one of the following:

CalinskiHarabasz

to create a CalinskiHarabaszEvaluation object.

DaviesBouldin

to create a DaviesBouldinEvaluation object.

gap

to create a GapEvaluation object.

silhouette

to create a SilhouetteEvaluation object.

The clustering algorithm clust is one of the following:

kmeans

to cluster the data using kmeans with EmptyAction set to singleton and Replicates set to 5.

linkage

to cluster the data using clusterdata with linkage set to Ward.

gmdistribution

to cluster the data using fitgmdist with SharedCov set to true and Replicates set to 5.

If the criterion is CalinskiHarabasz, DaviesBouldin, or silhouette, clust can also be a function handle to a function of the form c = clust(x, k), where x is the input data, k the number of clusters to evaluate and c the clustering result. The clustering result can be either an array of size n with k different integer values, or a matrix of size n by k with a likelihood value assigned to each one of the n observations for each one of the k clusters. In the latter case, each observation is assigned to the cluster with the higher value. If the criterion is CalinskiHarabasz, DaviesBouldin, or silhouette, clust can also be a matrix of size n by k, where k is the number of proposed clustering solutions, so that each column of clust is a clustering solution.

In addition to the obligatory x, clust and criterion inputs there is a number of optional arguments, specified as pairs of Name and Value options. The known Name arguments are:

KList

a vector of positive integer numbers, that is the cluster sizes to evaluate. This option is necessary, unless clust is a matrix of proposed clustering solutions.

Distance

a distance metric as accepted by the chosen clust. It can be the name of the distance metric as a string or a function handle. When criterion is silhouette, it can be a vector as created by function pdist. Valid distance metric strings are: sqEuclidean (default), Euclidean, cityblock, cosine, correlation, Hamming, Jaccard. Only used by silhouette and gap evaluation.

ClusterPriors

the prior probabilities of each cluster, which can be either empirical (default), or equal. When empirical the silhouette value is the average of the silhouette values of all points; when equal the silhouette value is the average of the average silhouette value of each cluster. Only used by silhouette evaluation.

B

the number of reference datasets generated from the reference distribution. Only used by gap evaluation.

ReferenceDistribution

the reference distribution used to create the reference data. It can be PCA (default) for a distribution based on the principal components of X, or uniform for a uniform distribution based on the range of the observed data. PCA is currently not implemented. Only used by gap evaluation.

SearchMethod

the method for selecting the optimal value with a gap evaluation. It can be either globalMaxSE (default) for selecting the smallest number of clusters which is inside the standard error of the maximum gap value, or firstMaxSE for selecting the first number of clusters which is inside the standard error of the following cluster number. Only used by gap evaluation.

Output eva is a clustering evaluation object.

See also: CalinskiHarabaszEvaluation, DaviesBouldinEvaluation, GapEvaluation, SilhouetteEvaluation.

Package: statistics