Name
, Value
)Create a clustering evaluation object to find the optimal number of clusters.
evalclusters
creates a clustering evaluation object to evaluate the
optimal number of clusters for data x, using criterion criterion.
The input data x is a matrix with n
observations of p
variables.
The evaluation criterion criterion is one of the following:
CalinskiHarabasz
to create a CalinskiHarabaszEvaluation
object.
DaviesBouldin
to create a DaviesBouldinEvaluation
object.
gap
to create a GapEvaluation
object.
silhouette
to create a SilhouetteEvaluation
object.
The clustering algorithm clust is one of the following:
kmeans
to cluster the data using kmeans
with EmptyAction
set to
singleton
and Replicates
set to 5.
linkage
to cluster the data using clusterdata
with linkage
set to
Ward
.
gmdistribution
to cluster the data using fitgmdist
with SharedCov
set to
true
and Replicates
set to 5.
If the criterion is CalinskiHarabasz
, DaviesBouldin
, or
silhouette
, clust can also be a function handle to a function
of the form c = clust(x, k)
, where x is the input data,
k the number of clusters to evaluate and c the clustering result.
The clustering result can be either an array of size n
with k
different integer values, or a matrix of size n
by k
with a
likelihood value assigned to each one of the n
observations for each
one of the k clusters. In the latter case, each observation is assigned
to the cluster with the higher value.
If the criterion is CalinskiHarabasz
, DaviesBouldin
, or
silhouette
, clust can also be a matrix of size n
by
k
, where k
is the number of proposed clustering solutions, so
that each column of clust is a clustering solution.
In addition to the obligatory x, clust and criterion inputs
there is a number of optional arguments, specified as pairs of Name
and Value
options. The known Name
arguments are:
KList
a vector of positive integer numbers, that is the cluster sizes to evaluate. This option is necessary, unless clust is a matrix of proposed clustering solutions.
Distance
a distance metric as accepted by the chosen clust. It can be the
name of the distance metric as a string or a function handle. When
criterion is silhouette
, it can be a vector as created by
function pdist
. Valid distance metric strings are: sqEuclidean
(default), Euclidean
, cityblock
, cosine
,
correlation
, Hamming
, Jaccard
.
Only used by silhouette
and gap
evaluation.
ClusterPriors
the prior probabilities of each cluster, which can be either empirical
(default), or equal
. When empirical
the silhouette value is
the average of the silhouette values of all points; when equal
the
silhouette value is the average of the average silhouette value of each
cluster. Only used by silhouette
evaluation.
B
the number of reference datasets generated from the reference distribution.
Only used by gap
evaluation.
ReferenceDistribution
the reference distribution used to create the reference data. It can be
PCA
(default) for a distribution based on the principal components of
X, or uniform
for a uniform distribution based on the range of
the observed data. PCA
is currently not implemented.
Only used by gap
evaluation.
SearchMethod
the method for selecting the optimal value with a gap
evaluation. It
can be either globalMaxSE
(default) for selecting the smallest number
of clusters which is inside the standard error of the maximum gap value, or
firstMaxSE
for selecting the first number of clusters which is inside
the standard error of the following cluster number.
Only used by gap
evaluation.
Output eva is a clustering evaluation object.
See also: CalinskiHarabaszEvaluation, DaviesBouldinEvaluation, GapEvaluation, SilhouetteEvaluation.
Package: statistics