: [idx, centers, sumd, dist] = kmeans (data, k, param1, value1, …)

Perform a k-means clustering of the NxD table data. If parameter start is specified, then k may be empty in which case k is set to the number of rows of start.

The outputs are:

idx

An Nx1 vector whose ith element is the class to which row i of data is assigned.

centers

A KxD array whose ith row is the centroid of cluster i.

sumd

A kx1 vector whose ith entry is the sum of the distances from samples in cluster i to centroid i.

dist

An Nxk matrix whose ijth element is the distance from sample i to centroid j.

The following parameters may be placed in any order. Each parameter must be followed by its value.

Start

The initialization method for the centroids.

plus

(Default) The k-means++ algorithm.

sample

A subset of k rows from data, sampled uniformly without replacement.

cluster

Perform a pilot clustering on 10% of the rows of data.

uniform

Each component of each centroid is drawn uniformly from the interval between the maximum and minimum values of that component within data. This performs poorly and is implemented only for Matlab compatibility.

A

A kxDxr matrix, where r is the number of replicates.

Replicates

An positive integer specifying the number of independent clusterings to perform. The output values are the values for the best clustering, i.e., the one with the smallest value of sumd. If Start is numeric, then Replicates defaults to (and must equal) the size of the third dimension of Start. Otherwise it defaults to 1.

MaxIter

The maximum number of iterations to perform for each replicate. If the maximum change of any centroid is less than 0.001, then the replicate terminates even if MaxIter iterations have no occurred. The default is 100.

Distance

The distance measure used for partitioning and calculating centroids.

sqeuclidean

The squared Euclidean distance, i.e., the sum of the squares of the differences between corresponding components. In this case, the centroid is the arithmetic mean of all samples in its cluster. This is the only distance for which this algorithm is truly "k-means".

cityblock

The sum metric, or L1 distance, i.e., the sum of the absolute differences between corresponding components. In this case, the centroid is the median of all samples in its cluster. This gives the k-medians algorithm.

cosine

(Documentation incomplete.)

correlation

(Documentation incomplete.)

hamming

The number of components in which the sample and the centroid differ. In this case, the centroid is the median of all samples in its cluster. Unlike Matlab, Octave allows non-logical data.

EmptyAction

What to do when a centroid is not the closest to any data sample.

error

Throw an error.

singleton

(Default) Select the row of data that has the highest error and use that as the new centroid.

drop

Remove the centroid, and continue computation with one fewer centroid. The dimensions of the outputs centroids and d are unchanged, with values for omitted centroids replaced by NA.

Display

Display a text summary.

off

(Default) Display no summary.

final

Display a summary for each clustering operation.

iter

Display a summary for each iteration of a clustering operation.

Example:

[~,c] = kmeans (rand(10, 3), 2, "emptyaction", "singleton");

See also: linkage.

Demonstration 1

The following code

 ## Generate a two-cluster problem
 C1 = randn (100, 2) + 1;
 C2 = randn (100, 2) - 1;
 data = [C1; C2];

 ## Perform clustering
 [idx, centers] = kmeans (data, 2);

 ## Plot the result
 figure;
 plot (data (idx==1, 1), data (idx==1, 2), 'ro');
 hold on;
 plot (data (idx==2, 1), data (idx==2, 2), 'bs');
 plot (centers (:, 1), centers (:, 2), 'kv', 'markersize', 10);
 hold off;

Produces the following figure

Figure 1

Package: statistics