Function File: GMdist = fitgmdist (data, k, param1, value1, …)

Fit a Gaussian mixture model with k components to data. Each row of data is a data sample. Each column is a variable.

Optional parameters are:

  • ’start’: initialization conditions. Possible values are:
    • ’randSample’ (default) takes means uniformly from rows of data
    • ’plus’ use k-means++ to initialize means
    • ’cluster’ Performs an initial clustering with 10% of the data
    • vector A vector whose length is the number of rows in data, and whose values are 1 to k specify the components each row is initially allocated to. The mean, variance and weight of each component is calculated from that
    • structure with elements mu, Sigma ComponentProportion

    For ’randSample’, ’plus’ and ’cluster’, the initial variance of each component is the variance of the entire data sample.

  • ’Replicates’ Number of random restarts to perform
  • ’RegularizationValue’
  • ’Regularize’ A small number added to the diagonal entries of the covariance to prevent singular covariances
  • ’SharedCovariance’
  • ’SharedCov’ (logical) True if all components must share the same variance, to reduce the number of free parameters
  • ’CovarianceType’
  • ’CovType’ (string). Possible values are:
    • ’full’ (default) Allow arbitrary covariance matrices
    • ’diagonal’ Force covariances to be diagonal, to reduce the number of free parameters.
  • ’Option’ A structure with all of the following fields:
    • ’MaxIter’ Maximum number of EM iterations (default 100)
    • ’TolFun’ Threshold increase in likelihood to terminate EM (default 1e-6)
    • ’Display’
      • ’off’ (default): display nothing
      • ’final’: display the number of iterations and likelihood once execution completes
      • ’iter’: display the above after each iteration
  • ’Weight’ A column vector or n-by-2 matrix. The first column consists of non-negative weights given to the samples. If these are all integers, this is equivalent to specifying weight(i) copies of row i of data, but potentially faster.

    If a row of data is used to represent samples that are similar but not identical, then the second column of weight indicates the variance of those original samples. Specifically, in the EM algorithm, the contribution of row i towards the variance is set to at least weight(i,2), to prevent spurious components with zero variance.

See also: gmdistribution, kmeans.

Package: statistics