17.7.3.6 Algorithms (K-Means Cluster Analysis)


K-Means Cluster Analysis uses minimum sum of squares to assign observations to groups.

An observation containing one or more missing values will be excluded before K-Means Cluster Analysis.

For a matrix X with n observations by p variables, initial cluster centers can be specified with a K-by-p matrix, or chosen from the matrix X with a predefined number of clusters.

  • Select Initial Cluster Centers from Observations
    1. The first K observations are assigned as cluster centers.
    2. Loop each remaining observation to see whether it can replace current cluster centers.
If the distance between the observation and its closest cluster center is greater than the distance between the two closest cluster centers(Cluster i and Cluster j), then the observation will replace the cluster center of Cluster i or j depending on which one is closer to the observation.
Otherwise if the distance between the observation and the second closest cluster center is greater than the closest distance between the closest cluster center and other cluster centers, then the observation will replace the closest cluster center.
  • Allocate Observations to the Closest Cluster
Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center.
If the ith observation is allocated to the kth cluster, then the distance between the observation and the kth cluster is corrected as \bar{d}_{ik}=n_k/(n_k-1)d_{ik}, and the distance between the observation and other clusters is corrected as \bar{d}_{ij}=n_j/(n_j+1)d_{ij}, where n_k\ and n_j\ are the number of observations in the kth cluster and jth cluster. If the corrected distance between the observation and the lth cluster is minimum and l \ne k\ , then the observation will be allocated to the lth cluster instead of the kth cluster.
Each cluster center will then be updated as the mean for observations in each cluster.
The within-cluster sum of squares is:
\sum_{k=1}^K \sum_{i\in S_k} \sum_{j=1}^{p} (x_{ij}-\bar{x}_{kj})^2
where S_k\ is the set of observations in the kth cluster and \bar{x}_{kj}\ is the jth variable of the cluster center for the kth cluster.
Using updated cluster centers, repeat the process, allocate each observation, and update cluster centers. The iteration will stop when the maximum number of iterations is reached or the change of within-cluster sum of squares in two successive iterations is less than the threshold value. The updated cluster centers for the last iteration are called Final Cluster Centers.
Origin's default maximum number of iterations is 10.
  • Distance from Cluster
Euclidean distance is used to calculate the distance between the ith observation and the kth cluster.
d_{ik}=\sqrt{\sum_{j=1}^{p} (x_{ij}-\bar{x}_{kj})^2}