17.7.3.3 Algorithms (Hierarchical Cluster Analysis)


Hierarchical Cluster Analysis is used to build a hierarchical tree. It starts with n clusters, each with a single object and then at each of n − 1 stages, merges two clusters to form a larger cluster, until all objects are in a single cluster. The process can be shown in a Dendrogram.

Objects to be clustered in Hierarchical Cluster Analysis can be observations or variables.

Distance Matrix

A distance, or dissimilarity, matrix is a symmetric matrix with zero diagonal elements. The ijth element represents how far or how dissimilar the ith and jth objects are. Methods to calculate the distance between two objects are different for clustering observations and clustering variables.

Cluster Observations

Origin supports standardization of data before distance calculation for clustering observations. Observations containing one or more missing values will be excluded in the analysis.

  • Standardize Variables
    1. Normalize to N(0,1)
      For a variable x_j\ , it is normalized as: (x_j-\bar{x}_j)/\sigma_j, where \bar{x}_j\ and \sigma_j\ are the mean and standard deviation of the variable. The standardized variable will have zero mean and unit standard deviation.
    2. Normalize to (0,1)
      For a variable x_j\ , it is normalized as: (x_j-\mathrm{min}(x_j))/(\mathrm{max}(x_j)-\mathrm{min}(x_j))\ . The variable will be standardized in the range of 0 and 1.

Origin supports these distance types:

  • Distance Type
    For a standardized matrix x_{ij}\ with n observations and p variables, the distance between the ith observation and the kth observation can be expressed as follows.
    1. Euclidean
      d_{ik}=\Big\{\sum_{j=1}^p (x_{ij}-x_{kj})^2 \Big\}^{\frac{1}{2}}
    2. Squared Euclidean
      d_{ik}=\sum_{j=1}^p (x_{ij}-x_{kj})^2
    3. Absolute (City block metric)
      d_{ik}=\sum_{j=1}^p |x_{ij}-x_{kj}|
    4. Cosine
      d_{ik}=1-\frac{\sum_{j=1}^p x_{ij}x_{kj}}{\sqrt{\sum_{j=1}^p x_{ij}^2}\sqrt{\sum_{j=1}^p x_{kj}^2}}
    5. Pearson correlation
      d_{ik}=1-\frac{\sum_{j=1}^p (x_{ij}- \bar x_i)(x_{kj}- \bar x_k)}{\sqrt{\sum_{j=1}^p (x_{ij}- \bar x_i)^2}\sqrt{\sum_{j=1}^p (x_{kj}- \bar x_k)^2}}
      where \bar x_i=\sum_{j=1}^p x_{ij} and \bar x_k=\sum_{j=1}^p x_{kj}
    6. Jaccard
      d_{ik}=\frac{\sum_{j=1}^p ( (x_{ij} \ne x_{kj}) \bigcap (x_{ij} \ne 0 \bigcup x_{kj} \ne 0) ) ? 1:0 }{\sum_{j=1}^p (x_{ij} \ne 0 \bigcup x_{kj} \ne 0) ) ? 1:0}

Cluster Variables

Origin supports two distance types for clustering variables. Observations will be excluded in the calculation of the correlation between two variables if missing values exist in either of the variables.

  • Distance Type
    For a matrix x_{ij}\ with n observations and p variables, the distance between the jth variable and the kth variable can be expressed as follows:
    1. Correlation
      d_{jk}=1-\mathrm{corr}(x_j, x_k)\ where \mathrm{corr}(x_j, x_k)\ is the correlation between the jth variable and the kth variable.
    2. Absolute correlation
      d_{jk}=1-|\mathrm{corr}(x_j, x_k)|\

Linkage Method

At each stage, the two clusters that are nearest are merged. Origin provides several methods to calculate the distance between the new cluster and other clusters. Let clusters j and k be merged as cluster jk. Let n_i\ , n_j\ and n_k\ be number of objects of Cluster i, Cluster j and Cluster k respectively, and let d_{ij}\ , d_{ik}\ and d_{jk}\ be the distance between two clusters. The distance between Cluster jk and Cluster i d_{i,jk}\ can then be calculated in the following ways:

  1. Single link or Nearest neighbor
    d_{i,jk}=\mathrm{min}(d_{ij},d_{ik})\
  2. Complete link or Furthest neighbor
    d_{i,jk}=\mathrm{max}(d_{ij},d_{ik})\
  3. Group average
    d_{i,jk}=\frac{n_j}{n_j+n_k}d_{ij}+\frac{n_k}{n_j+n_k}d_{ik}\
  4. Centroid
    d_{i,jk}=\frac{n_j}{n_j+n_k}d_{ij}+\frac{n_k}{n_j+n_k}d_{ik}-\frac{n_jn_k}{(n_j+n_k)^2}d_{jk}\
  5. Median
    d_{i,jk}=\frac{1}{2}d_{ij}+\frac{1}{2}d_{ik}-\frac{1}{4}d_{jk}\
  6. Minimum variance or Ward
    d_{i,jk}=\frac{(n_i+n_j)d_{ij}+(n_i+n_k)d_{ik}-n_id_{jk}}{n_i+n_j+n_k}\

If clusters j and k, j<k, merge then the new cluster will be referred to as Cluster j in the Cluster Stages table.

Dendrogram

The Dendrogram plot is a hierarchical tree that shows the distance at which two clusters merge. Each stage is represented as a unit in the Dendrogram. The top of the unit for each stage represents the new cluster by the merging of two clusters. Its height corresponds to the distance between two merged clusters.

The end-points of the Dendrogram represent n objects. n objects in the Dendrogram are sorted so that the clusters that merge are adjacent. The first end-point in the Dendrogram always corresponds to the first object.

Group Objects

Membership of n objects for specified k clusters can be determined from the information in the Dendrogram plot or Cluster Stages table. k clusters exist at the n-kth stage, which means membership of each object can be known from the first n-kth stages. Object 1 always belongs to Cluster 1.

Cluster centers, the distance between cluster centers, and the distance between observations and clusters, are calculated for clustering observations. Note that observations are standardized in the calculation if standardization is chosen in the analysis.