17.7.3.3 Algorithms (Hierarchical Cluster Analysis)HCAAlgorithm
Hierarchical Cluster Analysis is used to build a hierarchical tree. It starts with n clusters, each with a single object and then at each of n − 1 stages, merges two clusters to form a larger cluster, until all objects are in a single cluster. The process can be shown in a Dendrogram.
Objects to be clustered in Hierarchical Cluster Analysis can be observations or variables.
Distance Matrix
A distance, or dissimilarity, matrix is a symmetric matrix with zero diagonal elements. The ijth element represents how far or how dissimilar the ith and jth objects are. Methods to calculate the distance between two objects are different for clustering observations and clustering variables.
Cluster Observations
Origin supports standardization of data before distance calculation for clustering observations. Observations containing one or more missing values will be excluded in the analysis.
 Standardize Variables
 Normalize to N(0,1)
For a variable , it is normalized as: , where and are the mean and standard deviation of the variable. The standardized variable will have zero mean and unit standard deviation.
 Normalize to (0,1)
For a variable , it is normalized as: . The variable will be standardized in the range of 0 and 1.
Origin supports these distance types:
 Distance Type
For a standardized matrix with n observations and p variables, the distance between the ith observation and the kth observation can be expressed as follows.
 Euclidean
 Squared Euclidean
 Absolute (City block metric)
 Cosine
 Pearson correlation
where and
 Jaccard
Cluster Variables
Origin supports two distance types for clustering variables. Observations will be excluded in the calculation of the correlation between two variables if missing values exist in either of the variables.
 Distance Type
For a matrix with n observations and p variables, the distance between the jth variable and the kth variable can be expressed as follows:
 Correlation
where is the correlation between the jth variable and the kth variable.
 Absolute correlation
Linkage Method
At each stage, the two clusters that are nearest are merged. Origin provides several methods to calculate the distance between the new cluster and other clusters. Let clusters j and k be merged as cluster jk. Let , and be number of objects of Cluster i, Cluster j and Cluster k respectively, and let , and be the distance between two clusters. The distance between Cluster jk and Cluster i can then be calculated in the following ways:
 Single link or Nearest neighbor
 Complete link or Furthest neighbor
 Group average
 Centroid
 Median
 Minimum variance or Ward
If clusters j and k, j<k, merge then the new cluster will be referred to as Cluster j in the Cluster Stages table.
Dendrogram
The Dendrogram plot is a hierarchical tree that shows the distance at which two clusters merge. Each stage is represented as a unit in the Dendrogram. The top of the unit for each stage represents the new cluster by the merging of two clusters. Its height corresponds to the distance between two merged clusters.
The endpoints of the Dendrogram represent n objects. n objects in the Dendrogram are sorted so that the clusters that merge are adjacent. The first endpoint in the Dendrogram always corresponds to the first object.
Group Objects
Membership of n objects for specified k clusters can be determined from the information in the Dendrogram plot or Cluster Stages table. k clusters exist at the nkth stage, which means membership of each object can be known from the first nkth stages. Object 1 always belongs to Cluster 1.
Cluster centers, the distance between cluster centers, and the distance between observations and clusters, are calculated for clustering observations. Note that observations are standardized in the calculation if standardization is chosen in the analysis.
