5.6.2 Cluster Analysis

Video Image.png Video Text Image.png Website blog icon circle.png Blog Image 33x33px.png


We will perform cluster analysis for the mean temperatures of US cities over a 3-year-period.

The starting point is a hierarchical cluster analysis with randomly selected data in order to find the best method for clustering. K-means analysis, a quick cluster method, is then performed on the entire original dataset.

Minimum Origin Version Required: Updated Origin 2020

Hierarchical Cluster Analysis

  1. Start with a new project or a new workbook. Import the data file \Samples\Graphing\US Mean Temperature.dat.
  2. Highlight Column D through Column O.
  3. Select Statistics: Multivariate Analysis: Hierarchical Cluster Analysis and open the dialog.
  4. Select Input tab, click the triangle button Button Select Data Right Triangle.png next to Variables, and then click Select Columns... in the context menu.
    Cluster ex2 hcluster dialog1.png
  5. In the lower panel of the Column Browser dialog, click the ... button. Set the data range from 1 to 100. Click OK.
    Cluster ex2 col browser.png
  6. Click on the Settings tab and set Cluster to Observations, and Number of Clusters is 1. For Cluster Method, select Furthest Neighbor and then click OK.
    Hcluster ex2 dialog1.png
  7. Go to the Cluster 1 sheet. After examining the resulting dendrogram, we choose to cluster data into 5 groups.
  8. Click the lock icon in the dendrogram or the result tree, and then click Change Parameters in the context menu.
  9. Set Number of Clusters to 5 in the Settings tab and then select the Cluster Center check box in the Quantities tab. Click OK.
    Cluster ex2 hcluster dialog.png
    Cluster ex2 hcluster dialog01.png
  10. In the resulting dendrogram, we can clearly see how observations are clustered. Note, that you can double-click on the embedded dendrogram in the report sheeet to open the dendrogram in its own window. From here, you can customize the dendrogram -- for instance, by adding text labels, arrows, etc -- then click the Close button Button close embedded.png in the upper-right corner of the graph window to put changes back into the embedded graph in the report sheet.
    Hcluster ex2 dendrogram.png
  11. To focus in on a particular subtree, click on a node to select it then right-click and choose Duplicate Branch to New Window. This opens the selected subtree in a new graph window.
    Dendrogram zoom1.PNG
Note that beginning with Origin 2019b you will find, on the Plot tab of the hcluster dialog, a radio button for displaying Similarity on the Y axis of your Dendrogram (Distance is still default).

Analyzing Original Data with K-Means Cluster

  1. Right-click on Cluster Center and select Create Copy as New Sheet in the context menu. We are going to use the newly created Cluster Center as the Initial Cluster Centers in our k-means cluster analysis.
    Cluster ex2 cluster center.png
  2. Go back to the worksheet with the source data (US Mean Temperature), and highlight col(D) through col(O). Select Statistics: Multivariate Analysis: K-Means Cluster Analysis.
  3. Select the Specify Initial Cluster Centers check box in the Options tab. Click the interactive button Button Select Data Interactive.png next to Initial Cluster Centers. The dialog will "roll up".
  4. Go to Cluster Center and hightlight Col(D) through Col(O). Click the button on the rolled-up dialog to restore the dialog.
  5. In the Plot tab, select Additional Group Graph. Click the interactive button Button Select Data Interactive.png next to X Range. The dialog will "roll up". Go back to the source worksheet US Mean Temperature, and highlight Col(B):Longtitude. Click the button in the rolled up dialog to restore.
  6. Click the triangle buttonButton Select Data Right Triangle.png next to Y Range, and then select C(Y), Latitude. Click OK.
    Kmeans ex2 dialog.png
  7. Activate the worksheet K-Means Plot Data1. Observe that data has been clustered into 5 groups corresponding to the latitudes of the cities.
    Group graph.png

User can also select the output destination of Cluster Membership column, such as next to input data, for further operation if needed

Cluster Membership.png