Skip to content

Cluster Analysis

The idea of grouping objects by shared characteristics is ancient, but the formal, algorithmic discipline of cluster analysis emerged in a specific intellectual context: mid-20th-century biology. In 1963, the entomologist Robert R. Sokal and the microbiologist Peter H.A. Sneath published Principles of Numerical Taxonomy [1], a book that proposed classifying organisms not by subjective expert judgment or assumed evolutionary lineage, but by measuring dozens of observable traits and letting mathematical algorithms group the organisms by overall similarity. The approach was controversial — taxonomists who relied on phylogenetic reasoning dismissed it — but the computational framework Sokal and Sneath developed turned out to be broadly useful far beyond biology. They formalized the ideas of distance matrices, linkage criteria, and dendrograms that remain the backbone of hierarchical clustering today.

The other foundational thread came from engineering and statistics. In 1957, Stuart Lloyd at Bell Labs devised an algorithm for optimal quantization in pulse-code modulation — essentially, how to represent a continuous signal with a finite set of discrete levels. His method iteratively assigned data points to the nearest representative and then recomputed the representatives. The paper circulated internally at Bell Labs for twenty-five years before being published in 1982 [2]. Meanwhile, in 1967, James MacQueen independently described the same procedure, gave it the name “k-means,” and provided the first mathematical analysis of its convergence properties [3]. Around the same time, Joe H. Ward Jr. had published his minimum-variance method for hierarchical clustering in 1963 [4], which became the most widely used agglomerative linkage criterion because it produces compact, roughly equal-sized clusters.

In analytical chemistry, cluster analysis found its natural home in the 1980s and 1990s as spectroscopic instruments became computerized and datasets grew. Chemists needed to classify olive oils by geographic origin, group pharmaceutical tablets by formulation, detect adulterated food products, and monitor industrial processes for drift — all problems where the goal is not prediction (regression) but discovery of structure. Today, clustering is a standard exploratory tool in chemometrics, typically applied after preprocessing and often after dimensionality reduction by PCA.

Why group samples?

In many chemical problems, you do not start with labeled categories. Instead, you have a collection of spectra or analytical measurements and a question: are there natural groups in this data?

Consider a few concrete scenarios:

  • Geographic origin classification. You have 200 olive oil samples from five Mediterranean regions. Before building any supervised classifier, you want to know whether the NIR spectra naturally separate into distinct clusters — or whether some regions overlap so heavily that classification will be difficult.

  • Process monitoring. A pharmaceutical production line generates a tablet spectrum every 30 seconds. Over a week, something drifts. Clustering the spectra by similarity can reveal when the drift began and whether it created a distinct “off-spec” group or a gradual shift.

  • Spectral library matching. You have measured Raman spectra of 500 unknown mineral samples. Rather than comparing each one against a library of 10,000 reference spectra, you first cluster the unknowns into groups, then match one representative from each cluster — reducing 500 comparisons to perhaps 15.

  • Outlier detection. If most of your samples fall into a few tight clusters and three samples sit far from any cluster, those three deserve closer inspection. They may be contaminated, mislabeled, or genuinely unusual.

In all these cases, clustering is exploratory: you are looking for structure, not testing a hypothesis.

Distance metrics

Before you can group samples, you need a way to quantify how similar (or dissimilar) two samples are. The choice of distance metric shapes the clusters you find.

Euclidean distance

The most common choice. For two samples and with measured variables:

Euclidean distance treats all variables equally and measures straight-line distance in the variable space. It works well when variables are on comparable scales and when clusters are roughly spherical.

Manhattan distance

Also called city-block or distance:

Manhattan distance is less sensitive to outliers than Euclidean distance because it does not square the differences. It can be a better choice when a few variables have extreme values.

Mahalanobis distance

where is the covariance matrix. Mahalanobis distance accounts for correlations between variables and differences in variance. In spectroscopy, where neighboring wavelengths are highly correlated, this can be more meaningful than Euclidean distance — but it requires estimating the covariance matrix, which can be unstable when the number of variables exceeds the number of samples.

Spectral angle

For spectroscopic data, the angle between two spectra (treated as vectors) is sometimes more informative than the Euclidean distance:

Spectral angle is insensitive to multiplicative scaling — two spectra with the same shape but different intensities (for example, because of different path lengths) will have a spectral angle of zero. This makes it useful when the spectral shape matters more than the absolute intensity.

Hierarchical clustering

Hierarchical clustering builds a tree of nested groups. The most common variant is agglomerative (bottom-up): start with each sample as its own cluster, then repeatedly merge the two closest clusters until everything belongs to one group.

  1. Start with N clusters

    Each of the N samples is its own singleton cluster.

  2. Compute the distance matrix

    Calculate the distance between every pair of clusters. For singletons, this is just the distance between samples.

  3. Merge the closest pair

    Find the two clusters with the smallest inter-cluster distance and merge them into a single cluster.

  4. Update the distance matrix

    Recalculate distances between the new merged cluster and all remaining clusters.

  5. Repeat

    Go back to step 3 until only one cluster remains.

The result is a dendrogram — a tree diagram that shows the sequence of merges and the distance at which each merge occurred. You read a dendrogram from bottom to top: samples that merge at low distances are very similar; those that merge only at large distances are quite different.

Linkage methods

The key question in step 4 is: when two clusters have been merged, how do you define the distance between the new combined cluster and some other cluster? Different answers give different linkage methods:

Single linkage (nearest neighbor): The distance between two clusters is the shortest distance between any member of one and any member of the other.

Single linkage tends to produce long, chain-like clusters. It can find elongated or irregularly shaped groups, but it is also prone to the “chaining effect” where a few intermediate points can cause two distinct groups to merge prematurely.

Complete linkage (farthest neighbor): The distance is the longest distance between any pair of members.

Complete linkage produces compact, roughly spherical clusters. It is more robust to chaining but can split genuinely large clusters.

Average linkage (UPGMA): The distance is the average of all pairwise distances between members.

A compromise between single and complete linkage. Widely used in practice.

Ward’s method [4]: Instead of measuring distances directly, Ward’s method merges the pair of clusters that causes the smallest increase in the total within-cluster sum of squares (variance). It tends to produce compact, equal-sized clusters and is the most popular choice in chemometrics.

Cutting the dendrogram

A dendrogram by itself does not tell you how many clusters you have — it shows a hierarchy of possible groupings. To obtain a flat partition into clusters, you “cut” the dendrogram at a chosen height (distance threshold). All merges below the cut are kept; everything above is split.

The choice of where to cut can be guided by:

  • Visual inspection: Look for a large gap in merge distances (a long vertical line in the dendrogram). Cutting there separates groups that merged only at large distances.
  • Domain knowledge: If you know there should be three olive oil regions, cut to produce three clusters.
  • Validation metrics: Use silhouette scores or the gap statistic (discussed below) to compare different numbers of clusters.

k-means clustering

While hierarchical clustering builds a complete tree, k-means directly partitions data into a specified number of clusters . It is faster than hierarchical methods for large datasets and is arguably the most widely used clustering algorithm in all of data science.

The algorithm

  1. Choose k and initialize centroids

    Select initial cluster centers. The simplest approach is to pick random samples from the dataset as starting centroids. A smarter approach (k-means++) selects initial centroids that are spread far apart.

  2. Assign each sample to the nearest centroid

    For every sample, compute its distance to each of the centroids and assign it to the closest one.

  3. Update centroids

    Recalculate each centroid as the mean of all samples currently assigned to that cluster.

  4. Check for convergence

    If no sample changed its assignment (or the centroids moved less than a tolerance), stop. Otherwise, return to step 2.

The algorithm converges when assignments stabilize. In practice, this usually happens within 10—30 iterations.

What k-means minimizes

k-means minimizes the within-cluster sum of squares (WCSS), also called inertia:

This is the total squared distance from every sample to its assigned centroid. Smaller means tighter clusters.

Convergence and local minima

k-means is guaranteed to decrease at every step and to converge in a finite number of iterations. However, it converges to a local minimum, not necessarily the global one. The result depends on the initial centroids. The standard remedy is to run k-means multiple times with different random initializations and keep the solution with the lowest .

Assumptions and limitations of k-means

k-means assumes that clusters are:

  • Spherical (or at least roughly convex) in shape
  • Similar in size — it tends to split large clusters and merge small ones
  • Similar in density — dense and sparse clusters in the same dataset can confuse it

These assumptions are often approximately met after centering, scaling, and PCA reduction — which is why preprocessing matters so much for clustering.

Choosing the number of clusters

Neither hierarchical clustering (without a cut) nor k-means tells you the “right” number of clusters. You have to decide, and several quantitative criteria can help.

Elbow method

Run k-means for and plot the within-cluster sum of squares against . The WCSS always decreases as increases (more clusters = less variance within each), but at some point the improvement becomes marginal. The “elbow” in the curve — where the rate of decrease sharply levels off — suggests a reasonable number of clusters.

The elbow method is intuitive but sometimes ambiguous: real data does not always produce a sharp elbow. Use it as a starting point, not a definitive answer.

Silhouette analysis

The silhouette coefficient for a single sample measures how well it fits in its assigned cluster compared to the nearest alternative:

where:

  • is the average distance from sample to all other samples in the same cluster (cohesion)
  • is the average distance from sample to all samples in the nearest neighboring cluster (separation)

The silhouette coefficient ranges from to :

  • Close to : the sample is well-matched to its own cluster and poorly matched to neighboring clusters
  • Close to : the sample is on the boundary between two clusters
  • Negative: the sample may have been assigned to the wrong cluster

The average silhouette width across all samples summarizes the overall quality of the clustering. Compare average silhouettes for different values of and pick the one that maximizes it.

Gap statistic

The gap statistic [5] compares the observed WCSS to what you would expect under a null reference distribution with no cluster structure (typically, data uniformly distributed over the range of each variable). The optimal is the smallest value for which the gap is within one standard error of the maximum.

The gap statistic is more principled than the elbow method but computationally more expensive, since it requires generating multiple null reference datasets.

Validating clusters

Finding clusters is easy — finding meaningful clusters is harder. Validation asks: are these clusters real, or artifacts of the algorithm?

Internal validation

Internal metrics assess cluster quality using only the data itself:

  • Silhouette plots. Plot the silhouette coefficient for every sample, sorted by cluster. A good clustering shows wide, uniform bars for each cluster with no negative values. Narrow bars or many negative values indicate poorly defined clusters.

  • Within-cluster sum of squares. Compact clusters have low WCSS. Compare to null models (shuffled data) to see if the structure is stronger than chance.

  • Davies-Bouldin index. Measures the average similarity between each cluster and its most similar cluster. Lower values indicate better separation. Defined as:

where is the average distance of samples in cluster to their centroid, and is the distance between centroids.

External validation

When you have true labels (even for a subset of samples), you can compare the clustering to the known groups:

  • Adjusted Rand Index (ARI): Measures the agreement between the clustering and the true labels, corrected for chance. ARI = 1 means perfect agreement; ARI near 0 means no better than random.
  • Normalized Mutual Information (NMI): Quantifies how much information the clustering shares with the true labels. NMI = 1 is perfect; NMI = 0 is no shared information.

External validation is the strongest test, but it requires labeled data — which is not always available in exploratory analysis.

Clustering in chemistry

Spectral classification

One of the most common applications of clustering in analytical chemistry is grouping spectra by type. Given a set of NIR, MIR, or Raman spectra from unknown samples, hierarchical clustering with Ward’s linkage on the preprocessed spectra (or on PCA scores) can reveal natural groupings that correspond to different chemical compositions, origins, or processing conditions.

Example: Classifying olive oils by geographic origin using NIR spectra. After standard normal variate (SNV) preprocessing and PCA, a dendrogram computed with Ward’s method on the first five principal components typically separates oils from distinct regions into well-defined branches.

Process monitoring

In manufacturing, spectra collected over time can be clustered to detect process changes. When a process is stable, all spectra fall into one tight cluster. A new cluster appearing in real time signals a drift or fault. This approach is sometimes combined with PCA-based control charts: the PCA model defines “normal” operating space, and clustering helps characterize the nature of deviations.

Quality control and adulteration detection

Clustering can identify groups of samples that deviate from expected patterns. In food authenticity testing, for instance, genuine and adulterated honey samples often form distinct clusters when their mid-infrared spectra are analyzed. Samples that fall between the two clusters may indicate partial adulteration.

Geographic and varietal classification

Wine, coffee, tea, and pharmaceutical raw materials are routinely classified by origin or variety using spectroscopic clustering. The workflow is typically: measure spectra, preprocess (baseline correction, normalization), reduce dimensionality (PCA), cluster (Ward’s or k-means), and validate against known labels when available.

Preprocessing for clustering

Clustering results are highly sensitive to how the data is prepared. Raw spectroscopic data often contains baseline shifts, intensity variations, and noise that can dominate the distance calculations and mask the real structure.

Centering and scaling

Mean centering subtracts the average spectrum, so that clustering focuses on how samples differ from the average rather than on the common signal. Autoscaling (mean centering plus dividing by the standard deviation of each variable) gives all variables equal weight. Without scaling, variables with large numerical ranges dominate the distance calculations.

PCA reduction before clustering

When you have hundreds or thousands of variables (typical for spectroscopic data), clustering directly in the full variable space suffers from two problems:

  1. Curse of dimensionality. In high-dimensional spaces, distances between points become increasingly similar, which makes it harder to distinguish genuine clusters from noise.
  2. Noise amplification. Many spectroscopic variables carry mostly noise. Including them dilutes the signal from the informative variables.

A standard solution is to perform PCA first and then cluster the samples using the scores on the first few principal components. The PCA step removes noise (by discarding minor components) and compresses the information into a manageable number of dimensions. Typically, 2—10 principal components suffice for clustering.

Spectral preprocessing

Before centering and PCA, apply the usual spectroscopic corrections:

  • Baseline correction (e.g., AsLS, polynomial subtraction) to remove slowly varying background
  • Scatter correction (SNV or MSC) to remove multiplicative effects from particle size or path length
  • Smoothing (Savitzky-Golay) if noise is severe

The order of operations matters: correct the spectra first, then center/scale, then reduce with PCA, then cluster.

Code implementation

import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# --- Generate synthetic spectroscopic data ---
# Three groups of samples with distinct spectral profiles
np.random.seed(42)
n_per_group = 40
n_wavelengths = 200
wavelengths = np.linspace(400, 2500, n_wavelengths)
# Group 1: strong peak at 800 nm
base1 = 0.8 * np.exp(-((wavelengths - 800)**2) / 5000)
# Group 2: strong peak at 1400 nm
base2 = 0.7 * np.exp(-((wavelengths - 1400)**2) / 6000)
# Group 3: peaks at both 800 and 1700 nm
base3 = (0.5 * np.exp(-((wavelengths - 800)**2) / 5000) +
0.6 * np.exp(-((wavelengths - 1700)**2) / 4000))
group1 = base1 + np.random.normal(0, 0.05, (n_per_group, n_wavelengths))
group2 = base2 + np.random.normal(0, 0.05, (n_per_group, n_wavelengths))
group3 = base3 + np.random.normal(0, 0.05, (n_per_group, n_wavelengths))
X = np.vstack([group1, group2, group3])
true_labels = np.repeat([0, 1, 2], n_per_group)
# --- Preprocessing ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA reduction
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
print(f"Variance explained by 5 PCs: {pca.explained_variance_ratio_.sum():.1%}")
# --- Hierarchical clustering with Ward's method ---
Z = linkage(X_pca, method='ward')
plt.figure(figsize=(12, 5))
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90,
leaf_font_size=8, color_threshold=10)
plt.title("Dendrogram (Ward's method on PCA scores)")
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()
# --- k-means with elbow method ---
wcss = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=20, random_state=42)
km.fit(X_pca)
wcss.append(km.inertia_)
plt.figure(figsize=(8, 4))
plt.plot(list(K_range), wcss, 'o-', linewidth=2)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-cluster sum of squares')
plt.title('Elbow method')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- k-means with k=3 ---
km = KMeans(n_clusters=3, n_init=20, random_state=42)
labels = km.fit_predict(X_pca)
sil_avg = silhouette_score(X_pca, labels)
print(f"Average silhouette score (k=3): {sil_avg:.3f}")
# --- PCA scatter colored by cluster ---
plt.figure(figsize=(8, 6))
for i in range(3):
mask = labels == i
plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=f'Cluster {i}',
alpha=0.7, edgecolors='k', linewidths=0.3)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('k-means clusters on PCA scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

When to use clustering

Good for

  • Discovering natural groupings in unlabeled data
  • Classifying samples by spectral similarity (origin, type, quality)
  • Quality control: detecting off-spec batches or process drift
  • Reducing a large sample set to representative groups
  • Preliminary exploration before supervised classification

Not for

  • Predicting continuous values (use regression instead)
  • Datasets where classes overlap heavily (consider soft clustering or PLS-DA)
  • Very high-dimensional data without preprocessing (reduce with PCA first)
  • Situations requiring probabilistic class membership (consider Gaussian mixture models)

Advantages and limitations

Advantages

  • No labels required — works on unlabeled data, which is common in exploratory chemometrics
  • Intuitive results — dendrograms and scatter plots are easy to interpret and communicate
  • Flexible — many distance metrics and algorithms available for different data types
  • Scalable — k-means handles large datasets efficiently; hierarchical methods provide detailed structure
  • Reveals hidden structure — can uncover sample groupings that were not expected
  • Preprocessing-friendly — integrates naturally with PCA and standard spectroscopic preprocessing

Limitations

  • Number of clusters must be chosen — no algorithm reliably determines the “true” number of groups
  • Sensitive to preprocessing — scaling, centering, and dimensionality reduction strongly affect results
  • Assumes structure exists — will always find clusters, even in random data; validation is essential
  • k-means assumes spherical clusters — non-convex or elongated groups require different algorithms
  • Hierarchical clustering is slow memory and time for agglomerative methods
  • Results can be unstable — small changes in data or initialization can change cluster assignments

Next steps

Cluster analysis is one piece of the exploratory chemometrics toolkit. From here, you can explore:

  • Principal Component Analysis (PCA) — the dimensionality reduction step that often precedes clustering. PCA scores plots give a visual preview of cluster structure before any formal grouping algorithm is applied.
  • SIMCA (Soft Independent Modeling of Class Analogy) — when you have labeled classes and want to build a classification model based on PCA models for each class.
  • Discriminant analysis (PLS-DA, LDA) — supervised methods that use known labels to find the directions in variable space that best separate classes, complementing the unsupervised exploration that clustering provides.

References

[1] Sokal, R. R., & Sneath, P. H. A. (1963). Principles of Numerical Taxonomy. W. H. Freeman and Company, San Francisco.

[2] Lloyd, S. (1982). “Least squares quantization in PCM.” IEEE Transactions on Information Theory, 28(2), 129—137. (Originally an unpublished Bell Labs technical report from 1957.)

[3] MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations.” In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281—297.

[4] Ward, J. H. Jr. (1963). “Hierarchical grouping to optimize an objective function.” Journal of the American Statistical Association, 58(301), 236—244.

[5] Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

[6] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley, Chichester. Chapter 8: Cluster Analysis.

[7] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer, New York. Chapter 14: Unsupervised Learning.