The idea of grouping objects by shared characteristics is ancient, but the formal, algorithmic discipline of cluster analysis emerged in a specific intellectual context: mid-20th-century biology. In 1963, the entomologist Robert R. Sokal and the microbiologist Peter H.A. Sneath published Principles of Numerical Taxonomy[1], a book that proposed classifying organisms not by subjective expert judgment or assumed evolutionary lineage, but by measuring dozens of observable traits and letting mathematical algorithms group the organisms by overall similarity. The approach was controversial — taxonomists who relied on phylogenetic reasoning dismissed it — but the computational framework Sokal and Sneath developed turned out to be broadly useful far beyond biology. They formalized the ideas of distance matrices, linkage criteria, and dendrograms that remain the backbone of hierarchical clustering today.
The other foundational thread came from engineering and statistics. In 1957, Stuart Lloyd at Bell Labs devised an algorithm for optimal quantization in pulse-code modulation — essentially, how to represent a continuous signal with a finite set of discrete levels. His method iteratively assigned data points to the nearest representative and then recomputed the representatives. The paper circulated internally at Bell Labs for twenty-five years before being published in 1982 [2]. Meanwhile, in 1967, James MacQueen independently described the same procedure, gave it the name “k-means,” and provided the first mathematical analysis of its convergence properties [3]. Around the same time, Joe H. Ward Jr. had published his minimum-variance method for hierarchical clustering in 1963 [4], which became the most widely used agglomerative linkage criterion because it produces compact, roughly equal-sized clusters.
In analytical chemistry, cluster analysis found its natural home in the 1980s and 1990s as spectroscopic instruments became computerized and datasets grew. Chemists needed to classify olive oils by geographic origin, group pharmaceutical tablets by formulation, detect adulterated food products, and monitor industrial processes for drift — all problems where the goal is not prediction (regression) but discovery of structure. Today, clustering is a standard exploratory tool in chemometrics, typically applied after preprocessing and often after dimensionality reduction by PCA.
Why group samples?
In many chemical problems, you do not start with labeled categories. Instead, you have a collection of spectra or analytical measurements and a question: are there natural groups in this data?
Consider a few concrete scenarios:
Geographic origin classification. You have 200 olive oil samples from five Mediterranean regions. Before building any supervised classifier, you want to know whether the NIR spectra naturally separate into distinct clusters — or whether some regions overlap so heavily that classification will be difficult.
Process monitoring. A pharmaceutical production line generates a tablet spectrum every 30 seconds. Over a week, something drifts. Clustering the spectra by similarity can reveal when the drift began and whether it created a distinct “off-spec” group or a gradual shift.
Spectral library matching. You have measured Raman spectra of 500 unknown mineral samples. Rather than comparing each one against a library of 10,000 reference spectra, you first cluster the unknowns into groups, then match one representative from each cluster — reducing 500 comparisons to perhaps 15.
Outlier detection. If most of your samples fall into a few tight clusters and three samples sit far from any cluster, those three deserve closer inspection. They may be contaminated, mislabeled, or genuinely unusual.
In all these cases, clustering is exploratory: you are looking for structure, not testing a hypothesis.
Distance metrics
Before you can group samples, you need a way to quantify how similar (or dissimilar) two samples are. The choice of distance metric shapes the clusters you find.
Euclidean distance
The most common choice. For two samples x and y with p measured variables:
dE(x,y)=j=1∑p(xj−yj)2
Euclidean distance treats all variables equally and measures straight-line distance in the variable space. It works well when variables are on comparable scales and when clusters are roughly spherical.
Manhattan distance
Also called city-block or L1 distance:
dM(x,y)=j=1∑p∣xj−yj∣
Manhattan distance is less sensitive to outliers than Euclidean distance because it does not square the differences. It can be a better choice when a few variables have extreme values.
Mahalanobis distance
dMah(x,y)=(x−y)TS−1(x−y)
where S is the covariance matrix. Mahalanobis distance accounts for correlations between variables and differences in variance. In spectroscopy, where neighboring wavelengths are highly correlated, this can be more meaningful than Euclidean distance — but it requires estimating the covariance matrix, which can be unstable when the number of variables exceeds the number of samples.
Spectral angle
For spectroscopic data, the angle between two spectra (treated as vectors) is sometimes more informative than the Euclidean distance:
θ(x,y)=arccos(∥x∥∥y∥x⋅y)
Spectral angle is insensitive to multiplicative scaling — two spectra with the same shape but different intensities (for example, because of different path lengths) will have a spectral angle of zero. This makes it useful when the spectral shape matters more than the absolute intensity.
Hierarchical clustering
Hierarchical clustering builds a tree of nested groups. The most common variant is agglomerative (bottom-up): start with each sample as its own cluster, then repeatedly merge the two closest clusters until everything belongs to one group.
Start with N clusters
Each of the N samples is its own singleton cluster.
Compute the distance matrix
Calculate the distance between every pair of clusters. For singletons, this is just the distance between samples.
Merge the closest pair
Find the two clusters with the smallest inter-cluster distance and merge them into a single cluster.
Update the distance matrix
Recalculate distances between the new merged cluster and all remaining clusters.
Repeat
Go back to step 3 until only one cluster remains.
The result is a dendrogram — a tree diagram that shows the sequence of merges and the distance at which each merge occurred. You read a dendrogram from bottom to top: samples that merge at low distances are very similar; those that merge only at large distances are quite different.
Linkage methods
The key question in step 4 is: when two clusters have been merged, how do you define the distance between the new combined cluster and some other cluster? Different answers give different linkage methods:
Single linkage (nearest neighbor): The distance between two clusters is the shortest distance between any member of one and any member of the other.
d(A,B)=i∈A,j∈Bmind(i,j)
Single linkage tends to produce long, chain-like clusters. It can find elongated or irregularly shaped groups, but it is also prone to the “chaining effect” where a few intermediate points can cause two distinct groups to merge prematurely.
Complete linkage (farthest neighbor): The distance is the longest distance between any pair of members.
d(A,B)=i∈A,j∈Bmaxd(i,j)
Complete linkage produces compact, roughly spherical clusters. It is more robust to chaining but can split genuinely large clusters.
Average linkage (UPGMA): The distance is the average of all pairwise distances between members.
d(A,B)=∣A∣⋅∣B∣1i∈A∑j∈B∑d(i,j)
A compromise between single and complete linkage. Widely used in practice.
Ward’s method[4]: Instead of measuring distances directly, Ward’s method merges the pair of clusters that causes the smallest increase in the total within-cluster sum of squares (variance). It tends to produce compact, equal-sized clusters and is the most popular choice in chemometrics.
Δ(A,B)=∣A∣+∣B∣∣A∣⋅∣B∣∥xˉA−xˉB∥2
Cutting the dendrogram
A dendrogram by itself does not tell you how many clusters you have — it shows a hierarchy of possible groupings. To obtain a flat partition into k clusters, you “cut” the dendrogram at a chosen height (distance threshold). All merges below the cut are kept; everything above is split.
The choice of where to cut can be guided by:
Visual inspection: Look for a large gap in merge distances (a long vertical line in the dendrogram). Cutting there separates groups that merged only at large distances.
Domain knowledge: If you know there should be three olive oil regions, cut to produce three clusters.
Validation metrics: Use silhouette scores or the gap statistic (discussed below) to compare different numbers of clusters.
k-means clustering
While hierarchical clustering builds a complete tree, k-means directly partitions data into a specified number of clusters k . It is faster than hierarchical methods for large datasets and is arguably the most widely used clustering algorithm in all of data science.
The algorithm
Choose k and initialize centroids
Select k initial cluster centers. The simplest approach is to pick k random samples from the dataset as starting centroids. A smarter approach (k-means++) selects initial centroids that are spread far apart.
Assign each sample to the nearest centroid
For every sample, compute its distance to each of the k centroids and assign it to the closest one.
Update centroids
Recalculate each centroid as the mean of all samples currently assigned to that cluster.
μk=∣Ck∣1i∈Ck∑xi
Check for convergence
If no sample changed its assignment (or the centroids moved less than a tolerance), stop. Otherwise, return to step 2.
The algorithm converges when assignments stabilize. In practice, this usually happens within 10—30 iterations.
What k-means minimizes
k-means minimizes the within-cluster sum of squares (WCSS), also called inertia:
W=k=1∑Ki∈Ck∑∥xi−μk∥2
This is the total squared distance from every sample to its assigned centroid. Smaller W means tighter clusters.
Convergence and local minima
k-means is guaranteed to decrease W at every step and to converge in a finite number of iterations. However, it converges to a local minimum, not necessarily the global one. The result depends on the initial centroids. The standard remedy is to run k-means multiple times with different random initializations and keep the solution with the lowest W .
Assumptions and limitations of k-means
k-means assumes that clusters are:
Spherical (or at least roughly convex) in shape
Similar in size — it tends to split large clusters and merge small ones
Similar in density — dense and sparse clusters in the same dataset can confuse it
These assumptions are often approximately met after centering, scaling, and PCA reduction — which is why preprocessing matters so much for clustering.
Choosing the number of clusters
Neither hierarchical clustering (without a cut) nor k-means tells you the “right” number of clusters. You have to decide, and several quantitative criteria can help.
Elbow method
Run k-means for k=1,2,3,…,Kmax and plot the within-cluster sum of squares W against k . The WCSS always decreases as k increases (more clusters = less variance within each), but at some point the improvement becomes marginal. The “elbow” in the curve — where the rate of decrease sharply levels off — suggests a reasonable number of clusters.
The elbow method is intuitive but sometimes ambiguous: real data does not always produce a sharp elbow. Use it as a starting point, not a definitive answer.
Silhouette analysis
The silhouette coefficient for a single sample i measures how well it fits in its assigned cluster compared to the nearest alternative:
s(i)=max(a(i),b(i))b(i)−a(i)
where:
a(i) is the average distance from sample i to all other samples in the same cluster (cohesion)
b(i) is the average distance from sample i to all samples in the nearest neighboring cluster (separation)
The silhouette coefficient ranges from −1 to +1 :
Close to +1 : the sample is well-matched to its own cluster and poorly matched to neighboring clusters
Close to 0 : the sample is on the boundary between two clusters
Negative: the sample may have been assigned to the wrong cluster
The average silhouette width across all samples summarizes the overall quality of the clustering. Compare average silhouettes for different values of k and pick the one that maximizes it.
Gap statistic
The gap statistic [5] compares the observed WCSS to what you would expect under a null reference distribution with no cluster structure (typically, data uniformly distributed over the range of each variable). The optimal k is the smallest value for which the gap is within one standard error of the maximum.
The gap statistic is more principled than the elbow method but computationally more expensive, since it requires generating multiple null reference datasets.
Validating clusters
Finding clusters is easy — finding meaningful clusters is harder. Validation asks: are these clusters real, or artifacts of the algorithm?
Internal validation
Internal metrics assess cluster quality using only the data itself:
Silhouette plots. Plot the silhouette coefficient for every sample, sorted by cluster. A good clustering shows wide, uniform bars for each cluster with no negative values. Narrow bars or many negative values indicate poorly defined clusters.
Within-cluster sum of squares. Compact clusters have low WCSS. Compare to null models (shuffled data) to see if the structure is stronger than chance.
Davies-Bouldin index. Measures the average similarity between each cluster and its most similar cluster. Lower values indicate better separation. Defined as:
DB=K1k=1∑Kl=kmaxd(μk,μl)σk+σl
where σk is the average distance of samples in cluster k to their centroid, and d(μk,μl) is the distance between centroids.
External validation
When you have true labels (even for a subset of samples), you can compare the clustering to the known groups:
Adjusted Rand Index (ARI): Measures the agreement between the clustering and the true labels, corrected for chance. ARI = 1 means perfect agreement; ARI near 0 means no better than random.
Normalized Mutual Information (NMI): Quantifies how much information the clustering shares with the true labels. NMI = 1 is perfect; NMI = 0 is no shared information.
External validation is the strongest test, but it requires labeled data — which is not always available in exploratory analysis.
Clustering in chemistry
Spectral classification
One of the most common applications of clustering in analytical chemistry is grouping spectra by type. Given a set of NIR, MIR, or Raman spectra from unknown samples, hierarchical clustering with Ward’s linkage on the preprocessed spectra (or on PCA scores) can reveal natural groupings that correspond to different chemical compositions, origins, or processing conditions.
Example: Classifying olive oils by geographic origin using NIR spectra. After standard normal variate (SNV) preprocessing and PCA, a dendrogram computed with Ward’s method on the first five principal components typically separates oils from distinct regions into well-defined branches.
Process monitoring
In manufacturing, spectra collected over time can be clustered to detect process changes. When a process is stable, all spectra fall into one tight cluster. A new cluster appearing in real time signals a drift or fault. This approach is sometimes combined with PCA-based control charts: the PCA model defines “normal” operating space, and clustering helps characterize the nature of deviations.
Quality control and adulteration detection
Clustering can identify groups of samples that deviate from expected patterns. In food authenticity testing, for instance, genuine and adulterated honey samples often form distinct clusters when their mid-infrared spectra are analyzed. Samples that fall between the two clusters may indicate partial adulteration.
Geographic and varietal classification
Wine, coffee, tea, and pharmaceutical raw materials are routinely classified by origin or variety using spectroscopic clustering. The workflow is typically: measure spectra, preprocess (baseline correction, normalization), reduce dimensionality (PCA), cluster (Ward’s or k-means), and validate against known labels when available.
Preprocessing for clustering
Clustering results are highly sensitive to how the data is prepared. Raw spectroscopic data often contains baseline shifts, intensity variations, and noise that can dominate the distance calculations and mask the real structure.
Centering and scaling
Mean centering subtracts the average spectrum, so that clustering focuses on how samples differ from the average rather than on the common signal. Autoscaling (mean centering plus dividing by the standard deviation of each variable) gives all variables equal weight. Without scaling, variables with large numerical ranges dominate the distance calculations.
PCA reduction before clustering
When you have hundreds or thousands of variables (typical for spectroscopic data), clustering directly in the full variable space suffers from two problems:
Curse of dimensionality. In high-dimensional spaces, distances between points become increasingly similar, which makes it harder to distinguish genuine clusters from noise.
Noise amplification. Many spectroscopic variables carry mostly noise. Including them dilutes the signal from the informative variables.
A standard solution is to perform PCA first and then cluster the samples using the scores on the first few principal components. The PCA step removes noise (by discarding minor components) and compresses the information into a manageable number of dimensions. Typically, 2—10 principal components suffice for clustering.
Spectral preprocessing
Before centering and PCA, apply the usual spectroscopic corrections:
Datasets where classes overlap heavily (consider soft clustering or PLS-DA)
Very high-dimensional data without preprocessing (reduce with PCA first)
Situations requiring probabilistic class membership (consider Gaussian mixture models)
Advantages and limitations
Advantages
No labels required — works on unlabeled data, which is common in exploratory chemometrics
Intuitive results — dendrograms and scatter plots are easy to interpret and communicate
Flexible — many distance metrics and algorithms available for different data types
Scalable — k-means handles large datasets efficiently; hierarchical methods provide detailed structure
Reveals hidden structure — can uncover sample groupings that were not expected
Preprocessing-friendly — integrates naturally with PCA and standard spectroscopic preprocessing
Limitations
Number of clusters must be chosen — no algorithm reliably determines the “true” number of groups
Sensitive to preprocessing — scaling, centering, and dimensionality reduction strongly affect results
Assumes structure exists — will always find clusters, even in random data; validation is essential
k-means assumes spherical clusters — non-convex or elongated groups require different algorithms
Hierarchical clustering is slow — O(n2) memory and O(n3) time for agglomerative methods
Results can be unstable — small changes in data or initialization can change cluster assignments
Next steps
Cluster analysis is one piece of the exploratory chemometrics toolkit. From here, you can explore:
Principal Component Analysis (PCA) — the dimensionality reduction step that often precedes clustering. PCA scores plots give a visual preview of cluster structure before any formal grouping algorithm is applied.
SIMCA (Soft Independent Modeling of Class Analogy) — when you have labeled classes and want to build a classification model based on PCA models for each class.
Discriminant analysis (PLS-DA, LDA) — supervised methods that use known labels to find the directions in variable space that best separate classes, complementing the unsupervised exploration that clustering provides.
References
[1] Sokal, R. R., & Sneath, P. H. A. (1963). Principles of Numerical Taxonomy. W. H. Freeman and Company, San Francisco.
[2] Lloyd, S. (1982). “Least squares quantization in PCM.” IEEE Transactions on Information Theory, 28(2), 129—137. (Originally an unpublished Bell Labs technical report from 1957.)
[3] MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations.” In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281—297.
[4] Ward, J. H. Jr. (1963). “Hierarchical grouping to optimize an objective function.” Journal of the American Statistical Association, 58(301), 236—244.
[5] Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
[6] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley, Chichester. Chapter 8: Cluster Analysis.
[7] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer, New York. Chapter 14: Unsupervised Learning.