In 1901, the English mathematician Karl Pearson published a paper titled “On Lines and Planes of Closest Fit to Systems of Points in Space” in the Philosophical Magazine[1]. Pearson’s problem was geometric: given a cloud of points in many dimensions, how do you find the line (or plane, or hyperplane) that passes as close as possible to all the points? His solution was to find the direction in which the projected points had the greatest spread — the direction of maximum variance. Then find a second direction, perpendicular to the first, that captured the most remaining variance. And so on. The result was a coordinate system tailored to the data, with axes ranked by how much variation each one explained. Pearson was not thinking about chemistry. He was working on problems of biological inheritance, fitting lines and planes to measurements of skulls, limb lengths, and other anatomical features. But the mathematics he developed would prove universal.
Three decades later, the American psychologist Harold Hotelling independently arrived at the same mathematical structure from a completely different direction. In 1933, he published “Analysis of a Complex of Statistical Variables into Principal Components” in the Journal of Educational Psychology[2]. Hotelling’s motivation was psychological testing: he had matrices of test scores from students across many exams and wanted to find a small number of underlying “factors” (general intelligence, verbal ability, mathematical aptitude) that explained the correlations among the tests. He formalized the method, gave it the name principal components, and established the eigenvalue-eigenvector framework that is still used today. Where Pearson had approached the problem geometrically (closest-fitting subspaces), Hotelling approached it statistically (maximizing variance of linear combinations).
For the next three decades, PCA remained a tool for psychologists, biologists, and social scientists. Its application to large datasets was limited by computation — eigendecomposition of big matrices was impractical on the hardware of the time. That changed in 1966, when Herman Wold introduced the NIPALS algorithm (Nonlinear Iterative Partial Least Squares) [3], which extracted principal components one at a time through a simple iterative procedure. NIPALS made PCA feasible for datasets with hundreds of variables without ever computing the full covariance matrix. This was the bridge that carried PCA into chemometrics: Herman Wold’s son, Svante Wold, together with Paul Geladi, Bruce Kowalski, and others, made PCA the foundation of pattern recognition in spectroscopic data during the 1970s and 1980s [4]. Today, PCA is almost always the first thing a chemometrician does with a new dataset.
The dimensionality problem
A modern near-infrared (NIR) spectrometer records absorbance at hundreds or thousands of wavelengths. A single spectrum might contain 1050 data points — one for each wavelength between 900 and 2500 nm. If you measure 200 olive oil samples to check for adulteration, your dataset is a 200×1050 matrix: 200 rows (samples) and 1050 columns (wavelengths). Mathematically, each sample lives in a 1050-dimensional space.
But do you really need 1050 dimensions to describe the important differences among these olive oils? Almost certainly not. The absorbance at 1720 nm is closely related to the absorbance at 1721 nm, because they arise from similar molecular vibrations. The same is true across broad spectral regions: the C-H stretching overtones near 1200 nm, the O-H combinations near 1940 nm, and the C=O features near 1730 nm tend to move together as the chemical composition changes. The 1050 wavelengths are not 1050 independent pieces of information. They are 1050 redundant measurements of a much smaller number of underlying sources of variation — perhaps the concentrations of oleic acid, linoleic acid, water, and free fatty acids, plus some physical effects like scattering.
The question PCA answers is: how many independent directions of variation does this dataset actually contain, and what do those directions look like? If the answer is “about 5,” then PCA compresses 1050 dimensions into 5, with minimal loss of information. This compression is not just convenient — it is essential for visualization (you can plot 2 or 3 dimensions but not 1050), for noise reduction (the discarded dimensions mostly contain noise), and for building predictive models (fewer variables means more stable regression).
The core idea
PCA searches for directions of maximum variance in the data and projects the samples onto those directions.
Imagine your data matrix X as a cloud of points in a high-dimensional space. Some directions through that cloud show a lot of spread (the points are scattered widely), while other directions show very little (the points are bunched together). The directions with the most spread carry the most information about how the samples differ from each other.
The first principal component (PC1) is the direction in the original variable space along which the data has the greatest variance. It is a linear combination of all the original variables:
PC1=w11x1+w12x2+…+w1pxp
where x1,x2,…,xp are the original variables (wavelengths) and the weights w11,w12,…,w1p define the direction. PCA chooses these weights to maximize the variance of the projected data, subject to the constraint that the weight vector has unit length (∑w1j2=1 ).
The second principal component (PC2) is the direction of maximum variance orthogonal to PC1. It captures the largest amount of variation that PC1 missed. The third is orthogonal to both PC1 and PC2, and so on.
The result is a new coordinate system where:
The first few axes capture most of the systematic variation (signal)
The later axes capture progressively less variation (usually noise)
All axes are mutually perpendicular (uncorrelated)
Covariance and correlation
Before computing PCA, you need to decide how to preprocess your data. This decision determines what kind of variation PCA will find.
Mean-centering
Mean-centering is almost always required. You subtract the column mean from each variable:
xijcentered=xij−xˉj
Without centering, the first principal component would point roughly toward the mean of the data, capturing the average signal rather than the differences among samples. This wastes a component on information that carries no discriminating power.
Scaling: covariance vs. correlation
After centering, you face a choice:
PCA on the covariance matrix (no scaling): Variables with larger absolute variance dominate the analysis. For spectroscopic data where all variables have the same units (absorbance), this is usually appropriate. Wavelength regions with strong absorption bands naturally contribute more to the PCA decomposition.
PCA on the correlation matrix (auto-scaling to unit variance): Each variable is divided by its standard deviation before PCA. All variables contribute equally regardless of their magnitude. This is essential when variables have different units (e.g., mixing spectral data with temperature, pH, and pressure measurements) but can be harmful for spectroscopic data because it amplifies noisy baseline regions.
Computing principal components
The mathematics of PCA reduces to an eigenvalue problem. Starting from the centered data matrix X (n×p , where n is samples and p is variables), you compute the covariance matrix:
C=n−11XTX
This p×p symmetric matrix summarizes how every pair of variables co-varies across the samples. The diagonal elements are the variances of individual variables; the off-diagonal elements are the covariances between pairs of variables.
Eigendecomposition
The eigendecomposition of the covariance matrix is:
C=PΛPT
where:
P is a p×p orthogonal matrix whose columns are the eigenvectors (also called loadings)
Λ=diag(λ1,λ2,…,λp) is a diagonal matrix of eigenvalues, ordered so that λ1≥λ2≥…≥λp≥0
Each eigenvector pa defines a principal component direction. Its corresponding eigenvalue λa equals the variance of the data when projected onto that direction. The eigenvectors are orthonormal: perpendicular to each other and of unit length.
Singular value decomposition
In practice, PCA is more often computed via the singular value decomposition (SVD) of the centered data matrix itself:
X=UΣVT
where:
U (n×n ) contains the left singular vectors
Σ (n×p ) is diagonal with singular values σ1≥σ2≥…
V (p×p ) contains the right singular vectors
The connection to PCA is direct: the columns of V are the loadings (eigenvectors of XTX ), and the eigenvalues are λa=σa2/(n−1) . The SVD is numerically more stable than forming XTX explicitly and then diagonalizing it, which is why most software uses SVD internally.
Variance explained
Each eigenvalue tells you how much variance that component captures. The proportion of variance explained by the a -th component is:
Variance explained by PCa=∑k=1pλkλa×100%
The cumulative variance explained by the first A components is:
Cumulative variance=∑k=1pλk∑a=1Aλa×100%
In typical spectroscopic datasets, the first 2 to 5 components often capture 95—99% of the total variance. The remaining hundreds of components collectively account for a few percent — mostly noise.
Scores and loadings
PCA decomposes the data matrix as:
X=TPT+E
where:
T (n×A ) is the scores matrix
P (p×A ) is the loadings matrix
E is the residual (what is left after A components)
This decomposition is the heart of PCA interpretation.
Scores: the sample map
The scores are the coordinates of each sample in the new principal component space:
T=XP
Each row of T is a sample; each column is a principal component. Plotting the scores of PC1 against PC2 gives a scores plot — the most important visualization in exploratory chemometrics. In this plot:
Samples that are close together have similar spectral profiles
Samples that are far apart differ significantly
Clusters suggest groups with distinct chemical compositions
Outliers appear as isolated points far from the main group
Gradients (smooth transitions from one region to another) suggest a continuous change in composition (e.g., increasing concentration of an analyte)
Loadings: the variable map
The loadings tell you how each original variable contributes to each principal component. The loading vector pa has one value per original variable (wavelength). If the loading of PC1 at 1720 nm is large and positive, it means that samples with high PC1 scores have high absorbance at 1720 nm — which you might associate with a C=O stretching vibration related to free fatty acid content.
A loadings plot shows the loading values as a function of wavelength. It looks like a spectrum and often can be interpreted chemically:
Peaks in the loadings at specific wavelengths indicate which spectral features drive the variation captured by that component
The sign of the loading (positive or negative) indicates the direction of the contribution
Comparing loadings of different components reveals which wavelengths are important for different sources of variation
Biplots
A biplot superimposes scores and loadings on the same plot. Samples are shown as points and variables as vectors (arrows). This allows you to see simultaneously which samples are similar and which variables drive those similarities. Biplots are most useful when the number of variables is moderate (say, fewer than 50). With hundreds of wavelengths, the arrow display becomes cluttered and a separate loadings plot is more practical.
How many components?
One of the most important decisions in PCA is how many components to retain. Too few and you lose meaningful information; too many and you model noise.
Scree plot
The scree plot shows the eigenvalues (or percentage of variance explained) as a function of component number. It typically shows a steep drop for the first few components followed by a leveling off. The “elbow” where the curve transitions from steep to flat suggests the appropriate number of components. Components before the elbow capture systematic variation; components after it are mostly noise.
The name comes from the geological term scree — the pile of broken rocks at the base of a cliff. The flat tail of the eigenvalue plot resembles this debris field.
Cumulative variance
A related approach sets a cumulative variance threshold — for example, 95% or 99%. You retain however many components are needed to reach that threshold. This is simple but somewhat arbitrary: the “right” threshold depends on the noise level in the data.
Cross-validation
A more rigorous approach uses cross-validation. You systematically leave out portions of the data, reconstruct them from the PCA model, and measure the reconstruction error. When adding a component improves the reconstruction of held-out data, it is capturing signal. When it no longer helps (or makes reconstruction worse), it is capturing noise. This is the most reliable method but requires more computation.
Broken stick rule
The broken stick model compares the observed eigenvalues against the expected eigenvalues if the total variance were distributed randomly among the components (as if you broke a stick of unit length into p pieces at random). A component is retained if its eigenvalue exceeds the corresponding broken stick value. The expected length of the a -th largest piece of a stick broken at p−1 random points is:
ba=p1k=a∑pk1
Components whose eigenvalues exceed ba×total variance are considered significant.
Interpreting PCA results
Scores plots
The scores plot (PC1 vs. PC2) is the primary tool for exploring multivariate data. Common patterns include:
Distinct clusters: Suggest discrete sample groups (e.g., different olive oil cultivars, authentic vs. adulterated samples)
Continuous gradients: Suggest a quantitative property varying across samples (e.g., increasing analyte concentration)
Outliers: Single points far from the rest, suggesting unusual composition, measurement errors, or contaminated samples
Time trends: Ordered progression of samples over time, suggesting process drift or aging effects
You can color the points by known metadata (class labels, concentrations, production dates) to see whether the PCA structure correlates with chemical or physical properties.
Hotelling’s T-squared
To decide whether a sample is an outlier in the principal component space, you can use Hotelling’s T2 statistic. For a sample with scores ti=[ti1,ti2,…,tiA] , the T2 value is:
Ti2=a=1∑Aλatia2
This is a multivariate generalization of the squared standardized distance. Samples with T2 above a critical threshold (derived from the F-distribution) are flagged as outliers. The T2 confidence ellipse drawn on a scores plot corresponds to this threshold and provides a visual boundary for “normal” samples.
Q-residuals (SPE)
The T2 statistic only measures how extreme a sample is within the PCA model. But what about variation that the model does not capture? The Q-residual (also called Squared Prediction Error, SPE) measures how well the model reconstructs a sample:
Qi=eiTei=j=1∑p(xij−x^ij)2
where x^ij is the reconstructed value from the PCA model. A high Q means the sample contains variation that is not described by any of the retained components — it is “new” variation that the model has never seen.
The combination of T2 and Q provides a complete diagnostic:
Normal sample: Low T2 , low Q — within the model and well-described by it
Extreme but consistent: High T2 , low Q — far from the center but the model structure still applies (e.g., a sample with an unusually high concentration)
New variation: Low T2 , high Q — near the center in the PC space but contains a spectral feature the model has never seen (e.g., a contaminant)
Extreme and inconsistent: High T2 , high Q — far from the center and not well-described by the model (genuine outlier)
The NIPALS algorithm
While eigendecomposition and SVD are the most common ways to compute PCA today, the NIPALS algorithm introduced by Herman Wold [3] remains important for two reasons: it can handle missing data, and it extracts components one at a time, so you can stop after computing only as many as you need without decomposing the entire matrix.
Initialize
Start with the column of X that has the largest variance as the initial scores vector t .
Compute the loading vector
Project the data onto the current scores to get a loading vector, then normalize it to unit length:
pT=tTttTX,p←∣∣p∣∣p
Update the scores
Project the data onto the new loading vector:
t=pTpXp
Check convergence
If the scores vector has not changed significantly since the last iteration (e.g., ∣∣tnew−told∣∣<ϵ ), proceed to step 5. Otherwise, return to step 2.
Deflate the data matrix
Remove the variation captured by this component:
X←X−tpT
Repeat
Return to step 1 with the deflated matrix to extract the next component. Continue until the desired number of components has been extracted.
NIPALS typically converges in a few iterations per component. Its ability to skip missing values (by omitting them from the relevant inner products) made it the standard algorithm in early chemometrics software, and it is still the default in several packages today.
PCA in chemistry
Spectral decomposition and classification
The most common application of PCA in chemistry is exploratory analysis of spectroscopic datasets. Given a collection of spectra, PCA reveals how many independent sources of variation exist and which samples are similar to which. This is invaluable in:
Food authenticity: Distinguishing extra-virgin olive oil from refined oil, identifying geographical origin of wines, detecting adulteration in honey
Pharmaceutical quality: Grouping tablets by formulation, identifying out-of-specification batches, verifying raw material identity
Environmental monitoring: Classifying water samples by pollution source, tracking seasonal changes in atmospheric composition
Forensics: Comparing paint samples, fiber types, or ink formulations
In all these cases, the workflow is the same: preprocess the spectra (scatter correction, smoothing, perhaps derivatives), perform PCA, and examine the scores plot for patterns. If samples cluster by class, PCA has found discriminating information. If they do not, the spectral differences between classes may be too subtle relative to other sources of variation, and you may need a supervised method.
Process monitoring
PCA is the foundation of multivariate statistical process control (MSPC). In a manufacturing process monitored by multiple sensors (NIR probes, temperature, pressure, flow rates), PCA builds a model of “normal operating conditions” from historical data collected under stable production. New observations are then projected onto this model, and their T2 and Q statistics are monitored in real time.
When a new measurement falls outside the T2 or Q control limits, the process is flagged as abnormal. The contribution plot — which shows how much each original variable contributes to the elevated statistic — helps operators diagnose the root cause. This approach was developed by researchers such as John F. MacGregor and Theodora Kourti at McMaster University in the 1990s [5] and has become standard practice in pharmaceutical manufacturing under the FDA’s Process Analytical Technology (PAT) initiative.
Noise reduction and data compression
Because PCA concentrates information into the first few components and relegates noise to the later ones, reconstructing the data from only the first A components produces a denoised version of the dataset:
X^=TAPAT
This reconstruction retains the systematic variation and discards the noise. The same principle enables data compression: instead of storing the full n×p matrix, you store only TA (n×A ) and PA (p×A ), which requires far less memory when A≪p .
Exploratory data analysis
Visualize the structure of a new dataset, identify clusters and outliers before any modeling
Data compression
Reduce hundreds of correlated wavelengths to a few orthogonal scores for faster downstream analysis
Noise reduction
Reconstruct spectra from the first few components to filter out high-frequency noise
Process monitoring (MSPC)
Build a model of normal conditions and detect deviations using T-squared and Q statistics
Preprocessing for regression
Use PCA scores as inputs to PCR, or simply to inspect the data structure before building PLS models
Not the right tool for
Prediction and calibration
PCA ignores the response variable. Use PLS or PCR instead when you need to predict a property (e.g., concentration from spectra)
Class boundaries
PCA maximizes variance, not class separation. If classes overlap in the PCA scores plot, consider PLS-DA or LDA which use class labels to find discriminant directions
Non-linear structure
PCA finds linear directions. If the data lies on a curved manifold, methods like kernel PCA or t-SNE may reveal structure that linear PCA misses
Very small datasets
With fewer samples than variables and very few samples overall, PCA may overfit the noise structure. Cross-validation becomes essential
Advantages and limitations
Advantages
Unsupervised — requires no labels, concentrations, or prior knowledge about the samples
Dimensionality reduction — compresses high-dimensional data into a few interpretable components
Noise filtering — separates systematic variation from random noise
Visualization — provides 2D or 3D maps of sample relationships that are easy to interpret
Orthogonal components — the scores are uncorrelated, eliminating multicollinearity for downstream modeling
Fast and well-understood — efficient algorithms (SVD, NIPALS) and decades of theoretical development
Limitations
Linear method — can only find linear combinations of variables, misses non-linear relationships
Unsupervised — maximizes variance, not predictive power or class separation
Scale-dependent — results change depending on whether you center only, auto-scale, or apply other transformations
Rotation ambiguity — PCA components can be reflected (sign-flipped) without changing the decomposition; only relative orientations matter
Interpretation requires domain knowledge — loadings show which variables matter, but connecting them to chemistry requires expertise in spectral interpretation
Next steps
PCA is the starting point for much of chemometrics. Once you understand how it decomposes data into scores and loadings, several powerful methods become natural extensions:
Factor analysis applies a rotation to the PCA components to make them more interpretable, often used when you expect the underlying factors to have a physical meaning (e.g., pure component spectra)
Clustering can be applied to PCA scores to assign samples to groups algorithmically, rather than relying on visual inspection of the scores plot
Principal Component Regression (PCR) takes the PCA scores and regresses a response variable on them — the simplest multivariate calibration method
Partial Least Squares (PLS) modifies PCA by using the response variable to guide component construction, producing components that are optimized for prediction rather than variance explanation. PLS is the workhorse of multivariate calibration in chemistry
References
[1] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559—572.
[2] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417—441.
[3] Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiah (Ed.), Multivariate Analysis (pp. 391—420). Academic Press.
[4] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1—3), 37—52.
[5] Wise, B. M., & Gallagher, N. B. (1996). The process chemometrics approach to process monitoring and fault detection. Journal of Process Control, 6(6), 329—348.
[6] Jackson, J. E. (1991). A User’s Guide to Principal Components. Wiley.
[7] Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
[8] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.
[9] Kourti, T., & MacGregor, J. F. (1995). Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometrics and Intelligent Laboratory Systems, 28(1), 3—21.