Skip to content

Principal Component Analysis (PCA)

In 1901, the English mathematician Karl Pearson published a paper titled “On Lines and Planes of Closest Fit to Systems of Points in Space” in the Philosophical Magazine [1]. Pearson’s problem was geometric: given a cloud of points in many dimensions, how do you find the line (or plane, or hyperplane) that passes as close as possible to all the points? His solution was to find the direction in which the projected points had the greatest spread — the direction of maximum variance. Then find a second direction, perpendicular to the first, that captured the most remaining variance. And so on. The result was a coordinate system tailored to the data, with axes ranked by how much variation each one explained. Pearson was not thinking about chemistry. He was working on problems of biological inheritance, fitting lines and planes to measurements of skulls, limb lengths, and other anatomical features. But the mathematics he developed would prove universal.

Three decades later, the American psychologist Harold Hotelling independently arrived at the same mathematical structure from a completely different direction. In 1933, he published “Analysis of a Complex of Statistical Variables into Principal Components” in the Journal of Educational Psychology [2]. Hotelling’s motivation was psychological testing: he had matrices of test scores from students across many exams and wanted to find a small number of underlying “factors” (general intelligence, verbal ability, mathematical aptitude) that explained the correlations among the tests. He formalized the method, gave it the name principal components, and established the eigenvalue-eigenvector framework that is still used today. Where Pearson had approached the problem geometrically (closest-fitting subspaces), Hotelling approached it statistically (maximizing variance of linear combinations).

For the next three decades, PCA remained a tool for psychologists, biologists, and social scientists. Its application to large datasets was limited by computation — eigendecomposition of big matrices was impractical on the hardware of the time. That changed in 1966, when Herman Wold introduced the NIPALS algorithm (Nonlinear Iterative Partial Least Squares) [3], which extracted principal components one at a time through a simple iterative procedure. NIPALS made PCA feasible for datasets with hundreds of variables without ever computing the full covariance matrix. This was the bridge that carried PCA into chemometrics: Herman Wold’s son, Svante Wold, together with Paul Geladi, Bruce Kowalski, and others, made PCA the foundation of pattern recognition in spectroscopic data during the 1970s and 1980s [4]. Today, PCA is almost always the first thing a chemometrician does with a new dataset.

The dimensionality problem

A modern near-infrared (NIR) spectrometer records absorbance at hundreds or thousands of wavelengths. A single spectrum might contain 1050 data points — one for each wavelength between 900 and 2500 nm. If you measure 200 olive oil samples to check for adulteration, your dataset is a matrix: 200 rows (samples) and 1050 columns (wavelengths). Mathematically, each sample lives in a 1050-dimensional space.

But do you really need 1050 dimensions to describe the important differences among these olive oils? Almost certainly not. The absorbance at 1720 nm is closely related to the absorbance at 1721 nm, because they arise from similar molecular vibrations. The same is true across broad spectral regions: the C-H stretching overtones near 1200 nm, the O-H combinations near 1940 nm, and the C=O features near 1730 nm tend to move together as the chemical composition changes. The 1050 wavelengths are not 1050 independent pieces of information. They are 1050 redundant measurements of a much smaller number of underlying sources of variation — perhaps the concentrations of oleic acid, linoleic acid, water, and free fatty acids, plus some physical effects like scattering.

The question PCA answers is: how many independent directions of variation does this dataset actually contain, and what do those directions look like? If the answer is “about 5,” then PCA compresses 1050 dimensions into 5, with minimal loss of information. This compression is not just convenient — it is essential for visualization (you can plot 2 or 3 dimensions but not 1050), for noise reduction (the discarded dimensions mostly contain noise), and for building predictive models (fewer variables means more stable regression).

The core idea

PCA searches for directions of maximum variance in the data and projects the samples onto those directions.

Imagine your data matrix as a cloud of points in a high-dimensional space. Some directions through that cloud show a lot of spread (the points are scattered widely), while other directions show very little (the points are bunched together). The directions with the most spread carry the most information about how the samples differ from each other.

The first principal component (PC1) is the direction in the original variable space along which the data has the greatest variance. It is a linear combination of all the original variables:

where are the original variables (wavelengths) and the weights define the direction. PCA chooses these weights to maximize the variance of the projected data, subject to the constraint that the weight vector has unit length ( ).

The second principal component (PC2) is the direction of maximum variance orthogonal to PC1. It captures the largest amount of variation that PC1 missed. The third is orthogonal to both PC1 and PC2, and so on.

The result is a new coordinate system where:

  • The first few axes capture most of the systematic variation (signal)
  • The later axes capture progressively less variation (usually noise)
  • All axes are mutually perpendicular (uncorrelated)

Covariance and correlation

Before computing PCA, you need to decide how to preprocess your data. This decision determines what kind of variation PCA will find.

Mean-centering

Mean-centering is almost always required. You subtract the column mean from each variable:

Without centering, the first principal component would point roughly toward the mean of the data, capturing the average signal rather than the differences among samples. This wastes a component on information that carries no discriminating power.

Scaling: covariance vs. correlation

After centering, you face a choice:

  • PCA on the covariance matrix (no scaling): Variables with larger absolute variance dominate the analysis. For spectroscopic data where all variables have the same units (absorbance), this is usually appropriate. Wavelength regions with strong absorption bands naturally contribute more to the PCA decomposition.

  • PCA on the correlation matrix (auto-scaling to unit variance): Each variable is divided by its standard deviation before PCA. All variables contribute equally regardless of their magnitude. This is essential when variables have different units (e.g., mixing spectral data with temperature, pH, and pressure measurements) but can be harmful for spectroscopic data because it amplifies noisy baseline regions.

Computing principal components

The mathematics of PCA reduces to an eigenvalue problem. Starting from the centered data matrix ( , where is samples and is variables), you compute the covariance matrix:

This symmetric matrix summarizes how every pair of variables co-varies across the samples. The diagonal elements are the variances of individual variables; the off-diagonal elements are the covariances between pairs of variables.

Eigendecomposition

The eigendecomposition of the covariance matrix is:

where:

  • is a orthogonal matrix whose columns are the eigenvectors (also called loadings)
  • is a diagonal matrix of eigenvalues, ordered so that

Each eigenvector defines a principal component direction. Its corresponding eigenvalue equals the variance of the data when projected onto that direction. The eigenvectors are orthonormal: perpendicular to each other and of unit length.

Singular value decomposition

In practice, PCA is more often computed via the singular value decomposition (SVD) of the centered data matrix itself:

where:

  • ( ) contains the left singular vectors
  • ( ) is diagonal with singular values
  • ( ) contains the right singular vectors

The connection to PCA is direct: the columns of are the loadings (eigenvectors of ), and the eigenvalues are . The SVD is numerically more stable than forming explicitly and then diagonalizing it, which is why most software uses SVD internally.

Variance explained

Each eigenvalue tells you how much variance that component captures. The proportion of variance explained by the -th component is:

The cumulative variance explained by the first components is:

In typical spectroscopic datasets, the first 2 to 5 components often capture 95—99% of the total variance. The remaining hundreds of components collectively account for a few percent — mostly noise.

Scores and loadings

PCA decomposes the data matrix as:

where:

  • ( ) is the scores matrix
  • ( ) is the loadings matrix
  • is the residual (what is left after components)

This decomposition is the heart of PCA interpretation.

Scores: the sample map

The scores are the coordinates of each sample in the new principal component space:

Each row of is a sample; each column is a principal component. Plotting the scores of PC1 against PC2 gives a scores plot — the most important visualization in exploratory chemometrics. In this plot:

  • Samples that are close together have similar spectral profiles
  • Samples that are far apart differ significantly
  • Clusters suggest groups with distinct chemical compositions
  • Outliers appear as isolated points far from the main group
  • Gradients (smooth transitions from one region to another) suggest a continuous change in composition (e.g., increasing concentration of an analyte)

Loadings: the variable map

The loadings tell you how each original variable contributes to each principal component. The loading vector has one value per original variable (wavelength). If the loading of PC1 at 1720 nm is large and positive, it means that samples with high PC1 scores have high absorbance at 1720 nm — which you might associate with a C=O stretching vibration related to free fatty acid content.

A loadings plot shows the loading values as a function of wavelength. It looks like a spectrum and often can be interpreted chemically:

  • Peaks in the loadings at specific wavelengths indicate which spectral features drive the variation captured by that component
  • The sign of the loading (positive or negative) indicates the direction of the contribution
  • Comparing loadings of different components reveals which wavelengths are important for different sources of variation

Biplots

A biplot superimposes scores and loadings on the same plot. Samples are shown as points and variables as vectors (arrows). This allows you to see simultaneously which samples are similar and which variables drive those similarities. Biplots are most useful when the number of variables is moderate (say, fewer than 50). With hundreds of wavelengths, the arrow display becomes cluttered and a separate loadings plot is more practical.

How many components?

One of the most important decisions in PCA is how many components to retain. Too few and you lose meaningful information; too many and you model noise.

Scree plot

The scree plot shows the eigenvalues (or percentage of variance explained) as a function of component number. It typically shows a steep drop for the first few components followed by a leveling off. The “elbow” where the curve transitions from steep to flat suggests the appropriate number of components. Components before the elbow capture systematic variation; components after it are mostly noise.

The name comes from the geological term scree — the pile of broken rocks at the base of a cliff. The flat tail of the eigenvalue plot resembles this debris field.

Cumulative variance

A related approach sets a cumulative variance threshold — for example, 95% or 99%. You retain however many components are needed to reach that threshold. This is simple but somewhat arbitrary: the “right” threshold depends on the noise level in the data.

Cross-validation

A more rigorous approach uses cross-validation. You systematically leave out portions of the data, reconstruct them from the PCA model, and measure the reconstruction error. When adding a component improves the reconstruction of held-out data, it is capturing signal. When it no longer helps (or makes reconstruction worse), it is capturing noise. This is the most reliable method but requires more computation.

Broken stick rule

The broken stick model compares the observed eigenvalues against the expected eigenvalues if the total variance were distributed randomly among the components (as if you broke a stick of unit length into pieces at random). A component is retained if its eigenvalue exceeds the corresponding broken stick value. The expected length of the -th largest piece of a stick broken at random points is:

Components whose eigenvalues exceed are considered significant.

Interpreting PCA results

Scores plots

The scores plot (PC1 vs. PC2) is the primary tool for exploring multivariate data. Common patterns include:

  • Distinct clusters: Suggest discrete sample groups (e.g., different olive oil cultivars, authentic vs. adulterated samples)
  • Continuous gradients: Suggest a quantitative property varying across samples (e.g., increasing analyte concentration)
  • Outliers: Single points far from the rest, suggesting unusual composition, measurement errors, or contaminated samples
  • Time trends: Ordered progression of samples over time, suggesting process drift or aging effects

You can color the points by known metadata (class labels, concentrations, production dates) to see whether the PCA structure correlates with chemical or physical properties.

Hotelling’s T-squared

To decide whether a sample is an outlier in the principal component space, you can use Hotelling’s statistic. For a sample with scores , the value is:

This is a multivariate generalization of the squared standardized distance. Samples with above a critical threshold (derived from the F-distribution) are flagged as outliers. The confidence ellipse drawn on a scores plot corresponds to this threshold and provides a visual boundary for “normal” samples.

Q-residuals (SPE)

The statistic only measures how extreme a sample is within the PCA model. But what about variation that the model does not capture? The Q-residual (also called Squared Prediction Error, SPE) measures how well the model reconstructs a sample:

where is the reconstructed value from the PCA model. A high means the sample contains variation that is not described by any of the retained components — it is “new” variation that the model has never seen.

The combination of and provides a complete diagnostic:

  • Normal sample: Low , low — within the model and well-described by it
  • Extreme but consistent: High , low — far from the center but the model structure still applies (e.g., a sample with an unusually high concentration)
  • New variation: Low , high — near the center in the PC space but contains a spectral feature the model has never seen (e.g., a contaminant)
  • Extreme and inconsistent: High , high — far from the center and not well-described by the model (genuine outlier)

The NIPALS algorithm

While eigendecomposition and SVD are the most common ways to compute PCA today, the NIPALS algorithm introduced by Herman Wold [3] remains important for two reasons: it can handle missing data, and it extracts components one at a time, so you can stop after computing only as many as you need without decomposing the entire matrix.

  1. Initialize

    Start with the column of that has the largest variance as the initial scores vector .

  2. Compute the loading vector

    Project the data onto the current scores to get a loading vector, then normalize it to unit length:

  3. Update the scores

    Project the data onto the new loading vector:

  4. Check convergence

    If the scores vector has not changed significantly since the last iteration (e.g., ), proceed to step 5. Otherwise, return to step 2.

  5. Deflate the data matrix

    Remove the variation captured by this component:

  6. Repeat

    Return to step 1 with the deflated matrix to extract the next component. Continue until the desired number of components has been extracted.

NIPALS typically converges in a few iterations per component. Its ability to skip missing values (by omitting them from the relevant inner products) made it the standard algorithm in early chemometrics software, and it is still the default in several packages today.

PCA in chemistry

Spectral decomposition and classification

The most common application of PCA in chemistry is exploratory analysis of spectroscopic datasets. Given a collection of spectra, PCA reveals how many independent sources of variation exist and which samples are similar to which. This is invaluable in:

  • Food authenticity: Distinguishing extra-virgin olive oil from refined oil, identifying geographical origin of wines, detecting adulteration in honey
  • Pharmaceutical quality: Grouping tablets by formulation, identifying out-of-specification batches, verifying raw material identity
  • Environmental monitoring: Classifying water samples by pollution source, tracking seasonal changes in atmospheric composition
  • Forensics: Comparing paint samples, fiber types, or ink formulations

In all these cases, the workflow is the same: preprocess the spectra (scatter correction, smoothing, perhaps derivatives), perform PCA, and examine the scores plot for patterns. If samples cluster by class, PCA has found discriminating information. If they do not, the spectral differences between classes may be too subtle relative to other sources of variation, and you may need a supervised method.

Process monitoring

PCA is the foundation of multivariate statistical process control (MSPC). In a manufacturing process monitored by multiple sensors (NIR probes, temperature, pressure, flow rates), PCA builds a model of “normal operating conditions” from historical data collected under stable production. New observations are then projected onto this model, and their and statistics are monitored in real time.

When a new measurement falls outside the or control limits, the process is flagged as abnormal. The contribution plot — which shows how much each original variable contributes to the elevated statistic — helps operators diagnose the root cause. This approach was developed by researchers such as John F. MacGregor and Theodora Kourti at McMaster University in the 1990s [5] and has become standard practice in pharmaceutical manufacturing under the FDA’s Process Analytical Technology (PAT) initiative.

Noise reduction and data compression

Because PCA concentrates information into the first few components and relegates noise to the later ones, reconstructing the data from only the first components produces a denoised version of the dataset:

This reconstruction retains the systematic variation and discards the noise. The same principle enables data compression: instead of storing the full matrix, you store only ( ) and ( ), which requires far less memory when .

Code implementation

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Simulate NIR spectra of olive oil samples from 3 cultivars
np.random.seed(42)
n_per_class = 50
n_wavelengths = 500
wavelengths = np.linspace(900, 2500, n_wavelengths)
# Each cultivar has a characteristic spectral profile
# Cultivar A: higher oleic acid (strong C-H region)
profile_A = 0.8 * np.exp(-((wavelengths - 1200)**2) / 8000) + \
0.4 * np.exp(-((wavelengths - 1730)**2) / 3000) + \
0.3 * np.exp(-((wavelengths - 1940)**2) / 2000)
# Cultivar B: higher linoleic acid (shifted features)
profile_B = 0.6 * np.exp(-((wavelengths - 1210)**2) / 8000) + \
0.6 * np.exp(-((wavelengths - 1725)**2) / 3000) + \
0.35 * np.exp(-((wavelengths - 1935)**2) / 2000)
# Cultivar C: different composition
profile_C = 0.7 * np.exp(-((wavelengths - 1195)**2) / 8000) + \
0.5 * np.exp(-((wavelengths - 1735)**2) / 3000) + \
0.25 * np.exp(-((wavelengths - 1945)**2) / 2000)
# Generate samples with realistic variation
X_A = profile_A + np.random.randn(n_per_class, n_wavelengths) * 0.02
X_B = profile_B + np.random.randn(n_per_class, n_wavelengths) * 0.02
X_C = profile_C + np.random.randn(n_per_class, n_wavelengths) * 0.02
X = np.vstack([X_A, X_B, X_C])
labels = ['Cultivar A'] * n_per_class + \
['Cultivar B'] * n_per_class + \
['Cultivar C'] * n_per_class
# Mean-center (standard for spectroscopic data)
X_centered = X - X.mean(axis=0)
# Fit PCA
pca = PCA(n_components=10)
scores = pca.fit_transform(X_centered)
loadings = pca.components_.T # p x n_components
explained = pca.explained_variance_ratio_ * 100
# --- Plot 1: Scores plot (PC1 vs PC2) ---
plt.figure(figsize=(8, 6))
colors = {'Cultivar A': '#2563eb', 'Cultivar B': '#dc2626', 'Cultivar C': '#059669'}
for label in ['Cultivar A', 'Cultivar B', 'Cultivar C']:
mask = [l == label for l in labels]
plt.scatter(scores[mask, 0], scores[mask, 1],
c=colors[label], label=label, alpha=0.7, edgecolors='k',
linewidths=0.5, s=50)
plt.xlabel(f'PC1 ({explained[0]:.1f}%)')
plt.ylabel(f'PC2 ({explained[1]:.1f}%)')
plt.title('PCA Scores Plot: Olive Oil Cultivars')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Plot 2: Scree plot ---
plt.figure(figsize=(8, 5))
components = range(1, 11)
plt.bar(components, explained, color='steelblue', edgecolor='black')
plt.plot(components, np.cumsum(explained), 'ro-', linewidth=2,
label='Cumulative')
plt.axhline(95, color='gray', linestyle='--', alpha=0.5, label='95% threshold')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained (%)')
plt.title('Scree Plot')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# --- Plot 3: Loadings plot ---
plt.figure(figsize=(10, 4))
plt.plot(wavelengths, loadings[:, 0], linewidth=1.5, label=f'PC1 ({explained[0]:.1f}%)')
plt.plot(wavelengths, loadings[:, 1], linewidth=1.5, label=f'PC2 ({explained[1]:.1f}%)')
plt.axhline(0, color='gray', linewidth=0.5)
plt.xlabel('Wavelength (nm)')
plt.ylabel('Loading value')
plt.title('PCA Loadings')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Hotelling's T-squared ---
# Compute T2 for each sample
T2 = np.sum(scores[:, :2]**2 / pca.explained_variance_[:2], axis=1)
# Q-residuals (reconstruction error)
X_reconstructed = scores[:, :2] @ pca.components_[:2, :] + X.mean(axis=0)
Q = np.sum((X - X_reconstructed)**2, axis=1)
print(f"Cumulative variance (2 PCs): {sum(explained[:2]):.1f}%")
print(f"T2 range: {T2.min():.2f} - {T2.max():.2f}")
print(f"Q range: {Q.min():.4f} - {Q.max():.4f}")

When to use PCA

Well-suited applications

Exploratory data analysis Visualize the structure of a new dataset, identify clusters and outliers before any modeling

Data compression Reduce hundreds of correlated wavelengths to a few orthogonal scores for faster downstream analysis

Noise reduction Reconstruct spectra from the first few components to filter out high-frequency noise

Process monitoring (MSPC) Build a model of normal conditions and detect deviations using T-squared and Q statistics

Preprocessing for regression Use PCA scores as inputs to PCR, or simply to inspect the data structure before building PLS models

Not the right tool for

Prediction and calibration PCA ignores the response variable. Use PLS or PCR instead when you need to predict a property (e.g., concentration from spectra)

Class boundaries PCA maximizes variance, not class separation. If classes overlap in the PCA scores plot, consider PLS-DA or LDA which use class labels to find discriminant directions

Non-linear structure PCA finds linear directions. If the data lies on a curved manifold, methods like kernel PCA or t-SNE may reveal structure that linear PCA misses

Very small datasets With fewer samples than variables and very few samples overall, PCA may overfit the noise structure. Cross-validation becomes essential

Advantages and limitations

Advantages

Unsupervised — requires no labels, concentrations, or prior knowledge about the samples

Dimensionality reduction — compresses high-dimensional data into a few interpretable components

Noise filtering — separates systematic variation from random noise

Visualization — provides 2D or 3D maps of sample relationships that are easy to interpret

Orthogonal components — the scores are uncorrelated, eliminating multicollinearity for downstream modeling

Fast and well-understood — efficient algorithms (SVD, NIPALS) and decades of theoretical development

Limitations

Linear method — can only find linear combinations of variables, misses non-linear relationships

Unsupervised — maximizes variance, not predictive power or class separation

Scale-dependent — results change depending on whether you center only, auto-scale, or apply other transformations

Rotation ambiguity — PCA components can be reflected (sign-flipped) without changing the decomposition; only relative orientations matter

Interpretation requires domain knowledge — loadings show which variables matter, but connecting them to chemistry requires expertise in spectral interpretation

Next steps

PCA is the starting point for much of chemometrics. Once you understand how it decomposes data into scores and loadings, several powerful methods become natural extensions:

  • Factor analysis applies a rotation to the PCA components to make them more interpretable, often used when you expect the underlying factors to have a physical meaning (e.g., pure component spectra)

  • Clustering can be applied to PCA scores to assign samples to groups algorithmically, rather than relying on visual inspection of the scores plot

  • Principal Component Regression (PCR) takes the PCA scores and regresses a response variable on them — the simplest multivariate calibration method

  • Partial Least Squares (PLS) modifies PCA by using the response variable to guide component construction, producing components that are optimized for prediction rather than variance explanation. PLS is the workhorse of multivariate calibration in chemistry

References

[1] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559—572.

[2] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417—441.

[3] Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiah (Ed.), Multivariate Analysis (pp. 391—420). Academic Press.

[4] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1—3), 37—52.

[5] Wise, B. M., & Gallagher, N. B. (1996). The process chemometrics approach to process monitoring and fault detection. Journal of Process Control, 6(6), 329—348.

[6] Jackson, J. E. (1991). A User’s Guide to Principal Components. Wiley.

[7] Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.

[8] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.

[9] Kourti, T., & MacGregor, J. F. (1995). Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometrics and Intelligent Laboratory Systems, 28(1), 3—21.