Few methods can claim to have shaped an entire scientific discipline. Partial Least Squares regression (PLS) is one of them. Its story begins not in chemistry but in econometrics, with the Swedish statistician Herman Wold. During the 1960s and 1970s, Wold developed a family of iterative algorithms for estimating path models with latent variables — systems of equations where some quantities cannot be measured directly. His approach was deliberately “soft”: instead of imposing strict distributional assumptions and solving everything simultaneously (as the maximum-likelihood school demanded), Wold’s algorithms worked by alternating simple regressions, converging step by step to a solution. He called the framework PLS path modeling, and it was designed for the messy, multicollinear data that economists actually had, not the tidy datasets that textbooks assumed.
The leap from economics to chemistry came through Herman’s son, Svante Wold, a chemist at Umea University in Sweden, and Harald Martens, a food scientist working in Norway. By the early 1980s, near-infrared (NIR) spectroscopy was exploding as an analytical technique in the food and agricultural industries. A single NIR spectrum could contain hundreds or thousands of wavelength channels, all highly correlated, measured on perhaps only a few dozen samples. Ordinary least squares regression was useless here — trying to invert a 1000×1000 covariance matrix from 30 samples is a mathematical non-starter. Principal Component Regression (PCR) offered one way out, but it built its components from the spectra alone, ignoring the property being predicted. Wold and Martens recognized that Herman’s PLS framework could be adapted to extract latent variables that simultaneously captured the structure of the spectra X and their relationship to the property of interest y . Their key contribution, presented at the 1983 Heidelberg conference on matrix pencils and later published in the proceedings [1], laid out the PLS regression algorithm in a form chemists could use.
What truly brought PLS to the practicing analytical chemist was the 1986 tutorial by Paul Geladi and Bruce Kowalski in Analytica Chimica Acta[2]. Geladi, a Swede working at the University of Washington with Kowalski, wrote one of the most cited papers in the history of chemometrics. The paper walked readers through the algorithm step by step, with numerical examples, geometric interpretations, and practical advice. It translated Wold’s notation into language that bench chemists could follow, and it appeared at precisely the moment when affordable personal computers were making multivariate methods accessible to any laboratory with a spectrometer. Within a few years, PLS became the default calibration method for spectroscopic data worldwide.
The name itself has caused some confusion. “Partial Least Squares” suggests some kind of modified least squares procedure, which is misleading. The algorithm is really about projecting high-dimensional data onto a low-dimensional subspace of latent structures. This is why Svante Wold later proposed the alternative expansion “Projection to Latent Structures” [5], which more accurately describes what the method does. Both names are used interchangeably in the literature, and both abbreviate to PLS.
Why chemometrists needed PLS
To understand why PLS became so important, consider the typical problem in spectroscopic calibration. You have a set of samples — pharmaceutical tablets, grain batches, petroleum fractions — and you measure two things for each: a spectrum (your X matrix) and a property of interest like protein content, moisture, or octane number (your y vector).
The spectrum might contain p=1000 wavelength channels, but you only have n=50 samples. This is the wide data problem: p≫n . Ordinary Least Squares (OLS) requires inverting XTX , a p×p matrix, but when p>n this matrix is singular — it has no inverse. OLS simply cannot be computed.
Even when p<n , spectroscopic variables are massively multicollinear. Neighboring wavelengths carry almost identical information. This makes XTX nearly singular, so the OLS solution becomes wildly unstable: tiny changes in the data produce enormous swings in the regression coefficients.
The chemometrist’s toolbox before PLS offered two main alternatives:
Feature selection: Pick a handful of wavelengths and run OLS. But which wavelengths? This discards most of the information in the spectrum and requires expert knowledge for each new application.
Principal Component Regression (PCR): Run PCA on X , keep the first few components, and regress y on those scores. This handles multicollinearity and the wide-data problem, but the PCA step knows nothing about y . The first principal components capture maximum variance in X , which may or may not be relevant for predicting y .
PLS solves both problems at once. It compresses X into a few latent variables, like PCR, but it builds those latent variables to be maximally relevant for predicting y .
The PLS idea
The core idea of PLS can be stated in one sentence:
Find directions in X-space that have maximum covariance with y.
Contrast this with the goals of related methods:
Method
What it maximizes
PCA / PCR
Variance in X
CCA (Canonical Correlation)
Correlation between X and y
PLS
Covariance between X and y
Covariance is the product of correlation and standard deviation. By maximizing covariance rather than just correlation, PLS simultaneously seeks directions that (a) explain variation in X and (b) correlate with y . This is the key balance that makes PLS so effective for calibration.
Formally, PLS decomposes X and y as:
X=TPT+Ey=TqT+f
where T is the n×A score matrix (with A components), P is the p×A loading matrix for X , q contains the loadings for y , and E and f are residuals. The crucial difference from PCR is that the score vectors in T are computed using information from both X and y .
PLS vs PCR
This distinction deserves emphasis because it is the single most important concept for understanding PLS.
PCR (Principal Component Regression):
Decompose X alone via PCA → get scores T
Regress y on T
The PCA step is completely blind to y . If the variation in X that is most predictive of y happens to be a minor source of spectral variance, it will end up in a high-numbered PC and may be discarded.
PLS (Partial Least Squares):
Decompose X and y simultaneously → get scores T that are relevant for prediction
The regression is built into the decomposition
The practical consequence: PLS typically needs fewer components than PCR to achieve the same predictive performance, and it is less likely to miss predictive information hidden in minor spectral variation.
The NIPALS algorithm
The most widely taught algorithm for PLS is NIPALS (Nonlinear Iterative Partial Least Squares), originally developed by Herman Wold for PCA and adapted for PLS regression. Here we present the PLS1 version (single y variable).
Start with mean-centered X0=X and y0=y . For each component a=1,2,…,A :
Compute the weight vector
The weight vector wa defines the direction in X-space. It is proportional to the covariance between X and y :
wa=∥Xa−1Tya−1∥Xa−1Tya−1
This is the step where y information enters. The weight vector points in the X-direction that has maximum covariance with y .
Compute the scores
Project X onto this direction to get the score vector:
ta=Xa−1wa
Each element of ta is the “position” of one sample along this new latent direction.
Compute the X-loadings
The loading vector tells us how each original variable contributes to this component:
pa=taTtaXa−1Tta
Compute the y-loading
The scalar y-loading captures how much of y this component explains:
qa=taTtaya−1Tta
Deflate X and y
Remove the information captured by this component:
Xa=Xa−1−tapaTya=ya−1−qata
Then return to step 1 for the next component.
After extracting A components, the final regression coefficients in terms of the original variables can be recovered as:
bPLS=W(PTW)−1q
where W=[w1,…,wA] , P=[p1,…,pA] , and q=[q1,…,qA]T .
PLS1 vs PLS2
PLS comes in two flavors depending on the response variable:
PLS1 predicts a single y variable. This is by far the most common case in spectroscopic calibration: predicting protein content, moisture, fat, octane number, or any single property from a spectrum. The algorithm above is PLS1.
PLS2 predicts multiple y variables simultaneously. Instead of a vector y , you have a matrix Y . The algorithm is modified so that the weight vector maximizes the covariance between X and the entire Y matrix. This finds latent variables that are jointly predictive of all response variables at once.
When to use PLS2:
When you have multiple related responses (e.g., predicting protein, moisture, and fat simultaneously from NIR spectra)
When the response variables share common underlying structure
When you want a more parsimonious model (one model instead of several PLS1 models)
When to stick with PLS1:
When response variables are unrelated
When each property requires different preprocessing or a different number of components
When you want maximum predictive accuracy for a single property (PLS1 is often slightly better than PLS2 for any given individual response)
In practice, most chemometrists build separate PLS1 models for each property. PLS2 is more commonly used in process monitoring and multivariate quality control.
Choosing the number of components
The number of components A is the critical tuning parameter in PLS. Too few components and the model underfits — it misses real structure in the data. Too many and the model overfits — it starts fitting noise and performs poorly on new samples.
Cross-validation
The standard approach is cross-validation (CV), typically leave-one-out or k-fold:
Leave out one sample (or one group of samples)
Build a PLS model on the remaining data with A components
Predict the left-out sample and record the prediction error
Repeat for each sample (or group)
Compute RMSECV (Root Mean Squared Error of Cross-Validation):
RMSECV(A)=n1i=1∑n(yi−y^i,CV)2
Plot RMSECV vs number of components and choose the number that minimizes it (or where the curve flattens)
Reading the RMSECV curve
The RMSECV curve typically shows three regions:
Rapid decrease (components 1-3): The model captures the main predictive structure. Each new component substantially reduces error.
Plateau (the optimal region): Adding more components gives diminishing returns. The minimum (or the first point where the curve levels off) indicates the optimal number.
Three error metrics appear constantly in PLS modeling:
Metric
Full name
What it measures
RMSEC
Root Mean Square Error of Calibration
Fit to the training data (always optimistic)
RMSECV
Root Mean Square Error of Cross-Validation
Estimated prediction error via CV
RMSEP
Root Mean Square Error of Prediction
True prediction error on an independent test set
A large gap between RMSEC and RMSECV/RMSEP is a classic sign of overfitting. Ideally, all three should be similar.
Interpreting PLS models
A well-built PLS model produces several diagnostic plots that help you understand what the model has learned and whether it is reliable.
Scores plots
The score vectors t1,t2,… summarize each sample’s position in the latent variable space. A plot of t1 vs t2 (the first two PLS components) is analogous to a PCA scores plot, but the axes are oriented toward prediction rather than variance.
Use scores plots to:
Detect outliers (samples far from the main group)
Identify clusters or groups in your data
Check for trends over time (process drift)
Verify that calibration and validation sets span similar regions
Loadings and weights
The weight vectorw shows which variables (wavelengths) the model uses to construct each component. Large absolute values indicate important wavelengths for prediction.
The loading vectorp describes how each component relates back to the original X-variables.
Plotting w or p against wavelength reveals which spectral regions drive the model. These should correspond to known absorption bands of the analyte — if they do not, the model may be relying on spurious correlations.
VIP scores
Variable Importance in Projection (VIP)[4] provides a single summary score for each variable across all components:
VIPj=p⋅∑a=1ASSa∑a=1ASSa⋅waj2
where SSa is the sum of squares explained by component a , and waj is the weight of variable j in component a .
The rule of thumb: variables with VIP>1 are considered important. Variables with VIP<0.5 contribute little and could potentially be removed.
Regression coefficients
The final regression coefficient vector bPLS can be plotted against wavelength to see the overall spectral “recipe” the model uses for prediction. This is the most compact summary of the model and can be compared directly against known spectral features of the analyte.
Predicted vs reference plot
Plot predicted values y^ against reference values y for both calibration and validation sets. An ideal model produces points along the 1:1 diagonal with minimal scatter. Systematic deviations (curvature, offset) indicate model problems.
NIR spectroscopic calibration
Predicting protein, moisture, fat, octane, and dozens of other properties from near-infrared spectra
Mid-IR and Raman calibration
Same wide-data, correlated-variable scenario; PLS handles it naturally
Process analytical technology (PAT)
Real-time monitoring of manufacturing processes using inline spectroscopy
Any regression with more variables than samples
Whenever p>n or when variables are highly correlated, PLS is a strong starting point
Consider alternatives when
You have few, uncorrelated predictors
If p is small and variables are independent, ordinary least squares or ridge regression may be simpler and equally effective
You need strict variable selection
PLS uses all variables; if you need a sparse model, consider LASSO or sparse PLS
Non-linear relationships dominate
Standard PLS is a linear method. For strongly non-linear problems, consider kernel PLS or machine learning approaches
Classification is the goal
Use PLS-DA (Discriminant Analysis) instead, which adapts PLS for class membership prediction
Advantages and limitations
Advantages
Handles wide data
Works even when p≫n , where OLS fails entirely
Handles multicollinearity
Highly correlated predictors (neighboring wavelengths) pose no problem
Uses y-information
Builds more predictive components than PCR with fewer latent variables
Computationally efficient
The NIPALS algorithm is fast and works on very large datasets
Interpretable
Scores, loadings, VIP, and regression coefficients all provide insight
Robust in practice
Decades of successful use across industries; well-understood failure modes
Limitations
Linear method
Assumes a linear relationship between X and y; non-linear effects are missed
Component selection is critical
Too many components overfit; too few underfit. Cross-validation is mandatory
Sensitive to outliers
Extreme samples can distort the latent variable space. Always check for outliers before modeling
All variables retained
PLS does not perform variable selection; irrelevant regions of the spectrum add noise
Not a black box
Requires spectroscopic knowledge to validate that the model makes chemical sense
Next steps
PLS regression is the foundation of a large family of methods. Once you are comfortable with basic PLS, you can explore:
PLS-DA (Discriminant Analysis): Adapts PLS for classification by coding class membership as dummy y-variables. Widely used for authenticating food products, identifying counterfeit pharmaceuticals, and classifying materials by type.
Multi-block PLS: When your data comes from multiple sources (e.g., NIR + Raman + physical measurements), multi-block methods like MB-PLS or SO-PLS handle each block separately while finding common latent structures.
Kernel PLS: Extends PLS to non-linear relationships using the kernel trick, similar to kernel PCA or support vector machines.
Sparse PLS: Combines PLS with L1 regularization (LASSO-like penalties) to perform simultaneous regression and variable selection.
Orthogonal PLS (OPLS): Separates the systematic variation in X into a predictive part (correlated with y ) and an orthogonal part (uncorrelated with y ). Developed by Trygg and Wold in 2002 for improved model interpretation.
References
[1] Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils, Lecture Notes in Mathematics, Vol. 973 (pp. 286—293). Springer, Heidelberg.
[2] Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185, 1—17.
[3] de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251—263.
[4] Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109—130.
[5] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37—52.
[6] Martens, H., & Naes, T. (1989). Multivariate Calibration. John Wiley & Sons, Chichester.
[7] Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119—128.