Skip to content

Partial Least Squares Regression (PLS)

Few methods can claim to have shaped an entire scientific discipline. Partial Least Squares regression (PLS) is one of them. Its story begins not in chemistry but in econometrics, with the Swedish statistician Herman Wold. During the 1960s and 1970s, Wold developed a family of iterative algorithms for estimating path models with latent variables — systems of equations where some quantities cannot be measured directly. His approach was deliberately “soft”: instead of imposing strict distributional assumptions and solving everything simultaneously (as the maximum-likelihood school demanded), Wold’s algorithms worked by alternating simple regressions, converging step by step to a solution. He called the framework PLS path modeling, and it was designed for the messy, multicollinear data that economists actually had, not the tidy datasets that textbooks assumed.

The leap from economics to chemistry came through Herman’s son, Svante Wold, a chemist at Umea University in Sweden, and Harald Martens, a food scientist working in Norway. By the early 1980s, near-infrared (NIR) spectroscopy was exploding as an analytical technique in the food and agricultural industries. A single NIR spectrum could contain hundreds or thousands of wavelength channels, all highly correlated, measured on perhaps only a few dozen samples. Ordinary least squares regression was useless here — trying to invert a covariance matrix from 30 samples is a mathematical non-starter. Principal Component Regression (PCR) offered one way out, but it built its components from the spectra alone, ignoring the property being predicted. Wold and Martens recognized that Herman’s PLS framework could be adapted to extract latent variables that simultaneously captured the structure of the spectra and their relationship to the property of interest . Their key contribution, presented at the 1983 Heidelberg conference on matrix pencils and later published in the proceedings [1], laid out the PLS regression algorithm in a form chemists could use.

What truly brought PLS to the practicing analytical chemist was the 1986 tutorial by Paul Geladi and Bruce Kowalski in Analytica Chimica Acta [2]. Geladi, a Swede working at the University of Washington with Kowalski, wrote one of the most cited papers in the history of chemometrics. The paper walked readers through the algorithm step by step, with numerical examples, geometric interpretations, and practical advice. It translated Wold’s notation into language that bench chemists could follow, and it appeared at precisely the moment when affordable personal computers were making multivariate methods accessible to any laboratory with a spectrometer. Within a few years, PLS became the default calibration method for spectroscopic data worldwide.

The name itself has caused some confusion. “Partial Least Squares” suggests some kind of modified least squares procedure, which is misleading. The algorithm is really about projecting high-dimensional data onto a low-dimensional subspace of latent structures. This is why Svante Wold later proposed the alternative expansion “Projection to Latent Structures” [5], which more accurately describes what the method does. Both names are used interchangeably in the literature, and both abbreviate to PLS.

Why chemometrists needed PLS

To understand why PLS became so important, consider the typical problem in spectroscopic calibration. You have a set of samples — pharmaceutical tablets, grain batches, petroleum fractions — and you measure two things for each: a spectrum (your matrix) and a property of interest like protein content, moisture, or octane number (your vector).

The spectrum might contain wavelength channels, but you only have samples. This is the wide data problem: . Ordinary Least Squares (OLS) requires inverting , a matrix, but when this matrix is singular — it has no inverse. OLS simply cannot be computed.

Even when , spectroscopic variables are massively multicollinear. Neighboring wavelengths carry almost identical information. This makes nearly singular, so the OLS solution becomes wildly unstable: tiny changes in the data produce enormous swings in the regression coefficients.

The chemometrist’s toolbox before PLS offered two main alternatives:

  • Feature selection: Pick a handful of wavelengths and run OLS. But which wavelengths? This discards most of the information in the spectrum and requires expert knowledge for each new application.
  • Principal Component Regression (PCR): Run PCA on , keep the first few components, and regress on those scores. This handles multicollinearity and the wide-data problem, but the PCA step knows nothing about . The first principal components capture maximum variance in , which may or may not be relevant for predicting .

PLS solves both problems at once. It compresses into a few latent variables, like PCR, but it builds those latent variables to be maximally relevant for predicting .

The PLS idea

The core idea of PLS can be stated in one sentence:

Find directions in X-space that have maximum covariance with y.

Contrast this with the goals of related methods:

MethodWhat it maximizes
PCA / PCRVariance in
CCA (Canonical Correlation)Correlation between and
PLSCovariance between and

Covariance is the product of correlation and standard deviation. By maximizing covariance rather than just correlation, PLS simultaneously seeks directions that (a) explain variation in and (b) correlate with . This is the key balance that makes PLS so effective for calibration.

Formally, PLS decomposes and as:

where is the score matrix (with components), is the loading matrix for , contains the loadings for , and and are residuals. The crucial difference from PCR is that the score vectors in are computed using information from both and .

PLS vs PCR

This distinction deserves emphasis because it is the single most important concept for understanding PLS.

PCR (Principal Component Regression):

  1. Decompose alone via PCA → get scores
  2. Regress on

The PCA step is completely blind to . If the variation in that is most predictive of happens to be a minor source of spectral variance, it will end up in a high-numbered PC and may be discarded.

PLS (Partial Least Squares):

  1. Decompose and simultaneously → get scores that are relevant for prediction
  2. The regression is built into the decomposition

The practical consequence: PLS typically needs fewer components than PCR to achieve the same predictive performance, and it is less likely to miss predictive information hidden in minor spectral variation.

The NIPALS algorithm

The most widely taught algorithm for PLS is NIPALS (Nonlinear Iterative Partial Least Squares), originally developed by Herman Wold for PCA and adapted for PLS regression. Here we present the PLS1 version (single variable).

Start with mean-centered and . For each component :

  1. Compute the weight vector

    The weight vector defines the direction in X-space. It is proportional to the covariance between and :

    This is the step where information enters. The weight vector points in the X-direction that has maximum covariance with .

  2. Compute the scores

    Project onto this direction to get the score vector:

    Each element of is the “position” of one sample along this new latent direction.

  3. Compute the X-loadings

    The loading vector tells us how each original variable contributes to this component:

  4. Compute the y-loading

    The scalar y-loading captures how much of this component explains:

  5. Deflate X and y

    Remove the information captured by this component:

    Then return to step 1 for the next component.

After extracting components, the final regression coefficients in terms of the original variables can be recovered as:

where , , and .

PLS1 vs PLS2

PLS comes in two flavors depending on the response variable:

PLS1 predicts a single variable. This is by far the most common case in spectroscopic calibration: predicting protein content, moisture, fat, octane number, or any single property from a spectrum. The algorithm above is PLS1.

PLS2 predicts multiple variables simultaneously. Instead of a vector , you have a matrix . The algorithm is modified so that the weight vector maximizes the covariance between and the entire matrix. This finds latent variables that are jointly predictive of all response variables at once.

When to use PLS2:

  • When you have multiple related responses (e.g., predicting protein, moisture, and fat simultaneously from NIR spectra)
  • When the response variables share common underlying structure
  • When you want a more parsimonious model (one model instead of several PLS1 models)

When to stick with PLS1:

  • When response variables are unrelated
  • When each property requires different preprocessing or a different number of components
  • When you want maximum predictive accuracy for a single property (PLS1 is often slightly better than PLS2 for any given individual response)

In practice, most chemometrists build separate PLS1 models for each property. PLS2 is more commonly used in process monitoring and multivariate quality control.

Choosing the number of components

The number of components is the critical tuning parameter in PLS. Too few components and the model underfits — it misses real structure in the data. Too many and the model overfits — it starts fitting noise and performs poorly on new samples.

Cross-validation

The standard approach is cross-validation (CV), typically leave-one-out or k-fold:

  1. Leave out one sample (or one group of samples)

  2. Build a PLS model on the remaining data with components

  3. Predict the left-out sample and record the prediction error

  4. Repeat for each sample (or group)

  5. Compute RMSECV (Root Mean Squared Error of Cross-Validation):

  6. Plot RMSECV vs number of components and choose the number that minimizes it (or where the curve flattens)

Reading the RMSECV curve

The RMSECV curve typically shows three regions:

  1. Rapid decrease (components 1-3): The model captures the main predictive structure. Each new component substantially reduces error.
  2. Plateau (the optimal region): Adding more components gives diminishing returns. The minimum (or the first point where the curve levels off) indicates the optimal number.
  3. Increase (overfitting): Adding more components starts fitting noise. CV error rises again.

RMSEC vs RMSECV vs RMSEP

Three error metrics appear constantly in PLS modeling:

MetricFull nameWhat it measures
RMSECRoot Mean Square Error of CalibrationFit to the training data (always optimistic)
RMSECVRoot Mean Square Error of Cross-ValidationEstimated prediction error via CV
RMSEPRoot Mean Square Error of PredictionTrue prediction error on an independent test set

A large gap between RMSEC and RMSECV/RMSEP is a classic sign of overfitting. Ideally, all three should be similar.

Interpreting PLS models

A well-built PLS model produces several diagnostic plots that help you understand what the model has learned and whether it is reliable.

Scores plots

The score vectors summarize each sample’s position in the latent variable space. A plot of vs (the first two PLS components) is analogous to a PCA scores plot, but the axes are oriented toward prediction rather than variance.

Use scores plots to:

  • Detect outliers (samples far from the main group)
  • Identify clusters or groups in your data
  • Check for trends over time (process drift)
  • Verify that calibration and validation sets span similar regions

Loadings and weights

The weight vector shows which variables (wavelengths) the model uses to construct each component. Large absolute values indicate important wavelengths for prediction.

The loading vector describes how each component relates back to the original X-variables.

Plotting or against wavelength reveals which spectral regions drive the model. These should correspond to known absorption bands of the analyte — if they do not, the model may be relying on spurious correlations.

VIP scores

Variable Importance in Projection (VIP) [4] provides a single summary score for each variable across all components:

where is the sum of squares explained by component , and is the weight of variable in component .

The rule of thumb: variables with are considered important. Variables with contribute little and could potentially be removed.

Regression coefficients

The final regression coefficient vector can be plotted against wavelength to see the overall spectral “recipe” the model uses for prediction. This is the most compact summary of the model and can be compared directly against known spectral features of the analyte.

Predicted vs reference plot

Plot predicted values against reference values for both calibration and validation sets. An ideal model produces points along the 1:1 diagonal with minimal scatter. Systematic deviations (curvature, offset) indicate model problems.

Code examples

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Simulate NIR calibration data
np.random.seed(42)
n_samples = 100
n_wavelengths = 200
wavelengths = np.linspace(1000, 2500, n_wavelengths)
# True underlying spectra for 3 components
pure_spectra = np.array([
np.exp(-((wavelengths - 1400)**2) / 10000), # Analyte
np.exp(-((wavelengths - 1900)**2) / 20000), # Interferent 1
np.exp(-((wavelengths - 1700)**2) / 15000), # Interferent 2
])
# Random concentrations
concentrations = np.random.uniform(0, 10, (n_samples, 3))
y = concentrations[:, 0] # Predict analyte concentration
# Build spectra: X = concentrations * pure_spectra + noise
X = concentrations @ pure_spectra + np.random.normal(0, 0.05, (n_samples, n_wavelengths))
# Split into calibration and validation
X_cal, X_val = X[:70], X[70:]
y_cal, y_val = y[:70], y[70:]
# --- Choose number of components via cross-validation ---
max_components = 15
rmsecv = []
for n_comp in range(1, max_components + 1):
pls = PLSRegression(n_components=n_comp)
y_cv = cross_val_predict(pls, X_cal, y_cal, cv=KFold(10, shuffle=True, random_state=42))
rmsecv.append(np.sqrt(mean_squared_error(y_cal, y_cv)))
plt.figure(figsize=(8, 4))
plt.plot(range(1, max_components + 1), rmsecv, 'o-')
plt.xlabel('Number of PLS Components')
plt.ylabel('RMSECV')
plt.title('Cross-Validation for Component Selection')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Build final model ---
n_opt = np.argmin(rmsecv) + 1
print(f"Optimal components: {n_opt}")
pls = PLSRegression(n_components=n_opt)
pls.fit(X_cal, y_cal)
# Predictions
y_cal_pred = pls.predict(X_cal).ravel()
y_val_pred = pls.predict(X_val).ravel()
rmsec = np.sqrt(mean_squared_error(y_cal, y_cal_pred))
rmsep = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f"RMSEC = {rmsec:.3f}")
print(f"RMSEP = {rmsep:.3f}")
# --- Predicted vs Reference plot ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(y_cal, y_cal_pred, alpha=0.7, label=f'Cal (RMSEC={rmsec:.2f})')
axes[0].scatter(y_val, y_val_pred, alpha=0.7, label=f'Val (RMSEP={rmsep:.2f})')
axes[0].plot([0, 10], [0, 10], 'k--', alpha=0.5)
axes[0].set_xlabel('Reference')
axes[0].set_ylabel('Predicted')
axes[0].set_title('Predicted vs Reference')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Regression coefficients
axes[1].plot(wavelengths, pls.coef_.ravel())
axes[1].set_xlabel('Wavelength (nm)')
axes[1].set_ylabel('Regression Coefficient')
axes[1].set_title('PLS Regression Coefficients')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- VIP scores ---
def vip_scores(pls_model, X, y):
"""Calculate VIP scores for a fitted PLSRegression model."""
t = pls_model.x_scores_ # n x A
w = pls_model.x_weights_ # p x A
q = pls_model.y_loadings_ # 1 x A
p, A = w.shape
# Sum of squares for each component
ss = np.array([np.dot(q[0, a]**2, np.dot(t[:, a], t[:, a])) for a in range(A)])
vip = np.sqrt(p * np.sum(ss * w**2, axis=1) / np.sum(ss))
return vip
vip = vip_scores(pls, X_cal, y_cal)
plt.figure(figsize=(10, 4))
plt.plot(wavelengths, vip)
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='VIP = 1 threshold')
plt.xlabel('Wavelength (nm)')
plt.ylabel('VIP Score')
plt.title('Variable Importance in Projection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

When to use PLS

The default choice for

NIR spectroscopic calibration Predicting protein, moisture, fat, octane, and dozens of other properties from near-infrared spectra

Mid-IR and Raman calibration Same wide-data, correlated-variable scenario; PLS handles it naturally

Process analytical technology (PAT) Real-time monitoring of manufacturing processes using inline spectroscopy

Any regression with more variables than samples Whenever or when variables are highly correlated, PLS is a strong starting point

Consider alternatives when

You have few, uncorrelated predictors If is small and variables are independent, ordinary least squares or ridge regression may be simpler and equally effective

You need strict variable selection PLS uses all variables; if you need a sparse model, consider LASSO or sparse PLS

Non-linear relationships dominate Standard PLS is a linear method. For strongly non-linear problems, consider kernel PLS or machine learning approaches

Classification is the goal Use PLS-DA (Discriminant Analysis) instead, which adapts PLS for class membership prediction

Advantages and limitations

Advantages

Handles wide data Works even when , where OLS fails entirely

Handles multicollinearity Highly correlated predictors (neighboring wavelengths) pose no problem

Uses y-information Builds more predictive components than PCR with fewer latent variables

Computationally efficient The NIPALS algorithm is fast and works on very large datasets

Interpretable Scores, loadings, VIP, and regression coefficients all provide insight

Robust in practice Decades of successful use across industries; well-understood failure modes

Limitations

Linear method Assumes a linear relationship between X and y; non-linear effects are missed

Component selection is critical Too many components overfit; too few underfit. Cross-validation is mandatory

Sensitive to outliers Extreme samples can distort the latent variable space. Always check for outliers before modeling

All variables retained PLS does not perform variable selection; irrelevant regions of the spectrum add noise

Not a black box Requires spectroscopic knowledge to validate that the model makes chemical sense

Next steps

PLS regression is the foundation of a large family of methods. Once you are comfortable with basic PLS, you can explore:

PLS-DA (Discriminant Analysis): Adapts PLS for classification by coding class membership as dummy y-variables. Widely used for authenticating food products, identifying counterfeit pharmaceuticals, and classifying materials by type.

Multi-block PLS: When your data comes from multiple sources (e.g., NIR + Raman + physical measurements), multi-block methods like MB-PLS or SO-PLS handle each block separately while finding common latent structures.

Kernel PLS: Extends PLS to non-linear relationships using the kernel trick, similar to kernel PCA or support vector machines.

Sparse PLS: Combines PLS with L1 regularization (LASSO-like penalties) to perform simultaneous regression and variable selection.

Orthogonal PLS (OPLS): Separates the systematic variation in into a predictive part (correlated with ) and an orthogonal part (uncorrelated with ). Developed by Trygg and Wold in 2002 for improved model interpretation.

References

[1] Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils, Lecture Notes in Mathematics, Vol. 973 (pp. 286—293). Springer, Heidelberg.

[2] Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185, 1—17.

[3] de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251—263.

[4] Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109—130.

[5] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37—52.

[6] Martens, H., & Naes, T. (1989). Multivariate Calibration. John Wiley & Sons, Chichester.

[7] Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119—128.