Partial Least Squares Regression (PLS)

The central method of chemometrics — building predictive models from high-dimensional, correlated spectroscopic data

Few methods can claim to have shaped an entire scientific discipline. Partial Least Squares regression (PLS) is one of them. Its story begins not in chemistry but in econometrics, with the Swedish statistician Herman Wold. During the 1960s and 1970s, Wold developed a family of iterative algorithms for estimating path models with latent variables -- systems of equations where some quantities cannot be measured directly. His approach was deliberately "soft": instead of imposing strict distributional assumptions and solving everything simultaneously (as the maximum-likelihood school demanded), Wold's algorithms worked by alternating simple regressions, converging step by step to a solution. He called the framework PLS path modeling, and it was designed for the messy, multicollinear data that economists actually had, not the tidy datasets that textbooks assumed.

The leap from economics to chemistry came through Herman's son, Svante Wold, a chemist at Umea University in Sweden, and Harald Martens, a food scientist working in Norway. By the early 1980s, near-infrared (NIR) spectroscopy was exploding as an analytical technique in the food and agricultural industries. A single NIR spectrum could contain hundreds or thousands of wavelength channels, all highly correlated, measured on perhaps only a few dozen samples. Ordinary least squares regression was useless here -- trying to invert a $1000 \times 1000$ covariance matrix from 30 samples is a mathematical non-starter. Principal Component Regression (PCR) offered one way out, but it built its components from the spectra alone, ignoring the property being predicted. Wold and Martens recognized that Herman's PLS framework could be adapted to extract latent variables that simultaneously captured the structure of the spectra $X$ and their relationship to the property of interest $y$ . Their key contribution, presented at the 1983 Heidelberg conference on matrix pencils and later published in the proceedings [1], laid out the PLS regression algorithm in a form chemists could use.

What truly brought PLS to the practicing analytical chemist was the 1986 tutorial by Paul Geladi and Bruce Kowalski in Analytica Chimica Acta [2]. Geladi, a Swede working at the University of Washington with Kowalski, wrote one of the most cited papers in the history of chemometrics. The paper walked readers through the algorithm step by step, with numerical examples, geometric interpretations, and practical advice. It translated Wold's notation into language that bench chemists could follow, and it appeared at precisely the moment when affordable personal computers were making multivariate methods accessible to any laboratory with a spectrometer. Within a few years, PLS became the default calibration method for spectroscopic data worldwide.

The name itself has caused some confusion. "Partial Least Squares" suggests some kind of modified least squares procedure, which is misleading. The algorithm is really about projecting high-dimensional data onto a low-dimensional subspace of latent structures. This is why Svante Wold later proposed the alternative expansion "Projection to Latent Structures" [5], which more accurately describes what the method does. Both names are used interchangeably in the literature, and both abbreviate to PLS.

Why chemometrists needed PLS

To understand why PLS became so important, consider the typical problem in spectroscopic calibration. You have a set of samples -- pharmaceutical tablets, grain batches, petroleum fractions -- and you measure two things for each: a spectrum (your $X$ matrix) and a property of interest like protein content, moisture, or octane number (your $y$ vector).

The spectrum might contain $p = 1000$ wavelength channels, but you only have $n = 50$ samples. This is the wide data problem: $p ≫ n$ . Ordinary Least Squares (OLS) requires inverting $X^{T} X$ , a $p \times p$ matrix, but when $p > n$ this matrix is singular -- it has no inverse. OLS simply cannot be computed.

Even when $p < n$ , spectroscopic variables are massively multicollinear. Neighboring wavelengths carry almost identical information. This makes $X^{T} X$ nearly singular, so the OLS solution becomes wildly unstable: tiny changes in the data produce enormous swings in the regression coefficients.

The chemometrist's toolbox before PLS offered two main alternatives:

Feature selection: Pick a handful of wavelengths and run OLS. But which wavelengths? This discards most of the information in the spectrum and requires expert knowledge for each new application.
Principal Component Regression (PCR): Run PCA on $X$ , keep the first few components, and regress $y$ on those scores. This handles multicollinearity and the wide-data problem, but the PCA step knows nothing about $y$ . The first principal components capture maximum variance in $X$ , which may or may not be relevant for predicting $y$ .

PLS solves both problems at once. It compresses $X$ into a few latent variables, like PCR, but it builds those latent variables to be maximally relevant for predicting $y$ .

The PLS idea

The core idea of PLS can be stated in one sentence:

Find directions in X-space that have maximum covariance with y.

Contrast this with the goals of related methods:

Method	What it maximizes
PCA / PCR	Variance in $X$
CCA (Canonical Correlation)	Correlation between $X$ and $y$
PLS	Covariance between $X$ and $y$

Covariance is the product of correlation and standard deviation. By maximizing covariance rather than just correlation, PLS simultaneously seeks directions that (a) explain variation in $X$ and (b) correlate with $y$ . This is the key balance that makes PLS so effective for calibration.

Formally, PLS decomposes $X$ and $y$ as:

X = T P^{T} + E

y = T q^{T} + f

where $T$ is the $n \times A$ score matrix (with $A$ components), $P$ is the $p \times A$ loading matrix for $X$ , $q$ contains the loadings for $y$ , and $E$ and $f$ are residuals. The crucial difference from PCR is that the score vectors in $T$ are computed using information from both $X$ and $y$ .

PLS vs PCR

This distinction deserves emphasis because it is the single most important concept for understanding PLS.

PCR (Principal Component Regression):

Decompose $X$ alone via PCA → get scores $T$
Regress $y$ on $T$

The PCA step is completely blind to $y$ . If the variation in $X$ that is most predictive of $y$ happens to be a minor source of spectral variance, it will end up in a high-numbered PC and may be discarded.

PLS (Partial Least Squares):

Decompose $X$ and $y$ simultaneously → get scores $T$ that are relevant for prediction
The regression is built into the decomposition

The practical consequence: PLS typically needs fewer components than PCR to achieve the same predictive performance, and it is less likely to miss predictive information hidden in minor spectral variation.

When does it matter?

The PLS advantage over PCR is largest when the predictive information in the spectra is not aligned with the directions of maximum variance. This is common in practice. For example, a small absorption band from an analyte of interest might be dwarfed by a large, broad water absorption feature. PCA would place the water band first; PLS would prioritize the analyte band.

The NIPALS algorithm

The most widely taught algorithm for PLS is NIPALS (Nonlinear Iterative Partial Least Squares), originally developed by Herman Wold for PCA and adapted for PLS regression. Here we present the PLS1 version (single $y$ variable).

Start with mean-centered $X_{0} = X$ and $y_{0} = y$ . For each component $a = 1, 2, \dots, A$ :

Compute the weight vector

The weight vector $w_{a}$ defines the direction in X-space. It is proportional to the covariance between $X$ and $y$ :
$w_{a} = \frac{X _{a - 1}^{T} y _{a - 1}}{∥ X _{a - 1}^{T} y _{a - 1} ∥}$
This is the step where $y$ information enters. The weight vector points in the X-direction that has maximum covariance with $y$ .
Compute the scores

Project $X$ onto this direction to get the score vector:
$t_{a} = X_{a - 1} w_{a}$
Each element of $t_{a}$ is the "position" of one sample along this new latent direction.
Compute the X-loadings

The loading vector tells us how each original variable contributes to this component:
$p_{a} = \frac{X _{a - 1}^{T} t _{a}}{t _{a}^{T} t _{a}}$
Compute the y-loading

The scalar y-loading captures how much of $y$ this component explains:
$q_{a} = \frac{y _{a - 1}^{T} t _{a}}{t _{a}^{T} t _{a}}$
Deflate X and y

Remove the information captured by this component:
$X_{a} = X_{a - 1} - t_{a} p_{a}^{T}$ $y_{a} = y_{a - 1} - q_{a} t_{a}$
Then return to step 1 for the next component.

After extracting $A$ components, the final regression coefficients in terms of the original variables can be recovered as:

b_{P L S} = W (P^{T} W)^{- 1} q

where $W = [w_{1}, \dots, w_{A}]$ , $P = [p_{1}, \dots, p_{A}]$ , and $q = [q_{1}, \dots, q_{A}]^{T}$ .

NIPALS vs SIMPLS

An alternative algorithm called SIMPLS, proposed by Sijmen de Jong in 1993 [3], avoids explicit deflation of $X$ and computes all components in the original variable space. SIMPLS is computationally more efficient and produces identical results to NIPALS for PLS1 (single y). For PLS2 (multiple y variables), the two algorithms give slightly different results. Most modern software packages implement SIMPLS or a kernel variant.

PLS1 vs PLS2

PLS comes in two flavors depending on the response variable:

PLS1 predicts a single $y$ variable. This is by far the most common case in spectroscopic calibration: predicting protein content, moisture, fat, octane number, or any single property from a spectrum. The algorithm above is PLS1.

PLS2 predicts multiple $y$ variables simultaneously. Instead of a vector $y$ , you have a matrix $Y$ . The algorithm is modified so that the weight vector maximizes the covariance between $X$ and the entire $Y$ matrix. This finds latent variables that are jointly predictive of all response variables at once.

When to use PLS2:

When you have multiple related responses (e.g., predicting protein, moisture, and fat simultaneously from NIR spectra)
When the response variables share common underlying structure
When you want a more parsimonious model (one model instead of several PLS1 models)

When to stick with PLS1:

When response variables are unrelated
When each property requires different preprocessing or a different number of components
When you want maximum predictive accuracy for a single property (PLS1 is often slightly better than PLS2 for any given individual response)

In practice, most chemometrists build separate PLS1 models for each property. PLS2 is more commonly used in process monitoring and multivariate quality control.

Choosing the number of components

The number of components $A$ is the critical tuning parameter in PLS. Too few components and the model underfits -- it misses real structure in the data. Too many and the model overfits -- it starts fitting noise and performs poorly on new samples.

Cross-validation

The standard approach is cross-validation (CV), typically leave-one-out or k-fold:

Leave out one sample (or one group of samples)
Build a PLS model on the remaining data with $A$ components
Predict the left-out sample and record the prediction error
Repeat for each sample (or group)
Compute RMSECV (Root Mean Squared Error of Cross-Validation):
$R M S E C V (A) = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i, CV})^{2}$
Plot RMSECV vs number of components and choose the number that minimizes it (or where the curve flattens)

Reading the RMSECV curve

The RMSECV curve typically shows three regions:

Rapid decrease (components 1-3): The model captures the main predictive structure. Each new component substantially reduces error.
Plateau (the optimal region): Adding more components gives diminishing returns. The minimum (or the first point where the curve levels off) indicates the optimal number.
Increase (overfitting): Adding more components starts fitting noise. CV error rises again.

The overfitting trap

It is tempting to keep adding components until RMSECV stops improving. But a common rule of thumb is to choose the first local minimum or the point where adding one more component improves RMSECV by less than 2-5%. For spectroscopic calibration, most PLS models use between 3 and 10 components. If you need more than 15, something is likely wrong with your data or preprocessing.

RMSEC vs RMSECV vs RMSEP

Three error metrics appear constantly in PLS modeling:

Metric	Full name	What it measures
RMSEC	Root Mean Square Error of Calibration	Fit to the training data (always optimistic)
RMSECV	Root Mean Square Error of Cross-Validation	Estimated prediction error via CV
RMSEP	Root Mean Square Error of Prediction	True prediction error on an independent test set

A large gap between RMSEC and RMSECV/RMSEP is a classic sign of overfitting. Ideally, all three should be similar.

Interpreting PLS models

A well-built PLS model produces several diagnostic plots that help you understand what the model has learned and whether it is reliable.

Scores plots

The score vectors $t_{1}, t_{2}, \dots$ summarize each sample's position in the latent variable space. A plot of $t_{1}$ vs $t_{2}$ (the first two PLS components) is analogous to a PCA scores plot, but the axes are oriented toward prediction rather than variance.

Use scores plots to:

Detect outliers (samples far from the main group)
Identify clusters or groups in your data
Check for trends over time (process drift)
Verify that calibration and validation sets span similar regions

Loadings and weights

The weight vector $w$ shows which variables (wavelengths) the model uses to construct each component. Large absolute values indicate important wavelengths for prediction.

The loading vector $p$ describes how each component relates back to the original X-variables.

Plotting $w$ or $p$ against wavelength reveals which spectral regions drive the model. These should correspond to known absorption bands of the analyte -- if they do not, the model may be relying on spurious correlations.

VIP scores

Variable Importance in Projection (VIP) [4] provides a single summary score for each variable across all components:

V I P_{j} = p \cdot \frac{\sum _{a = 1}^{A} S S _{a} \cdot w _{aj}^{2}}{\sum _{a = 1}^{A} S S _{a}}

where $S S_{a}$ is the sum of squares explained by component $a$ , and $w_{aj}$ is the weight of variable $j$ in component $a$ .

The rule of thumb: variables with $V I P > 1$ are considered important. Variables with $V I P < 0.5$ contribute little and could potentially be removed.

Regression coefficients

The final regression coefficient vector $b_{P L S}$ can be plotted against wavelength to see the overall spectral "recipe" the model uses for prediction. This is the most compact summary of the model and can be compared directly against known spectral features of the analyte.

Predicted vs reference plot

Plot predicted values $\overset{y}{^}$ against reference values $y$ for both calibration and validation sets. An ideal model produces points along the 1:1 diagonal with minimal scatter. Systematic deviations (curvature, offset) indicate model problems.

Code examples

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Simulate NIR calibration data
np.random.seed(42)
n_samples = 100
n_wavelengths = 200
wavelengths = np.linspace(1000, 2500, n_wavelengths)

# True underlying spectra for 3 components
pure_spectra = np.array([
    np.exp(-((wavelengths - 1400)**2) / 10000),  # Analyte
    np.exp(-((wavelengths - 1900)**2) / 20000),  # Interferent 1
    np.exp(-((wavelengths - 1700)**2) / 15000),  # Interferent 2
])

# Random concentrations
concentrations = np.random.uniform(0, 10, (n_samples, 3))
y = concentrations[:, 0]  # Predict analyte concentration

# Build spectra: X = concentrations * pure_spectra + noise
X = concentrations @ pure_spectra + np.random.normal(0, 0.05, (n_samples, n_wavelengths))

# Split into calibration and validation
X_cal, X_val = X[:70], X[70:]
y_cal, y_val = y[:70], y[70:]

# --- Choose number of components via cross-validation ---
max_components = 15
rmsecv = []
for n_comp in range(1, max_components + 1):
    pls = PLSRegression(n_components=n_comp)
    y_cv = cross_val_predict(pls, X_cal, y_cal, cv=KFold(10, shuffle=True, random_state=42))
    rmsecv.append(np.sqrt(mean_squared_error(y_cal, y_cv)))

plt.figure(figsize=(8, 4))
plt.plot(range(1, max_components + 1), rmsecv, 'o-')
plt.xlabel('Number of PLS Components')
plt.ylabel('RMSECV')
plt.title('Cross-Validation for Component Selection')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# --- Build final model ---
n_opt = np.argmin(rmsecv) + 1
print(f"Optimal components: {n_opt}")

pls = PLSRegression(n_components=n_opt)
pls.fit(X_cal, y_cal)

# Predictions
y_cal_pred = pls.predict(X_cal).ravel()
y_val_pred = pls.predict(X_val).ravel()

rmsec = np.sqrt(mean_squared_error(y_cal, y_cal_pred))
rmsep = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f"RMSEC = {rmsec:.3f}")
print(f"RMSEP = {rmsep:.3f}")

# --- Predicted vs Reference plot ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].scatter(y_cal, y_cal_pred, alpha=0.7, label=f'Cal (RMSEC={rmsec:.2f})')
axes[0].scatter(y_val, y_val_pred, alpha=0.7, label=f'Val (RMSEP={rmsep:.2f})')
axes[0].plot([0, 10], [0, 10], 'k--', alpha=0.5)
axes[0].set_xlabel('Reference')
axes[0].set_ylabel('Predicted')
axes[0].set_title('Predicted vs Reference')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Regression coefficients
axes[1].plot(wavelengths, pls.coef_.ravel())
axes[1].set_xlabel('Wavelength (nm)')
axes[1].set_ylabel('Regression Coefficient')
axes[1].set_title('PLS Regression Coefficients')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# --- VIP scores ---
def vip_scores(pls_model, X, y):
    """Calculate VIP scores for a fitted PLSRegression model."""
    t = pls_model.x_scores_       # n x A
    w = pls_model.x_weights_      # p x A
    q = pls_model.y_loadings_     # 1 x A
    p, A = w.shape
    # Sum of squares for each component
    ss = np.array([np.dot(q[0, a]**2, np.dot(t[:, a], t[:, a])) for a in range(A)])
    vip = np.sqrt(p * np.sum(ss * w**2, axis=1) / np.sum(ss))
    return vip

vip = vip_scores(pls, X_cal, y_cal)
plt.figure(figsize=(10, 4))
plt.plot(wavelengths, vip)
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='VIP = 1 threshold')
plt.xlabel('Wavelength (nm)')
plt.ylabel('VIP Score')
plt.title('Variable Importance in Projection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

% PLS Regression in MATLAB
% Requires Statistics and Machine Learning Toolbox

rng(42);
n_samples = 100;
n_wavelengths = 200;
wavelengths = linspace(1000, 2500, n_wavelengths);

% Simulate NIR data with 3 pure components
pure_spectra = [
    exp(-((wavelengths - 1400).^2) / 10000);   % Analyte
    exp(-((wavelengths - 1900).^2) / 20000);   % Interferent 1
    exp(-((wavelengths - 1700).^2) / 15000);   % Interferent 2
];

concentrations = rand(n_samples, 3) * 10;
y = concentrations(:, 1);  % Predict analyte

X = concentrations * pure_spectra + randn(n_samples, n_wavelengths) * 0.05;

% Split data
X_cal = X(1:70, :);    y_cal = y(1:70);
X_val = X(71:end, :);  y_val = y(71:end);

% --- Cross-validation to choose components ---
max_comp = 15;
rmsecv = zeros(max_comp, 1);

for nc = 1:max_comp
    [~, ~, ~, ~, ~, pctvar, mse] = plsregress(X_cal, y_cal, nc, 'cv', 10);
    % mse(2, end) is the CV MSE for y with nc components
    rmsecv(nc) = sqrt(mse(2, end));
end

figure;
plot(1:max_comp, rmsecv, 'o-', 'LineWidth', 1.5);
xlabel('Number of PLS Components');
ylabel('RMSECV');
title('Cross-Validation for Component Selection');
grid on;

% Optimal number of components
[~, n_opt] = min(rmsecv);
fprintf('Optimal components: %d\n', n_opt);

% --- Build final model ---
[~, ~, ~, ~, betaPLS, ~, mse] = plsregress(X_cal, y_cal, n_opt);

% Predictions (betaPLS includes intercept as first element)
y_cal_pred = [ones(size(X_cal, 1), 1), X_cal] * betaPLS;
y_val_pred = [ones(size(X_val, 1), 1), X_val] * betaPLS;

rmsec = sqrt(mean((y_cal - y_cal_pred).^2));
rmsep_val = sqrt(mean((y_val - y_val_pred).^2));
fprintf('RMSEC = %.3f\n', rmsec);
fprintf('RMSEP = %.3f\n', rmsep_val);

% --- Predicted vs Reference ---
figure;
subplot(1, 2, 1);
scatter(y_cal, y_cal_pred, 'b', 'filled', 'DisplayName', ...
    sprintf('Cal (RMSEC=%.2f)', rmsec));
hold on;
scatter(y_val, y_val_pred, 'r', 'filled', 'DisplayName', ...
    sprintf('Val (RMSEP=%.2f)', rmsep_val));
plot([0 10], [0 10], 'k--');
xlabel('Reference'); ylabel('Predicted');
title('Predicted vs Reference');
legend('Location', 'best');
grid on;

% Regression coefficients (skip intercept)
subplot(1, 2, 2);
plot(wavelengths, betaPLS(2:end), 'LineWidth', 1.5);
xlabel('Wavelength (nm)');
ylabel('Regression Coefficient');
title('PLS Regression Coefficients');
grid on;

library(pls)

set.seed(42)
n_samples <- 100
n_wavelengths <- 200
wavelengths <- seq(1000, 2500, length.out = n_wavelengths)

# Simulate NIR data
pure_spectra <- rbind(
  exp(-((wavelengths - 1400)^2) / 10000),
  exp(-((wavelengths - 1900)^2) / 20000),
  exp(-((wavelengths - 1700)^2) / 15000)
)

concentrations <- matrix(runif(n_samples * 3, 0, 10), ncol = 3)
y <- concentrations[, 1]  # Predict analyte

X <- concentrations %*% pure_spectra +
     matrix(rnorm(n_samples * n_wavelengths, 0, 0.05),
            nrow = n_samples)

# Prepare data frame (pls package convention)
cal_idx <- 1:70
val_idx <- 71:100

dat <- data.frame(y = y)
dat$X <- I(X)  # Embed matrix as single variable

# --- PLS with cross-validation ---
pls_model <- plsr(y ~ X, data = dat, subset = cal_idx,
                  ncomp = 15, validation = "CV", segments = 10)

# RMSECV plot
plot(RMSEP(pls_model), legendpos = "topright",
     main = "Cross-Validation for Component Selection")

# Find optimal components
rmsecv_vals <- RMSEP(pls_model)$val[1, 1, -1]  # CV values
n_opt <- which.min(rmsecv_vals)
cat(sprintf("Optimal components: %d\n", n_opt))

# --- Predictions ---
y_cal_pred <- predict(pls_model, ncomp = n_opt, newdata = dat[cal_idx, ])
y_val_pred <- predict(pls_model, ncomp = n_opt, newdata = dat[val_idx, ])

rmsec <- sqrt(mean((y[cal_idx] - y_cal_pred)^2))
rmsep <- sqrt(mean((y[val_idx] - y_val_pred)^2))
cat(sprintf("RMSEC = %.3f\n", rmsec))
cat(sprintf("RMSEP = %.3f\n", rmsep))

# --- Predicted vs Reference ---
par(mfrow = c(1, 2))

plot(y[cal_idx], y_cal_pred, pch = 16, col = "blue",
     xlim = c(0, 10), ylim = c(0, 10),
     xlab = "Reference", ylab = "Predicted",
     main = "Predicted vs Reference")
points(y[val_idx], y_val_pred, pch = 16, col = "red")
abline(0, 1, lty = 2)
legend("topleft",
       legend = c(sprintf("Cal (RMSEC=%.2f)", rmsec),
                  sprintf("Val (RMSEP=%.2f)", rmsep)),
       col = c("blue", "red"), pch = 16)
grid()

# Regression coefficients
coefs <- coef(pls_model, ncomp = n_opt)
plot(wavelengths, coefs, type = "l", lwd = 2,
     xlab = "Wavelength (nm)", ylab = "Regression Coefficient",
     main = "PLS Regression Coefficients")
grid()

When to use PLS

The default choice for

NIR spectroscopic calibration Predicting protein, moisture, fat, octane, and dozens of other properties from near-infrared spectra

Mid-IR and Raman calibration Same wide-data, correlated-variable scenario; PLS handles it naturally

Process analytical technology (PAT) Real-time monitoring of manufacturing processes using inline spectroscopy

Any regression with more variables than samples Whenever $p > n$ or when variables are highly correlated, PLS is a strong starting point

Consider alternatives when

You have few, uncorrelated predictors If $p$ is small and variables are independent, ordinary least squares or ridge regression may be simpler and equally effective

You need strict variable selection PLS uses all variables; if you need a sparse model, consider LASSO or sparse PLS

Non-linear relationships dominate Standard PLS is a linear method. For strongly non-linear problems, consider kernel PLS or machine learning approaches

Classification is the goal Use PLS-DA (Discriminant Analysis) instead, which adapts PLS for class membership prediction

Advantages and limitations

Advantages

Handles wide data Works even when $p ≫ n$ , where OLS fails entirely

Handles multicollinearity Highly correlated predictors (neighboring wavelengths) pose no problem

Uses y-information Builds more predictive components than PCR with fewer latent variables

Computationally efficient The NIPALS algorithm is fast and works on very large datasets

Interpretable Scores, loadings, VIP, and regression coefficients all provide insight

Robust in practice Decades of successful use across industries; well-understood failure modes

Limitations

Linear method Assumes a linear relationship between X and y; non-linear effects are missed

Component selection is critical Too many components overfit; too few underfit. Cross-validation is mandatory

Sensitive to outliers Extreme samples can distort the latent variable space. Always check for outliers before modeling

All variables retained PLS does not perform variable selection; irrelevant regions of the spectrum add noise

Not a black box Requires spectroscopic knowledge to validate that the model makes chemical sense

Next steps

PLS regression is the foundation of a large family of methods. Once you are comfortable with basic PLS, you can explore:

PLS-DA (Discriminant Analysis): Adapts PLS for classification by coding class membership as dummy y-variables. Widely used for authenticating food products, identifying counterfeit pharmaceuticals, and classifying materials by type.

Multi-block PLS: When your data comes from multiple sources (e.g., NIR + Raman + physical measurements), multi-block methods like MB-PLS or SO-PLS handle each block separately while finding common latent structures.

Kernel PLS: Extends PLS to non-linear relationships using the kernel trick, similar to kernel PCA or support vector machines.

Sparse PLS: Combines PLS with L1 regularization (LASSO-like penalties) to perform simultaneous regression and variable selection.

Orthogonal PLS (OPLS): Separates the systematic variation in $X$ into a predictive part (correlated with $y$ ) and an orthogonal part (uncorrelated with $y$ ). Developed by Trygg and Wold in 2002 for improved model interpretation.

References

[1] Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils, Lecture Notes in Mathematics, Vol. 973 (pp. 286--293). Springer, Heidelberg.

[2] Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185, 1--17.

[3] de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251--263.

[4] Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109--130.

[5] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37--52.

[6] Martens, H., & Naes, T. (1989). Multivariate Calibration. John Wiley & Sons, Chichester.

[7] Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119--128.

Partial Least Squares Regression (PLS)

The default choice for

Consider alternatives when

Advantages

Limitations

On this page