Skip to content

Data Reduction

The idea that not all measured variables carry useful information has deep roots in information theory. In 1948, Claude Shannon published “A Mathematical Theory of Communication,” establishing the fundamental limits on how much information can be transmitted through a noisy channel. Shannon’s framework introduced the concept of entropy as a measure of information content, and with it the implication that redundant or uninformative signals can be discarded without loss. While Shannon was thinking about telegraph wires and radio transmissions, the principle applies equally well to a near-infrared spectrum with 2000 wavelength channels: not all of those channels carry chemically relevant information. Some record only noise, some are redundant with their neighbors, and some are dominated by physical artifacts (scattering, water absorption) rather than the chemistry of interest.

Variable selection and data reduction became a distinct research topic in chemometrics during the 1990s, as instruments grew more powerful and datasets grew larger. An FTIR spectrometer might record 4000 data points per spectrum, and a hyperspectral imaging system might produce hundreds of thousands of variables per pixel. The chemometrics community recognized that throwing all these variables into a PLS or PCA model was not only computationally expensive but often counterproductive — noisy or irrelevant variables degraded model performance rather than improving it. Centner et al. (1996) proposed uninformative variable elimination (UVE), which used a randomization test to identify and remove variables that contributed no more than noise to a PLS model. Norgaard et al. (2000) introduced interval PLS (iPLS), a simple but powerful idea: divide the spectrum into intervals, build separate PLS models on each, and compare their predictive performance to find which spectral regions carry the most information. These methods, along with the earlier concept of Variable Importance in Projection (VIP) developed by Wold et al. (1993) for PLS models, established the modern toolkit for data reduction in chemometrics.

The motivation for data reduction extends beyond computational convenience. Reducing the number of variables can improve model interpretability (a model based on 20 selected wavelengths is easier to understand than one based on 2000), reduce overfitting (fewer variables means fewer degrees of freedom for the model to fit noise), and sometimes improve predictive performance outright. The principle is related to the curse of dimensionality, a term coined by Richard Bellman in 1961: as the number of variables grows relative to the number of samples, the data becomes increasingly sparse in the high-dimensional space, distances between points become less meaningful, and statistical models struggle to generalize. Data reduction is one of the primary defenses against this problem.

Why reduce data?

Before applying a multivariate model, it is worth asking: do I need all of these variables? In spectroscopy, the answer is usually no. A typical NIR spectrum has 1000-2000 wavelength channels, but the underlying chemistry may involve only a handful of absorbing species, each with a few characteristic bands. The rest of the spectrum is either noise, redundant information (neighboring wavelengths that are highly correlated), or regions dominated by physical effects unrelated to the analyte of interest.

There are five main reasons to reduce the number of variables before modeling:

Computational cost. PLS and PCA are fast for moderate datasets, but some applications — hyperspectral imaging, process monitoring with high-frequency acquisition, or multi-way data — produce matrices with hundreds of thousands of variables. Reducing dimensionality makes computation tractable.

Noise reduction. Variables that record mostly noise add noise to the model without adding signal. Including them is strictly harmful: they degrade calibration statistics, inflate prediction errors, and make it harder for the model to find the real patterns. Removing noisy variables is a form of preprocessing that complements smoothing.

Model interpretability. A PLS model built on the full spectrum has loadings at every wavelength, making it hard to identify which chemical features drive the prediction. A model built on selected variables or intervals can be directly interpreted in terms of known absorption bands: “the model uses the C-H stretch region at 1720 nm and the O-H band at 1940 nm to predict fat content.” This kind of interpretability is important for regulatory acceptance, method validation, and scientific understanding.

Avoiding overfitting. When the number of variables far exceeds the number of samples (the problem), models have too many degrees of freedom and can fit the calibration data perfectly while failing catastrophically on new samples. Reducing closer to or below constrains the model and forces it to capture only the most robust patterns.

The curse of dimensionality. In high-dimensional spaces, data points tend to be roughly equidistant from each other, which undermines distance-based methods (kNN, clustering) and makes density estimation unreliable. The volume of the space grows exponentially with dimensionality, so the number of samples needed to adequately cover the space grows exponentially as well. For variables, even 10,000 samples may not be enough to populate the space. Reducing makes the geometry of the data more tractable.

Variable selection

Variable selection means choosing a subset of the original variables (wavelengths, frequencies, mass channels) to include in the model, discarding the rest. The goal is to retain the variables that carry useful chemical information and remove those that contribute only noise or redundancy.

Forward and backward selection

The simplest variable selection strategies borrow from classical statistics:

Forward selection starts with an empty model and adds variables one at a time. At each step, the variable that most improves the model (e.g., the one that produces the largest decrease in cross-validated prediction error) is added. The process stops when no remaining variable improves the model.

Backward elimination starts with all variables included and removes them one at a time. At each step, the variable whose removal causes the least degradation (or the greatest improvement) in model performance is dropped. The process stops when further removal worsens the model.

Both approaches are greedy: they make the locally optimal choice at each step without considering the global picture. Forward selection may miss pairs of variables that are uninformative individually but powerful together. Backward elimination is computationally expensive for high-dimensional spectroscopic data (starting with 2000 variables and evaluating each removal is feasible, but not fast).

In practice, these classical approaches have been largely superseded by spectroscopy-specific methods like iPLS and VIP, which exploit the ordered, continuous nature of spectral data.

Interval PLS (iPLS)

Interval PLS, introduced by Norgaard et al. (2000), is one of the most widely used variable selection methods in spectroscopy. The idea is beautifully simple:

  1. Divide the spectrum into intervals

    Split the full spectral range into equally-sized, non-overlapping intervals. For a spectrum of 1000 wavelengths, you might use 20 intervals of 50 wavelengths each, or 40 intervals of 25 each.

  2. Build a PLS model on each interval

    For each interval, build a separate PLS model using only the wavelengths in that interval. Use cross-validation to determine the optimal number of latent variables and to estimate the prediction error (RMSECV).

  3. Compare performance

    Plot the RMSECV of each interval model alongside the RMSECV of the full-spectrum model. Intervals that outperform the full-spectrum model contain particularly informative wavelengths. Intervals with much worse performance contain noise or irrelevant information.

  4. Select and combine

    Combine the best-performing intervals into a single model. This reduced model uses only the informative spectral regions, typically achieving equal or better prediction than the full-spectrum model.

The power of iPLS lies in its interpretability. The bar plot of interval RMSECV values is a visual map of where the useful information resides in the spectrum. Analysts can relate the selected intervals to known absorption bands, providing chemical justification for the variable selection.

Synergy iPLS (siPLS) extends the idea by testing combinations of 2, 3, or 4 intervals simultaneously, searching for synergistic effects where two intervals together outperform either one alone. This is more computationally expensive but can discover complementary spectral regions that a single-interval analysis would miss.

Variable Importance in Projection (VIP)

VIP scores quantify the contribution of each variable to a PLS model. Originally described by Wold et al. (1993) and formalized by Chong and Jun (2005), VIP is computed from the PLS weights and the variance explained by each latent variable:

where is the total number of variables, is the number of PLS components, is the regression weight of component on the response, and is the weight of variable in component .

The VIP score is normalized so that the average of squared VIP values across all variables equals 1. The standard threshold is:

  • : Variable is considered important (above average contribution)
  • : Variable is a candidate for removal (well below average contribution)
  • : Gray zone, depends on context

VIP scores are easy to compute from an existing PLS model and provide a per-variable importance measure that can be plotted against the wavelength axis. Peaks in the VIP plot correspond to wavelengths that drive the prediction, and these can often be matched to known absorption bands.

Competitive Adaptive Reweighted Sampling (CARS)

CARS, proposed by Li et al. (2009), takes a different approach: it runs PLS repeatedly, using Monte Carlo sampling to select subsets of samples and an adaptive reweighting scheme to progressively eliminate unimportant variables. At each iteration, variables with small PLS regression coefficients are down-weighted and eventually removed. The process mimics natural selection — variables “compete” for inclusion and only the fittest survive.

CARS is more computationally intensive than VIP but often produces sparser models (fewer selected variables) with comparable or better predictive performance. It is particularly popular in the NIR spectroscopy community for applications like food quality analysis and pharmaceutical monitoring.

Binning and averaging

Binning is the simplest form of dimensionality reduction: group adjacent data points and replace each group with a single value (typically the mean). A spectrum of 2000 wavelengths binned by a factor of 4 becomes a spectrum of 500 values.

The formula for binning with bin size is:

where is the original number of variables and is the value at variable .

Binning reduces dimensionality and suppresses high-frequency noise (each binned value is an average of adjacent points, reducing noise by a factor of ). It is appropriate when:

  • The spectral resolution of the instrument is much higher than needed for the application (you are oversampled)
  • Spectral features are broad relative to the sampling interval
  • You want a quick, assumption-free way to reduce data size

Binning is not appropriate when:

  • Spectral features are sharp (Raman peaks, high-resolution IR) and binning would blur them
  • The spectral resolution is already matched to the feature widths
  • You need to preserve the exact positions and shapes of peaks

PCA for compression

PCA is not just an exploratory tool — it can also serve as a data compression method. Instead of using the original variables, you use the first principal component scores as a compressed representation. If , this is a dramatic reduction in dimensionality.

The original data matrix (dimensions ) is decomposed as:

where is the score matrix, is the loading matrix, and is the residual matrix. Using only the scores as input to subsequent models (regression, classification, clustering) reduces the dimensionality from to .

Choosing the number of components

The number of components determines the compression ratio and the information retained. Too few components discard useful variance; too many retain noise. Common approaches for choosing :

Explained variance. Plot cumulative explained variance as a function of the number of components. Choose where the curve levels off. For typical spectroscopic data, 3-10 components capture 95-99% of the variance.

Cross-validation. If the scores will be used in a regression model (PCA followed by regression is called principal component regression, or PCR), choose to minimize cross-validated prediction error. This directly optimizes the downstream task.

Scree plot. Plot the eigenvalues (or singular values) in descending order. Look for an “elbow” where the values drop from clearly meaningful to roughly flat. Components before the elbow represent signal; those after represent noise.

Reconstruction error

The quality of the PCA compression can be measured by the reconstruction error — how well the original data can be recovered from the compressed representation:

where is the reconstructed value. A low reconstruction error means the compression preserves most of the information in the data. For spectroscopic data with clear chemical structure, 3-5 components often achieve reconstruction errors comparable to the instrumental noise level, meaning the discarded components contain mostly noise.

Wavelength range selection

The simplest and often most impactful form of data reduction is to exclude spectral regions that are known to be uninformative. This requires domain knowledge but is straightforward and risk-free when the excluded regions are genuinely useless.

Common exclusions in NIR spectroscopy

NIR spectroscopy is the area where wavelength range selection is most routinely applied, because certain spectral regions are dominated by water absorption that overwhelms any analyte signal:

  • 1400-1500 nm: The first overtone of the O-H stretch. In aqueous samples or samples with significant moisture, this region saturates the detector and carries no analyte information.
  • 1900-2000 nm: The combination band of O-H stretch and bend. Again, water dominates this region, and the signal is often clipped or noisy.
  • Detector edges: The extremes of the wavelength range (typically below 900 nm and above 2500 nm for InGaAs detectors) often have lower sensitivity and higher noise. Excluding the noisiest 50-100 nm at each edge is common practice.

When to exclude regions

Wavelength range selection should be guided by:

Physical knowledge. Known absorbers that interfere with the analyte (water bands, CO2 bands in mid-IR, atmospheric interference in remote sensing) can be excluded a priori.

Visual inspection. Plot all spectra overlaid. Regions with very high noise, detector saturation (flat tops), or no variation between samples are candidates for exclusion.

Model diagnostics. Loadings or regression coefficients that are large in physically nonsensical regions suggest that the model is fitting noise or artifacts. Excluding those regions and rebuilding may improve robustness.

Code implementation

The following examples demonstrate two key data reduction techniques: iPLS-style interval selection and VIP calculation from a PLS model.

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
def interval_pls(X, y, n_intervals=20, max_components=10):
"""
Interval PLS (iPLS): evaluate PLS on each spectral interval.
Returns RMSECV per interval and for the full spectrum.
"""
n_samples, n_vars = X.shape
interval_size = n_vars // n_intervals
edges, rmsecv_list = [], []
for i in range(n_intervals):
start = i * interval_size
end = start + interval_size if i < n_intervals - 1 else n_vars
edges.append((start, end))
X_int = X[:, start:end]
best_rmsecv = np.inf
max_nc = min(max_components, X_int.shape[1], n_samples - 1)
for nc in range(1, max_nc + 1):
pls = PLSRegression(n_components=nc, scale=False)
y_pred = cross_val_predict(pls, X_int, y, cv=10)
rmsecv = np.sqrt(np.mean((y - y_pred.ravel()) ** 2))
if rmsecv < best_rmsecv:
best_rmsecv = rmsecv
rmsecv_list.append(best_rmsecv)
# Full-spectrum model
best_full = np.inf
for nc in range(1, min(max_components, n_vars, n_samples - 1) + 1):
pls = PLSRegression(n_components=nc, scale=False)
y_pred = cross_val_predict(pls, X, y, cv=10)
rmsecv = np.sqrt(np.mean((y - y_pred.ravel()) ** 2))
if rmsecv < best_full:
best_full = rmsecv
return rmsecv_list, best_full, edges
def compute_vip(X, y, n_components=5):
"""
Compute VIP (Variable Importance in Projection) scores
from a PLS model.
"""
pls = PLSRegression(n_components=n_components, scale=False)
pls.fit(X, y)
W = pls.x_weights_ # (p, A)
T = pls.x_scores_ # (n, A)
Q = pls.y_loadings_ # (1, A)
p, A = W.shape
# Variance in y explained by each component
ss_y = np.array([
(Q[0, a] ** 2) * np.dot(T[:, a], T[:, a]) for a in range(A)
])
vip = np.zeros(p)
for j in range(p):
vip[j] = np.sqrt(p * np.sum(ss_y * W[j, :] ** 2) / np.sum(ss_y))
return vip
# --- Example: simulate NIR spectral data ---
np.random.seed(42)
n_samples, n_wl = 80, 500
wavelengths = np.linspace(1000, 2500, n_wl)
# Three chemical components with known absorption bands
comp1 = np.exp(-((wavelengths - 1200) ** 2) / 5000)
comp2 = np.exp(-((wavelengths - 1720) ** 2) / 3000)
comp3 = np.exp(-((wavelengths - 2100) ** 2) / 4000)
conc = np.random.rand(n_samples, 3)
conc[:, 1] = 0.5 * conc[:, 0] + 0.5 * np.random.rand(n_samples)
X = conc @ np.vstack([comp1, comp2, comp3])
X += np.random.normal(0, 0.02, X.shape)
y = conc[:, 0]
# iPLS analysis
rmsecv_int, rmsecv_full, edges = interval_pls(X, y, n_intervals=20)
# VIP analysis
vip_scores = compute_vip(X, y, n_components=4)
# Plot iPLS and VIP results
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
centers = [(e[0] + e[1]) / 2 for e in edges]
widths = [e[1] - e[0] for e in edges]
colors = ['#2563eb' if r < rmsecv_full else '#94a3b8' for r in rmsecv_int]
axes[0].bar(centers, rmsecv_int, width=widths, color=colors,
edgecolor='white', alpha=0.8)
axes[0].axhline(rmsecv_full, color='#dc2626', linestyle='--', linewidth=2,
label=f'Full spectrum (RMSECV={rmsecv_full:.4f})')
axes[0].set_xlabel('Variable index'); axes[0].set_ylabel('RMSECV')
axes[0].set_title('iPLS: RMSECV per interval')
axes[0].legend(); axes[0].grid(True, alpha=0.3, axis='y')
axes[1].plot(wavelengths, vip_scores, 'k-', linewidth=1)
axes[1].fill_between(wavelengths, vip_scores, alpha=0.3, color='#2563eb')
axes[1].axhline(1.0, color='#dc2626', linestyle='--', linewidth=1.5,
label='VIP = 1 threshold')
axes[1].set_xlabel('Wavelength (nm)'); axes[1].set_ylabel('VIP score')
axes[1].set_title('Variable Importance in Projection (VIP)')
axes[1].legend(); axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

When to use data reduction

When reduction helps

High-dimensional data with few samples. If you have 2000 wavelengths and 50 samples, reducing variables reduces the risk of overfitting dramatically.

Noisy spectral regions. Excluding or down-weighting regions dominated by noise improves model robustness.

Need for interpretability. Selected variables can be linked to specific chemical absorptions, making the model scientifically meaningful and easier to validate.

Computational constraints. Hyperspectral imaging, process monitoring at high frequency, and real-time applications benefit from smaller, faster models.

When to be cautious

Risk of discarding information. Aggressive variable selection may remove variables that carry subtle but real chemical information. Always validate on independent test data.

Overfitting the selection. If variable selection is optimized on the same data used for model calibration (without proper cross-validation), the selected variables overfit the training data. Use double cross-validation: an outer loop for model assessment and an inner loop for variable selection.

PLS already reduces dimensions. PLS extracts a small number of latent variables from many original variables. For well-conditioned datasets, PLS on the full spectrum may already give excellent results, and variable selection adds complexity without benefit.

Loss of transferability. A model based on selected variables may be more sensitive to instrument drift or sample matrix changes than a full-spectrum model, because it relies on fewer data points to anchor the prediction.

References

[1] Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.

[2] Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy, 54(3), 413-419.

[3] Centner, V., Massart, D. L., de Noord, O. E., de Jong, S., Vandeginste, B. M., & Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Analytical Chemistry, 68(21), 3851-3858.

[4] Chong, I.-G., & Jun, C.-H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems, 78(1-2), 103-112.

[5] Wold, S., Johansson, E., & Cocchi, M. (1993). PLS — Partial least squares projections to latent structures. In H. Kubinyi (Ed.), 3D QSAR in Drug Design: Theory, Methods and Applications (pp. 523-550). ESCOM.

[6] Li, H., Liang, Y., Xu, Q., & Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648(1), 77-84.

[7] Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.

[8] Mehmood, T., Liland, K. H., Snipen, L., & Saebo, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62-69.

[9] Rinnan, A., van den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.