The idea that not all measured variables carry useful information has deep roots in information theory. In 1948, Claude Shannon published “A Mathematical Theory of Communication,” establishing the fundamental limits on how much information can be transmitted through a noisy channel. Shannon’s framework introduced the concept of entropy as a measure of information content, and with it the implication that redundant or uninformative signals can be discarded without loss. While Shannon was thinking about telegraph wires and radio transmissions, the principle applies equally well to a near-infrared spectrum with 2000 wavelength channels: not all of those channels carry chemically relevant information. Some record only noise, some are redundant with their neighbors, and some are dominated by physical artifacts (scattering, water absorption) rather than the chemistry of interest.
Variable selection and data reduction became a distinct research topic in chemometrics during the 1990s, as instruments grew more powerful and datasets grew larger. An FTIR spectrometer might record 4000 data points per spectrum, and a hyperspectral imaging system might produce hundreds of thousands of variables per pixel. The chemometrics community recognized that throwing all these variables into a PLS or PCA model was not only computationally expensive but often counterproductive — noisy or irrelevant variables degraded model performance rather than improving it. Centner et al. (1996) proposed uninformative variable elimination (UVE), which used a randomization test to identify and remove variables that contributed no more than noise to a PLS model. Norgaard et al. (2000) introduced interval PLS (iPLS), a simple but powerful idea: divide the spectrum into intervals, build separate PLS models on each, and compare their predictive performance to find which spectral regions carry the most information. These methods, along with the earlier concept of Variable Importance in Projection (VIP) developed by Wold et al. (1993) for PLS models, established the modern toolkit for data reduction in chemometrics.
The motivation for data reduction extends beyond computational convenience. Reducing the number of variables can improve model interpretability (a model based on 20 selected wavelengths is easier to understand than one based on 2000), reduce overfitting (fewer variables means fewer degrees of freedom for the model to fit noise), and sometimes improve predictive performance outright. The principle is related to the curse of dimensionality, a term coined by Richard Bellman in 1961: as the number of variables grows relative to the number of samples, the data becomes increasingly sparse in the high-dimensional space, distances between points become less meaningful, and statistical models struggle to generalize. Data reduction is one of the primary defenses against this problem.
Why reduce data?
Before applying a multivariate model, it is worth asking: do I need all of these variables? In spectroscopy, the answer is usually no. A typical NIR spectrum has 1000-2000 wavelength channels, but the underlying chemistry may involve only a handful of absorbing species, each with a few characteristic bands. The rest of the spectrum is either noise, redundant information (neighboring wavelengths that are highly correlated), or regions dominated by physical effects unrelated to the analyte of interest.
There are five main reasons to reduce the number of variables before modeling:
Computational cost. PLS and PCA are fast for moderate datasets, but some applications — hyperspectral imaging, process monitoring with high-frequency acquisition, or multi-way data — produce matrices with hundreds of thousands of variables. Reducing dimensionality makes computation tractable.
Noise reduction. Variables that record mostly noise add noise to the model without adding signal. Including them is strictly harmful: they degrade calibration statistics, inflate prediction errors, and make it harder for the model to find the real patterns. Removing noisy variables is a form of preprocessing that complements smoothing.
Model interpretability. A PLS model built on the full spectrum has loadings at every wavelength, making it hard to identify which chemical features drive the prediction. A model built on selected variables or intervals can be directly interpreted in terms of known absorption bands: “the model uses the C-H stretch region at 1720 nm and the O-H band at 1940 nm to predict fat content.” This kind of interpretability is important for regulatory acceptance, method validation, and scientific understanding.
Avoiding overfitting. When the number of variables far exceeds the number of samples (the p≫n problem), models have too many degrees of freedom and can fit the calibration data perfectly while failing catastrophically on new samples. Reducing p closer to or below n constrains the model and forces it to capture only the most robust patterns.
The curse of dimensionality. In high-dimensional spaces, data points tend to be roughly equidistant from each other, which undermines distance-based methods (kNN, clustering) and makes density estimation unreliable. The volume of the space grows exponentially with dimensionality, so the number of samples needed to adequately cover the space grows exponentially as well. For p=1000 variables, even 10,000 samples may not be enough to populate the space. Reducing p makes the geometry of the data more tractable.
Variable selection
Variable selection means choosing a subset of the original variables (wavelengths, frequencies, mass channels) to include in the model, discarding the rest. The goal is to retain the variables that carry useful chemical information and remove those that contribute only noise or redundancy.
Forward and backward selection
The simplest variable selection strategies borrow from classical statistics:
Forward selection starts with an empty model and adds variables one at a time. At each step, the variable that most improves the model (e.g., the one that produces the largest decrease in cross-validated prediction error) is added. The process stops when no remaining variable improves the model.
Backward elimination starts with all variables included and removes them one at a time. At each step, the variable whose removal causes the least degradation (or the greatest improvement) in model performance is dropped. The process stops when further removal worsens the model.
Both approaches are greedy: they make the locally optimal choice at each step without considering the global picture. Forward selection may miss pairs of variables that are uninformative individually but powerful together. Backward elimination is computationally expensive for high-dimensional spectroscopic data (starting with 2000 variables and evaluating each removal is feasible, but not fast).
In practice, these classical approaches have been largely superseded by spectroscopy-specific methods like iPLS and VIP, which exploit the ordered, continuous nature of spectral data.
Interval PLS (iPLS)
Interval PLS, introduced by Norgaard et al. (2000), is one of the most widely used variable selection methods in spectroscopy. The idea is beautifully simple:
Divide the spectrum into intervals
Split the full spectral range into N equally-sized, non-overlapping intervals. For a spectrum of 1000 wavelengths, you might use 20 intervals of 50 wavelengths each, or 40 intervals of 25 each.
Build a PLS model on each interval
For each interval, build a separate PLS model using only the wavelengths in that interval. Use cross-validation to determine the optimal number of latent variables and to estimate the prediction error (RMSECV).
Compare performance
Plot the RMSECV of each interval model alongside the RMSECV of the full-spectrum model. Intervals that outperform the full-spectrum model contain particularly informative wavelengths. Intervals with much worse performance contain noise or irrelevant information.
Select and combine
Combine the best-performing intervals into a single model. This reduced model uses only the informative spectral regions, typically achieving equal or better prediction than the full-spectrum model.
The power of iPLS lies in its interpretability. The bar plot of interval RMSECV values is a visual map of where the useful information resides in the spectrum. Analysts can relate the selected intervals to known absorption bands, providing chemical justification for the variable selection.
Synergy iPLS (siPLS) extends the idea by testing combinations of 2, 3, or 4 intervals simultaneously, searching for synergistic effects where two intervals together outperform either one alone. This is more computationally expensive but can discover complementary spectral regions that a single-interval analysis would miss.
Variable Importance in Projection (VIP)
VIP scores quantify the contribution of each variable to a PLS model. Originally described by Wold et al. (1993) and formalized by Chong and Jun (2005), VIP is computed from the PLS weights and the variance explained by each latent variable:
VIPj=∑a=1Aqa2p∑a=1Aqa2waj2
where p is the total number of variables, A is the number of PLS components, qa is the regression weight of component a on the response, and waj is the weight of variable j in component a .
The VIP score is normalized so that the average of squared VIP values across all variables equals 1. The standard threshold is:
VIPj>1 : Variable j is considered important (above average contribution)
VIPj<0.5 : Variable j is a candidate for removal (well below average contribution)
0.5≤VIPj≤1 : Gray zone, depends on context
VIP scores are easy to compute from an existing PLS model and provide a per-variable importance measure that can be plotted against the wavelength axis. Peaks in the VIP plot correspond to wavelengths that drive the prediction, and these can often be matched to known absorption bands.
Competitive Adaptive Reweighted Sampling (CARS)
CARS, proposed by Li et al. (2009), takes a different approach: it runs PLS repeatedly, using Monte Carlo sampling to select subsets of samples and an adaptive reweighting scheme to progressively eliminate unimportant variables. At each iteration, variables with small PLS regression coefficients are down-weighted and eventually removed. The process mimics natural selection — variables “compete” for inclusion and only the fittest survive.
CARS is more computationally intensive than VIP but often produces sparser models (fewer selected variables) with comparable or better predictive performance. It is particularly popular in the NIR spectroscopy community for applications like food quality analysis and pharmaceutical monitoring.
Binning and averaging
Binning is the simplest form of dimensionality reduction: group adjacent data points and replace each group with a single value (typically the mean). A spectrum of 2000 wavelengths binned by a factor of 4 becomes a spectrum of 500 values.
The formula for binning with bin size b is:
ykbinned=b1i=(k−1)b+1∑kbyi,k=1,2,…,⌊bp⌋
where p is the original number of variables and yi is the value at variable i .
Binning reduces dimensionality and suppresses high-frequency noise (each binned value is an average of b adjacent points, reducing noise by a factor of b ). It is appropriate when:
The spectral resolution of the instrument is much higher than needed for the application (you are oversampled)
Spectral features are broad relative to the sampling interval
You want a quick, assumption-free way to reduce data size
Binning is not appropriate when:
Spectral features are sharp (Raman peaks, high-resolution IR) and binning would blur them
The spectral resolution is already matched to the feature widths
You need to preserve the exact positions and shapes of peaks
PCA for compression
PCA is not just an exploratory tool — it can also serve as a data compression method. Instead of using the original p variables, you use the first A principal component scores as a compressed representation. If A≪p , this is a dramatic reduction in dimensionality.
The original data matrix X (dimensions n×p ) is decomposed as:
X=TP⊤+E
where T is the n×A score matrix, P is the p×A loading matrix, and E is the residual matrix. Using only the scores T as input to subsequent models (regression, classification, clustering) reduces the dimensionality from p to A .
Choosing the number of components
The number of components A determines the compression ratio and the information retained. Too few components discard useful variance; too many retain noise. Common approaches for choosing A :
Explained variance. Plot cumulative explained variance as a function of the number of components. Choose A where the curve levels off. For typical spectroscopic data, 3-10 components capture 95-99% of the variance.
Cross-validation. If the scores will be used in a regression model (PCA followed by regression is called principal component regression, or PCR), choose A to minimize cross-validated prediction error. This directly optimizes the downstream task.
Scree plot. Plot the eigenvalues (or singular values) in descending order. Look for an “elbow” where the values drop from clearly meaningful to roughly flat. Components before the elbow represent signal; those after represent noise.
Reconstruction error
The quality of the PCA compression can be measured by the reconstruction error — how well the original data can be recovered from the compressed representation:
RMSErecon=np1i=1∑nj=1∑p(xij−x^ij)2
where x^ij=∑a=1Atiapja is the reconstructed value. A low reconstruction error means the compression preserves most of the information in the data. For spectroscopic data with clear chemical structure, 3-5 components often achieve reconstruction errors comparable to the instrumental noise level, meaning the discarded components contain mostly noise.
Wavelength range selection
The simplest and often most impactful form of data reduction is to exclude spectral regions that are known to be uninformative. This requires domain knowledge but is straightforward and risk-free when the excluded regions are genuinely useless.
Common exclusions in NIR spectroscopy
NIR spectroscopy is the area where wavelength range selection is most routinely applied, because certain spectral regions are dominated by water absorption that overwhelms any analyte signal:
1400-1500 nm: The first overtone of the O-H stretch. In aqueous samples or samples with significant moisture, this region saturates the detector and carries no analyte information.
1900-2000 nm: The combination band of O-H stretch and bend. Again, water dominates this region, and the signal is often clipped or noisy.
Detector edges: The extremes of the wavelength range (typically below 900 nm and above 2500 nm for InGaAs detectors) often have lower sensitivity and higher noise. Excluding the noisiest 50-100 nm at each edge is common practice.
When to exclude regions
Wavelength range selection should be guided by:
Physical knowledge. Known absorbers that interfere with the analyte (water bands, CO2 bands in mid-IR, atmospheric interference in remote sensing) can be excluded a priori.
Visual inspection. Plot all spectra overlaid. Regions with very high noise, detector saturation (flat tops), or no variation between samples are candidates for exclusion.
Model diagnostics. Loadings or regression coefficients that are large in physically nonsensical regions suggest that the model is fitting noise or artifacts. Excluding those regions and rebuilding may improve robustness.
Code implementation
The following examples demonstrate two key data reduction techniques: iPLS-style interval selection and VIP calculation from a PLS model.
High-dimensional data with few samples. If you have 2000 wavelengths and 50 samples, reducing variables reduces the risk of overfitting dramatically.
Noisy spectral regions. Excluding or down-weighting regions dominated by noise improves model robustness.
Need for interpretability. Selected variables can be linked to specific chemical absorptions, making the model scientifically meaningful and easier to validate.
Computational constraints. Hyperspectral imaging, process monitoring at high frequency, and real-time applications benefit from smaller, faster models.
When to be cautious
Risk of discarding information. Aggressive variable selection may remove variables that carry subtle but real chemical information. Always validate on independent test data.
Overfitting the selection. If variable selection is optimized on the same data used for model calibration (without proper cross-validation), the selected variables overfit the training data. Use double cross-validation: an outer loop for model assessment and an inner loop for variable selection.
PLS already reduces dimensions. PLS extracts a small number of latent variables from many original variables. For well-conditioned datasets, PLS on the full spectrum may already give excellent results, and variable selection adds complexity without benefit.
Loss of transferability. A model based on selected variables may be more sensitive to instrument drift or sample matrix changes than a full-spectrum model, because it relies on fewer data points to anchor the prediction.
References
[1] Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
[2] Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy, 54(3), 413-419.
[3] Centner, V., Massart, D. L., de Noord, O. E., de Jong, S., Vandeginste, B. M., & Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Analytical Chemistry, 68(21), 3851-3858.
[4] Chong, I.-G., & Jun, C.-H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems, 78(1-2), 103-112.
[5] Wold, S., Johansson, E., & Cocchi, M. (1993). PLS — Partial least squares projections to latent structures. In H. Kubinyi (Ed.), 3D QSAR in Drug Design: Theory, Methods and Applications (pp. 523-550). ESCOM.
[6] Li, H., Liang, Y., Xu, Q., & Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648(1), 77-84.
[7] Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.
[8] Mehmood, T., Liland, K. H., Snipen, L., & Saebo, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62-69.
[9] Rinnan, A., van den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.