In 1996, Robert Tibshirani, then a statistician at the University of Toronto, published a paper in the Journal of the Royal Statistical Society, Series B that introduced what he called the Least Absolute Shrinkage and Selection Operator — or LASSO [1]. The acronym was deliberately playful, evoking a cowboy’s lasso that ropes in only the variables that matter and lets the rest fall away. But the mathematics behind it was deeply serious, and it would transform how scientists build regression models in high-dimensional settings.
Tibshirani did not work in isolation. He was profoundly influenced by the Stanford statistics group — particularly Bradley Efron, who pioneered the bootstrap, and Trevor Hastie, who had developed generalized additive models. In fact, Tibshirani had been Efron’s doctoral student before moving to Toronto, and would later return to Stanford as a professor. The intellectual environment at Stanford during the 1990s was uniquely fertile for this kind of work: statisticians there were grappling with datasets where the number of variables far exceeded the number of observations, a situation that classical regression simply could not handle.
The key insight behind LASSO was deceptively simple. Ridge regression, which Hoerl and Kennard had proposed in 1970 [2], already showed that adding a penalty to the regression coefficients could stabilize ill-conditioned problems. But ridge used an L2 penalty (sum of squared coefficients), which shrank coefficients toward zero without ever reaching it. Tibshirani’s innovation was to replace the L2 penalty with an L1 penalty (sum of absolute values). This one change had a dramatic consequence: the L1 penalty not only shrinks coefficients, it forces some of them to exactly zero. The model performs automatic variable selection — it decides which predictors matter and which do not, all within the fitting process itself.
For chemometrics, this was a revelation. A near-infrared spectrum might contain 2000 wavelengths, but only a handful carry real chemical information about the analyte of interest. LASSO could, in principle, identify exactly which wavelengths matter and discard the rest, producing sparse, interpretable models that traditional methods like PLS could not.
The variable selection problem
Consider a typical NIR calibration scenario. You measure the spectrum of a pharmaceutical tablet across 1000 wavelengths and want to predict the concentration of the active ingredient. You have 200 calibration samples. This gives you a data matrix X with 200 rows and 1000 columns, and a response vector y with 200 concentration values.
Ordinary least squares (OLS) regression would try to find a coefficient for every wavelength:
y^=Xβ
But with 1000 variables and only 200 samples, the system is underdetermined. There are infinitely many solutions, and OLS cannot find a unique one. Even if you had more samples than variables, many of those 1000 wavelengths are highly correlated (neighboring wavelengths in a spectrum measure nearly the same thing), making XTX nearly singular.
Ridge regression handles this by adding an L2 penalty that stabilizes the solution. PLS and PCR handle it by reducing dimensionality first. But none of these methods answers a question that chemists often care about:
Which specific wavelengths are actually relevant for predicting this analyte?
Ridge keeps all 1000 wavelengths in the model, just with smaller coefficients. PLS uses linear combinations of all wavelengths. Neither gives you a short list of “these 15 wavelengths are the ones that matter.”
This is where LASSO enters the picture.
The LASSO idea
LASSO modifies the ordinary least squares objective by adding a penalty on the absolute values of the coefficients. The optimization problem is:
Data fidelity term: 2n1∥y−Xβ∥22 — the usual sum of squared residuals, which wants the model to fit the data as closely as possible
L1 penalty: λ∥β∥1=λ∑j=1p∣βj∣ — the sum of the absolute values of all coefficients, which wants the coefficients to be small
The parameter λ≥0 controls the balance. When λ=0 , there is no penalty and you recover OLS. As λ increases, the penalty becomes more aggressive: coefficients shrink, and eventually some are forced to exactly zero. At a sufficiently large λ , all coefficients are zero and the model predicts the mean of y for every sample.
Why L1 gives sparsity
This is the central question: why does replacing ∑βj2 (L2, ridge) with ∑∣βj∣ (L1, LASSO) produce coefficients that are exactly zero? The answer has an elegant geometric explanation.
The constraint region perspective
The penalized problem can be rewritten as a constrained optimization:
βmin∥y−Xβ∥22subject toj=1∑p∣βj∣≤t
for some threshold t that corresponds to the value of λ . This is mathematically equivalent to the penalized form — the Lagrangian connects them.
Now consider the geometry in two dimensions (two coefficients, β1 and β2 ):
The residual sum of squares∥y−Xβ∥22 forms elliptical contours centered at the OLS solution. These are the familiar error ellipses — each contour represents all (β1,β2) pairs that give the same residual.
For ridge (L2), the constraint β12+β22≤t defines a circle (or sphere in higher dimensions).
For LASSO (L1), the constraint ∣β1∣+∣β2∣≤t defines a diamond (or cross-polytope in higher dimensions).
The solution is the point where the expanding ellipses first touch the constraint region. Here is the critical difference:
A circle is smooth everywhere. The ellipses almost always touch it at a point where both β1 and β2 are nonzero.
A diamond has corners on the axes. The ellipses are much more likely to first touch the diamond at a corner, where one coefficient is exactly zero.
In higher dimensions, the L1 diamond has even more corners (one on each coordinate axis), and the probability that the solution lands on one of these corners — setting one or more coefficients to zero — increases. This is why LASSO produces sparse solutions: the geometry of the L1 constraint naturally pushes the solution toward the axes.
The subgradient perspective
Another way to see why L1 gives exact zeros comes from calculus. The L2 penalty βj2 has derivative 2βj , which is zero only when βj=0 . This means the penalty always pushes coefficients toward zero but with decreasing force — it never quite gets there.
The L1 penalty ∣βj∣ has a constant derivative of ±1 everywhere except at zero, where it is not differentiable. This means the penalty pushes with constant force regardless of how small the coefficient is. When the data’s push toward a nonzero value is weaker than λ , the L1 penalty wins and the coefficient snaps to exactly zero. This is why LASSO performs hard thresholding: coefficients are either clearly nonzero or exactly zero, with nothing in between.
The solution path
As λ decreases from a large value toward zero, the LASSO solution traces a path through coefficient space. This regularization path is one of the most informative diagnostics available.
Start with large lambda
All coefficients are zero. The model predicts the mean of y .
Decrease lambda slightly
The first variable enters the model — the one with the strongest correlation to y . Its coefficient grows from zero.
Continue decreasing lambda
More variables enter the model one at a time (or occasionally in small groups). Existing coefficients adjust. Some may shrink back to zero and drop out.
Reach lambda = 0
All variables have potentially nonzero coefficients. If n>p , this is the OLS solution.
A key property of the LASSO path is that it is piecewise linear[3]. This means the coefficients change as linear functions of λ between breakpoints where variables enter or leave the model. The LARS (Least Angle Regression) algorithm by Efron, Hastie, Johnstone, and Tibshirani exploits this property to compute the entire path efficiently, at roughly the same computational cost as a single OLS fit.
Plotting the regularization path is a standard diagnostic. You plot each coefficient as a function of λ (or equivalently, the L1 norm of the coefficient vector). Variables that enter the model early and maintain large coefficients are the most important predictors. Variables that enter late or with small coefficients are marginal.
Choosing lambda
The regularization parameter λ controls the sparsity of the model. Too large, and the model is too simple (underfitting). Too small, and the model includes too many variables (overfitting). The standard approach for selecting λ is cross-validation.
K-fold cross-validation
Split the data into K folds (typically K=5 or K=10 )
For each candidate lambda, fit LASSO on K−1 folds and predict on the held-out fold
Compute the prediction error (usually mean squared error, MSE) for each fold
Average the error across all folds for each λ
Choose lambda that minimizes the cross-validated error
Most software computes the cross-validated error curve automatically. A typical plot shows MSE on the y-axis and log(λ) on the x-axis. The curve is U-shaped: high error at very large λ (too few variables), decreasing as useful variables enter, then increasing again at very small λ (overfitting).
Two common choices are reported:
λmin — the λ that minimizes cross-validated error
λ1SE — the largest λ whose error is within one standard error of the minimum. This gives a sparser model with nearly the same predictive performance, and is often preferred in practice.
Comparison with ridge regression
LASSO and ridge regression both add a penalty to OLS, but the penalties produce fundamentally different behavior.
Property
Ridge (L2)
LASSO (L1)
Penalty
Sum of squared coefficients
Sum of absolute values of coefficients
Constraint shape
Circle/sphere
Diamond/cross-polytope
Sparsity
No — all coefficients nonzero
Yes — many coefficients exactly zero
Variable selection
No
Yes (automatic)
Correlated variables
Distributes weight evenly
Picks one, zeros the others
Number of selected variables
Always p
At most min(n, p)
Solution uniqueness
Always unique
May not be unique when p > n
Computational
Closed-form solution
Requires iterative algorithm
When ridge wins
Ridge regression tends to outperform LASSO when:
Most variables are relevant. If many wavelengths carry small but real information, ridge’s strategy of keeping all of them with small coefficients is more appropriate than LASSO’s strategy of discarding most.
Variables are highly correlated. When a group of wavelengths are correlated (which is the norm in spectroscopy), LASSO arbitrarily selects one from the group and zeros the rest. This can be unstable — a small change in the data can switch which variable is selected. Ridge distributes the coefficient evenly across the group, producing more stable predictions.
When LASSO wins
LASSO tends to outperform ridge when:
Only a few variables are truly relevant. If the signal is concentrated in a handful of wavelengths, LASSO’s ability to identify and keep only those variables gives it better prediction and interpretability.
Interpretability matters. A model with 15 nonzero coefficients is easier to understand and validate than one with 1000.
You need a compact model. For deployment on simple hardware (e.g., a handheld sensor with limited wavelengths), LASSO identifies which wavelengths to measure.
The elastic net compromise
When neither pure ridge nor pure LASSO is ideal — for example, when you have groups of correlated variables and want sparsity — the elastic net combines both penalties [4]:
βmin{2n1∥y−Xβ∥22+λ[α∥β∥1+21−α∥β∥22]}
The mixing parameter α∈[0,1] controls the blend: α=1 is pure LASSO, α=0 is pure ridge. A typical choice is α=0.5 , which gives the “grouped selection” property: correlated variables tend to enter or leave the model together.
Wavelength selection for NIR/MIR calibration
Identify which spectral regions carry analyte information
Compact sensor design
Select a small number of wavelengths for dedicated filter instruments
Biomarker discovery
Find which metabolites (from LC-MS or NMR) predict a clinical outcome
Sparse models from large variable sets
Any situation where you believe only a few variables are truly relevant
Consider alternatives
Highly correlated spectra (most NIR/IR)
LASSO picks one wavelength from a correlated group and ignores the rest, producing unstable selections. Use elastic net or PLS instead.
When all variables contribute
If the signal is spread across the entire spectrum, LASSO’s enforced sparsity hurts. Ridge or PLS is more appropriate.
Small datasets
LASSO’s cross-validation needs enough data for reliable error estimation. With very few samples, consider PLS with leave-one-out CV.
When you need the best prediction, not interpretation
PLS or ridge often predict as well or better, especially in spectroscopy where sparsity is not the true data-generating mechanism.
Practical tips
Preprocessing matters. LASSO does not eliminate the need for spectral preprocessing. Apply standard corrections (baseline, scatter, smoothing) before fitting. LASSO selects variables, not preprocessing steps.
Check the path, not just the final model. The regularization path reveals which variables enter first and how stable the selection is. A variable that enters the model early at large λ and stays is more trustworthy than one that appears only at small λ .
Stability selection. Run LASSO on many bootstrap subsamples and track how often each variable is selected [5]. Variables selected in more than 70—80% of runs are robust; those selected sporadically are likely noise.
Beware the grouping effect. When predictors are correlated (neighboring wavelengths always are), LASSO tends to select one from the group arbitrarily. Elastic net (α≈0.5 ) handles this better by selecting groups together.
Compare against PLS. In spectroscopy, PLS regression remains the workhorse. Before reporting LASSO results, compare prediction performance against PLS with the optimal number of components. LASSO is most valuable when it matches PLS in predictive accuracy while also providing interpretable variable selection.
Next steps
LASSO opens the door to a family of penalized regression methods:
Elastic net combines L1 and L2 penalties, gaining the grouped selection property that LASSO lacks. It is often the preferred choice in spectroscopy [4].
Group LASSO extends the idea to predefined groups of variables (e.g., spectral regions), selecting or discarding entire groups at once [6].
Adaptive LASSO uses data-dependent weights on the L1 penalty to achieve consistent variable selection under milder conditions [7].
Sparse PLS combines the dimensionality reduction of PLS with the variable selection of LASSO, directly addressing the chemometric setting [8].
For most spectroscopic calibration tasks, the practical recommendation is to start with PLS (the chemometric standard), use LASSO or elastic net when variable selection is the primary goal, and compare prediction performance across methods on an independent test set.
References
[1] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267—288.
[2] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55—67.
[3] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407—499.
[4] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301—320.
[5] Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society, Series B, 72(4), 417—473.
[6] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49—67.
[7] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418—1429.
[8] Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, Series B, 72(1), 3—25.