Skip to content

LASSO Regression

In 1996, Robert Tibshirani, then a statistician at the University of Toronto, published a paper in the Journal of the Royal Statistical Society, Series B that introduced what he called the Least Absolute Shrinkage and Selection Operator — or LASSO [1]. The acronym was deliberately playful, evoking a cowboy’s lasso that ropes in only the variables that matter and lets the rest fall away. But the mathematics behind it was deeply serious, and it would transform how scientists build regression models in high-dimensional settings.

Tibshirani did not work in isolation. He was profoundly influenced by the Stanford statistics group — particularly Bradley Efron, who pioneered the bootstrap, and Trevor Hastie, who had developed generalized additive models. In fact, Tibshirani had been Efron’s doctoral student before moving to Toronto, and would later return to Stanford as a professor. The intellectual environment at Stanford during the 1990s was uniquely fertile for this kind of work: statisticians there were grappling with datasets where the number of variables far exceeded the number of observations, a situation that classical regression simply could not handle.

The key insight behind LASSO was deceptively simple. Ridge regression, which Hoerl and Kennard had proposed in 1970 [2], already showed that adding a penalty to the regression coefficients could stabilize ill-conditioned problems. But ridge used an L2 penalty (sum of squared coefficients), which shrank coefficients toward zero without ever reaching it. Tibshirani’s innovation was to replace the L2 penalty with an L1 penalty (sum of absolute values). This one change had a dramatic consequence: the L1 penalty not only shrinks coefficients, it forces some of them to exactly zero. The model performs automatic variable selection — it decides which predictors matter and which do not, all within the fitting process itself.

For chemometrics, this was a revelation. A near-infrared spectrum might contain 2000 wavelengths, but only a handful carry real chemical information about the analyte of interest. LASSO could, in principle, identify exactly which wavelengths matter and discard the rest, producing sparse, interpretable models that traditional methods like PLS could not.

The variable selection problem

Consider a typical NIR calibration scenario. You measure the spectrum of a pharmaceutical tablet across 1000 wavelengths and want to predict the concentration of the active ingredient. You have 200 calibration samples. This gives you a data matrix with 200 rows and 1000 columns, and a response vector with 200 concentration values.

Ordinary least squares (OLS) regression would try to find a coefficient for every wavelength:

But with 1000 variables and only 200 samples, the system is underdetermined. There are infinitely many solutions, and OLS cannot find a unique one. Even if you had more samples than variables, many of those 1000 wavelengths are highly correlated (neighboring wavelengths in a spectrum measure nearly the same thing), making nearly singular.

Ridge regression handles this by adding an L2 penalty that stabilizes the solution. PLS and PCR handle it by reducing dimensionality first. But none of these methods answers a question that chemists often care about:

Which specific wavelengths are actually relevant for predicting this analyte?

Ridge keeps all 1000 wavelengths in the model, just with smaller coefficients. PLS uses linear combinations of all wavelengths. Neither gives you a short list of “these 15 wavelengths are the ones that matter.”

This is where LASSO enters the picture.

The LASSO idea

LASSO modifies the ordinary least squares objective by adding a penalty on the absolute values of the coefficients. The optimization problem is:

Or equivalently, in matrix notation:

The objective has two competing parts:

  • Data fidelity term: — the usual sum of squared residuals, which wants the model to fit the data as closely as possible
  • L1 penalty: — the sum of the absolute values of all coefficients, which wants the coefficients to be small

The parameter controls the balance. When , there is no penalty and you recover OLS. As increases, the penalty becomes more aggressive: coefficients shrink, and eventually some are forced to exactly zero. At a sufficiently large , all coefficients are zero and the model predicts the mean of for every sample.

Why L1 gives sparsity

This is the central question: why does replacing (L2, ridge) with (L1, LASSO) produce coefficients that are exactly zero? The answer has an elegant geometric explanation.

The constraint region perspective

The penalized problem can be rewritten as a constrained optimization:

for some threshold that corresponds to the value of . This is mathematically equivalent to the penalized form — the Lagrangian connects them.

Now consider the geometry in two dimensions (two coefficients, and ):

  • The residual sum of squares forms elliptical contours centered at the OLS solution. These are the familiar error ellipses — each contour represents all pairs that give the same residual.

  • For ridge (L2), the constraint defines a circle (or sphere in higher dimensions).

  • For LASSO (L1), the constraint defines a diamond (or cross-polytope in higher dimensions).

The solution is the point where the expanding ellipses first touch the constraint region. Here is the critical difference:

  • A circle is smooth everywhere. The ellipses almost always touch it at a point where both and are nonzero.

  • A diamond has corners on the axes. The ellipses are much more likely to first touch the diamond at a corner, where one coefficient is exactly zero.

In higher dimensions, the L1 diamond has even more corners (one on each coordinate axis), and the probability that the solution lands on one of these corners — setting one or more coefficients to zero — increases. This is why LASSO produces sparse solutions: the geometry of the L1 constraint naturally pushes the solution toward the axes.

The subgradient perspective

Another way to see why L1 gives exact zeros comes from calculus. The L2 penalty has derivative , which is zero only when . This means the penalty always pushes coefficients toward zero but with decreasing force — it never quite gets there.

The L1 penalty has a constant derivative of everywhere except at zero, where it is not differentiable. This means the penalty pushes with constant force regardless of how small the coefficient is. When the data’s push toward a nonzero value is weaker than , the L1 penalty wins and the coefficient snaps to exactly zero. This is why LASSO performs hard thresholding: coefficients are either clearly nonzero or exactly zero, with nothing in between.

The solution path

As decreases from a large value toward zero, the LASSO solution traces a path through coefficient space. This regularization path is one of the most informative diagnostics available.

  1. Start with large lambda

    All coefficients are zero. The model predicts the mean of .

  2. Decrease lambda slightly

    The first variable enters the model — the one with the strongest correlation to . Its coefficient grows from zero.

  3. Continue decreasing lambda

    More variables enter the model one at a time (or occasionally in small groups). Existing coefficients adjust. Some may shrink back to zero and drop out.

  4. Reach lambda = 0

    All variables have potentially nonzero coefficients. If , this is the OLS solution.

A key property of the LASSO path is that it is piecewise linear [3]. This means the coefficients change as linear functions of between breakpoints where variables enter or leave the model. The LARS (Least Angle Regression) algorithm by Efron, Hastie, Johnstone, and Tibshirani exploits this property to compute the entire path efficiently, at roughly the same computational cost as a single OLS fit.

Plotting the regularization path is a standard diagnostic. You plot each coefficient as a function of (or equivalently, the L1 norm of the coefficient vector). Variables that enter the model early and maintain large coefficients are the most important predictors. Variables that enter late or with small coefficients are marginal.

Choosing lambda

The regularization parameter controls the sparsity of the model. Too large, and the model is too simple (underfitting). Too small, and the model includes too many variables (overfitting). The standard approach for selecting is cross-validation.

K-fold cross-validation

  1. Split the data into folds (typically or )

  2. For each candidate lambda, fit LASSO on folds and predict on the held-out fold

  3. Compute the prediction error (usually mean squared error, MSE) for each fold

  4. Average the error across all folds for each

  5. Choose lambda that minimizes the cross-validated error

Most software computes the cross-validated error curve automatically. A typical plot shows MSE on the y-axis and on the x-axis. The curve is U-shaped: high error at very large (too few variables), decreasing as useful variables enter, then increasing again at very small (overfitting).

Two common choices are reported:

  • — the that minimizes cross-validated error
  • — the largest whose error is within one standard error of the minimum. This gives a sparser model with nearly the same predictive performance, and is often preferred in practice.

Comparison with ridge regression

LASSO and ridge regression both add a penalty to OLS, but the penalties produce fundamentally different behavior.

PropertyRidge (L2)LASSO (L1)
PenaltySum of squared coefficientsSum of absolute values of coefficients
Constraint shapeCircle/sphereDiamond/cross-polytope
SparsityNo — all coefficients nonzeroYes — many coefficients exactly zero
Variable selectionNoYes (automatic)
Correlated variablesDistributes weight evenlyPicks one, zeros the others
Number of selected variablesAlways pAt most min(n, p)
Solution uniquenessAlways uniqueMay not be unique when p > n
ComputationalClosed-form solutionRequires iterative algorithm

When ridge wins

Ridge regression tends to outperform LASSO when:

  • Most variables are relevant. If many wavelengths carry small but real information, ridge’s strategy of keeping all of them with small coefficients is more appropriate than LASSO’s strategy of discarding most.
  • Variables are highly correlated. When a group of wavelengths are correlated (which is the norm in spectroscopy), LASSO arbitrarily selects one from the group and zeros the rest. This can be unstable — a small change in the data can switch which variable is selected. Ridge distributes the coefficient evenly across the group, producing more stable predictions.

When LASSO wins

LASSO tends to outperform ridge when:

  • Only a few variables are truly relevant. If the signal is concentrated in a handful of wavelengths, LASSO’s ability to identify and keep only those variables gives it better prediction and interpretability.
  • Interpretability matters. A model with 15 nonzero coefficients is easier to understand and validate than one with 1000.
  • You need a compact model. For deployment on simple hardware (e.g., a handheld sensor with limited wavelengths), LASSO identifies which wavelengths to measure.

The elastic net compromise

When neither pure ridge nor pure LASSO is ideal — for example, when you have groups of correlated variables and want sparsity — the elastic net combines both penalties [4]:

The mixing parameter controls the blend: is pure LASSO, is pure ridge. A typical choice is , which gives the “grouped selection” property: correlated variables tend to enter or leave the model together.

Code implementation

import numpy as np
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# Simulate NIR spectroscopy data
# 200 samples, 500 wavelengths, only 10 are truly relevant
np.random.seed(42)
n_samples, n_wavelengths = 200, 500
n_relevant = 10
# Generate spectra with correlated wavelengths (realistic)
X = np.random.randn(n_samples, n_wavelengths)
# Add correlation between neighboring wavelengths
for j in range(1, n_wavelengths):
X[:, j] = 0.7 * X[:, j-1] + 0.3 * X[:, j]
# True coefficients: only 10 wavelengths matter
true_beta = np.zeros(n_wavelengths)
relevant_idx = [45, 102, 150, 203, 255, 310, 348, 400, 425, 470]
true_beta[relevant_idx] = np.random.randn(n_relevant) * 2
# Response with noise
y = X @ true_beta + np.random.randn(n_samples) * 0.5
# Always standardize before LASSO
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- Method 1: LASSO with cross-validation ---
lasso_cv = LassoCV(cv=10, alphas=None, max_iter=10000, random_state=42)
lasso_cv.fit(X_scaled, y)
print(f"Optimal lambda: {lasso_cv.alpha_:.4f}")
print(f"Nonzero coefficients: {np.sum(lasso_cv.coef_ != 0)} out of {n_wavelengths}")
print(f"R² (on training): {lasso_cv.score(X_scaled, y):.4f}")
# Which wavelengths were selected?
selected = np.where(lasso_cv.coef_ != 0)[0]
print(f"Selected wavelengths: {selected}")
print(f"True relevant wavelengths: {relevant_idx}")
# --- Plot 1: Regularization path ---
from sklearn.linear_model import lasso_path
alphas, coefs, _ = lasso_path(X_scaled, y, alphas=np.logspace(-4, 1, 100))
plt.figure(figsize=(10, 5))
plt.semilogx(alphas, coefs.T, alpha=0.5)
plt.axvline(lasso_cv.alpha_, color='black', linestyle='--',
label=f'CV optimal (lambda={lasso_cv.alpha_:.3f})')
plt.xlabel('Lambda (regularization strength)')
plt.ylabel('Coefficient value')
plt.title('LASSO Regularization Path')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Plot 2: Cross-validation curve ---
plt.figure(figsize=(8, 5))
plt.semilogx(lasso_cv.alphas_, lasso_cv.mse_path_.mean(axis=1),
'b-', linewidth=2, label='Mean CV error')
plt.fill_between(lasso_cv.alphas_,
lasso_cv.mse_path_.mean(axis=1) - lasso_cv.mse_path_.std(axis=1),
lasso_cv.mse_path_.mean(axis=1) + lasso_cv.mse_path_.std(axis=1),
alpha=0.2)
plt.axvline(lasso_cv.alpha_, color='red', linestyle='--',
label=f'lambda_min = {lasso_cv.alpha_:.4f}')
plt.xlabel('Lambda')
plt.ylabel('Mean Squared Error')
plt.title('LASSO Cross-Validation Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Plot 3: True vs estimated coefficients ---
fig, axes = plt.subplots(2, 1, figsize=(12, 6), sharex=True)
axes[0].stem(range(n_wavelengths), true_beta, markerfmt='ro', linefmt='r-',
basefmt='gray', label='True coefficients')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('True coefficients (only 10 nonzero)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[1].stem(range(n_wavelengths), lasso_cv.coef_, markerfmt='bo', linefmt='b-',
basefmt='gray', label='LASSO estimates')
axes[1].set_xlabel('Wavelength index')
axes[1].set_ylabel('Coefficient')
axes[1].set_title(f'LASSO estimates ({np.sum(lasso_cv.coef_ != 0)} nonzero)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

When to use LASSO in chemistry

Good applications

Wavelength selection for NIR/MIR calibration Identify which spectral regions carry analyte information

Compact sensor design Select a small number of wavelengths for dedicated filter instruments

Biomarker discovery Find which metabolites (from LC-MS or NMR) predict a clinical outcome

Sparse models from large variable sets Any situation where you believe only a few variables are truly relevant

Consider alternatives

Highly correlated spectra (most NIR/IR) LASSO picks one wavelength from a correlated group and ignores the rest, producing unstable selections. Use elastic net or PLS instead.

When all variables contribute If the signal is spread across the entire spectrum, LASSO’s enforced sparsity hurts. Ridge or PLS is more appropriate.

Small datasets LASSO’s cross-validation needs enough data for reliable error estimation. With very few samples, consider PLS with leave-one-out CV.

When you need the best prediction, not interpretation PLS or ridge often predict as well or better, especially in spectroscopy where sparsity is not the true data-generating mechanism.

Practical tips

Preprocessing matters. LASSO does not eliminate the need for spectral preprocessing. Apply standard corrections (baseline, scatter, smoothing) before fitting. LASSO selects variables, not preprocessing steps.

Check the path, not just the final model. The regularization path reveals which variables enter first and how stable the selection is. A variable that enters the model early at large and stays is more trustworthy than one that appears only at small .

Stability selection. Run LASSO on many bootstrap subsamples and track how often each variable is selected [5]. Variables selected in more than 70—80% of runs are robust; those selected sporadically are likely noise.

Beware the grouping effect. When predictors are correlated (neighboring wavelengths always are), LASSO tends to select one from the group arbitrarily. Elastic net ( ) handles this better by selecting groups together.

Compare against PLS. In spectroscopy, PLS regression remains the workhorse. Before reporting LASSO results, compare prediction performance against PLS with the optimal number of components. LASSO is most valuable when it matches PLS in predictive accuracy while also providing interpretable variable selection.

Next steps

LASSO opens the door to a family of penalized regression methods:

  • Elastic net combines L1 and L2 penalties, gaining the grouped selection property that LASSO lacks. It is often the preferred choice in spectroscopy [4].
  • Group LASSO extends the idea to predefined groups of variables (e.g., spectral regions), selecting or discarding entire groups at once [6].
  • Adaptive LASSO uses data-dependent weights on the L1 penalty to achieve consistent variable selection under milder conditions [7].
  • Sparse PLS combines the dimensionality reduction of PLS with the variable selection of LASSO, directly addressing the chemometric setting [8].

For most spectroscopic calibration tasks, the practical recommendation is to start with PLS (the chemometric standard), use LASSO or elastic net when variable selection is the primary goal, and compare prediction performance across methods on an independent test set.

References

[1] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267—288.

[2] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55—67.

[3] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407—499.

[4] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301—320.

[5] Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society, Series B, 72(4), 417—473.

[6] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49—67.

[7] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418—1429.

[8] Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, Series B, 72(1), 3—25.