In 1970, Arthur E. Hoerl and Robert W. Kennard, two statisticians at the University of Delaware, published a pair of papers in Technometrics that would fundamentally change how scientists handle correlated predictors. Their article “Ridge Regression: Biased Estimation for Nonorthogonal Problems” [1] proposed a deceptively simple idea: when the ordinary least squares (OLS) solution is unstable because predictors are correlated, add a small positive constant to the diagonal of XTX before inverting it. This makes the matrix “more invertible” and produces coefficients that are biased but far more stable. They called the technique ridge regression because a plot of the coefficients against the penalty parameter traces out curves that resemble mountain ridges.
What Hoerl and Kennard did not know at the time was that the same mathematical idea had appeared decades earlier in a completely different field. In 1943, the Soviet mathematician Andrey Nikolaevich Tikhonov published a method for solving ill-posed integral equations that is mathematically identical to ridge regression [2]. In the applied mathematics and physics communities, the technique is known as Tikhonov regularization. The convergence of these two independent discoveries underscores how fundamental the idea is: whenever you face an ill-conditioned inverse problem, adding a penalty term is the natural remedy.
For chemists, the relevance was immediate. Spectroscopic calibration data — where you try to predict analyte concentration from absorbance at many wavelengths — produces exactly the kind of ill-conditioned matrices that ridge regression was designed for. Neighboring wavelengths are highly correlated because they arise from the same broad molecular absorption bands. By the 1980s, ridge regression had become part of the chemometrician’s standard toolkit, alongside its descendants PCR and PLS [3].
The core contribution of Hoerl and Kennard’s work was philosophical as much as mathematical: they demonstrated that a biased estimator can outperform an unbiased one when prediction accuracy (mean squared error) is the goal. This bias-variance tradeoff is one of the most important ideas in all of statistical learning.
The instability problem
To understand why ridge regression exists, we need to see OLS fail. Consider a simple scenario: you want to predict analyte concentration from absorbance at three wavelengths that are nearly identical (say, 500, 501, and 502 nm).
When the three wavelengths are highly correlated, XTX looks something like:
XTX≈10099.899.599.810099.899.599.8100
The eigenvalues of this matrix are roughly 299.3, 0.5, and 0.2. The condition number is 299.3/0.2≈1500 — severe multicollinearity. When we invert this matrix, the small eigenvalues become large values in the inverse (1/0.2=5 , 1/0.5=2 ), amplifying any noise in XTy .
The practical result: OLS might produce coefficients like β^1=+3847 , β^2=−7251 , β^3=+3415 . These enormous values nearly cancel each other out. Add one new sample to the training set, and the coefficients might flip to +2100 , −3850 , +1760 . The predictions on the training data barely change, but the coefficients are completely different — and predictions on new data will be unreliable.
The ridge solution
The fix is elegant. Instead of minimizing just the sum of squared residuals, we minimize:
Sridge(β)=∥y−Xβ∥2+λ∥β∥2
or equivalently:
Sridge(β)=i=1∑n(yi−y^i)2+λj=1∑pβj2
The first term is the usual least squares objective — fit the data well. The second term is the L2 penalty (also called the ridge penalty, Tikhonov penalty, or weight decay in neural network parlance) — keep the coefficients small. The parameter λ≥0 controls the tradeoff between these two goals.
Taking the derivative and setting it to zero, exactly as we did for OLS:
∂β∂Sridge=−2XTy+2XTXβ+2λβ=0
Rearranging:
(XTX+λI)β^ridge=XTy
The solution is:
β^ridge=(XTX+λI)−1XTy
Compare this with OLS: the only difference is the addition of λI to XTX . This addition increases every eigenvalue by λ . If the smallest eigenvalue was 0.2 and we set λ=1 , it becomes 1.2 — the condition number drops from 1500 to roughly 300/1.2=250 . The matrix is better conditioned, the inverse is better behaved, and the coefficients are smaller and more stable.
When λ=0 , ridge regression is identical to OLS. As λ→∞ , all coefficients shrink toward zero (the model predicts the mean of y for every input). The art of ridge regression lies in choosing λ somewhere in between.
The SVD perspective
The singular value decomposition (SVD) of X=UDVT provides the clearest view of what ridge regression does. The OLS coefficients can be written as:
β^OLS=j=1∑pdj1(ujTy)vj
where dj are the singular values, uj are left singular vectors, and vj are right singular vectors. Small singular values produce large 1/dj factors, amplifying noise.
Ridge regression modifies this to:
β^ridge=j=1∑pdj2+λdj(ujTy)vj
The factor dj/(dj2+λ) acts as a continuous shrinkage factor. When dj2≫λ , the factor is approximately 1/dj (no shrinkage, same as OLS). When dj2≪λ , the factor is approximately dj/λ≈0 (heavy shrinkage). Ridge regression selectively dampens the directions in predictor space that have the least variance — precisely the directions where OLS is most unstable.
The bias-variance tradeoff
The Gauss-Markov theorem tells us that OLS is the best unbiased linear estimator. So by introducing bias (shrinking coefficients toward zero), aren’t we making things worse?
Not necessarily. The mean squared error (MSE) of any estimator can be decomposed as:
MSE(β^)=Bias2(β^)+Variance(β^)
OLS has zero bias but potentially enormous variance (when predictors are correlated). Ridge regression introduces some bias but dramatically reduces variance. For multicollinear data, the variance reduction far outweighs the bias increase, producing lower total MSE.
Hoerl and Kennard proved that there always exists a value of λ>0 for which the ridge estimator has lower MSE than OLS [1]. This is a remarkable result: you can always improve on OLS by adding at least a tiny amount of regularization when the true coefficients are finite.
Ridge trace
The ridge trace is the most intuitive way to visualize the bias-variance tradeoff. It plots each regression coefficient as a function of λ :
Fit ridge regression for a sequence of λ values (e.g., 0.001, 0.01, 0.1, 1, 10, 100)
For each λ , record all p coefficients
Plot each coefficient as a curve against log(λ)
At λ=0 (left side of the plot), the coefficients are the OLS values — potentially wild and erratic. As λ increases, the coefficients shrink and stabilize. Eventually, for very large λ , all coefficients approach zero.
The “right” λ is typically where the coefficients have stabilized (stopped fluctuating wildly) but have not yet been shrunk to insignificance. Hoerl and Kennard originally proposed choosing λ by visual inspection of the ridge trace — a subjective but informative approach.
Choosing lambda
While the ridge trace provides visual intuition, modern practice selects λ using quantitative criteria.
Cross-validation
The most widely used method. The idea:
Split the data into k folds (typically 5 or 10)
For each candidate λ , fit the model on k−1 folds and predict the held-out fold
Compute the cross-validated RMSE (or MSE) for each λ
Choose the λ that minimizes the cross-validated error
This approach is simple, general, and makes no distributional assumptions. Its only drawback is computational cost (fitting the model k×∣grid∣ times), but for ridge regression the computation is fast enough that this is rarely a concern.
Generalized cross-validation (GCV)
An efficient approximation to leave-one-out cross-validation, proposed by Golub, Heath, and Wahba (1979) [4]:
GCV(λ)=(1−n−1tr(Hλ))2n−1∥y−y^λ∥2
where Hλ=X(XTX+λI)−1XT is the ridge hat matrix and tr(Hλ) is the effective number of parameters (degrees of freedom). GCV does not require actually holding out data — it computes the LOO error analytically from the full fit.
The L-curve
Plot log∥βλ∥ (solution norm) against log∥y−Xβλ∥ (residual norm) for many values of λ . The resulting curve is typically L-shaped. The optimal λ is at the “corner” of the L, where you get the best compromise between fitting the data (small residual) and keeping the solution smooth (small norm).
Geometric interpretation
Ridge regression has an elegant geometric interpretation as a constrained optimization problem. Minimizing ∥y−Xβ∥2+λ∥β∥2 is equivalent to:
βmin∥y−Xβ∥2subject to∥β∥2≤t
for some value t that depends on λ (smaller t corresponds to larger λ ).
In two dimensions, this is easy to visualize. The least squares objective defines elliptical contours centered at the OLS solution β^OLS . The constraint ∥β∥2≤t defines a circle (or sphere in higher dimensions) centered at the origin.
The ridge solution is the point where the smallest ellipse just touches the circle. Because the constraint region is a circle (smooth, no corners), the solution point can lie anywhere on the boundary. This is in contrast to LASSO regression, whose constraint region is a diamond (with corners on the axes), which tends to push solutions to the corners — producing exact zeros in the coefficients.
Connection to Bayesian regression
Ridge regression has an elegant Bayesian interpretation. If we place a Gaussian prior on the coefficients:
β∼N(0,τ2I)
meaning we believe a priori that each coefficient is drawn from a normal distribution centered at zero with variance τ2 , then the posterior mean (the most likely coefficients given the data and our prior belief) is exactly the ridge estimator:
β^ridge=(XTX+τ2σ2I)−1XTy
with λ=σ2/τ2 . This means that λ has a direct interpretation: it is the ratio of noise variance to prior coefficient variance. A large λ (strong regularization) corresponds to a strong prior belief that coefficients are small. A small λ corresponds to a vague prior that lets the data dominate.
This Bayesian view also explains why ridge regression works: even when the data alone is insufficient to determine the coefficients reliably (because of multicollinearity), the prior provides additional information that stabilizes the estimates. The more correlated the predictors, the more the data alone is ambiguous, and the more helpful the prior becomes.
Multicollinear predictors
The primary use case. When predictors are highly correlated (condition number > 30), ridge stabilizes the coefficients dramatically.
More predictors than observations
OLS has no unique solution when p>n . Ridge always has a unique solution because XTX+λI is always invertible for λ>0 .
Noisy predictors
When the signal-to-noise ratio in the predictor variables is low, OLS amplifies noise through large coefficients. Ridge tames them.
Improved generalization
Even when OLS fits the training data slightly better, ridge almost always predicts new data more accurately for spectroscopic calibration.
Ridge may not help when
Predictors are truly orthogonal
If predictors are uncorrelated (e.g., from a designed experiment), OLS is already optimal and ridge just adds unnecessary bias.
Variable selection is needed
Ridge shrinks all coefficients toward zero but never sets any exactly to zero. If you believe most predictors are irrelevant, use LASSO or elastic net.
The model is fundamentally wrong
Regularization cannot fix a misspecified model. If the true relationship is nonlinear and you fit a linear model, adding a penalty will not help.
Very few predictors, many observations
With n≫p and low correlation among predictors, OLS works fine and the bias from ridge is unnecessary.
Practical guidelines
Always standardize your predictors before applying ridge regression. The L2 penalty treats all coefficients equally, so predictors measured on different scales (e.g., absorbance vs. temperature) would be penalized unequally without standardization.
Start with cross-validation to select λ . Use 10-fold CV as the default. The ridge trace is useful for understanding the solution, but CV gives you the best predictive λ .
Plot the ridge trace to check for stability. If coefficients are still erratic at the CV-optimal λ , consider increasing the penalty or switching to PCR/PLS.
Compare against OLS on a held-out test set (or by cross-validation). If ridge does not improve RMSEP over OLS, multicollinearity may not be your main problem.
Consider the alternatives. Ridge is the simplest regularization method, but PCR and PLS are often more effective for spectroscopic data because they also perform dimensionality reduction.
Common pitfalls
A few mistakes come up repeatedly when practitioners first use ridge regression:
Forgetting to standardize. This is the most common error. If predictor x1 is measured in absorbance units (range 0-2) and predictor x2 is measured in temperature (range 20-300), the penalty λ∑βj2 will shrink the temperature coefficient far more aggressively because it must be much smaller to compensate for the larger scale. Standardizing all predictors to zero mean and unit variance ensures equal treatment.
Using R-squared on training data to evaluate ridge. Ridge deliberately sacrifices training-set fit (by adding bias) to improve generalization. Comparing training R2 between OLS and ridge will always favor OLS. The only fair comparison is on held-out data or via cross-validation.
Choosing lambda on the full dataset. If you use all your data to select λ by cross-validation and then report the cross-validated RMSE as your expected prediction error, you are being overly optimistic. Ideally, the entire model selection process (including lambda tuning) should be nested inside an outer validation loop.
Interpreting individual coefficients. Ridge coefficients are biased, and their magnitudes depend on λ . Do not use individual ridge coefficients to make claims about which predictors are “important.” If variable importance is the goal, use permutation importance or switch to LASSO.
Next steps
Ridge regression introduces the fundamental idea of regularization — trading bias for variance to improve prediction. The same idea takes different forms in other methods:
LASSO regression: Replaces the L2 penalty ∥β∥2 with the L1 penalty ∥β∥1=∑∣βj∣ . The geometry of the L1 constraint (a diamond with corners on the axes) drives some coefficients exactly to zero, performing automatic variable selection.
Elastic net: Combines L1 and L2 penalties: α∥β∥1+(1−α)∥β∥2 . Gets the variable selection of LASSO with the stability of ridge.
Principal component regression (PCR): Instead of penalizing coefficients, reduces dimensionality by regressing on the top principal components. Achieves a similar effect to ridge (suppresses directions with small variance) but is more interpretable when a few components dominate.
Partial least squares (PLS): Like PCR but constructs components that account for covariance with the response, not just variance in the predictors. Generally outperforms ridge for spectroscopic calibration.
[2] Tikhonov, A.N. (1943). “On the stability of inverse problems.” Doklady Akademii Nauk SSSR, 39(5), 195-198.
[3] Naes, T., & Martens, H. (1988). “Principal component regression in NIR analysis: Viewpoints, background details and selection of components.” Journal of Chemometrics, 2(2), 155-167.
[4] Golub, G.H., Heath, M., & Wahba, G. (1979). “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics, 21(2), 215-223.
[5] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3.
[6] Brereton, R.G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. John Wiley & Sons.
[7] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. Chapter 6.