Ridge Regression

Stabilizing regression with an L2 penalty — the bias-variance tradeoff, ridge traces, and why a little bias goes a long way

In 1970, Arthur E. Hoerl and Robert W. Kennard, two statisticians at the University of Delaware, published a pair of papers in Technometrics that would fundamentally change how scientists handle correlated predictors. Their article "Ridge Regression: Biased Estimation for Nonorthogonal Problems" [1] proposed a deceptively simple idea: when the ordinary least squares (OLS) solution is unstable because predictors are correlated, add a small positive constant to the diagonal of $X^{T} X$ before inverting it. This makes the matrix "more invertible" and produces coefficients that are biased but far more stable. They called the technique ridge regression because a plot of the coefficients against the penalty parameter traces out curves that resemble mountain ridges.

What Hoerl and Kennard did not know at the time was that the same mathematical idea had appeared decades earlier in a completely different field. In 1943, the Soviet mathematician Andrey Nikolaevich Tikhonov published a method for solving ill-posed integral equations that is mathematically identical to ridge regression [2]. In the applied mathematics and physics communities, the technique is known as Tikhonov regularization. The convergence of these two independent discoveries underscores how fundamental the idea is: whenever you face an ill-conditioned inverse problem, adding a penalty term is the natural remedy.

For chemists, the relevance was immediate. Spectroscopic calibration data -- where you try to predict analyte concentration from absorbance at many wavelengths -- produces exactly the kind of ill-conditioned matrices that ridge regression was designed for. Neighboring wavelengths are highly correlated because they arise from the same broad molecular absorption bands. By the 1980s, ridge regression had become part of the chemometrician's standard toolkit, alongside its descendants PCR and PLS [3].

The core contribution of Hoerl and Kennard's work was philosophical as much as mathematical: they demonstrated that a biased estimator can outperform an unbiased one when prediction accuracy (mean squared error) is the goal. This bias-variance tradeoff is one of the most important ideas in all of statistical learning.

The instability problem

To understand why ridge regression exists, we need to see OLS fail. Consider a simple scenario: you want to predict analyte concentration from absorbance at three wavelengths that are nearly identical (say, 500, 501, and 502 nm).

In multiple linear regression, the OLS solution is:

\hat{β}_{OLS} = (X^{T} X)^{- 1} X^{T} y

When the three wavelengths are highly correlated, $X^{T} X$ looks something like:

X^{T} X \approx 100 99.8 99.5 99.8 100 99.8 99.5 99.8 100

The eigenvalues of this matrix are roughly 299.3, 0.5, and 0.2. The condition number is $299.3/0.2 \approx 1500$ -- severe multicollinearity. When we invert this matrix, the small eigenvalues become large values in the inverse ( $1/0.2 = 5$ , $1/0.5 = 2$ ), amplifying any noise in $X^{T} y$ .

The practical result: OLS might produce coefficients like $\hat{β}_{1} = + 3847$ , $\hat{β}_{2} = - 7251$ , $\hat{β}_{3} = + 3415$ . These enormous values nearly cancel each other out. Add one new sample to the training set, and the coefficients might flip to $+ 2100$ , $- 3850$ , $+ 1760$ . The predictions on the training data barely change, but the coefficients are completely different -- and predictions on new data will be unreliable.

A revealing thought experiment

Imagine two models:

Model A: $\overset{y}{^} = 3847 x_{1} - 7251 x_{2} + 3415 x_{3}$
Model B: $\overset{y}{^} = 3.5 x_{1} + 3.2 x_{2} + 3.4 x_{3}$

Both might fit the training data equally well. But Model B, with its modest, stable coefficients, is far more likely to generalize to new samples. Ridge regression systematically pushes OLS toward solutions like Model B.

The ridge solution

The fix is elegant. Instead of minimizing just the sum of squared residuals, we minimize:

S_{ridge} (β) = ∥ y - X β ∥^{2} + λ ∥ β ∥^{2}

or equivalently:

S_{ridge} (β) = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} + λ j = 1 \sum p β_{j}^{2}

The first term is the usual least squares objective -- fit the data well. The second term is the L2 penalty (also called the ridge penalty, Tikhonov penalty, or weight decay in neural network parlance) -- keep the coefficients small. The parameter $λ \geq 0$ controls the tradeoff between these two goals.

The intercept

The penalty $λ \sum β_{j}^{2}$ is typically applied only to the slope coefficients, not to the intercept $β_{0}$ . Penalizing the intercept would shift all predictions toward zero regardless of the data, which is rarely desirable. In practice, the data are centered (subtract the mean of $y$ and the column means of $X$ ) so the intercept can be handled separately.

Taking the derivative and setting it to zero, exactly as we did for OLS:

\frac{\partial S _{ridge}}{\partial β} = - 2 X^{T} y + 2 X^{T} X β + 2 λ β = 0

Rearranging:

(X^{T} X + λ I) \hat{β}_{ridge} = X^{T} y

The solution is:

\hat{β}_{ridge} = (X^{T} X + λ I)^{- 1} X^{T} y

Compare this with OLS: the only difference is the addition of $λ I$ to $X^{T} X$ . This addition increases every eigenvalue by $λ$ . If the smallest eigenvalue was 0.2 and we set $λ = 1$ , it becomes 1.2 -- the condition number drops from 1500 to roughly $300/1.2 = 250$ . The matrix is better conditioned, the inverse is better behaved, and the coefficients are smaller and more stable.

When $λ = 0$ , ridge regression is identical to OLS. As $λ \to \infty$ , all coefficients shrink toward zero (the model predicts the mean of $y$ for every input). The art of ridge regression lies in choosing $λ$ somewhere in between.

The SVD perspective

The singular value decomposition (SVD) of $X = UD V^{T}$ provides the clearest view of what ridge regression does. The OLS coefficients can be written as:

\hat{β}_{OLS} = j = 1 \sum p \frac{1}{d _{j}} (u_{j}^{T} y) v_{j}

where $d_{j}$ are the singular values, $u_{j}$ are left singular vectors, and $v_{j}$ are right singular vectors. Small singular values produce large $1/ d_{j}$ factors, amplifying noise.

Ridge regression modifies this to:

\hat{β}_{ridge} = j = 1 \sum p \frac{d _{j}}{d _{j}^{2} + λ} (u_{j}^{T} y) v_{j}

The factor $d_{j} / (d_{j}^{2} + λ)$ acts as a continuous shrinkage factor. When $d_{j}^{2} ≫ λ$ , the factor is approximately $1/ d_{j}$ (no shrinkage, same as OLS). When $d_{j}^{2} ≪ λ$ , the factor is approximately $d_{j} / λ \approx 0$ (heavy shrinkage). Ridge regression selectively dampens the directions in predictor space that have the least variance -- precisely the directions where OLS is most unstable.

The bias-variance tradeoff

The Gauss-Markov theorem tells us that OLS is the best unbiased linear estimator. So by introducing bias (shrinking coefficients toward zero), aren't we making things worse?

Not necessarily. The mean squared error (MSE) of any estimator can be decomposed as:

MSE (\hat{β}) = Bias^{2} (\hat{β}) + Variance (\hat{β})

OLS has zero bias but potentially enormous variance (when predictors are correlated). Ridge regression introduces some bias but dramatically reduces variance. For multicollinear data, the variance reduction far outweighs the bias increase, producing lower total MSE.

Hoerl and Kennard proved that there always exists a value of $λ > 0$ for which the ridge estimator has lower MSE than OLS [1]. This is a remarkable result: you can always improve on OLS by adding at least a tiny amount of regularization when the true coefficients are finite.

Ridge trace

The ridge trace is the most intuitive way to visualize the bias-variance tradeoff. It plots each regression coefficient as a function of $λ$ :

Fit ridge regression for a sequence of $λ$ values (e.g., 0.001, 0.01, 0.1, 1, 10, 100)
For each $λ$ , record all $p$ coefficients
Plot each coefficient as a curve against $lo g (λ)$

At $λ = 0$ (left side of the plot), the coefficients are the OLS values -- potentially wild and erratic. As $λ$ increases, the coefficients shrink and stabilize. Eventually, for very large $λ$ , all coefficients approach zero.

The "right" $λ$ is typically where the coefficients have stabilized (stopped fluctuating wildly) but have not yet been shrunk to insignificance. Hoerl and Kennard originally proposed choosing $λ$ by visual inspection of the ridge trace -- a subjective but informative approach.

Reading a ridge trace

A healthy ridge trace shows chaotic, large-magnitude coefficients on the left (small $λ$ ) that quickly settle into smooth, stable curves as $λ$ increases. If the coefficients are already stable at $λ = 0$ , multicollinearity is not a problem and ridge regression is unnecessary. If the coefficients are still changing rapidly at large $λ$ , the penalty may be too weak or the model may have fundamental issues.

Choosing lambda

While the ridge trace provides visual intuition, modern practice selects $λ$ using quantitative criteria.

Cross-validation

The most widely used method. The idea:

Split the data into $k$ folds (typically 5 or 10)
For each candidate $λ$ , fit the model on $k - 1$ folds and predict the held-out fold
Compute the cross-validated RMSE (or MSE) for each $λ$
Choose the $λ$ that minimizes the cross-validated error

This approach is simple, general, and makes no distributional assumptions. Its only drawback is computational cost (fitting the model $k \times ∣ grid ∣$ times), but for ridge regression the computation is fast enough that this is rarely a concern.

Generalized cross-validation (GCV)

An efficient approximation to leave-one-out cross-validation, proposed by Golub, Heath, and Wahba (1979) [4]:

GCV (λ) = \frac{n ^{- 1} ∥ y - y ^ _{λ} ∥ ^{2}}{( 1 - n ^{- 1} tr ( H _{λ} ) ) ^{2}}

where $H_{λ} = X (X^{T} X + λ I)^{- 1} X^{T}$ is the ridge hat matrix and $tr (H_{λ})$ is the effective number of parameters (degrees of freedom). GCV does not require actually holding out data -- it computes the LOO error analytically from the full fit.

The L-curve

Plot $lo g ∥ β_{λ} ∥$ (solution norm) against $lo g ∥ y - X β_{λ} ∥$ (residual norm) for many values of $λ$ . The resulting curve is typically L-shaped. The optimal $λ$ is at the "corner" of the L, where you get the best compromise between fitting the data (small residual) and keeping the solution smooth (small norm).

Geometric interpretation

Ridge regression has an elegant geometric interpretation as a constrained optimization problem. Minimizing $∥ y - X β ∥^{2} + λ ∥ β ∥^{2}$ is equivalent to:

β min ∥ y - X β ∥^{2} subject to ∥ β ∥^{2} \leq t

for some value $t$ that depends on $λ$ (smaller $t$ corresponds to larger $λ$ ).

In two dimensions, this is easy to visualize. The least squares objective defines elliptical contours centered at the OLS solution $\hat{β}_{OLS}$ . The constraint $∥ β ∥^{2} \leq t$ defines a circle (or sphere in higher dimensions) centered at the origin.

The ridge solution is the point where the smallest ellipse just touches the circle. Because the constraint region is a circle (smooth, no corners), the solution point can lie anywhere on the boundary. This is in contrast to LASSO regression, whose constraint region is a diamond (with corners on the axes), which tends to push solutions to the corners -- producing exact zeros in the coefficients.

Why ridge does not do variable selection

The circular constraint region of ridge regression means that the solution will generically have all coefficients nonzero (just shrunken). If you need to set some coefficients exactly to zero -- effectively selecting a subset of predictors -- you need the LASSO (L1 penalty) or elastic net (combination of L1 and L2). Ridge shrinks everything; LASSO selects.

Connection to Bayesian regression

Ridge regression has an elegant Bayesian interpretation. If we place a Gaussian prior on the coefficients:

β \sim N (0, τ^{2} I)

meaning we believe a priori that each coefficient is drawn from a normal distribution centered at zero with variance $τ^{2}$ , then the posterior mean (the most likely coefficients given the data and our prior belief) is exactly the ridge estimator:

\hat{β}_{ridge} = (X^{T} X + \frac{σ ^{2}}{τ ^{2}} I)^{- 1} X^{T} y

with $λ = σ^{2} / τ^{2}$ . This means that $λ$ has a direct interpretation: it is the ratio of noise variance to prior coefficient variance. A large $λ$ (strong regularization) corresponds to a strong prior belief that coefficients are small. A small $λ$ corresponds to a vague prior that lets the data dominate.

This Bayesian view also explains why ridge regression works: even when the data alone is insufficient to determine the coefficients reliably (because of multicollinearity), the prior provides additional information that stabilizes the estimates. The more correlated the predictors, the more the data alone is ambiguous, and the more helpful the prior becomes.

From ridge to full Bayesian

The Bayesian perspective connects ridge regression to a much larger family of methods. Replace the Gaussian prior with a Laplace prior and you get LASSO regression. Use a spike-and-slab prior and you get Bayesian variable selection. The regularization parameter $λ$ becomes a hyperparameter that can itself be estimated from the data using empirical Bayes or full Bayesian inference.

Code implementation

import numpy as np
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# --- Simulate spectroscopic data with multicollinearity ---
np.random.seed(42)
n_samples = 60
n_wavelengths = 30

y = np.random.uniform(0.5, 10.0, n_samples)
wavelengths = np.linspace(400, 700, n_wavelengths)

X = np.zeros((n_samples, n_wavelengths))
for j in range(n_wavelengths):
    sensitivity = np.exp(-((wavelengths[j] - 550) ** 2) / 3000)
    X[:, j] = sensitivity * y + np.random.normal(0, 0.03, n_samples)

# Standardize predictors (important for ridge)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Ridge trace: coefficients vs lambda ---
alphas = np.logspace(-3, 4, 200)
coefs = []
for a in alphas:
    ridge = Ridge(alpha=a, fit_intercept=True)
    ridge.fit(X_scaled, y)
    coefs.append(ridge.coef_)
coefs = np.array(coefs)

plt.figure(figsize=(10, 5))
for j in range(n_wavelengths):
    plt.plot(np.log10(alphas), coefs[:, j], linewidth=0.8)
plt.xlabel(r"$\log_{10}(\lambda)$")
plt.ylabel("Coefficient value")
plt.title("Ridge Trace — coefficients stabilize as penalty increases")
plt.axhline(y=0, color='black', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# --- Optimal lambda via cross-validation ---
ridge_cv = RidgeCV(alphas=np.logspace(-2, 4, 100), cv=10,
                   scoring='neg_mean_squared_error')
ridge_cv.fit(X_scaled, y)
print(f"Optimal lambda: {ridge_cv.alpha_:.4f}")

y_pred = ridge_cv.predict(X_scaled)
print(f"R²:    {ridge_cv.score(X_scaled, y):.4f}")
print(f"RMSE:  {np.sqrt(np.mean((y - y_pred)**2)):.4f}")

# --- Compare OLS vs Ridge coefficients ---
from sklearn.linear_model import LinearRegression
ols = LinearRegression().fit(X_scaled, y)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
axes[0].bar(range(n_wavelengths), ols.coef_, color='tomato')
axes[0].set_title(f"OLS Coefficients (range: {ols.coef_.min():.0f} to {ols.coef_.max():.0f})")
axes[0].set_xlabel("Wavelength index")
axes[0].set_ylabel("Coefficient value")
axes[0].grid(True, alpha=0.3)

axes[1].bar(range(n_wavelengths), ridge_cv.coef_, color='steelblue')
axes[1].set_title(f"Ridge Coefficients, lambda={ridge_cv.alpha_:.2f}")
axes[1].set_xlabel("Wavelength index")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

% --- Simulate spectroscopic data with multicollinearity ---
rng(42);
n_samples = 60;
n_wavelengths = 30;

y = 0.5 + (10.0 - 0.5) * rand(n_samples, 1);
wavelengths = linspace(400, 700, n_wavelengths);

X = zeros(n_samples, n_wavelengths);
for j = 1:n_wavelengths
    sensitivity = exp(-((wavelengths(j) - 550)^2) / 3000);
    X(:, j) = sensitivity * y + 0.03 * randn(n_samples, 1);
end

% Standardize predictors
X_scaled = (X - mean(X)) ./ std(X);
y_centered = y - mean(y);

% --- Ridge trace ---
alphas = logspace(-3, 4, 200);
coefs = zeros(length(alphas), n_wavelengths);
for i = 1:length(alphas)
    beta = (X_scaled' * X_scaled + alphas(i) * eye(n_wavelengths)) ...
           \ (X_scaled' * y_centered);
    coefs(i, :) = beta';
end

figure;
plot(log10(alphas), coefs, 'LineWidth', 0.8);
xlabel('log_{10}(\lambda)');
ylabel('Coefficient value');
title('Ridge Trace');
yline(0, 'k', 'LineWidth', 0.5);
grid on;

% --- Cross-validation to select lambda ---
% Using MATLAB's ridge function and manual CV
k = 10;
cv_indices = crossvalind('Kfold', n_samples, k);
cv_errors = zeros(length(alphas), 1);

for i = 1:length(alphas)
    fold_errors = zeros(k, 1);
    for fold = 1:k
        test_idx = (cv_indices == fold);
        train_idx = ~test_idx;

        X_train = X_scaled(train_idx, :);
        y_train = y_centered(train_idx);
        X_test = X_scaled(test_idx, :);
        y_test = y_centered(test_idx);

        beta = (X_train' * X_train + alphas(i) * eye(n_wavelengths)) ...
               \ (X_train' * y_train);
        y_pred = X_test * beta;
        fold_errors(fold) = mean((y_test - y_pred).^2);
    end
    cv_errors(i) = mean(fold_errors);
end

[~, best_idx] = min(cv_errors);
best_lambda = alphas(best_idx);
fprintf('Optimal lambda: %.4f\n', best_lambda);

% --- Compare OLS vs Ridge ---
beta_ols = X_scaled \ y_centered;
beta_ridge = (X_scaled' * X_scaled + best_lambda * eye(n_wavelengths)) ...
             \ (X_scaled' * y_centered);

figure;
subplot(1, 2, 1);
bar(beta_ols, 'FaceColor', [0.9 0.4 0.3]);
title('OLS Coefficients');
xlabel('Wavelength index');
ylabel('Coefficient value');
grid on;

subplot(1, 2, 2);
bar(beta_ridge, 'FaceColor', [0.3 0.5 0.8]);
title(sprintf('Ridge Coefficients (\\lambda=%.2f)', best_lambda));
xlabel('Wavelength index');
grid on;

library(glmnet)

# --- Simulate spectroscopic data with multicollinearity ---
set.seed(42)
n_samples <- 60
n_wavelengths <- 30

y <- runif(n_samples, 0.5, 10.0)
wavelengths <- seq(400, 700, length.out = n_wavelengths)

X <- matrix(0, nrow = n_samples, ncol = n_wavelengths)
for (j in 1:n_wavelengths) {
  sensitivity <- exp(-((wavelengths[j] - 550)^2) / 3000)
  X[, j] <- sensitivity * y + rnorm(n_samples, 0, 0.03)
}

# Standardize predictors
X_scaled <- scale(X)

# --- Ridge trace ---
alphas <- 10^seq(-3, 4, length.out = 200)
ridge_fit <- glmnet(X_scaled, y, alpha = 0, lambda = alphas)

plot(ridge_fit, xvar = "lambda", label = TRUE,
     main = "Ridge Trace",
     xlab = expression(log(lambda)),
     ylab = "Coefficient value")
grid()

# --- Cross-validation to select lambda ---
cv_ridge <- cv.glmnet(X_scaled, y, alpha = 0, nfolds = 10)
plot(cv_ridge, main = "Cross-Validated MSE vs lambda")

best_lambda <- cv_ridge$lambda.min
cat("Optimal lambda:", best_lambda, "\n")

# Predictions with optimal lambda
y_pred <- predict(cv_ridge, X_scaled, s = "lambda.min")
r2 <- 1 - sum((y - y_pred)^2) / sum((y - mean(y))^2)
rmse <- sqrt(mean((y - y_pred)^2))
cat("R²:  ", r2, "\n")
cat("RMSE:", rmse, "\n")

# --- Compare OLS vs Ridge ---
beta_ols <- coef(lm(y ~ X_scaled))[-1]  # exclude intercept
beta_ridge <- as.numeric(coef(cv_ridge, s = "lambda.min"))[-1]

par(mfrow = c(1, 2))
barplot(beta_ols, names.arg = 1:n_wavelengths,
        main = "OLS Coefficients",
        xlab = "Wavelength index", ylab = "Coefficient value",
        col = "tomato")
grid()

barplot(beta_ridge, names.arg = 1:n_wavelengths,
        main = paste0("Ridge Coefficients (lambda=",
                       round(best_lambda, 2), ")"),
        xlab = "Wavelength index", ylab = "Coefficient value",
        col = "steelblue")
grid()

Comparison with OLS

Ridge helps when

Multicollinear predictors The primary use case. When predictors are highly correlated (condition number > 30), ridge stabilizes the coefficients dramatically.

More predictors than observations OLS has no unique solution when $p > n$ . Ridge always has a unique solution because $X^{T} X + λ I$ is always invertible for $λ > 0$ .

Noisy predictors When the signal-to-noise ratio in the predictor variables is low, OLS amplifies noise through large coefficients. Ridge tames them.

Improved generalization Even when OLS fits the training data slightly better, ridge almost always predicts new data more accurately for spectroscopic calibration.

Ridge may not help when

Predictors are truly orthogonal If predictors are uncorrelated (e.g., from a designed experiment), OLS is already optimal and ridge just adds unnecessary bias.

Variable selection is needed Ridge shrinks all coefficients toward zero but never sets any exactly to zero. If you believe most predictors are irrelevant, use LASSO or elastic net.

The model is fundamentally wrong Regularization cannot fix a misspecified model. If the true relationship is nonlinear and you fit a linear model, adding a penalty will not help.

Very few predictors, many observations With $n ≫ p$ and low correlation among predictors, OLS works fine and the bias from ridge is unnecessary.

Practical guidelines

Always standardize your predictors before applying ridge regression. The L2 penalty treats all coefficients equally, so predictors measured on different scales (e.g., absorbance vs. temperature) would be penalized unequally without standardization.
Start with cross-validation to select $λ$ . Use 10-fold CV as the default. The ridge trace is useful for understanding the solution, but CV gives you the best predictive $λ$ .
Plot the ridge trace to check for stability. If coefficients are still erratic at the CV-optimal $λ$ , consider increasing the penalty or switching to PCR/PLS.
Compare against OLS on a held-out test set (or by cross-validation). If ridge does not improve RMSEP over OLS, multicollinearity may not be your main problem.
Consider the alternatives. Ridge is the simplest regularization method, but PCR and PLS are often more effective for spectroscopic data because they also perform dimensionality reduction.

Common pitfalls

A few mistakes come up repeatedly when practitioners first use ridge regression:

Forgetting to standardize. This is the most common error. If predictor $x_{1}$ is measured in absorbance units (range 0-2) and predictor $x_{2}$ is measured in temperature (range 20-300), the penalty $λ \sum β_{j}^{2}$ will shrink the temperature coefficient far more aggressively because it must be much smaller to compensate for the larger scale. Standardizing all predictors to zero mean and unit variance ensures equal treatment.

Using R-squared on training data to evaluate ridge. Ridge deliberately sacrifices training-set fit (by adding bias) to improve generalization. Comparing training $R^{2}$ between OLS and ridge will always favor OLS. The only fair comparison is on held-out data or via cross-validation.

Choosing lambda on the full dataset. If you use all your data to select $λ$ by cross-validation and then report the cross-validated RMSE as your expected prediction error, you are being overly optimistic. Ideally, the entire model selection process (including lambda tuning) should be nested inside an outer validation loop.

Interpreting individual coefficients. Ridge coefficients are biased, and their magnitudes depend on $λ$ . Do not use individual ridge coefficients to make claims about which predictors are "important." If variable importance is the goal, use permutation importance or switch to LASSO.

Next steps

Ridge regression introduces the fundamental idea of regularization -- trading bias for variance to improve prediction. The same idea takes different forms in other methods:

LASSO regression: Replaces the L2 penalty $∥ β ∥^{2}$ with the L1 penalty $∥ β ∥_{1} = \sum ∣ β_{j} ∣$ . The geometry of the L1 constraint (a diamond with corners on the axes) drives some coefficients exactly to zero, performing automatic variable selection.
Elastic net: Combines L1 and L2 penalties: $α ∥ β ∥_{1} + (1 - α) ∥ β ∥^{2}$ . Gets the variable selection of LASSO with the stability of ridge.
Principal component regression (PCR): Instead of penalizing coefficients, reduces dimensionality by regressing on the top principal components. Achieves a similar effect to ridge (suppresses directions with small variance) but is more interpretable when a few components dominate.
Partial least squares (PLS): Like PCR but constructs components that account for covariance with the response, not just variance in the predictors. Generally outperforms ridge for spectroscopic calibration.

References

[1] Hoerl, A.E., & Kennard, R.W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55-67.

[2] Tikhonov, A.N. (1943). "On the stability of inverse problems." Doklady Akademii Nauk SSSR, 39(5), 195-198.

[3] Naes, T., & Martens, H. (1988). "Principal component regression in NIR analysis: Viewpoints, background details and selection of components." Journal of Chemometrics, 2(2), 155-167.

[4] Golub, G.H., Heath, M., & Wahba, G. (1979). "Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter." Technometrics, 21(2), 215-223.

[5] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3.

[6] Brereton, R.G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. John Wiley & Sons.

[7] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. Chapter 6.

Ridge Regression

Ridge helps when

Ridge may not help when

On this page