Skip to content

Multiple Linear Regression

The idea of fitting a straight line to data goes back centuries, but chemistry rarely gives us the luxury of a single predictor. When a spectrophotometer records absorbance at 200 wavelengths and we want to predict the concentration of an analyte, we need a model that can handle all 200 variables at once. That model is multiple linear regression (MLR).

The mathematical machinery behind MLR — the matrix formulation of least squares — was assembled piece by piece during the first half of the twentieth century. Carl Friedrich Gauss had already described the method of least squares around 1795, and Adrien-Marie Legendre published the first formal account in 1805. But it was the statisticians of the early 1900s who generalized the idea to multiple predictors. R.A. Fisher, working at the Rothamsted Experimental Station in the 1920s and 1930s, developed the analysis of variance and the general linear model framework that placed multiple regression on rigorous statistical footing. His 1925 book Statistical Methods for Research Workers gave scientists a practical toolbox for handling several variables simultaneously [1].

Fisher’s framework was not purely theoretical. He applied multiple regression to agricultural field trials at Rothamsted, where crop yield depended on rainfall, temperature, soil nitrogen, and other factors simultaneously. His insight was that you could estimate the effect of each factor while accounting for all the others — a problem that single-variable analysis could not solve. The mathematical tools he developed — matrix algebra applied to linear models, the F-distribution for testing groups of coefficients, the concept of degrees of freedom — became the foundation that chemometricians would later build upon [3].

In chemistry, the need for multiple regression became acute with the rise of instrumental analysis in the mid-twentieth century. A UV-Vis spectrum contains absorbance readings at dozens or hundreds of wavelengths, and each wavelength carries partial information about the analyte concentration. Using just one wavelength — simple linear regression — throws away most of the available data. Multivariate calibration, the practice of building models that use many wavelengths to predict concentration, became a cornerstone of chemometrics by the 1970s and 1980s [2]. MLR was the natural first attempt, and understanding where it succeeds and where it fails is essential for appreciating the more advanced methods (PCR, PLS) that came later.

Why one variable is not enough

In the simple linear regression article, we modeled a response as a function of a single predictor . That works when one measurement clearly determines the outcome — for example, absorbance at a single wavelength predicting a single analyte via Beer’s Law.

But consider a more realistic scenario. You have a mixture of three dyes in solution, and you measure the UV-Vis spectrum at 50 wavelengths. Each wavelength responds to all three dyes, with different sensitivities. No single wavelength can disentangle the contributions. You need a model that combines information from multiple wavelengths to predict each dye’s concentration.

This is where multiple linear regression enters. Instead of one slope and one intercept, we have many slopes — one for each predictor variable — plus an intercept:

Each coefficient tells us how much the response changes when predictor increases by one unit, holding all other predictors constant. That last phrase is critical. In simple regression, the slope has an unambiguous interpretation. In multiple regression, the interpretation of each coefficient depends on what other variables are in the model.

A concrete example: multi-analyte determination

Suppose you are analyzing pharmaceutical tablets that contain two active ingredients, A and B. You measure the UV-Vis spectrum at five wavelengths. Beer’s Law tells us that absorbance is additive:

where is the absorbance at wavelength , and are the molar absorptivities of A and B at that wavelength, and are the concentrations, and is the path length.

To determine from the spectrum, simple regression at one wavelength would fail if B also absorbs there. But with multiple wavelengths, we can set up an MLR model where the absorbances at all five wavelengths are the predictors and the known concentration of A (from reference analyses) is the response. The model learns a set of coefficients that, when applied to a new spectrum, disentangle the contributions of A and B and return the concentration of A. This is the essence of multivariate calibration.

The model in matrix form

Writing out for every observation and every predictor quickly becomes tedious. Matrix notation compresses everything into a single equation.

Suppose we have observations and predictor variables. We arrange the data into:

  • A response vector of size , containing the measured values (e.g., concentrations)
  • A design matrix of size , where each row is one observation. The first column is all ones (for the intercept), and the remaining columns hold the predictor values (e.g., absorbances at different wavelengths)
  • A coefficient vector of size , containing the intercept and all slopes
  • An error vector of size , containing the noise

The model is:

Written out, the design matrix looks like:

where is the value of predictor for observation . The column of ones ensures that when we multiply , the first element of acts as the intercept .

The normal equations

Just as in simple regression, we want to minimize the sum of squared residuals:

To find the minimum, we take the derivative with respect to and set it to zero. Expanding the expression:

Differentiating with respect to and setting the gradient to zero:

This gives us the normal equations:

If is invertible (we will see shortly when it is not), the solution is:

This is the ordinary least squares (OLS) estimator. The matrix is sometimes called the pseudo-inverse of (denoted ), and it appears throughout chemometrics.

What the solution means

The OLS estimator has a clean interpretation: is the vector of coefficients that makes the predicted values as close as possible to the observed values , measured by Euclidean distance (sum of squared differences).

Under the standard assumptions (errors are independent, identically distributed with mean zero and constant variance ), OLS has a remarkable property: the Gauss-Markov theorem guarantees that among all linear unbiased estimators, OLS has the smallest variance. In other words, no other unbiased linear method can do better. This result is what makes OLS the default starting point for regression.

Assumptions of MLR

The Gauss-Markov optimality result depends on several assumptions. When they hold, OLS is hard to beat. When they are violated, the consequences range from mild inefficiency to completely misleading results.

  1. Linearity: The true relationship between and is linear. If the relationship is curved, the residuals will show systematic patterns. In spectroscopy, Beer’s Law is linear only within a concentration range — at high concentrations, deviations from linearity appear.

  2. Full rank: The design matrix has rank (no predictor is a perfect linear combination of the others). If this fails, is singular and cannot be inverted. Near-violations (multicollinearity) produce nearly singular matrices with all the associated instabilities.

  3. Homoscedasticity: The error variance is constant across all observations: for all . In analytical chemistry, measurement precision often decreases at low concentrations (higher relative noise), violating this assumption. Weighted least squares is the standard remedy.

  4. Independence: The errors are mutually independent. Time-series data (e.g., process monitoring) often violates this through autocorrelation. For calibration with independent samples, this assumption is usually reasonable.

  5. Normal errors (for inference): The Gauss-Markov theorem requires only assumptions 1-4 for OLS to be the best linear unbiased estimator. But to construct confidence intervals and perform hypothesis tests (F-test, t-tests), we additionally need . For large sample sizes, the central limit theorem provides approximate normality even when the errors are not exactly Gaussian.

The multicollinearity problem

Everything described so far works beautifully — as long as is invertible and well-conditioned. In many chemistry applications, it is neither.

Multicollinearity occurs when two or more predictor variables are highly correlated with each other. In spectroscopy, this is the rule rather than the exception: absorbance at 500 nm and absorbance at 501 nm are measuring nearly the same thing. They are almost perfectly correlated.

When predictors are highly correlated, becomes nearly singular. Its eigenvalues span many orders of magnitude, and the inverse amplifies small perturbations into enormous changes in the coefficients. The practical consequences are severe:

  • Wildly unstable coefficients: Adding or removing a single observation can flip the sign of a coefficient. One coefficient might be +5000 and its neighbor -4998, with the two nearly canceling out. The individual values are meaningless even though their combination might still predict reasonably.

  • Inflated standard errors: The variance of each coefficient estimate is proportional to the corresponding diagonal element of . When the matrix is nearly singular, these diagonal elements are enormous, making every coefficient statistically insignificant even though the model as a whole may fit well.

  • Poor prediction on new data: A model with wild coefficients is essentially memorizing the noise in the training data. It will generalize poorly to new samples.

Diagnosing multicollinearity

The condition number of is the primary diagnostic:

where and are the largest and smallest eigenvalues. Rules of thumb:

  • : Multicollinearity is not a practical problem
  • : Moderate multicollinearity, coefficients becoming unreliable
  • : Severe multicollinearity, OLS should not be trusted

For a typical NIR spectrum with 1000 wavelengths and 50 samples, the condition number can easily exceed . In such cases, OLS is hopeless and we must turn to regularized methods.

Another useful diagnostic is the Variance Inflation Factor (VIF) for each predictor:

where is the obtained by regressing predictor on all other predictors. A VIF above 10 is traditionally considered problematic.

Model evaluation

Once you have fitted an MLR model, you need to assess its quality. Several complementary metrics exist.

Adjusted R-squared

The ordinary measures the proportion of variance explained:

But always increases (or stays the same) when you add more predictors, even useless ones. The adjusted penalizes for model complexity:

where is the number of observations and is the number of predictors. Adding a useless variable increases without reducing enough, so decreases — signaling that the variable did not help.

The F-test

The overall F-test asks: “Does this model explain significantly more variance than a model with no predictors (just the mean)?”

A large F-value (relative to the distribution) indicates that at least some of the predictors are useful. In chemometrics, where we often have many correlated predictors, the overall F-test is almost always significant — it does not tell you which predictors matter.

Individual t-tests

To test whether a specific coefficient is significantly different from zero:

where . Under multicollinearity, the standard errors are inflated (as discussed above), so coefficients that are genuinely important may appear insignificant. This is another symptom of the disease, not a failure of the test itself.

RMSEP and cross-validation

For predictive models, the most honest assessment is the root mean square error of prediction (RMSEP) computed on data that was not used to build the model:

Leave-one-out cross-validation (LOOCV) or k-fold cross-validation provides a practical estimate of RMSEP when a separate test set is not available. In LOOCV, each sample is left out once, the model is built on the remaining samples, and the left-out sample is predicted. The process repeats for all samples, yielding prediction errors. For MLR, there is an elegant shortcut: the LOOCV RMSE can be computed from a single model fit using the hat matrix :

where is the -th diagonal element of , known as the leverage of observation . Samples with high leverage (close to 1) have a disproportionate influence on the model and produce large cross-validated residuals.

Code implementation

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
# --- Simulate spectroscopic calibration data ---
np.random.seed(42)
n_samples = 50
n_wavelengths = 20
# True concentrations
y = np.random.uniform(0.1, 5.0, n_samples)
# Generate spectra: each wavelength responds to concentration
# with some overlap (correlated predictors)
wavelengths = np.linspace(400, 700, n_wavelengths)
X = np.zeros((n_samples, n_wavelengths))
for j in range(n_wavelengths):
# Each wavelength has a sensitivity that depends on position
sensitivity = np.exp(-((wavelengths[j] - 550) ** 2) / 5000)
X[:, j] = sensitivity * y + np.random.normal(0, 0.05, n_samples)
# --- Fit multiple linear regression ---
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
# Training metrics
r2 = r2_score(y, y_pred)
rmse_train = np.sqrt(mean_squared_error(y, y_pred))
n, p = X.shape
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"R²: {r2:.4f}")
print(f"Adjusted R²: {r2_adj:.4f}")
print(f"RMSE (train): {rmse_train:.4f}")
# Cross-validated RMSE
cv_scores = cross_val_score(model, X, y,
cv=5, scoring='neg_mean_squared_error')
rmsecv = np.sqrt(-cv_scores.mean())
print(f"RMSECV: {rmsecv:.4f}")
# --- Diagnose multicollinearity ---
XtX = X.T @ X
eigenvalues = np.linalg.eigvalsh(XtX)
condition_number = eigenvalues[-1] / eigenvalues[0]
print(f"\nCondition number: {condition_number:.1f}")
# Coefficient plot
plt.figure(figsize=(10, 4))
plt.bar(range(n_wavelengths), model.coef_)
plt.xlabel("Wavelength index")
plt.ylabel("Coefficient value")
plt.title("MLR Coefficients — note erratic values from multicollinearity")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Variable selection

One common strategy for making MLR work with spectroscopic data is to select a small subset of wavelengths before fitting the model. If we can identify, say, 5 wavelengths out of 200 that carry most of the information about the analyte, then MLR on those 5 wavelengths avoids multicollinearity entirely.

Several approaches exist:

Forward selection starts with no predictors and adds them one at a time, choosing at each step the variable that most improves the model (highest partial F-statistic or lowest AIC). The process stops when no remaining variable provides a statistically significant improvement.

Backward elimination starts with all predictors and removes them one at a time, dropping the least useful variable at each step.

Stepwise selection combines both: at each step, variables can be added or removed. This is the most common approach in practice, though it has well-known problems — the selected variables depend heavily on the order of addition and removal, and the resulting p-values are not valid because of the multiple testing inherent in the search.

Best subset selection evaluates all possible subsets of predictors and picks the one that optimizes some criterion (AIC, BIC, or cross-validated RMSE). This is optimal in principle but computationally infeasible when is large (with 200 wavelengths, there are more subsets than atoms in the universe).

When to use and limitations

MLR works well when

Few, uncorrelated predictors When you have a small number of predictors that are not strongly correlated, OLS gives excellent, interpretable results.

More observations than predictors The system is required for OLS to have a unique solution, and is preferred for stable estimates.

Well-designed experiments Factorial or response surface designs produce orthogonal predictors, eliminating multicollinearity by construction.

Variable selection has been done If you have already selected a small set of informative wavelengths (e.g., via stepwise selection or expert knowledge), MLR can work well.

MLR struggles when

Spectroscopic data (many correlated wavelengths) This is the most common situation in chemometrics, and it is precisely where OLS fails. The condition number explodes and coefficients become meaningless.

More variables than observations (p > n) The system is underdetermined and is singular. OLS has infinitely many solutions.

Noisy predictors When measurement noise in the predictors is comparable to the signal, OLS amplifies the noise through unstable coefficients.

No variable selection Throwing all available wavelengths into an MLR model almost always produces poor results for spectroscopic data.

Next steps

Multiple linear regression is the starting point for understanding every regression method in chemometrics. Its failure modes — particularly multicollinearity and the problem — motivate every technique that follows:

  • Ridge regression: Adds a penalty term to stabilize the coefficients. The simplest fix for multicollinearity.
  • LASSO regression: Uses an L1 penalty that drives some coefficients exactly to zero, performing automatic variable selection.
  • Principal component regression (PCR): Replaces the original correlated predictors with a smaller set of uncorrelated principal components, then regresses on those.
  • Partial least squares (PLS): Like PCR, but constructs components that maximize covariance with the response, not just variance in the predictors. The workhorse of chemometric calibration.

The key insight to carry forward: OLS is optimal under the Gauss-Markov conditions, but those conditions are violated in almost every spectroscopic dataset. Recognizing this is what separates naive regression from chemometric modeling.

References

[1] Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.

[2] Martens, H., & Naes, T. (1989). Multivariate Calibration. John Wiley & Sons.

[3] Draper, N.R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley & Sons.

[4] Wold, S., Sjostrom, M., & Eriksson, L. (2001). “PLS-regression: a basic tool of chemometrics.” Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.

[5] Brereton, R.G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. John Wiley & Sons.

[6] Montgomery, D.C., Peck, E.A., & Vining, G.G. (2012). Introduction to Linear Regression Analysis (5th ed.). John Wiley & Sons.