The idea of fitting a straight line to data goes back centuries, but chemistry rarely gives us the luxury of a single predictor. When a spectrophotometer records absorbance at 200 wavelengths and we want to predict the concentration of an analyte, we need a model that can handle all 200 variables at once. That model is multiple linear regression (MLR).
The mathematical machinery behind MLR — the matrix formulation of least squares — was assembled piece by piece during the first half of the twentieth century. Carl Friedrich Gauss had already described the method of least squares around 1795, and Adrien-Marie Legendre published the first formal account in 1805. But it was the statisticians of the early 1900s who generalized the idea to multiple predictors. R.A. Fisher, working at the Rothamsted Experimental Station in the 1920s and 1930s, developed the analysis of variance and the general linear model framework that placed multiple regression on rigorous statistical footing. His 1925 book Statistical Methods for Research Workers gave scientists a practical toolbox for handling several variables simultaneously [1].
Fisher’s framework was not purely theoretical. He applied multiple regression to agricultural field trials at Rothamsted, where crop yield depended on rainfall, temperature, soil nitrogen, and other factors simultaneously. His insight was that you could estimate the effect of each factor while accounting for all the others — a problem that single-variable analysis could not solve. The mathematical tools he developed — matrix algebra applied to linear models, the F-distribution for testing groups of coefficients, the concept of degrees of freedom — became the foundation that chemometricians would later build upon [3].
In chemistry, the need for multiple regression became acute with the rise of instrumental analysis in the mid-twentieth century. A UV-Vis spectrum contains absorbance readings at dozens or hundreds of wavelengths, and each wavelength carries partial information about the analyte concentration. Using just one wavelength — simple linear regression — throws away most of the available data. Multivariate calibration, the practice of building models that use many wavelengths to predict concentration, became a cornerstone of chemometrics by the 1970s and 1980s [2]. MLR was the natural first attempt, and understanding where it succeeds and where it fails is essential for appreciating the more advanced methods (PCR, PLS) that came later.
Why one variable is not enough
In the simple linear regression article, we modeled a response y as a function of a single predictor x . That works when one measurement clearly determines the outcome — for example, absorbance at a single wavelength predicting a single analyte via Beer’s Law.
But consider a more realistic scenario. You have a mixture of three dyes in solution, and you measure the UV-Vis spectrum at 50 wavelengths. Each wavelength responds to all three dyes, with different sensitivities. No single wavelength can disentangle the contributions. You need a model that combines information from multiple wavelengths to predict each dye’s concentration.
This is where multiple linear regression enters. Instead of one slope and one intercept, we have many slopes — one for each predictor variable — plus an intercept:
y=β0+β1x1+β2x2+⋯+βpxp+ε
Each coefficient βj tells us how much the response changes when predictor xj increases by one unit, holding all other predictors constant. That last phrase is critical. In simple regression, the slope has an unambiguous interpretation. In multiple regression, the interpretation of each coefficient depends on what other variables are in the model.
A concrete example: multi-analyte determination
Suppose you are analyzing pharmaceutical tablets that contain two active ingredients, A and B. You measure the UV-Vis spectrum at five wavelengths. Beer’s Law tells us that absorbance is additive:
aj=εAj⋅cA⋅l+εBj⋅cB⋅l+noise
where aj is the absorbance at wavelength j , εAj and εBj are the molar absorptivities of A and B at that wavelength, cA and cB are the concentrations, and l is the path length.
To determine cA from the spectrum, simple regression at one wavelength would fail if B also absorbs there. But with multiple wavelengths, we can set up an MLR model where the absorbances at all five wavelengths are the predictors and the known concentration of A (from reference analyses) is the response. The model learns a set of coefficients that, when applied to a new spectrum, disentangle the contributions of A and B and return the concentration of A. This is the essence of multivariate calibration.
The model in matrix form
Writing out y=β0+β1x1+β2x2+⋯ for every observation and every predictor quickly becomes tedious. Matrix notation compresses everything into a single equation.
Suppose we have n observations and p predictor variables. We arrange the data into:
A response vectory of size n×1 , containing the measured values (e.g., concentrations)
A design matrixX of size n×(p+1) , where each row is one observation. The first column is all ones (for the intercept), and the remaining p columns hold the predictor values (e.g., absorbances at different wavelengths)
A coefficient vectorβ of size (p+1)×1 , containing the intercept and all slopes
An error vectorε of size n×1 , containing the noise
where xij is the value of predictor j for observation i . The column of ones ensures that when we multiply Xβ , the first element of β acts as the intercept β0 .
The normal equations
Just as in simple regression, we want to minimize the sum of squared residuals:
S(β)=i=1∑n(yi−y^i)2=(y−Xβ)T(y−Xβ)
To find the minimum, we take the derivative with respect to β and set it to zero. Expanding the expression:
S(β)=yTy−2βTXTy+βTXTXβ
Differentiating with respect to β and setting the gradient to zero:
∂β∂S=−2XTy+2XTXβ=0
This gives us the normal equations:
XTXβ^=XTy
If XTX is invertible (we will see shortly when it is not), the solution is:
β^=(XTX)−1XTy
This is the ordinary least squares (OLS) estimator. The matrix (XTX)−1XT is sometimes called the pseudo-inverse of X (denoted X+ ), and it appears throughout chemometrics.
What the solution means
The OLS estimator has a clean interpretation: β^ is the vector of coefficients that makes the predicted values y^=Xβ^ as close as possible to the observed values y , measured by Euclidean distance (sum of squared differences).
Under the standard assumptions (errors are independent, identically distributed with mean zero and constant variance σ2 ), OLS has a remarkable property: the Gauss-Markov theorem guarantees that among all linear unbiased estimators, OLS has the smallest variance. In other words, no other unbiased linear method can do better. This result is what makes OLS the default starting point for regression.
Assumptions of MLR
The Gauss-Markov optimality result depends on several assumptions. When they hold, OLS is hard to beat. When they are violated, the consequences range from mild inefficiency to completely misleading results.
Linearity: The true relationship between y and X is linear. If the relationship is curved, the residuals will show systematic patterns. In spectroscopy, Beer’s Law is linear only within a concentration range — at high concentrations, deviations from linearity appear.
Full rank: The design matrix X has rank p+1 (no predictor is a perfect linear combination of the others). If this fails, XTX is singular and cannot be inverted. Near-violations (multicollinearity) produce nearly singular matrices with all the associated instabilities.
Homoscedasticity: The error variance is constant across all observations: Var(εi)=σ2 for all i . In analytical chemistry, measurement precision often decreases at low concentrations (higher relative noise), violating this assumption. Weighted least squares is the standard remedy.
Independence: The errors εi are mutually independent. Time-series data (e.g., process monitoring) often violates this through autocorrelation. For calibration with independent samples, this assumption is usually reasonable.
Normal errors (for inference): The Gauss-Markov theorem requires only assumptions 1-4 for OLS to be the best linear unbiased estimator. But to construct confidence intervals and perform hypothesis tests (F-test, t-tests), we additionally need εi∼N(0,σ2) . For large sample sizes, the central limit theorem provides approximate normality even when the errors are not exactly Gaussian.
The multicollinearity problem
Everything described so far works beautifully — as long as XTX is invertible and well-conditioned. In many chemistry applications, it is neither.
Multicollinearity occurs when two or more predictor variables are highly correlated with each other. In spectroscopy, this is the rule rather than the exception: absorbance at 500 nm and absorbance at 501 nm are measuring nearly the same thing. They are almost perfectly correlated.
When predictors are highly correlated, XTX becomes nearly singular. Its eigenvalues span many orders of magnitude, and the inverse (XTX)−1 amplifies small perturbations into enormous changes in the coefficients. The practical consequences are severe:
Wildly unstable coefficients: Adding or removing a single observation can flip the sign of a coefficient. One coefficient might be +5000 and its neighbor -4998, with the two nearly canceling out. The individual values are meaningless even though their combination might still predict reasonably.
Inflated standard errors: The variance of each coefficient estimate is proportional to the corresponding diagonal element of (XTX)−1 . When the matrix is nearly singular, these diagonal elements are enormous, making every coefficient statistically insignificant even though the model as a whole may fit well.
Poor prediction on new data: A model with wild coefficients is essentially memorizing the noise in the training data. It will generalize poorly to new samples.
Diagnosing multicollinearity
The condition number of XTX is the primary diagnostic:
κ(XTX)=λminλmax
where λmax and λmin are the largest and smallest eigenvalues. Rules of thumb:
κ<30 : Multicollinearity is not a practical problem
κ>300 : Severe multicollinearity, OLS should not be trusted
For a typical NIR spectrum with 1000 wavelengths and 50 samples, the condition number can easily exceed 1010 . In such cases, OLS is hopeless and we must turn to regularized methods.
Another useful diagnostic is the Variance Inflation Factor (VIF) for each predictor:
VIFj=1−Rj21
where Rj2 is the R2 obtained by regressing predictor j on all other predictors. A VIF above 10 is traditionally considered problematic.
Model evaluation
Once you have fitted an MLR model, you need to assess its quality. Several complementary metrics exist.
Adjusted R-squared
The ordinary R2 measures the proportion of variance explained:
where n is the number of observations and p is the number of predictors. Adding a useless variable increases p without reducing SSres enough, so Radj2 decreases — signaling that the variable did not help.
The F-test
The overall F-test asks: “Does this model explain significantly more variance than a model with no predictors (just the mean)?”
F=SSres/(n−p−1)(SStot−SSres)/p=MSresMSreg
A large F-value (relative to the Fp,n−p−1 distribution) indicates that at least some of the predictors are useful. In chemometrics, where we often have many correlated predictors, the overall F-test is almost always significant — it does not tell you which predictors matter.
Individual t-tests
To test whether a specific coefficient βj is significantly different from zero:
tj=SE(β^j)β^j
where SE(β^j)=σ^2[(XTX)−1]jj . Under multicollinearity, the standard errors are inflated (as discussed above), so coefficients that are genuinely important may appear insignificant. This is another symptom of the disease, not a failure of the test itself.
RMSEP and cross-validation
For predictive models, the most honest assessment is the root mean square error of prediction (RMSEP) computed on data that was not used to build the model:
RMSEP=ntest1i=1∑ntest(yi−y^i)2
Leave-one-out cross-validation (LOOCV) or k-fold cross-validation provides a practical estimate of RMSEP when a separate test set is not available. In LOOCV, each sample is left out once, the model is built on the remaining n−1 samples, and the left-out sample is predicted. The process repeats for all n samples, yielding n prediction errors. For MLR, there is an elegant shortcut: the LOOCV RMSE can be computed from a single model fit using the hat matrixH=X(XTX)−1XT :
RMSECV=n1i=1∑n(1−hiiyi−y^i)2
where hii is the i -th diagonal element of H , known as the leverage of observation i . Samples with high leverage (close to 1) have a disproportionate influence on the model and produce large cross-validated residuals.
One common strategy for making MLR work with spectroscopic data is to select a small subset of wavelengths before fitting the model. If we can identify, say, 5 wavelengths out of 200 that carry most of the information about the analyte, then MLR on those 5 wavelengths avoids multicollinearity entirely.
Several approaches exist:
Forward selection starts with no predictors and adds them one at a time, choosing at each step the variable that most improves the model (highest partial F-statistic or lowest AIC). The process stops when no remaining variable provides a statistically significant improvement.
Backward elimination starts with all predictors and removes them one at a time, dropping the least useful variable at each step.
Stepwise selection combines both: at each step, variables can be added or removed. This is the most common approach in practice, though it has well-known problems — the selected variables depend heavily on the order of addition and removal, and the resulting p-values are not valid because of the multiple testing inherent in the search.
Best subset selection evaluates all 2p possible subsets of predictors and picks the one that optimizes some criterion (AIC, BIC, or cross-validated RMSE). This is optimal in principle but computationally infeasible when p is large (with 200 wavelengths, there are more subsets than atoms in the universe).
When to use and limitations
MLR works well when
Few, uncorrelated predictors
When you have a small number of predictors that are not strongly correlated, OLS gives excellent, interpretable results.
More observations than predictors
The system n>p is required for OLS to have a unique solution, and n≫p is preferred for stable estimates.
Well-designed experiments
Factorial or response surface designs produce orthogonal predictors, eliminating multicollinearity by construction.
Variable selection has been done
If you have already selected a small set of informative wavelengths (e.g., via stepwise selection or expert knowledge), MLR can work well.
MLR struggles when
Spectroscopic data (many correlated wavelengths)
This is the most common situation in chemometrics, and it is precisely where OLS fails. The condition number explodes and coefficients become meaningless.
More variables than observations (p > n)
The system is underdetermined and XTX is singular. OLS has infinitely many solutions.
Noisy predictors
When measurement noise in the predictors is comparable to the signal, OLS amplifies the noise through unstable coefficients.
No variable selection
Throwing all available wavelengths into an MLR model almost always produces poor results for spectroscopic data.
Next steps
Multiple linear regression is the starting point for understanding every regression method in chemometrics. Its failure modes — particularly multicollinearity and the p>n problem — motivate every technique that follows:
Ridge regression: Adds a penalty term λ∥β∥2 to stabilize the coefficients. The simplest fix for multicollinearity.
LASSO regression: Uses an L1 penalty λ∥β∥1 that drives some coefficients exactly to zero, performing automatic variable selection.
Principal component regression (PCR): Replaces the original correlated predictors with a smaller set of uncorrelated principal components, then regresses on those.
Partial least squares (PLS): Like PCR, but constructs components that maximize covariance with the response, not just variance in the predictors. The workhorse of chemometric calibration.
The key insight to carry forward: OLS is optimal under the Gauss-Markov conditions, but those conditions are violated in almost every spectroscopic dataset. Recognizing this is what separates naive regression from chemometric modeling.
References
[1] Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.
[2] Martens, H., & Naes, T. (1989). Multivariate Calibration. John Wiley & Sons.
[3] Draper, N.R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley & Sons.
[4] Wold, S., Sjostrom, M., & Eriksson, L. (2001). “PLS-regression: a basic tool of chemometrics.” Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.
[5] Brereton, R.G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. John Wiley & Sons.
[6] Montgomery, D.C., Peck, E.A., & Vining, G.G. (2012). Introduction to Linear Regression Analysis (5th ed.). John Wiley & Sons.