The word “regression” entered statistics through an unlikely door: the study of human height. In 1886, the English polymath Francis Galton published a paper titled Regression towards Mediocrity in Hereditary Stature, in which he measured the heights of 928 adult children and their parents. He observed that unusually tall parents tended to have children who were tall, but not quite as tall as their parents — and unusually short parents tended to have children who were short, but not quite as short. The children’s heights “regressed” toward the population average. Galton drew a straight line through a scatter of parent-child height pairs, and in doing so created the first regression analysis.
Galton’s colleague Karl Pearson took the idea further. Over the following decade, Pearson developed the correlation coefficient r , formalized the method of fitting a straight line to bivariate data, and built the mathematical framework that turned Galton’s empirical observation into a general statistical tool. By 1903, Pearson and his students had applied the method to problems ranging from skull measurements to meteorological data.
But the mathematical engine underneath regression — the method of least squares — had been invented a full eighty years earlier, and in a completely different field. In 1805, the French mathematician Adrien-Marie Legendre published the first description of the least squares method in his work on determining the orbits of comets. He proposed minimizing the sum of squared residuals as the criterion for the “best” fit, without any probabilistic justification. Four years later, Carl Friedrich Gauss provided that justification: if measurement errors follow a normal (Gaussian) distribution, then the least squares solution is also the maximum likelihood estimate. Gauss claimed he had been using the method since 1795, though he published after Legendre. The priority dispute between them was bitter and never fully resolved.
What matters for us is how the method migrated from astronomy to chemistry. Throughout the 19th and 20th centuries, analytical chemists needed to convert instrument readings into concentrations. The standard procedure — still used in every analytical chemistry laboratory today — is the calibration curve: prepare a set of standards with known concentrations, measure each one on the instrument, plot signal versus concentration, and fit a straight line. That line is a linear regression model. When August Beer formalized his law of light absorption in 1852, the linear relationship between absorbance and concentration gave calibration curves a firm theoretical basis. Linear regression became the workhorse of quantitative chemical analysis.
The calibration problem in chemistry
Suppose you need to determine how much lead is in a set of drinking water samples. You have an atomic absorption spectrometer that measures absorbance, and you know from Beer’s Law that absorbance is proportional to concentration. But you don’t know the proportionality constant for your particular instrument, cuvette path length, and wavelength setting. You need a calibration model.
The procedure is straightforward:
Prepare standards
Make 5—8 solutions with known lead concentrations (say 0, 5, 10, 20, 50, 100 ppb).
Measure each standard
Run each solution through the spectrometer and record the absorbance.
Fit a line
Plot absorbance vs. concentration and find the straight line that best describes the relationship.
Predict unknowns
Measure the absorbance of your unknown samples and use the fitted line to read off their concentrations.
This is linear regression in its most natural chemical setting. The “line of best fit” is not drawn by eye (as was common before the 1970s) but computed by a precise mathematical criterion: least squares. The line that minimizes the total squared distance between the observed points and the line is, in a well-defined sense, the best one.
The mathematical model
The simplest linear regression model relates a response variable y to a single predictor x :
yi=β0+β1xi+εi,i=1,2,…,n
Each symbol has a concrete chemical meaning:
yi is the measured response for the i -th sample (e.g., absorbance at 283.3 nm)
xi is the known predictor value (e.g., lead concentration in ppb)
β0 is the intercept — the expected response when x=0 (ideally zero for a blank, but often slightly nonzero due to stray light, solvent absorption, or detector offset)
β1 is the slope — how much the response changes per unit change in x (the sensitivity of the method)
εi is the error (or residual) — the difference between what we actually measured and what the model predicts, arising from random measurement noise
The model says that the data are generated by a deterministic linear relationship β0+β1x plus random noise ε . We observe x and y ; we want to estimate β0 and β1 .
Finding the best line
We have n data points (x1,y1),(x2,y2),…,(xn,yn) and we want the values of β0 and β1 that make the line fit the data as closely as possible. The standard criterion is ordinary least squares (OLS): minimize the sum of squared residuals.
Define the residual for the i -th observation as:
ei=yi−(β0+β1xi)
The sum of squared residuals is:
S(β0,β1)=i=1∑nei2=i=1∑n(yi−β0−β1xi)2
To minimize S , take partial derivatives with respect to β0 and β1 , set them to zero, and solve:
where xˉ=n1∑i=1nxi and yˉ=n1∑i=1nyi are the sample means.
The formula for β^1 has an intuitive reading: the slope is the ratio of “how much x and y vary together” (the covariance) to “how much x varies on its own” (the variance). If x and y increase together, the slope is positive. If one increases while the other decreases, the slope is negative.
Measuring model quality
After fitting the line, you need to know: is it actually any good? Two complementary metrics answer this question.
Coefficient of determination (R-squared)
The coefficient of determination R2 measures the fraction of the total variability in y that is explained by the linear relationship with x :
SSres=∑i=1n(yi−y^i)2 is the residual sum of squares — the variation the model fails to explain
SStot=∑i=1n(yi−yˉ)2 is the total sum of squares — the total variation in the data
R2 ranges from 0 to 1. An R2 of 0.998 (common for well-behaved Beer’s Law calibrations) means the linear model explains 99.8% of the variance in absorbance. An R2 of 0.75 means 25% of the variance is unexplained — the model captures the general trend but misses something important.
Root mean square error (RMSE)
While R2 is dimensionless, the RMSE tells you the model’s prediction error in the same units as the response:
RMSE=n1i=1∑n(yi−y^i)2
If you are calibrating a UV-Vis spectrometer for copper determination and your RMSE is 0.003 absorbance units, your predictions are typically within 0.003 AU of the true value. This is directly interpretable: you can convert it to concentration units using the slope, giving you the method’s practical detection capability.
Residual analysis
Numbers alone are not enough. The single most informative diagnostic is the residual plot: plot the residuals ei=yi−y^i against the fitted values y^i (or against xi ).
A well-behaved regression produces residuals that look like random scatter around zero — no trends, no patterns, no funnels. If you see structure in the residuals, something is wrong:
A curved pattern indicates the true relationship is nonlinear. Consider a quadratic term or a transformation.
A funnel shape (variance increasing with x ) indicates heteroscedasticity. Weighted regression or a variance-stabilizing transformation may help.
Clusters or runs of positive/negative residuals suggest autocorrelation or a missing variable.
One or two points far from zero flag potential outliers worth investigating.
From simple to matrix form
Simple linear regression handles one predictor. But in chemistry, we often have multiple predictors — for example, absorbances at several wavelengths, or concentrations of several analytes. The matrix formulation generalizes the model to any number of predictors.
Write the model for all n observations at once:
y=Xβ+ε
where:
y is the n×1 vector of responses
X is the n×pdesign matrix (first column is all ones for the intercept, remaining columns are predictors)
β is the p×1 vector of coefficients
ε is the n×1 vector of errors
For simple linear regression with one predictor, the design matrix looks like:
X=11⋮1x1x2⋮xn,β=(β0β1)
The least squares solution in matrix form is the famous normal equation:
β^=(XTX)−1XTy
This single expression replaces all the summation formulas from the simple case. It works for any number of predictors p , which is why it forms the basis for multiple linear regression, ridge regression, and ultimately the entire family of linear methods in chemometrics.
Real chemistry examples
Linear regression appears throughout analytical chemistry. Here are three common scenarios.
Scenario: Determine copper concentration in water samples using UV-Vis spectroscopy at 810 nm.
You prepare six standards and measure their absorbances:
Concentration (mg/L)
Absorbance
0.0
0.002
2.0
0.098
5.0
0.243
10.0
0.491
20.0
0.985
50.0
2.461
Beer’s Law predicts a linear relationship: A=εℓc , where ε is the molar absorptivity, ℓ is the path length, and c is the concentration. Fitting A=β0+β1c gives:
β^0=0.001 (very close to zero, as expected for a good blank)
β^1=0.0492 AU per mg/L (the method sensitivity)
R2=0.9999
To predict an unknown sample with absorbance 0.370: c=(0.370−0.001)/0.0492=7.50 mg/L.
Scenario: Quantify polycyclic aromatic hydrocarbons (PAHs) in river water using fluorescence spectroscopy.
Fluorescence intensity is proportional to concentration at low levels, but the relationship can deviate at higher concentrations due to inner filter effects. A calibration using standards at 0, 1, 5, 10, 25, and 50 ng/mL might give:
R2=0.9994 for the range 0—25 ng/mL (linear)
R2=0.9823 for the range 0—50 ng/mL (slight curvature appears)
This illustrates a critical point: always check your calibration range. Linear regression assumes a linear relationship. If you push beyond the linear range, the model breaks down and predictions become biased. The residual plot will show the curvature clearly.
Scenario: Validate a moisture analyzer by comparing its readings to reference oven-drying results.
You analyze 20 grain samples using both methods and fit:
Moistureanalyzer=β0+β1⋅Moistureoven
If the analyzer is perfectly accurate, you expect β0=0 and β1=1 . Statistical tests on the intercept and slope (using their standard errors) tell you whether the analyzer has a systematic bias (β0=0 ) or a proportional error (β1=1 ). This is method comparison — one of the most common uses of linear regression in analytical chemistry beyond calibration.
Common pitfalls
Linear regression is robust and well-understood, but it can fail silently if its assumptions are violated. These are the problems you are most likely to encounter in chemical data.
Nonlinearity. The most common failure. Beer’s Law is linear only over a limited concentration range; at high concentrations, deviations appear due to molecular interactions, stray light, or detector saturation. If your residual plot shows a curve, the fix is usually to narrow the calibration range, add a quadratic term, or use a nonlinear model.
Outliers. A single outlier can dramatically shift the fitted line, especially with small sample sizes. Outliers in calibration data may come from preparation errors (wrong dilution), instrument glitches, or transcription mistakes. Identify them through the residual plot, investigate them (don’t just delete them blindly), and consider robust regression methods if outliers are frequent.
Heteroscedasticity. In many spectroscopic measurements, noise increases with signal intensity (think shot noise, which scales as n ). This means the variance of ε is not constant across the calibration range — a violation of the standard OLS assumption. The remedy is weighted least squares, where each point is weighted inversely to its variance, giving less noisy points more influence.
Extrapolation. A calibration curve is only valid within the range of your standards. Predicting beyond this range (extrapolation) is dangerous because you have no data to confirm the relationship holds. A calibration built from 0—50 mg/L standards should not be used to predict a sample at 200 mg/L.
Collinearity (in multiple regression). When two or more predictors are highly correlated — as they almost always are in spectroscopic data — the matrix XTX becomes nearly singular. The estimated coefficients become unstable: large in magnitude, sensitive to small changes in the data, and difficult to interpret. This is the collinearity problem, and it is the principal motivation for regularized methods like ridge regression and latent-variable methods like PCR and PLS.
When to use (and when not to)
Good applications
Calibration curves
The primary use case in analytical chemistry. Beer’s Law, electrode calibrations, instrument validation.
Method comparison
Comparing two measurement techniques (e.g., reference vs. rapid method).
Simple trend analysis
Relating a response to a single continuous predictor with a roughly linear relationship.
Teaching
The foundation for understanding all regression methods. Learn this well before moving to more complex techniques.
Better alternatives exist for
High-dimensional data (many wavelengths)
Spectroscopic data typically has hundreds of correlated predictors. Use PCR or PLS instead.
Nonlinear relationships
If the calibration curve bends, consider polynomial regression, splines, or nonlinear models.
Noisy data with outliers
Robust regression (e.g., iteratively reweighted least squares) handles outliers better than OLS.
Small sample sizes with many predictors
When p≥n , OLS has no unique solution. Ridge, LASSO, or PLS are necessary.
Code implementation
Here are complete, working implementations of simple linear regression for calibration in three languages.
Simple linear regression is the foundation upon which almost every chemometric method is built. The concepts introduced here — least squares fitting, the normal equations, residual diagnostics, the matrix formulation — reappear in increasingly sophisticated forms throughout the field.
The natural next steps from here:
Multiple Linear Regression extends the model to several predictors (y=β0+β1x1+β2x2+⋯ ). Essential when your response depends on more than one variable, but runs into trouble when predictors are correlated.
Ridge Regression adds a penalty term to the least squares criterion to stabilize coefficient estimates when predictors are collinear. The first regularization method most chemometricians encounter.
LASSO Regression uses a different penalty that can shrink some coefficients exactly to zero, performing variable selection. Useful when you suspect only a few wavelengths matter.
Principal Component Regression compresses the predictor matrix into a few orthogonal components before regression. A classic approach for spectroscopic calibration.
Partial Least Squares finds components that simultaneously explain variance in both X and y . The most widely used regression method in chemometrics, and the one you will encounter most often in spectroscopic applications.
[1] Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246—263.
[2] Legendre, A.-M. (1805). Nouvelles methodes pour la determination des orbites des cometes. Firmin Didot, Paris.
[3] Gauss, C. F. (1809). Theoria motus corporum coelestium. Perthes et Besser, Hamburg.
[4] Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society A, 187, 253—318.
[5] Beer, A. (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flussigkeiten. Annalen der Physik und Chemie, 162(5), 78—88.
[6] Martens, H., & Naes, T. (1989). Multivariate Calibration. Wiley.
[7] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.
[8] Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17—21.
[9] Mark, H., & Workman, J. (2007). Chemometrics in Spectroscopy. Academic Press.
[10] Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley.