Skip to content

Linear Regression

The word “regression” entered statistics through an unlikely door: the study of human height. In 1886, the English polymath Francis Galton published a paper titled Regression towards Mediocrity in Hereditary Stature, in which he measured the heights of 928 adult children and their parents. He observed that unusually tall parents tended to have children who were tall, but not quite as tall as their parents — and unusually short parents tended to have children who were short, but not quite as short. The children’s heights “regressed” toward the population average. Galton drew a straight line through a scatter of parent-child height pairs, and in doing so created the first regression analysis.

Galton’s colleague Karl Pearson took the idea further. Over the following decade, Pearson developed the correlation coefficient , formalized the method of fitting a straight line to bivariate data, and built the mathematical framework that turned Galton’s empirical observation into a general statistical tool. By 1903, Pearson and his students had applied the method to problems ranging from skull measurements to meteorological data.

But the mathematical engine underneath regression — the method of least squares — had been invented a full eighty years earlier, and in a completely different field. In 1805, the French mathematician Adrien-Marie Legendre published the first description of the least squares method in his work on determining the orbits of comets. He proposed minimizing the sum of squared residuals as the criterion for the “best” fit, without any probabilistic justification. Four years later, Carl Friedrich Gauss provided that justification: if measurement errors follow a normal (Gaussian) distribution, then the least squares solution is also the maximum likelihood estimate. Gauss claimed he had been using the method since 1795, though he published after Legendre. The priority dispute between them was bitter and never fully resolved.

What matters for us is how the method migrated from astronomy to chemistry. Throughout the 19th and 20th centuries, analytical chemists needed to convert instrument readings into concentrations. The standard procedure — still used in every analytical chemistry laboratory today — is the calibration curve: prepare a set of standards with known concentrations, measure each one on the instrument, plot signal versus concentration, and fit a straight line. That line is a linear regression model. When August Beer formalized his law of light absorption in 1852, the linear relationship between absorbance and concentration gave calibration curves a firm theoretical basis. Linear regression became the workhorse of quantitative chemical analysis.

The calibration problem in chemistry

Suppose you need to determine how much lead is in a set of drinking water samples. You have an atomic absorption spectrometer that measures absorbance, and you know from Beer’s Law that absorbance is proportional to concentration. But you don’t know the proportionality constant for your particular instrument, cuvette path length, and wavelength setting. You need a calibration model.

The procedure is straightforward:

  1. Prepare standards

    Make 5—8 solutions with known lead concentrations (say 0, 5, 10, 20, 50, 100 ppb).

  2. Measure each standard

    Run each solution through the spectrometer and record the absorbance.

  3. Fit a line

    Plot absorbance vs. concentration and find the straight line that best describes the relationship.

  4. Predict unknowns

    Measure the absorbance of your unknown samples and use the fitted line to read off their concentrations.

This is linear regression in its most natural chemical setting. The “line of best fit” is not drawn by eye (as was common before the 1970s) but computed by a precise mathematical criterion: least squares. The line that minimizes the total squared distance between the observed points and the line is, in a well-defined sense, the best one.

The mathematical model

The simplest linear regression model relates a response variable to a single predictor :

Each symbol has a concrete chemical meaning:

  • is the measured response for the -th sample (e.g., absorbance at 283.3 nm)
  • is the known predictor value (e.g., lead concentration in ppb)
  • is the intercept — the expected response when (ideally zero for a blank, but often slightly nonzero due to stray light, solvent absorption, or detector offset)
  • is the slope — how much the response changes per unit change in (the sensitivity of the method)
  • is the error (or residual) — the difference between what we actually measured and what the model predicts, arising from random measurement noise

The model says that the data are generated by a deterministic linear relationship plus random noise . We observe and ; we want to estimate and .

Finding the best line

We have data points and we want the values of and that make the line fit the data as closely as possible. The standard criterion is ordinary least squares (OLS): minimize the sum of squared residuals.

Define the residual for the -th observation as:

The sum of squared residuals is:

To minimize , take partial derivatives with respect to and , set them to zero, and solve:

These are the normal equations. Solving them gives closed-form expressions for the optimal coefficients:

where and are the sample means.

The formula for has an intuitive reading: the slope is the ratio of “how much and vary together” (the covariance) to “how much varies on its own” (the variance). If and increase together, the slope is positive. If one increases while the other decreases, the slope is negative.

Measuring model quality

After fitting the line, you need to know: is it actually any good? Two complementary metrics answer this question.

Coefficient of determination (R-squared)

The coefficient of determination measures the fraction of the total variability in that is explained by the linear relationship with :

where:

  • is the residual sum of squares — the variation the model fails to explain
  • is the total sum of squares — the total variation in the data

ranges from 0 to 1. An of 0.998 (common for well-behaved Beer’s Law calibrations) means the linear model explains 99.8% of the variance in absorbance. An of 0.75 means 25% of the variance is unexplained — the model captures the general trend but misses something important.

Root mean square error (RMSE)

While is dimensionless, the RMSE tells you the model’s prediction error in the same units as the response:

If you are calibrating a UV-Vis spectrometer for copper determination and your RMSE is 0.003 absorbance units, your predictions are typically within 0.003 AU of the true value. This is directly interpretable: you can convert it to concentration units using the slope, giving you the method’s practical detection capability.

Residual analysis

Numbers alone are not enough. The single most informative diagnostic is the residual plot: plot the residuals against the fitted values (or against ).

A well-behaved regression produces residuals that look like random scatter around zero — no trends, no patterns, no funnels. If you see structure in the residuals, something is wrong:

  • A curved pattern indicates the true relationship is nonlinear. Consider a quadratic term or a transformation.
  • A funnel shape (variance increasing with ) indicates heteroscedasticity. Weighted regression or a variance-stabilizing transformation may help.
  • Clusters or runs of positive/negative residuals suggest autocorrelation or a missing variable.
  • One or two points far from zero flag potential outliers worth investigating.

From simple to matrix form

Simple linear regression handles one predictor. But in chemistry, we often have multiple predictors — for example, absorbances at several wavelengths, or concentrations of several analytes. The matrix formulation generalizes the model to any number of predictors.

Write the model for all observations at once:

where:

  • is the vector of responses
  • is the design matrix (first column is all ones for the intercept, remaining columns are predictors)
  • is the vector of coefficients
  • is the vector of errors

For simple linear regression with one predictor, the design matrix looks like:

The least squares solution in matrix form is the famous normal equation:

This single expression replaces all the summation formulas from the simple case. It works for any number of predictors , which is why it forms the basis for multiple linear regression, ridge regression, and ultimately the entire family of linear methods in chemometrics.

Real chemistry examples

Linear regression appears throughout analytical chemistry. Here are three common scenarios.

Scenario: Determine copper concentration in water samples using UV-Vis spectroscopy at 810 nm.

You prepare six standards and measure their absorbances:

Concentration (mg/L)Absorbance
0.00.002
2.00.098
5.00.243
10.00.491
20.00.985
50.02.461

Beer’s Law predicts a linear relationship: , where is the molar absorptivity, is the path length, and is the concentration. Fitting gives:

  • (very close to zero, as expected for a good blank)
  • AU per mg/L (the method sensitivity)

To predict an unknown sample with absorbance 0.370: mg/L.

Common pitfalls

Linear regression is robust and well-understood, but it can fail silently if its assumptions are violated. These are the problems you are most likely to encounter in chemical data.

Nonlinearity. The most common failure. Beer’s Law is linear only over a limited concentration range; at high concentrations, deviations appear due to molecular interactions, stray light, or detector saturation. If your residual plot shows a curve, the fix is usually to narrow the calibration range, add a quadratic term, or use a nonlinear model.

Outliers. A single outlier can dramatically shift the fitted line, especially with small sample sizes. Outliers in calibration data may come from preparation errors (wrong dilution), instrument glitches, or transcription mistakes. Identify them through the residual plot, investigate them (don’t just delete them blindly), and consider robust regression methods if outliers are frequent.

Heteroscedasticity. In many spectroscopic measurements, noise increases with signal intensity (think shot noise, which scales as ). This means the variance of is not constant across the calibration range — a violation of the standard OLS assumption. The remedy is weighted least squares, where each point is weighted inversely to its variance, giving less noisy points more influence.

Extrapolation. A calibration curve is only valid within the range of your standards. Predicting beyond this range (extrapolation) is dangerous because you have no data to confirm the relationship holds. A calibration built from 0—50 mg/L standards should not be used to predict a sample at 200 mg/L.

Collinearity (in multiple regression). When two or more predictors are highly correlated — as they almost always are in spectroscopic data — the matrix becomes nearly singular. The estimated coefficients become unstable: large in magnitude, sensitive to small changes in the data, and difficult to interpret. This is the collinearity problem, and it is the principal motivation for regularized methods like ridge regression and latent-variable methods like PCR and PLS.

When to use (and when not to)

Good applications

Calibration curves The primary use case in analytical chemistry. Beer’s Law, electrode calibrations, instrument validation.

Method comparison Comparing two measurement techniques (e.g., reference vs. rapid method).

Simple trend analysis Relating a response to a single continuous predictor with a roughly linear relationship.

Teaching The foundation for understanding all regression methods. Learn this well before moving to more complex techniques.

Better alternatives exist for

High-dimensional data (many wavelengths) Spectroscopic data typically has hundreds of correlated predictors. Use PCR or PLS instead.

Nonlinear relationships If the calibration curve bends, consider polynomial regression, splines, or nonlinear models.

Noisy data with outliers Robust regression (e.g., iteratively reweighted least squares) handles outliers better than OLS.

Small sample sizes with many predictors When , OLS has no unique solution. Ridge, LASSO, or PLS are necessary.

Code implementation

Here are complete, working implementations of simple linear regression for calibration in three languages.

import numpy as np
import matplotlib.pyplot as plt
def linear_regression(x, y):
"""
Fit a simple linear regression model y = beta_0 + beta_1 * x.
Parameters
----------
x : array-like
Predictor values (e.g., known concentrations)
y : array-like
Response values (e.g., measured absorbances)
Returns
-------
beta_0 : float
Intercept
beta_1 : float
Slope
r_squared : float
Coefficient of determination
rmse : float
Root mean square error
"""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
n = len(x)
# Sample means
x_mean = np.mean(x)
y_mean = np.mean(y)
# Slope and intercept via normal equations
beta_1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean)**2)
beta_0 = y_mean - beta_1 * x_mean
# Predictions and residuals
y_pred = beta_0 + beta_1 * x
residuals = y - y_pred
# Model quality metrics
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y - y_mean)**2)
r_squared = 1 - ss_res / ss_tot
rmse = np.sqrt(ss_res / n)
return beta_0, beta_1, r_squared, rmse
# --- Example: Beer's Law calibration for copper ---
# Known concentrations (mg/L) and measured absorbances
conc = np.array([0.0, 2.0, 5.0, 10.0, 20.0, 50.0])
absorbance = np.array([0.002, 0.098, 0.243, 0.491, 0.985, 2.461])
# Fit the calibration model
b0, b1, r2, rmse = linear_regression(conc, absorbance)
print(f"Intercept: {b0:.4f}")
print(f"Slope: {b1:.4f} AU per mg/L")
print(f"R-squared: {r2:.6f}")
print(f"RMSE: {rmse:.4f} AU")
# Predict unknown sample
unknown_abs = 0.370
unknown_conc = (unknown_abs - b0) / b1
print(f"\nUnknown sample: A = {unknown_abs} -> c = {unknown_conc:.2f} mg/L")
# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Calibration plot
x_line = np.linspace(0, 55, 100)
y_line = b0 + b1 * x_line
axes[0].scatter(conc, absorbance, s=80, zorder=5, label='Standards')
axes[0].plot(x_line, y_line, 'r-', linewidth=2, label=f'Fit (R² = {r2:.4f})')
axes[0].scatter(unknown_conc, unknown_abs, s=100, marker='*',
color='green', zorder=5, label=f'Unknown ({unknown_conc:.1f} mg/L)')
axes[0].set_xlabel('Concentration (mg/L)')
axes[0].set_ylabel('Absorbance (AU)')
axes[0].set_title('Beer\'s Law Calibration')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Residual plot
y_pred = b0 + b1 * conc
residuals = absorbance - y_pred
axes[1].scatter(y_pred, residuals, s=80, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', alpha=0.7)
axes[1].set_xlabel('Fitted values (AU)')
axes[1].set_ylabel('Residuals (AU)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Next steps

Simple linear regression is the foundation upon which almost every chemometric method is built. The concepts introduced here — least squares fitting, the normal equations, residual diagnostics, the matrix formulation — reappear in increasingly sophisticated forms throughout the field.

The natural next steps from here:

  • Multiple Linear Regression extends the model to several predictors ( ). Essential when your response depends on more than one variable, but runs into trouble when predictors are correlated.

  • Ridge Regression adds a penalty term to the least squares criterion to stabilize coefficient estimates when predictors are collinear. The first regularization method most chemometricians encounter.

  • LASSO Regression uses a different penalty that can shrink some coefficients exactly to zero, performing variable selection. Useful when you suspect only a few wavelengths matter.

  • Principal Component Regression compresses the predictor matrix into a few orthogonal components before regression. A classic approach for spectroscopic calibration.

  • Partial Least Squares finds components that simultaneously explain variance in both and . The most widely used regression method in chemometrics, and the one you will encounter most often in spectroscopic applications.

For a deeper treatment of the least squares criterion and its probabilistic foundations, see Least Squares: Your First Step into Chemometrics.

References

[1] Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246—263.

[2] Legendre, A.-M. (1805). Nouvelles methodes pour la determination des orbites des cometes. Firmin Didot, Paris.

[3] Gauss, C. F. (1809). Theoria motus corporum coelestium. Perthes et Besser, Hamburg.

[4] Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society A, 187, 253—318.

[5] Beer, A. (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flussigkeiten. Annalen der Physik und Chemie, 162(5), 78—88.

[6] Martens, H., & Naes, T. (1989). Multivariate Calibration. Wiley.

[7] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.

[8] Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17—21.

[9] Mark, H., & Workman, J. (2007). Chemometrics in Spectroscopy. Academic Press.

[10] Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley.