Least Squares: normal equations

The normal equations are often the first computational solution we meet for least squares. They feel almost too neat: write the model as a matrix product, do a little algebra, set a derivative to zero, and the optimal parameters appear in one compact expression.

This direct solution is possible when the model is linear in its parameters. That means the prediction can be written as:

f (X, β) = X β

In that case, the least squares problem becomes:

β min ∥ y - X β ∥^{2}

Do you remember the assumption of linearity in the parameters from the least squares foundation? This is where it becomes computationally powerful.

Building the matrix model

Imagine a simple analytical model where a measured response depends on concentration and temperature. We can write one row per experiment:

X = 111 x_{11} x_{21} x_{31} x_{12} x_{22} x_{32}

The first column of ones is there for the intercept. It lets the model include a constant term:

y_{1} = β_{0} + x_{11} β_{1} + x_{12} β_{2}

The parameter vector contains the unknowns:

β = β_{0} β_{1} β_{2}

Putting everything together:

y = X β

Of course, real data are noisy, so this equality is almost never exact. Least squares asks for the $β$ that makes $X β$ as close as possible to $y$ .

Why we need the normal equations

If $X$ were square and invertible, we could solve:

β = X^{- 1} y

But in chemometrics, that is rarely the situation. We usually have more samples than parameters, or sometimes many more variables than samples. A calibration with 30 standards and two predictors gives a tall matrix. A spectrum with thousands of wavelengths gives a very wide one. In both cases, the ordinary inverse is not the right tool.

So instead of solving the system exactly, we solve the best approximate system:

S (β) = ∥ y - X β ∥^{2}

This is the sum of squared residuals written as a vector length.

Deriving the solution

To find the minimum, we expand the squared norm:

S (β) = (y - X β)^{T} (y - X β)

Multiplying it out gives:

S (β) = y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β

Now we use a familiar idea from calculus:

A smooth function reaches its minimum where its derivative is zero.

Here, the derivative is taken with respect to the whole parameter vector $β$ . This gives the gradient:

\frac{\partial S}{\partial β} = - 2 X^{T} y + 2 X^{T} X β

At the minimum, the gradient is zero:

- 2 X^{T} y + 2 X^{T} X β = 0

Divide by 2 and rearrange:

X^{T} X β = X^{T} y

This is the key result: the normal equations [1],[2].

If $X^{T} X$ is invertible, we can isolate the parameter vector:

\hat{β} = (X^{T} X)^{- 1} X^{T} y

This is the classic closed-form solution.

Do we need to check the second derivative?

For ordinary linear least squares, the cost function is shaped like a bowl. More formally, it is a convex quadratic. That means there is only one lowest point, so the parameter vector obtained from the normal equations is the global minimum when the required rank conditions are satisfied.

The pseudoinverse view

When $X$ has full column rank, the expression:

X^{+} = (X^{T} X)^{- 1} X^{T}

is the Moore-Penrose pseudoinverse. It lets us write:

\hat{β} = X^{+} y

That notation is useful because it reminds us that we are not simply inverting $X$ . We are building the best least squares solution for a system that is usually not exactly solvable. When $X$ is wide or rank deficient, the pseudoinverse still exists, but this particular formula is replaced by an SVD-based construction.

When is this shortcut safe?

There is an important nuance here: the problem is not that the pseudoinverse suddenly becomes invalid when variables are related. It is perfectly normal to use a pseudoinverse in chemometrics.

For example, in a bilinear Beer-Lambert style model,

D = CS

if the pure spectra $S$ are known, estimating concentrations with

\hat{C} = D S^{+}

is a standard least-squares calculation. There is nothing wrong with that. In fact, the pseudoinverse exists precisely because many useful systems are rectangular, noisy, or not exactly solvable.

The warning is about sensitivity, not permission.

If two columns of the design matrix contain almost the same information, the fitted predictions may still be good, but the individual coefficients become harder to interpret. A small amount of noise can move the solution noticeably because several different coefficient combinations produce almost the same fitted response.

In the $D = CS$ example, this would happen if two component spectra in $S$ were extremely similar. The pseudoinverse can still calculate a concentration estimate, but the estimate may have larger uncertainty because the data do not strongly distinguish one component from the other.

The explicit normal-equation formula makes this sensitivity worse because it uses $X^{T} X$ . This matrix is built from inner products between columns of $X$ . If the columns are already very similar, forming $X^{T} X$ can amplify the numerical difficulty. SVD-based pseudoinverses are usually safer because they expose these weak directions directly.

When least-squares estimates become sensitive

The two predictors are not strongly redundant, so small noise is less likely to move the coefficients dramatically.

case

Independent predictors

predictor correlation

0.513

noise sensitivity

lower

beta

not solved

So the practical rule is:

The pseudoinverse is allowed and often useful. The question is how stable the estimated coefficients are when the data contain redundant or nearly redundant information.

When the columns are strongly correlated, the fitted predictions may still look reasonable, but the coefficients can become unstable. A tiny change in the data can produce a large change in $β$ . That is the warning sign.

What software usually does

Although we write the solution using an inverse, good numerical software usually avoids computing that inverse directly. Libraries in Python, MATLAB, and R often use QR decomposition, Cholesky decomposition, or singular value decomposition because those approaches are more stable.

A chemometric example

Suppose you are building a UV calibration model:

A = β_{0} + β_{1} c

where $A$ is absorbance and $c$ is concentration. The design matrix is:

X = 111 ⋮ 1 c_{1} c_{2} c_{3} ⋮ c_{n}

and the parameter vector is:

β = β_{0} β_{1}

The normal equations give the intercept and slope that minimize the squared differences between measured and predicted absorbances.

Interactive example

The small panel below shows the direct nature of the method. The data are fixed, and pressing Solve makes the fitted line appear in one step. There is no path, no sequence of guesses, and no learning rate: the algebra jumps straight to the coefficients.

Normal equations: one direct jump to the fitted line

dataset

UV calibration

intercept

not solved

slope

not solved

SSE

Not only for straight lines

The straight line is the easiest drawing, but it is not the real boundary of the method. The real boundary is this:

Can the prediction be written as X β ?

If the answer is yes, the normal equations can be used. The graph may be a line, a curve, or a higher-dimensional surface. What matters is that the unknown coefficients appear as a linear combination of columns in the design matrix.

For example, this is a straight-line model:

A = β_{0} + β_{1} c

but this curved polynomial is also linear in the parameters:

y = β_{0} + β_{1} x + β_{2} x^{2}

The curve bends because one column of $X$ contains $x^{2}$ . The parameter $β_{2}$ still multiplies that column directly. The same idea works for multiple predictors:

A = β_{0} + β_{1} c + β_{2} T

where one column may represent concentration and another may represent temperature. More predictors simply mean more columns in $X$ .

Normal equations: the model can be bigger than a straight line

This is the familiar calibration line: two columns, two coefficients, one direct solve.

model

Straight calibration

columns in X

1, c

beta

not solved

SSE

This is why "linear regression" can sometimes look nonlinear on a plot. Polynomial regression is linear regression in an expanded design matrix. Multiple linear regression is still linear regression, just with several measured variables contributing at the same time.

The method stops being directly applicable when the parameters enter the model in a genuinely nonlinear way, for example:

y = a e^{- k x} + c

Here $k$ is inside the exponential, so the model cannot be solved in one normal-equation jump. That kind of problem belongs to nonlinear least-squares methods such as Gauss-Newton or Levenberg-Marquardt.

There is also a numerical limit. Even when the model is linear in the parameters, solving the normal equations directly can become unstable if $X^{T} X$ is singular, nearly singular, or very large. That is why QR and SVD are so important in practice.

import numpy as np

c = np.array([0.0, 0.2, 0.4, 0.6, 0.8, 1.0])
A = np.array([0.03, 0.49, 1.01, 1.48, 2.04, 2.51])

X = np.column_stack([np.ones_like(c), c])
beta = np.linalg.solve(X.T @ X, X.T @ A)

print(beta)

c = [0.0; 0.2; 0.4; 0.6; 0.8; 1.0];
A = [0.03; 0.49; 1.01; 1.48; 2.04; 2.51];

X = [ones(size(c)), c];
beta = (X' * X) \ (X' * A);

disp(beta)

conc <- c(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
A <- c(0.03, 0.49, 1.01, 1.48, 2.04, 2.51)

X <- cbind(1, conc)
beta <- solve(t(X) %*% X, t(X) %*% A)

print(beta)

References

[1]Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

[2]Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Least Squares: normal equations

When least-squares estimates become sensitive

Normal equations: one direct jump to the fitted line

Normal equations: the model can be bigger than a straight line

On this page