Least Squares: normal equations
The classic closed-form solution for linear least squares
The normal equations are often the first computational solution we meet for least squares. They feel almost too neat: write the model as a matrix product, do a little algebra, set a derivative to zero, and the optimal parameters appear in one compact expression.
This direct solution is possible when the model is linear in its parameters. That means the prediction can be written as:
In that case, the least squares problem becomes:
Do you remember the assumption of linearity in the parameters from the least squares foundation? This is where it becomes computationally powerful.
Building the matrix model
Imagine a simple analytical model where a measured response depends on concentration and temperature. We can write one row per experiment:
The first column of ones is there for the intercept. It lets the model include a constant term:
The parameter vector contains the unknowns:
Putting everything together:
Of course, real data are noisy, so this equality is almost never exact. Least squares asks for the that makes as close as possible to .
Why we need the normal equations
If were square and invertible, we could solve:
But in chemometrics, that is rarely the situation. We usually have more samples than parameters, or sometimes many more variables than samples. A calibration with 30 standards and two predictors gives a tall matrix. A spectrum with thousands of wavelengths gives a very wide one. In both cases, the ordinary inverse is not the right tool.
So instead of solving the system exactly, we solve the best approximate system:
This is the sum of squared residuals written as a vector length.
Deriving the solution
To find the minimum, we expand the squared norm:
Multiplying it out gives:
Now we use a familiar idea from calculus:
A smooth function reaches its minimum where its derivative is zero.
Here, the derivative is taken with respect to the whole parameter vector . This gives the gradient:
At the minimum, the gradient is zero:
Divide by 2 and rearrange:
This is the key result: the normal equations [1],[2].
If is invertible, we can isolate the parameter vector:
This is the classic closed-form solution.
Do we need to check the second derivative?
For ordinary linear least squares, the cost function is shaped like a bowl. More formally, it is a convex quadratic. That means there is only one lowest point, so the parameter vector obtained from the normal equations is the global minimum when the required rank conditions are satisfied.
The pseudoinverse view
When has full column rank, the expression:
is the Moore-Penrose pseudoinverse. It lets us write:
That notation is useful because it reminds us that we are not simply inverting . We are building the best least squares solution for a system that is usually not exactly solvable. When is wide or rank deficient, the pseudoinverse still exists, but this particular formula is replaced by an SVD-based construction.
When is this shortcut safe?
There is an important nuance here: the problem is not that the pseudoinverse suddenly becomes invalid when variables are related. It is perfectly normal to use a pseudoinverse in chemometrics.
For example, in a bilinear Beer-Lambert style model,
if the pure spectra are known, estimating concentrations with
is a standard least-squares calculation. There is nothing wrong with that. In fact, the pseudoinverse exists precisely because many useful systems are rectangular, noisy, or not exactly solvable.
The warning is about sensitivity, not permission.
If two columns of the design matrix contain almost the same information, the fitted predictions may still be good, but the individual coefficients become harder to interpret. A small amount of noise can move the solution noticeably because several different coefficient combinations produce almost the same fitted response.
In the example, this would happen if two component spectra in were extremely similar. The pseudoinverse can still calculate a concentration estimate, but the estimate may have larger uncertainty because the data do not strongly distinguish one component from the other.
The explicit normal-equation formula makes this sensitivity worse because it uses . This matrix is built from inner products between columns of . If the columns are already very similar, forming can amplify the numerical difficulty. SVD-based pseudoinverses are usually safer because they expose these weak directions directly.
When least-squares estimates become sensitive
The two predictors are not strongly redundant, so small noise is less likely to move the coefficients dramatically.
So the practical rule is:
The pseudoinverse is allowed and often useful. The question is how stable the estimated coefficients are when the data contain redundant or nearly redundant information.
When the columns are strongly correlated, the fitted predictions may still look reasonable, but the coefficients can become unstable. A tiny change in the data can produce a large change in . That is the warning sign.
What software usually does
Although we write the solution using an inverse, good numerical software usually avoids computing that inverse directly. Libraries in Python, MATLAB, and R often use QR decomposition, Cholesky decomposition, or singular value decomposition because those approaches are more stable.
A chemometric example
Suppose you are building a UV calibration model:
where is absorbance and is concentration. The design matrix is:
and the parameter vector is:
The normal equations give the intercept and slope that minimize the squared differences between measured and predicted absorbances.
Interactive example
The small panel below shows the direct nature of the method. The data are fixed, and pressing Solve makes the fitted line appear in one step. There is no path, no sequence of guesses, and no learning rate: the algebra jumps straight to the coefficients.
Normal equations: one direct jump to the fitted line
Not only for straight lines
The straight line is the easiest drawing, but it is not the real boundary of the method. The real boundary is this:
If the answer is yes, the normal equations can be used. The graph may be a line, a curve, or a higher-dimensional surface. What matters is that the unknown coefficients appear as a linear combination of columns in the design matrix.
For example, this is a straight-line model:
but this curved polynomial is also linear in the parameters:
The curve bends because one column of contains . The parameter still multiplies that column directly. The same idea works for multiple predictors:
where one column may represent concentration and another may represent temperature. More predictors simply mean more columns in .
Normal equations: the model can be bigger than a straight line
This is the familiar calibration line: two columns, two coefficients, one direct solve.
This is why "linear regression" can sometimes look nonlinear on a plot. Polynomial regression is linear regression in an expanded design matrix. Multiple linear regression is still linear regression, just with several measured variables contributing at the same time.
The method stops being directly applicable when the parameters enter the model in a genuinely nonlinear way, for example:
Here is inside the exponential, so the model cannot be solved in one normal-equation jump. That kind of problem belongs to nonlinear least-squares methods such as Gauss-Newton or Levenberg-Marquardt.
There is also a numerical limit. Even when the model is linear in the parameters, solving the normal equations directly can become unstable if is singular, nearly singular, or very large. That is why QR and SVD are so important in practice.
import numpy as np
c = np.array([0.0, 0.2, 0.4, 0.6, 0.8, 1.0])
A = np.array([0.03, 0.49, 1.01, 1.48, 2.04, 2.51])
X = np.column_stack([np.ones_like(c), c])
beta = np.linalg.solve(X.T @ X, X.T @ A)
print(beta)c = [0.0; 0.2; 0.4; 0.6; 0.8; 1.0];
A = [0.03; 0.49; 1.01; 1.48; 2.04; 2.51];
X = [ones(size(c)), c];
beta = (X' * X) \ (X' * A);
disp(beta)conc <- c(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
A <- c(0.03, 0.49, 1.01, 1.48, 2.04, 2.51)
X <- cbind(1, conc)
beta <- solve(t(X) %*% X, t(X) %*% A)
print(beta)