Least Squares: conjugate gradient

Gradient descent is easy to understand, but it can be inefficient. It always follows the steepest downhill direction at the current point, and that can lead to a zig-zag path, especially when the error surface is long and narrow [1],[2].

The conjugate gradient method tries to be smarter. Instead of choosing each new direction independently, it chooses directions that cooperate with the previous ones.

The problem it solves

For linear least squares, the normal equations are:

X^{T} X β = X^{T} y

This has the form:

A β = b

where:

A = X^{T} X, b = X^{T} y

If $A$ is small and well behaved, we can solve this directly. If it is huge, storing it or factorizing it can become expensive. When $A = X^{T} X$ is symmetric positive definite, conjugate gradient solves the same system iteratively, without needing a full matrix inverse [1],[2].

That condition matters. If columns of $X$ are redundant or nearly redundant, $X^{T} X$ can become singular or ill-conditioned. In that case, LSQR, CGNR, SVD, regularization, or preconditioning is usually safer than applying plain conjugate gradient to the normal equations.

Why steepest descent can be slow

Imagine the least squares error surface as a valley. Gradient descent always points in the locally steepest direction. That sounds ideal, but in a narrow valley the steepest direction often points across the valley rather than along it. The algorithm bounces from side to side.

Conjugate gradient remembers something about where it has already been. It chooses the next direction so that progress made in earlier directions is not undone later.

That is the meaning of conjugate here: the directions are independent with respect to the geometry defined by $A$ .

The core idea

Instead of moving only along the negative gradient, conjugate gradient builds a sequence of search directions:

p_{0}, p_{1}, p_{2}, \dots

Each direction is chosen so that it is conjugate to the previous directions:

p_{i}^{T} A p_{j} = 0 for i \neq = j

This condition means that each step solves a new part of the problem without disturbing what earlier steps already improved [2].

A useful compromise

Conjugate gradient sits between direct methods and simple gradient descent. It is still iterative, but it usually needs far fewer steps than ordinary gradient descent for large linear systems.

Interactive example

The visualization below shows the method in coefficient space. The ellipses represent equal-error contours. A plain steepest-descent path would tend to bounce across a narrow valley, while conjugate gradient chooses directions that make new progress without undoing the previous step.

Conjugate gradient: smarter directions in coefficient space

iteration

0 / 2

residual norm

2.14e+0

current beta

(-0.250, 2.200)

Why chemometricians care

Large chemometric datasets often contain many variables. A hyperspectral image, a dense chromatographic fingerprint, or a high-resolution spectrum can make $X^{T} X$ large enough that direct inversion becomes unattractive.

Conjugate gradient is useful when:

The least squares problem is linear.
$X^{T} X$ is positive definite, or the problem has been regularized or preconditioned appropriately.
The system is large.
Matrix-vector products are affordable.
You do not need to explicitly form or invert a huge matrix.

It is especially helpful when combined with preconditioning, where we transform the system so the algorithm sees a more rounded, easier-to-descend error surface [1],[2].

What to remember

Conjugate gradient is still trying to minimize the same least squares error. It does not change the statistical assumptions or the meaning of the fitted parameters. It changes the route to the answer.

Gradient descent asks, "Which way is downhill right now?"

Conjugate gradient asks, "Which downhill direction gives new progress without undoing the old progress?"

That small difference can matter a lot when the problem becomes large.

References

[1]Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

[2]Bjorck, A. (1996). Numerical Methods for Least Squares Problems. SIAM.