Least Squares: gradient descent

The normal equations give us a beautiful closed-form answer, but they are not always the most practical way to solve a least squares problem. When the dataset is huge, when the number of variables is very large, or when we want to update a model gradually as new data arrive, an iterative method can be more useful [1],[2].

Gradient descent is the simplest iterative idea: start somewhere, look at the direction in which the error increases most quickly, and move the other way.

In other words: walk downhill.

The error surface

For linear least squares, we want to minimize:

S (β) = ∥ y - X β ∥^{2}

If there is only one parameter, this error is a curve. If there are two parameters, it becomes a surface. With many parameters, it lives in a space we cannot draw, but the same idea remains: every possible set of parameters has an associated error.

The interactive plot below starts with the simplest possible case. The algorithm moves along a curve until it reaches the bottom:

Interactive gradient descent

Iteration: 0 | β: 4.5000 | Descent: -9.0000 | Cost: 20.2500

Learning Rate (α): 0.1

Now we can make the picture a little closer to a real linear regression, with slope and intercept as two parameters. The error becomes a bowl-shaped surface. Click on the plot to choose a starting point and watch how the path descends toward the minimum:

Current: β₁ = 0.1000, β₀ = 0.0000 (click plot to start)

Optimal: β₁ = 0.3090, β₀ = 0.5170

Learning Rate (α): 0.0025

This bowl shape is important. Ordinary linear least squares is convex, so there is one global minimum. Gradient descent does not have to worry about hidden local valleys in this case [2].

Why this matters for chemometrics

In spectroscopic calibration, a model can have hundreds or thousands of coefficients. We cannot visualize that error surface, but the algorithm still works with the same principle: follow the direction that reduces the residual error.

The gradient

The gradient is a vector of derivatives. It tells us which direction makes the function increase fastest.

For the least squares objective:

S (β) = ∥ y - X β ∥^{2}

the gradient is:

\nabla S (β) = 2 X^{T} X β - 2 X^{T} y

That vector points uphill. Since we want to minimize the error, we move in the opposite direction:

- \nabla S (β)

This is the whole intuition of gradient descent.

The update rule

We start with an initial guess:

β^{(0)}

Then we update the parameters step by step:

β^{(k + 1)} = β^{(k)} - α \nabla S (β^{(k)})

Here, $k$ is the iteration number and $α$ is the step size, often called the learning rate [1].

For linear least squares, substituting the gradient gives:

β^{(k + 1)} = β^{(k)} - 2 α X^{T} (X β^{(k)} - y)

This equation says: compute the residuals, translate them into a direction in parameter space, and take a step that reduces the error.

Choosing the step size

The step size $α$ controls how boldly the algorithm moves.

If $α$ is too small, convergence is painfully slow.
If $α$ is too large, the algorithm can jump over the minimum and oscillate.
If $α$ is well chosen, the path descends smoothly.

For simple least squares problems, it is sometimes possible to calculate a good step size analytically. In practice, many algorithms use line search or adaptive learning-rate rules.

The learning-rate feeling

If the error decreases very slowly, the step may be too cautious. If the error goes down and then suddenly explodes, the step is probably too aggressive. Watching the error curve over iterations is one of the simplest diagnostics.

When gradient descent is useful

Gradient descent is not the fastest method for a small calibration curve. If you have ten standards and one predictor, the normal equations will solve the problem immediately.

Gradient descent becomes more interesting when:

The dataset is too large to process comfortably in one matrix operation.
The number of variables is very high, as in full-spectrum modeling.
The objective includes extra penalty terms, such as in regularized regression.
New data arrive continuously and the model needs to update over time.

In modern machine learning, gradient descent is everywhere. But its roots are older and very close to the least squares problem itself [1],[2].

What it gives up

Gradient descent trades exact one-step algebra for flexibility. It needs a starting point, a step-size rule, and a stopping criterion. It may require many iterations. But in return, it can handle problems where direct methods are too expensive or inconvenient.

The key idea is modest but powerful: we do not need to solve everything at once. Sometimes we can get there by improving the answer, one step at a time.

References

[1]Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

[2]Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.