Least Squares: optimization
A roadmap for the main ways we compute least squares solutions
In the nineteenth century, astronomers were trying to determine orbits from limited and noisy observations. Instruments were improving, but observations still disagreed with one another. A comet or small planet might be visible only for a short time, and the available measurements were not enough to satisfy every equation exactly.
In 1805, Adrien-Marie Legendre presented least squares as a practical method for problems with more equations than unknowns. In that setting, the errors usually cannot all be made zero. His rule was to choose the unknown parameters that make the sum of squared errors as small as possible [1].
The calculation he derived is what we now recognize, in the linear case, as the normal-equation route. In modern notation, a linear least-squares problem leads to:
Gauss entered the same problem through the orbit of Ceres. The object was discovered in January 1801 and then lost as it moved too close to the Sun. From a short set of observations, Gauss predicted where it would be found again, and Ceres was recovered in December 1801. Gauss later claimed that he had used least squares since 1795, but he did not publish the method before Legendre [2],[3].
The interactive example below is a simplified two-dimensional Kepler model. It is not a reconstruction of Gauss's numerical tables, but it keeps the essential structure: fit an orbit to angular observations, project the orbit back onto the sky, and check the later predicted position.
Ceres: fitting the arc, then checking the prediction
Simplified visual analogy: least squares fits the available arc, and the later recovered position checks the prediction.
Fit from the observed arc
Each click adds one observation to the least-squares fit. The yellow point is held out as a later check, so it measures prediction quality rather than fitting error.
In orbit determination, the parameters enter the observations nonlinearly through the geometry of motion. In practice, Gauss worked with corrections to an approximate orbit: he linearized the observational equations, solved for a least-squares correction, and updated the parameters. In modern terms, that workflow sits in the same family as linearized least squares and later Gauss-Newton-type methods.

Legendre treated least squares mainly as an approximation method and showed how the criterion leads to equations for the unknown parameters. Gauss connected the same criterion to probability and observational error. Under normally distributed errors, minimizing squared residuals is equivalent to maximizing likelihood [4].
So the early story already contains a computational answer for linear problems: minimizing the sum of squares leads to equations for the unknown parameters. Today we call that route the normal equations.
But the normal equations are only one route. The same least-squares objective can also be solved by QR decomposition, SVD, gradient descent, conjugate gradient, Gauss-Newton, or Levenberg-Marquardt. The question of this series is therefore more specific:
Which computational route should we use to minimize the least-squares error?
If you have ever built a calibration curve, fitted a spectrum, or used a regression model in a software package, you have already used one of these routes. The software gives you coefficients, predicted values, residuals, and maybe confidence intervals. Underneath, it has chosen an algorithm for solving the least-squares problem.
This small series splits that computational story into separate methods. Each method gives a different way to reduce a squared-error criterion: sometimes by solving it exactly, sometimes by improving the parameters step by step, and sometimes by solving a local approximation.
The methods
There are many ways to solve least squares problems, but the methods in this series fall into two broad families [5],[6],[7].
| Family | How it works | Typical methods |
|---|---|---|
| Direct methods | Find the answer from algebra in one main step | Normal equations, QR, SVD |
| Iterative methods | Start with a guess and improve it step by step | Gradient descent, conjugate gradient, Gauss-Newton, Levenberg-Marquardt |
The articles below follow that path from the simplest closed-form solution to more flexible iterative methods.
Reading order
- Normal equations show the classic closed-form solution for linear least squares. This is the neat formula many of us first meet in regression.
- Gradient descent solves the same problem by walking downhill on the error surface. It is slower in small problems, but essential for understanding modern optimization.
- Conjugate gradient improves on simple gradient descent for large linear systems by choosing smarter search directions.
- Gauss-Newton moves into nonlinear least squares by repeatedly replacing a curved model with a local linear approximation.
- Levenberg-Marquardt adds damping to Gauss-Newton, making nonlinear fitting more stable when the starting guess is not perfect.
Choosing the right door
For a small calibration curve, the normal equations or a numerically stable variant such as QR decomposition is usually enough [5],[6]. For a large spectral dataset, iterative methods become attractive because they avoid forming and inverting very large matrices [7],[8]. For nonlinear models, such as exponential decay or Michaelis-Menten kinetics, direct linear least squares no longer applies and methods like Gauss-Newton or Levenberg-Marquardt become much more natural [9],[10].
The important thing is that the goal has not changed. All these methods are still trying to make the residuals small. They simply differ in how they travel toward that minimum.