Least Squares: optimization

In the nineteenth century, astronomers were trying to determine orbits from limited and noisy observations. Instruments were improving, but observations still disagreed with one another. A comet or small planet might be visible only for a short time, and the available measurements were not enough to satisfy every equation exactly.

In 1805, Adrien-Marie Legendre presented least squares as a practical method for problems with more equations than unknowns. In that setting, the errors usually cannot all be made zero. His rule was to choose the unknown parameters that make the sum of squared errors as small as possible [1].

The calculation he derived is what we now recognize, in the linear case, as the normal-equation route. In modern notation, a linear least-squares problem leads to:

X^{T} X β = X^{T} y

Gauss entered the same problem through the orbit of Ceres. The object was discovered in January 1801 and then lost as it moved too close to the Sun. From a short set of observations, Gauss predicted where it would be found again, and Ceres was recovered in December 1801. Gauss later claimed that he had used least squares since 1795, but he did not publish the method before Legendre [2],[3].

The interactive example below is a simplified two-dimensional Kepler model. It is not a reconstruction of Gauss's numerical tables, but it keeps the essential structure: fit an orbit to angular observations, project the orbit back onto the sky, and check the later predicted position.

Ceres: fitting the arc, then checking the prediction

Simplified visual analogy: least squares fits the available arc, and the later recovered position checks the prediction.

Observation 3 of 11

Fit from the observed arc

Each click adds one observation to the least-squares fit. The yellow point is held out as a later check, so it measures prediction quality rather than fitting error.

Fit score

MSE_{f i t} (θ) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i} (θ))^{2}

In orbit determination, the parameters enter the observations nonlinearly through the geometry of motion. In practice, Gauss worked with corrections to an approximate orbit: he linearized the observational equations, solved for a least-squares correction, and updated the parameters. In modern terms, that workflow sits in the same family as linearized least squares and later Gauss-Newton-type methods.

Legendre and Gauss in the history of least squares

Legendre treated least squares mainly as an approximation method and showed how the criterion leads to equations for the unknown parameters. Gauss connected the same criterion to probability and observational error. Under normally distributed errors, minimizing squared residuals is equivalent to maximizing likelihood [4].

So the early story already contains a computational answer for linear problems: minimizing the sum of squares leads to equations for the unknown parameters. Today we call that route the normal equations.

But the normal equations are only one route. The same least-squares objective can also be solved by QR decomposition, SVD, gradient descent, conjugate gradient, Gauss-Newton, or Levenberg-Marquardt. The question of this series is therefore more specific:

Which computational route should we use to minimize the least-squares error?

If you have ever built a calibration curve, fitted a spectrum, or used a regression model in a software package, you have already used one of these routes. The software gives you coefficients, predicted values, residuals, and maybe confidence intervals. Underneath, it has chosen an algorithm for solving the least-squares problem.

This small series splits that computational story into separate methods. Each method gives a different way to reduce a squared-error criterion: sometimes by solving it exactly, sometimes by improving the parameters step by step, and sometimes by solving a local approximation.

The methods

There are many ways to solve least squares problems, but the methods in this series fall into two broad families [5],[6],[7].

Family	How it works	Typical methods
Direct methods	Find the answer from algebra in one main step	Normal equations, QR, SVD
Iterative methods	Start with a guess and improve it step by step	Gradient descent, conjugate gradient, Gauss-Newton, Levenberg-Marquardt

The articles below follow that path from the simplest closed-form solution to more flexible iterative methods.

Reading order

Normal equations show the classic closed-form solution for linear least squares. This is the neat formula many of us first meet in regression.
Gradient descent solves the same problem by walking downhill on the error surface. It is slower in small problems, but essential for understanding modern optimization.
Conjugate gradient improves on simple gradient descent for large linear systems by choosing smarter search directions.
Gauss-Newton moves into nonlinear least squares by repeatedly replacing a curved model with a local linear approximation.
Levenberg-Marquardt adds damping to Gauss-Newton, making nonlinear fitting more stable when the starting guess is not perfect.

Choosing the right door

For a small calibration curve, the normal equations or a numerically stable variant such as QR decomposition is usually enough [5],[6]. For a large spectral dataset, iterative methods become attractive because they avoid forming and inverting very large matrices [7],[8]. For nonlinear models, such as exponential decay or Michaelis-Menten kinetics, direct linear least squares no longer applies and methods like Gauss-Newton or Levenberg-Marquardt become much more natural [9],[10].

The important thing is that the goal has not changed. All these methods are still trying to make the residuals small. They simply differ in how they travel toward that minimum.

References

[1]Legendre, A.-M. (1805). Nouvelles methodes pour la determination des orbites des cometes. Firmin Didot.

[2]O'Connor, J. J., & Robertson, E. F. (2022). The Missing Planet. MacTutor History of Mathematics Archive, University of St Andrews.

[3]Gauss, C. F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Perthes et Besser.

[4]Magnus, J. R. (2022). Gauss on least-squares and maximum-likelihood estimation. Archive for History of Exact Sciences, 76, 425-430. https://doi.org/10.1007/s00407-022-00291-w

[5]Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

[6]Bjorck, A. (1996). Numerical Methods for Least Squares Problems. SIAM.

[7]Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

[8]Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

[9]Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2), 431-441.

[10]Moré, J. J. (1978). The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical Analysis (pp. 105-116). Springer.