Least Squares: optimization

By the time mathematicians began asking how to minimize the sum of squared errors, the criterion itself was already well established. Adrien-Marie Legendre published the method of least squares in 1805 as an appendix to his Nouvelles methodes pour la determination des orbites des cometes, where he used it to fit parabolic orbits to comet observations. Carl Friedrich Gauss, who claimed to have used the approach privately since the mid-1790s, published his own treatment in 1809 in Theoria Motus Corporum Coelestium, notably providing a probabilistic justification by showing that least squares gives maximum likelihood estimates when errors are normally distributed. For small systems, the optimal parameters could be found by solving the resulting equations directly, and this was sufficient for the astronomical problems of the early nineteenth century.

But as problems grew larger and more complex, direct solution became impractical. On October 18, 1847, Augustin-Louis Cauchy presented a short memoir to the French Academy of Sciences titled Methode generale pour la resolution des systemes d'equations simultanees, published in the Comptes Rendus (volume 25, pages 536-538). In it, he described an iterative procedure for solving systems of equations: start from an initial guess, compute the partial derivatives of the objective function, and update the variables by stepping in the direction that decreases the function most rapidly. This was the first description of what we now call gradient descent, or the method of steepest descent.

Cauchy's idea was conceptually simple but profoundly consequential. Where Legendre and Gauss had provided the what (minimize squared errors) and the why (it follows from probability theory), Cauchy provided a general-purpose how that did not require solving a system of equations in one step. His iterative approach could, in principle, handle problems too large or too complex for direct methods. It would take more than a century for this idea to find its full potential: gradient descent became the backbone of machine learning and deep learning, powering the backpropagation algorithm that trains modern neural networks. But its origins lie squarely in the same optimization problems that Legendre and Gauss had formalized decades earlier -- the minimization of squared residuals between a model and observed data.

From probabilistic foundation to computational reality

In the last entry, we explored where the "square" in least squares actually comes from: discovering how it emerges naturally from probability theory and the physics of measurement. That understanding is powerful, but it leaves us with a practical question: now that we know what to minimize, how do we actually find those optimal parameters?

If you've ever built a calibration curve or fit a spectroscopic model, you've probably relied on software to magically produce those coefficients. But what's happening behind the scenes? In this entry, we will explore how we can actually solve the least squares problem, from elegant closed-form solutions to powerful iterative algorithms.

There are many ways to solve the least squares problem. Some methods work better in certain situations, others in different ones. Each method has its own assumptions and strengths. Even though the list is long, we can group them into two main types based on how they find the answer:

Category	Description	Examples
Direct Methods	Or so-called closed-form methods. These methods find an exact answer in one go because a formula exists that provides the solution. They are fast and simple when the problem is small and the data is clean.	`Normal Equations`, `SVD`
Iterative Methods	These methods find the answer step by step, getting closer to the best solution with each step. They are very useful for very large datasets or when direct methods are too slow or need too much computer memory.	`Gradient Descent`, `Conjugate Gradient`, `Gauss-Newton`, `Levenberg-Marquardt`

Despite their differences, all least squares methods speak the same mathematical language: linear algebra. Whether you’re solving equations directly or updating parameters iteratively, everything is expressed and computed using matrices. These structures define the data, the models, the transformations, and the solutions.

Expressing least squares as matrices

Starting from the beginning, we need to get used to seeing the least squares problem in terms of matrices. If you're not super comfortable with matrix notation yet, don't worry: we'll build it up piece by piece, and I promise it will make the computations much clearer once you see the pattern.

Previously, we viewed least squares from the perspective of individual data points.

β min i = 1 \sum I (y_{i} - \overset{y}{^}_{i} (β))^{2}

Here, $β$ represents the parameters we want to find, i.e., the ones that minimize the total squared error between the observed values $y_{i}$ and the predicted values $\overset{y}{^}_{i} (β)$ .

To move toward a computational solution, we first need to reformulate the least squares objective using linear algebra. Instead of treating each data point individually, we stack all the observations into vectors and matrices. This lets us express the entire problem in one compact, efficient and general form. A common and more compact expression uses the Euclidean norm (also called the L² norm) to represent the sum of squared errors:

β min ∥ y - f (X, β) ∥^{2}

The notation $∥ \cdot ∥^{2}$ represents the squared Euclidean norm. In practical terms, this means you square each value in a vector and then add them up. In the context of least squares, this is the same as the sum of squared residuals.

If we define the residuals as:

ε = y - f (X, β)

then the squared norm becomes:

∥ ε ∥^{2} = ε_{1}^{2} + ε_{2}^{2} + ε_{3}^{2} + \dots + ε_{I}^{2} = i = 1 \sum I ε_{i}^{2}

It is a way to measure the total squared distance between the measured values and the model predictions. In other words, the squared length of the residual vector. This is exactly what least squares minimizes.

We will use this norm-based notation going forward because it allows us to express and solve the problem cleanly using linear algebra. It's especially useful when working with large datasets or building general computational methods.

On the other hand, the vector $y$ contains all your measured values. These are the experimental data you're trying to analyze. For example, in a calibration experiment where there are absorbances vs concentrations, these might be absorbance readings. Think of this as a list of all your experimental results:

y = y_{1} y_{2} y_{3} ⋮ y_{I}

The matrix $X$ contains your experimental conditions. These are the things you controlled or measured that you think affect your results. In the calibration study, this includes the known concentrations of your standards. This matrix captures all the information about how you set up each measurement.

For example, if you're modeling the effect of two variables (say, concentration and temperature), the matrix might look like this:

X = Intercept 111 ⋮ 1 Concentration x_{11} x_{21} x_{31} ⋮ x_{I 1} Temperature x_{12} x_{22} x_{32} ⋮ x_{I 2}

Here, the first column of $1$ s represents the intercept term (a constant), and the remaining columns contain the actual input values for each observation. Each row corresponds to a single measurement or experiment.

Why the column of 1s? You might wonder why we need a column of 1s for the intercept. Here's the key insight: when we multiply the first column (all 1s) by β₀, we get:

1 \times β_{0} = β_{0}

This gives us a constant term that's added to every prediction! Let's see this in action for the first observation:

y_{1} = (1 \times β_{0}) + (x_{11} \times β_{1}) + (x_{12} \times β_{2}) = β_{0} + x_{11} β_{1} + x_{12} β_{2}

See how β₀ appears as a constant? That's your intercept! Without this column of 1s, our model would be forced to pass through the origin (0,0), which is rarely what we want in real data.

The parameter vector $β$ contains the unknown numbers you're trying to find. These might be the slope and intercept of a calibration line, or the coefficients that tell you how different factors contribute to your final result.

For example, if your model includes an intercept, concentration, and temperature, then the parameter vector might look like this:

β = β_{0} β_{1} β_{2} = Intercept Effect of Concentration Effect of Temperature

Each entry in $β$ represents a weight or coefficient that tells the model how much that corresponding input (from $X$ ) contributes to the predicted result.

The function $f (X, β)$ represents your model. This is your mathematical guess about how the experimental conditions relate to your measurements. This function takes your experimental setup $X$ and some unknown parameters $β$ , then produces predictions for what your results should be.

Notation Summary

Expression	Meaning
$y$	Vector of measured values (e.g., absorbance readings)
$X$	Matrix of input variables or experimental conditions (e.g., concentrations, temperature)
$β$	Vector of parameters to estimate (e.g., slopes, intercepts, coefficients)
$f (X, β)$	Model output, predicted values computed from inputs and parameters
$y - f (X, β)$	Residual vector, the difference between observed and predicted values
$∥ y - f (X, β) ∥^{2}$	Squared Euclidean norm, sum of squared residuals (i.e., what least squares minimizes)
$min_{β} ∥ y - f (X, β) ∥^{2}$	Least squares objective: find the parameters that minimize the total squared error

Solving the Least Squares Problem

Now that we understand the structure of the least squares objective and its notation, it's time to actually solve it.

As stated before, there are two main ways to do this, the direct (or analytical) and the iterative:

Approach	Description
Analytical (Closed-form)	We use calculus and linear algebra to derive an exact formula for the optimal parameters. This gives us a direct answer.
Numerical (Iterative)	Instead of solving it all at once, we start with a guess and improve it step by step. This is essential when the model is nonlinear or the dataset is huge.

Both methods aim to minimize the same quantity (the sum of squared errors) but they do so in very different ways.

Closed form solution

The closed-form solution is possible when the model is linear in its parameters. That means we can write the predicted values as a matrix multiplication:

f (X, β) = X β

In this case, the least squares problem becomes:

β min ∥ y - X β ∥^{2}

Do you remember when we talked about the model being linear in the parameters in the last entry? That idea is important here: it's what makes a closed-form solution possible.

Let's go back to our earlier example, where we're modeling the effect of concentration and temperature on a measured response (e.g., absorbance). And, for example, let's suppose we have three observations:

y = y_{1} y_{2} y_{3}

Therefore, the matrix $X$ , which contains the experimental conditions, is:

X = 111 x_{11} x_{21} x_{31} x_{12} x_{22} x_{32}

This matrix includes:

a column of 1s for the intercept term,
one column for concentration,
and one column for temperature.

The vector of unknown parameters is:

β = β_{0} β_{1} β_{2}

Putting it all together, the model:

y = X β

means that:

y_{1} y_{2} y_{3} = 111 x_{11} x_{21} x_{31} x_{12} x_{22} x_{32} β_{0} β_{1} β_{2}

Multiplying this out gives us the model equations for each observation:

y_{1} y_{2} y_{3} = β_{0} + β_{1} x_{11} + β_{2} x_{12} = β_{0} + β_{1} x_{21} + β_{2} x_{22} = β_{0} + β_{1} x_{31} + β_{2} x_{32}

This structure, i.e., expressing the model as $y = X β$ , allows us to express our measurements as coefficients that are linear in parameters. Each row of $X$ describes one experiment, and each row of the resulting product $X β$ gives a predicted value.

Now, we want to find the parameters $β$ that minimize the total squared error:

S (β) = ∥ y - X β ∥^{2}

If the matrix $X$ is square (same number of rows and columns) and invertible, we can solve this system exactly. In that case, we simply rearrange the equation:

y = X β

and apply the inverse of $X$ to both sides:

β = X^{- 1} y

This gives us the exact solution for $β$ .

However, this situation is very rare in real-world data. In chemometrics, we almost always have a different number of samples and variables. For example, 30 samples but only 2 or 3 experimental factors, or 15 samples and thousands of variables for spectroscopic data. That makes $X$ a tall matrix: more rows than columns.

Such a matrix is not invertible. Here's why: a matrix can only have an inverse if it's square (same number of rows and columns), and even then, it must be full rank (no redundant information). A tall matrix isn't square, so it doesn't have a standard inverse. You can't "undo" a multiplication by a non-square matrix in the usual way.

So we cannot apply the same trick. Instead, we need to find the $β$ that gives the best approximate solution: the one that makes $X β$ as close as possible to $y$ . This is where the magic of the normal equations comes in!

To do this, we expand the squared norm to make the structure of the problem more visible. This next part looks like a lot of algebra, but it's really just careful bookkeeping: each step follows naturally from the one before.

Quick reminder: Matrix transpose rules

Before we dive in, let's recall a few key properties of the transpose operation (denoted by ᵀ):

(A + B)ᵀ = Aᵀ + Bᵀ — transpose of a sum is the sum of transposes
(AB)ᵀ = BᵀAᵀ — transpose of a product reverses the order
(cA)ᵀ = cAᵀ — scalars can be pulled out

These rules will help you follow the algebra below!

S (β) = (y - X β)^{T} (y - X β)

Now we multiply this expression out. The transpose of the difference is:

(y - X β)^{T} = y^{T} - (X β)^{T} = y^{T} - β^{T} X^{T}

So the expanded form of the cost function is:

S (β) = y^{T} y - β^{T} X^{T} y - y^{T} X β + β^{T} X^{T} X β

The two middle terms are both scalars and equal to each other, so we combine them:

S (β) = y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β

To find the value of $β$ that minimizes this function, we use a key idea from calculus:

A function reaches its minimum (or maximum) where its derivative is zero.

So, we take the derivative of the cost function $S (β)$ with respect to the parameter vector $β$ . This tells us how the error changes when we change the parameters. At the minimum, the error doesn't get smaller in any direction: the slope is zero.

Taking derivatives with respect to vectors

What does ∂S/∂β mean when β is a vector?

In single-variable calculus, you learned to find dS/dx. But here, β is a vector with multiple components [β₀, β₁, β₂, ...]. So what does it mean to take a derivative with respect to a vector?

The answer: we take the derivative with respect to each component and stack the results into a vector. This vector of derivatives is called the gradient, denoted ∇S or ∂S/∂β.

For example, if β = [β₀, β₁], then:

\frac{\partial S}{\partial β} = [\frac{\partial S}{\partial β _{0}} \frac{\partial S}{\partial β _{1}}]

Key matrix calculus rules you'll need:

$\frac{\partial}{\partial β} (a^{T} β) = a$ — derivative of a linear form
$\frac{\partial}{\partial β} (β^{T} A β) = 2 A β$ — derivative of a quadratic form (when A is symmetric)

Don't worry if these seem mysterious — just trust them for now. With practice, they'll become as natural as the power rule d/dx(x²) = 2x!

Mathematically, we compute:

\frac{\partial S}{\partial β}

and then set that derivative equal to zero. This gives us an equation that tells us where the error is minimized.

This step is exactly like finding the lowest point of a curve by solving $\frac{d}{d x} f (x) = 0$ , but in multiple dimensions (here, we're doing it for a vector of parameters instead of a single variable).

Applying the matrix calculus rules above to our cost function S(β) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ:

\frac{\partial S}{\partial β} = 0 - 2 X^{T} y + 2 X^{T} X β

Setting the gradient to zero:

- 2 X^{T} y + 2 X^{T} X β = 0

Divide both sides by 2 and rearrange terms:

X^{T} X β = X^{T} y

This is the key result: the normal equations [1],[2]. It gives a direct way to solve for the parameters.

If the matrix $X^{T} X$ is invertible, we can isolate $β$ :

\hat{β} = (X^{T} X)^{- 1} X^{T} y

This is the famous closed-form solution to the linear least squares problem. All the data $y$ and the design matrix $X$ feed into this single formula. The result is the set of parameters $\hat{β}$ that best fits the data in the least squares sense.

This expression also defines what we call the pseudoinverse of $X$ . In this case, when $X^{T} X$ is invertible, the pseudoinverse is written as:

X^{+} = (X^{T} X)^{- 1} X^{T}

So we can write the solution compactly as:

\hat{β} = X^{+} y

Do we need to check the second derivative?

In general, when you minimize a function, it's good practice to check the second derivative to confirm that the point you found is really a minimum (and not a maximum or a saddle point). But in this case, we don't need to.

The least squares cost function $S (β) = ∥ y - X β ∥^{2}$ is shaped like a smooth bowl. It curves upward everywhere and has only one lowest point.

Why "convex" guarantees one minimum: A function is convex if it curves upward like a bowl everywhere (no waves, no valleys within valleys). Mathematically, this means that any line segment connecting two points on the function's graph lies above (or on) the graph itself.

The key property: In a convex function, any local minimum is automatically the global minimum. There are no hidden valleys! If you find a point where the derivative is zero in a convex function, you've found THE minimum, not just A minimum.

This "bowl shape" is what we call a convex quadratic. That's why the minimum we find by setting the derivative to zero is guaranteed to be the best possible solution — no second-derivative check needed!

In the previous entry, we showed that the least squares expression arises from a set of statistical assumptions (homoscedasticity, independence of errors, and zero-mean noise). These assumptions justify why minimizing squared errors makes sense probabilistically. Now, it becomes clear why the assumption of linearity in the parameters is necessary:

All this story started by expressing the model as a matrix product of the form $X β$ . This solution is only valid if the model is linear in its parameters.

Alongside this, we also need:

✅ $X^{⊤} X$ is invertible (or at least not too close to singular, i.e., no perfect multicollinearity).
✅ Problem size is small or moderate, because computing the inverse becomes expensive for very large datasets.

When these conditions are met, the closed-form solution is fast, efficient, and exact (up to numerical precision).

Examples of Least Squares calculated by closed form

To make things concrete, let’s walk through a few practical examples where least squares shows up in real chemometric modeling. For each case, we’ll express the model in matrix form and briefly discuss how the parameters would be computed using the closed-form solution.

The Chemistry: You're measuring copper concentrations in water using UV spectroscopy. You prepare standards with known copper concentrations and measure their absorbances.

The Model: Beer's Law says absorbance should be proportional to concentration:

A = ε \cdot l \cdot c

Since your cuvette path length is fixed, and you might have some baseline absorbance, your practical model becomes:

A = β_{0} + β_{1} \cdot c

Matrix Form:

y = A_{1} A_{2} A_{3} ⋮ A_{n}, X = 111 ⋮ 1 c_{1} c_{2} c_{3} ⋮ c_{n}, β = [β_{0} β_{1}]

Solving the System:

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y

Show code examples for Beer's law

% Beer–Lambert Law calibration in MATLAB
% Generate synthetic data
c = linspace(0, 1, 10)';           % concentrations from 0 to 1
epsilon = 2.5; l = 1;              % true molar absorptivity and path length
noise = 0.05 * randn(size(c));     % small Gaussian noise
A = epsilon * l * c + noise;       % absorbance measurements

% Fit by least squares
X = [ones(size(c)), c];
beta = (X' * X) \ (X' * A);

% Plot data and fit
figure;
scatter(c, A, 'o', 'DisplayName', 'Data'); hold on;
plot(c, X * beta, '-', 'LineWidth', 2, 'DisplayName', 'Fit');
xlabel('Concentration (c)');
ylabel('Absorbance (A)');
title('Beer–Lambert Calibration (MATLAB)');
legend('Location', 'NorthWest');

 # Beer–Lambert Law calibration in Python
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
c = np.linspace(0, 1, 10)
epsilon, l = 2.5, 1.0
noise = np.random.normal(scale=0.05, size=c.shape)
A = epsilon * l * c + noise

# Fit by least squares
X = np.vstack([np.ones_like(c), c]).T
beta = np.linalg.inv(X.T @ X) @ (X.T @ A)

# Plot
plt.scatter(c, A, label='Data')
plt.plot(c, X @ beta, label='Fit', linewidth=2)
plt.xlabel('Concentration (c)')
plt.ylabel('Absorbance (A)')
plt.title("Beer–Lambert Calibration (Python)")
plt.legend()
plt.show()

# Beer–Lambert Law calibration in R
set.seed(0)
c <- seq(0, 1, length.out = 10)
epsilon <- 2.5; l <- 1
noise <- rnorm(length(c), sd = 0.05)
A <- epsilon * l * c + noise

# Fit by least squares
X <- cbind(1, c)
beta <- solve(t(X) %*% X) %*% t(X) %*% A

# Plot
plot(c, A, pch = 19, xlab = "Concentration (c)", ylab = "Absorbance (A)",
 main = "Beer–Lambert Calibration (R)")
lines(c, X %*% beta, col = "blue", lwd = 2)

The Chemistry: Your method works well over most concentrations, but you notice slight curvature at higher levels. Instead of limiting your range, you decide to model this curvature.

The Model:

y = β_{0} + β_{1} x + β_{2} x^{2}

Matrix Form:

X = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{n} x_{1}^{2} x_{2}^{2} ⋮ x_{n}^{2}, β = β_{0} β_{1} β_{2}

Solving the System:

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y

Show code examples for Polynomial Fit

% Polynomial (Quadratic) fit in MATLAB
% Generate synthetic data
x = linspace(0, 10, 50)';               % predictor values
beta_true = [1.0; 0.5; -0.02];          % [intercept; linear; quadratic]
noise = 0.5 * randn(size(x));           % Gaussian noise
y = beta_true(1) + beta_true(2)*x + beta_true(3)*x.^2 + noise;

% Fit by least squares
X = [ones(size(x)), x, x.^2];
beta = (X' * X) \ (X' * y);

% Plot data and fit
figure;
scatter(x, y, 'o', 'DisplayName', 'Data'); hold on;
plot(x, X * beta, '-', 'LineWidth', 2, 'DisplayName', 'Fit');
xlabel('x');
ylabel('y');
title('Polynomial (Quadratic) Fit (MATLAB)');
legend('Location','Best');

# Polynomial (Quadratic) fit in Python
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
x = np.linspace(0, 10, 50)
beta_true = np.array([1.0, 0.5, -0.02])
noise = np.random.normal(scale=0.5, size=x.shape)
y = beta_true[0] + beta_true[1]*x + beta_true[2]*x**2 + noise

# Fit by least squares
X = np.column_stack([np.ones_like(x), x, x**2])
beta = np.linalg.inv(X.T @ X) @ (X.T @ y)

# Plot
plt.scatter(x, y, label='Data')
plt.plot(x, X @ beta, label='Fit', linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.title("Polynomial (Quadratic) Fit (Python)")
plt.legend()
plt.show()

# Polynomial (Quadratic) fit in R
set.seed(0)
x <- seq(0, 10, length.out = 50)
beta_true <- c(1.0, 0.5, -0.02)  # [intercept; linear; quadratic]
noise <- rnorm(length(x), sd = 0.5)
y <- beta_true[1] + beta_true[2]*x + beta_true[3]*x^2 + noise

# Fit by least squares
X <- cbind(1, x, x^2)
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# Plot
plot(x, y, pch = 19, xlab = "x", ylab = "y",
   main = "Polynomial (Quadratic) Fit (R)")
lines(x, X %*% beta, col = "blue", lwd = 2)

The Chemistry: You're studying how molecular properties affect chromatographic retention: molecular weight (MW), polarity (log P), and hydrogen bonding ability (H).

The Model:

t_{R} = β_{0} + β_{1} M W + β_{2} lo g P + β_{3} H

Matrix Form:

X = 1 ⋮ 1 M W_{1} ⋮ M W_{n} lo g P_{1} ⋮ lo g P_{n} H_{1} ⋮ H_{n}, β = β_{0} β_{1} β_{2} β_{3}

Solving the System:

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y

Show code examples for Multivariable Model

% Multivariable regression in MATLAB
% Generate synthetic data
n = 50;
MW   = 100 + (500-100) * rand(n,1);    % molecular weight between 100 and 500
logP = 2    + 1 * randn(n,1);          % log P around 2 ±1
H    = randi([0,5], n,1);              % number of H-bond donors 0–5
beta_true = [0.5; 0.01; -0.3; 0.8];
noise = 0.5 * randn(n,1);
y = beta_true(1) + beta_true(2)*MW + beta_true(3)*logP + beta_true(4)*H + noise;

% Fit by least squares
X = [ones(n,1), MW, logP, H];
beta = (X' * X) \ (X' * y);

% Plot measured vs. predicted
y_pred = X * beta;
figure;
scatter(y, y_pred, 'filled');
hold on;
plot([min(y),max(y)], [min(y),max(y)], 'r--', 'LineWidth', 1.5);
xlabel('Measured t_R');
ylabel('Predicted t_R');
title('Multivariable Regression (MATLAB)');
legend('Data','Ideal Fit','Location','Best');

# Multivariable regression in Python
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
n = 50
MW   = np.random.uniform(100, 500, n)
logP = np.random.normal(2, 1, n)
H    = np.random.randint(0, 6, n)
beta_true = np.array([0.5, 0.01, -0.3, 0.8])
noise = np.random.normal(scale=0.5, size=n)
y = beta_true[0] + beta_true[1]*MW + beta_true[2]*logP + beta_true[3]*H + noise

# Fit by least squares
X = np.column_stack([np.ones(n), MW, logP, H])
beta = np.linalg.inv(X.T @ X) @ (X.T @ y)

# Plot measured vs. predicted
y_pred = X @ beta
plt.scatter(y, y_pred, label='Data')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', label='Ideal Fit')
plt.xlabel('Measured t_R')
plt.ylabel('Predicted t_R')
plt.title('Multivariable Regression (Python)')
plt.legend()
plt.show()

# Multivariable regression in R
set.seed(0)
n <- 50
MW   <- runif(n, 100, 500)
logP <- rnorm(n, 2, 1)
H    <- sample(0:5, n, replace=TRUE)
beta_true <- c(0.5, 0.01, -0.3, 0.8)
noise <- rnorm(n, sd=0.5)
y <- beta_true[1] + beta_true[2]*MW + beta_true[3]*logP + beta_true[4]*H + noise

# Fit by least squares
X <- cbind(1, MW, logP, H)
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# Plot measured vs. predicted
y_pred <- X %*% beta
plot(y, y_pred, pch=19, xlab="Measured t_R", ylab="Predicted t_R",
     main="Multivariable Regression (R)")
abline(a=0, b=1, col="red", lwd=2, lty=2)

Gradient Descent (Iterative Solution)

While the normal equations give us the exact mathematical answer, they're not always the best approach in practice. When we have many variables, large datasets, or computational constraints, iterative methods like gradient descent can be more efficient and stable [3],[4].

Gradient descent is based on a simple idea: if you want to find the minimum of a function, move in the direction that decreases the function most quickly. Think of it like walking downhill in fog. Even if you can't see where the bottom is, you can feel which direction is steepest and take steps in that direction.

To start, let's visualize gradient descent in the simplest case: a linear regression with just one parameter (β). The interactive plot below shows how the algorithm iteratively finds the minimum of the cost function J(β) = β²:

Interactive gradient descent

Iteration: 0 | β: 4.5000 | Descent: -9.0000 | Cost: 20.2500

Learning Rate (α): 0.1

Now let's extend this to a more realistic scenario: a linear regression with two parameters (slope β₁ and intercept β₀).

When we have two parameters, the cost function becomes a surface in 3D space: the horizontal plane represents the two parameters (β₁ and β₀), and the vertical axis represents the error (SSE). This is what we call the error surface. Every point on this surface tells us: "if you choose these parameter values, this is how much error you'll get."

The visualization below shows this error surface as a topographic contour map, where each contour line connects points with the same error value. The colors indicate the magnitude of the error: blue regions have low error (close to the optimal solution), while red regions have high error (far from optimal). The red path traces the gradient descent trajectory, and the blue marker shows the current position as it descends toward the optimal solution. Click anywhere on the plot to set a starting point, then watch gradient descent navigate to the global minimum:

Current: β₁ = 0.1000, β₀ = 0.0000 (click plot to start)

Optimal: β₁ = 0.3090, β₀ = 0.5170

Learning Rate (α): 0.0025

Notice how no matter where you start, gradient descent always converges to the same solution at the bottom of the bowl. This is because the least squares error surface is convex: it has a single global minimum with no local minima to get stuck in. Every path leads to the same optimal parameters!

What about higher dimensions? In real chemometric applications, you often work with many variables: think of a full UV-Vis spectrum with hundreds of wavelengths, or a multivariate calibration with dozens of predictors. In those cases, the error surface exists in a space with many dimensions (one dimension for each parameter, plus one for the error). While we can't visualize a 10-dimensional or 100-dimensional surface, the same principles apply: gradient descent still "walks downhill" in this high-dimensional space, following the steepest descent direction until it reaches the minimum. The mathematics and algorithms work the same way, regardless of how many parameters you have.

The Basic Gradient Descent Algorithm

Now let's translate that intuitive "walking downhill in fog" idea into precise mathematics. Don't worry if the equations look intimidating at first: we're just formalizing the simple idea of taking steps toward lower error.

Let's apply this to our least squares problem. We want to minimize:

S (β) = ∥ y - X β ∥^{2}

From our earlier derivation, we know the gradient is:

\nabla S (β) = 2 X^{T} X β - 2 X^{T} y

What is the gradient? The gradient ∇S(β) is a vector that points in the direction where the function S increases most rapidly. Think of it as a compass that always points uphill. At each point on the error surface, the gradient tells you: "if you want to increase the error as fast as possible, go this way!"

Since we want to minimize the error (not maximize it), we do the opposite: we follow the negative gradient -∇S(β), which points downhill toward lower error. That's why it's called "gradient descent" — we descend in the direction opposite to the gradient!

The gradient descent algorithm works like this:

Start with an initial guess for the parameters: $β^{(0)}$
Calculate the gradient at the current point
Take a step in the opposite direction (since we want to minimize, not maximize)
Repeat until convergence

Mathematically, the update rule is:

β^{(k + 1)} = β^{(k)} - α \nabla S (β^{(k)})

Where $α$ is the step size (also called the learning rate) and $k$ is the iteration number.

Substituting our gradient:

β^{(k + 1)} = β^{(k)} - α (2 X^{T} X β^{(k)} - 2 X^{T} y)

Simplifying:

β^{(k + 1)} = β^{(k)} - 2 α X^{T} (X β^{(k)} - y)

Choosing the Step Size

The step size $α$ (also called the learning rate) is crucial for gradient descent to work well. If it's too small, the algorithm will take many iterations to converge. If it's too large, the algorithm might overshoot the minimum and fail to converge at all. The "just right" learning rate converges smoothly and quickly, while too small creeps along painfully slowly, and too large oscillates or even diverges. Finding the right balance is critical!

For least squares problems, there's actually an optimal step size that can be calculated analytically. But in practice, people often use simple rules like:

Start with $α = 0.01$ and adjust based on performance
Use line search methods to find a good step size at each iteration
Adaptive methods that adjust the step size automatically

When to Use Gradient Descent

Gradient descent becomes particularly useful in several situations:

Large Datasets: When you have millions of data points, forming and inverting $X^{T} X$ becomes computationally expensive. Gradient descent can process the data in smaller batches.

High Dimensional Problems: When you have thousands of variables (common in chemometrics with full spectrum analysis), the matrix $X^{T} X$ becomes huge. Gradient descent avoids forming this matrix explicitly.

Regularized Problems: When you add penalty terms to prevent overfitting (like Ridge or Lasso regression), the closed form solution may not exist. Gradient descent can handle these modified objective functions easily.

Online Learning: When new data arrives continuously, gradient descent can update the model incrementally without recomputing everything from scratch.

Advanced Iterative Methods

Beyond basic gradient descent, several more sophisticated iterative methods have been developed specifically for least squares and related problems. These methods often converge faster or are more robust than simple gradient descent.

Linear vs Nonlinear Least Squares

Important distinction: The methods we discuss below (Gauss-Newton and Levenberg-Marquardt) are designed for nonlinear least squares problems where your model $f (X, β)$ is a nonlinear function of the parameters.

For the linear case we've been focusing on ( $f (X, β) = X β$ ), these advanced methods are overkill. Use Normal Equations or Conjugate Gradient instead!

The key difference:

Linear least squares: Error surface is convex → one global minimum → any method finds the same solution
Nonlinear least squares: Error surface can have local minima → solution depends on starting point → need robust methods like Levenberg-Marquardt

Gauss-Newton Method

The Gauss Newton method is specifically designed for nonlinear least squares problems, but understanding it helps illuminate the structure of the linear case. For nonlinear models where $f (X, β)$ is not linear in $β$ , we can't use the normal equations directly.

The Gauss Newton method approximates the nonlinear function with its linear Taylor expansion at each iteration:

f (X, β) \approx f (X, β^{(k)}) + J (β^{(k)}) (β - β^{(k)})

Where $J$ is the Jacobian matrix — a matrix of all first-order partial derivatives of the model with respect to each parameter. Think of it as a multi-dimensional generalization of the derivative. This gives us a linear problem at each step that we can solve with the normal equations.

Levenberg-Marquardt Method

The Levenberg Marquardt method improves on Gauss Newton by adding a regularization term that makes the algorithm more stable [5],[6]:

(J^{T} J + λ I) δ = J^{T} r

When $λ$ is large, this behaves like gradient descent (stable but slow). When $λ$ is small, it behaves like Gauss Newton (fast but potentially unstable). The algorithm adjusts $λ$ automatically based on whether each step improves the fit.

Conjugate Gradient Method

For large linear systems, the conjugate gradient method provides an elegant compromise between direct methods and simple gradient descent [3],[10]. Instead of always moving in the steepest descent direction, it chooses directions that are "conjugate" to previous search directions. This ensures that each step makes progress that won't be undone by later steps.

The method is particularly powerful because it theoretically converges in at most $n$ steps for an $n \times n$ system, but often converges much faster in practice.

Choosing the Right Computational Approach

With all these methods available, you might be wondering: "Which one should I actually use?" The answer depends on the size and structure of your problem. Understanding when to use each method will save you time and help you avoid numerical headaches.

Here's a practical guide based on problem characteristics:

Small Problems (fewer than 100 variables)

For most routine analytical chemistry work including calibration curves, simple regression, and polynomial fitting, use the normal equations. Modern computers handle these calculations easily and accurately. This covers most problems you'll encounter in typical laboratory work.

Advantages:

Exact solution in one step
Well understood mathematically
Implemented in all statistical software

Implementation note: While we've written the solution as $(X^{T} X)^{- 1} X^{T} y$ , modern numerical libraries (NumPy, MATLAB, R) typically use more stable methods instead of directly computing the matrix inverse [1],[2],[10]:

QR decomposition: Breaks X into an orthogonal matrix Q and upper triangular matrix R, avoiding the X'X multiplication that can amplify numerical errors
Cholesky decomposition: Factors X'X into a triangular matrix product, particularly efficient when X'X is positive definite (guaranteed for full-rank problems)
SVD (Singular Value Decomposition): The most robust method that works even when X is rank-deficient

These advanced factorization methods give the same mathematical answer but with better numerical stability.

When to use: Simple calibrations, polynomial fits, standard multiple regression

Medium Problems (100 to 1000 variables)

Here you might start seeing numerical problems, especially if your variables are highly correlated. The issue is that computing $X^{T} X$ can make small numerical errors much worse.

Better approach: QR decomposition or SVD methods. These break down your matrix $X$ into simpler pieces that are easier to work with numerically, avoiding the problematic $X^{T} X$ calculation.

When to use: Full spectrum analysis, chromatographic fingerprinting, moderate size chemometric problems

Large Problems (more than 1000 variables)

Direct matrix methods become slow because computational cost grows rapidly with the number of variables. Iterative methods like gradient descent or conjugate gradient work better.

Advantages of iterative methods:

Memory efficient (don't need to store large matrices)
Can be stopped early when sufficient accuracy is reached
Handle regularization naturally
Can process data in batches

When to use: High resolution spectroscopy, large scale chemometric modeling, online analysis systems

Poorly Conditioned Problems

When your variables are highly correlated (extremely common in spectroscopy where adjacent wavelengths provide nearly identical information), the matrix becomes nearly singular. Standard solutions become unreliable.

Solutions:

Regularization methods (Ridge, Lasso) that add penalty terms [7]
Principal component regression that works with transformed variables
Partial least squares that handles correlation explicitly [8],[9]

Nonlinear Models

When your model is nonlinear in the parameters (exponential decay, Michaelis Menten kinetics, complex chemical equilibria), closed form solutions don't exist.

Use: Gauss Newton, Levenberg Marquardt, or general nonlinear optimization methods. These iteratively linearize the problem and solve sequences of linear least squares problems.

Practical Computing Considerations

Understanding these computational aspects helps you choose appropriate software and interpret results correctly. The fundamental least squares principle stays the same regardless of how you compute it, but the numerical method can significantly affect the reliability of your results.

Numerical Stability: Even mathematically equivalent methods can give different results due to rounding errors. Methods that avoid computing $X^{T} X$ explicitly (like QR decomposition) are often more stable.

Convergence Criteria: For iterative methods, you need to decide when to stop. Common criteria include: change in parameters becomes small, change in objective function becomes small, or gradient becomes close to zero.

Initialization: Iterative methods need starting values. For least squares, simple methods like setting all parameters to zero often work well, but sometimes you need more sophisticated initialization based on the problem structure.

Regularization: When you have more variables than data points, or when variables are highly correlated, some form of regularization usually improves the usefulness of your model, even if it slightly increases the fitting error.

Validation and Diagnostics

Regardless of which computational method you use, you should always validate your results and check the assumptions underlying least squares.

Residual Analysis: Plot residuals against predicted values, input variables, and measurement order. Random scatter indicates good model fit; systematic patterns suggest model inadequacy or assumption violations.

Cross Validation: Split your data into training and testing sets. Fit the model on training data and evaluate on test data. This helps detect overfitting and gives realistic estimates of prediction accuracy.

Condition Number: For direct methods, check the condition number of $X^{T} X$ . The condition number measures how sensitive the solution is to small changes in the data — essentially, how close the matrix is to being singular (non-invertible). A condition number of 1 is perfect, while values above $1 0^{12}$ indicate numerical instability (the matrix is nearly singular, often due to highly correlated variables).

Convergence Monitoring: For iterative methods, plot the objective function versus iteration number. The curve should decrease smoothly to a stable minimum.

Conclusion

We've come a long way in these two entries. In the first, we discovered why least squares uses the sum of squared errors: how it emerges naturally from probability theory and the physics of measurement. Now, we've explored how to actually compute those optimal parameters, from elegant closed-form solutions to powerful iterative algorithms.

The key insight is that all these methods solve the same fundamental optimization problem. They just differ in computational efficiency, numerical stability, and when they're practical to use. For most analytical chemistry applications, understanding when to reach for direct methods (normal equations) versus iterative methods (gradient descent and variants) will serve you well.

Remember that least squares is not just a mathematical technique you apply blindly. It's a principled approach to finding the best model given your data and assumptions. Whether you're fitting a simple calibration curve or developing complex chemometric models, the same fundamental principles apply. Understanding both the why (from the first entry) and the how (from this entry) puts you in a strong position to apply these methods effectively and troubleshoot when things go wrong.

The methods we've discussed form the foundation for more advanced techniques like partial least squares, principal component regression, and regularized methods that you'll encounter in modern analytical chemistry. Once you understand how least squares optimization works, these extensions become much more accessible and meaningful.

References

[1]Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

[2]Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

[3]Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

[4]Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

[5]Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2), 431-441.

[6]Moré, J. J. (1978). The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical Analysis (pp. 105-116). Springer.

[7]Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

[8]Martens, H., & Næs, T. (1989). Multivariate Calibration. Wiley.

[9]Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.

[10]Bjorck, A. (1996). Numerical Methods for Least Squares Problems. SIAM.

Least Squares: optimization

Interactive gradient descent

On this page