Least Squares: your first step into chemometrics

Master the foundational method behind calibration curves, PLS, and spectral analysis

Least squares forms the mathematical backbone of modern analytical chemistry. If you work in an analytical lab, you’ve probably used least squares regression more times than you’ve counted. It’s there when you build calibration curves from spectroscopic data. It helps you quantify analytes in messy samples. In chromatography, it helps you get numbers from the peaks in your data. If you use near-infrared data, you may use it to build partial least squares models. Even if you don’t think about it, least squares is a big part of the work you do in the lab.

Take a look at the interactive panel below. I'm willing to bet you've used at least one of those fitting methods during a typical lab day:

Model Type

We learn that least squares means minimize the sum of squared errors,

min i = 1 \sum I (y_{i} - \overset{y}{^}_{i})^{2}

but we rarely pause to ask: where does the square actually come from? Why this expression and not another one? The answer is not trivial, and it is related to fundamental properties of the measurements.

In this entry, we will derive, step by step, where the square in least squares actually comes from. Along the way, we will explore each concept and mathematical step clearly, so we can understand why the squared error appears naturally in this type of optimization. More importantly, we will see how the core assumptions behind least squares relate directly to the physical reality of the measurements we make in the laboratory.

Ah, just a last thing before starting: even if some of the equations here might look like monsters at first glance, I promise they are very simple at heart. Don’t be afraid! We’re really just talking about sums and multiplications.

Why nature chooses squares

I still remember the moment my high school mathematics teacher first introduced least squares. He stood at the blackboard, drawing a scatter plot of data points, and explained how we could find the "best fit" line by minimizing something called the "sum of squared errors."

The idea that we could mathematically determine the optimal line through messy points was fascinating. But there was one thing that absolutely bothered me, and I couldn't let it go.

"Why do we square the differences?" I asked, raising my hand. "“Why not raise the differences to the third power instead?"

My teacher gave the standard textbook answer: "Well, we square them to avoid cancellations. If we just added up the differences without squaring, positive and negative errors would cancel each other out."

Reasonable. But that explanation left me even more confused. "But if we're worried about cancellations, why not raise them to the power of 4? Or 6? Or 8? or the absolute value: $∣ y_{i} - \overset{y}{^}_{i} ∣$ . That seems much more natural to me. Why specifically the power of 2?".

Other error criteria are possible

In fact, minimizing absolute errors or other powers of the error is possible, and sometimes it is exactly the right choice: absolute values are more robust to large deviations, while higher powers punish large errors more strongly. Although I did not receive that answer at the time, in practice we usually use the square.

My teacher looked slightly uncomfortable and said something about squares being "mathematically convenient". I didn't (and I still don't) like convenient things, so that question nagged at me for years.

It wasn't until much later, when I discovered an incredible explanation by Artem Kirsanov in his YouTube video on the probabilistic foundations of least squares [1] (see also García-Portugués [2]), that I finally understood the beautiful mathematics behind that seemingly arbitrary choice: the answer emerges from key assumptions about probabilities and the nature of the measurement.

The reality of analytical measurements

Before we start digging into why the least squares expression has the shape it does, we need to understand something fundamental about analytical measurements: they're never perfect, and it's not your fault!

For instance, imagine you're a quality control chemist in a pharmaceutical company, rigorously testing a batch of an analgesic to ensure each dose contains the correct amount of acetylsalicylic acid. You take a precisely diluted sample of the solution and place it in a spectrophotometer (try to use the interactive component below to take some measurements!).

Take some measurements!

Current Reading: 0.000 AU

So, let´s assume that the perfect measurement for this specific case is 1.20 AU. But when you measure, your results are a bit different, like 1.25 AU, 1.19 AU, 1.22 AU, or 1.17 AU, as you can observe in the interactive component above.

These small variations aren't due to carelessness or poor technique. They represent the fundamental reality of measurement:

In spectroscopy, thermal noise causes random motion of electrons in detector circuits, creating small voltage fluctuations that add uncertainty to your signal. In optical measurements, photon noise arises from the quantum mechanical reality that light arrives at your detector in discrete packets, leading to statistical variations in the number of photons counted. In chromatography, you encounter variations in mobile phase flow rate, small temperature fluctuations, and the stochastic nature of molecular interactions between analytes and the stationary phase. Even in electrochemical measurements, thermal motion of ions creates random current fluctuations that appear as noise in your signal. That's the nature of measurements: an intrinsic property that cannot be avoided.

This random fluctuation is what we call noise, $ε$ , and it's present in every analytical technique you'll ever use.

Let's express this mathematically. When you make a measurement $y_{n}$ , what you observe is:

y_{n} = \overset{y}{^} + ε_{n}

where $\overset{y}{^}$ is the underlying value you're trying to measure (1.20 AU), and $ε_{n}$ represents one specific noise measurement from the underlying $ε$ noise distribution.

You can imagine a distribution as a bag containing all possible values of $ε$ , each one with its own probability of being picked. Every time you take a measurement, you blindly reach into the bag and pull out one random value $ε_{n}$ , which gets added to the true value. Small values (close to zero) are usually the most likely to be picked, so most of your readings land near 1.20 AU. Larger values are rarer, but they still show up from time to time. The shape of this probability bag, whether the values are clustered tightly or spread out, symmetric or skewed, is what defines the distribution. The interactive demonstration below lets you explore this: take enough measurements, and you'll start to see the shape of the bag.

Measurement distribution

Last Reading: 0.000 AU

Let's visualize how our measurements are distributed. The histogram above shows how often each absorbance value appears (its frequency). The horizontal axis groups measurements into small intervals, and the vertical axis counts how many readings fell into each one. Hit Play and watch: at first, the bars look scattered and messy. But as you collect more and more measurements, a pattern emerges. The bars start forming a familiar bell shape, revealing the underlying distribution that was there all along. The red line shows the true Gaussian distribution that our noise follows.

The shape of this distribution depends on the dominant noise sources in your particular analytical system. It might be Gaussian, Poisson, uniform, or something else entirely.

Towards the least squares expression

Now, as we move from understanding the physical reality of analytical measurements to developing the mathematical framework of least squares, i.e., from reality to statistics, we need to make an important distinction in terminology.

The noise we've been discussing represents the physical sources of uncertainty in your measurement. These physical processes create what we mathematically call measurement errors: the difference between what you observe and the true value you're trying to measure.

Same letter, different perspective

The equation $y_{n} = \overset{y}{^} + ε_{n}$ describes the same physical reality from two viewpoints. From the physical perspective, $ε_{n}$ represents the combined effect of all noise sources in your instrument and $\overset{y}{^}$ is the underlying true value. From the mathematical perspective, $ε_{n}$ represents the error term we need for statistical modeling and $\overset{y}{^}$ our model value.

This distinction matters because the statistical assumptions we're about to discuss (normality, independence, constant variance) apply specifically to these mathematical error terms $ε_{n}$ . But these mathematical assumptions are justified by understanding how the underlying physical noise sources actually behave in real analytical instruments. Therefore, from this point forward in our mathematical development, when we write $ε_{n}$ and $\overset{y}{^}$ in our equations, we're referring to these mathematical error terms and our model values, respectively.

From measurement errors to probabilities

We've established that every measurement we take contains some random error. But how does this fact lead us to the least squares formula? It starts by asking a simple question:

Given the data points we measured, what is the most likely true value or model that produced them?

We are no longer just trying to draw a line through some points. We are looking for the physical truth that makes our specific measurements the most probable outcome. This way of thinking will lead us directly to the famous squares.

To get there, we need three ingredients:

Assume a probability distribution for our errors: We already argued that Gaussian is a reasonable choice. Now we'll write down the probability of a single measurement.
Combine the probabilities of all our measurements: If we have many data points, we need a way to compute the probability of observing the entire dataset at once. This is where we shift our perspective from individual measurements to a global, probabilistic view.
Assume how our data spread around the model: We'll make one final assumption about how the dispersion of our measurements behaves, and that will simplify everything into the familiar least squares formula.

Let's go step by step.

Step 1: The probability of one measurement

Let's start by looking at just one measurement through a probabilistic lens. We assume we have our model prediction, which we denote as $\overset{y}{^}$ and which should coincide ideally with the true underlying value. However, as stated before, when you perform the measurement, you observe the value $y_{n}$ .

The difference between what you observe and our model is the measurement error, which we define as:

ε_{n} = y_{n} - \overset{y}{^}

This error $ε_{n}$ is what we assume comes from a random distribution, reflecting the inherent noise in your measurement process.

Then: What is the probability of getting this specific error, given an underlying value $\hat{y}$ ? To answer this, we need to make our first assumption about how these errors behave.

Assumption 1: Errors follow a Normal (Gaussian) Distribution

We assume that the measurement errors, $ε_{n}$ , follow a bell-shaped curve, known as a Gaussian or Normal distribution, with zero mean. We write this as $ε_{n} \sim N (0, σ^{2})$ .

Why this distribution and not another one? This is not a random choice, and it connects directly to the noise sources we discussed earlier.

Think about what happens inside a spectrophotometer when you take a single reading. Thermal noise in the electronics, photon counting statistics, tiny mechanical vibrations, temperature fluctuations... all of these small, independent effects add up to produce the final error in your measurement. No single source dominates; it's the combination of many little contributions.

This is exactly the scenario where a famous mathematical idea, the Central Limit Theorem [3], applies. It tells us that when you add up many small, independent random effects, their sum is well-approximated by a Gaussian distribution, regardless of how each individual effect is distributed. Even if photon noise alone follows a Poisson distribution (which is not Gaussian at all), once it combines with all the other noise sources and you accumulate enough counts, the overall error ends up looking Gaussian.

So the Gaussian assumption is very reasonable for most analytical instruments, as long as the noise sources are independent, none is overwhelmingly large, and each has a finite variance.

The interactive demonstration below illustrates this idea in a simplified way. Instead of combining different noise sources (which is hard to visualize), it shows the same principle from a different angle: what happens when you average multiple random values. The math behind both cases is the same. Whether you are adding up many different noise sources within a single measurement, or averaging many replicate measurements together, the Central Limit Theorem applies in both situations, and the result tends toward a Gaussian.

Central Limit Theorem Demonstration

Samples per average (n): 1

Move the slider to increase $n$ , the number of random values being averaged. Watch how the histograms evolve: despite starting from a perfectly uniform or a heavily skewed distribution, the averages converge toward the same symmetric, bell-shaped Gaussian curve. This is the Central Limit Theorem in action, and the reason why assuming Gaussian errors is often reasonable in analytical work.

Here's a concrete example from the lab. Think about a detector in a spectrophotometer counting photons. When very few photons arrive (a weak signal, low signal-to-noise ratio), the counts fluctuate in an asymmetric, skewed way: you might count 3, 5, 2, 7 photons in successive readings. That pattern of variation is not Gaussian at all. But your instrument doesn't just report one raw count. It accumulates photons over an integration time, effectively averaging thousands or millions of individual detection events. By the time the instrument displays a number on the screen, it has already averaged so many counts together that the Central Limit Theorem has done its work: the final reading fluctuates symmetrically around the true value, following a Gaussian distribution. The same thing happens with electronic noise in the detector circuit, where the instrument averages many rapid voltage samples into a single stable reading. So in practice, the Gaussian assumption is not something we impose artificially. It emerges naturally from how instruments work.

The formula for this bell curve is:

p (y_{n} ∣ \overset{y}{^}_{n}, σ^{2}) = \frac{1}{2 π σ ^{2}} e^{- \frac{ε _{n}^{2}}{2 σ ^{2}}}

Where $p (y_{n} ∣ \overset{y}{^}_{n}, σ^{2})$ means "the probability density of observing measurement $y_{n}$ given that what we expect is $\overset{y}{^}_{n}$ ." The vertical bar $∣$ means "given that" or "conditional on".

That's exactly the shape you saw on the interactive histogram above, and it tells us that small errors (when $ε$ is close to zero) are very probable, and large errors are very improbable.

If we replace the general error $ε_{n}$ with our specific measurement error, $y_{n} - \overset{y}{^}$ , we get the probability of observing our data point, $y_{n}$ :

p (y_{n} ∣ \overset{y}{^}_{n}, σ^{2}) = \frac{1}{2 π σ ^{2}} e^{- \frac{( y _{n} - y ^ _{n} ) ^{2}}{2 σ ^{2}}}

Step 2: The probability of the whole dataset

In a real experiment (for example, a calibration curve), you have more than one data point. You have a whole set of measurements: $y_{1}, y_{2}, \dots, y_{I}$ .

The next question we must ask ourselves is: what is the probability of observing the entire collection of measurements at once? To answer it, we need to understand how the errors of different measurements are related. This leads to our second assumption.

Assumption 2: Errors are Independent

We assume that the error in one measurement gives you no information about the error in the next measurement. They are independent, so we can multiply their likelihood contributions.

Think of it like flipping a coin. The result of the first flip does not change the 50/50 probability for the second flip. Our measurements should be the same.

If you rinse your equipment properly between samples and allow your instrument to stabilize, then each measurement is a new, independent event. The random noise that made your first measurement slightly high should not affect your second measurement.

Because the errors are independent, we can calculate the total probability of the whole dataset by multiplying the individual probabilities of each measurement:

P (entire dataset) = P (y_{1}) \times P (y_{2}) \times \dots \times P (y_{I}) = i = 1 \prod I P (y_{i} ∣ \overset{y}{^}_{i})

This joint probability is usually called the likelihood. Now, let's substitute the Gaussian probability formula we found in Step 1:

P (entire dataset) = i = 1 \prod I \frac{1}{2 π σ _{i}^{2}} e^{- \frac{( y _{i} - y ^ _{i} ) ^{2}}{2 σ _{i}^{2}}}

That equation is a monster! But don't be afraid, it represents just $I$ likelihood contributions being multiplied: the probability density of observing our entire dataset, expressed as a product of individual Gaussian terms.

Now, our goal is to find the model parameters that maximize this probability. The logic is beautifully simple: we want to find the model that makes our actual experimental data the most likely outcome. If you measured a series of absorbance values, you want to find the calibration curve parameters that make those specific measurements the most probable result you could have gotten. This approach is called "maximum likelihood estimation" and it's based on a very reasonable principle: the best model is the one that makes your observed data most likely to occur.

However, working directly with this product of complex exponential expressions is mathematically challenging. When you multiply many small probabilities together, the product becomes extraordinarily tiny that computers may treat the result as zero due to limited numerical precision. This "numerical underflow" can completely break your optimization algorithm.

Numerical Underflow (and Complex Derivatives)

If we consider 1000 points (a typical number in a single infrared spectrum), multiplying many small density values can produce numbers so tiny (e.g., on the order of $1 0^{- 1000}$ or smaller) that most computers treat them as zero due to limited numerical precision. Moreover, derivatives are important in optimization problems. But applying derivatives to products may be a nightmare.

Fortunately, there's a trick that we can use. We can apply a logarithm transformation to this product of probabilities. The logarithm function is monotonic, i.e., it always increases in the same direction as its input. Think of it like this: if you have two probabilities where $P_{A} > P_{B}$ , then you're guaranteed that $lo g (P_{A}) > lo g (P_{B})$ . The logarithm preserves the relative ordering, which means that the parameter values that maximize our original probability will be exactly the same parameter values that maximize the logarithm of that probability.

Applying this transformation to our probability product, and using the fundamental property that $lo g (A \times B) = lo g (A) + lo g (B)$ , this becomes:

lo g (P (entire dataset)) = lo g (i = 1 \prod I P (y_{i} ∣ \overset{y}{^}_{i})) = i = 1 \sum I lo g (P (y_{i} ∣ \overset{y}{^}_{i}))

The multiplication of probabilities now is a sum! Now, remember that each individual probability has the form:

P (y_{i} ∣ \overset{y}{^}_{i}) = \frac{1}{2 π σ _{i}^{2}} e^{- \frac{( y _{i} - y ^ _{i} ) ^{2}}{2 σ _{i}^{2}}}

And our goal is to find the model parameters that maximize the total log-likelihood:

β max i = 1 \sum I lo g P (y_{i} ∣ \overset{y}{^}_{i} (β))

Here, $β$ represents the parameters we optimize to maximize the log-likelihood. For example, in a linear regression model, $β$ includes the slope and intercept. In an exponential decay model, $β$ consists of the pre-exponential factor and the decay rate. These parameters define the predictions $\overset{y}{^}_{i} (β)$ , and the goal is to find the values of $β$ that make the observed data most likely.

When we substitute the Gaussian probability expression and take the logarithm, we apply the logarithm to both parts of the product inside each probability. Using the property that $lo g (A \times B) = lo g (A) + lo g (B)$ , this becomes:

β max i = 1 \sum I lo g \frac{1}{2 π σ _{i}^{2}} \cdot e^{- \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}}

and then

β max i = 1 \sum I lo g (\frac{1}{2 π σ _{i}^{2}}) + lo g e^{- \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}}

When we take the logarithm of the exponential term, we use the fundamental property that $lo g (e^{x}) = x$ . This means:

lo g e^{- \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}} = - \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}

This simplification transforms our expression into:

β max i = 1 \sum I [lo g (\frac{1}{2 π σ _{i}^{2}}) - \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}]

Now we can separate this expression into two distinct parts to better understand what each component contributes to our optimization problem:

β max [i = 1 \sum I lo g (\frac{1}{2 π σ _{i}^{2}}) - i = 1 \sum I \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}]

This separation reveals something profound about the structure of our optimization problem. The first sum contains only normalization constants that depend on our measurement uncertainties $σ_{i}^{2}$ but not on our model parameters $β$ . Since these terms remain constant regardless of which model parameters we choose, they don't affect which parameters give us the maximum likelihood. For optimization purposes, we can treat this entire first sum as a constant.

Therefore, maximizing the full expression is mathematically equivalent to maximizing only the second part:

β max [- i = 1 \sum I \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}]

But here's a fundamental insight about optimization that connects our probabilistic reasoning to the familiar least squares method: maximizing a negative expression is exactly the same as minimizing the positive version of that expression. Since we want to find the maximum of the negative sum, we can equivalently find the minimum of the positive sum:

β min i = 1 \sum I \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{2 σ _{i}^{2}}

Finally, since the factor of $\frac{1}{2}$ doesn't change which parameters minimize the expression (it just scales everything uniformly), we can remove it without affecting the optimization result:

β min i = 1 \sum I \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{σ _{i}^{2}}

At this point, we've established that maximizing the likelihood of our data leads us to minimize the expression:

This is called weighted least squares, and it's actually the most general form of our optimization problem. Each measurement contributes to the sum with a weight that depends on its individual measurement uncertainty. Measurements with smaller uncertainties (smaller $σ_{i}^{2}$ ) get more weight in determining our final model, while noisier measurements (larger $σ_{i}^{2}$ ) get less influence. This makes intuitive sense: we should trust our most precise measurements more heavily when fitting our model.

But here's where we encounter a practical challenge that leads us to our third fundamental assumption. In real analytical work, how often do you actually know the individual uncertainty $σ_{i}^{2}$ for each and every measurement you make?

Step 3: The role of constant variance

In most laboratory situations, you don't measure the uncertainty of each individual data point. Instead, you typically characterize your instrument's overall precision through replicate measurements or method validation studies. You might determine that your HPLC system has a typical injection precision of 2% RSD, or that your spectrophotometer has a baseline noise level of 0.01 absorbance units. This practical reality leads us to a crucial simplifying assumption.

Assumption 3: Homoscedasticity (Constant Variance)

We assume that all measurements have the same underlying uncertainty: $σ_{i}^{2} = σ^{2}$ for all measurements. The word "homoscedasticity" comes from Greek, meaning "same scatter."

Let's explore what this assumption really means in concrete terms. Imagine you're running a calibration curve for quantifying caffeine using UV spectroscopy. Homoscedasticity means that when you measure a 1 mg/L standard solution, the measurement uncertainty is essentially the same as when you measure a 2 mg/L standard or a 3 mg/L standard. The absolute magnitude of the random error stays approximately constant across your entire calibration range.

This might seem counterintuitive at first. You might expect that higher concentrations, which give stronger signals, would be more precise. But remember that we're talking about absolute error, not relative error. Your spectrophotometer's detector noise might contribute ±0.01 absorbance units of uncertainty whether you're measuring 0.2 AU or 1.0 AU. The absolute uncertainty stays roughly the same, even though the relative uncertainty (the percentage error) decreases for stronger signals.

The interactive component below demonstrates this concept using a linear calibration relationship. By default, it shows homoscedastic data where error bars maintain constant height across all concentrations. Toggle the "Heteroscedastic noise" option to see what happens when measurement uncertainty increases with signal strength, and notice how the error bars grow larger at higher concentrations, violating our constant variance assumption.

Homoscedastic Calibration

Heteroscedastic noiseShow error bars (±2σ)

This comparison is crucial for understanding when our mathematical assumptions are valid. In the homoscedastic case, you can see that the scatter around the line remains roughly constant whether you're looking at low or high concentrations. In the heteroscedastic case, the scatter clearly increases as concentration increases, which would require us to use weighted least squares rather than ordinary least squares.

Now let's see how the homoscedasticity assumption transforms our mathematics. When every measurement has the same variance $σ^{2}$ , we can factor this constant out of our summation:

β min i = 1 \sum I \frac{( y _{i} - y ^ _{i} ( β ) ) ^{2}}{σ ^{2}} = β min \frac{1}{σ ^{2}} i = 1 \sum I (y_{i} - \overset{y}{^}_{i} (β))^{2}

Here's the key mathematical insight: since $\frac{1}{σ ^{2}}$ is just a positive constant that doesn't depend on our model parameters, it doesn't affect which parameters minimize the expression. Multiplying an entire function by a positive constant doesn't change the location of its minimum. It just rescales the function vertically.

Therefore, minimizing our expression is mathematically equivalent to minimizing:

β min i = 1 \sum I (y_{i} - \overset{y}{^}_{i} (β))^{2}

And there it is!🎉 We've finally arrived at the familiar ordinary least squares expression that you've probably encountered countless times in analytical chemistry. The squares that seemed so mysterious at the beginning of our journey have emerged naturally from three reasonable assumptions about how analytical measurements behave:

Gaussian errors: Measurement errors follow a bell-shaped distribution, which reflects the Central Limit Theorem acting on many small, independent noise sources in your instrument.
Independent measurements: The error in one measurement doesn't predict the error in the next, which reflects good laboratory practice and proper instrument operation.
Constant variance: All measurements have similar absolute precision, which reflects the typical behavior of most analytical instruments across their useful range.

Understanding these nuances helps you choose the right mathematical approach for your specific analytical situation. When homoscedasticity is reasonable, ordinary least squares provides an elegant and computationally efficient solution. When it's not, weighted least squares offers a more sophisticated approach that properly accounts for varying measurement uncertainties.

This mathematical framework elegantly connects the physics of measurement (random errors with Gaussian distributions) to the statistics of parameter estimation (maximum likelihood) to the practical mathematics we use every day (least squares fitting). The beauty lies not just in the mathematical elegance, but in how these assumptions reflect the real behavior of analytical instruments and the practical constraints of laboratory work.

Some considerations

While our probabilistic derivation explains why least squares uses the sum of squared errors, there are additional assumptions and practical considerations that every analytical chemist should understand to apply this method effectively in real laboratory work. These considerations will help you recognize when least squares is appropriate, when it might fail, and how to adapt your approach accordingly.

Key Assumptions for Linear Models

Our derivation focused on three fundamental assumptions for the sake of simplifying the explanation, but the classical least squares framework, especially for linear models, rests on a few key assumptions.

Assumption 4: Linearity in Parameters

Your model must be linear in the parameters you're trying to estimate. This assumption is what makes least squares problems solvable in closed form and is a requirement for theorems like Gauss–Markov.

The key to understanding "linear in parameters" is recognizing what makes a model linear or non-linear from a mathematical perspective. A model is linear in parameters when each parameter appears only to the first power and no parameters multiply each other.

Consider these examples from analytical chemistry:

Linear in Parameters:

Simple calibration: $y = a + b x$ ✓ (parameters $a$ and $b$ each appear once, no multiplication between them)
Polynomial calibration: $y = a + b x + c x^{2} + d x^{3}$ ✓ (even though the relationship between $y$ and $x$ is curved, parameters $a$ , $b$ , $c$ , $d$ are independent)
Multiple regression: $y = a + b_{1} x_{1} + b_{2} x_{2} + b_{3} x_{3}$ ✓ (each parameter has its own coefficient role)

Non-linear in Parameters:

Exponential decay: $y = a \cdot e^{- b x}$ ✗ (parameter $b$ appears inside an exponential function)
Michaelis-Menten kinetics: $y = \frac{V _{ma x} \cdot x}{K _{m} + x}$ ✗ (parameters $V_{ma x}$ and $K_{m}$ interact in complex ways)
Interactive model: $y = a + b x + c \cdot a \cdot x$ ✗ (parameters $a$ and $c$ multiply each other)

Even when your model is not linear in the parameters, it is still possible to minimize the sum of squared errors using iterative methods like gradient descent or Levenberg-Marquardt. However, in this case you lose the ability to write the model globally in the simple matrix form $\hat{y} = X β$ , and therefore also lose direct access to closed-form solutions and their associated geometric interpretation, such as projections. That said, these linear algebra techniques are not entirely lost: they are typically applied locally at each iteration through linear approximations of the model.

Assumption 5: Zero Mean Errors

While this condition was already included in our notation for Assumption 1 ( $ε_{i} \sim N (0, σ^{2})$ ), it's a critical concept worth stating on its own. The mathematical expectation of your measurement errors is assumed to be zero: $E [ε_{i}] = 0$ . This assumption is crucial for making your parameter estimates unbiased, meaning they will correctly estimate the true underlying parameters without systematic error.

In practical terms, this assumption means that if you could take an infinite number of measurements of the same sample under identical conditions, the average of all measurement errors would approach zero. Sometimes your instrument reads slightly high, sometimes slightly low, but there's no consistent tendency in either direction. This assumption essentially requires that you've eliminated systematic errors through proper analytical technique.

The Five Key Assumptions

Gaussian errors: Measurement errors follow a normal (bell-shaped) distribution
Independence: Each measurement's error is unrelated to any other measurement
Homoscedasticity: Variance of errors remains constant across all measurements
Linearity: The model is linear in its parameters (not necessarily in variables)
Zero mean errors: Errors average to zero over many measurements (no systematic bias)

When assumptions break down and what to do about it

Understanding when least squares assumptions fail helps you recognize situations where alternative approaches might be more appropriate. Each violation has characteristic symptoms and established solutions.

Heteroscedasticity occurs when measurement precision changes across your analytical range, as you saw in the interactive component above. You might observe this in trace analysis where measurements near the detection limit have proportionally larger uncertainties than measurements at higher concentrations. In Time-Correlated Single Photon Counting (TCSPC), heteroscedasticity can arise due to variations in photon count statistics: measurements with fewer photon events near the baseline tend to exhibit higher relative uncertainty compared to peak regions with more photon events. In chromatography, peak integration precision often depends on peak height, creating heteroscedastic errors. When you suspect heteroscedasticity, examine residual plots where errors are plotted against predicted values. If you see a funnel shape or other systematic pattern in the residuals, consider weighted least squares where each measurement is weighted by the inverse of its estimated variance [7],[8].

Violating the assumption of constant variance (homoscedasticity) does not usually bias the parameter estimates themselves in ordinary least squares, but it does bias the standard errors, which in turn affects confidence intervals and hypothesis tests. This can lead to incorrect inferences, such as overstating the significance of predictors. In cases where heteroscedasticity is substantial and ignored, the model may appear more precise than it truly is. For example, confidence intervals for regression coefficients (computed under constant-variance formulas) can be biased.

Non-Gaussian errors often show up in real lab work. For example, when you count photons at very low light levels, the random noise follows a Poisson curve and looks skewed to the right. A dirty cuvette or a brief detector glitch can throw in a few big outliers, making the tails of the error curve very fat. And when results fall below your detection limit, you report “non-detect,” which cuts off part of the distribution on the left.

Most textbooks and software assume a neat bell-shaped (Gaussian) error because it makes the algebra tidy. The good news is that ordinary least squares still gives the correct average answer as long as the errors keep the same spread, are not linked from one measurement to the next, and average out to zero in the long run.

Where non-Gaussian errors cause trouble is in the fine details: small-sample confidence limits and the usual $t$ and $F$ tests can be off. When that happens, you can switch to a robust fit, apply variance-stabilizing transformations (such as square root for Poisson count data), or pick a model that matches the real error distribution, such as Poisson regression.

Do we really need a bell curve?

Least-squares treats a Gaussian curve as the default error shape, but the method stays reliable if the errors are zero-mean, have roughly constant spread, and do not depend on each other. The bell curve mainly matters for textbook formulas that give exact confidence limits in small data sets.

Autocorrelated errors violate the independence assumption and commonly occur in time-series measurements where drift or memory effects cause sequential measurements to be related. You might see this in continuous monitoring applications or when measuring samples in sequence without adequate equilibration time. Autocorrelation can be detected by plotting residuals against measurement order and looking for patterns. Solutions include incorporating time-trend terms into your model or using time-series regression methods.

Conclusion

We began with a simple question: why do we use the "sum of squares"? As we saw, the explanation from a probability viewpoint comes from a few key ideas about our measurements: that errors follow a bell curve (a Gaussian distribution), that they are independent, and that the measurement precision stays the same. This approach helps us understand the real-world reason why the method works.

To get a more complete picture, it is also useful to mention the Gauss-Markov theorem [4],[5],[6]. This theorem shows that, if (1) the model is linear in its parameters, (2) the errors have mean zero, (3) the errors share the same variance (homoscedasticity), and (4) the errors are independent (uncorrelated), the ordinary-least-squares (OLS) estimator is the BLUE: the Best Linear Unbiased Estimator. "Best" means it has the smallest possible variance among all estimators that are linear in the data and unbiased.

Unlike earlier steps in our story, the Gauss–Markov proof does not require the errors to follow a perfect bell curve (Normal distribution); the four conditions above are enough. That wider set of rules makes the least-squares method even more solid and trustworthy in everyday work.

If you understand these core ideas, it becomes much easier to learn more advanced methods in chemometrics, like Multivariate Curve Resolution or special algorithms for baseline correction. All these methods solve the same basic problem: finding the model parameters that best fit the data, based on good assumptions about measurement errors.

So, now that we understand why we use the least squares formula, the next natural question is how do we find the solution? In the next entry, we will look at the mathematical ways to calculate the best least squares fit, from the neat "normal equations" to modern computer-based methods.

References

[1]Kirsanov, A. (2024). Why does least squares REALLY work? [Video]. YouTube. https://www.youtube.com/watch?v=q7seckj1hwM

[2]García-Portugués, E. (2025). Notes for Predictive Modeling. https://bookdown.org/egarpor/PM-UC3M/

[3]Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury Press.

[4]Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley.

[5]Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley.

[6]Seber, G. A. F., & Lee, A. J. (2003). Linear Regression Analysis (2nd ed.). Wiley.

[7]Carroll, R. J., & Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman and Hall.

[8]Weisberg, S. (2005). Applied Linear Regression (3rd ed.). Wiley.

Least Squares: your first step into chemometrics

Take some measurements!

Measurement distribution

Central Limit Theorem Demonstration

Homoscedastic Calibration

On this page