Least Squares: your first step into chemometrics
Least squares forms the mathematical backbone of modern analytical chemistry. If you work in an analytical lab, you’ve probably used least squares regression more times than you’ve counted. It’s there when you build calibration curves from spectroscopic data. It helps you quantify analytes in messy samples. In chromatography, it helps you get numbers from the peaks in your data. If you use near-infrared data, you may use it to build partial least squares models. Even if you don’t think about it, least squares is a big part of the work you do in the lab.
Take a look at the interactive panel below. I’m willing to bet you’ve used at least one of those fitting methods during a typical lab day:
We learn that least squares means minimize the sum of squared errors,
but we rarely pause to ask: where does the square actually come from? Why this expression and not another one? The answer is not trivial, and it is related to fundamental properties of the measurements.
In this entry, we will derive, step by step, where the square in least squares actually comes from. Along the way, we will explore each concept and mathematical step clearly, so we can understand why the squared error appears naturally in this type of optimization. More importantly, we will see how the core assumptions behind least squares relate directly to the physical reality of the measurements we make in the laboratory.
Ah, just a last thing before starting: even if some of the equations here might look like monsters at first glance, I promise they are very simple at heart. Don’t be afraid! We’re really just talking about sums and multiplications.
Why nature chooses squares
I still remember the moment my high school mathematics teacher first introduced least squares. He stood at the blackboard, drawing a scatter plot of data points, and explained how we could find the “best fit” line by minimizing something called the “sum of squared errors.”
The idea that we could mathematically determine the optimal line through messy points was fascinating. But there was one thing that absolutely bothered me, and I couldn’t let it go.
“Why do we square the differences?” I asked, raising my hand. ““Why not raise the differences to the third power instead?”
My teacher gave the standard textbook answer: “Well, we square them to avoid cancellations. If we just added up the differences without squaring, positive and negative errors would cancel each other out.”
Reasonable. But that explanation left me even more confused. “But if we’re worried about cancellations, why not raise them to the power of 4? Or 6? Or 8? or the absolute value: . That seems much more natural to me. Why specifically the power of 2?”.
My teacher looked slightly uncomfortable and said something about squares being “mathematically convenient”. I didn’t (and I still don’t) like convenient things, so that question nagged at me for years.
It wasn’t until much later, when I discovered an incredible explanation by Artem Kirsanov in his YouTube video on the probabilistic foundations of least squares [1] (see also García-Portugués [2] ), that I finally understood the beautiful mathematics behind that seemingly arbitrary choice: the answer emerges from key assumptions about probabilities and the nature of the measurement.
The reality of analytical measurements
Before we start digging into why the least squares expression has the shape it does, we need to understand something fundamental about analytical measurements: they’re never perfect, and it’s not your fault!
For instance, imagine you’re a quality control chemist in a pharmaceutical company, rigorously testing a batch of an analgesic to ensure each dose contains the correct amount of acetylsalicylic acid. You take a precisely diluted sample of the solution and place it in a spectrophotometer (try to use the interactive component below to take some measurements!).
So, let´s assume that the perfect measurement for this specific case is 1.20 AU. But when you measure, your results are a bit different, like 1.25 AU, 1.19 AU, 1.22 AU, or 1.17 AU, as you can observe in the interactive component above.
These small variations aren’t due to carelessness or poor technique. They represent the fundamental reality of measurement:
In spectroscopy, thermal noise causes random motion of electrons in detector circuits, creating small voltage fluctuations that add uncertainty to your signal. In optical measurements, photon noise arises from the quantum mechanical reality that light arrives at your detector in discrete packets, leading to statistical variations in the number of photons counted. In chromatography, you encounter variations in mobile phase flow rate, small temperature fluctuations, and the stochastic nature of molecular interactions between analytes and the stationary phase. Even in electrochemical measurements, thermal motion of ions creates random current fluctuations that appear as noise in your signal. That’s the nature of measurements: an intrinsic property that cannot be avoided.
This random fluctuation is what we call noise, , and it’s present in every analytical technique you’ll ever use.
Let’s express this mathematically. When you make a measurement , what you observe is:
where is the underlying value you’re trying to measure (1.20 AU), and represents one specific noise measurement from the underlying noise distribution.
You can imagine a distribution as a bag containing all possible values of , each one with its own probability of being picked. Every time you take a measurement, you blindly reach into the bag and pull out one random value , which gets added to the true value. Small values (close to zero) are usually the most likely to be picked, so most of your readings land near 1.20 AU. Larger values are rarer, but they still show up from time to time. The shape of this probability bag, whether the values are clustered tightly or spread out, symmetric or skewed, is what defines the distribution. The interactive demonstration below lets you explore this: take enough measurements, and you’ll start to see the shape of the bag.
Let’s visualize how our measurements are distributed. The histogram above shows how often each absorbance value appears (its frequency). The horizontal axis groups measurements into small intervals, and the vertical axis counts how many readings fell into each one. Hit Play and watch: at first, the bars look scattered and messy. But as you collect more and more measurements, a pattern emerges. The bars start forming a familiar bell shape, revealing the underlying distribution that was there all along. The red line shows the true Gaussian distribution that our noise follows.
The shape of this distribution depends on the dominant noise sources in your particular analytical system. It might be Gaussian, Poisson, uniform, or something else entirely.
Towards the least squares expression
Now, as we move from understanding the physical reality of analytical measurements to developing the mathematical framework of least squares, i.e., from reality to statistics, we need to make an important distinction in terminology.
The noise we’ve been discussing represents the physical sources of uncertainty in your measurement. These physical processes create what we mathematically call measurement errors: the difference between what you observe and the true value you’re trying to measure.
This distinction matters because the statistical assumptions we’re about to discuss (normality, independence, constant variance) apply specifically to these mathematical error terms . But these mathematical assumptions are justified by understanding how the underlying physical noise sources actually behave in real analytical instruments. Therefore, from this point forward in our mathematical development, when we write and in our equations, we’re referring to these mathematical error terms and our model values, respectively.
From measurement errors to probabilities
We’ve established that every measurement we take contains some random error. But how does this fact lead us to the least squares formula? It starts by asking a simple question:
Given the data points we measured, what is the most likely true value or model that produced them?
We are no longer just trying to draw a line through some points. We are looking for the physical truth that makes our specific measurements the most probable outcome. This way of thinking will lead us directly to the famous squares.
To get there, we need three ingredients:
- Assume a probability distribution for our errors: We already argued that Gaussian is a reasonable choice. Now we’ll write down the probability of a single measurement.
- Combine the probabilities of all our measurements: If we have many data points, we need a way to compute the probability of observing the entire dataset at once. This is where we shift our perspective from individual measurements to a global, probabilistic view.
- Assume how our data spread around the model: We’ll make one final assumption about how the dispersion of our measurements behaves, and that will simplify everything into the familiar least squares formula.
Let’s go step by step.
Step 1: The probability of one measurement
Let’s start by looking at just one measurement through a probabilistic lens. We assume we have our model prediction, which we denote as and which should coincide ideally with the true underlying value. However, as stated before, when you perform the measurement, you observe the value .
The difference between what you observe and our model is the measurement error, which we define as:
This error is what we assume comes from a random distribution, reflecting the inherent noise in your measurement process.
Then: What is the probability of getting this specific error, given an underlying value ? To answer this, we need to make our first assumption about how these errors behave.
Why this distribution and not another one? This is not a random choice, and it connects directly to the noise sources we discussed earlier.
Think about what happens inside a spectrophotometer when you take a single reading. Thermal noise in the electronics, photon counting statistics, tiny mechanical vibrations, temperature fluctuations… all of these small, independent effects add up to produce the final error in your measurement. No single source dominates; it’s the combination of many little contributions.
This is exactly the scenario where a famous mathematical idea, the Central Limit Theorem [3] , applies. It tells us that when you add up many small, independent random effects, their sum is well-approximated by a Gaussian distribution, regardless of how each individual effect is distributed. Even if photon noise alone follows a Poisson distribution (which is not Gaussian at all), once it combines with all the other noise sources and you accumulate enough counts, the overall error ends up looking Gaussian.
So the Gaussian assumption is very reasonable for most analytical instruments, as long as the noise sources are independent, none is overwhelmingly large, and each has a finite variance.
The interactive demonstration below illustrates this idea in a simplified way. Instead of combining different noise sources (which is hard to visualize), it shows the same principle from a different angle: what happens when you average multiple random values. The math behind both cases is the same. Whether you are adding up many different noise sources within a single measurement, or averaging many replicate measurements together, the Central Limit Theorem applies in both situations, and the result tends toward a Gaussian.
Move the slider to increase , the number of random values being averaged. Watch how the histograms evolve: despite starting from a perfectly uniform or a heavily skewed distribution, the averages converge toward the same symmetric, bell-shaped Gaussian curve. This is the Central Limit Theorem in action, and the reason why assuming Gaussian errors is often reasonable in analytical work.
Here’s a concrete example from the lab. Think about a detector in a spectrophotometer counting photons. When very few photons arrive (a weak signal, low signal-to-noise ratio), the counts fluctuate in an asymmetric, skewed way: you might count 3, 5, 2, 7 photons in successive readings. That pattern of variation is not Gaussian at all. But your instrument doesn’t just report one raw count. It accumulates photons over an integration time, effectively averaging thousands or millions of individual detection events. By the time the instrument displays a number on the screen, it has already averaged so many counts together that the Central Limit Theorem has done its work: the final reading fluctuates symmetrically around the true value, following a Gaussian distribution. The same thing happens with electronic noise in the detector circuit, where the instrument averages many rapid voltage samples into a single stable reading. So in practice, the Gaussian assumption is not something we impose artificially. It emerges naturally from how instruments work.
The formula for this bell curve is:
Where means “the probability of observing measurement given that what we expect is .” The vertical bar ”|” means “given that” or “conditional on”.
That’s exactly the shape you saw on the interactive histogram above, and it tells us that small errors (when is close to zero) are very probable, and large errors are very improbable.
If we replace the general error with our specific measurement error, , we get the probability of observing our data point, :
Step 2: The probability of the whole dataset
In a real experiment (for example, a calibration curve), you have more than one data point. You have a whole set of measurements: .
The next question we must ask ourselves is: what is the probability of observing the entire collection of measurements at once? To answer it, we need to understand how the errors of different measurements are related. This leads to our second assumption.
Think of it like flipping a coin. The result of the first flip does not change the 50/50 probability for the second flip. Our measurements should be the same.
If you rinse your equipment properly between samples and allow your instrument to stabilize, then each measurement is a new, independent event. The random noise that made your first measurement slightly high should not affect your second measurement.
Because the errors are independent, we can calculate the total probability of the whole dataset by multiplying the individual probabilities of each measurement:
This joint probability is usually called the likelihood. Now, let’s substitute the Gaussian probability formula we found in Step 1:
That equation is a monster! But don’t be afraid, it represents just likelihood contributions being multiplied: the probability density of observing our entire dataset, expressed as a product of individual Gaussian terms.
Now, our goal is to find the model parameters that maximize this probability. The logic is beautifully simple: we want to find the model that makes our actual experimental data the most likely outcome. If you measured a series of absorbance values, you want to find the calibration curve parameters that make those specific measurements the most probable result you could have gotten. This approach is called “maximum likelihood estimation” and it’s based on a very reasonable principle: the best model is the one that makes your observed data most likely to occur.
However, working directly with this product of complex exponential expressions is mathematically challenging. When you multiply many small probabilities together, the product becomes extraordinarily tiny that computers may treat the result as zero due to limited numerical precision. This “numerical underflow” can completely break your optimization algorithm.
Fortunately, there’s a trick that we can use. We can apply a logarithm transformation to this product of probabilities. The logarithm function is monotonic, i.e., it always increases in the same direction as its input. Think of it like this: if you have two probabilities where , then you’re guaranteed that . The logarithm preserves the relative ordering, which means that the parameter values that maximize our original probability will be exactly the same parameter values that maximize the logarithm of that probability.
Applying this transformation to our probability product, and using the fundamental property that , this becomes:
The multiplication of probabilities now is a sum! Now, remember that each individual probability has the form:
And our goal is to find the model parameters that maximize the total log-likelihood:
Here, represents the parameters we optimize to maximize the log-likelihood. For example, in a linear regression model, includes the slope and intercept. In an exponential decay model, consists of the pre-exponential factor and the decay rate. These parameters define the predictions , and the goal is to find the values of that make the observed data most likely.
When we substitute the Gaussian probability expression and take the logarithm, we apply the logarithm to both parts of the product inside each probability. Using the property that , this becomes:
and then
When we take the logarithm of the exponential term, we use the fundamental property that . This means:
This simplification transforms our expression into:
Now we can separate this expression into two distinct parts to better understand what each component contributes to our optimization problem:
This separation reveals something profound about the structure of our optimization problem. The first sum contains only normalization constants that depend on our measurement uncertainties but not on our model parameters . Since these terms remain constant regardless of which model parameters we choose, they don’t affect which parameters give us the maximum likelihood. For optimization purposes, we can treat this entire first sum as a constant.
Therefore, maximizing the full expression is mathematically equivalent to maximizing only the second part:
But here’s a fundamental insight about optimization that connects our probabilistic reasoning to the familiar least squares method: maximizing a negative expression is exactly the same as minimizing the positive version of that expression. Since we want to find the maximum of the negative sum, we can equivalently find the minimum of the positive sum:
Finally, since the factor of doesn’t change which parameters minimize the expression (it just scales everything uniformly), we can remove it without affecting the optimization result:
At this point, we’ve established that maximizing the likelihood of our data leads us to minimize the expression:
This is called weighted least squares, and it’s actually the most general form of our optimization problem. Each measurement contributes to the sum with a weight that depends on its individual measurement uncertainty. Measurements with smaller uncertainties (smaller ) get more weight in determining our final model, while noisier measurements (larger ) get less influence. This makes intuitive sense: we should trust our most precise measurements more heavily when fitting our model.
But here’s where we encounter a practical challenge that leads us to our third fundamental assumption. In real analytical work, how often do you actually know the individual uncertainty for each and every measurement you make?
Step 3: The role of constant variance
In most laboratory situations, you don’t measure the uncertainty of each individual data point. Instead, you typically characterize your instrument’s overall precision through replicate measurements or method validation studies. You might determine that your HPLC system has a typical injection precision of 2% RSD, or that your spectrophotometer has a baseline noise level of 0.01 absorbance units. This practical reality leads us to a crucial simplifying assumption.
Let’s explore what this assumption really means in concrete terms. Imagine you’re running a calibration curve for quantifying caffeine using UV spectroscopy. Homoscedasticity means that when you measure a 1 mg/L standard solution, the measurement uncertainty is essentially the same as when you measure a 2 mg/L standard or a 3 mg/L standard. The absolute magnitude of the random error stays approximately constant across your entire calibration range.
This might seem counterintuitive at first. You might expect that higher concentrations, which give stronger signals, would be more precise. But remember that we’re talking about absolute error, not relative error. Your spectrophotometer’s detector noise might contribute ±0.01 absorbance units of uncertainty whether you’re measuring 0.2 AU or 1.0 AU. The absolute uncertainty stays roughly the same, even though the relative uncertainty (the percentage error) decreases for stronger signals.
The interactive component below demonstrates this concept using a linear calibration relationship. By default, it shows homoscedastic data where error bars maintain constant height across all concentrations. Toggle the “Heteroscedastic noise” option to see what happens when measurement uncertainty increases with signal strength, and notice how the error bars grow larger at higher concentrations, violating our constant variance assumption.
This comparison is crucial for understanding when our mathematical assumptions are valid. In the homoscedastic case, you can see that the scatter around the line remains roughly constant whether you’re looking at low or high concentrations. In the heteroscedastic case, the scatter clearly increases as concentration increases, which would require us to use weighted least squares rather than ordinary least squares.
Now let’s see how the homoscedasticity assumption transforms our mathematics. When every measurement has the same variance , we can factor this constant out of our summation:
Here’s the key mathematical insight: since is just a positive constant that doesn’t depend on our model parameters, it doesn’t affect which parameters minimize the expression. Multiplying an entire function by a positive constant doesn’t change the location of its minimum. It just rescales the function vertically.
Therefore, minimizing our expression is mathematically equivalent to minimizing:
And there it is!🎉 We’ve finally arrived at the familiar ordinary least squares expression that you’ve probably encountered countless times in analytical chemistry. The squares that seemed so mysterious at the beginning of our journey have emerged naturally from three reasonable assumptions about how analytical measurements behave:
-
Gaussian errors: Measurement errors follow a bell-shaped distribution, which reflects the Central Limit Theorem acting on many small, independent noise sources in your instrument.
-
Independent measurements: The error in one measurement doesn’t predict the error in the next, which reflects good laboratory practice and proper instrument operation.
-
Constant variance: All measurements have similar absolute precision, which reflects the typical behavior of most analytical instruments across their useful range.
Understanding these nuances helps you choose the right mathematical approach for your specific analytical situation. When homoscedasticity is reasonable, ordinary least squares provides an elegant and computationally efficient solution. When it’s not, weighted least squares offers a more sophisticated approach that properly accounts for varying measurement uncertainties.
This mathematical framework elegantly connects the physics of measurement (random errors with Gaussian distributions) to the statistics of parameter estimation (maximum likelihood) to the practical mathematics we use every day (least squares fitting). The beauty lies not just in the mathematical elegance, but in how these assumptions reflect the real behavior of analytical instruments and the practical constraints of laboratory work.
Some considerations
While our probabilistic derivation explains why least squares uses the sum of squared errors, there are additional assumptions and practical considerations that every analytical chemist should understand to apply this method effectively in real laboratory work. These considerations will help you recognize when least squares is appropriate, when it might fail, and how to adapt your approach accordingly.
Key Assumptions for Linear Models
Our derivation focused on three fundamental assumptions for the sake of simplifying the explanation, but the classical least squares framework, especially for linear models, rests on a few key assumptions.
The key to understanding “linear in parameters” is recognizing what makes a model linear or non-linear from a mathematical perspective. A model is linear in parameters when each parameter appears only to the first power and no parameters multiply each other.
Consider these examples from analytical chemistry:
Linear in Parameters:
- Simple calibration: ✓ (parameters and each appear once, no multiplication between them)
- Polynomial calibration: ✓ (even though the relationship between and is curved, parameters , , , are independent)
- Multiple regression: ✓ (each parameter has its own coefficient role)
Non-linear in Parameters:
- Exponential decay: ✗ (parameter appears inside an exponential function)
- Michaelis-Menten kinetics: ✗ (parameters and interact in complex ways)
- Interactive model: ✗ (parameters and multiply each other)
Even when your model is not linear in the parameters, it is still possible to minimize the sum of squared errors using iterative methods like gradient descent or Levenberg–Marquardt. However, in this case you lose the ability to write the model in matrix form , which means you also lose access to some powerful techniques we will introduce in the next entry, such as matrix factorization and projections.
In practical terms, this assumption means that if you could take an infinite number of measurements of the same sample under identical conditions, the average of all measurement errors would approach zero. Sometimes your instrument reads slightly high, sometimes slightly low, but there’s no consistent tendency in either direction. This assumption essentially requires that you’ve eliminated systematic errors through proper analytical technique.
When assumptions break down and what to do about it
Understanding when least squares assumptions fail helps you recognize situations where alternative approaches might be more appropriate. Each violation has characteristic symptoms and established solutions.
Heteroscedasticity occurs when measurement precision changes across your analytical range, as you saw in the interactive component above. You might observe this in trace analysis where measurements near the detection limit have proportionally larger uncertainties than measurements at higher concentrations. In Time-Correlated Single Photon Counting (TCSPC), heteroscedasticity can arise due to variations in photon count statistics: measurements with fewer photon events near the baseline tend to exhibit higher relative uncertainty compared to peak regions with more photon events. In chromatography, peak integration precision often depends on peak height, creating heteroscedastic errors. When you suspect heteroscedasticity, examine residual plots where errors are plotted against predicted values. If you see a funnel shape or other systematic pattern in the residuals, consider weighted least squares where each measurement is weighted by the inverse of its estimated variance [7] , [8] .
Violating the assumption of constant variance (homoscedasticity) does not usually bias the parameter estimates themselves in ordinary least squares, but it does bias the standard errors, which in turn affects confidence intervals and hypothesis tests. This can lead to incorrect inferences, such as overstating the significance of predictors. In cases where heteroscedasticity is substantial and ignored, the model may appear more precise than it truly is. For example, confidence intervals for regression coefficients (computed under constant-variance formulas) can be biased.
Non-Gaussian errors often show up in real lab work. For example, when you count photons at very low light levels, the random noise follows a Poisson curve and looks skewed to the right. A dirty cuvette or a brief detector glitch can throw in a few big outliers, making the tails of the error curve very fat. And when results fall below your detection limit, you report “non-detect,” which cuts off part of the distribution on the left.
Most textbooks and software assume a neat bell-shaped (Gaussian) error because it makes the algebra tidy. The good news is that ordinary least squares still gives the correct average answer as long as the errors keep the same spread, are not linked from one measurement to the next, and average out to zero in the long run.
Where non-Gaussian errors cause trouble is in the fine details: small-sample confidence limits and the usual and tests can be off. When that happens, you can switch to a robust fit, apply variance-stabilizing transformations (such as square root for Poisson count data), or pick a model that matches the real error distribution, such as Poisson regression.
Autocorrelated errors violate the independence assumption and commonly occur in time-series measurements where drift or memory effects cause sequential measurements to be related. You might see this in continuous monitoring applications or when measuring samples in sequence without adequate equilibration time. Autocorrelation can be detected by plotting residuals against measurement order and looking for patterns. Solutions include incorporating time-trend terms into your model or using time-series regression methods.
Conclusion
We began with a simple question: why do we use the “sum of squares”? As we saw, the explanation from a probability viewpoint comes from a few key ideas about our measurements: that errors follow a bell curve (a Gaussian distribution), that they are independent, and that the measurement precision stays the same. This approach helps us understand the real-world reason why the method works.
To get a more complete picture, it is also useful to mention the Gauss-Markov theorem [4] , [5] , [6] . This theorem shows that, if (1) the model is linear in its parameters, (2) the errors have mean zero, (3) the errors share the same variance (homoscedasticity), and (4) the errors are independent (uncorrelated), the ordinary-least-squares (OLS) estimator is the BLUE: the Best Linear Unbiased Estimator. “Best” means it has the smallest possible variance among all estimators that are linear in the data and unbiased.
Unlike earlier steps in our story, the Gauss–Markov proof does not require the errors to follow a perfect bell curve (Normal distribution); the four conditions above are enough. That wider set of rules makes the least-squares method even more solid and trustworthy in everyday work.
If you understand these core ideas, it becomes much easier to learn more advanced methods in chemometrics, like Multivariate Curve Resolution or special algorithms for baseline correction. All these methods solve the same basic problem: finding the model parameters that best fit the data, based on good assumptions about measurement errors.
So, now that we understand why we use the least squares formula, the next natural question is how do we find the solution? In the next entry, we will look at the mathematical ways to calculate the best least squares fit, from the neat “normal equations” to modern computer-based methods.