In 1863, William Chauvenet, a professor of mathematics and astronomy at the United States Naval Observatory in Washington, D.C., published a brief but influential paper in which he proposed a criterion for rejecting suspect observations. The problem was practical: astronomers made repeated measurements of stellar positions, and occasionally a reading would deviate far enough from the others to suggest that something had gone wrong — a misread instrument, a gust of wind disturbing the telescope, a momentary lapse of attention. Chauvenet’s criterion said that if the probability of obtaining a deviation as large as the one observed was less than 1/(2n) for a sample of size n , the observation should be rejected. The method was crude by modern standards, but it established a fundamental principle: you need a quantitative rule to decide when a measurement is too far from the rest to be trusted, because human judgment alone is unreliable and inconsistent.
A century later, Frank E. Grubbs, a statistician at the Aberdeen Proving Ground in Maryland, placed outlier detection on firmer statistical ground. In a 1950 paper in The Annals of Mathematical Statistics and a widely cited 1969 review in Technometrics, Grubbs developed formal hypothesis tests for detecting one or more outliers in a univariate sample. His test statistic — the maximum absolute deviation from the sample mean divided by the sample standard deviation — follows a known distribution under the null hypothesis that all observations come from the same normal population. Grubbs’ test gave experimentalists a rigorous way to flag suspect values, and variants of it remain in use today in quality control and standard test methods (ASTM E178, for instance, cites Grubbs directly).
The shift to multivariate thinking came in 1977, when R. Dennis Cook, then at the University of Minnesota, introduced what is now called Cook’s distance — a measure of how much a regression model’s fitted values change when a single observation is deleted. Cook’s work launched the field of influence diagnostics, which asks not just whether a point is unusual in isolation, but whether it is actually pulling the model in a distorted direction. In chemometrics, where models routinely involve hundreds or thousands of variables, the problem became more acute: a single contaminated spectrum in a PCA or PLS model can rotate principal components, shift score clusters, and degrade predictions for every other sample. The tools developed for multivariate outlier detection — Hotelling’s T2 , Q-residuals (SPE), leverage, and their combinations — are now part of the standard workflow for any serious chemometric analysis.
What is an outlier?
An outlier is an observation that does not follow the same pattern as the majority of the data. This definition is deliberately vague, because what counts as “not following the pattern” depends entirely on what pattern you expect.
In a univariate context, an outlier is simply a value that lies far from the center of the distribution. A pH reading of 14.3 in a set of measurements clustered around 7.0 is an obvious outlier. But in multivariate data, outliers can be much more subtle. A sample might have perfectly normal values for every individual variable yet be unusual in its combination of values — high absorbance at wavelength A combined with low absorbance at wavelength B in a way that no other sample exhibits. This is why univariate screening (checking each variable independently) misses multivariate outliers.
Outliers vs. influential points
These two concepts are related but distinct.
An outlier is a point that is distant from the rest of the data. It may or may not affect a model, depending on where it sits relative to the model’s structure.
An influential point is a point that, if removed, substantially changes the fitted model. A point can be influential without being an obvious outlier (it might sit at the edge of the predictor space where it has high leverage), and an outlier can be non-influential (if it lies far from the regression line but in the middle of the predictor space, removing it may barely change the slope).
In chemometrics, the distinction matters because PCA and PLS models are built from covariance structures. A single extreme sample can:
Rotate principal components to accommodate itself, distorting the model for everyone else
Inflate the apparent variance explained, giving a misleading picture of model quality
Bias PLS regression coefficients, degrading predictions on normal samples
Consume a principal component to explain its own peculiarity rather than genuine chemical variation
Why outliers matter
The core issue is that most chemometric methods are based on least-squares optimization, which minimizes the sum of squared residuals. Squaring means that large deviations get disproportionate weight. A single outlier with a residual 10 times the average has 100 times the influence of a typical point on the fitted model. This sensitivity to extreme values is a feature when all data is well-behaved (it gives efficient estimates) and a liability when it is not (one bad sample can wreck the model).
Univariate outlier detection
Before moving to multivariate methods, it is worth reviewing the classical univariate approaches. They are limited — they cannot detect multivariate outliers — but they provide useful context and are still applied for screening individual variables.
Z-score method
The simplest approach: compute the z-score for each observation and flag those beyond a threshold.
zi=sxi−xˉ
where xˉ is the sample mean and s is the sample standard deviation. A common threshold is ∣zi∣>3 , based on the expectation that under normality, only about 0.3% of values should exceed three standard deviations.
Limitation: Both the mean and standard deviation are themselves influenced by the outlier. If the outlier is extreme enough, it inflates s , which shrinks all z-scores, potentially making the outlier look less extreme than it really is. This is the masking effect — the outlier hides itself by distorting the very statistics used to detect it.
IQR method (box plot rule)
A more robust alternative uses the interquartile range:
IQR=Q3−Q1
An observation is flagged if it falls below Q1−1.5×IQR or above Q3+1.5×IQR . The factor 1.5 was chosen by John Tukey for his box plot, and it corresponds roughly to ±2.7σ under normality. Using 3.0 instead of 1.5 identifies “far outliers.”
The IQR method is robust because the quartiles are not strongly affected by extreme values. An outlier that inflates the mean and standard deviation barely changes Q1 and Q3 .
Grubbs’ test
Grubbs’ test formally tests the null hypothesis that all observations come from the same normal population against the alternative that there is exactly one outlier.
G=smaxi∣xi−xˉ∣
The null hypothesis is rejected at significance level α if:
G>nn−1n−2+tα/(2n),n−22tα/(2n),n−22
where tα/(2n),n−2 is the critical value of the t-distribution with n−2 degrees of freedom.
Multivariate outlier detection
This is where chemometrics really needs dedicated tools. A sample can look perfectly normal in every individual variable yet be an outlier in the multivariate space, because the combination of its variable values is unusual. The methods below account for the correlation structure among variables.
Mahalanobis distance
The Mahalanobis distance generalizes the concept of “how many standard deviations away” to multiple correlated variables. Instead of dividing by the standard deviation of a single variable, it accounts for the full covariance structure:
Di2=(xi−xˉ)⊤S−1(xi−xˉ)
where xˉ is the multivariate mean and S is the sample covariance matrix.
Geometrically, the Mahalanobis distance measures the distance from a point to the center of the data cloud, but it stretches and rotates the space so that the data cloud becomes spherical. A point that is far along a direction of high correlation (where you would expect elongated scatter) gets a smaller distance than a point equally far in a direction perpendicular to the correlation structure.
Under multivariate normality, Di2 follows approximately a χ2 distribution with p degrees of freedom (where p is the number of variables). A common threshold is:
Di2>χp,1−α2
with α=0.05 or α=0.01 .
Limitation in chemometrics: The Mahalanobis distance requires inverting the covariance matrix. When the number of variables exceeds the number of samples (the typical case in spectroscopy), S is singular and cannot be inverted directly. This is precisely why PCA-based diagnostics are preferred — they project the data into a lower-dimensional space where the covariance structure is well-defined.
Hotelling’s T-squared
Hotelling’s T2 is the Mahalanobis distance computed in the PCA score space rather than in the original variable space. After fitting a PCA model with A components, each sample has a score vector ti=[ti1,ti2,…,tiA] . The T2 statistic measures how far each sample is from the center of the score space:
Ti2=a=1∑Aλatia2
where λa is the eigenvalue (variance) of the a -th principal component. Dividing each score by its corresponding eigenvalue normalizes by the expected spread in that direction — a large score on a high-variance component is less surprising than the same score on a low-variance component.
The threshold for T2 at significance level α is based on the F-distribution:
Tlim2=n−AA(n−1)FA,n−A,1−α
where n is the number of samples. Samples with Ti2>Tlim2 are flagged as potential outliers.
What high T-squared means: The sample is extreme within the PCA model. It has unusual scores — it sits far from the center in the space spanned by the principal components. This means the sample has an unusual combination of the patterns that the model captures. For example, if PC1 represents protein content and PC2 represents moisture, a sample with an extreme T2 has an unusual protein-moisture combination relative to the other samples.
Q-residuals (Squared Prediction Error, SPE)
While T2 asks whether a sample is extreme within the model, the Q-residual asks whether the sample fits the model at all. It measures the portion of a sample’s variance that the PCA model does not explain:
Qi=ei⊤ei=j=1∑peij2
where ei=xi−x^i is the residual vector for sample i , and x^i=∑a=1Atiapa is the PCA reconstruction. In other words, Qi is the squared length of the residual vector — the part of the sample that the model cannot account for.
The threshold for Q is based on a χ2 approximation proposed by Jackson and Mudholkar (1979):
where θk=∑a=A+1pλak for k=1,2,3 , h0=1−2θ1θ3/(3θ22) , and cα is the standard normal critical value. In practice, this is often simplified: a useful rule of thumb for the Q limit is to use the 95th or 99th percentile of the observed Q values in the calibration set, especially when the eigenvalues of the discarded components are not readily available.
What high Q means: The sample contains variation that the PCA model was not built to describe. This might be a different type of interference, a new chemical species, an instrument artifact, or a fundamentally different sample type. The model “knows” about the patterns captured by its A components; a high-Q sample has something else going on.
The T-squared vs Q diagnostic plot
The most informative single diagnostic in chemometric outlier detection is the plot of T2 against Q (or equivalently, T2 on the x-axis and Q on the y-axis, with threshold lines for each). This plot divides samples into four regions:
Region
T-squared
Q-residual
Interpretation
Normal
Low
Low
The sample is well-modeled and not extreme. It sits in the main cluster.
Extreme but modeled
High
Low
The sample has unusual scores but fits the model well. It is an extreme version of known variation — for example, a very high-concentration sample in a set of calibration standards.
New variation
Low
High
The sample has normal scores but does not fit the model. It contains variation the model was not trained on — a new interferent, a different instrument, or a sample type not in the calibration set.
Genuine outlier
High
High
The sample is both extreme and poorly modeled. It is unusual in every sense and should be investigated carefully.
Influence in regression
When building calibration models (PLS regression, PCR, or even ordinary least squares), it is not enough to know whether a sample is an outlier in the predictor space or the response space. What matters is whether removing a sample substantially changes the model. This is the domain of influence diagnostics.
Leverage (hat matrix diagonal)
The leverage of sample i measures how much influence it has on the fitted model due to its position in the predictor space. In ordinary least squares regression y^=Hy , the hat matrix is:
H=X(X⊤X)−1X⊤
The leverage of sample i is the i -th diagonal element:
hii=xi⊤(X⊤X)−1xi
Leverage values range from 1/n to 1. The average leverage is (p+1)/n , and a common rule of thumb flags samples with hii>2(p+1)/n or hii>3(p+1)/n as high-leverage points.
Key insight: Leverage depends only on the predictor values, not on the response. A high-leverage point sits in a sparsely populated region of the predictor space. It has the potential to be influential — whether it actually is depends on whether its response value is consistent with the model.
In PCA/PLS, leverage in the score space is directly related to Hotelling’s T2 :
hii=n−11Ti2+n1
This makes T2 and leverage essentially interchangeable diagnostics in the chemometric context.
Studentized residuals
The residual for sample i is ei=yi−y^i . The internally studentized residual standardizes this by the residual standard error, adjusted for leverage:
ri=s1−hiiei
where s is the estimated standard deviation of the residuals. The division by 1−hii accounts for the fact that high-leverage points tend to have smaller residuals (the model is “pulled” toward them).
The externally studentized residual (also called the deleted residual or jackknife residual) goes further: it recomputes the model without sample i and uses that model to predict yi :
ti=s(i)1−hiiei
where s(i) is the residual standard error computed without sample i . Under the null hypothesis, ti follows a t-distribution with n−p−2 degrees of freedom. Values with ∣ti∣>tn−p−2,α/2 are flagged. A practical threshold is ∣ti∣>2.5 or ∣ti∣>3 .
Cook’s distance
Cook’s distance combines leverage and residual information into a single measure of influence. It quantifies how much all the fitted values change when sample i is removed:
Di=ps2(y^−y^(i))⊤(y^−y^(i))
which can be rewritten using leverage and residuals:
Di=pri2⋅1−hiihii
Cook’s distance is large when a sample has both a large residual (it does not fit the model) and high leverage (it is in a position to influence the model). A common guideline is to investigate samples with Di>4/n or Di>1 , though the more conservative threshold of Di>Fp,n−p,0.50 (the 50th percentile of the F-distribution) is sometimes used.
Williams plot
The Williams plot (leverage vs. studentized residuals) is the regression counterpart of the T2 vs Q plot. It plots hii on the x-axis against the studentized residual on the y-axis, with threshold lines for each:
Vertical line at hii=3(p+1)/n (high leverage boundary)
Horizontal lines at ±2.5 or ±3 (large residual boundary)
The four quadrants have analogous interpretations to the T2 vs Q plot:
Low leverage, small residual: Normal, well-modeled sample
High leverage, small residual: Influential position but good fit — the sample pulls the model but in a consistent direction
Low leverage, large residual: Poor fit but low influence — an outlier in the response that does not distort the model much
High leverage, large residual: Both influential and poorly fitted — the most dangerous case for model quality
What to do with outliers
Detecting outliers is only half the job. The harder question is what to do about them. The single most important principle is: investigate before you delete.
Identify the suspect samples
Use the T2 vs Q plot, Williams plot, or other diagnostics to flag candidates. Record which samples are flagged and by which criterion.
Examine the raw data
Go back to the original measurements. Is there an obvious problem? A saturated detector, a baseline shift, a missing region, an obviously corrupted spectrum? If so, the sample has a documented quality issue and removal is justified.
Check the metadata
Was the sample mislabeled? Was the instrument malfunctioning on that day? Was it a different sample type accidentally included in the batch? Laboratory records often explain outliers that statistics alone cannot.
Investigate the chemistry
If the sample has no apparent quality issue, ask whether it might represent genuinely different chemistry. A sample with high Q may contain a chemical species not present in the other samples. This is not an error — it is information. Removing it would mean ignoring a real phenomenon.
Assess the impact
Build the model with and without the suspect samples. Do the predictions change substantially? If removing a sample barely affects the model, the point is not influential regardless of whether it is an outlier. If removal changes the predictions for other samples, the point deserves careful scrutiny.
Document your decision
Whether you keep or remove the sample, record the reasoning. “Removed because the detector saturated above 2500 nm” is a defensible decision. “Removed because it was an outlier” is not — it tells the next analyst nothing about the actual cause.
Robust alternatives
The methods described above all share a vulnerability: they use the sample mean and covariance to define “normal,” but those statistics are themselves influenced by the outliers they are trying to detect. This circular problem motivates robust methods, which estimate location and scatter in ways that resist contamination by outlying observations.
Median Absolute Deviation (MAD)
In univariate settings, the median and MAD provide a robust alternative to the mean and standard deviation:
MAD=median(∣xi−median(x)∣)
The MAD can be converted to a robust estimate of the standard deviation by multiplying by 1.4826 (valid under normality). A robust z-score is then:
zirobust=1.4826×MADxi−median(x)
The MAD has a breakdown point of 50%, meaning that up to half the data can be contaminated before the MAD becomes unreliable. Compare this to the standard deviation, which has a breakdown point of 0% — a single extreme value can make it arbitrarily large.
ROBPCA
For multivariate data, Hubert, Rousseeuw, and Vanden Branden (2005) developed ROBPCA, a robust version of PCA that combines projection pursuit with the Minimum Covariance Determinant (MCD) estimator. The key idea is to find the subset of observations that has the smallest covariance determinant (the tightest ellipsoid), use that subset to estimate the covariance structure, and then compute PCA from the robust covariance matrix.
ROBPCA produces robust scores, robust T2 values, and robust orthogonal distances (analogous to Q-residuals), all of which are less affected by the outliers themselves. It also classifies observations into four groups: regular observations, good leverage points (unusual position but consistent with the robust model), orthogonal outliers (high orthogonal distance), and bad leverage points (both unusual position and high orthogonal distance).
Least Trimmed Squares (LTS)
In regression, the Least Trimmed Squares estimator (introduced by Rousseeuw in 1984) fits the model by minimizing the sum of the smallesth squared residuals, where h is typically chosen as roughly n/2+(p+1)/2 . This means the LTS regression simply ignores the worst-fitting half of the data when estimating the model, making it highly resistant to outliers. Once the robust model is fitted, the full set of residuals can be examined to identify which observations were excluded and investigate them.
if (results$T2[i] > results$T2_lim) flags <-c(flags, sprintf("T²=%.2f", results$T2[i]))
if (results$Q[i] > results$Q_lim) flags <-c(flags, sprintf("Q=%.6f", results$Q[i]))
if (length(flags) >0)
cat(sprintf(" Sample %d: %s\n", i, paste(flags, collapse=", ")))
}
When to investigate vs when to remove
Investigate further
High Q, low T-squared. The sample fits poorly but is not extreme in the model. It likely contains variation the model was not designed for — a new interferent, a different matrix, or an instrument change. This is new information, not noise. Understand it before discarding it.
Isolated case in a large dataset. If one sample out of 500 is flagged, look at the raw data. It could be a transcription error, a labeling mistake, or a genuine anomaly worth investigating.
The outlier is chemically plausible. If the flagged sample comes from a known edge case (extreme pH, high temperature, unusual matrix), it may represent the boundary of your model’s applicability domain. Keeping it improves the model’s range; removing it narrows it.
Consider removing
Documented instrument failure. If lab records confirm that the detector malfunctioned, the lamp was failing, or the sample path was blocked during acquisition, the measurement is not valid data. Remove it and document the reason.
Sample preparation error. A mislabeled vial, a contaminated cuvette, or a dilution mistake produces data that does not represent what you think it represents. If the error is confirmed, removal is justified.
Multiple diagnostics converge. If a sample is flagged by T-squared, Q-residual, Cook’s distance, and visual inspection of the raw spectrum all points to a problem, the evidence for removal is strong. But still document the specific cause.
The sample breaks the model for everyone else. If removing one sample substantially improves predictions for the remaining samples (validated by cross-validation), it was likely exerting undue influence. Investigate why before removing.
References
[1] Chauvenet, W. (1863). A Manual of Spherical and Practical Astronomy (Vol. II, Appendix). J. B. Lippincott & Co.
[2] Grubbs, F. E. (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.
[3] Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1-21.
[4] Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18.
[5] Jackson, J. E., & Mudholkar, G. S. (1979). Control procedures for residuals associated with principal component analysis. Technometrics, 21(3), 341-349.
[6] Hubert, M., Rousseeuw, P. J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47(1), 64-79.
[7] Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79(388), 871-880.
[8] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.
[9] Hotelling, H. (1931). The generalization of Student’s ratio. The Annals of Mathematical Statistics, 2(3), 360-378.
[10] Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikstrom, C., & Wold, S. (2006). Multi- and Megavariate Data Analysis: Part I — Basic Principles and Applications (2nd ed.). Umetrics Academy.