Outlier Detection

Identifying and diagnosing anomalous observations in univariate, multivariate, and regression settings

In 1863, William Chauvenet, a professor of mathematics and astronomy at the United States Naval Observatory in Washington, D.C., published a brief but influential paper in which he proposed a criterion for rejecting suspect observations. The problem was practical: astronomers made repeated measurements of stellar positions, and occasionally a reading would deviate far enough from the others to suggest that something had gone wrong -- a misread instrument, a gust of wind disturbing the telescope, a momentary lapse of attention. Chauvenet's criterion said that if the probability of obtaining a deviation as large as the one observed was less than $1/ (2 n)$ for a sample of size $n$ , the observation should be rejected. The method was crude by modern standards, but it established a fundamental principle: you need a quantitative rule to decide when a measurement is too far from the rest to be trusted, because human judgment alone is unreliable and inconsistent.

A century later, Frank E. Grubbs, a statistician at the Aberdeen Proving Ground in Maryland, placed outlier detection on firmer statistical ground. In a 1950 paper in The Annals of Mathematical Statistics and a widely cited 1969 review in Technometrics, Grubbs developed formal hypothesis tests for detecting one or more outliers in a univariate sample. His test statistic -- the maximum absolute deviation from the sample mean divided by the sample standard deviation -- follows a known distribution under the null hypothesis that all observations come from the same normal population. Grubbs' test gave experimentalists a rigorous way to flag suspect values, and variants of it remain in use today in quality control and standard test methods (ASTM E178, for instance, cites Grubbs directly).

The shift to multivariate thinking came in 1977, when R. Dennis Cook, then at the University of Minnesota, introduced what is now called Cook's distance -- a measure of how much a regression model's fitted values change when a single observation is deleted. Cook's work launched the field of influence diagnostics, which asks not just whether a point is unusual in isolation, but whether it is actually pulling the model in a distorted direction. In chemometrics, where models routinely involve hundreds or thousands of variables, the problem became more acute: a single contaminated spectrum in a PCA or PLS model can rotate principal components, shift score clusters, and degrade predictions for every other sample. The tools developed for multivariate outlier detection -- Hotelling's $T^{2}$ , Q-residuals (SPE), leverage, and their combinations -- are now part of the standard workflow for any serious chemometric analysis.

What is an outlier?

An outlier is an observation that does not follow the same pattern as the majority of the data. This definition is deliberately vague, because what counts as "not following the pattern" depends entirely on what pattern you expect.

In a univariate context, an outlier is simply a value that lies far from the center of the distribution. A pH reading of 14.3 in a set of measurements clustered around 7.0 is an obvious outlier. But in multivariate data, outliers can be much more subtle. A sample might have perfectly normal values for every individual variable yet be unusual in its combination of values -- high absorbance at wavelength A combined with low absorbance at wavelength B in a way that no other sample exhibits. This is why univariate screening (checking each variable independently) misses multivariate outliers.

Outliers vs. influential points

These two concepts are related but distinct.

An outlier is a point that is distant from the rest of the data. It may or may not affect a model, depending on where it sits relative to the model's structure.

An influential point is a point that, if removed, substantially changes the fitted model. A point can be influential without being an obvious outlier (it might sit at the edge of the predictor space where it has high leverage), and an outlier can be non-influential (if it lies far from the regression line but in the middle of the predictor space, removing it may barely change the slope).

In chemometrics, the distinction matters because PCA and PLS models are built from covariance structures. A single extreme sample can:

Rotate principal components to accommodate itself, distorting the model for everyone else
Inflate the apparent variance explained, giving a misleading picture of model quality
Bias PLS regression coefficients, degrading predictions on normal samples
Consume a principal component to explain its own peculiarity rather than genuine chemical variation

Why outliers matter

The core issue is that most chemometric methods are based on least-squares optimization, which minimizes the sum of squared residuals. Squaring means that large deviations get disproportionate weight. A single outlier with a residual 10 times the average has 100 times the influence of a typical point on the fitted model. This sensitivity to extreme values is a feature when all data is well-behaved (it gives efficient estimates) and a liability when it is not (one bad sample can wreck the model).

Univariate outlier detection

Before moving to multivariate methods, it is worth reviewing the classical univariate approaches. They are limited -- they cannot detect multivariate outliers -- but they provide useful context and are still applied for screening individual variables.

Z-score method

The simplest approach: compute the z-score for each observation and flag those beyond a threshold.

z_{i} = \frac{x _{i} - x ˉ}{s}

where $\overset{x}{ˉ}$ is the sample mean and $s$ is the sample standard deviation. A common threshold is $∣ z_{i} ∣ > 3$ , based on the expectation that under normality, only about 0.3% of values should exceed three standard deviations.

Limitation: Both the mean and standard deviation are themselves influenced by the outlier. If the outlier is extreme enough, it inflates $s$ , which shrinks all z-scores, potentially making the outlier look less extreme than it really is. This is the masking effect -- the outlier hides itself by distorting the very statistics used to detect it.

IQR method (box plot rule)

A more robust alternative uses the interquartile range:

IQR = Q_{3} - Q_{1}

An observation is flagged if it falls below $Q_{1} - 1.5 \times IQR$ or above $Q_{3} + 1.5 \times IQR$ . The factor 1.5 was chosen by John Tukey for his box plot, and it corresponds roughly to $\pm 2.7 σ$ under normality. Using 3.0 instead of 1.5 identifies "far outliers."

The IQR method is robust because the quartiles are not strongly affected by extreme values. An outlier that inflates the mean and standard deviation barely changes $Q_{1}$ and $Q_{3}$ .

Grubbs' test

Grubbs' test formally tests the null hypothesis that all observations come from the same normal population against the alternative that there is exactly one outlier.

G = \frac{max _{i} ∣ x _{i} - x ˉ ∣}{s}

The null hypothesis is rejected at significance level $α$ if:

G > \frac{n - 1}{n} \frac{t _{α / (2 n), n - 2}^{2}}{n - 2 + t _{α / (2 n), n - 2}^{2}}

where $t_{α / (2 n), n - 2}$ is the critical value of the t-distribution with $n - 2$ degrees of freedom.

Grubbs' test assumes normality

Grubbs' test is strictly valid only for normally distributed data. In chemometrics, many variables (especially raw spectral intensities) may not be normal. The test also detects only one outlier at a time -- if multiple outliers are present, the masking effect can prevent detection. For multivariate data, Grubbs' test is not applicable; use the methods described in the next section.

Multivariate outlier detection

This is where chemometrics really needs dedicated tools. A sample can look perfectly normal in every individual variable yet be an outlier in the multivariate space, because the combination of its variable values is unusual. The methods below account for the correlation structure among variables.

Mahalanobis distance

The Mahalanobis distance generalizes the concept of "how many standard deviations away" to multiple correlated variables. Instead of dividing by the standard deviation of a single variable, it accounts for the full covariance structure:

D_{i}^{2} = (x_{i} - \overset{ˉ}{x})^{⊤} S^{- 1} (x_{i} - \overset{ˉ}{x})

where $\overset{ˉ}{x}$ is the multivariate mean and $S$ is the sample covariance matrix.

Geometrically, the Mahalanobis distance measures the distance from a point to the center of the data cloud, but it stretches and rotates the space so that the data cloud becomes spherical. A point that is far along a direction of high correlation (where you would expect elongated scatter) gets a smaller distance than a point equally far in a direction perpendicular to the correlation structure.

Under multivariate normality, $D_{i}^{2}$ follows approximately a $χ^{2}$ distribution with $p$ degrees of freedom (where $p$ is the number of variables). A common threshold is:

D_{i}^{2} > χ_{p, 1 - α}^{2}

with $α = 0.05$ or $α = 0.01$ .

Limitation in chemometrics: The Mahalanobis distance requires inverting the covariance matrix. When the number of variables exceeds the number of samples (the typical case in spectroscopy), $S$ is singular and cannot be inverted directly. This is precisely why PCA-based diagnostics are preferred -- they project the data into a lower-dimensional space where the covariance structure is well-defined.

Hotelling's T-squared

Hotelling's $T^{2}$ is the Mahalanobis distance computed in the PCA score space rather than in the original variable space. After fitting a PCA model with $A$ components, each sample has a score vector $t_{i} = [t_{i 1}, t_{i 2}, \dots, t_{i A}]$ . The $T^{2}$ statistic measures how far each sample is from the center of the score space:

T_{i}^{2} = a = 1 \sum A \frac{t _{ia}^{2}}{λ _{a}}

where $λ_{a}$ is the eigenvalue (variance) of the $a$ -th principal component. Dividing each score by its corresponding eigenvalue normalizes by the expected spread in that direction -- a large score on a high-variance component is less surprising than the same score on a low-variance component.

The threshold for $T^{2}$ at significance level $α$ is based on the F-distribution:

T_{lim}^{2} = \frac{A ( n - 1 )}{n - A} F_{A, n - A, 1 - α}

where $n$ is the number of samples. Samples with $T_{i}^{2} > T_{lim}^{2}$ are flagged as potential outliers.

What high T-squared means: The sample is extreme within the PCA model. It has unusual scores -- it sits far from the center in the space spanned by the principal components. This means the sample has an unusual combination of the patterns that the model captures. For example, if PC1 represents protein content and PC2 represents moisture, a sample with an extreme $T^{2}$ has an unusual protein-moisture combination relative to the other samples.

Q-residuals (Squared Prediction Error, SPE)

While $T^{2}$ asks whether a sample is extreme within the model, the Q-residual asks whether the sample fits the model at all. It measures the portion of a sample's variance that the PCA model does not explain:

Q_{i} = e_{i}^{⊤} e_{i} = j = 1 \sum p e_{ij}^{2}

where $e_{i} = x_{i} - \hat{x}_{i}$ is the residual vector for sample $i$ , and $\hat{x}_{i} = \sum_{a = 1}^{A} t_{ia} p_{a}$ is the PCA reconstruction. In other words, $Q_{i}$ is the squared length of the residual vector -- the part of the sample that the model cannot account for.

The threshold for Q is based on a $χ^{2}$ approximation proposed by Jackson and Mudholkar (1979):

Q_{lim} = θ_{1} (\frac{c _{α} 2 θ _{2} h _{0}^{2}}{θ _{1}} + 1 + \frac{θ _{2} h _{0} ( h _{0} - 1 )}{θ _{1}^{2}})^{1/ h_{0}}

where $θ_{k} = \sum_{a = A + 1}^{p} λ_{a}^{k}$ for $k = 1, 2, 3$ , $h_{0} = 1 - 2 θ_{1} θ_{3} / (3 θ_{2}^{2})$ , and $c_{α}$ is the standard normal critical value. In practice, this is often simplified: a useful rule of thumb for the Q limit is to use the 95th or 99th percentile of the observed Q values in the calibration set, especially when the eigenvalues of the discarded components are not readily available.

What high Q means: The sample contains variation that the PCA model was not built to describe. This might be a different type of interference, a new chemical species, an instrument artifact, or a fundamentally different sample type. The model "knows" about the patterns captured by its $A$ components; a high-Q sample has something else going on.

The T-squared vs Q diagnostic plot

The most informative single diagnostic in chemometric outlier detection is the plot of $T^{2}$ against Q (or equivalently, $T^{2}$ on the x-axis and Q on the y-axis, with threshold lines for each). This plot divides samples into four regions:

Region	T-squared	Q-residual	Interpretation
Normal	Low	Low	The sample is well-modeled and not extreme. It sits in the main cluster.
Extreme but modeled	High	Low	The sample has unusual scores but fits the model well. It is an extreme version of known variation -- for example, a very high-concentration sample in a set of calibration standards.
New variation	Low	High	The sample has normal scores but does not fit the model. It contains variation the model was not trained on -- a new interferent, a different instrument, or a sample type not in the calibration set.
Genuine outlier	High	High	The sample is both extreme and poorly modeled. It is unusual in every sense and should be investigated carefully.

The T-squared vs Q plot in practice

This plot is the first thing many experienced chemometricians look at after fitting a PCA or PLS model. It answers two questions simultaneously: "Is this sample unusual in ways we understand?" (high T-squared) and "Is this sample unusual in ways we don't understand?" (high Q). The most dangerous outliers are those with high Q -- they indicate a failure of the model's assumptions, not just an extreme observation.

Influence in regression

When building calibration models (PLS regression, PCR, or even ordinary least squares), it is not enough to know whether a sample is an outlier in the predictor space or the response space. What matters is whether removing a sample substantially changes the model. This is the domain of influence diagnostics.

Leverage (hat matrix diagonal)

The leverage of sample $i$ measures how much influence it has on the fitted model due to its position in the predictor space. In ordinary least squares regression $\hat{y} = Hy$ , the hat matrix is:

H = X (X^{⊤} X)^{- 1} X^{⊤}

The leverage of sample $i$ is the $i$ -th diagonal element:

h_{ii} = x_{i}^{⊤} (X^{⊤} X)^{- 1} x_{i}

Leverage values range from $1/ n$ to 1. The average leverage is $(p + 1) / n$ , and a common rule of thumb flags samples with $h_{ii} > 2 (p + 1) / n$ or $h_{ii} > 3 (p + 1) / n$ as high-leverage points.

Key insight: Leverage depends only on the predictor values, not on the response. A high-leverage point sits in a sparsely populated region of the predictor space. It has the potential to be influential -- whether it actually is depends on whether its response value is consistent with the model.

In PCA/PLS, leverage in the score space is directly related to Hotelling's $T^{2}$ :

h_{ii} = \frac{1}{n - 1} T_{i}^{2} + \frac{1}{n}

This makes $T^{2}$ and leverage essentially interchangeable diagnostics in the chemometric context.

Studentized residuals

The residual for sample $i$ is $e_{i} = y_{i} - \overset{y}{^}_{i}$ . The internally studentized residual standardizes this by the residual standard error, adjusted for leverage:

r_{i} = \frac{e _{i}}{s 1 - h _{ii}}

where $s$ is the estimated standard deviation of the residuals. The division by $1 - h_{ii}$ accounts for the fact that high-leverage points tend to have smaller residuals (the model is "pulled" toward them).

The externally studentized residual (also called the deleted residual or jackknife residual) goes further: it recomputes the model without sample $i$ and uses that model to predict $y_{i}$ :

t_{i} = \frac{e _{i}}{s _{(i)} 1 - h _{ii}}

where $s_{(i)}$ is the residual standard error computed without sample $i$ . Under the null hypothesis, $t_{i}$ follows a t-distribution with $n - p - 2$ degrees of freedom. Values with $∣ t_{i} ∣ > t_{n - p - 2, α /2}$ are flagged. A practical threshold is $∣ t_{i} ∣ > 2.5$ or $∣ t_{i} ∣ > 3$ .

Cook's distance

Cook's distance combines leverage and residual information into a single measure of influence. It quantifies how much all the fitted values change when sample $i$ is removed:

D_{i} = \frac{( y ^ - y ^ _{(i)} ) ^{⊤} ( y ^ - y ^ _{(i)} )}{p s ^{2}}

which can be rewritten using leverage and residuals:

D_{i} = \frac{r _{i}^{2}}{p} \cdot \frac{h _{ii}}{1 - h _{ii}}

Cook's distance is large when a sample has both a large residual (it does not fit the model) and high leverage (it is in a position to influence the model). A common guideline is to investigate samples with $D_{i} > 4/ n$ or $D_{i} > 1$ , though the more conservative threshold of $D_{i} > F_{p, n - p, 0.50}$ (the 50th percentile of the F-distribution) is sometimes used.

Williams plot

The Williams plot (leverage vs. studentized residuals) is the regression counterpart of the $T^{2}$ vs Q plot. It plots $h_{ii}$ on the x-axis against the studentized residual on the y-axis, with threshold lines for each:

Vertical line at $h_{ii} = 3 (p + 1) / n$ (high leverage boundary)
Horizontal lines at $\pm 2.5$ or $\pm 3$ (large residual boundary)

The four quadrants have analogous interpretations to the $T^{2}$ vs Q plot:

Low leverage, small residual: Normal, well-modeled sample
High leverage, small residual: Influential position but good fit -- the sample pulls the model but in a consistent direction
Low leverage, large residual: Poor fit but low influence -- an outlier in the response that does not distort the model much
High leverage, large residual: Both influential and poorly fitted -- the most dangerous case for model quality

What to do with outliers

Detecting outliers is only half the job. The harder question is what to do about them. The single most important principle is: investigate before you delete.

Identify the suspect samples

Use the $T^{2}$ vs Q plot, Williams plot, or other diagnostics to flag candidates. Record which samples are flagged and by which criterion.
Examine the raw data

Go back to the original measurements. Is there an obvious problem? A saturated detector, a baseline shift, a missing region, an obviously corrupted spectrum? If so, the sample has a documented quality issue and removal is justified.
Check the metadata

Was the sample mislabeled? Was the instrument malfunctioning on that day? Was it a different sample type accidentally included in the batch? Laboratory records often explain outliers that statistics alone cannot.
Investigate the chemistry

If the sample has no apparent quality issue, ask whether it might represent genuinely different chemistry. A sample with high Q may contain a chemical species not present in the other samples. This is not an error -- it is information. Removing it would mean ignoring a real phenomenon.
Assess the impact

Build the model with and without the suspect samples. Do the predictions change substantially? If removing a sample barely affects the model, the point is not influential regardless of whether it is an outlier. If removal changes the predictions for other samples, the point deserves careful scrutiny.
Document your decision

Whether you keep or remove the sample, record the reasoning. "Removed because the detector saturated above 2500 nm" is a defensible decision. "Removed because it was an outlier" is not -- it tells the next analyst nothing about the actual cause.

Never delete outliers blindly

Removing data points solely because they exceed a statistical threshold is a form of data manipulation. Statistical thresholds are screening tools, not deletion criteria. An outlier that represents genuine chemical variation is the most interesting sample in your dataset -- removing it means throwing away the very information you should be investigating. Always have a reason beyond the statistics for excluding a sample.

Robust alternatives

The methods described above all share a vulnerability: they use the sample mean and covariance to define "normal," but those statistics are themselves influenced by the outliers they are trying to detect. This circular problem motivates robust methods, which estimate location and scatter in ways that resist contamination by outlying observations.

Median Absolute Deviation (MAD)

In univariate settings, the median and MAD provide a robust alternative to the mean and standard deviation:

MAD = median (∣ x_{i} - median (x) ∣)

The MAD can be converted to a robust estimate of the standard deviation by multiplying by 1.4826 (valid under normality). A robust z-score is then:

z_{i}^{robust} = \frac{x _{i} - median ( x )}{1.4826 \times MAD}

The MAD has a breakdown point of 50%, meaning that up to half the data can be contaminated before the MAD becomes unreliable. Compare this to the standard deviation, which has a breakdown point of 0% -- a single extreme value can make it arbitrarily large.

ROBPCA

For multivariate data, Hubert, Rousseeuw, and Vanden Branden (2005) developed ROBPCA, a robust version of PCA that combines projection pursuit with the Minimum Covariance Determinant (MCD) estimator. The key idea is to find the subset of observations that has the smallest covariance determinant (the tightest ellipsoid), use that subset to estimate the covariance structure, and then compute PCA from the robust covariance matrix.

ROBPCA produces robust scores, robust $T^{2}$ values, and robust orthogonal distances (analogous to Q-residuals), all of which are less affected by the outliers themselves. It also classifies observations into four groups: regular observations, good leverage points (unusual position but consistent with the robust model), orthogonal outliers (high orthogonal distance), and bad leverage points (both unusual position and high orthogonal distance).

Least Trimmed Squares (LTS)

In regression, the Least Trimmed Squares estimator (introduced by Rousseeuw in 1984) fits the model by minimizing the sum of the smallest $h$ squared residuals, where $h$ is typically chosen as roughly $n /2 + (p + 1) /2$ . This means the LTS regression simply ignores the worst-fitting half of the data when estimating the model, making it highly resistant to outliers. Once the robust model is fitted, the full set of residuals can be examined to identify which observations were excluded and investigate them.

When to use robust methods

Robust methods are most valuable when you suspect contamination but do not know which samples are affected. If you have a clean calibration set and are monitoring new samples, classical methods with well-established thresholds (Hotelling's T-squared, Q-residuals) work well. But if you are performing an initial exploratory analysis on a new dataset with unknown quality, starting with ROBPCA or a robust covariance estimate can prevent outliers from hiding themselves by distorting the reference statistics.

Code implementation

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def pca_outlier_diagnostics(X, n_components=2, alpha=0.05):
    """
    Compute PCA-based outlier diagnostics: Hotelling's T² and Q-residuals.

    Parameters:
    -----------
    X : array, shape (n_samples, n_variables)
        Data matrix (will be mean-centered internally)
    n_components : int
        Number of PCA components
    alpha : float
        Significance level for thresholds

    Returns:
    --------
    dict with keys: scores, T2, Q, T2_lim, Q_lim, loadings, eigenvalues
    """
    n, p = X.shape

    # Mean center
    X_mean = X.mean(axis=0)
    X_centered = X - X_mean

    # PCA via SVD
    U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
    eigenvalues = s**2 / (n - 1)

    # Scores and loadings for A components
    scores = U[:, :n_components] * s[:n_components]
    loadings = Vt[:n_components, :].T

    # Hotelling's T²
    T2 = np.sum(scores**2 / eigenvalues[:n_components], axis=1)

    # T² threshold (F-distribution)
    A = n_components
    T2_lim = A * (n - 1) / (n - A) * stats.f.ppf(1 - alpha, A, n - A)

    # Q-residuals (SPE)
    X_reconstructed = scores @ loadings.T
    E = X_centered - X_reconstructed
    Q = np.sum(E**2, axis=1)

    # Q threshold (Jackson-Mudholkar approximation)
    eig_residual = eigenvalues[n_components:]
    theta1 = np.sum(eig_residual)
    theta2 = np.sum(eig_residual**2)
    theta3 = np.sum(eig_residual**3)

    if theta1 > 0 and theta2 > 0:
        h0 = 1 - 2 * theta1 * theta3 / (3 * theta2**2)
        if h0 > 0:
            c_alpha = stats.norm.ppf(1 - alpha)
            Q_lim = theta1 * (
                c_alpha * np.sqrt(2 * theta2 * h0**2) / theta1
                + 1
                + theta2 * h0 * (h0 - 1) / theta1**2
            ) ** (1 / h0)
        else:
            Q_lim = np.percentile(Q, 100 * (1 - alpha))
    else:
        Q_lim = np.percentile(Q, 100 * (1 - alpha))

    return {
        'scores': scores,
        'T2': T2,
        'Q': Q,
        'T2_lim': T2_lim,
        'Q_lim': Q_lim,
        'loadings': loadings,
        'eigenvalues': eigenvalues
    }


# Example: NIR-like spectral data with outliers
np.random.seed(42)
n_samples = 60
n_wavelengths = 200
wavelengths = np.linspace(1000, 2500, n_wavelengths)

# Two chemical components
comp1 = np.exp(-((wavelengths - 1400)**2) / 20000)
comp2 = np.exp(-((wavelengths - 1900)**2) / 30000)
concentrations = np.random.rand(n_samples, 2) * 2

X_clean = concentrations @ np.array([comp1, comp2])
X_clean += np.random.normal(0, 0.005, X_clean.shape)

# Insert outliers
X = X_clean.copy()
# Sample 55: extreme concentration (high T², low Q)
X[55] = 5 * comp1 + 4 * comp2 + np.random.normal(0, 0.005, n_wavelengths)
# Sample 56: new interferent (low T², high Q)
comp3 = np.exp(-((wavelengths - 2200)**2) / 10000)
X[56] = 0.8 * comp1 + 1.0 * comp2 + 1.5 * comp3
# Sample 57: extreme + new variation (high T², high Q)
X[57] = 4 * comp1 + 0.5 * comp2 + 2.0 * comp3

# Run diagnostics
results = pca_outlier_diagnostics(X, n_components=2, alpha=0.05)

# Plot T² vs Q
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# T² vs Q plot
ax = axes[0]
T2 = results['T2']
Q = results['Q']
T2_lim = results['T2_lim']
Q_lim = results['Q_lim']

# Color by status
normal = (T2 <= T2_lim) & (Q <= Q_lim)
high_T2 = (T2 > T2_lim) & (Q <= Q_lim)
high_Q = (T2 <= T2_lim) & (Q > Q_lim)
both = (T2 > T2_lim) & (Q > Q_lim)

ax.scatter(T2[normal], Q[normal], c='#3b82f6', label='Normal', alpha=0.7, s=40)
ax.scatter(T2[high_T2], Q[high_T2], c='#f59e0b', marker='s', s=80,
           label='High T² only', edgecolors='k')
ax.scatter(T2[high_Q], Q[high_Q], c='#8b5cf6', marker='^', s=80,
           label='High Q only', edgecolors='k')
ax.scatter(T2[both], Q[both], c='#ef4444', marker='D', s=100,
           label='High T² + Q', edgecolors='k')
ax.axvline(T2_lim, color='gray', linestyle='--', linewidth=1, label=f'T² limit = {T2_lim:.1f}')
ax.axhline(Q_lim, color='gray', linestyle=':', linewidth=1, label=f'Q limit = {Q_lim:.4f}')
ax.set_xlabel("Hotelling's T²")
ax.set_ylabel('Q-residual (SPE)')
ax.set_title('T² vs Q diagnostic plot')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# Scores plot with outliers highlighted
ax = axes[1]
scores = results['scores']
ax.scatter(scores[normal, 0], scores[normal, 1], c='#3b82f6',
           label='Normal', alpha=0.7, s=40)
for idx, (marker, color, lab) in zip(
    [55, 56, 57],
    [('s', '#f59e0b', 'Sample 55 (extreme)'),
     ('^', '#8b5cf6', 'Sample 56 (new variation)'),
     ('D', '#ef4444', 'Sample 57 (both)')]
):
    ax.scatter(scores[idx, 0], scores[idx, 1], c=color,
               marker=marker, s=120, edgecolors='k', label=lab, zorder=5)
ax.set_xlabel(f'PC1 ({results["eigenvalues"][0] / results["eigenvalues"].sum() * 100:.1f}%)')
ax.set_ylabel(f'PC2 ({results["eigenvalues"][1] / results["eigenvalues"].sum() * 100:.1f}%)')
ax.set_title('PCA scores with outliers')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print summary
print("Outlier summary:")
print(f"  T² limit (alpha=0.05): {T2_lim:.2f}")
print(f"  Q  limit (alpha=0.05): {Q_lim:.6f}")
print()
for i in range(n_samples):
    if T2[i] > T2_lim or Q[i] > Q_lim:
        status = []
        if T2[i] > T2_lim:
            status.append(f"T²={T2[i]:.2f}")
        if Q[i] > Q_lim:
            status.append(f"Q={Q[i]:.6f}")
        print(f"  Sample {i}: {', '.join(status)}")

function results = pca_outlier_diagnostics(X, n_components, alpha)
    % Compute PCA-based outlier diagnostics: Hotelling's T² and Q-residuals
    %
    % Parameters:
    %   X            - Data matrix (n_samples x n_variables)
    %   n_components - Number of PCA components
    %   alpha        - Significance level (default: 0.05)
    %
    % Returns:
    %   results - struct with scores, T2, Q, T2_lim, Q_lim, loadings, eigenvalues

    if nargin < 3, alpha = 0.05; end

    [n, p] = size(X);

    % Mean center
    X_mean = mean(X, 1);
    X_centered = X - X_mean;

    % PCA via SVD
    [U, S, V] = svd(X_centered, 'econ');
    s = diag(S);
    eigenvalues = s.^2 / (n - 1);

    A = n_components;
    scores = U(:, 1:A) * S(1:A, 1:A);
    loadings = V(:, 1:A);

    % Hotelling's T²
    T2 = sum(scores.^2 ./ eigenvalues(1:A)', 2);

    % T² threshold
    T2_lim = A * (n - 1) / (n - A) * finv(1 - alpha, A, n - A);

    % Q-residuals
    X_reconstructed = scores * loadings';
    E = X_centered - X_reconstructed;
    Q = sum(E.^2, 2);

    % Q threshold (Jackson-Mudholkar)
    eig_res = eigenvalues(A+1:end);
    theta1 = sum(eig_res);
    theta2 = sum(eig_res.^2);
    theta3 = sum(eig_res.^3);

    if theta1 > 0 && theta2 > 0
        h0 = 1 - 2 * theta1 * theta3 / (3 * theta2^2);
        if h0 > 0
            c_alpha = norminv(1 - alpha);
            Q_lim = theta1 * (c_alpha * sqrt(2 * theta2 * h0^2) / theta1 ...
                    + 1 + theta2 * h0 * (h0 - 1) / theta1^2)^(1/h0);
        else
            Q_lim = prctile(Q, 100 * (1 - alpha));
        end
    else
        Q_lim = prctile(Q, 100 * (1 - alpha));
    end

    results.scores = scores;
    results.T2 = T2;
    results.Q = Q;
    results.T2_lim = T2_lim;
    results.Q_lim = Q_lim;
    results.loadings = loadings;
    results.eigenvalues = eigenvalues;
end

% Example: NIR-like spectral data with outliers
rng(42);
n_samples = 60;
n_wavelengths = 200;
wavelengths = linspace(1000, 2500, n_wavelengths);

% Two chemical components
comp1 = exp(-((wavelengths - 1400).^2) / 20000);
comp2 = exp(-((wavelengths - 1900).^2) / 30000);
concentrations = rand(n_samples, 2) * 2;

X_clean = concentrations * [comp1; comp2];
X_clean = X_clean + randn(size(X_clean)) * 0.005;

X = X_clean;
% Sample 56: extreme concentration (high T², low Q)
X(56, :) = 5 * comp1 + 4 * comp2 + randn(1, n_wavelengths) * 0.005;
% Sample 57: new interferent (low T², high Q)
comp3 = exp(-((wavelengths - 2200).^2) / 10000);
X(57, :) = 0.8 * comp1 + 1.0 * comp2 + 1.5 * comp3;
% Sample 58: extreme + new variation (high T², high Q)
X(58, :) = 4 * comp1 + 0.5 * comp2 + 2.0 * comp3;

% Run diagnostics
results = pca_outlier_diagnostics(X, 2, 0.05);

% Plot
figure('Position', [100 100 1200 450]);

% T² vs Q
subplot(1, 2, 1);
normal = (results.T2 <= results.T2_lim) & (results.Q <= results.Q_lim);
scatter(results.T2(normal), results.Q(normal), 40, [0.23 0.51 0.96], 'filled', ...
        'MarkerFaceAlpha', 0.7);
hold on;
high_T2 = (results.T2 > results.T2_lim) & (results.Q <= results.Q_lim);
scatter(results.T2(high_T2), results.Q(high_T2), 80, [0.96 0.62 0.04], 's', ...
        'filled', 'MarkerEdgeColor', 'k');
high_Q = (results.T2 <= results.T2_lim) & (results.Q > results.Q_lim);
scatter(results.T2(high_Q), results.Q(high_Q), 80, [0.55 0.36 0.96], '^', ...
        'filled', 'MarkerEdgeColor', 'k');
both = (results.T2 > results.T2_lim) & (results.Q > results.Q_lim);
scatter(results.T2(both), results.Q(both), 100, [0.94 0.27 0.27], 'd', ...
        'filled', 'MarkerEdgeColor', 'k');
xline(results.T2_lim, '--', 'Color', [0.5 0.5 0.5]);
yline(results.Q_lim, ':', 'Color', [0.5 0.5 0.5]);
xlabel("Hotelling's T^2");
ylabel('Q-residual (SPE)');
title('T^2 vs Q diagnostic plot');
legend('Normal', 'High T^2', 'High Q', 'Both', 'Location', 'northwest');
grid on;

% Scores plot
subplot(1, 2, 2);
scatter(results.scores(normal, 1), results.scores(normal, 2), 40, ...
        [0.23 0.51 0.96], 'filled', 'MarkerFaceAlpha', 0.7);
hold on;
outlier_idx = find(~normal);
for k = 1:length(outlier_idx)
    idx = outlier_idx(k);
    scatter(results.scores(idx, 1), results.scores(idx, 2), 120, ...
            [0.94 0.27 0.27], 'd', 'filled', 'MarkerEdgeColor', 'k');
    text(results.scores(idx, 1), results.scores(idx, 2), ...
         sprintf('  %d', idx), 'FontSize', 9);
end
explained = results.eigenvalues / sum(results.eigenvalues) * 100;
xlabel(sprintf('PC1 (%.1f%%)', explained(1)));
ylabel(sprintf('PC2 (%.1f%%)', explained(2)));
title('PCA scores with outliers');
grid on;

pca_outlier_diagnostics <- function(X, n_components = 2, alpha = 0.05) {
  # Compute PCA-based outlier diagnostics: Hotelling's T² and Q-residuals
  #
  # Parameters:
  #   X            - Data matrix (n_samples x n_variables)
  #   n_components - Number of PCA components
  #   alpha        - Significance level
  #
  # Returns:
  #   list with scores, T2, Q, T2_lim, Q_lim, loadings, eigenvalues

  n <- nrow(X)
  p <- ncol(X)
  A <- n_components

  # Mean center
  X_mean <- colMeans(X)
  X_centered <- sweep(X, 2, X_mean)

  # PCA via SVD
  sv <- svd(X_centered)
  eigenvalues <- sv$d^2 / (n - 1)

  scores <- sv$u[, 1:A, drop = FALSE] %*% diag(sv$d[1:A], nrow = A)
  loadings <- sv$v[, 1:A, drop = FALSE]

  # Hotelling's T²
  T2 <- rowSums(scores^2 / rep(eigenvalues[1:A], each = n))

  # T² threshold
  T2_lim <- A * (n - 1) / (n - A) * qf(1 - alpha, A, n - A)

  # Q-residuals
  X_reconstructed <- scores %*% t(loadings)
  E <- X_centered - X_reconstructed
  Q <- rowSums(E^2)

  # Q threshold (Jackson-Mudholkar)
  eig_res <- eigenvalues[(A + 1):length(eigenvalues)]
  theta1 <- sum(eig_res)
  theta2 <- sum(eig_res^2)
  theta3 <- sum(eig_res^3)

  if (theta1 > 0 && theta2 > 0) {
    h0 <- 1 - 2 * theta1 * theta3 / (3 * theta2^2)
    if (h0 > 0) {
      c_alpha <- qnorm(1 - alpha)
      Q_lim <- theta1 * (c_alpha * sqrt(2 * theta2 * h0^2) / theta1 +
                1 + theta2 * h0 * (h0 - 1) / theta1^2)^(1 / h0)
    } else {
      Q_lim <- quantile(Q, 1 - alpha)
    }
  } else {
    Q_lim <- quantile(Q, 1 - alpha)
  }

  list(
    scores = scores,
    T2 = T2,
    Q = Q,
    T2_lim = T2_lim,
    Q_lim = Q_lim,
    loadings = loadings,
    eigenvalues = eigenvalues
  )
}

# Example: NIR-like spectral data with outliers
set.seed(42)
n_samples <- 60
n_wavelengths <- 200
wavelengths <- seq(1000, 2500, length.out = n_wavelengths)

# Two chemical components
comp1 <- exp(-((wavelengths - 1400)^2) / 20000)
comp2 <- exp(-((wavelengths - 1900)^2) / 30000)
concentrations <- matrix(runif(n_samples * 2) * 2, ncol = 2)

X_clean <- concentrations %*% rbind(comp1, comp2)
X_clean <- X_clean + matrix(rnorm(n_samples * n_wavelengths, sd = 0.005),
                             nrow = n_samples)

X <- X_clean
# Sample 56: extreme concentration (high T², low Q)
X[56, ] <- 5 * comp1 + 4 * comp2 + rnorm(n_wavelengths, sd = 0.005)
# Sample 57: new interferent (low T², high Q)
comp3 <- exp(-((wavelengths - 2200)^2) / 10000)
X[57, ] <- 0.8 * comp1 + 1.0 * comp2 + 1.5 * comp3
# Sample 58: extreme + new variation (high T², high Q)
X[58, ] <- 4 * comp1 + 0.5 * comp2 + 2.0 * comp3

# Run diagnostics
results <- pca_outlier_diagnostics(X, n_components = 2, alpha = 0.05)

# Plot
par(mfrow = c(1, 2), mar = c(4.5, 4.5, 3, 1))

# T² vs Q
normal <- (results$T2 <= results$T2_lim) & (results$Q <= results$Q_lim)
plot(results$T2, results$Q,
     type = "n",
     xlab = expression("Hotelling's T"^2),
     ylab = "Q-residual (SPE)",
     main = expression("T"^2 * " vs Q diagnostic plot"))
points(results$T2[normal], results$Q[normal],
       pch = 19, col = rgb(0.23, 0.51, 0.96, 0.7), cex = 1)

high_T2 <- (results$T2 > results$T2_lim) & (results$Q <= results$Q_lim)
if (any(high_T2))
  points(results$T2[high_T2], results$Q[high_T2],
         pch = 15, col = "#f59e0b", cex = 1.5)

high_Q <- (results$T2 <= results$T2_lim) & (results$Q > results$Q_lim)
if (any(high_Q))
  points(results$T2[high_Q], results$Q[high_Q],
         pch = 17, col = "#8b5cf6", cex = 1.5)

both <- (results$T2 > results$T2_lim) & (results$Q > results$Q_lim)
if (any(both))
  points(results$T2[both], results$Q[both],
         pch = 18, col = "#ef4444", cex = 2)

abline(v = results$T2_lim, lty = 2, col = "gray50")
abline(h = results$Q_lim, lty = 3, col = "gray50")
legend("topleft",
       legend = c("Normal", "High T²", "High Q", "Both"),
       pch = c(19, 15, 17, 18),
       col = c(rgb(0.23, 0.51, 0.96), "#f59e0b", "#8b5cf6", "#ef4444"),
       cex = 0.8)
grid()

# Scores plot
explained <- results$eigenvalues / sum(results$eigenvalues) * 100
plot(results$scores[, 1], results$scores[, 2],
     type = "n",
     xlab = sprintf("PC1 (%.1f%%)", explained[1]),
     ylab = sprintf("PC2 (%.1f%%)", explained[2]),
     main = "PCA scores with outliers")
points(results$scores[normal, 1], results$scores[normal, 2],
       pch = 19, col = rgb(0.23, 0.51, 0.96, 0.7), cex = 1)

outlier_idx <- which(!normal)
if (length(outlier_idx) > 0) {
  points(results$scores[outlier_idx, 1], results$scores[outlier_idx, 2],
         pch = 18, col = "#ef4444", cex = 2)
  text(results$scores[outlier_idx, 1], results$scores[outlier_idx, 2],
       labels = outlier_idx, pos = 4, cex = 0.8)
}
grid()

# Summary
cat("Outlier summary:\n")
cat(sprintf("  T² limit (alpha=0.05): %.2f\n", results$T2_lim))
cat(sprintf("  Q  limit (alpha=0.05): %.6f\n", results$Q_lim))
for (i in seq_len(n_samples)) {
  flags <- c()
  if (results$T2[i] > results$T2_lim) flags <- c(flags, sprintf("T²=%.2f", results$T2[i]))
  if (results$Q[i] > results$Q_lim) flags <- c(flags, sprintf("Q=%.6f", results$Q[i]))
  if (length(flags) > 0)
    cat(sprintf("  Sample %d: %s\n", i, paste(flags, collapse = ", ")))
}

When to investigate vs when to remove

Investigate further

High Q, low T-squared. The sample fits poorly but is not extreme in the model. It likely contains variation the model was not designed for -- a new interferent, a different matrix, or an instrument change. This is new information, not noise. Understand it before discarding it.

Isolated case in a large dataset. If one sample out of 500 is flagged, look at the raw data. It could be a transcription error, a labeling mistake, or a genuine anomaly worth investigating.

The outlier is chemically plausible. If the flagged sample comes from a known edge case (extreme pH, high temperature, unusual matrix), it may represent the boundary of your model's applicability domain. Keeping it improves the model's range; removing it narrows it.

Consider removing

Documented instrument failure. If lab records confirm that the detector malfunctioned, the lamp was failing, or the sample path was blocked during acquisition, the measurement is not valid data. Remove it and document the reason.

Sample preparation error. A mislabeled vial, a contaminated cuvette, or a dilution mistake produces data that does not represent what you think it represents. If the error is confirmed, removal is justified.

Multiple diagnostics converge. If a sample is flagged by T-squared, Q-residual, Cook's distance, and visual inspection of the raw spectrum all points to a problem, the evidence for removal is strong. But still document the specific cause.

The sample breaks the model for everyone else. If removing one sample substantially improves predictions for the remaining samples (validated by cross-validation), it was likely exerting undue influence. Investigate why before removing.

Keep a rejection log

Maintain a record of every sample removed from a model, including the date, the reason, and which diagnostics flagged it. This log is invaluable when revisiting a model months or years later, when transferring a model to a colleague, or when an auditor asks why certain data was excluded. "Removed as outlier" is never sufficient -- always specify the cause.

References

[1] Chauvenet, W. (1863). A Manual of Spherical and Practical Astronomy (Vol. II, Appendix). J. B. Lippincott & Co.

[2] Grubbs, F. E. (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.

[3] Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1-21.

[4] Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18.

[5] Jackson, J. E., & Mudholkar, G. S. (1979). Control procedures for residuals associated with principal component analysis. Technometrics, 21(3), 341-349.

[6] Hubert, M., Rousseeuw, P. J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47(1), 64-79.

[7] Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79(388), 871-880.

[8] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.

[9] Hotelling, H. (1931). The generalization of Student's ratio. The Annals of Mathematical Statistics, 2(3), 360-378.

[10] Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikstrom, C., & Wold, S. (2006). Multi- and Megavariate Data Analysis: Part I -- Basic Principles and Applications (2nd ed.). Umetrics Academy.

Outlier Detection

Investigate further

Consider removing

On this page