Skip to content

Missing Data

In 1976, the Harvard statistician Donald B. Rubin published a paper in Biometrika that would fundamentally change how scientists think about incomplete data. Before Rubin, the standard practice was pragmatic and crude: if a measurement was missing, you either deleted the sample or plugged in the column mean and moved on. Rubin’s contribution was to show that what you can do about missing data depends entirely on why it is missing. He introduced a formal classification — Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) — that distinguished between benign gaps and dangerous ones. This classification made it possible to reason rigorously about when simple fixes are safe and when they introduce bias that no amount of modeling can correct.

A decade later, Rubin teamed up with Roderick J. A. Little to write Statistical Analysis with Missing Data (1987, second edition 2002), which became the definitive reference on the subject. Little and Rubin formalized multiple imputation — the idea of filling in missing values several times, each time drawing from a plausible distribution, then combining results — and provided the theoretical framework that justified it. Their book influenced fields from clinical trials to economics, and its principles remain the backbone of modern missing-data methodology.

In chemometrics, missing data has always been a practical reality rather than a theoretical curiosity. Spectrometers saturate. Detectors fail mid-scan. Concentrations fall below the limit of detection. Merged datasets span different wavelength ranges, leaving rectangular holes in the data matrix. But the field had an early advantage: the NIPALS algorithm (Nonlinear Iterative Partial Least Squares), introduced by Herman Wold in 1966 for computing principal components, naturally skips missing values during its iterative fitting. This meant that PCA-based workflows in chemometrics could tolerate gaps in the data long before mainstream statistics had settled on a general-purpose solution.

When data goes missing in chemistry

Missing data in analytical chemistry is not an abstract statistical concern — it arises from concrete physical and experimental situations. Understanding the source of the missingness is the first step toward handling it correctly.

Detector saturation. Every detector has a dynamic range. When a signal exceeds the upper limit (for instance, an absorbance above 3 AU in UV-Vis spectroscopy), the reading clips to a maximum value or is flagged as invalid. The true value exists but is unobservable. This is particularly common in NIR and mid-IR measurements of concentrated samples, where strong absorption bands push the detector beyond its linear range.

Failed measurements. Instruments malfunction. A lamp flickers during an acquisition. A cuvette leaks. A fiber-optic probe loses contact with the sample. The result is a spectrum with one or more corrupted channels, or an entire missing sample in a time series. In process analytical technology (PAT), where instruments run continuously for days or weeks, intermittent failures are routine.

Below the detection limit. When an analyte concentration is too low for the instrument to distinguish from noise, the result is often recorded as ”< LOD” or simply left blank. This is censored data: you know the true value is somewhere between zero and the detection limit, but you don’t know where. Treating these as zero, as missing, or as LOD/2 each introduces a different kind of bias.

Merged datasets. When combining spectra from different instruments — for example, two NIR spectrometers covering slightly different wavelength ranges — the merged data matrix has missing blocks. Instrument A measured 1100-2500 nm; instrument B measured 800-2200 nm. The union spans 800-2500 nm, but each instrument contributes blanks in the range it didn’t cover. This situation produces structured, predictable patterns of missingness.

Sample-level losses. In designed experiments, some samples may be lost entirely: a vial breaks, a reaction fails, a sample degrades before measurement. The entire row of the data matrix is absent. This is different from having a few missing wavelengths within an otherwise complete spectrum.

Rubin’s classification

Not all missing data is created equal. Rubin’s 1976 framework classifies missingness by its relationship to the data, and this classification determines which imputation methods are valid.

Missing Completely At Random (MCAR)

The probability of a value being missing is unrelated to both the observed and unobserved data. The missingness is purely random — like flipping a coin to decide whether to record each measurement.

Chemistry example: A lab technician accidentally skips a few samples during a routine analysis because they were distracted, not because of anything about the samples themselves. The skipped samples are a random subset of the batch.

Consequence: MCAR is the most benign case. Any method works — even simple deletion — because the remaining data is still a random sample of the full dataset. No bias is introduced by ignoring the gaps.

How to test: Compare the means and distributions of other variables between samples with and without missing values. If they are statistically indistinguishable (e.g., by Little’s MCAR test), the data is plausibly MCAR.

Missing At Random (MAR)

The probability of a value being missing depends on observed data but not on the missing value itself. This is more subtle: the missingness is not random, but the mechanism can be fully explained by information you already have.

Chemistry example: In a pharmaceutical stability study, tablets stored at high temperature degrade faster and are measured more frequently, while cold-stored tablets are measured less often. Whether a measurement exists depends on storage temperature (observed), not on the tablet’s actual potency at that time point (the value that might be missing).

Consequence: Simple deletion introduces bias (you would over-represent high-temperature samples), but methods that condition on the observed data — such as multiple imputation or maximum likelihood estimation — can correct for it. MAR is the assumption underlying most modern imputation methods.

Missing Not At Random (MNAR)

The probability of a value being missing depends on the missing value itself. This is the dangerous case: the data is missing because of what it would have been.

Chemistry example: Concentrations below the limit of detection are left blank. The missingness depends directly on the concentration being low — the very value that is missing. Similarly, detector saturation creates blanks precisely because the signal is too high. In both cases, the missing values are systematically different from the observed ones.

Consequence: No standard imputation method can fully correct for MNAR without making strong assumptions about the missing-data mechanism. You need to model the process that generates the missingness (e.g., a censored regression model for LOD data, or a truncation model for saturated detectors). Ignoring MNAR and treating it as MAR introduces bias.

Simple approaches

When the amount of missing data is small and the mechanism is MCAR or close to it, simple methods can be sufficient. They are also useful as baselines against which to compare more sophisticated approaches.

Listwise deletion (complete cases)

Remove any sample (row) that has one or more missing values. Only complete cases are retained.

When it works: MCAR data with a small fraction of incomplete samples. If 5% of your 200 spectra are missing a few wavelengths, deleting those 10 spectra may be perfectly acceptable.

When it fails: If missingness is systematic (MAR or MNAR), deletion biases the remaining dataset. If many samples have at least one missing value, deletion can discard most of your data — a common problem when merging datasets with different wavelength ranges.

Pairwise deletion

Instead of discarding entire samples, use all available pairs of variables when computing statistics. For example, when computing a covariance matrix, the covariance between variables j and k uses all samples where both j and k are observed, even if those samples have missing values in other variables.

When it works: Computing correlation or covariance matrices when different variables have missing values in different samples.

When it fails: The resulting covariance matrix may not be positive semidefinite (because different entries are computed from different subsets of samples), which causes problems for PCA and other methods that decompose the covariance matrix. Use with caution.

Mean or median imputation

Replace each missing value with the mean (or median) of the observed values in that variable (column).

When it works: A small number of missing values scattered randomly through the matrix. The bias introduced is minor if the fraction missing is below 5%.

When it fails: Mean imputation artificially reduces the variance of the imputed variable (every imputed value equals the mean, so scatter is suppressed). It also distorts correlations between variables. If 20% of a column is missing and you fill it with the mean, you have systematically pulled those values toward the center of the distribution, weakening any relationship that variable has with other variables. For PCA-based analyses, this can shrink scores and distort loadings.

Better approaches

When the fraction of missing data is substantial, or when the mechanism is MAR, simple methods introduce unacceptable bias. The following approaches use the structure of the data to produce better estimates.

k-Nearest Neighbors (kNN) imputation

For each sample with missing values, find the k most similar complete (or more complete) samples, and fill in the gaps using the average of those neighbors’ values.

  1. Define a distance metric

    Use Euclidean distance (or another metric) computed over the variables that are observed in both the target sample and the candidate neighbor.

  2. Find the k nearest neighbors

    Among the samples with observed values for the missing variable, identify the k closest ones.

  3. Impute using neighbor values

    Replace the missing value with the (weighted or unweighted) mean of the corresponding values from the k neighbors.

where weights closer neighbors more heavily.

Strengths: kNN imputation uses local structure — a missing value in a high-absorbance sample is estimated from other high-absorbance samples, not from the global mean. It does not assume linearity and can capture complex relationships.

Weaknesses: The method requires a meaningful distance metric, which can be problematic when many variables are missing simultaneously (the distance calculation uses fewer dimensions and becomes less reliable). It also scales poorly with very large datasets, since finding neighbors requires comparing every incomplete sample to every candidate.

Typical choice of k: Values between 3 and 10 are common. Small k captures local structure but is sensitive to noise in individual neighbors; large k smooths over more neighbors but may blend distinct subpopulations.

Iterative PCA imputation

PCA models capture the dominant patterns of covariation in a dataset. If a spectrum is missing a few wavelengths, the remaining wavelengths constrain where that spectrum sits in score space, and the PCA model can predict what the missing wavelengths should have been.

  1. Initialize missing values

    Fill missing entries with column means (or any reasonable starting values).

  2. Fit PCA with A components

    Decompose the completed matrix: .

  3. Predict missing values

    Use the PCA model to reconstruct the full matrix: . Replace only the originally missing entries with their PCA-predicted values.

  4. Repeat until convergence

    Iterate steps 2-3 until the imputed values stabilize (i.e., the change between iterations falls below a tolerance).

The key decision is the number of components A. Too few components underfit (the model cannot reconstruct the missing values accurately). Too many components overfit (noise is modeled and propagated into the imputed values). Cross-validation — holding out known values, pretending they are missing, and checking prediction error — is the standard approach for choosing A.

Multiple imputation

Rather than producing a single “best guess” for each missing value, multiple imputation generates M complete datasets, each with slightly different imputed values drawn from a plausible distribution. You then analyze each dataset separately and combine the results, obtaining estimates that properly reflect the uncertainty introduced by the missing data.

The variance of the combined estimate includes both the within-imputation variance (uncertainty in each completed dataset) and the between-imputation variance (uncertainty due to not knowing the true missing values):

where is the average within-imputation variance and is the between-imputation variance.

Multiple imputation is the gold standard in clinical trials and social sciences but is less commonly used in chemometrics, where iterative PCA tends to be preferred because it directly leverages the spectral structure.

PCA-based imputation in detail

PCA-based imputation deserves special attention in chemometrics because it exploits the fact that spectroscopic data is inherently low-rank: hundreds or thousands of wavelengths are driven by a handful of underlying chemical components. This makes PCA a natural fit for reconstructing missing values.

NIPALS and missing values

The NIPALS algorithm, introduced by Herman Wold in 1966, computes principal components one at a time through an iterative alternating regression. Its key property is that each step involves regressing one set of unknowns (scores or loadings) on the other, and these regressions can simply skip any missing entries.

For a single component, NIPALS alternates:

Because each regression uses only the available values, NIPALS handles missing data without any explicit imputation step. The scores and loadings converge to values consistent with the observed data. Once the model is fitted, the missing entries can be reconstructed as .

Iterative SVD (EM-PCA)

A more modern variant uses the SVD (Singular Value Decomposition) in an expectation-maximization-like loop:

  1. Initialize missing values (column means).

  2. Compute truncated SVD of the filled matrix with A components.

  3. Reconstruct the matrix from the truncated SVD.

  4. Replace only the missing entries with the reconstructed values, keeping observed values unchanged.

  5. Check convergence. If the imputed values changed by less than a threshold (e.g., relative change), stop. Otherwise, return to step 2.

This is equivalent to iterative PCA but uses the SVD, which is more numerically stable and available in every linear algebra library.

How much missing is too much?

There is no universal threshold, but some guidelines help frame the decision.

Less than 5% missing (scattered randomly): Almost any method works. Mean imputation, kNN, or iterative PCA all produce similar results. The choice is unlikely to affect downstream analyses materially.

5-20% missing: Simple methods (mean imputation, deletion) start to introduce noticeable bias. Use kNN or iterative PCA. The pattern of missingness matters more than the percentage — 15% missing concentrated in a few variables is more problematic than 15% scattered uniformly.

20-50% missing: Iterative PCA or multiple imputation is strongly recommended. Mean imputation distorts variance and correlations substantially. Check results by comparing analyses with and without imputation, or by varying the imputation method.

More than 50% missing: The imputed values are driven more by the model assumptions than by the observed data. Results should be interpreted with extreme caution. Consider whether the missing variables can be dropped entirely, or whether the analysis question can be reformulated to avoid them.

Visualizing missingness patterns

Before imputing, always examine where the data is missing. A simple heatmap of the missing-data indicator matrix (1 = observed, 0 = missing) can reveal:

  • Scattered pattern: Random dropouts, consistent with MCAR. Safe for most methods.
  • Block pattern: Entire variables or entire samples are missing. Common when merging datasets. Iterative PCA handles this well if the blocks overlap enough to constrain the model.
  • Monotone pattern: If you sort samples by completeness, the missing entries form a staircase. This arises in longitudinal studies where subjects drop out over time. Specialized monotone imputation methods exist.
  • Systematic pattern: Missing values cluster in specific variables for specific sample groups. Strong evidence for MAR or MNAR. Investigate before imputing.

Code implementation

import numpy as np
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
def iterative_pca_imputation(X, n_components=2, max_iter=100, tol=1e-6):
"""
Impute missing values using iterative PCA (EM-PCA).
Parameters:
-----------
X : array-like, shape (n_samples, n_variables)
Data matrix with np.nan for missing values
n_components : int
Number of PCA components for reconstruction
max_iter : int
Maximum number of iterations
tol : float
Convergence tolerance (relative change in imputed values)
Returns:
--------
X_filled : array
Completed data matrix with imputed values
n_iter : int
Number of iterations until convergence
"""
X_filled = X.copy()
missing = np.isnan(X)
# Initialize missing values with column means
col_means = np.nanmean(X, axis=0)
for j in range(X.shape[1]):
X_filled[missing[:, j], j] = col_means[j]
for iteration in range(max_iter):
old_imputed = X_filled[missing].copy()
# Center the data
means = X_filled.mean(axis=0)
X_centered = X_filled - means
# Truncated SVD (PCA)
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
U = U[:, :n_components]
s = s[:n_components]
Vt = Vt[:n_components, :]
# Reconstruct
X_reconstructed = U @ np.diag(s) @ Vt + means
# Replace only missing values
X_filled[missing] = X_reconstructed[missing]
# Check convergence
new_imputed = X_filled[missing]
change = np.sqrt(np.sum((new_imputed - old_imputed)**2))
scale = np.sqrt(np.sum(new_imputed**2)) + 1e-10
if change / scale < tol:
return X_filled, iteration + 1
return X_filled, max_iter
# Example: NIR spectra with missing values
np.random.seed(42)
n_samples = 50
n_wavelengths = 200
wavelengths = np.linspace(1000, 2500, n_wavelengths)
# Simulate low-rank spectral data (3 chemical components)
components = np.array([
np.exp(-((wavelengths - 1400)**2) / 20000),
np.exp(-((wavelengths - 1800)**2) / 30000),
np.exp(-((wavelengths - 2100)**2) / 15000)
])
concentrations = np.random.rand(n_samples, 3)
X_true = concentrations @ components + np.random.normal(0, 0.01, (n_samples, n_wavelengths))
# Introduce 10% missing values (MCAR)
X_missing = X_true.copy()
mask = np.random.rand(*X_true.shape) < 0.10
X_missing[mask] = np.nan
print(f"Missing: {mask.sum()} values ({100 * mask.mean():.1f}%)")
# Method 1: Mean imputation
X_mean = X_missing.copy()
col_means = np.nanmean(X_missing, axis=0)
for j in range(n_wavelengths):
X_mean[np.isnan(X_mean[:, j]), j] = col_means[j]
# Method 2: kNN imputation
knn = KNNImputer(n_neighbors=5)
X_knn = knn.fit_transform(X_missing)
# Method 3: Iterative PCA
X_pca, n_iter = iterative_pca_imputation(X_missing, n_components=3)
print(f"PCA converged in {n_iter} iterations")
# Compare reconstruction errors (only at missing positions)
rmse_mean = np.sqrt(np.mean((X_mean[mask] - X_true[mask])**2))
rmse_knn = np.sqrt(np.mean((X_knn[mask] - X_true[mask])**2))
rmse_pca = np.sqrt(np.mean((X_pca[mask] - X_true[mask])**2))
print(f"RMSE (mean imputation): {rmse_mean:.6f}")
print(f"RMSE (kNN, k=5): {rmse_knn:.6f}")
print(f"RMSE (iterative PCA): {rmse_pca:.6f}")
# Visualize one spectrum
idx = 0 # First sample
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
axes[0].plot(wavelengths, X_true[idx], 'k-', linewidth=2, label='True')
axes[0].plot(wavelengths[mask[idx]], X_true[idx, mask[idx]], 'rx',
markersize=8, label='Missing positions')
axes[0].set_title('Original spectrum with missing value locations')
axes[0].set_xlabel('Wavelength (nm)')
axes[0].set_ylabel('Absorbance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[1].plot(wavelengths, X_true[idx], 'k-', linewidth=2, alpha=0.3, label='True')
axes[1].plot(wavelengths, X_mean[idx], '--', label=f'Mean (RMSE={rmse_mean:.4f})')
axes[1].plot(wavelengths, X_knn[idx], '--', label=f'kNN (RMSE={rmse_knn:.4f})')
axes[1].plot(wavelengths, X_pca[idx], '--', label=f'PCA (RMSE={rmse_pca:.4f})')
axes[1].set_title('Imputation comparison')
axes[1].set_xlabel('Wavelength (nm)')
axes[1].set_ylabel('Absorbance')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Practical guidelines

Always investigate the mechanism

Before choosing an imputation method, determine why data is missing. Plot missingness patterns. Check whether missingness correlates with observed variables. The mechanism (MCAR, MAR, MNAR) dictates which methods are valid.

Don't just delete

Listwise deletion is tempting but wasteful. It throws away entire spectra because of a few missing wavelengths, and it introduces bias whenever missingness is not MCAR. Use it only as a last resort or when the fraction missing is negligible.

Leverage spectral structure

Spectroscopic data is inherently low-rank — hundreds of wavelengths are driven by a few chemical components. PCA-based imputation exploits this structure and consistently outperforms generic methods like mean imputation for spectral data.

Validate your imputation

Artificially remove known values, impute them, and compare to the truth. Report the RMSE of imputation. If downstream results (PCA scores, PLS predictions) change substantially when you switch imputation methods, your conclusions may depend on assumptions rather than data.

References

[1] Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.

[2] Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd ed.). Wiley.

[3] Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiah (Ed.), Multivariate Analysis (pp. 391-420). Academic Press.

[4] Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65.

[5] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.

[6] Walczak, B., & Massart, D. L. (2001). Dealing with missing data: Part I. Chemometrics and Intelligent Laboratory Systems, 58(1), 15-27.

[7] Walczak, B., & Massart, D. L. (2001). Dealing with missing data: Part II. Chemometrics and Intelligent Laboratory Systems, 58(1), 29-42.

[8] Grung, B., & Manne, R. (1998). Missing values in principal component analysis. Chemometrics and Intelligent Laboratory Systems, 42(1-2), 125-139.

[9] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.