Extended Multiplicative Scatter Correction (EMSC)

In 1985, Paul Geladi, Douglas MacDougall, and Harald Martens introduced Multiplicative Scatter Correction (MSC) to separate chemical light absorption from physical light scatter in near-infrared reflectance spectra. The method was elegant and effective for many routine applications, but it assumed a simple linear relationship between each spectrum and a reference. By the early 1990s, Martens recognized that real-world spectroscopic data often contained more complex artifacts than basic MSC could handle. In 1991, he and Edward Stark published the first formal extension in the Journal of Pharmaceutical and Biomedical Analysis, adding polynomial baseline terms and spectral interference subtraction to the original MSC framework. This paper laid the theoretical groundwork for what would become EMSC.

The method found its most compelling applications through the work of Achim Kohler at the Norwegian University of Life Sciences (NMBU) in the early 2000s. Kohler was using Fourier transform infrared (FTIR) microspectroscopy to study biological samples, including tissue cross-sections and individual cells. These samples presented spectroscopic challenges that went far beyond simple scatter: complex baseline curvature from varying sample thickness, interference fringes, and strong scattering artifacts caused by the size and shape of cells interacting with mid-infrared light. Basic MSC was not enough. Kohler and Martens collaborated to refine and extend the EMSC framework for these demanding applications, publishing key work on separating physical and chemical information in FTIR microscopy images of biological tissue (2005) and on physics-based scatter correction approaches (2006).

Their collaboration continued to push the boundaries of the method. By the late 2000s, Kohler and colleagues had developed the resonant Mie scattering EMSC (RMieS-EMSC), which incorporated full Mie scattering theory to model the wavelength-dependent distortions caused by spherical and near-spherical biological cells. This variant, later accelerated with GPU computing, became a standard preprocessing tool in biomedical infrared spectroscopy. Nils Kristian Afseth and Kohler published an influential tutorial in Chemometrics and Intelligent Laboratory Systems in 2012 that made the method accessible to a broader audience. Today, EMSC and its variants are among the most widely used preprocessing techniques in vibrational spectroscopy, particularly for biological and biomedical applications where scatter effects are complex and physically meaningful.

Beyond basic MSC: handling complex scatter

Multiplicative Scatter Correction (MSC) works great when scatter is your main problem. But real spectroscopic data often throws additional challenges at you:

Polynomial baseline drift due to instrumental artifacts
Known interfering compounds that vary across samples
Temperature effects causing systematic baseline shifts
Physical effects that can't be captured by simple linear scaling

EMSC (Extended Multiplicative Scatter Correction) extends the basic MSC model to handle these complications. Instead of just fitting a line to a reference spectrum, EMSC fits a more complex model that includes polynomial baselines and optional interference terms.

The EMSC model: MSC plus extra terms

Basic MSC assumes:

y_{i} = a_{i} + b_{i} \cdot y_{ref}

EMSC extends this with additional terms:

y_{i} = a_{i} + b_{i} \cdot y_{ref} + k = 1 \sum K c_{i, k} \cdot z_{k} + j = 1 \sum J d_{i, j} \cdot λ^{j} + noise

Where:

a_i, b_i = same scatter coefficients as MSC (offset and slope)
z_k = known interference spectra (e.g., water, CO₂)
c_i,k = coefficients for each interference
λ^j = polynomial baseline terms (wavelength to power j)
d_i,j = polynomial coefficients

The correction removes all these effects:

y_{corrected} = \frac{y _{i} - a _{i} - \sum c _{i, k} z _{k} - \sum d _{i, j} λ ^{j}}{b _{i}}

This gives you a spectrum with scatter, interferences, and polynomial drift all removed.

When EMSC helps more than MSC

Use EMSC instead of basic MSC when you have:

Polynomial baseline drift:

Instrument baseline wander (common in FTIR)
Temperature-induced baseline shifts
Strong curvature that MSC can't capture
Diffuse reflectance with non-linear scatter effects

Known interferences:

Water vapor absorption in NIR
CO₂ bands in FTIR
Packaging material signals in food analysis
Solvent peaks in solution-state measurements

Complex sample matrices:

Biological tissues with varying water content
Pharmaceutical tablets with excipients
Foods with fat/moisture variations
Environmental samples with matrix effects

Stick with basic MSC when:

Scatter is linear and dominant
No systematic baseline drift
No known interfering compounds
Simpler is better for your application

The algorithm

EMSC uses multiple linear regression with the reference spectrum and additional terms as predictors.

Step 1: Calculate reference spectrum

Same as MSC—usually the mean:

y_{ref} = \frac{1}{N} i = 1 \sum N y_{i}

Step 2: Build the design matrix

For polynomial order J and K interferences:

X = [1, y_{ref}, z_{1}, \dots, z_{K}, λ, λ^{2}, \dots, λ^{J}]

Each column is a predictor:

Column 1: ones (for intercept a_i)
Column 2: reference spectrum (for slope b_i)
Columns 3 to K+2: interference spectra
Remaining columns: wavelength polynomials

Step 3: Fit regression for each spectrum

Solve the least-squares problem:

β_{i} = (X^{T} X)^{- 1} X^{T} y_{i}

This gives you all coefficients: [a_i, b_i, c_i,1, ..., c_i,K, d_i,1, ..., d_i,J]

Step 4: Correct the spectrum

Remove everything except the reference-scaled signal:

y_{corrected, i} = \frac{y _{i} - a _{i} - \sum _{k = 1}^{K} c _{i, k} z _{k} - \sum _{j = 1}^{J} d _{i, j} λ ^{j}}{b _{i}}

Code examples

import numpy as np
import matplotlib.pyplot as plt

def emsc(spectra, wavelengths, reference=None, poly_order=2, interferences=None):
    """
    Extended Multiplicative Scatter Correction.

    Parameters:
    -----------
    spectra : array-like, shape (n_samples, n_wavelengths)
        Input spectra to be corrected
    wavelengths : array-like, shape (n_wavelengths,)
        Wavelength values
    reference : array-like, shape (n_wavelengths,), optional
        Reference spectrum. If None, uses mean of all spectra.
    poly_order : int, default=2
        Order of polynomial baseline (0 = none, 1 = linear, 2 = quadratic, etc.)
    interferences : list of arrays, optional
        Known interference spectra to be removed

    Returns:
    --------
    corrected : array
        EMSC-corrected spectra, same shape as input
    """
    spectra = np.asarray(spectra)
    wavelengths = np.asarray(wavelengths)

    # Calculate reference spectrum
    if reference is None:
        reference = np.mean(spectra, axis=0)

    n_samples, n_wavelengths = spectra.shape

    # Normalize wavelengths to [-1, 1] for numerical stability
    wl_norm = 2 * (wavelengths - wavelengths.min()) / (wavelengths.max() - wavelengths.min()) - 1

    # Build design matrix
    # Start with intercept and reference
    X = [np.ones(n_wavelengths), reference]

    # Add interference spectra
    if interferences is not None:
        for interference in interferences:
            X.append(np.asarray(interference))

    # Add polynomial terms
    for j in range(1, poly_order + 1):
        X.append(wl_norm ** j)

    X = np.column_stack(X)

    # Correct each spectrum
    corrected = np.zeros_like(spectra)

    for i in range(n_samples):
        # Fit: spectrum = X @ beta
        beta = np.linalg.lstsq(X, spectra[i], rcond=None)[0]

        a = beta[0]  # Intercept
        b = beta[1]  # Reference scaling

        # Reconstruct interference and polynomial contributions
        n_interference = len(interferences) if interferences is not None else 0
        interference_contrib = np.sum([beta[2 + k] * X[:, 2 + k]
                                      for k in range(n_interference)], axis=0) if n_interference > 0 else 0
        poly_contrib = np.sum([beta[2 + n_interference + j - 1] * X[:, 2 + n_interference + j - 1]
                              for j in range(1, poly_order + 1)], axis=0) if poly_order > 0 else 0

        # Correct: remove offset, interferences, and polynomial baseline
        corrected[i] = (spectra[i] - a - interference_contrib - poly_contrib) / b

    return corrected


# Example: NIR spectra with scatter + polynomial baseline
np.random.seed(42)
n_samples = 20
n_wavelengths = 100
wavelengths = np.linspace(1000, 2500, n_wavelengths)

# Ideal spectrum (chemical signal)
ideal_spectrum = (0.5 * np.exp(-((wavelengths - 1500)**2) / 50000) +
                  0.3 * np.exp(-((wavelengths - 2000)**2) / 30000))

# Simulate complex effects
raw_spectra = []
for i in range(n_samples):
    # Random scatter (MSC part)
    a = np.random.uniform(-0.1, 0.1)
    b = np.random.uniform(0.8, 1.2)

    # Random polynomial baseline drift
    p0 = np.random.uniform(-0.05, 0.05)
    p1 = np.random.uniform(-0.00002, 0.00002)
    p2 = np.random.uniform(-5e-9, 5e-9)
    poly_baseline = p0 + p1 * wavelengths + p2 * wavelengths**2

    # Chemical variation
    chemical_var = ideal_spectrum + np.random.normal(0, 0.01, n_wavelengths)

    # Combine: scatter + baseline drift + noise
    spectrum = a + b * chemical_var + poly_baseline + np.random.normal(0, 0.01, n_wavelengths)
    raw_spectra.append(spectrum)

raw_spectra = np.array(raw_spectra)

# Apply EMSC with 2nd-order polynomial
corrected_spectra = emsc(raw_spectra, wavelengths, poly_order=2)

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Before EMSC
for i in range(n_samples):
    ax1.plot(wavelengths, raw_spectra[i], alpha=0.6, linewidth=1)
ax1.set_xlabel('Wavelength (nm)')
ax1.set_ylabel('Absorbance')
ax1.set_title('Before EMSC (scatter + baseline drift)')
ax1.grid(True, alpha=0.3)

# After EMSC
for i in range(n_samples):
    ax2.plot(wavelengths, corrected_spectra[i], alpha=0.6, linewidth=1)
ax2.set_xlabel('Wavelength (nm)')
ax2.set_ylabel('Absorbance')
ax2.set_title('After EMSC (corrected)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Compare variability
print(f"Std before EMSC: {np.std(raw_spectra, axis=0).mean():.4f}")
print(f"Std after EMSC:  {np.std(corrected_spectra, axis=0).mean():.4f}")

function corrected = emsc(spectra, wavelengths, varargin)
    % Extended Multiplicative Scatter Correction
    %
    % Parameters:
    %   spectra - Matrix of spectra (rows = samples, cols = wavelengths)
    %   wavelengths - Wavelength vector
    %   'Reference' - Optional reference spectrum (default: mean)
    %   'PolyOrder' - Polynomial order (default: 2)
    %   'Interferences' - Cell array of interference spectra (default: {})
    %
    % Returns:
    %   corrected - EMSC-corrected spectra

    % Parse inputs
    p = inputParser;
    addParameter(p, 'Reference', []);
    addParameter(p, 'PolyOrder', 2);
    addParameter(p, 'Interferences', {});
    parse(p, varargin{:});

    reference = p.Results.Reference;
    poly_order = p.Results.PolyOrder;
    interferences = p.Results.Interferences;

    [n_samples, n_wavelengths] = size(spectra);

    % Calculate reference
    if isempty(reference)
        reference = mean(spectra, 1);
    end

    % Normalize wavelengths to [-1, 1]
    wl_norm = 2 * (wavelengths - min(wavelengths)) / (max(wavelengths) - min(wavelengths)) - 1;

    % Build design matrix
    X = [ones(n_wavelengths, 1), reference(:)];

    % Add interference spectra
    for k = 1:length(interferences)
        X = [X, interferences{k}(:)];
    end

    % Add polynomial terms
    for j = 1:poly_order
        X = [X, wl_norm(:).^j];
    end

    % Correct each spectrum
    corrected = zeros(size(spectra));

    for i = 1:n_samples
        % Fit: spectrum = X * beta
        beta = X \ spectra(i, :)';

        a = beta(1);  % Intercept
        b = beta(2);  % Reference scaling

        % Reconstruct unwanted contributions
        n_interference = length(interferences);
        unwanted = a;

        for k = 1:n_interference
            unwanted = unwanted + beta(2 + k) * X(:, 2 + k)';
        end

        for j = 1:poly_order
            unwanted = unwanted + beta(2 + n_interference + j) * X(:, 2 + n_interference + j)';
        end

        % Correct
        corrected(i, :) = (spectra(i, :) - unwanted) / b;
    end
end

% Example usage
rng(42);
n_samples = 20;
n_wavelengths = 100;
wavelengths = linspace(1000, 2500, n_wavelengths);

% Ideal spectrum
ideal_spectrum = 0.5 * exp(-((wavelengths - 1500).^2) / 50000) + ...
                 0.3 * exp(-((wavelengths - 2000).^2) / 30000);

% Simulate scatter + polynomial baseline
raw_spectra = zeros(n_samples, n_wavelengths);
for i = 1:n_samples
    a = rand() * 0.2 - 0.1;
    b = rand() * 0.4 + 0.8;

    % Polynomial baseline
    p0 = rand() * 0.1 - 0.05;
    p1 = rand() * 0.00004 - 0.00002;
    p2 = rand() * 1e-8 - 5e-9;
    poly_baseline = p0 + p1 * wavelengths + p2 * wavelengths.^2;

    chemical_var = ideal_spectrum + randn(1, n_wavelengths) * 0.01;
    raw_spectra(i, :) = a + b * chemical_var + poly_baseline + randn(1, n_wavelengths) * 0.01;
end

% Apply EMSC
corrected_spectra = emsc(raw_spectra, wavelengths, 'PolyOrder', 2);

% Plot
figure;
subplot(1, 2, 1);
plot(wavelengths, raw_spectra', 'LineWidth', 1);
xlabel('Wavelength (nm)');
ylabel('Absorbance');
title('Before EMSC');
grid on;

subplot(1, 2, 2);
plot(wavelengths, corrected_spectra', 'LineWidth', 1);
xlabel('Wavelength (nm)');
ylabel('Absorbance');
title('After EMSC');
grid on;

fprintf('Std before EMSC: %.4f\n', mean(std(raw_spectra, 0, 1)));
fprintf('Std after EMSC:  %.4f\n', mean(std(corrected_spectra, 0, 1)));

emsc <- function(spectra, wavelengths, reference = NULL, poly_order = 2, interferences = NULL) {
  # Extended Multiplicative Scatter Correction
  #
  # Parameters:
  #   spectra - Matrix of spectra (rows = samples, cols = wavelengths)
  #   wavelengths - Wavelength vector
  #   reference - Optional reference spectrum (default: mean)
  #   poly_order - Polynomial order (default: 2)
  #   interferences - List of interference spectra (default: NULL)
  #
  # Returns:
  #   corrected - EMSC-corrected spectra

  # Calculate reference
  if (is.null(reference)) {
    reference <- colMeans(spectra)
  }

  n_samples <- nrow(spectra)
  n_wavelengths <- ncol(spectra)

  # Normalize wavelengths to [-1, 1]
  wl_norm <- 2 * (wavelengths - min(wavelengths)) / (max(wavelengths) - min(wavelengths)) - 1

  # Build design matrix
  X <- cbind(1, reference)

  # Add interference spectra
  if (!is.null(interferences)) {
    for (interference in interferences) {
      X <- cbind(X, interference)
    }
  }

  # Add polynomial terms
  if (poly_order > 0) {
    for (j in 1:poly_order) {
      X <- cbind(X, wl_norm^j)
    }
  }

  # Correct each spectrum
  corrected <- matrix(0, nrow = n_samples, ncol = n_wavelengths)

  for (i in 1:n_samples) {
    # Fit: spectrum = X * beta
    fit <- lm.fit(X, spectra[i, ])
    beta <- fit$coefficients

    a <- beta[1]  # Intercept
    b <- beta[2]  # Reference scaling

    # Reconstruct unwanted contributions
    n_interference <- if (!is.null(interferences)) length(interferences) else 0
    unwanted <- a

    if (n_interference > 0) {
      for (k in 1:n_interference) {
        unwanted <- unwanted + beta[2 + k] * X[, 2 + k]
      }
    }

    if (poly_order > 0) {
      for (j in 1:poly_order) {
        unwanted <- unwanted + beta[2 + n_interference + j] * X[, 2 + n_interference + j]
      }
    }

    # Correct
    corrected[i, ] <- (spectra[i, ] - unwanted) / b
  }

  return(corrected)
}

# Example usage
set.seed(42)
n_samples <- 20
n_wavelengths <- 100
wavelengths <- seq(1000, 2500, length.out = n_wavelengths)

# Ideal spectrum
ideal_spectrum <- 0.5 * exp(-((wavelengths - 1500)^2) / 50000) +
                  0.3 * exp(-((wavelengths - 2000)^2) / 30000)

# Simulate scatter + polynomial baseline
raw_spectra <- matrix(0, nrow = n_samples, ncol = n_wavelengths)
for (i in 1:n_samples) {
  a <- runif(1, -0.1, 0.1)
  b <- runif(1, 0.8, 1.2)

  # Polynomial baseline
  p0 <- runif(1, -0.05, 0.05)
  p1 <- runif(1, -0.00002, 0.00002)
  p2 <- runif(1, -5e-9, 5e-9)
  poly_baseline <- p0 + p1 * wavelengths + p2 * wavelengths^2

  chemical_var <- ideal_spectrum + rnorm(n_wavelengths, 0, 0.01)
  raw_spectra[i, ] <- a + b * chemical_var + poly_baseline + rnorm(n_wavelengths, 0, 0.01)
}

# Apply EMSC
corrected_spectra <- emsc(raw_spectra, wavelengths, poly_order = 2)

# Plot
par(mfrow = c(1, 2))

matplot(wavelengths, t(raw_spectra), type = 'l', lty = 1,
        xlab = 'Wavelength (nm)', ylab = 'Absorbance',
        main = 'Before EMSC')
grid()

matplot(wavelengths, t(corrected_spectra), type = 'l', lty = 1,
        xlab = 'Wavelength (nm)', ylab = 'Absorbance',
        main = 'After EMSC')
grid()

cat(sprintf("Std before EMSC: %.4f\n", mean(apply(raw_spectra, 2, sd))))
cat(sprintf("Std after EMSC:  %.4f\n", mean(apply(corrected_spectra, 2, sd))))

Choosing polynomial order

The polynomial order controls how complex the baseline correction can be:

Order 0 (none): Standard MSC—just offset and slope

Use when baseline is already flat

Order 1 (linear): MSC + linear baseline drift

Use for gradual baseline tilt
Common in transmission spectroscopy

Order 2 (quadratic): MSC + curved baseline

Use for parabolic baseline curvature
Most common choice for NIR diffuse reflectance

Order 3 (cubic): MSC + more complex curvature

Use for strong non-linear baselines
Common in biological samples

Higher orders (>3): Rarely needed

Risk overfitting and removing chemical information
Only use if you have strong evidence of complex drift

General advice: Start with order 2. Increase only if you see clear residual baseline structure after correction.

Adding interference spectra

If you know certain compounds interfere across all samples (but with varying amounts), include their pure spectra in the EMSC model.

Example: Water vapor in NIR

# You have a reference water spectrum
water_spectrum = load_pure_water_spectrum()

# Apply EMSC with water as interference
corrected = emsc(spectra, wavelengths, interferences=[water_spectrum])

Common interferences to model:

Water vapor (NIR, FTIR)
CO₂ absorption (FTIR)
Packaging material (transmission through containers)
Solvent peaks (solution-state NMR, IR)
Excipients (pharmaceutical tablets)

When to include interferences:

You have a pure reference spectrum of the interference
The interference varies in intensity across samples
You want to remove the interference, not just correct for scatter

When not to include:

The interference is constant across all samples (it's already in the reference)
You don't have a pure reference spectrum
The interference is part of what you want to measure

Advantages and limitations

Advantages

✅ Handles complex scatter — Polynomial baselines capture non-linear effects

✅ Removes known interferences — Explicitly models unwanted spectral contributions

✅ More flexible than MSC — Can adapt to more challenging datasets

✅ Still interpretable — Each term has clear physical meaning

✅ Improves model performance — Especially when baseline drift dominates

Limitations

❌ More parameters = more risk — Can overfit if polynomial order is too high

❌ Requires all spectra together — Can't apply to single spectra (same as MSC)

❌ Reference spectrum still matters — Poor reference gives poor correction

❌ Polynomial edges can be unstable — Correction may distort spectral extremes

❌ Slower than MSC — More complex regression, especially with interferences

❌ Need pure interference spectra — May not be available for complex mixtures

Practical tips

Choosing polynomial order:

Default: Start with order 2
Check residuals: If baseline remains after correction, increase order
Don't go above 3-4 unless you have a very good reason
Cross-validate to avoid overfitting

Using interference spectra:

Ensure interference spectra have same wavelength range as your data
Normalize interference spectra to similar intensity scale
Don't include too many interferences (risk overfitting)
Verify interferences actually improve cross-validation performance

EMSC vs MSC decision:

Try MSC first—it's simpler and often sufficient
Use EMSC if MSC leaves visible baseline curvature
Use EMSC if you have known, varying interferences
Compare cross-validation performance to justify the complexity

Computational efficiency:

EMSC is slower than MSC due to larger regression problem
For large datasets, consider parallelizing the loop over samples
Pre-compute the normal equations (X^TX)^-1X^T if reference doesn't change

Test set correction:

Use the same reference and same interference spectra as training data
Apply EMSC to test data with these fixed references
Never recompute the reference from test data (causes data leakage)

Common mistakes to avoid

❌ Using too high polynomial order → Overfitting removes chemical information

❌ Including interferences that aren't actually present → Introduces artifacts

❌ Recomputing reference for test data → Data leakage breaks validation

❌ Combining EMSC with SNV → They correct similar effects; pick one

❌ Using unnormalized wavelengths in polynomials → Numerical instability in regression

❌ Applying EMSC before data splitting → Apply separately to train/test

❌ Not checking if EMSC helps → Compare cross-validation performance vs simpler methods

EMSC variations

Standard EMSC: Polynomial baseline + optional interferences (what we've described)

Constituent EMSC: Models specific target analytes explicitly to preserve their signal

Adaptive EMSC: Automatically selects polynomial order for each spectrum

Orthogonal EMSC: Uses orthogonal polynomials (Legendre, Chebyshev) for numerical stability

Weighted EMSC: Downweights spectral regions with low signal-to-noise

Most of these are research extensions. Standard EMSC is what you'll use in practice.