The Library
Introduction to chemometricsPreprocessing

Extended Multiplicative Scatter Correction (EMSC)

Advanced scatter correction with polynomial baselines and interference handling

In 1985, Paul Geladi, Douglas MacDougall, and Harald Martens introduced Multiplicative Scatter Correction (MSC) to separate chemical light absorption from physical light scatter in near-infrared reflectance spectra. The method was elegant and effective for many routine applications, but it assumed a simple linear relationship between each spectrum and a reference. By the early 1990s, Martens recognized that real-world spectroscopic data often contained more complex artifacts than basic MSC could handle. In 1991, he and Edward Stark published the first formal extension in the Journal of Pharmaceutical and Biomedical Analysis, adding polynomial baseline terms and spectral interference subtraction to the original MSC framework. This paper laid the theoretical groundwork for what would become EMSC.

The method found its most compelling applications through the work of Achim Kohler at the Norwegian University of Life Sciences (NMBU) in the early 2000s. Kohler was using Fourier transform infrared (FTIR) microspectroscopy to study biological samples, including tissue cross-sections and individual cells. These samples presented spectroscopic challenges that went far beyond simple scatter: complex baseline curvature from varying sample thickness, interference fringes, and strong scattering artifacts caused by the size and shape of cells interacting with mid-infrared light. Basic MSC was not enough. Kohler and Martens collaborated to refine and extend the EMSC framework for these demanding applications, publishing key work on separating physical and chemical information in FTIR microscopy images of biological tissue (2005) and on physics-based scatter correction approaches (2006).

Their collaboration continued to push the boundaries of the method. By the late 2000s, Kohler and colleagues had developed the resonant Mie scattering EMSC (RMieS-EMSC), which incorporated full Mie scattering theory to model the wavelength-dependent distortions caused by spherical and near-spherical biological cells. This variant, later accelerated with GPU computing, became a standard preprocessing tool in biomedical infrared spectroscopy. Nils Kristian Afseth and Kohler published an influential tutorial in Chemometrics and Intelligent Laboratory Systems in 2012 that made the method accessible to a broader audience. Today, EMSC and its variants are among the most widely used preprocessing techniques in vibrational spectroscopy, particularly for biological and biomedical applications where scatter effects are complex and physically meaningful.

Beyond basic MSC: handling complex scatter

Multiplicative Scatter Correction (MSC) works great when scatter is your main problem. But real spectroscopic data often throws additional challenges at you:

  • Polynomial baseline drift due to instrumental artifacts
  • Known interfering compounds that vary across samples
  • Temperature effects causing systematic baseline shifts
  • Physical effects that can't be captured by simple linear scaling

EMSC (Extended Multiplicative Scatter Correction) extends the basic MSC model to handle these complications. Instead of just fitting a line to a reference spectrum, EMSC fits a more complex model that includes polynomial baselines and optional interference terms.

The EMSC model: MSC plus extra terms

Basic MSC assumes:

EMSC extends this with additional terms:

Where:

  • ai, bi = same scatter coefficients as MSC (offset and slope)
  • zk = known interference spectra (e.g., water, CO₂)
  • ci,k = coefficients for each interference
  • λj = polynomial baseline terms (wavelength to power j)
  • di,j = polynomial coefficients

The correction removes all these effects:

This gives you a spectrum with scatter, interferences, and polynomial drift all removed.

When EMSC helps more than MSC

Use EMSC instead of basic MSC when you have:

Polynomial baseline drift:

  • Instrument baseline wander (common in FTIR)
  • Temperature-induced baseline shifts
  • Strong curvature that MSC can't capture
  • Diffuse reflectance with non-linear scatter effects

Known interferences:

  • Water vapor absorption in NIR
  • CO₂ bands in FTIR
  • Packaging material signals in food analysis
  • Solvent peaks in solution-state measurements

Complex sample matrices:

  • Biological tissues with varying water content
  • Pharmaceutical tablets with excipients
  • Foods with fat/moisture variations
  • Environmental samples with matrix effects

Stick with basic MSC when:

  • Scatter is linear and dominant
  • No systematic baseline drift
  • No known interfering compounds
  • Simpler is better for your application

The algorithm

EMSC uses multiple linear regression with the reference spectrum and additional terms as predictors.

Step 1: Calculate reference spectrum

Same as MSC—usually the mean:

Step 2: Build the design matrix

For polynomial order J and K interferences:

Each column is a predictor:

  • Column 1: ones (for intercept ai)
  • Column 2: reference spectrum (for slope bi)
  • Columns 3 to K+2: interference spectra
  • Remaining columns: wavelength polynomials

Step 3: Fit regression for each spectrum

Solve the least-squares problem:

This gives you all coefficients: [ai, bi, ci,1, ..., ci,K, di,1, ..., di,J]

Step 4: Correct the spectrum

Remove everything except the reference-scaled signal:

Code examples

import numpy as np
import matplotlib.pyplot as plt

def emsc(spectra, wavelengths, reference=None, poly_order=2, interferences=None):
    """
    Extended Multiplicative Scatter Correction.

    Parameters:
    -----------
    spectra : array-like, shape (n_samples, n_wavelengths)
        Input spectra to be corrected
    wavelengths : array-like, shape (n_wavelengths,)
        Wavelength values
    reference : array-like, shape (n_wavelengths,), optional
        Reference spectrum. If None, uses mean of all spectra.
    poly_order : int, default=2
        Order of polynomial baseline (0 = none, 1 = linear, 2 = quadratic, etc.)
    interferences : list of arrays, optional
        Known interference spectra to be removed

    Returns:
    --------
    corrected : array
        EMSC-corrected spectra, same shape as input
    """
    spectra = np.asarray(spectra)
    wavelengths = np.asarray(wavelengths)

    # Calculate reference spectrum
    if reference is None:
        reference = np.mean(spectra, axis=0)

    n_samples, n_wavelengths = spectra.shape

    # Normalize wavelengths to [-1, 1] for numerical stability
    wl_norm = 2 * (wavelengths - wavelengths.min()) / (wavelengths.max() - wavelengths.min()) - 1

    # Build design matrix
    # Start with intercept and reference
    X = [np.ones(n_wavelengths), reference]

    # Add interference spectra
    if interferences is not None:
        for interference in interferences:
            X.append(np.asarray(interference))

    # Add polynomial terms
    for j in range(1, poly_order + 1):
        X.append(wl_norm ** j)

    X = np.column_stack(X)

    # Correct each spectrum
    corrected = np.zeros_like(spectra)

    for i in range(n_samples):
        # Fit: spectrum = X @ beta
        beta = np.linalg.lstsq(X, spectra[i], rcond=None)[0]

        a = beta[0]  # Intercept
        b = beta[1]  # Reference scaling

        # Reconstruct interference and polynomial contributions
        n_interference = len(interferences) if interferences is not None else 0
        interference_contrib = np.sum([beta[2 + k] * X[:, 2 + k]
                                      for k in range(n_interference)], axis=0) if n_interference > 0 else 0
        poly_contrib = np.sum([beta[2 + n_interference + j - 1] * X[:, 2 + n_interference + j - 1]
                              for j in range(1, poly_order + 1)], axis=0) if poly_order > 0 else 0

        # Correct: remove offset, interferences, and polynomial baseline
        corrected[i] = (spectra[i] - a - interference_contrib - poly_contrib) / b

    return corrected


# Example: NIR spectra with scatter + polynomial baseline
np.random.seed(42)
n_samples = 20
n_wavelengths = 100
wavelengths = np.linspace(1000, 2500, n_wavelengths)

# Ideal spectrum (chemical signal)
ideal_spectrum = (0.5 * np.exp(-((wavelengths - 1500)**2) / 50000) +
                  0.3 * np.exp(-((wavelengths - 2000)**2) / 30000))

# Simulate complex effects
raw_spectra = []
for i in range(n_samples):
    # Random scatter (MSC part)
    a = np.random.uniform(-0.1, 0.1)
    b = np.random.uniform(0.8, 1.2)

    # Random polynomial baseline drift
    p0 = np.random.uniform(-0.05, 0.05)
    p1 = np.random.uniform(-0.00002, 0.00002)
    p2 = np.random.uniform(-5e-9, 5e-9)
    poly_baseline = p0 + p1 * wavelengths + p2 * wavelengths**2

    # Chemical variation
    chemical_var = ideal_spectrum + np.random.normal(0, 0.01, n_wavelengths)

    # Combine: scatter + baseline drift + noise
    spectrum = a + b * chemical_var + poly_baseline + np.random.normal(0, 0.01, n_wavelengths)
    raw_spectra.append(spectrum)

raw_spectra = np.array(raw_spectra)

# Apply EMSC with 2nd-order polynomial
corrected_spectra = emsc(raw_spectra, wavelengths, poly_order=2)

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Before EMSC
for i in range(n_samples):
    ax1.plot(wavelengths, raw_spectra[i], alpha=0.6, linewidth=1)
ax1.set_xlabel('Wavelength (nm)')
ax1.set_ylabel('Absorbance')
ax1.set_title('Before EMSC (scatter + baseline drift)')
ax1.grid(True, alpha=0.3)

# After EMSC
for i in range(n_samples):
    ax2.plot(wavelengths, corrected_spectra[i], alpha=0.6, linewidth=1)
ax2.set_xlabel('Wavelength (nm)')
ax2.set_ylabel('Absorbance')
ax2.set_title('After EMSC (corrected)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Compare variability
print(f"Std before EMSC: {np.std(raw_spectra, axis=0).mean():.4f}")
print(f"Std after EMSC:  {np.std(corrected_spectra, axis=0).mean():.4f}")
function corrected = emsc(spectra, wavelengths, varargin)
    % Extended Multiplicative Scatter Correction
    %
    % Parameters:
    %   spectra - Matrix of spectra (rows = samples, cols = wavelengths)
    %   wavelengths - Wavelength vector
    %   'Reference' - Optional reference spectrum (default: mean)
    %   'PolyOrder' - Polynomial order (default: 2)
    %   'Interferences' - Cell array of interference spectra (default: {})
    %
    % Returns:
    %   corrected - EMSC-corrected spectra

    % Parse inputs
    p = inputParser;
    addParameter(p, 'Reference', []);
    addParameter(p, 'PolyOrder', 2);
    addParameter(p, 'Interferences', {});
    parse(p, varargin{:});

    reference = p.Results.Reference;
    poly_order = p.Results.PolyOrder;
    interferences = p.Results.Interferences;

    [n_samples, n_wavelengths] = size(spectra);

    % Calculate reference
    if isempty(reference)
        reference = mean(spectra, 1);
    end

    % Normalize wavelengths to [-1, 1]
    wl_norm = 2 * (wavelengths - min(wavelengths)) / (max(wavelengths) - min(wavelengths)) - 1;

    % Build design matrix
    X = [ones(n_wavelengths, 1), reference(:)];

    % Add interference spectra
    for k = 1:length(interferences)
        X = [X, interferences{k}(:)];
    end

    % Add polynomial terms
    for j = 1:poly_order
        X = [X, wl_norm(:).^j];
    end

    % Correct each spectrum
    corrected = zeros(size(spectra));

    for i = 1:n_samples
        % Fit: spectrum = X * beta
        beta = X \ spectra(i, :)';

        a = beta(1);  % Intercept
        b = beta(2);  % Reference scaling

        % Reconstruct unwanted contributions
        n_interference = length(interferences);
        unwanted = a;

        for k = 1:n_interference
            unwanted = unwanted + beta(2 + k) * X(:, 2 + k)';
        end

        for j = 1:poly_order
            unwanted = unwanted + beta(2 + n_interference + j) * X(:, 2 + n_interference + j)';
        end

        % Correct
        corrected(i, :) = (spectra(i, :) - unwanted) / b;
    end
end

% Example usage
rng(42);
n_samples = 20;
n_wavelengths = 100;
wavelengths = linspace(1000, 2500, n_wavelengths);

% Ideal spectrum
ideal_spectrum = 0.5 * exp(-((wavelengths - 1500).^2) / 50000) + ...
                 0.3 * exp(-((wavelengths - 2000).^2) / 30000);

% Simulate scatter + polynomial baseline
raw_spectra = zeros(n_samples, n_wavelengths);
for i = 1:n_samples
    a = rand() * 0.2 - 0.1;
    b = rand() * 0.4 + 0.8;

    % Polynomial baseline
    p0 = rand() * 0.1 - 0.05;
    p1 = rand() * 0.00004 - 0.00002;
    p2 = rand() * 1e-8 - 5e-9;
    poly_baseline = p0 + p1 * wavelengths + p2 * wavelengths.^2;

    chemical_var = ideal_spectrum + randn(1, n_wavelengths) * 0.01;
    raw_spectra(i, :) = a + b * chemical_var + poly_baseline + randn(1, n_wavelengths) * 0.01;
end

% Apply EMSC
corrected_spectra = emsc(raw_spectra, wavelengths, 'PolyOrder', 2);

% Plot
figure;
subplot(1, 2, 1);
plot(wavelengths, raw_spectra', 'LineWidth', 1);
xlabel('Wavelength (nm)');
ylabel('Absorbance');
title('Before EMSC');
grid on;

subplot(1, 2, 2);
plot(wavelengths, corrected_spectra', 'LineWidth', 1);
xlabel('Wavelength (nm)');
ylabel('Absorbance');
title('After EMSC');
grid on;

fprintf('Std before EMSC: %.4f\n', mean(std(raw_spectra, 0, 1)));
fprintf('Std after EMSC:  %.4f\n', mean(std(corrected_spectra, 0, 1)));
emsc <- function(spectra, wavelengths, reference = NULL, poly_order = 2, interferences = NULL) {
  # Extended Multiplicative Scatter Correction
  #
  # Parameters:
  #   spectra - Matrix of spectra (rows = samples, cols = wavelengths)
  #   wavelengths - Wavelength vector
  #   reference - Optional reference spectrum (default: mean)
  #   poly_order - Polynomial order (default: 2)
  #   interferences - List of interference spectra (default: NULL)
  #
  # Returns:
  #   corrected - EMSC-corrected spectra

  # Calculate reference
  if (is.null(reference)) {
    reference <- colMeans(spectra)
  }

  n_samples <- nrow(spectra)
  n_wavelengths <- ncol(spectra)

  # Normalize wavelengths to [-1, 1]
  wl_norm <- 2 * (wavelengths - min(wavelengths)) / (max(wavelengths) - min(wavelengths)) - 1

  # Build design matrix
  X <- cbind(1, reference)

  # Add interference spectra
  if (!is.null(interferences)) {
    for (interference in interferences) {
      X <- cbind(X, interference)
    }
  }

  # Add polynomial terms
  if (poly_order > 0) {
    for (j in 1:poly_order) {
      X <- cbind(X, wl_norm^j)
    }
  }

  # Correct each spectrum
  corrected <- matrix(0, nrow = n_samples, ncol = n_wavelengths)

  for (i in 1:n_samples) {
    # Fit: spectrum = X * beta
    fit <- lm.fit(X, spectra[i, ])
    beta <- fit$coefficients

    a <- beta[1]  # Intercept
    b <- beta[2]  # Reference scaling

    # Reconstruct unwanted contributions
    n_interference <- if (!is.null(interferences)) length(interferences) else 0
    unwanted <- a

    if (n_interference > 0) {
      for (k in 1:n_interference) {
        unwanted <- unwanted + beta[2 + k] * X[, 2 + k]
      }
    }

    if (poly_order > 0) {
      for (j in 1:poly_order) {
        unwanted <- unwanted + beta[2 + n_interference + j] * X[, 2 + n_interference + j]
      }
    }

    # Correct
    corrected[i, ] <- (spectra[i, ] - unwanted) / b
  }

  return(corrected)
}

# Example usage
set.seed(42)
n_samples <- 20
n_wavelengths <- 100
wavelengths <- seq(1000, 2500, length.out = n_wavelengths)

# Ideal spectrum
ideal_spectrum <- 0.5 * exp(-((wavelengths - 1500)^2) / 50000) +
                  0.3 * exp(-((wavelengths - 2000)^2) / 30000)

# Simulate scatter + polynomial baseline
raw_spectra <- matrix(0, nrow = n_samples, ncol = n_wavelengths)
for (i in 1:n_samples) {
  a <- runif(1, -0.1, 0.1)
  b <- runif(1, 0.8, 1.2)

  # Polynomial baseline
  p0 <- runif(1, -0.05, 0.05)
  p1 <- runif(1, -0.00002, 0.00002)
  p2 <- runif(1, -5e-9, 5e-9)
  poly_baseline <- p0 + p1 * wavelengths + p2 * wavelengths^2

  chemical_var <- ideal_spectrum + rnorm(n_wavelengths, 0, 0.01)
  raw_spectra[i, ] <- a + b * chemical_var + poly_baseline + rnorm(n_wavelengths, 0, 0.01)
}

# Apply EMSC
corrected_spectra <- emsc(raw_spectra, wavelengths, poly_order = 2)

# Plot
par(mfrow = c(1, 2))

matplot(wavelengths, t(raw_spectra), type = 'l', lty = 1,
        xlab = 'Wavelength (nm)', ylab = 'Absorbance',
        main = 'Before EMSC')
grid()

matplot(wavelengths, t(corrected_spectra), type = 'l', lty = 1,
        xlab = 'Wavelength (nm)', ylab = 'Absorbance',
        main = 'After EMSC')
grid()

cat(sprintf("Std before EMSC: %.4f\n", mean(apply(raw_spectra, 2, sd))))
cat(sprintf("Std after EMSC:  %.4f\n", mean(apply(corrected_spectra, 2, sd))))

Choosing polynomial order

The polynomial order controls how complex the baseline correction can be:

Order 0 (none): Standard MSC—just offset and slope

  • Use when baseline is already flat

Order 1 (linear): MSC + linear baseline drift

  • Use for gradual baseline tilt
  • Common in transmission spectroscopy

Order 2 (quadratic): MSC + curved baseline

  • Use for parabolic baseline curvature
  • Most common choice for NIR diffuse reflectance

Order 3 (cubic): MSC + more complex curvature

  • Use for strong non-linear baselines
  • Common in biological samples

Higher orders (>3): Rarely needed

  • Risk overfitting and removing chemical information
  • Only use if you have strong evidence of complex drift

General advice: Start with order 2. Increase only if you see clear residual baseline structure after correction.

Adding interference spectra

If you know certain compounds interfere across all samples (but with varying amounts), include their pure spectra in the EMSC model.

Example: Water vapor in NIR

# You have a reference water spectrum
water_spectrum = load_pure_water_spectrum()

# Apply EMSC with water as interference
corrected = emsc(spectra, wavelengths, interferences=[water_spectrum])

Common interferences to model:

  • Water vapor (NIR, FTIR)
  • CO₂ absorption (FTIR)
  • Packaging material (transmission through containers)
  • Solvent peaks (solution-state NMR, IR)
  • Excipients (pharmaceutical tablets)

When to include interferences:

  • You have a pure reference spectrum of the interference
  • The interference varies in intensity across samples
  • You want to remove the interference, not just correct for scatter

When not to include:

  • The interference is constant across all samples (it's already in the reference)
  • You don't have a pure reference spectrum
  • The interference is part of what you want to measure

Advantages and limitations

Advantages

Handles complex scatter — Polynomial baselines capture non-linear effects

Removes known interferences — Explicitly models unwanted spectral contributions

More flexible than MSC — Can adapt to more challenging datasets

Still interpretable — Each term has clear physical meaning

Improves model performance — Especially when baseline drift dominates

Limitations

More parameters = more risk — Can overfit if polynomial order is too high

Requires all spectra together — Can't apply to single spectra (same as MSC)

Reference spectrum still matters — Poor reference gives poor correction

Polynomial edges can be unstable — Correction may distort spectral extremes

Slower than MSC — More complex regression, especially with interferences

Need pure interference spectra — May not be available for complex mixtures

Practical tips

Choosing polynomial order:

  • Default: Start with order 2
  • Check residuals: If baseline remains after correction, increase order
  • Don't go above 3-4 unless you have a very good reason
  • Cross-validate to avoid overfitting

Using interference spectra:

  • Ensure interference spectra have same wavelength range as your data
  • Normalize interference spectra to similar intensity scale
  • Don't include too many interferences (risk overfitting)
  • Verify interferences actually improve cross-validation performance

EMSC vs MSC decision:

  • Try MSC first—it's simpler and often sufficient
  • Use EMSC if MSC leaves visible baseline curvature
  • Use EMSC if you have known, varying interferences
  • Compare cross-validation performance to justify the complexity

Computational efficiency:

  • EMSC is slower than MSC due to larger regression problem
  • For large datasets, consider parallelizing the loop over samples
  • Pre-compute the normal equations (XTX)-1XT if reference doesn't change

Test set correction:

  • Use the same reference and same interference spectra as training data
  • Apply EMSC to test data with these fixed references
  • Never recompute the reference from test data (causes data leakage)

Common mistakes to avoid

Using too high polynomial order → Overfitting removes chemical information

Including interferences that aren't actually present → Introduces artifacts

Recomputing reference for test data → Data leakage breaks validation

Combining EMSC with SNV → They correct similar effects; pick one

Using unnormalized wavelengths in polynomials → Numerical instability in regression

Applying EMSC before data splitting → Apply separately to train/test

Not checking if EMSC helps → Compare cross-validation performance vs simpler methods

EMSC variations

Standard EMSC: Polynomial baseline + optional interferences (what we've described)

Constituent EMSC: Models specific target analytes explicitly to preserve their signal

Adaptive EMSC: Automatically selects polynomial order for each spectrum

Orthogonal EMSC: Uses orthogonal polynomials (Legendre, Chebyshev) for numerical stability

Weighted EMSC: Downweights spectral regions with low signal-to-noise

Most of these are research extensions. Standard EMSC is what you'll use in practice.

Further reading

  • Martens, H., & Stark, E. (1991). "Extended multiplicative signal correction and spectral interference subtraction: new preprocessing methods for near infrared spectroscopy". Journal of Pharmaceutical and Biomedical Analysis, 9(8), 625-635.

    • Original EMSC paper
  • Afseth, N. K., & Kohler, A. (2012). "Extended multiplicative signal correction in vibrational spectroscopy, a tutorial". Chemometrics and Intelligent Laboratory Systems, 117, 92-99.

    • Excellent tutorial with practical examples
  • Rinnan, Å., van den Berg, F., & Engelsen, S. B. (2009). "Review of the most common pre-processing techniques for near-infrared spectra". TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.

    • Comprehensive review comparing EMSC with other methods

EMSC is a powerful tool when basic MSC isn't enough. Use it when your spectra have complex baseline drift or known interferences, but always validate that the added complexity actually improves your results.

On this page