Whittaker Smoothing

The mathematical foundations of this smoother trace back to Edmund Taylor Whittaker (1873--1956), a British mathematician and astronomer who spent most of his career at the University of Edinburgh. In 1923, Whittaker published "On a New Method of Graduation" in the Proceedings of the Edinburgh Mathematical Society (volume 41, pages 63--75). The word "graduation" came from actuarial science, where smoothing mortality tables to remove statistical fluctuations was called graduating the data. Whittaker's idea was to frame smoothing as a penalized least squares problem: find the curve that balances fidelity to the observed values against a penalty on roughness, measured by the sum of squared finite differences. The formulation was elegant, but in 1923 there were no electronic computers, and solving the resulting linear systems by hand was impractical for all but the smallest datasets. The method was largely forgotten.

Eight decades later, Paul H.C. Eilers at Leiden University Medical Centre rediscovered Whittaker's framework and showed it was ideally suited to modern computation. His 2003 paper "A Perfect Smoother" in Analytical Chemistry (volume 75, pages 3631--3636) demonstrated that Whittaker's penalized least squares formulation, combined with sparse matrix techniques, produced a smoother that was fast, trivially easy to implement, and controlled by a single intuitive parameter $λ$ . Eilers' key contribution was using difference penalties rather than derivative penalties, which reduced the entire algorithm to solving a banded linear system -- just a few lines of code in MATLAB, R, or Python. This was not Eilers' first encounter with penalized methods: together with Brian Marx, he had introduced P-splines (penalized B-splines) in 1996, a framework that also relied on difference penalties for flexible curve fitting. The Whittaker smoother can be understood as the limiting case of P-splines where every data point has its own basis function.

Eilers' revival turned an obscure 1923 idea into one of the most practical smoothing tools in modern chemometrics. Today, Whittaker smoothing and its weighted variants underpin methods for baseline correction (Asymmetric Least Squares), signal denoising, and even two-dimensional smoothing of spectral images.

The optimization approach to smoothing

Moving average smoothing is simple and intuitive, but it has limitations. The main problem? It treats smoothing as a purely local operation, averaging nearby points without considering the global structure of the signal. This leads to peak broadening and loss of resolution.

What if instead of mechanically averaging neighbors, we posed smoothing as an optimization problem?

Find the smoothest curve that still fits the data well.

This is exactly what Whittaker smoothing does. As described above, Whittaker framed the problem as penalized least squares in 1923, and Eilers revived it in 2003 as the "perfect smoother" for modern applications.

The core idea: balancing two goals

Whittaker smoothing recognizes that smoothing involves a fundamental tradeoff between two competing objectives:

Fidelity to data — The smoothed signal should be close to the observed (noisy) data
Smoothness — The smoothed signal should be, well, smooth

These goals conflict. Perfect fidelity means no smoothing (keeping all the noise). Perfect smoothness means a straight line (losing all features). We need to balance them.

Mathematically, we minimize:

S = fidelity term i = 1 \sum n (y_{i} - z_{i})^{2} + smoothness penalty λ i = 1 \sum n - d (Δ^{d} z_{i})^{2}

Where:

y = your noisy observed data (n points)
z = the smoothed signal we're trying to find
λ (lambda) = smoothing parameter that controls the tradeoff
Δ^d = d-th order difference operator
d = difference order (typically 2)

The fidelity term

\sum_{i = 1}^{n} (y_{i} - z_{i})^{2}

This is the sum of squared differences between the observed data and the smooth signal. It's small when z is close to y (good fit) and large when z differs from y (poor fit).

The smoothness penalty

λ \sum_{i = 1}^{n - d} (Δ^{d} z_{i})^{2}

This penalizes roughness in the smoothed signal. The difference operator Δ measures how much the signal changes:

First difference (d=1): Δz_i = z_i+1 - z_i (measures slope)
Second difference (d=2): Δ²z_i = z_i+2 - 2z_i+1 + z_i (measures curvature)
Third difference (d=3): Δ³z_i = z_i+3 - 3z_i+2 + 3z_i+1 - z_i

Why use second differences?

Second differences measure curvature (how much the curve is bending). A smooth curve should have gentle, consistent bending. Penalizing second differences discourages rapid changes in curvature—exactly what we want.

First differences (d=1) penalize slope changes, which tends to produce piecewise linear (jagged) results. Third or higher differences (d=3, 4, ...) are smoother but less commonly used in practice.

Rule of thumb: Use d=2 unless you have a specific reason to do otherwise.

The λ parameter: the key to the tradeoff

The smoothing parameter λ controls the balance:

λ = 0 → No penalty for roughness → Perfect fit to data (no smoothing)
λ = ∞ → Infinite penalty for roughness → Perfectly smooth (straight line)
λ = "just right" → Balanced smoothing that reduces noise but preserves features

In practice, λ values typically range from 10⁰ to 10⁸, which is why it's often specified on a log scale.

The solution: a simple linear system

Here's the remarkable thing: even though this looks like a complex optimization problem, it has a closed-form solution. No iterative optimization needed.

Taking the derivative of S with respect to z and setting it to zero gives us:

(I + λ D^{T} D) z = y

Where:

I = identity matrix (n × n)
D = difference matrix (implements the Δ^d operator)
z = smoothed signal (what we want)
y = noisy observed data

This is just a linear system Ax = b. We can solve it directly:

z = (I + λ D^{T} D)^{- 1} y

Computational efficiency

Don't be scared by the matrix inverse. The matrix (I + λD^TD) is sparse (mostly zeros) and symmetric positive definite, which means:

We don't actually compute the inverse explicitly
We solve the linear system using efficient algorithms (Cholesky factorization, conjugate gradient)
Whittaker smoothing is very fast, even for thousands of points

In practice, Whittaker is as fast as Savitzky-Golay and much faster than many iterative smoothing methods.

Interactive visualization

Let's explore how the λ parameter and difference order affect smoothing. The visualization below lets you adjust both parameters and see the results in real-time:

Smoothing Parameter: λ = 10⁴ = 1e+4

No smoothingLightModerateHeavyVery heavy

Difference Order: d = 2

Second differences: penalizes curvature (recommended)

Interpretation:

Light to moderate smoothing. Good balance for typical spectroscopic data. Most noise removed while preserving peaks.

Try this:

Start with log(λ) = 2 (λ = 100) and observe moderate smoothing
Increase log(λ) to 6 and watch the signal become very smooth
Decrease log(λ) to 0 and see it follow the noisy data closely
Try different difference orders (d = 1, 2, 3) and notice how they affect the result
Compare the smoothed signal with the true underlying signal (green line)

Notice how you can find a λ value that removes noise while preserving the peak shapes.

What does the difference matrix D look like?

The difference matrix D implements the difference operator. For second differences (d=2) and n=6 points, it looks like:

D = [  1  -2   1   0   0   0 ]
    [  0   1  -2   1   0   0 ]
    [  0   0   1  -2   1   0 ]
    [  0   0   0   1  -2   1 ]

Each row computes one second difference: Δ²z_i = z_i - 2z_i+1 + z_i+2

For first differences (d=1):

D = [  1  -1   0   0   0   0 ]
    [  0   1  -1   0   0   0 ]
    [  0   0   1  -1   0   0 ]
    [  0   0   0   1  -1   0 ]
    [  0   0   0   0   1  -1 ]

Notice how sparse these matrices are—mostly zeros with just a few -1, 1, -2 entries. This is why Whittaker is computationally efficient.

Choosing λ: how much to smooth?

The key question: how do I choose λ?

There's no universal answer. The optimal λ depends on:

The noise level in your data
The feature size you want to preserve
Your application (exploratory vs. quantitative analysis)

Practical guidelines

Start with log(λ) in the range 0 to 6:

log(λ) = 0-2 (λ = 1-100): Light smoothing, preserves sharp features
log(λ) = 2-4 (λ = 100-10,000): Moderate smoothing, good for typical spectra
log(λ) = 4-6 (λ = 10,000-1,000,000): Heavy smoothing, for very noisy data
log(λ) > 6: Extreme smoothing, may over-smooth

Visual inspection:

Plot the original and smoothed signals
Increase λ gradually
Stop when noise is reduced but peaks still look sharp

Cross-validation (for quantitative models):

Smooth your data with different λ values
Build a calibration model (e.g., PLS regression) on each
Choose the λ that gives the best prediction error on test data

Residual analysis:

Compute residuals: r = y - z (difference between data and smooth)
Plot the residuals
If they look like random noise → good smoothing
If they show structure (peaks, trends) → you've over-smoothed

Same λ for all samples

Critical for quantitative analysis: If you're building a calibration model, you must use the same λ for all samples (training and test). Different smoothing parameters would introduce inconsistency and hurt your predictions.

Code examples

import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.sparse.linalg import spsolve

def whittaker_smooth(y, lam=1e4, d=2):
    """
    Apply Whittaker smoothing to a 1D array.

    Parameters:
    -----------
    y : array-like
        Input signal to be smoothed
    lam : float
        Smoothing parameter (lambda). Larger = more smoothing.
        Typical range: 1e0 to 1e8
    d : int
        Order of the difference penalty (typically 2)

    Returns:
    --------
    z : array
        Smoothed signal
    """
    n = len(y)

    # Create the difference matrix D
    E = sparse.eye(n, format='csc')
    D = sparse.diags([1, -2, 1], [0, 1, 2], shape=(n-2, n), format='csc')

    # For higher order differences, apply D multiple times
    for _ in range(d - 2):
        D = D[1:] - D[:-1]

    # Solve: (I + λD'D)z = y
    coefMat = E + lam * D.T @ D
    z = spsolve(coefMat, y)

    return z

# Example: Generate noisy spectrum
np.random.seed(42)
wavelength = np.linspace(400, 800, 400)  # 400-800 nm
true_signal = 2 * np.exp(-((wavelength - 500)**2) / 5000) + \
              1 * np.exp(-((wavelength - 600)**2) / 3000)
noise = np.random.normal(0, 0.1, len(wavelength))
noisy_spectrum = true_signal + noise

# Apply Whittaker smoothing with different λ values
smooth_lam1e2 = whittaker_smooth(noisy_spectrum, lam=1e2, d=2)
smooth_lam1e4 = whittaker_smooth(noisy_spectrum, lam=1e4, d=2)
smooth_lam1e6 = whittaker_smooth(noisy_spectrum, lam=1e6, d=2)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(wavelength, noisy_spectrum, alpha=0.5, label='Noisy', color='gray')
plt.plot(wavelength, smooth_lam1e2, label='λ=1e2 (light)', linewidth=2)
plt.plot(wavelength, smooth_lam1e4, label='λ=1e4 (moderate)', linewidth=2)
plt.plot(wavelength, smooth_lam1e6, label='λ=1e6 (heavy)', linewidth=2)
plt.plot(wavelength, true_signal, '--', label='True signal',
         color='green', linewidth=2, alpha=0.7)
plt.xlabel('Wavelength (nm)')
plt.ylabel('Absorbance')
plt.title('Whittaker Smoothing')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

function z = whittaker_smooth(y, lam, d)
    % Apply Whittaker smoothing to a 1D array
    %
    % Parameters:
    %   y - Input signal to be smoothed (vector)
    %   lam - Smoothing parameter (lambda). Larger = more smoothing.
    %         Typical range: 1e0 to 1e8
    %   d - Order of the difference penalty (typically 2)
    %
    % Returns:
    %   z - Smoothed signal

    if nargin < 3
        d = 2;  % Default to second-order differences
    end
    if nargin < 2
        lam = 1e4;  % Default lambda
    end

    n = length(y);

    % Create the difference matrix D
    E = speye(n);  % Sparse identity matrix
    D = spdiags(ones(n-2, 1) * [1 -2 1], 0:2, n-2, n);

    % For higher order differences, compose D with itself
    for i = 1:(d-2)
        D = diff(D, 1, 1);
    end

    % Solve the linear system: (I + λD'D)z = y
    coefMat = E + lam * (D' * D);
    z = coefMat \ y(:);  % Backslash operator solves the system efficiently
end

% Example: Generate noisy spectrum
rng(42);  % For reproducibility
wavelength = linspace(400, 800, 400);  % 400-800 nm
true_signal = 2 * exp(-((wavelength - 500).^2) / 5000) + ...
              1 * exp(-((wavelength - 600).^2) / 3000);
noise = 0.1 * randn(size(wavelength));
noisy_spectrum = true_signal + noise;

% Apply Whittaker smoothing with different λ values
smooth_lam1e2 = whittaker_smooth(noisy_spectrum, 1e2, 2);
smooth_lam1e4 = whittaker_smooth(noisy_spectrum, 1e4, 2);
smooth_lam1e6 = whittaker_smooth(noisy_spectrum, 1e6, 2);

% Plot results
figure;
plot(wavelength, noisy_spectrum, 'Color', [0.5 0.5 0.5], ...
     'DisplayName', 'Noisy');
hold on;
plot(wavelength, smooth_lam1e2, 'LineWidth', 2, ...
     'DisplayName', '\lambda=10^2 (light)');
plot(wavelength, smooth_lam1e4, 'LineWidth', 2, ...
     'DisplayName', '\lambda=10^4 (moderate)');
plot(wavelength, smooth_lam1e6, 'LineWidth', 2, ...
     'DisplayName', '\lambda=10^6 (heavy)');
plot(wavelength, true_signal, '--', 'Color', 'g', 'LineWidth', 2, ...
     'DisplayName', 'True signal');
xlabel('Wavelength (nm)');
ylabel('Absorbance');
title('Whittaker Smoothing');
legend('Location', 'best');
grid on;

whittaker_smooth <- function(y, lambda = 1e4, d = 2) {
  # Apply Whittaker smoothing to a 1D array
  #
  # Parameters:
  #   y - Input signal to be smoothed (numeric vector)
  #   lambda - Smoothing parameter. Larger = more smoothing.
  #            Typical range: 1e0 to 1e8
  #   d - Order of the difference penalty (typically 2)
  #
  # Returns:
  #   z - Smoothed signal

  library(Matrix)  # For sparse matrices

  n <- length(y)

  # Create the difference matrix D
  E <- Diagonal(n)  # Sparse identity matrix

  # Create d-th order difference matrix
  D <- diff(E, differences = d)

  # Solve the linear system: (I + λD'D)z = y
  coefMat <- E + lambda * t(D) %*% D
  z <- as.vector(solve(coefMat, y))

  return(z)
}

# Example: Generate noisy spectrum
set.seed(42)
wavelength <- seq(400, 800, length.out = 400)  # 400-800 nm
true_signal <- 2 * exp(-((wavelength - 500)^2) / 5000) +
               1 * exp(-((wavelength - 600)^2) / 3000)
noise <- rnorm(length(wavelength), mean = 0, sd = 0.1)
noisy_spectrum <- true_signal + noise

# Apply Whittaker smoothing with different λ values
smooth_lam1e2 <- whittaker_smooth(noisy_spectrum, lambda = 1e2, d = 2)
smooth_lam1e4 <- whittaker_smooth(noisy_spectrum, lambda = 1e4, d = 2)
smooth_lam1e6 <- whittaker_smooth(noisy_spectrum, lambda = 1e6, d = 2)

# Plot results
plot(wavelength, noisy_spectrum, type = 'l', col = 'gray',
     xlab = 'Wavelength (nm)', ylab = 'Absorbance',
     main = 'Whittaker Smoothing')
lines(wavelength, smooth_lam1e2, col = 'blue', lwd = 2)
lines(wavelength, smooth_lam1e4, col = 'red', lwd = 2)
lines(wavelength, smooth_lam1e6, col = 'purple', lwd = 2)
lines(wavelength, true_signal, col = 'green', lwd = 2, lty = 2)
legend('topright',
       legend = c('Noisy', 'λ=10² (light)', 'λ=10⁴ (moderate)',
                  'λ=10⁶ (heavy)', 'True signal'),
       col = c('gray', 'blue', 'red', 'purple', 'green'),
       lwd = c(1, 2, 2, 2, 2),
       lty = c(1, 1, 1, 1, 2))
grid()

Advantages of Whittaker smoothing

✅ Excellent peak preservation — Maintains sharp features better than moving average

✅ Theoretically sound — Based on clear optimization principles

✅ Flexible — λ parameter provides precise control over smoothing

✅ Fast — Closed-form solution, no iteration required

✅ Handles unequally spaced data — Works even when x-values aren't uniform

✅ Smooth derivatives — The smoothed signal has smooth derivatives (unlike moving average)

✅ Well-studied — Extensive literature and proven track record

Disadvantages of Whittaker smoothing

❌ Parameter choice — Requires choosing λ (though this also gives you control)

❌ Less intuitive — Harder to explain than "average your neighbors"

❌ Requires linear algebra — More complex to implement from scratch

❌ Memory for large datasets — Sparse matrices needed for very large n (though still manageable)

Comparison with other methods

Feature	Moving Average	Savitzky-Golay	Whittaker	Gaussian
Peak preservation	Poor	Excellent	Excellent	Good
Flexibility	Low	Moderate	High	Low
Theoretical basis	None	Polynomial fitting	Optimization	Weighted average
Parameters	1 (window)	2 (window, order)	1-2 (λ, d)	1 (σ)
Speed	Fastest	Fast	Fast	Fast
Unequal spacing	No	No	Yes	No
Best for	Quick tests	Spectroscopy	General use	Images

When to use Whittaker smoothing

Whittaker smoothing is an excellent choice when:

You need excellent peak preservation with controlled smoothing
You want a theoretically principled approach
You're willing to tune one parameter (λ)
You need to smooth unequally spaced data
You want smooth derivatives for further analysis
You need reproducible, automated smoothing (once λ is chosen)

Avoid Whittaker when:

You need the simplest possible explanation (use moving average)
You want to compute derivatives directly (use Savitzky-Golay)
You have extremely large datasets and limited memory

Real-world applications

Whittaker smoothing is widely used in:

Spectroscopy: NIR, Raman, IR spectra preprocessing

Chromatography: Smoothing chromatographic peaks

Environmental science: Smoothing time series data

Medical imaging: Signal processing of physiological signals

Remote sensing: Smoothing satellite time series data

Extensions: weighted Whittaker

You can assign different weights to different points if some measurements are more reliable:

S = i = 1 \sum n w_{i} (y_{i} - z_{i})^{2} + λ i = 1 \sum n - d (Δ^{d} z_{i})^{2}

This becomes: (W + λD^TD)z = Wy where W is a diagonal matrix of weights.

This weighted variant is the foundation for more advanced techniques like Asymmetric Least Squares (AsLS) for baseline correction, which you can explore in the baseline correction section.

Common mistakes to avoid

❌ λ too small → Not enough smoothing, noise remains

❌ λ too large → Over-smoothing, peaks disappear

❌ Different λ for train/test → Inconsistent preprocessing

❌ Using d=1 → Creates piecewise linear (jagged) results

❌ Not using sparse matrices → Slow for large n

❌ Forgetting to validate → Always check residuals

Next steps

Now that you understand Whittaker smoothing, you can explore:

Moving Average — The simplest smoothing baseline
Savitzky-Golay smoothing — Polynomial fitting approach
Gaussian smoothing — Weighted averaging with Gaussian kernel

For most spectroscopic applications, Whittaker and Savitzky-Golay are the two workhorses of smoothing. Whittaker gives you precise control via λ, while Savitzky-Golay excels at computing smooth derivatives. Try both and see which works better for your data.