The idea of standardizing measurements to make them comparable has roots that reach back to the late 19th century. In 1893, Karl Pearson formalized the concept of the standard score (later called the z-score) in his work on statistical distributions, building on earlier ideas by Francis Galton about regression and deviation from the mean. Pearson’s insight was straightforward: if you subtract the mean and divide by the standard deviation, any variable — regardless of its original units or scale — becomes a dimensionless quantity with zero mean and unit variance. This allowed statisticians to compare heights measured in inches with weights measured in pounds, or temperatures in Fahrenheit with pressures in atmospheres. The z-score became one of the most fundamental operations in all of statistics.
When chemometrics emerged as a discipline in the 1970s and 1980s, centering and scaling took on a new urgency. The multivariate methods at the heart of chemometrics — Principal Component Analysis (PCA), Partial Least Squares (PLS), and their many variants — are built on covariance and correlation structures. Svante Wold and Bruce Kowalski, two of the founding figures of chemometrics, emphasized from the beginning that mean centering was not optional for PCA: without it, the first principal component simply captures the mean spectrum rather than the variation between samples, which is the whole point of the analysis. Scaling, meanwhile, became necessary as analytical chemistry increasingly combined data from different instruments (NIR, Raman, GC, HPLC) or different types of measurements (concentrations, temperatures, pressures) into a single data matrix.
Over the decades, the chemometrics community developed several scaling approaches tailored to different data types and analytical goals. Autoscaling (unit variance scaling) became the standard for mixed-unit data. Pareto scaling, introduced by the metabolomics community in the early 2000s, offered a middle ground that reduced the dominance of large variables without amplifying noise as aggressively as autoscaling. Range scaling found its niche in process monitoring, where variables naturally have defined operating ranges. The choice of scaling method is not a minor technical detail — it fundamentally shapes what patterns a multivariate model can find.
Why preprocessing starts with centering and scaling
Before applying PCA, PLS, or any multivariate method, you need to ask a basic question: are your variables on comparable scales?
In analytical chemistry, the answer is almost always no. A typical data matrix might contain:
NIR absorbance values ranging from 0.01 to 2.5
Temperature readings from 20 to 180 degrees Celsius
Moisture content from 0.1% to 15%
Raman intensities from 100 to 50,000 counts
When these variables are combined into a single matrix, the ones with the largest absolute values and the largest variance dominate the analysis. PCA, for example, finds directions of maximum variance in the data. If Raman intensity varies over a range of 50,000 while moisture varies over a range of 15, the first principal component will almost entirely describe Raman intensity variations — not because Raman is more chemically important, but simply because its numbers are bigger.
This is the fundamental problem that centering and scaling solve. They bring variables to a common ground so that the multivariate analysis reflects chemical information rather than arbitrary measurement scales.
Mean centering
Mean centering is the most basic and universally applied preprocessing step. For each variable (column) in your data matrix, you subtract the mean of that variable across all samples:
xijcentered=xij−xˉj
where xˉj=n1∑i=1nxij is the mean of variable j across all n samples.
In matrix notation, if X is your n×p data matrix (n samples, p variables):
Xcentered=X−1xˉT
where 1 is a column vector of ones and xˉ is the vector of column means.
What centering does to your data
After centering, each variable has a mean of zero. Geometrically, you have moved the cloud of data points so that it is centered at the origin. This might seem trivial, but it has a profound effect on PCA.
Without centering, the first principal component points from the origin toward the center of the data cloud. It captures where the data is rather than how it varies. In spectroscopic terms, the first component is essentially the mean spectrum — a perfectly uninteresting piece of information that tells you nothing about differences between samples.
With centering, the first principal component captures the direction of greatest variation between samples. This is what you actually want: the most important pattern of differences in your data. Every subsequent component captures the next most important source of variation, and so on.
Example: spectroscopic data
Consider a set of NIR spectra measured on 50 samples. Each spectrum has 1000 wavelength points. The raw spectra all share a similar overall shape (the mean spectrum) with relatively small differences between samples.
Before centering, PCA finds that component 1 explains 99.5% of the variance — but it is just the mean spectrum. The interesting differences between samples are buried in the remaining 0.5%.
After centering, those differences become the entire focus of the analysis. Component 1 now captures the most important chemical variation (perhaps protein content), component 2 captures the next most important (perhaps moisture), and so on.
Autoscaling (unit variance scaling)
Autoscaling combines mean centering with division by the standard deviation. It is Pearson’s z-score applied column-wise:
xijauto=sjxij−xˉj
where sj=n−11∑i=1n(xij−xˉj)2 is the standard deviation of variable j .
After autoscaling, every variable has zero mean and unit variance. Each variable contributes equally to the analysis regardless of its original scale or units.
When to use autoscaling
Autoscaling is the right choice when:
Variables have different units — combining spectral data with temperature, pressure, pH, etc.
Variables have very different magnitudes — one variable ranges 0-1 while another ranges 0-10,000
All variables are potentially important — you do not want any single variable to dominate simply because of its scale
You are doing exploratory analysis — autoscaling is a safe default that ensures every variable gets a fair hearing
The danger of autoscaling
Autoscaling has a well-known pitfall: it amplifies noise in low-variance variables. If a variable has very little real variation (perhaps it is nearly constant across all samples), its standard deviation will be small, and dividing by that small number magnifies whatever noise is present. A variable that was practically irrelevant in the raw data can become a major source of apparent variation after autoscaling.
This is particularly problematic in spectroscopy, where baseline regions with little chemical information can have low variance. After autoscaling, these noisy baseline regions get amplified to the same importance as the information-rich peak regions.
Pareto scaling
Pareto scaling offers a compromise between no scaling and autoscaling. Instead of dividing by the standard deviation, you divide by the square root of the standard deviation:
xijPareto=sjxij−xˉj
This reduces the dominance of large-variance variables without completely equalizing all variables. Large signals are scaled down, but they still contribute more than small signals — which is often desirable when the signal magnitude carries real information.
Why metabolomics loves Pareto scaling
Pareto scaling became particularly popular in metabolomics after van den Berg et al. (2006) published a systematic comparison of scaling methods for metabolomic data. Their key observation was that in metabolomics, peak intensity often correlates with biological importance — major metabolites like glucose and lactate produce large peaks because they are present in high concentrations. Autoscaling would give equal weight to a glucose peak and a minor metabolite detected at the noise floor, which may not be desirable. Pareto scaling keeps the large metabolites prominent while still reducing the dominance gap, striking a practical balance.
The mathematical intuition is simple. If variable A has a standard deviation 100 times larger than variable B:
No scaling: A dominates by a factor of 100
Pareto scaling: A dominates by a factor of 100=10
Autoscaling: Both have equal weight
Pareto scaling reduces the dominance ratio from 100:1 to 10:1, which often better reflects the underlying importance of the variables.
Range scaling
Range scaling divides each centered variable by its range (maximum minus minimum):
xijrange=xj,max−xj,minxij−xˉj
This maps each variable to a comparable scale determined by its observed range. It is conceptually similar to min-max normalization (which maps to [0, 1]) but preserves the centering at zero.
When range scaling makes sense
Range scaling is most useful in process monitoring and industrial settings where variables have well-defined operating ranges. A reactor temperature that operates between 150 and 200 degrees Celsius has a meaningful range of 50 degrees. A pressure sensor operating between 1 and 5 bar has a meaningful range of 4 bar. Range scaling normalizes each variable by its operational span, which often corresponds to the practical significance of that variable.
The main limitation is sensitivity to outliers. A single extreme value in one variable can inflate its range, pulling down the scaled values for all other samples. This makes range scaling less robust than autoscaling for datasets with outliers or non-standard samples.
Other scaling methods
Several additional scaling approaches have been proposed for specific applications:
VAST scaling (Variable Stability scaling): Divides by the coefficient of variation (standard deviation divided by the mean) times the standard deviation. This gives more weight to variables with small relative variation, which can be useful for identifying stable biomarkers.
Level scaling: Divides each centered variable by its mean value, making each variable a measure of relative deviation from the average. Useful when proportional changes are more meaningful than absolute changes.
Log transformation: Not strictly a scaling method, but often used in conjunction with scaling to handle multiplicative noise or data spanning several orders of magnitude. Common in metabolomics and gene expression analysis. Note that log transformation requires all values to be positive, and it changes the distributional properties of the data.
Power transformations (Box-Cox): A family of transformations parameterized by λ that includes log transformation (λ=0 ) and square root (λ=0.5 ) as special cases. Used to stabilize variance or improve normality.
Choosing the right preprocessing
The choice between centering alone, autoscaling, Pareto scaling, or range scaling depends on the nature of your data and the goal of your analysis. There is no universal answer, but the following guidelines cover most practical situations.
The effect on PCA
The impact of centering and scaling on PCA results can be dramatic. Here is what typically happens with different preprocessing choices applied to a dataset of NIR spectra combined with process variables (temperature, pressure, flow rate).
No preprocessing (raw data):
PC1 captures the variable with the largest absolute values (often temperature or a dominant spectral region)
Scores plot is dominated by one axis
Chemical patterns are hidden
Mean centering only:
PC1 captures the direction of largest variance among all variables
For pure spectroscopic data, this often works well — variance is naturally meaningful
For mixed-unit data, variables with large variance still dominate
Autoscaling (centering + unit variance):
Each variable contributes equally to the analysis
Chemical patterns emerge clearly in the scores plot
Risk: noise in low-variance variables may obscure real patterns
Pareto scaling:
Large-variance variables contribute more than small-variance variables, but not overwhelmingly so
Good compromise for heterogeneous data
The scores plot comparison typically reveals the following pattern: without centering, PC1 vs PC2 shows little structure. With centering alone on mixed-unit data, one or two variables dominate. With autoscaling, clusters and trends that reflect real chemistry become visible. Pareto scaling often gives results intermediate between centering-only and autoscaling.
Mean-center your data before PCA, PLS, or any covariance-based method. This is not optional — it is a mathematical requirement for meaningful results.
Store the preprocessing parameters (means, standard deviations, ranges) from your calibration set. When you preprocess new samples for prediction, you must apply the same means and standard deviations, not recalculate them from the new data.
Document your choices. Record which preprocessing you applied and why. Reproducibility requires knowing exactly how the data was transformed.
Watch out for
Autoscaling noisy variables. If a variable has very low variance, dividing by its standard deviation amplifies noise. Consider removing such variables or using Pareto scaling instead.
Inconsistent preprocessing. The calibration set, validation set, and any new samples must all be preprocessed with the same parameters (the means and standard deviations from the calibration set).
Scaling spectroscopic data unnecessarily. If all variables are in the same units (e.g., absorbance at different wavelengths), mean centering alone often gives better results than autoscaling, because the natural variance structure carries chemical information.
Quick reference: which method to use
Situation
Recommended approach
Pure spectroscopic data (same units)
Mean centering only
Mixed-unit data (spectra + process variables)
Autoscaling
Metabolomics / peak intensity matters
Pareto scaling
Process monitoring with known operating ranges
Range scaling
Data spanning orders of magnitude
Log transform + centering or autoscaling
Exploratory analysis, uncertain which to use
Try autoscaling first, compare with centering only
Applying preprocessing to new data
When you build a calibration model, you compute means and standard deviations from your training data. New samples must be preprocessed using those same parameters, not recalculated from the new data. This is a common source of errors.
# During calibration
X_train_scaled, means, stds =autoscale(X_train)
model =build_pls_model(X_train_scaled, y_train)
# During prediction -- use calibration means and stds
X_new_scaled = (X_new - means) / stds # NOT autoscale(X_new)
y_predicted = model.predict(X_new_scaled)
If you recalculate means and standard deviations from new data, each prediction batch gets a different preprocessing, and the model’s coefficients no longer apply correctly.
References
[1] van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7, 142.
[2] Bro, R., & Smilde, A. K. (2014). Principal component analysis. Analytical Methods, 6(9), 2812-2831.
[3] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559-572.
[4] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37-52.
[5] Rinnan, Å., van den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.
[6] Eriksson, L., Byrne, T., Johansson, E., Trygg, J., & Vikstrom, C. (2013). Multi- and Megavariate Data Analysis: Basic Principles and Applications. Umetrics Academy.
[7] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.
[8] Martens, H., & Naes, T. (1989). Multivariate Calibration. Wiley.