Skip to content

Principal Component Regression (PCR)

In 1965, William F. Massy, an economist at Stanford University, published a paper in the Journal of the American Statistical Association titled “Principal Components Regression in Exploratory Statistical Research” [1]. Massy’s problem was a familiar one in the social sciences: he had more predictor variables than he could handle, and many of them were correlated with each other. His solution was straightforward — first reduce the dimensionality of the predictor space using Principal Component Analysis, then regress the response on the resulting scores. The method combined two existing tools (PCA, dating back to Pearson in 1901 and Hotelling in 1933 [2], and ordinary least squares regression) into a single two-step procedure that elegantly sidestepped the multicollinearity problem.

The idea did not gain immediate traction in the natural sciences. It took nearly two decades, and the explosive growth of near-infrared (NIR) spectroscopy in the 1980s, for PCR to become a standard tool. NIR instruments produced spectra with hundreds or thousands of wavelengths — far more variables than samples — and the wavelengths were heavily correlated because neighboring regions of a spectrum carry overlapping chemical information. Classical regression was simply impossible. PCR, by compressing the spectral matrix into a handful of orthogonal scores before regressing, offered a clean path forward. Papers by Haaland and Thomas (1988) [3] and by Næs and Martens (1988) [4] established PCR as one of the two pillars of multivariate calibration, alongside the emerging Partial Least Squares (PLS) method.

The debate between PCR and PLS became one of the defining controversies in chemometrics during the late 1980s and 1990s. Ian Jolliffe pointed out in 1982 [5] that the principal components explaining the most variance in are not necessarily the ones most relevant for predicting — a limitation that PLS addresses by using information when constructing its components. Yet PCR has never disappeared. Its simplicity, mathematical transparency, and the fact that PCA is already routinely applied to spectroscopic data for exploration make PCR a natural first step before PLS. Understanding PCR is essential for understanding PLS, and in many practical situations, PCR and PLS produce nearly identical results.

The curse of dimensionality

A NIR spectrum of a pharmaceutical tablet might be recorded at 1000 wavelengths. You want to predict the active ingredient concentration from the spectrum. You have 150 calibration samples. The spectral data matrix is — more columns than rows.

In this situation, ordinary least squares (OLS) regression fails completely. To find the regression coefficients , you need to invert the matrix . But this matrix has rank at most 150 (the number of samples), which means it is singular — it has no inverse. There are infinitely many coefficient vectors that fit the training data perfectly, and most of them are useless for prediction.

Even when you have more samples than wavelengths, spectroscopic data presents a severe problem: multicollinearity. Neighboring wavelengths in a spectrum are highly correlated because they arise from the same molecular vibrations. The matrix is technically invertible but nearly singular, meaning tiny changes in the data produce wildly different coefficient estimates. The resulting model is unstable and generalizes poorly.

Three broad strategies have been developed to handle this:

  1. Regularization: Keep all variables but constrain the coefficients (ridge regression, LASSO)
  2. Latent variable methods: Compress the data into a few underlying factors before regressing (PCR, PLS)
  3. Variable selection: Choose a subset of variables and discard the rest (stepwise regression, LASSO)

PCR takes the second approach: compress first, then regress.

The PCR strategy

The idea behind PCR is remarkably simple: take your wide, correlated spectral matrix and replace it with a small set of uncorrelated scores from PCA, then perform ordinary regression on those scores.

  1. Center (and optionally scale) the data

    Subtract the column means from so each variable has zero mean. For spectral data, mean-centering is almost always sufficient. Scaling to unit variance is more common when variables have different units.

  2. Perform PCA on X

    Decompose the centered into scores and loadings:

    where is the scores matrix, is the loadings matrix, is the number of retained components, and is the residual matrix.

  3. Regress y on the scores T

    Since the columns of are orthogonal, OLS regression is trivially well-conditioned:

    Because is diagonal (the scores are orthogonal), the inverse is just the reciprocal of each diagonal element. Each regression coefficient is computed independently:

  4. Back-transform to the original variables (if needed)

    The PCR model in the original variable space is:

    This gives you a coefficient vector in the original wavelength space, which you can plot and interpret.

The elegance of this approach is that step 3 is completely free of multicollinearity. The scores are orthogonal by construction, so is always invertible regardless of how many wavelengths you started with or how correlated they were. All the ill-conditioning is handled in step 2, where PCA extracts the dominant patterns and discards the noise.

Choosing the number of components

The number of retained principal components is the single most important tuning parameter in PCR. It controls the bias-variance tradeoff:

  • Too few components: The model discards important spectral information. The predictions are biased (systematic error) because the model is too simple to capture the full relationship between spectra and concentrations.
  • Too many components: The model includes components that describe noise or spectral variation unrelated to the analyte. The predictions overfit the calibration data and generalize poorly to new samples.

The standard approach is cross-validation, typically leave-one-out (LOO) or K-fold:

  1. For each candidate number of components ( )

  2. For each fold in the cross-validation

    Remove one sample (or group of samples), fit PCR on the remaining data with components, and predict the held-out sample.

  3. Compute the Root Mean Squared Error of Cross-Validation (RMSECV)

    where is the prediction for sample when it was excluded from the calibration.

  4. Plot RMSECV vs. number of components and select the that minimizes the error (or the first after which the error stops decreasing meaningfully).

A typical RMSECV plot shows a rapid decrease for the first few components (as the model captures the main spectral-concentration relationship), reaches a minimum, and then either plateaus or increases slightly (as noise is incorporated). The optimal number of components is usually between 2 and 15 for typical spectroscopic calibrations.

Mathematical formulation

Let us lay out the full PCR formulation in matrix terms. Given:

  • ( ): centered spectral matrix, samples, wavelengths
  • ( ): centered response vector

Step 1: PCA decomposition

Compute the eigendecomposition of the covariance matrix:

where contains the eigenvectors (loadings) and contains the eigenvalues in decreasing order. Equivalently, compute the SVD of , where .

Step 2: Project onto the first A components

Retain only the first columns of , call them :

The scores are with orthogonal columns.

Step 3: Regress y on the scores

Since (up to a scaling factor), the regression is trivial.

Step 4: Express in original variables

The regression coefficients in the original wavelength space are:

Predictions for new spectra are:

The PCR limitation

PCR has one fundamental weakness that has been debated since the method was first proposed: PCA ignores the response variable when constructing the components.

PCA finds directions of maximum variance in . These are the directions along which the spectra vary the most. But there is no guarantee that the directions of maximum spectral variation are the same directions that are most correlated with the analyte concentration.

Consider a concrete example. Suppose you are measuring the protein content of wheat samples by NIR. The first principal component might capture variation due to moisture (which causes large, broad spectral changes across many wavelengths). The second might capture variation due to particle size (scattering effects). The protein-related spectral changes, being subtle and localized, might not appear until the 5th or 6th component.

If you build a PCR model with only 2 components, you would be using moisture and particle size information to predict protein — which is nearly useless. You would need to include more components to reach the ones that actually carry protein information. But those later components also carry noise, degrading the model.

This is the scenario where PCR requires more components than PLS to achieve the same predictive performance. PLS avoids this by constructing components that maximize the covariance between and , ensuring that the most predictive directions are captured first.

When does PCR fail?

PCR performs poorly when:

  • The analyte of interest produces small spectral changes relative to other sources of variation (e.g., trace analysis in complex matrices)
  • The spectral variance is dominated by interferences that are uncorrelated with the analyte
  • A low component explains a large amount of spectral variance but is irrelevant for prediction

When does PCR work well?

PCR works well when:

  • The directions of maximum -variance also happen to be correlated with (which is often the case in simple systems)
  • You use enough components to capture the predictive variance
  • The analyte is a major constituent whose spectral features dominate the PCA decomposition

In practice, for many routine spectroscopic calibrations, PCR and PLS give very similar results. The cases where PLS clearly wins tend to involve complex matrices with multiple interferences.

PCR vs PLS comparison

This is the central comparison in multivariate calibration. Both methods reduce dimensionality before regression, but they do it differently.

PropertyPCRPLS
Component constructionMaximizes variance in Maximizes covariance between and
Uses y information?No (PCA step ignores y)Yes (y guides component selection)
Components neededOften moreOften fewer
InterpretabilityComponents have clear geometric meaning (directions of max variance)Components are harder to interpret (compromise between X-variance and y-correlation)
Mathematical transparencyVery clear — PCA + OLSMore complex algorithm (NIPALS, SIMPLS)
Orthogonal scores?Yes (always)Yes (in X-block)
When X-variance aligns with yEquivalent to PLSEquivalent to PCR
When X-variance misaligns with yNeeds more components, may overfitFinds predictive directions first
Computational costPCA is cheap (SVD)Iterative algorithms, slightly more expensive

A useful mental model

Think of PCR and PLS as two strategies for navigating a high-dimensional space:

  • PCR says: “Let me find the most important directions in the spectral landscape first (PCA), and then check which of those directions are useful for predicting the analyte.”

  • PLS says: “Let me find directions in the spectral landscape that are simultaneously important for describing spectra AND for predicting the analyte.”

When the spectral landscape and the prediction landscape are aligned (the biggest source of spectral variation is also the analyte), both strategies find the same directions. When they are misaligned, PLS is more efficient because it goes directly toward the prediction-relevant directions.

Code implementation

import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
# Simulate NIR calibration data
# 150 samples, 500 wavelengths
np.random.seed(42)
n_samples, n_wavelengths = 150, 500
wavelengths = np.linspace(900, 1700, n_wavelengths) # NIR range (nm)
# Simulate spectral data with structure
# Factor 1: large variance, correlated with analyte
factor1 = np.random.randn(n_samples, 1)
loading1 = np.exp(-((wavelengths - 1200)**2) / 5000)
# Factor 2: medium variance, partially correlated with analyte
factor2 = np.random.randn(n_samples, 1)
loading2 = np.exp(-((wavelengths - 1400)**2) / 3000)
# Factor 3: small variance, uncorrelated with analyte
factor3 = np.random.randn(n_samples, 1) * 0.5
loading3 = np.exp(-((wavelengths - 1100)**2) / 8000)
X = factor1 @ loading1.reshape(1, -1) + \
factor2 @ loading2.reshape(1, -1) + \
factor3 @ loading3.reshape(1, -1)
X += np.random.randn(n_samples, n_wavelengths) * 0.05 # Noise
# Response: depends on factors 1 and 2
y = 3.0 * factor1.ravel() + 1.5 * factor2.ravel() + np.random.randn(n_samples) * 0.3
# Mean-center
X_centered = X - X.mean(axis=0)
y_centered = y - y.mean()
# --- PCR with cross-validation to choose number of components ---
max_components = 30
rmsecv = []
kf = KFold(n_splits=10, shuffle=True, random_state=42)
for n_comp in range(1, max_components + 1):
# Build PCR pipeline: PCA + Linear Regression
pcr = Pipeline([
('pca', PCA(n_components=n_comp)),
('regression', LinearRegression())
])
# Cross-validated predictions
y_pred_cv = cross_val_predict(pcr, X_centered, y_centered, cv=kf)
rmse = np.sqrt(np.mean((y_centered - y_pred_cv)**2))
rmsecv.append(rmse)
optimal_ncomp = np.argmin(rmsecv) + 1
print(f"Optimal number of components: {optimal_ncomp}")
print(f"Minimum RMSECV: {min(rmsecv):.4f}")
# --- Plot 1: RMSECV vs number of components ---
plt.figure(figsize=(8, 5))
plt.plot(range(1, max_components + 1), rmsecv, 'bo-', linewidth=2)
plt.axvline(optimal_ncomp, color='red', linestyle='--',
label=f'Optimal = {optimal_ncomp} components')
plt.xlabel('Number of Principal Components')
plt.ylabel('RMSECV')
plt.title('PCR: Cross-Validation for Component Selection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Fit final model ---
pca = PCA(n_components=optimal_ncomp)
T = pca.fit_transform(X_centered)
reg = LinearRegression()
reg.fit(T, y_centered)
# Back-transform to original wavelength space
beta_pcr = pca.components_.T @ reg.coef_
y_pred = X_centered @ beta_pcr + y.mean()
# --- Plot 2: Predicted vs actual ---
plt.figure(figsize=(6, 6))
plt.scatter(y, y_pred, alpha=0.6, edgecolors='k', linewidths=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
plt.xlabel('Actual concentration')
plt.ylabel('Predicted concentration')
r2 = 1 - np.sum((y - y_pred)**2) / np.sum((y - y.mean())**2)
plt.title(f'PCR Calibration (R² = {r2:.4f}, {optimal_ncomp} components)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Plot 3: Regression coefficients in wavelength space ---
plt.figure(figsize=(10, 4))
plt.plot(wavelengths, beta_pcr, 'b-', linewidth=1.5)
plt.axhline(0, color='gray', linewidth=0.5)
plt.xlabel('Wavelength (nm)')
plt.ylabel('Regression coefficient')
plt.title('PCR Regression Coefficients')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Plot 4: Explained variance by PCA ---
pca_full = PCA(n_components=20)
pca_full.fit(X_centered)
plt.figure(figsize=(8, 5))
plt.bar(range(1, 21), pca_full.explained_variance_ratio_[:20] * 100,
color='steelblue', edgecolor='black')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance (%)')
plt.title('PCA: Variance Explained by Each Component')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

When to use PCR in chemistry

Good applications

NIR/MIR calibration of major constituents When the analyte is a dominant source of spectral variation, PCR performs comparably to PLS

Exploratory calibration PCR lets you examine the PCA scores and loadings before building the calibration, giving insight into the spectral structure

Quality control with stable models For routine quality control where simplicity and interpretability are valued

Teaching multivariate calibration PCR is the clearest way to understand the dimension-reduction approach before learning PLS

Consider alternatives

Trace analytes in complex matrices When the analyte signal is buried under larger sources of spectral variation, PLS captures predictive information more efficiently

When minimizing components matters PLS typically needs fewer components, reducing the risk of overfitting

Process analytical technology (PAT) PLS is the de facto industry standard for real-time pharmaceutical and food applications

When y-relevant variance is in later PCs If the first few principal components describe interferences rather than the analyte, PCR needs many components and PLS is more efficient

Practical tips

Mean-center X and y. This is not optional. Without centering, the first principal component captures the mean spectrum rather than variation, wasting a component.

Do not auto-scale spectral data unless you have a specific reason. Scaling each wavelength to unit variance gives equal weight to noisy baseline regions and informative peak regions. For spectral data, mean-centering alone is almost always the right choice. Auto-scaling is appropriate when variables have genuinely different units (e.g., mixing spectral and physical measurements).

Plot the loadings. The loadings of the retained components tell you which wavelength regions drive the model. If a loading vector shows features at wavelengths with known chemical meaning (e.g., O-H stretches, C-H overtones), the model is using chemically meaningful information. If the loadings look like noise, you may have too many components.

Plot the scores. A scores plot (PC1 vs. PC2) colored by analyte concentration should show a gradient if the first components are predictive. If the gradient only appears on PC5 vs. PC6, you should seriously consider PLS.

Validate on an independent test set. Cross-validation selects the number of components, but final model performance should be evaluated on data that was not involved in any part of the model building — not even in the CV loop. This is the RMSEP (Root Mean Squared Error of Prediction).

Preprocess before PCR. Apply scatter correction (SNV, MSC), smoothing (Savitzky-Golay), and/or derivative transformations before computing PCA. These preprocessing steps remove physical artifacts (scattering, baseline drift) that would otherwise dominate the first principal components and obscure the chemical information.

Next steps

PCR is the stepping stone to the most important regression method in chemometrics:

  • Partial Least Squares (PLS) regression modifies the PCR strategy by constructing components that maximize the covariance between and , rather than the variance of alone. This single change makes PLS the workhorse of spectroscopic calibration and the most widely used chemometric method in industry [6].

Understanding PCR deeply — how it decomposes the spectral matrix, why component selection matters, and where its limitations lie — is the best preparation for understanding PLS. The conceptual framework is the same; PLS simply adds -awareness to the component construction step.

References

[1] Massy, W. F. (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60(309), 234—256.

[2] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417—441.

[3] Haaland, D. M., & Thomas, E. V. (1988). Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information. Analytical Chemistry, 60(11), 1193—1202.

[4] Næs, T., & Martens, H. (1988). Principal component regression in NIR analysis: viewpoints, background details and selection of components. Journal of Chemometrics, 2(2), 155—167.

[5] Jolliffe, I. T. (1982). A note on the use of principal components in regression. Journal of the Royal Statistical Society, Series C, 31(3), 300—303.

[6] Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109—130.

[7] Martens, H., & Næs, T. (1989). Multivariate Calibration. Wiley.

[8] Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley.