Cross-Validation


Lecture 18

March 6, 2026

Last Class

Bias vs. Variance

  • Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).
  • Variance is error from over-sensitivity to small fluctuations in training inputs \(D\) (\(\text{Variance} = \text{Var}_D(\hat{f}(x; D)\)).

Commonly discussed in terms of “bias-variance tradeoff”: more valuable to think of these as contributors to total error.

Overfitting and Underfitting

  • Underfitting: Model predicts individual data points poorly, high bias, think approximation error
  • Overfitting: Model generalizes poorly, high variance, think estimation error.

Model Degrees of Freedom

Potential to overfit vs. underfit isn’t directly related to standard metrics of “model complexity” (number of parameters, etc).

Instead, think of degrees of freedom: how much flexibility is there given the model parameterization to “chase” reduced model error?

Regularization

Can reduce degrees of freedom with regularization (shrinkage of estimates (e.g. LASSO) vs. “raw” MLE) or dimensionality reduction (model simplification).

Regularization Meme

Source: Richard McElreath

Cross-Validation

Can We Drive Model Error to Zero?

Effectively, no. Why?

  • Inherent noise: even a perfect model wouldn’t perfectly predict observations.
  • Model mis-specification (the cause of bias)
  • Model estimation is never “right” (the cause of variance)

Quantifying Generalization Error

The goal is then to minimize the generalized (expected) error:

\[\mathbb{E}\left[L(X, \theta)\right] = \int_X L(x, \theta) \pi(x)dx\]

where \(L(x, \theta)\) is an error function capturing the discrepancy between \(\hat{f}(x, \theta)\) and \(y\).

In-Sample Error

Since we don’t know the “true” distribution of \(y\), we usually try to approximate it using the training data:

\[\hat{L} = \min_{\theta \in \Theta} L(x_n, \theta)\]

But: This is minimizing in-sample error and is likely to result an optimistic score.

Held Out Data

Instead, let’s divide our data into a training dataset \(y_k\) and testing dataset \(\tilde{y}_l\).

  1. Fit the model to \(y_k\);
  2. Evaluate error on \(\tilde{y}_l\).
Figure 1: Data for CV example.

Held Out Data

Figure 2: Data for CV example.

Training Data MSE: 19899.2

Testing Data MSE: 7498.4

Training Data Score: 19899.2

Testing Data Score: 7498.4

\(k\)-Fold Cross-Validation

What if we repeated this procedure for multiple held-out sets?

  1. Randomly split data into \(k = n / m\) equally-sized subsets.
  2. For each \(i = 1, \ldots, k\), fit model to \(y_{-i}\) and test on \(y_i\).

If data are large, this is a good approximation.

Random Division of Data

Why do we want to randomly divide the data and test over all splits?

Consistency of C-V

Code
# create data set
Random.seed!(1)
ntrain = 20
x = rand(Uniform(-2, 2), ntrain)
f(x) = x.^3 .- 5x.^2 .+ 1
y = f(x) + rand(Normal(0, 2), length(x))

# function to fit polynomial model
function polyfit(d, x, y)
    function m(d, θ, x)
        mout = zeros(length(x), d + 1)
        for j in eachindex(x)
            for i = 0:d
                mout[j, i + 1] = θ[i + 1] * x[j]^i
            end
        end
        return sum(mout; dims=2)
    end
    θ₀ = [zeros(d+1); 1.0]
    lb = [-10.0 .+ zeros(d+1); 0.01]
    ub = [10.0 .+ zeros(d+1); 20.0]
    optim_out = optimize-> -sum(logpdf.(Normal.(m(d, θ[1:end-1], x), θ[end]), y)), lb, ub, θ₀)
    θmin = optim_out.minimizer
    mfit(x) = sum([θmin[i + 1] * x^i for i in 0:d])
    return (mfit, θmin[end])
end

# create out-of-sample data for testing generalizability
ntest = 20
xtest = rand(Uniform(-2, 2), ntest)
ytest = f(xtest) + rand(Normal(0, 2), length(xtest))

# calculate cross-validation fits and scores
function cv_poly(d, x, y, nfolds=5)
    # split data into folds
    labels = repeat(1:nfolds, outer=Int(length(x) / nfolds))
    fold_labels = shuffle(labels) # randomly permute the labels
    # do C-V
    cv_error = zeros(nfolds)
    for fold in 1:nfolds
        test_idx = findall(fold_labels .== fold)
        train_idx = setdiff(1:length(x), test_idx)
        m, σ = polyfit(d, x[train_idx], y[train_idx])
        cv_error[fold] = mean((m.(x[test_idx]) .- y[test_idx]).^2)
    end
    return cv_error
end

in_error = zeros(11)
out_error = zeros(11)
cv_error = zeros(11)
nfold = 4
#calculate sum of squared errors
for d = 0:10
    m, σ = polyfit(d, x, y)
    in_error[d+1] = mean((m.(x) .- y).^2)
    out_error[d+1] = mean((m.(xtest) .- ytest).^2)
    cv_error[d+1] = mean(cv_poly(d, x, y, nfold))
end

plot(0:10, in_error, marker=(:circle, :blue, 8), line=(:blue, 3), label="In-Sample Error", xlabel="Polynomial Degree", ylabel="Mean Squared Error", legend=:top)
plot!(0:10, out_error, marker=(:circle, :red, 8), line=(:red, 3), label="Out-of-Sample Error")
plot!(0:10, cv_error, marker=(:circle, :purple, 8), line=(:purple, 3), label="C-V Error")
plot!(yaxis=:log)
plot!(xticks=0:1:10)
plot!(size=(900, 450))

Figure 3: Impact of increasing model complexity on model fits

Leave-One-Out Cross-Validation (LOO-CV)

The problem with \(k\)-fold CV, when data is scarce, is withholding \(n/k\) points, which results in more biased estimates.

LOO-CV: Set \(k=n\)

The trouble: estimates of \(L\) are highly correlated since every two datasets share \(n-2\) points.

The benefit: LOO-CV approximates seeing “the next datum”.

LOO-CV Algorithm

We focus here on the log score, but any error metric can be used.

  1. Drop one value \(y_i\).
  2. Refit model on rest of data \(y_{-i}\).
  3. Predict dropped point \(p(\hat{y}_i | y_{-i})\).
  4. Evaluate score on dropped point (\(-\log p(y_i | y_{-i})\)).
  5. Repeat on rest of data set.

LOO-CV Example

Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]

Figure 4: Data for CV example.

LOO-CV Flow

  1. Drop one value \(y_i\).
  2. Refit model on \(y_{-i}\).
  3. Predict \(p(\hat{y}_i | y_{-i})\).
  4. Evaluate \(-\log p(y_i | y_{-i})\).
  5. Repeat on rest of data set.
Figure 5: Data for CV example.

LOO-CV Flow

  1. Drop one value \(y_i\).
  2. Refit model on \(y_{-i}\).
  3. Predict \(p(\hat{y}_i | y_{-i})\).
  4. Evaluate \(-\log p(y_i | y_{-i})\).
  5. Repeat on rest of data set.
Figure 6: Data for CV example.

LOO-CV Flow

  1. Drop one value \(y_i\).
  2. Refit model on \(y_{-i}\).
  3. Predict \(p(\hat{y}_i | y_{-i})\).
  4. Evaluate \(-\log p(y_i | y_{-i})\).
  5. Repeat on rest of data set.

Out of Sample:

\(p(y_i | y_{-i})\) = 5.2

In Sample:

\(p(y_{-i} | y_{-i})\) = 5.7

LOO-CV Flow

  1. Drop one value \(y_i\).
  2. Refit model on \(y_{-i}\).
  3. Predict \(p(\hat{y}_i | y_{-i})\).
  4. Evaluate \(-\log p(y_i | y_{-i})\).
  5. Repeat on rest of data set.

LOO-CV Score: 5.8

This is the average log-likelihood of out-of-sample data.

Leave-\(k\)-Out Cross-Validation

Drop \(k\) values, refit model on rest of data, check for predictive skill.

As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]

Challenges with Cross-Validation

  • This can be very computationally expensive!
  • We often don’t have a lot of data for calibration, so holding some back can be a problem.
  • Best score still be over-optimistic.
  • Can fail if you try to compare too many different models.

C-V and Feature Selection

Every step of model “tuning” must be subject to C-V: this includes any statistical feature selection.

In other words: construct held-out data sets, then do feature selection + model training + evaluation all at once for each fold.

Otherwise the C-V evaluation is biased by the “full knowledge” of which features to use.

Cross-Validation for Dependent Data

Challenges with C-V With Structured Data

  • Data structure creates dependencies: holding out a random subset of data might not align properly.
  • Example: What if we randomly hold out data in the middle of a time series?
Code
μ = 0.01
ρ = 0.7
σ = 0.5
ar_var = sqrt^2 / (1 - ρ^2))
T = 100
ts_stat = zeros(T)
for i = 1:T
    if i == 1
        ts_stat[i] = rand(Normal(0, ar_var))
    else
        ts_stat[i] = ρ * ts_stat[i-1] + rand(Normal(0, σ))
    end
end

p = plot(1:T, ts_stat, marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
p1 = deepcopy(p)
test_idx1 = sample(1:nrow(tds), 5, replace=false)
scatter!(p1, test_idx1, ts_stat[test_idx1], color=:red, markersize=8, label="Testing Data")
Figure 7

TS Approach: Leave-Future-Out C-V

Idea: Use model fit on data \(\{y_1, \ldots, y_i\}\) and predict \(y_{i+1}\)

Can also look at multi-step prediction: predict \(y_{i+h}\) for some forecast horizon \(h\).

LFO-CV Example

  1. Set minimum length \(m\).
  2. Refit model on \(\{y_1, y_m\}\).
  3. Predict \(p(\hat{y}_{m+1} | y_{1:m})\).
  4. Evaluate \(-\log p(y_{m+1} | y_{1:m})\)
  5. Increment \(m\) by 1 and repeat until end is hit.
  6. Take mean of individual scores.
Code
function cv_lfo(dat, m)
    ts_pred = zeros(length(dat) - m, 10_000)
    log_p = zeros(length(dat) - m)
    function ar1_loglik_whitened(θ, dat)
        # might need to include mean or trend parameters as well
        # subtract trend from data to make this mean-zero in this case
        α, ρ, σ = θ
        T = length(dat)
        ll = 0 # initialize log-likelihood counter
        for i = 1:T
            if i == 1
                ll += logpdf(Normal/ (1 - ρ), sqrt^2 / (1 - ρ^2))), dat[i])
            else
                r = dat[i] -+ ρ * dat[i-1])
                ll += logpdf(Normal(0, σ), r)
            end
        end
        return ll
    end

    lb = [-5.0, -0.99, 0.01]
    ub = [5.0, 0.99, 5]
    init = [0.0, 0.6, 0.3]

    for i = (m+1):T
        optim_whitened = Optim.optimize-> -ar1_loglik_whitened(θ, dat[1:(i-1)]), lb, ub, init)
        θ = optim_whitened.minimizer
        # get log score of data
        log_p[i - m] = -logpdf(Normal(θ[1] .+ θ[2] * dat[i-1], θ[3]), dat[i])
        # simulate predictions
        ts_pred[i - m, :] = θ[1] .+ θ[2] * dat[i-1] .+ rand(Normal(0, θ[3]), 10_000)

    end
    return (log_p, ts_pred)
end

m = 10
log_p, ts_lfo = cv_lfo(ts_stat, m)
p = plot(1:m, ts_stat[1:m], marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
pred_quantile = quantile(ts_lfo[1, :], [0.05, 0.95])
scatter!(p, [m+1], [ts_stat[m+1]], yerr = pred_quantile .+ ts_stat[m+1], marker=:circle, markersize=8, color=:red, label="Testing Data")
Figure 8

LFO-CV Example

  1. Set minimum length \(m\).
  2. Refit model on \(\{y_1, y_m\}\).
  3. Predict \(p(\hat{y}_{m+1} | y_{1:m})\).
  4. Evaluate \(-\log p(y_{m+1} | y_{1:m})\)
  5. Increment \(m\) by 1 and repeat until end is hit.
  6. Take mean of individual scores.
Code
i = m + 1
p = plot(1:i, ts_stat[1:i], marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
pred_quantile = quantile(ts_lfo[i - m + 1, :], [0.05, 0.95])
scatter!(p, [i+1], [ts_stat[i+1]], yerr = pred_quantile .+ ts_stat[i+1], marker=:circle, markersize=8, color=:red, label="Testing Data")
Figure 9

LFO-CV Example

  1. Set minimum length \(m\).
  2. Refit model on \(\{y_1, y_m\}\).
  3. Predict \(p(\hat{y}_{m+1} | y_{1:m})\).
  4. Evaluate \(-\log p(y_{m+1} | y_{1:m})\)
  5. Increment \(m\) by 1 and repeat until end is hit.
  6. Take mean of individual scores.

LFO-CV score: 0.9

Key Points and Upcoming Schedule

Key Points (Cross-Validation)

  • Hold out data (one or more points) randomly, refit model, and quantify predictive skill.
  • LOO-CV maximizes use of data but can be computationally expensive.
  • Be thoughtful about “blocking” for structured data to avoid information leakage.
  • All parts of statistical modeling must be included in C-V workflow.

Next Classes

Monday: Hypothesis Testing

Assessments

HW3: Due today.

Quiz 2: Next Friday (3/16)

References

References (Scroll for Full List)