Lecture 18
March 6, 2026
Commonly discussed in terms of “bias-variance tradeoff”: more valuable to think of these as contributors to total error.
Potential to overfit vs. underfit isn’t directly related to standard metrics of “model complexity” (number of parameters, etc).
Instead, think of degrees of freedom: how much flexibility is there given the model parameterization to “chase” reduced model error?
Can reduce degrees of freedom with regularization (shrinkage of estimates (e.g. LASSO) vs. “raw” MLE) or dimensionality reduction (model simplification).

Source: Richard McElreath
Effectively, no. Why?
The goal is then to minimize the generalized (expected) error:
\[\mathbb{E}\left[L(X, \theta)\right] = \int_X L(x, \theta) \pi(x)dx\]
where \(L(x, \theta)\) is an error function capturing the discrepancy between \(\hat{f}(x, \theta)\) and \(y\).
Since we don’t know the “true” distribution of \(y\), we usually try to approximate it using the training data:
\[\hat{L} = \min_{\theta \in \Theta} L(x_n, \theta)\]
But: This is minimizing in-sample error and is likely to result an optimistic score.
Instead, let’s divide our data into a training dataset \(y_k\) and testing dataset \(\tilde{y}_l\).
Training Data MSE: 19899.2
Testing Data MSE: 7498.4
Training Data Score: 19899.2
Testing Data Score: 7498.4
What if we repeated this procedure for multiple held-out sets?
If data are large, this is a good approximation.
Why do we want to randomly divide the data and test over all splits?
# create data set
Random.seed!(1)
ntrain = 20
x = rand(Uniform(-2, 2), ntrain)
f(x) = x.^3 .- 5x.^2 .+ 1
y = f(x) + rand(Normal(0, 2), length(x))
# function to fit polynomial model
function polyfit(d, x, y)
function m(d, θ, x)
mout = zeros(length(x), d + 1)
for j in eachindex(x)
for i = 0:d
mout[j, i + 1] = θ[i + 1] * x[j]^i
end
end
return sum(mout; dims=2)
end
θ₀ = [zeros(d+1); 1.0]
lb = [-10.0 .+ zeros(d+1); 0.01]
ub = [10.0 .+ zeros(d+1); 20.0]
optim_out = optimize(θ -> -sum(logpdf.(Normal.(m(d, θ[1:end-1], x), θ[end]), y)), lb, ub, θ₀)
θmin = optim_out.minimizer
mfit(x) = sum([θmin[i + 1] * x^i for i in 0:d])
return (mfit, θmin[end])
end
# create out-of-sample data for testing generalizability
ntest = 20
xtest = rand(Uniform(-2, 2), ntest)
ytest = f(xtest) + rand(Normal(0, 2), length(xtest))
# calculate cross-validation fits and scores
function cv_poly(d, x, y, nfolds=5)
# split data into folds
labels = repeat(1:nfolds, outer=Int(length(x) / nfolds))
fold_labels = shuffle(labels) # randomly permute the labels
# do C-V
cv_error = zeros(nfolds)
for fold in 1:nfolds
test_idx = findall(fold_labels .== fold)
train_idx = setdiff(1:length(x), test_idx)
m, σ = polyfit(d, x[train_idx], y[train_idx])
cv_error[fold] = mean((m.(x[test_idx]) .- y[test_idx]).^2)
end
return cv_error
end
in_error = zeros(11)
out_error = zeros(11)
cv_error = zeros(11)
nfold = 4
#calculate sum of squared errors
for d = 0:10
m, σ = polyfit(d, x, y)
in_error[d+1] = mean((m.(x) .- y).^2)
out_error[d+1] = mean((m.(xtest) .- ytest).^2)
cv_error[d+1] = mean(cv_poly(d, x, y, nfold))
end
plot(0:10, in_error, marker=(:circle, :blue, 8), line=(:blue, 3), label="In-Sample Error", xlabel="Polynomial Degree", ylabel="Mean Squared Error", legend=:top)
plot!(0:10, out_error, marker=(:circle, :red, 8), line=(:red, 3), label="Out-of-Sample Error")
plot!(0:10, cv_error, marker=(:circle, :purple, 8), line=(:purple, 3), label="C-V Error")
plot!(yaxis=:log)
plot!(xticks=0:1:10)
plot!(size=(900, 450))Figure 3: Impact of increasing model complexity on model fits
The problem with \(k\)-fold CV, when data is scarce, is withholding \(n/k\) points, which results in more biased estimates.
LOO-CV: Set \(k=n\)
The trouble: estimates of \(L\) are highly correlated since every two datasets share \(n-2\) points.
The benefit: LOO-CV approximates seeing “the next datum”.
We focus here on the log score, but any error metric can be used.
Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]
Out of Sample:
\(p(y_i | y_{-i})\) = 5.2
In Sample:
\(p(y_{-i} | y_{-i})\) = 5.7
LOO-CV Score: 5.8
This is the average log-likelihood of out-of-sample data.
Drop \(k\) values, refit model on rest of data, check for predictive skill.
As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]
Every step of model “tuning” must be subject to C-V: this includes any statistical feature selection.
In other words: construct held-out data sets, then do feature selection + model training + evaluation all at once for each fold.
Otherwise the C-V evaluation is biased by the “full knowledge” of which features to use.
μ = 0.01
ρ = 0.7
σ = 0.5
ar_var = sqrt(σ^2 / (1 - ρ^2))
T = 100
ts_stat = zeros(T)
for i = 1:T
if i == 1
ts_stat[i] = rand(Normal(0, ar_var))
else
ts_stat[i] = ρ * ts_stat[i-1] + rand(Normal(0, σ))
end
end
p = plot(1:T, ts_stat, marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
p1 = deepcopy(p)
test_idx1 = sample(1:nrow(tds), 5, replace=false)
scatter!(p1, test_idx1, ts_stat[test_idx1], color=:red, markersize=8, label="Testing Data")Idea: Use model fit on data \(\{y_1, \ldots, y_i\}\) and predict \(y_{i+1}\)
Can also look at multi-step prediction: predict \(y_{i+h}\) for some forecast horizon \(h\).
function cv_lfo(dat, m)
ts_pred = zeros(length(dat) - m, 10_000)
log_p = zeros(length(dat) - m)
function ar1_loglik_whitened(θ, dat)
# might need to include mean or trend parameters as well
# subtract trend from data to make this mean-zero in this case
α, ρ, σ = θ
T = length(dat)
ll = 0 # initialize log-likelihood counter
for i = 1:T
if i == 1
ll += logpdf(Normal(α / (1 - ρ), sqrt(σ^2 / (1 - ρ^2))), dat[i])
else
r = dat[i] - (α + ρ * dat[i-1])
ll += logpdf(Normal(0, σ), r)
end
end
return ll
end
lb = [-5.0, -0.99, 0.01]
ub = [5.0, 0.99, 5]
init = [0.0, 0.6, 0.3]
for i = (m+1):T
optim_whitened = Optim.optimize(θ -> -ar1_loglik_whitened(θ, dat[1:(i-1)]), lb, ub, init)
θ = optim_whitened.minimizer
# get log score of data
log_p[i - m] = -logpdf(Normal(θ[1] .+ θ[2] * dat[i-1], θ[3]), dat[i])
# simulate predictions
ts_pred[i - m, :] = θ[1] .+ θ[2] * dat[i-1] .+ rand(Normal(0, θ[3]), 10_000)
end
return (log_p, ts_pred)
end
m = 10
log_p, ts_lfo = cv_lfo(ts_stat, m)
p = plot(1:m, ts_stat[1:m], marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
pred_quantile = quantile(ts_lfo[1, :], [0.05, 0.95])
scatter!(p, [m+1], [ts_stat[m+1]], yerr = pred_quantile .+ ts_stat[m+1], marker=:circle, markersize=8, color=:red, label="Testing Data")i = m + 1
p = plot(1:i, ts_stat[1:i], marker=:circle, markersize=8, color=:blue, label="Training Data", xlabel="Time", linewidth=3, size=(600, 500))
pred_quantile = quantile(ts_lfo[i - m + 1, :], [0.05, 0.95])
scatter!(p, [i+1], [ts_stat[i+1]], yerr = pred_quantile .+ ts_stat[i+1], marker=:circle, markersize=8, color=:red, label="Testing Data")LFO-CV score: 0.9
Monday: Hypothesis Testing
HW3: Due today.
Quiz 2: Next Friday (3/16)