Lecture 17
March 4, 2026
Figure 1: True data and the data-generating curve.
function polyfit(d, x, y)
function m(d, θ, x)
mout = zeros(length(x), d + 1)
for j in eachindex(x)
for i = 0:d
mout[j, i + 1] = θ[i + 1] * x[j]^i
end
end
return sum(mout; dims=2)
end
θ₀ = [zeros(d+1); 1.0]
lb = [-10.0 .+ zeros(d+1); 0.01]
ub = [10.0 .+ zeros(d+1); 20.0]
optim_out = optimize(θ -> -sum(logpdf.(Normal.(m(d, θ[1:end-1], x), θ[end]), y)), lb, ub, θ₀)
θmin = optim_out.minimizer
mfit(x) = sum([θmin[i + 1] * x^i for i in 0:d])
return (mfit, θmin[end])
end
function plot_polyfit(d, x, y)
m, σ = polyfit(d, x, y)
p = scatter(x, y, label="Data", markersize=5, ylabel=L"$y$", xlabel=L"$x$", title="Degree $d")
plot!(p, xrange, m.(xrange), ribbon = 1.96 * σ, fillalpha=0.2, lw=3, label="Fit")
ylims!(p, (-30, 15))
plot!(p, size=(600, 450))
return p
end
p1 = plot_polyfit(1, x, y)
p2 = plot_polyfit(2, x, y)
display(p1)
display(p2)We can think of a model as a form of data compression.
Instead of storing coordinates of individual points, project onto parameters of functional form.
The degree to which we can “tune” the model by adjusting parameters are called the model degrees of freedom (DOF), which is one measure of model complexity.
Higher DOF ⇒ more ability to represent complex patterns.
If DOF is too low, the model can’t capture meaningful data-generating signals (underfitting).
Higher DOF ⇒ more ability to represent complex patterns.
But if DOF is too high, the model will “learn” the noise rather than the signal, resulting in poor generalization (overfitting).
ntest = 20
xtest = rand(Uniform(-2, 2), ntest)
ytest = f(xtest) + rand(Normal(0, 2), length(xtest))
in_error = zeros(11)
out_error = zeros(11)
#calculate sum of squared errors
for d = 0:10
m, σ = polyfit(d, x, y)
in_error[d+1] = sum((m.(x) .- y).^2)
out_error[d+1] = sum((m.(xtest) .- ytest).^2)
end
plot(0:10, in_error / length(y), marker=(:circle, :blue, 8), line=(:blue, 3), label="In-Sample Error", xlabel="Polynomial Degree", ylabel="Mean Squared Error", legend=:topleft)
plot!(0:10, out_error / length(y), marker=(:circle, :red, 8), line=(:red, 3), label="Out-of-Sample Error")
plot!(yaxis=:log)
plot!(xticks=0:1:10)Figure 7: Impact of increasing model complexity on in and out of sample error
Example from The Signal and the Noise by Nate Silver:

Example from The Signal and the Noise by Nate Silver:

Suppose we use \(\hat{f}(x)\) to predict data from a “true” regression model \(f(x)\):
\[ \begin{aligned} (Y - \hat{f}(x))^2 &= (Y - f(x) + f(x) - \hat{f}(x))^2 \\ &= (Y - f(x))^2 + 2(Y - f(x))(f(x) - \hat{f}(x)) + (f(x) - \hat{f}(x))^2 \\ \text{MSE}(\hat{f}(x)) &= \mathbb{V}(\varepsilon) + \mathbb{E}\left[(f(x) - \hat{f}(x))^2\right] \\ &= \mathbb{V}(\varepsilon) + \text{Bias}(\hat{f})^2 \end{aligned} \]
But \(\hat{f}(x)\) is also a random model \(\hat{M_n}(x)\) based on the data.
\[ \begin{aligned} \text{MSE}(\hat{M_n}(x)) &= \mathbb{E}\left[(Y - \hat{M_n}(X))^2 | X = x\right] \\ &= \mathbb{E}\left[\mathbb{E}\left[\left(Y - \hat{M_n}(X)\right)^2 | X = x, \hat{M_n} = \hat{f} \right] | X = x\right] \\ &= \mathbb{E}\left[\mathbb{V}\left[\varepsilon\right] + \left(f(x) - \hat{M_n}(x)\right)^2 | X = x\right] \\ &= \mathbb{V}\left[\varepsilon\right] + \mathbb{E}\left[\left(f(x) - \hat{M_n}(x)\right)^2 | X = x \right] \end{aligned} \]
\[\begin{aligned} \text{MSE}(\hat{M_n}(x)) &= \text{Var}\left[\varepsilon\right] + \mathbb{E}\left[\left(f(x) - \hat{M_n}(x)\right)^2 | X = x\right] \\ &= \mathbb{V}\left[\varepsilon\right] + \\ & \qquad \mathbb{E}\left[\left(f(x) - \mathbb{E}\left[\hat{M_n}(x)\right] + \mathbb{E}\left[\hat{M_n}(x)\right] - \hat{M_n}(x)\right)^2\right] \\ &= \mathbb{V}\left[\varepsilon\right] + \left(f(x) - \mathbb{E}\left[\hat{M_n}(x)\right]\right)^2 + \mathbb{V}\left[\hat{M_n}(x)\right] \end{aligned}\]
Therefore MSE consists of:
Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).
Bias comes from the approximation error.
High bias indicates under-fitting meaningful relationships between inputs and outputs:
Variance is error from over-sensitivity to small fluctuations in training inputs \(D\) (\(\text{Variance} = \text{Var}_D(\hat{f}(x; D)\)).
Variance can come from over-fitting noise in the data:
This means that past a certain point, you can only reduce bias (increasing model complexity) at the cost of increasing variance.
This is the so-called “bias-variance tradeoff.”
This does not have to be 1-1: sometimes adding bias can reduce total error if it reduces variance more than it adds approximation error.
This decomposition is for MSE, but the principle holds more generally.

Source: Wikipedia