Model Comparison and Information Criteria


Lecture 20

March 11, 2026

Review of Last Class

Information and Uncertainty

Key Idea: “Information” as reduction in uncertainty from projection.

Entropy: \(H(p) = \mathbb{E}[\log(p)] = -\sum_{i=1}^n p_i \log p_i\)

Kullback-Leibler Divergence

Measure of “distance” between “true” distribution \(p\) and surrogate/projection distribution \(q\):

\[D_{KL}(p, q) = \sum_i p_i (\log(p_i) - \log(q_i))\]

Intuitively: What is the surprise from using \(q\) to predict \(p\)?

Entropy and Log-Score

Suppose we have two projections \(q\) and \(r\) for a “true” data-generating process \(p\):

\[ D_{KL}(p, q) - D_{KL}(p, r) = \sum_i p_i (\log(r_i) - \log(q_i)) \]

Deviance Scale

This log-predictive-density score is better when larger.

Common to see this converted to deviance, which is \(-2\text{lppd}\):

  • \(-1\) reorients so smaller is better;
  • Multiplied by \(2\) for historical reasons (cancels out the -1/2 in the Gaussian likelihood).

Deviance and Degrees of Freedom

Code
function calculate_deviance(y_gen, x_gen, y_oos, x_oos; degree=2, reg=Inf)
    # fit model
    lb = [-5 .+ zeros(degree); 0.01]
    ub = [5 .+ zeros(degree); 10.0]
    p0 = [zeros(degree); 5.0]

    function model_loglik(p, y, x, reg)
        if mean(p[1:end-1]) <= reg
            ll = 0
            for i = 1:length(y)
                μ = sum(x[i, 1:degree] .* p[1:end-1])
                ll += logpdf(Normal(μ, p[end]), y[i])
            end
        else
            ll = -Inf
        end
        return ll
    end

    result = optimize(p -> -model_loglik(p, y_gen, x_gen, reg), lb, ub, p0)
    θ = result.minimizer

    function deviance(p, y_pred, x_pred)
        dev = 0
        for i = 1:length(y_pred)
            μ = sum(x_pred[i, 1:degree] .* p[1:end-1])
            dev += -2 * logpdf(Normal(μ, p[end]), y_pred[i])
        end
        return dev
    end

    dev_is = deviance(θ, y_gen, x_gen)
    dev_oos = deviance(θ, y_oos, x_oos)
    return (dev_is, dev_oos)
end


N_sim = 10_000

function simulate_deviance(N_sim, N_data; reg=Inf)
    d_is = zeros(5, 4)
    for d = 1:5
        dev_out = zeros(N_sim, 2)
        for n = 1:N_sim
            x_gen = rand(Uniform(-2, 2), (N_data, 5))
            μ_gen = [0.1 * x_gen[i, 1] - 0.3 * x_gen[i, 2] for i in 1:N_data]
            y_gen = rand.(Normal.(μ_gen, 1))

            x_oos = rand(Uniform(-2, 2), (N_data, 5))
            μ_oos = [0.1 * x_oos[i, 1] - 0.3 * x_oos[i, 2] for i in 1:N_data]
            y_oos = rand.(Normal.(μ_oos, 1))

            dev_out[n, :] .= calculate_deviance(y_gen, x_gen, y_oos, x_oos, degree=d, reg=reg)
        end
        d_is[d, 1] = mean(dev_out[:, 1])
        d_is[d, 2] = std(dev_out[:, 1])
        d_is[d, 3] = mean(dev_out[:, 2])
        d_is[d, 4] = std(dev_out[:, 2])
    end
    return d_is
end

d_20 = simulate_deviance(N_sim, 20)
p1 = scatter(collect(1:5) .- 0.1, d_20[:, 1], yerr=d_20[:, 2], color=:blue, linecolor=:blue, lw=2, markersize=5, label="In Sample", xlabel="Number of Predictors", ylabel="Deviance", title=L"$N = 20$", grid=true)
scatter!(p1, collect(1:5) .+ 0.1, d_20[:, 3], yerr=d_20[:, 4], lw=2, markersize=5, color=:black, linecolor=:black, label="Out of Sample")
plot!(p1, size=(600, 550))

d_100 = simulate_deviance(N_sim, 100)
p2 = scatter(collect(1:5) .- 0.1, d_100[:, 1], yerr=d_100[:, 2], color=:blue, linecolor=:blue, lw=2, markersize=5, label="In Sample", xlabel="Number of Predictors", ylabel="Deviance", title=L"$N = 100$", grid=true)
scatter!(p2, collect(1:5) .+ 0.1, d_100[:, 3], yerr=d_100[:, 4], lw=2, markersize=5, color=:black, linecolor=:black, label="Out of Sample")
plot!(p2, size=(600, 550))

display(p1)
display(p2)
(a) Simulation of in vs. out of sample deviance calculations with increasing number of simulations
(b)
Figure 1

Cross-Validation and Over-/Under-Fitting

Cross-Validation and related measures cannot detect overfitting when information leaks from the full dataset to the trained model.

Information Criteria

Expected Out-Of-Sample Predictive Accuracy

We want to compute the expected out-of-sample log-predictive density (elpd)

What is the challenge?

We don’t know the distribution of new data \(p(\tilde{y})\)!

Expected Out-Of-Sample Predictive Accuracy

We need some measure of the error induced in estimating elpd using an approximating distribution \(Q\) from some model.

\[\begin{align} Q(\tilde{y}_i) &= \mathbb{E}_Y \left[p(\tilde{y} | \theta_{MLE})\right] \\ &= \int_Y p(\tilde{y} | \theta_{\text{MLE}}) p(\tilde{y})\,d\tilde{y}. \end{align}\]

Information Criteria

“Information criteria” refers to a category of estimators of prediction error.

The idea: estimate predictive error using the fitted model.

Information Criteria Overview

There is a common framework for all of these:

\[\widehat{\text{elpd}} = \underbrace{\log p(y | \theta, \mathcal{M})}_{\text{in-sample log-predictive density}} - \underbrace{d(\mathcal{M})}_{\text{penalty for degrees of freedom}}\]

Akaike Information Criterion (AIC)

The “first” information criterion that most people see.

Uses a point estimate (the maximum-likelihood estimate \(\hat{\theta}_\text{MLE}\)) to compute the log-predictive density for the data, corrected by the number of parameters \(k\):

\[\widehat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{MLE}) - k.\]

AIC Formula

The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).

Due to this convention, lower AICs are better (they correspond to a higher predictive skill).

AIC Correction Term

In the case of a model with normal sampling distributions, uniform priors, and sample size \(N >> k\), \(k\) is the asymptotically “correct” bias term (there are modified corrections for small sample sizes).

However, with more informative priors and/or hierarchical models, the bias correction \(k\) is no longer quite right, as there is less “freedom” associated with each parameter.

AIC and Deviance Example

Code
scatter!(p1, collect(1:5) .+ 0.2, d_20[:, 1] .+ 2 * (2:6), yerr=d_20[:, 2], lw=2, markersize=5, color=:red, linecolor=:black, label="AIC")

scatter!(p2, collect(1:5) .+ 0.2, d_100[:, 1] .+ 2 * (2:6), yerr=d_20[:, 2], lw=2, markersize=5, color=:red, linecolor=:black, label="AIC")

display(p1)
display(p2)
(a) Simulation of in vs. out of sample AIC calculations with increasing number of simulations
(b)
Figure 2

AIC: Storm Surge Example

Question: Do climate oscillations (e.g. the Pacific Decadal Oscillation) result in variations in tidal extremes at the San Francisco tide gauge station?

Code
# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(DataFrames.transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])

dat_annmax.residual = dat_annmax.residual  # convert to m

# make plots
psurge = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide (mm)",
    label=false,
    marker=:circle,
    markersize=5
)

plot!(psurge, size=(600, 400))
Figure 3: San Francisco tide gauge data

AIC: Storm Surge Example

Models:

  1. Stationary (“null”) model, \(y_t \sim \text{GEV}(\mu, \sigma, \xi);\)
  2. Time nonstationary (“null-ish”) model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi);\)
  3. PDO nonstationary model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 \text{PDO}_t, \sigma, \xi)\)

AIC Example

Code
stat_mle = [1258.71, 56.27, 0.017]
stat_ll = -707.67
nonstat_mle = [1231.58, 0.42, 52.07, 0.075]
nonstat_ll = -702.45
pdo_mle = [1255.87, -12.39, 54.73, 0.033]
pdo_ll = -705.24

# compute AIC values
stat_aic = stat_ll - 3
nonstat_aic = nonstat_ll - 4
pdo_aic = pdo_ll - 4

model_aic = DataFrame(Model=["Stationary", "Time", "PDO"], LogLik=trunc.(Int64, round.([stat_ll, nonstat_ll, pdo_ll]; digits=0)), AIC=trunc.(Int64, round.(-2 * [stat_aic, nonstat_aic, pdo_aic]; digits=0)))
3×3 DataFrame
Row Model LogLik AIC
String Int64 Int64
1 Stationary -708 1421
2 Time -702 1413
3 PDO -705 1418

AIC is Efficient, not Consistent

  • Efficiency: Metric converges to K-L divergence/LOO-CV (Stone, 1977)
  • Consistency: Metric will tend to choose the “true” model if it is included in the set.

This means AIC selects for predictive performance, not “truth”.

AIC/LOO-CV can tend to overfit due to this prioritization of efficiency.

AIC for Small Samples

AIC will tend to overfit (more parameters ⇒ more variance), penalty term isn’t strong enough to compensate.

\[\text{AIC}_c = \text{AIC} + \frac{2k^2 + 2k}{n - k - 1},\]

where

  • \(k = \text{dim}(\mathcal{M})\)
  • \(n\) is the sample size.

AIC Interpretation

Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).

Some basic rules of thumb (from Burnham & Anderson (2004)):

  • \(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
  • \(4 < \Delta_i < 7\) suggests “less” support;
  • \(\Delta_i > 10\) suggests “weak” or “no” support.

Model Averaging vs. Selection

Model averaging can sometimes be beneficial vs. model selection.

Model selection can introduce bias from the selection process (this is particularly acute for stepwise selection due to path-dependence).

AIC and Model Evidence

\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).

The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).

AIC and Model Averaging

This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]

Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).

Key Takeaways and Upcoming Schedule

Key Takeaways

  • LOO-CV is ideal for navigating bias-variance tradeoff but can be computationally prohibitive.
  • Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
  • Approximation to out of sample predictive error as a penalty for potential to overfit.

Next Classes

Friday: Quiz 2, BIC

References

References

Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Series B Stat. Methodol., 39, 44–47. https://doi.org/10.1111/j.2517-6161.1977.tb01603.x