Bayesian Information Criterion


Lecture 21

March 13, 2026

Review of Last Class

Akaike Information Criterion (AIC)

The “first” information criterion that most people see.

\[\widehat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{MLE}) - k.\]

The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).

Due to this convention, lower AICs are better (they correspond to a higher predictive skill).

AIC Interpretation

Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).

Some basic rules of thumb (from Burnham & Anderson (2004)):

  • \(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
  • \(4 < \Delta_i < 7\) suggests “less” support;
  • \(\Delta_i > 10\) suggests “weak” or “no” support.

AIC and Model Evidence

\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).

The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).

AIC and Model Averaging

This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]

Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).

Bayesian Information Criterion

Bayesian Information Criterion (BIC)

\[\text{BIC} = -2 \left(\log p(y | \hat{\theta}_\text{MLE}\right) + k\log n\]

BIC is an approximation to \(k\)-fold cross-validation for a particular choice of \[k = n \left(1 - \frac{1}{\log(n)-1}\right).\]

BIC: Consistency vs. Efficiency

  • BIC is consistent but not efficient.

If the “true” model \(M\) is in the set of compared models \(\mathcal{M}\), as \(n \to \infty\) BIC will select \(M\) with probability 1.

But this can be quite biased at “not large” \(n\) and will typically be more parsimonious than the “true” model.

BIC vs. AIC

  • BIC tends to select more parsimonious models due to stronger penalty;
  • AIC will tend to overfit, BIC to underfit.

In other words, if the “true” model is among the choices, AIC may not choose it, but BIC might. But AIC will choose the model with best predictive skill out of the candidates.

BIC vs. AIC is analogous to the tradeoff between causal vs. predictive analyses. Generally not coherent to use both for the same problem.

Key Takeaways and Upcoming Schedule

Key Takeaways

  • Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
  • Approximation to out of sample predictive error as a penalty for potential to overfit.
  • Some ICs approximate K-L Divergence/LOO-CV, others approximate marginal likelihood. Different implications for predictive vs. causal modeling.

Next Classes

Next Week: Hypothesis Testing and Intro to Simulation

References

References

Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644