Lecture 21
March 13, 2026
The “first” information criterion that most people see.
\[\widehat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{MLE}) - k.\]
The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).
Due to this convention, lower AICs are better (they correspond to a higher predictive skill).
Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).
Some basic rules of thumb (from Burnham & Anderson (2004)):
\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).
The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).
This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]
Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).
\[\text{BIC} = -2 \left(\log p(y | \hat{\theta}_\text{MLE}\right) + k\log n\]
BIC is an approximation to \(k\)-fold cross-validation for a particular choice of \[k = n \left(1 - \frac{1}{\log(n)-1}\right).\]
If the “true” model \(M\) is in the set of compared models \(\mathcal{M}\), as \(n \to \infty\) BIC will select \(M\) with probability 1.
But this can be quite biased at “not large” \(n\) and will typically be more parsimonious than the “true” model.
In other words, if the “true” model is among the choices, AIC may not choose it, but BIC might. But AIC will choose the model with best predictive skill out of the candidates.
BIC vs. AIC is analogous to the tradeoff between causal vs. predictive analyses. Generally not coherent to use both for the same problem.
Next Week: Hypothesis Testing and Intro to Simulation