Lecture 16
March 2, 2026
Values with a very low probability of occurring, not necessarily high-impact events (which don’t have to be rare!).
The \(T\)-year return level is the value expected to be observed on average once every \(T\) years.
The return period of an extreme value is the inverse of the exceedance probability.
Example: The 100-year return period has an exceedance probability of 1%, e.g. the 0.99 quantile.
Return levels are associated with the analogous return period.

Source: Richard McElreath
\[R^2 = (\sigma^2_\text{model}/\sigma^2_\text{data})\]
SMBC: Take It Off
Anscombe’s Quartet (Anscombe, 1973) consists of datasets which have the same summary statistics (including \(R^2\)) but very different graphs.

Source: Wikipedia
In pretty much every case where \(R^2\) might be useful, (root) mean squared error ((R)MSE) is better:
More generally, we want to think about measures which capture the skill of a probabilistic prediction.
These are commonly called scoring rules (Gneiting & Katzfuss, 2014) (more on this next class).
Suppose we have a dataset \(\mathbf{z}_n = \{z_1, z_2, \ldots, z_n\}\) and various models \(\mathcal{M} = \{M_1, M_2, \ldots, M_k\}\) parameterized by \(\theta_i, \ldots, \theta_k\).
Then the prediction our model makes at a new predictor \(\mathbf{x}\) is \(M_i(\mathbf{x}; \theta_i)\).
Finally, we have a loss function \(L\) which tells us how to penalize errors, e.g. for MSE use the quadratic loss \[L(z, M, \theta) = \left(y - M(\mathbf{x}; \theta)\right)^2.\]
\[\hat{\theta_n} = \text{argmin}_{\theta \in \Theta} \bar{L}(\mathbf{y}_n, M, \theta).\]
This means minimizing MSE/SSE or maximizing likelihood for the quadratic or log losses.
What do we want to see in a probabilistic projection \(F\)?
Common to use the PIT to make these more concrete: \(Z_F = F(y)\).
The forecast is probabilistically calibrated if \(Z_F \sim Uniform(0, 1)\).
The forecast is properly dispersed if \(\text{Var}(Z_F) = 1/12\).
Sharpness can be measured by the width of a particular prediction interval. A good forecast is a sharp as possible subject to calibration (Gneiting et al., 2007).
# "true" observation distribution is N(2, 0.5)
obs = rand(Normal(2, 0.5), 50)
# forecast according to the "correct" distribution and obtain PIT
pit_corr = cdf(Normal(2, 0.45), obs)
p_corr = histogram(pit_corr, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf1 = plot(xrange, cdf.(Normal(2, 0.4), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf1, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf1)
display(p_corr)# forecast according to an underdispersed distribution and obtain PIT
pit_under = cdf(Normal(2, 0.1), obs)
p_under = histogram(pit_under, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf2 = plot(xrange, cdf.(Normal(2, 0.1), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf2, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf2)
display(p_under)# forecast according to an overdispersed distribution and obtain PIT
pit_over = cdf(Normal(2, 1), obs)
p_over = histogram(pit_over, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf3 = plot(xrange, cdf.(Normal(2, 1), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf3, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf3)
display(p_over)Scoring rules compare observations against an entire probabilistic forecast.
A scoring rule \(S(F, y)\) measures the “loss” of a predicted probability distribution \(F\) once an observation \(y\) is obtained.
Typically oriented so smaller = better.
Proper scoring rules are intended to encourage forecasters to provide their full (and honest) forecasts.
Minimized when the forecasted distribution matches the observed distribution:
\[\mathbb{E}_Y(S(G, G)) \leq \mathbb{E}_Y(S(F, G)) \qquad \forall F.\]
It is strictly proper if equality holds only if \(F = G\).
Most classification algorithms produce a probability (e.g. logistic regression) of different outcomes.
A common skill metric for classification models is accuracy (sensitivity/specificity): given these probabilities and some threshold to translate them into categorical prediction.
The problem: This translation is a decision problem, not a statistical problem. A probabilistic scoring rule over the predicted probabilities more accurately reflects the skill of the statistical model.
The logarithmic score \(S(F, y) = -\log F(y)\) is (up to equivalence) the only local strictly proper scoring rule (locality ⇒ score depends only on the observation).
This is the negative log-probability: straightforward to use for the likelihood (frequentist forecasts) or posterior (Bayesian forecasts) and generalizes MSE.
Wednesday: Bias and Variance and Tradeoffs
Friday: Cross-Validation
HW3: Due Friday
Quiz 2: Next Friday, 3/13