Modeling Extreme Values

Lecture 13

February 23, 2026

Review

Last Week: Time Series

Autocorrelation: \(y_t\) not independent of \(y_{t-k}\)
Common for measurements made repeatedly over time
Use partial autocorrelation functions to diagnose lag \(k\)

Autoregressive Models

AR(k) models capture dependence, most common is AR(1).
Stationary models: don’t blow up to infinity, \(|\rho| < 1\).
Not very valuable for explanation, very useful for prediction.
Can combine with explanatory models (statistical or numerical) to capture trends.

Extreme Values

What Are Some Examples of Extremes?

When An Event is Rare?
When A Variable Exceeds Some High Threshold?
When An Event Causes a Catastrophe?

Two Ways To Frame “Extreme” Values

These are values with a very low probability of occurring, not necessarily high-impact events (which don’t have to be rare!).

“Block” extremes, e.g. annual maxima (block maxima)?
Values which exceed a certain threshold (peaks over threshold)?

Example: Tide Gauge Data

Code

function load_data(fname)
    date_format = "yyyy-mm-dd HH:MM"
    # this uses the DataFramesMeta package -- it's pretty cool
    return @chain fname begin
        CSV.File(; dateformat=date_format)
        DataFrame
        rename(
            "Time (GMT)" => "time", "Predicted (m)" => "harmonic", "Verified (m)" => "gauge"
        )
        @transform :datetime = (Date.(:Date, "yyyy/mm/dd") + Time.(:time))
        select(:datetime, :gauge, :harmonic)
        @transform :weather = :gauge - :harmonic
        @transform :month = (month.(:datetime))
    end
end

dat = load_data("data/surge/norfolk-hourly-surge-2015.csv")

p1 = plot(dat.datetime, dat.gauge; ylabel="Gauge Measurement (m)", label="Observed", legend=:topleft, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)

Figure 1: 2015 tide gauge data from the Norfolk, VA tide gauge.

Example: Tide Gauge Data

Code

plot!(p1, dat.datetime, dat.harmonic, label="Predicted", alpha=0.7)

Figure 2: 2015 tide gauge data with predicted harmonics from the Norfolk, VA tide gauge.

Example: Detrended Data

Code

plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=3, legend=:topleft,  xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)

Figure 3: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Example: Block Maxima

Code

pbm = plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=2, legend=:topleft, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)
max_dat = combine(dat -> dat[argmax(dat.weather), :], groupby(transform(dat, :datetime => x->yearmonth.(x)), :datetime_function))
scatter!(max_dat.datetime, max_dat.weather, label="Monthly Maxima", markersize=5)
month_start = collect(Date(2015, 01, 01):Dates.Month(1):Date(2015, 12, 01))
vline!(DateTime.(month_start), color=:black, label=:false, linestyle=:dash)

p = histogram(
    max_dat.weather,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="Count",
    bins=5,
    ylabel="",
    yticks=[]
)

l = @layout [a{0.7w} b{0.3w}]
plot(pbm, p; layout=l, link=:y, ylims=(-0.4, 1.4), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 4: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Example: Peaks Over Threshold

Code

thresh = 0.5
ppot = plot(dat.datetime, dat.weather; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:top, xlabel="Date/Time")
hline!([thresh], color=:red, linestyle=:dash, label="Threshold")
scatter!(dat.datetime[dat.weather .> thresh], dat.weather[dat.weather .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")

p2 = histogram(
    dat.weather[dat.weather .> thresh],
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="Count",
    ylabel="",
    yticks=[]
)

l = @layout [a{0.7w} b{0.3w}]
plot(ppot, p2; layout=l, link=:y, ylims=(-0.4, 1.4), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 5: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Block Maxima

Given independent and identically-distributed random variables \(X_1, X_2, \ldots, X_{mk}\), what is the distribution of maxima of “blocks” of size \(m\):

\[\tilde{X}_i = \max_{(i-1)m < j \leq im} X_j,\]

for \(i = 1, 2, \ldots, k\)?

How Are Maxima Distributed?

Consider the CDF \(F_X\) of \(X\) and let \(Y_n = \max \{X_1, \ldots, X_n \}\).

\[ \begin{align*} F_{Y_n}(z) &= \mathbb{P}[Y_n \leq z] \\ &= \mathbb{P}(X_1 \leq z, \ldots, X_n \leq z) \\ &= \mathbb{P}(X_1 \leq z) \times \ldots \times \mathbb{P}(X_n \leq z) \\ &= \mathbb{P}(X \leq z)^n = \left[F_X(z)\right]^n \end{align*} \]

This means that errors in estimating \(F_X\) become exponentially worse for \(F_{Y_n}\).

Sum-Stable Distributions

If we have independent and identically-distributed variables \(X_1, X_2, \ldots, X_n\).

\[Y_1 = \sum_{i=1}^k X_i, \quad Y_2 = \sum_{i={k+1}}^{2k} X_i, \quad \ldots, \quad Y_m = \sum_{i={n-k+1}}^{n} X_i\]

Sum-Stability: \(Y \stackrel{d}{=} a_kX + b_k\) for some constants \(a, b \geq 0\).

Example: If \(X \sim N(\mu, \sigma^2)\), \(Y \sim N(k\mu, k\sigma^2)\)

Max Stability

\[ \begin{aligned} \max\{x_1, \ldots, &x_{2n}\}\\ &= \max\{\max\{x_1, \ldots, x_n\}, \max\{x_{n+1}, \ldots, x_{2n}\} \} \end{aligned} \]

Figure 6: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Stability Postulate

If \(Z = \max\{X_1, \ldots, X_n\}\), then \(F_{Z}(z) = \left[F_X(z)\right]^n\).

By analogy with sum stability, postulate that for a max-stable process:

\[F^n(z) = F(a_n z + b_n)\]

for some constants \(a_n, b_n \geq 0\).

Idea: If possible, this rescaling ensures that as \(n \to \infty\), \(G^n\) doesn’t collapse to zero.

Extremal Types Theorem

Let \(X_1, \ldots, X_n\) be a sample of i.i.d. random variables.

If a limiting distribution for \(Y = \max\{X_1, \ldots, X_n\}\) exists, it can only by given as a Generalized Extreme Value (GEV) distribution:

\[H(y) = \exp\left\{-\left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)\right]^{-1/\xi}\right\},\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).

GEV Distributions

GEV distributions have three parameters:

location \(\mu\);
scale \(\sigma > 0\);
shape \(\xi\).

GEV “Types”

\(\xi > 0\): Frèchet (heavy-tailed)
\(\xi = 0\): Gumbel (light-tailed)
\(\xi < 0\): Weibull (bounded)

Code

p1 = plot(-2:0.1:6, GeneralizedExtremeValue(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", lw=3)
plot!(-4:0.1:6, GeneralizedExtremeValue(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$", lw=3)
plot!(-4:0.1:2, GeneralizedExtremeValue(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$", lw=3)
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 7: Shape of the GEV distribution with different choices of \(\xi\).

Impact of Shape Parameter

Code

c1 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, 0.5), -2:0.1:6)
c2 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, 0), -2:0.1:6)
c3 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, -0.5), -2:0.1:6)
p = plot(-2:0.1:6, c1, linewidth=3, color=:red, label=L"$\xi = 1/2$", lw=3, ylabel=L"$P(X > x)$", xlabel=L"$x$")
plot!(-2:0.1:6, c2, linewidth=3, color=:green, label=L"$\xi = 0$", lw=3)
plot!(-2:0.1:6, c3, linewidth=3, color=:blue, label=L"$\xi = -1/2$", lw=3)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 8: Shape of the GEV distribution with different choices of \(\xi\).

Code

c1 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, 0.5), -2:0.1:6)
c2 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, 0), -2:0.1:6)
c3 = 1 .- cdf.(GeneralizedExtremeValue(0, 1, -0.5), -2:0.01:1.89)
p = plot(-2:0.1:6, c1, linewidth=3, color=:red, label=L"$\xi = 1/2$", lw=3, ylabel=L"$P(X > x)$", xlabel=L"$x$", yaxis=:log)
plot!(-2:0.1:6, c2, linewidth=3, color=:green, label=L"$\xi = 0$", lw=3)
plot!(-2:0.01:1.89, c3, linewidth=3, color=:blue, label=L"$\xi = -1/2$", lw=3)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 9: Shape of the GEV distribution with different choices of \(\xi\).

GEV Types

\(\xi < 0\): extremes are bounded (the Weibull distribution comes up in the context of temperature and wind speed extremes).
\(\xi > 0\): tails are heavy, and there is no expectation if \(\xi > 1\). Common for streamflow, storm surge, precipitation.
The Gumbel distribution (\(\xi = 0\)) is common for extremes from normal distributions, doesn’t occur often in real-world data.

Return Levels

Return Levels are a central (if poorly named) concept in risk analysis.

The \(T\)-year return level is the value expected to be observed on average once every \(T\) years.

From a GEV fit to annual maxima: \(T\)-year return level is the \(1-1/T\) quantile.

Return Periods

The return period of an extreme value is the inverse of the exceedance probability.

Example: The 100-year return period has an exceedance probability of 1%, e.g. the 0.99 quantile of the GEV.

Return levels are associated with the analogous return period.

San Francisco Tide Gauge Data

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

d_sf = load_data("data/surge/h551.csv")

## plot data
p1 = plot(
    d_sf.datetime[year.(d_sf.datetime) .== 2000],
    d_sf.gauge[year.(d_sf.datetime) .== 2000] / 1000;
    xlabel="Date",
    ylabel="Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=2
)

Figure 10: Annual surge data from the San Francisco, CA tide gauge.

Annual Maxima of Gauge Data

Code

# group data by year and compute the annual maxima
dat_ma = dropmissing(d_sf) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.gauge), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :gauge])
dat_annmax.gauge = dat_annmax.gauge / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.gauge;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5
)
p2 = histogram(
    dat_annmax.gauge,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="Count",
    ylabel="",
    yticks=[]
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(3.7, 4.6), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 11: Annual maxima surge data from the San Francisco, CA tide gauge.

Annual Maxima of Detrended Data

Code

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=d_sf.datetime[ma_offset+1:end-ma_offset], residual=d_sf.gauge[ma_offset+1:end-ma_offset] .- moving_average(d_sf.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Over MSL (m)",
    label=false,
    marker=:circle,
    markersize=5
)
p2 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="Count",
    ylabel="",
    yticks=[]
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 12: Annual maximasurge data from the San Francisco, CA tide gauge.

GEV Parameters

Code

# find GEV fit
# for most distributions we could use Distributions.fit(), but this isn't implemented in Distributions.jl for GEV
init_θ = [1.0, 1.0, 1.0]
gev_lik(θ) = -sum(logpdf(GeneralizedExtremeValue(θ[1], θ[2], θ[3]), dat_annmax.residual))
θ_mle = Optim.optimize(gev_lik, init_θ).minimizer

p = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    label="Data",
    xlabel="Annual Maximum (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm,
    right_margin=10mm
)

plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=5, label="GEV Fit", color=:gold4)
xlims!((1, 1.75))
plot!(size=(800, 500))

Figure 13: GEV fit to annual maxima of San Francisco Tide Gauge Data

\(\mu =\) 1.26
\(\sigma =\) 0.06
\(\xi =\) 0.02

GEV Return Levels

100-year return level: 1.53m

Code

gevcdf(x) = cdf(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), x) # find quantiles of values
rl_100 = quantile(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 0.99)

plot(1:0.01:1.75, 1 .- gevcdf.(1:0.01:1.75), yaxis=:log, yticks=10 .^ collect(-3.0:1.0:0.0), xlabel="Water Level (m)", ylabel="Exceedance Probability (1/yr)", lw=3)
plot!(1:0.01:rl_100, 0.01 .+ zeros(length(1:0.01:rl_100)), color=:red, lw=2)
plot!(rl_100 .+ zeros(length(-4:0.01:-2)), 10 .^collect(-4.0:0.01:-2.0), color=:red, lw=2)
scatter!([rl_100], [0.01], color=:red, markersize=5)
plot!(size=(800, 500))

Figure 14: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV vs LogNormal

Code

p = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    label="Data",
    xlabel="Annual Maximum (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm,
    right_margin=10mm
)
plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=5, label="GEV Fit", color=:gold4)
plot!(fit(LogNormal, dat_annmax.residual), linewidth=5, label="LogNormal Fit", color=:darkred)
xlims!((1, 1.75))

Figure 15: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV Q-Q Plot

Code

p1 = qqplot(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), dat_annmax.residual, 
    linewidth=3, markersize=5,
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile"
)
plot!(p1, size=(600, 450))

return_periods = 2:500
# get GEV return levels
return_levels = quantile.(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 1 .- (1 ./ return_periods))
# fit lognormal to get return levels for comparison
lognormal_fit = fit(LogNormal, dat_annmax.residual)
return_levels_lognormal = quantile.(lognormal_fit, 1 .- (1 ./ return_periods))

# function to calculate exceedance probability and plot positions based on data quantile
function exceedance_plot_pos(y)
    N = length(y)
    ys = sort(y; rev=false) # sorted values of y
    nxp = xp = [r / (N + 1) for r in 1:N] # exceedance probability
    xp = 1 .- nxp
    return xp, ys
end
xp, ys = exceedance_plot_pos(dat_annmax.residual)

p2 = plot(return_periods, return_levels, linewidth=5, color=:gold4, label="GEV Model Fit", bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)
plot!(p2, return_periods, return_levels_lognormal, linewidth=5, color=:darkred, label="LogNormal Model Fit")
scatter!(p2, 1 ./ xp, ys, label="Observations", color=:black, markersize=6)
xlabel!(p2, "Return Period (yrs)")
ylabel!(p2, "Return Level (m)")
xlims!(-1, 300)
plot!(p2, size=(600, 450))

display(p1)
display(p2)

(a) GEV fit to annual maxima of San Francisco Tide Gauge Data

Be Careful About The Shape Parameter!

\(\xi\) can be difficult to constrain;
Often lots of uncertainty (large standard errors);
Controls frequency of “extreme” extremes

Source: Zarekarizi et al. (2020)

Choosing a Block Size

Small block sizes: poor approximation by GEV, resulting in greater bias
Large block sizes: fewer data points, greater estimation variance
One year is common for practical reasons, but can go finer with sufficient data so long as block maxima can be treated as independent.

Key Points And Upcoming Schedule

Key Points

Two approaches to modeling extremes: block maxima and peaks over thresholds
Block maxima are modeled by Generalized Extreme Value (GEV) distributions
Three types of GEVs depending on shape \(\xi\): influences heaviness of taills (extreme eztremes)
Need to detrend data to ensure distribution over time is the same.
Be careful when choosing block sizes.

Upcoming Schedule

Wednesday: Peaks Over Thresholds

Friday: Nonstationary Extremes

Assessments

Friday: Project Proposal Due

3/06: HW3 Due

References

Zarekarizi, M., Srikrishnan, V., & Keller, K. (2020). Neglecting uncertainties biases house-elevation decisions to manage riverine flood risks. Nat. Commun., 11, 5361. https://doi.org/10.1038/s41467-020-19188-9