Peaks Over Thresholds

Lecture 14

February 25, 2026

Review

Extreme Values

Values with a very low probability of occurring, not necessarily high-impact events (which don’t have to be rare!).

“Block” extremes, e.g. annual maxima (block maxima)
Values which exceed a certain threshold (peaks over threshold)

Generalized Extreme Values

Block maxima \(Y_n = \max \{X_1, \ldots, X_n \}\) are modeled using Generalized Extreme Value distributions: \(GEV(\mu, \sigma, \xi)\).

Code

p1 = plot(-2:0.1:6, GeneralizedExtremeValue(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", lw=3)
plot!(-4:0.1:6, GeneralizedExtremeValue(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$", lw=3)
plot!(-4:0.1:2, GeneralizedExtremeValue(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$", lw=3)
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 1: Shape of the GEV distribution with different choices of \(\xi\).

Return Levels

Return Levels are a central (if poorly named) concept in risk analysis.

The \(T\)-year return level is the value expected to be observed on average once every \(T\) years.

From a GEV fit to annual maxima: \(T\)-year return level is the \(1-1/T\) quantile.

Return Periods

The return period of an extreme value is the inverse of the exceedance probability.

Example: The 100-year return period has an exceedance probability of 1%, e.g. the 0.99 quantile.

Return levels are associated with the analogous return period.

San Francisco Tide Gauge Data

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

d_sf = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=d_sf.datetime[ma_offset+1:end-ma_offset], residual=d_sf.gauge[ma_offset+1:end-ma_offset] .- moving_average(d_sf.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5
)
p2 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="Count",
    ylabel="",
    yticks=[]
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

# find GEV fit
# for most distributions we could use Distributions.fit(), but this isn't implemented in Distributions.jl for GEV
init_θ = [1.0, 1.0, 1.0]
gev_lik(θ) = -sum(logpdf(GeneralizedExtremeValue(θ[1], θ[2], θ[3]), dat_annmax.residual))
θ_mle = Optim.optimize(gev_lik, init_θ).minimizer

3-element Vector{Float64}:
 1.2587093828052343
 0.05626679672643659
 0.0171950735119165

Figure 2: Annual maxima surge data from the San Francisco, CA tide gauge.

GEV Parameters

\(\mu =\) 1.26
\(\sigma =\) 0.06
\(\xi =\) 0.02

Code

p = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    label="Data",
    xlabel="Annual Maximum (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm,
    right_margin=10mm
)

plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=5, label="GEV Fit", color=:gold4)
xlims!((1, 1.75))
plot!(size=(800, 500))

Figure 3: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV Return Levels

100-year return level: 1.53m

Code

gevcdf(x) = cdf(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), x) # find quantiles of values
rl_100 = quantile(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 0.99)

plot(1:0.01:1.75, 1 .- gevcdf.(1:0.01:1.75), yaxis=:log, yticks=10 .^ collect(-3.0:1.0:0.0), xlabel="Water Level (m)", ylabel="Exceedance Probability (1/yr)", lw=3)
plot!(1:0.01:rl_100, 0.01 .+ zeros(length(1:0.01:rl_100)), color=:red, lw=2)
plot!(rl_100 .+ zeros(length(-4:0.01:-2)), 10 .^collect(-4.0:0.01:-2.0), color=:red, lw=2)
scatter!([rl_100], [0.01], color=:red, markersize=5)
plot!(size=(800, 500))

Figure 4: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV vs LogNormal

Code

p = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    label="Data",
    xlabel="Annual Maximum (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm,
    right_margin=10mm
)
plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=5, label="GEV Fit", color=:gold4)
plot!(fit(LogNormal, dat_annmax.residual), linewidth=5, label="LogNormal Fit", color=:darkred)
xlims!((1, 1.75))

Figure 5: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV Q-Q Plot

Code

p1 = qqplot(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), dat_annmax.residual, 
    linewidth=3, markersize=5,
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile"
)
plot!(p1, size=(600, 450))

return_periods = 2:500
# get GEV return levels
return_levels = quantile.(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 1 .- (1 ./ return_periods))
# fit lognormal to get return levels for comparison
lognormal_fit = fit(LogNormal, dat_annmax.residual)
return_levels_lognormal = quantile.(lognormal_fit, 1 .- (1 ./ return_periods))

# function to calculate exceedance probability and plot positions based on data quantile
function exceedance_plot_pos(y)
    N = length(y)
    ys = sort(y; rev=false) # sorted values of y
    nxp = xp = [r / (N + 1) for r in 1:N] # exceedance probability
    xp = 1 .- nxp
    return xp, ys
end
xp, ys = exceedance_plot_pos(dat_annmax.residual)

p2 = plot(return_periods, return_levels, linewidth=5, color=:gold4, label="GEV Model Fit", bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)
plot!(p2, return_periods, return_levels_lognormal, linewidth=5, color=:darkred, label="LogNormal Model Fit")
scatter!(p2, 1 ./ xp, ys, label="Observations", color=:black, markersize=6)
xlabel!(p2, "Return Period (yrs)")
ylabel!(p2, "Return Level (m)")
xlims!(-1, 300)
plot!(p2, size=(600, 450))

display(p1)
display(p2)

(a) GEV fit to annual maxima of San Francisco Tide Gauge Data

Be Careful About The Shape Parameter!

\(\xi\) can be difficult to constrain;
Often lots of uncertainty (large standard errors);
Controls frequency of “extreme” extremes

Source: Zarekarizi et al. (2020)

Choosing a Block Size

Similar to the “bias-variance tradeoff”
Small block sizes: poor approximation by GEV, resulting in greater bias
Large block sizes: fewer data points, greater estimation variance
One year is common for practical reasons, but can go finer with sufficient data so long as block maxima can be treated as independent.

Peaks Over Thresholds

Drawbacks of Block Maxima

The block-maxima approach has two potential drawbacks:

Uses a limited amount of data;
Doesn’t capture the potential for multiple extremes within a block.

Peaks Over Thresholds

Consider the conditional excess distribution function

\[F_u(y) = \mathbb{P}(X - u > y | X > u)\]

Figure 7: Illustration of the Conditional Excess Function

Figure 8: Histogram of conditional excesses

Generalized Pareto Distribution (GPD)

For a large number of underlying distributions of \(X\), \(F_u(y)\) is well-approximated by a Generalized Pareto Distribution (GPD):

\[F_u(y) \to G(y) = 1 - \left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)^{-1/\xi}\right],\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).

Generalized Pareto Distribution (GPD)

Similarly to the GEV distribution, the GPD distribution has three parameters:

location \(\mu\);
scale \(\sigma > 0\);
shape \(\xi\).

GPD Types

\(\xi > 0\): heavy-tailed
\(\xi = 0\): light-tailed
\(\xi < 0\): bounded

Code

p1 = plot(-2:0.1:6, GeneralizedPareto(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", left_margin=5mm, bottom_margin=10mm)
plot!(-4:0.1:6, GeneralizedPareto(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$")
plot!(-4:0.1:2, GeneralizedPareto(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$")
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 9: Shape of the GPD distribution with different choices of \(\xi\).

Declustering Exceedances

Exceedances Can Occur In Clusters

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

d_sf = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=d_sf.datetime[ma_offset+1:end-ma_offset], residual=d_sf.gauge[ma_offset+1:end-ma_offset] .- moving_average(d_sf.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data


thresh = 1.0
dat_ma_plot = @subset(dat_ma, year.(:datetime) .> 2020)
dat_ma_plot.residual = dat_ma_plot.residual ./ 1000
p1 = plot(dat_ma_plot.datetime, dat_ma_plot.residual; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:bottom, xlabel="Date/Time", right_margin=10mm, left_margin=5mm, bottom_margin=5mm)
hline!([thresh], color=:red, linestyle=:dash, label="Threshold")
scatter!(dat_ma_plot.datetime[dat_ma_plot.residual .> thresh], dat_ma_plot.residual[dat_ma_plot.residual .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")

Figure 10: Peaks Over Thresholds for the SF Tide Gauge Data

The Problem With Clusters

A GPD models conditional exceedances assuming they are independent realizations.

But events which result in clustered exceedances (heat waves, TCs) don’t result in independent exceedances.

Why is this not a problem with block maxima?

Declustering

Arns et al. (2013) note: there is no clear declustering time period to use: need to rely on physical understanding of events and “typical” durations.

If we have prior knowledge about the duration of physical processes leading to clustered extremes (e.g. storm durations), can use this. Otherwise, need some way to estimate cluster duration from the data.

Extremal Index

The most common is the extremal index \(\theta(u)\), which measures the inter-exceedance time for a given threshold \(u\).

\[0 \leq \theta(u) \leq 1,\]

where \(\theta(u) = 1\) means independence and \(\theta(u) = 0\) means the entire dataset is one cluster.

Extremal Index

\(\theta(u)\) has two meanings:

The “propensity to cluster”: \(\theta\) is the probability that the process has left one exceedance cluster;
The “reciprocal of the clustering duration”: \(1/\theta\) is the mean time between clusters.

Computing the Extremal Index

This estimator is taken from Ferro & Segers (2003).

Let \(N = \sum_{i=1}^n \mathbb{I}(X_i > u)\) be the total number of exceedances.

Denote by \(1 \leq S_1 < \ldots < S_N \leq n\) the exceedance times.

Then the inter-exceedance times are \[T_i = S_{i+1} - S_i, \quad 1 \leq i \leq N-1.\]

Computing the Extremal Index

\[\hat{\theta}(u) = \frac{2\left(\sum_{i=1}^{N-1} T_i\right)^2}{(N-1)\sum_{i=1}^{N-1}T_i^2}\]

Code

# find total number of exceedances and exceedance times
dat_ma.residual = dat_ma.residual ./ 1000 # convert to m
S = findall(dat_ma.residual .> thresh)
N = length(S)
T = diff(S) # get difference between adjacent exceedances
θ = 2 * sum(T)^2 / ((N-1) * sum(T.^2)) # extremal index

For the SF tide gauge data and \(u=1.0 \text{m}\), we get an extremal index (\(\theta(u)\)) of 0.24 and a declustering time (\(1/\theta(u)\)) of 4.0 hours.

Mapping Data To Clusters

# cluster data points which occur within period
function assign_cluster(dat, period)
    cluster_index = 1
    clusters = zeros(Int, length(dat))
    for i in 1:length(dat)
        if clusters[i] == 0
            clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index
            cluster_index += 1
        end
    end
    return clusters
end

# cluster exceedances that occur within a four-hour window
# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformation
dat_exceed = dat_ma[dat_ma.residual .> thresh, :]
dat_exceed = @transform dat_exceed :cluster = assign_cluster(:datetime, Dates.Hour(4))
# find maximum value within cluster
dat_decluster = combine(dat_exceed -> dat_exceed[argmax(dat_exceed.residual), :], 
    groupby(dat_exceed, :cluster))
dat_decluster

Declustered Distribution

Code

p = histogram(dat_decluster.residual .- thresh,
    normalize = :pdf,
    label="Data",
    xlabel="Threshold Exceedance (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm, 
    right_margin=10mm,
    bottom_margin=5mm
    )

Figure 11: Histogram of clustered exceedances for SF tide gauge data.

GPD Fit

Code

# fit GPD
init_θ = [1.0, 1.0]
low_bds = [0.0, -Inf]
up_bds = [Inf, Inf]
gpd_lik(θ) = -sum(logpdf(GeneralizedPareto(0.0, θ[1], θ[2]), dat_decluster.residual .- thresh))
θ_mle = Optim.optimize(gpd_lik, low_bds, up_bds, init_θ).minimizer
p1 = plot!(p, GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), linewidth=3, label="GPD Fit")
plot!(size=(600, 450))

# Q-Q Plot
p2 = qqplot(GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), dat_decluster.residual .- thresh, 
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile",
    linewidth=3,
    left_margin=5mm, 
    right_margin=10mm,
    bottom_margin=5mm)
plot!(size=(600, 450))

display(p1)
display(p2)

(a) GPD fit to tide gauge readings over 1m of San Francisco Tide Gauge Data

But What About Exceedance Frequency?

The GPD fit gives a distribution for how extreme threshold exceedances are when they occur.

But how often do they occur?

Code

# add column with years of occurrence
dat_decluster = @transform dat_decluster :year = Dates.year.(dat_decluster.datetime)
# group by year and add up occurrences
exceed_counts = combine(groupby(dat_decluster, :year), nrow => :count)
delete!(exceed_counts, nrow(exceed_counts)) # including 2023 will bias the count estimate
p = histogram(exceed_counts.count, legend=:false, 
    xlabel="Yearly Exceedances",
    ylabel="Count",
    left_margin=5mm,
    bottom_margin=10mm
)
plot!(size=(600, 400))

Figure 13: histogram of number of exceedances in each year

Poisson - Generalized Pareto Process

Model the number of new exceedances with a Poisson distribution

\[n \sim \text{Poisson}(\lambda_u),\]

The MLE for \(\lambda_u\) is the mean of the count data, in this case 49.5.

Then, for each \(i=1, \ldots, n\), sample \[X_i \sim \text{GeneralizedPareto}(u, \sigma, \xi).\]

Poisson - Generalized Pareto Process Return Levels

Then the return level for return period \(m\) years can be obtained by solving the quantile equation (see Coles (2001) for details):

\[\text{RL}_m = \begin{cases}u + \frac{\sigma}{\xi} \left((m\lambda_u)^\xi - 1\right) & \text{if}\ \xi \neq 0 \\ u + \sigma \log(m\lambda_u) & \text{if}\ \xi = 0.\end{cases}\]

Poisson-GPP Example

Question: What is the 100-year return period of the SF data?

Code

count_dist = fit(Poisson, exceed_counts.count)
λ = params(count_dist)[1]

p = histogram(exceed_counts.count, legend=:false, 
    xlabel="Yearly Exceedances",
    ylabel="Count",
    normalize=:pdf,
    left_margin=5mm,
    bottom_margin=10mm
)
plot!(size=(600, 500))
plot!(p, count_dist)

Figure 14: Histogram of number of exceedances in each year with Poisson fit

Poisson-GPP Example

Poisson: \(\lambda =\) 50.0

GPD parameters:

\(\mu = 0\) (fixed)
\(\sigma =\) 0.11
\(\xi =\) -0.17

\[ \begin{align*} \text{RL}_{\text{100}} &= 1 + \frac{\sigma}{\xi} \left((m\lambda)^\xi - 1\right) \\ &= 1 - \frac{0.11}{0.17}\left((100 * 50)^{-0.17} - 1\right) \\ &\approx 1.5 \text{m} \end{align*} \]

GPP-GEV Comparison

Pros and Cons

GEV: have to choose block size, throws away more data, only captures one extreme per period, easier to fit, return levels are trivial to compute

GPP: have to set threshold and deal with clusters, model both frequency and peak, can have multiple events each period, return levels more cumbersome

But…

We should get the same return periods in either case. The difference is:

If you care about the possibility of multiple exceedances per year.
If you only care about the “worst case” event over a period of time.

Unfortunately, this choice is often made for reasons of simplicity or data length, which is always the wrong approach. Let your question dictate your methods.

Key Points And Upcoming Schedule

Upcoming Schedule

Friday: Non-Stationary Extremes

Next Week: Model Evaluation and Comparison

Assessments

Project Proposal due Friday (2/27)

References

Arns, A., Wahl, T., Haigh, I. D., Jensen, J., & Pattiaratchi, C. (2013). Estimating extreme water level probabilities: A comparison of the direct methods and recommendations for best practise. Coast. Eng., 81, 51–66. https://doi.org/10.1016/j.coastaleng.2013.07.003

Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag London.

Ferro, C. A. T., & Segers, J. (2003). Inference for clusters of extreme values. J. R. Stat. Soc. Series B Stat. Methodol., 65, 545–556. https://doi.org/10.1111/1467-9868.00401

Zarekarizi, M., Srikrishnan, V., & Keller, K. (2020). Neglecting uncertainties biases house-elevation decisions to manage riverine flood risks. Nat. Commun., 11, 5361. https://doi.org/10.1038/s41467-020-19188-9