import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()Homework 3 Solutions
BEE 4850/5850
Overview
Instructions
The goal of this homework assignment is to practice developing and working with probability models for data.
- Problem 1 asks you to diagnose autocorrelation and develop an AR model for the weather-based variability at the Sewell’s Point tide gauge.
- Problem 2 asks you to fit a sea-level rise model under both assumptions of Gaussian i.i.d. residuals and autocorrelated residuals.
- Problem 3 asks you to conduct an analysis of extreme temperatures in Ithaca under both stationary and non-stationary assumptions.
- Problem 4 asks you to model how often Ithaca temperatures exceed thresholds.
Load Environment
The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.
The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).
using Random # random number generation and seed-setting
using DataFrames # tabular data structure
using DataFramesMeta
using CSV # reads/writes .csv files
using Distributions # interface to work with probability distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using Optim # optimization tools
using Dates # date-time interface
Random.seed!(1)Problems
Scoring
- Problem 1 is worth 5 points.
- Problem 2 is worth 8 points.
- Problem 3 is worth 7 points.
- Problem 4 is worth 5 points.
Problem 1
Problem 1.1
To load and process the data:
tide_dat = CSV.read(joinpath("data", "norfolk-hourly-surge-2015.csv"), DataFrame)
surge_resids = tide_dat[:, 5] - tide_dat[:, 3]
plot(surge_resids; ylabel="Gauge Measurement (m)", label="Weather-Related Residual", legend=:topleft, xlabel="Hour Number")Let’s look at the partial autocorrelation function of the residuals to determine what the autoregressive lag should be.
resid_acf = pacf(surge_resids, 1:5)
plot(resid_acf , marker=:circle, line=:stem, linewidth=3, markersize=5, legend=false, ylabel="Partial Autocorrelation", xlabel="Time Lag")There is clearly a strong lag-1 autocorrelation in Figure 2, so we would need to at least use an AR(1) model. You could also argue that we should include a second lag, as well, as the lag-2 partial autocorrelation is non-negligible, but this is less critical than the AR(1).
Problem 1.2
This AR(1) model is \[\begin{aligned} y_t &= \alpha + \rho y_{t+1} + \varepsilon \\ \varepsilon &\sim N(0, \sigma) \\ y_0 &\sim N\left(\frac{\alpha}{1 - \rho}, \frac{\sigma^2}{1-\rho^2}\right). \end{aligned}\]
Let’s implement this model and fit it. We’ll specify the likelihood based on whitening the residuals and computing the individual normal likelihoods of the resulting terms, but you should get the same parameters if you used the joint likelihood of the entire time series.
function llik_ar(params, y)
a, p, s = params
ll = 0 # initialize log-likelihood counter
ll += logpdf(Normal(a / (1 - p), s/sqrt(1-p^2)), y[1]) # Normal() in Distributions.jl is parameterized with the standard deviation, not the variance
for i = 1:length(y)-1
residual_whitened = y[i+1] - (a + p * y[i])
ll += logpdf(Normal(0, s), residual_whitened)
end
return ll
end
# set up lower and upper bounds for the parameters for the optimization
lbds = [0.0, 0.0, 0.1]
ubds = [5.0, 1.0, 5.0]
p0 = [1.0, 0.4, 0.5]
optim_out = Optim.optimize(p -> -llik_ar(p, surge_resids), lbds, ubds, p0)
p_tide_mle = optim_out.minimizer
@show round(p_tide_mle; digits=2)- 1
-
This syntax is useful as it lets us unpack a vector or tuple into named variables, which increases readability of the subsequent code (versus only referring to e.g.
p[1]).
┌ Warning: Terminated early due to NaN in gradient. └ @ Optim ~/.julia/packages/Optim/DtV5C/src/multivariate/optimize/optimize.jl:136
MethodError: no method matching round(::Vector{Float64}, ::RoundingMode{:Nearest}; digits::Int64)
The function `round` exists, but no method is defined for this combination of argument types.
Closest candidates are:
round(::Type{T}, ::Any, ::RoundingMode) where T>:Missing got unsupported keyword argument "digits"
@ Base missing.jl:148
round(::Type{T}, ::Any, ::RoundingMode) where T got unsupported keyword argument "digits"
@ Base rounding.jl:479
round(::Missing, ::RoundingMode; sigdigits, digits, base)
@ Base missing.jl:144
...
Stacktrace:
[1] round(x::Vector{Float64}; kws::@Kwargs{digits::Int64})
@ Base ./rounding.jl:472
[2] macro expansion
@ show.jl:1232 [inlined]
[3] top-level scope
@ ~/Teaching/BEE4850/spring2026/solutions/hw03/hw03.qmd:138
Now we can simulate these residuals and add them back to the NOAA model predictions using the mapslices() function. The resulting histograms of the maximum and median values is shown in Figure 3.
# T is the length of the simulated series, N the number of simulations
function simulate_ar1(a, p, s, T, N)
y = zeros(T, N) # initialize storage
y[1, :] = rand(Normal(a / (1 - p), s/sqrt(1-p^2)), N) # sample initial value
# simulate the remaining residuals
for t = 2:T
y[t, :] = a .+ p * y[t-1, :] + rand(Normal(0, s), N)
end
return y
end
# simulate new observations
simulated_resids = simulate_ar1(p_tide_mle[1], p_tide_mle[2], p_tide_mle[3], length(surge_resids), 10_000)
simulated_obs = mapslices(col -> col + tide_dat[:, 3], simulated_resids, dims=1)
# plot histogram
max_obs = [maximum(simulated_obs[:, n]) for n in 1:size(simulated_obs, 2)]
med_obs = [median(simulated_obs[:, n]) for n in 1:size(simulated_obs, 2)]
p1 = histogram(max_obs, xlabel="Maximum Simulated Gauge Level (m)", ylabel="Count", label="Simulations")
vline!([maximum(tide_dat[:, 5])], color=:red, linestyle=:dash, label="Observation")
p2 = histogram(med_obs, xlabel="Median Simulated Gauge Level (m)", ylabel="Count", label="Simulations")
vline!([median(tide_dat[:, 5])], color=:red, linestyle=:dash, label="Observation")
plot(p1, p2, layout=(2, 1))The median distribution fits the data well, but the observed maximum 1.98 is below all of the simulations, which could be due to a model structural problem. For example, an AR(2) might have been a better fit.
Problem 2
Problem 2.1
First, load the data and normalize. We’ll plot the resulting series (though this wasn’t required).
norm_yrs = 1880:1900
sl_dat = DataFrame(CSV.File(joinpath("data", "CSIRO_Recons_gmsl_yr_2015.csv")))
rename!(sl_dat, [:Year, :GMSLR, :SD]) # rename to make columns easier to work with
sl_dat[!, :Year] .-= 0.5 # shift year to line up with years instead of being half-year
sl_dat[!, :GMSLR] .-= mean(filter(row -> row.Year ∈ norm_yrs, sl_dat)[!, :GMSLR]) # rescale to be relative to 1880-1900 mean for consistency with temperature anomaly
# load temperature data
temp_dat = DataFrame(CSV.File(joinpath("data", "HadCRUT.5.0.2.0.analysis.summary_series.global.annual.csv")))
rename!(temp_dat, [:Year, :Temp, :Lower, :Upper]) # rename to make columns easier to work with
filter!(row -> row.Year ∈ sl_dat[!, :Year], temp_dat) # reduce to the same years that we have SL data for
temp_normalize = mean(filter(row -> row.Year ∈ norm_yrs, temp_dat)[!, :Temp]) # get renormalization to rescale temperature to 1880-1900 mean
temp_dat[!, :Temp] .-= temp_normalize
sl_plot = scatter(sl_dat[!, :Year], sl_dat[!, :GMSLR], color=:black, label="Observations", ylabel="(mm)", xlabel="Year", title="Sea Level Anomaly")
temp_plot = scatter(temp_dat[!, :Year], temp_dat[!, :Temp], color=:black, label="Observations", ylabel="(°C)", xlabel="Year", title="Temperature")
plot(sl_plot, temp_plot, layout=(2, 1))Problem 2.2
The discretized model is:
\[ \begin{aligned} S(t+1) &= S(t) + \Delta t \frac{S_\text{eq}(t) - S(t)}{\tau} \\ S\text{eq}(t) &= a T(t) + b, \end{aligned} \]
where we take \(\delta t = 1\). Then in Julia (as an example), this might look like the following function.
function grinsted_slr(params, temps; Δt=1)
a, b, l, S₀ = params
S = zeros(length(temps)) # initialize storage
Seq = a * temps .+ b
S[1] = S₀
for i = 2:length(S)
S[i] = S[i-1] + Δt * (Seq[i] - S[i-1]) / l
end
return S[1:end]
end- 1
-
This syntax is useful as it lets us unpack a vector or tuple into named variables, which increases readability of the subsequent code (versus only referring to e.g.
p[1]).
grinsted_slr (generic function with 1 method)
The next function lets us calculate the log-likelihood if we assume i.i.d. Gaussian residuals. Note the inclusion of the observation errors
function gaussian_iid_loglik(params, temp_dat, slr_obs, Δt=1.0)
a, b, l, S₀, s = params
slr_sim = grinsted_slr((a, b, l, S₀), temp_dat; Δt = Δt)
ll = sum(logpdf.(Normal.(slr_sim, s), slr_obs))
return ll
end- 1
- This line makes heavy use of broadcasting. A loop would work just as well.
gaussian_iid_loglik (generic function with 2 methods)
Now, let’s find the MLE by optimizing the gaussian_iid_loglik function.
low_bds = [100.0, 100.0, 0.1, -100.0, 0.0]
up_bds = [2000.0, 500.0, 2500.0, 100.0, 100.0]
p₀ = [1000.0, 200.0, 1500.0, 0.0, 10.0]
mle_optim = optimize(p -> -gaussian_iid_loglik(p, temp_dat.Temp, sl_dat.GMSLR), low_bds, up_bds, p₀)
p_mle_iid = mle_optim.minimizer
@show round.(p_mle_iid; digits=2)round.(p_mle_iid; digits = 2) = [566.92, 222.66, 189.11, -10.36, 5.81]
5-element Vector{Float64}:
566.92
222.66
189.11
-10.36
5.81
The sensitivity of GMSLR to increases in GMT is \(a / \tau\):
\[S(t+1) = S(t) + \frac{\overbrace{(aT + b)}^{S_\text{eq}} - S(t)}{\tau}.\]
This value is 2.5 mm/(yr\(\cdot ^\circ\)C).
Finally, the i.i.d. Gaussian model had two key assumptions: Gaussian-distributed residuals and independent residuals. To check the first, we can use a Gaussian Q-Q plot, and the second, a partial autocorrelation plot.
grinsted_proj_mle = grinsted_slr(p_mle_iid, temp_dat.Temp)
resids = sl_dat.GMSLR - grinsted_proj_mle
p1 = qqnorm(resids, ylabel="Theoretical Values", xlabel="Empirical Values")
pacf_resids = pacf(resids, 1:5)
p2 = plot(1:5, pacf_resids, marker=:circle, line=:stem, markersize=5, legend=false, ylabel="Partial Autocorrelation", xlabel="Time Lag")
display(p1)
display(p2)We can see that while the residuals appear pretty-well described by a Gaussian distribution based on the Q-Q plot, they have a reasonable lag-1 autocorrelation of about 0.52, so this may be an inappropriate assumption.
Problem 2.3
This will look like the previous problem, only with an AR(1) likelihood for the residuals instead of the Gaussian residuals. We will assume that the residual model is mean-zero, so \(\alpha = 0\).
function gaussian_ar_loglik(params, temp_dat, slr_obs, Δt=1.0)
a, b, l, S₀, p, s = params
slr_sim = grinsted_slr((a, b, l, S₀), temp_dat; Δt = Δt)
slr_resids = slr_obs - slr_sim
ll = logpdf(Normal(0, s / sqrt(1 - p^2)), slr_resids[1])
for i = 2:length(slr_sim)
ll += logpdf(Normal(0, s), slr_resids[i] - p * slr_resids[i-1])
end
return ll
endgaussian_ar_loglik (generic function with 2 methods)
Now, let’s find the MLE by optimizing the gaussian_iid_loglik function.
low_bds = [100.0, 100.0, 0.1, -100.0, 0.0, 0.1]
up_bds = [2000.0, 500.0, 2500.0, 100.0, 1.0, 100.0]
p₀ = [1000.0, 200.0, 1500.0, 0.0, 0.5, 10.0]
mle_optim = optimize(p -> -gaussian_ar_loglik(p, temp_dat.Temp, sl_dat.GMSLR), low_bds, up_bds, p₀)
p_mle_ar = mle_optim.minimizer
@show round.(p_mle_ar; digits=2)round.(p_mle_ar; digits = 2) = [557.57, 216.77, 183.68, -10.44, 0.52, 4.95]
6-element Vector{Float64}:
557.57
216.77
183.68
-10.44
0.52
4.95
Comparing these parameters to those from Problem 2.2, all of the parameters are a little smaller in the AR(1) residual model. The resulting sensitivity of GMSLR to increases in GMT:
\[S(t+1) = S(t) + \frac{\overbrace{(aT + b)}^{S_\text{eq}} - S(t)}{\tau}.\]
This value is 2.6 mm/(yr\(\cdot ^\circ\)C), which is just a little larger than with the Gaussian residuals and is unlikely to make a meaningful difference to SLR projections; with even 4\(\cdot ^\circ\)C of warming this would result only in an additional projected 0.4mm of GMSLR. This can matter more in other examples, however!
Problem 3
Problem 3.1
Let’s load the data from ‘data/gfdl-esm4-tempmax-ithaca.csv’.
dat = CSV.read(joinpath("data", "gfdl-esm4-tempmax-ithaca.csv"), DataFrame, delim=",")| Row | Day | TempMax |
|---|---|---|
| Date | Float64 | |
| 1 | 1850-01-01 | 2.33492 |
| 2 | 1850-01-02 | 5.37603 |
| 3 | 1850-01-03 | 2.89517 |
| 4 | 1850-01-04 | -5.93915 |
| 5 | 1850-01-05 | -3.71647 |
| 6 | 1850-01-06 | -5.10975 |
| 7 | 1850-01-07 | -2.69068 |
| 8 | 1850-01-08 | 0.0958496 |
| 9 | 1850-01-09 | 0.242303 |
| 10 | 1850-01-10 | -0.239172 |
| 11 | 1850-01-11 | -1.82245 |
| 12 | 1850-01-12 | -4.21055 |
| 13 | 1850-01-13 | -3.57349 |
| ⋮ | ⋮ | ⋮ |
| 60214 | 2014-11-10 | 4.02496 |
| 60215 | 2014-11-11 | -2.068 |
| 60216 | 2014-11-12 | -1.11286 |
| 60217 | 2014-11-13 | 6.14102 |
| 60218 | 2014-11-14 | 11.2776 |
| 60219 | 2014-11-15 | -2.47153 |
| 60220 | 2014-11-16 | -5.6442 |
| 60221 | 2014-11-17 | -3.63846 |
| 60222 | 2014-11-18 | 8.89431 |
| 60223 | 2014-11-19 | -0.551337 |
| 60224 | 2014-11-20 | -0.311835 |
| 60225 | 2014-11-21 | 7.76135 |
First, we want to find the annual maxima.
dat.Year = year.(dat.Day)
dat_gp = @groupby(dat, :Year)
dat_max = @combine(dat_gp, :AnnMax=maximum(:TempMax))
plot(dat_max.Year, dat_max.AnnMax, lw=2, marker=:o, markersize=3, xlabel="Year", ylabel="Annual Maximum Temperature (°C)")Problem 3.2
Now, let’s fit a stationary GEV model.
function gev_stat_loglik(p, x)
u, s, xi = p
ll = sum(logpdf.(GeneralizedExtremeValue(u, s, xi), x))
return ll
end
lb = [0.0, 0.1, -1.0]
ub = [50.0, 10.0, 1.0]
x₀ = [30.0, 5.0, 0.0]
optim_out = Optim.optimize(p -> -gev_stat_loglik(p, dat_max.AnnMax), lb, ub, x₀)
p_stat = optim_out.minimizer
@show round.(p_stat; digits=1);round.(p_stat; digits = 1) = [27.8, 1.0, -0.2]
This is a Weibull distribution due to the negative shape parameter \(\xi\), and the 100-year return level is the 0.99 quantile, or 30.8\(^\circ C\).
Let’s plot the fitted distribution and the data.
histogram(dat_max.AnnMax, normalize=:pdf, label="Data", xlabel="Annual Maximal Temperature (°C)", ylabel="PDF", color=:lightblue)
plot!(GeneralizedExtremeValue(p_stat[1], p_stat[2], p_stat[3]), color=:red, label="Fitted Stationary GEV", lw=2)Problem 3.3
Now, let’s fit a nonstationary GEV, where the location parameter \(\mu\) is a function of time, e.g. \(\mu = \mu_0 + \mu_1 t\).
function gev_nonstat_loglik(p, x, yr)
u₀, u₁, s, xi = p
t = yr .- yr[1]
u = u₀ .+ u₁ .* t
ll = sum(logpdf.(GeneralizedExtremeValue.(u, s, xi), x))
return ll
end
lb = [0.0, 0.0, 0.1, -1.0]
ub = [40.0, 1.0, 10.0, 1.0]
x₀ = [10.0, 0.5, 5.0, 0.0]
optim_out = Optim.optimize(p -> -gev_nonstat_loglik(p, dat_max.AnnMax, dat_max.Year), lb, ub, x₀)
p_nonstat = optim_out.minimizer
@show round.(p_nonstat; digits=1);round.(p_nonstat; digits = 1) = [27.7, 0.0, 1.0, -0.2]
We want to calculate and plot how the 100-year return level changed over time according to this model. We also want to compare to the 100-year return level from Problem 3.3
u = collect(p_nonstat[1] .+ p_nonstat[2] * (1:nrow(dat_max)))
nonstat_rl = quantile.(GeneralizedExtremeValue.(u, p_nonstat[3], p_nonstat[4]), 0.99)
plot(dat_max.Year, nonstat_rl, color=:black, lw=2, xlabel="Year", ylabel="100-Year Return Level", label="Nonstationary Model")
hline!([quantile(GeneralizedExtremeValue(p_stat[1], p_stat[2], p_stat[3]), 0.99)], color=:red, linestyle=:dash, lw=2, label="Stationary Model")The non-stationary return levels increase from about 30.6\(^\circ C\) to closer to 31\(^\circ C\), while the stationary return level is about the average of that range.
Problem 4
Problem 4.1
We don’t need to re-load the dataset, so let’s find the exceedances over 28\(^\circ\) C.
temp_high = dat[dat.TempMax .> 28, :]| Row | Day | TempMax | Year |
|---|---|---|---|
| Date | Float64 | Int64 | |
| 1 | 1853-08-13 | 28.734 | 1853 |
| 2 | 1854-07-12 | 29.5106 | 1854 |
| 3 | 1854-07-18 | 28.131 | 1854 |
| 4 | 1854-07-19 | 28.9641 | 1854 |
| 5 | 1854-07-31 | 28.1494 | 1854 |
| 6 | 1854-08-01 | 28.2747 | 1854 |
| 7 | 1854-08-02 | 28.6134 | 1854 |
| 8 | 1854-08-03 | 28.7721 | 1854 |
| 9 | 1857-07-07 | 28.0677 | 1857 |
| 10 | 1858-07-29 | 28.0767 | 1858 |
| 11 | 1860-07-01 | 28.5655 | 1860 |
| 12 | 1860-07-29 | 29.743 | 1860 |
| 13 | 1860-08-08 | 28.4803 | 1860 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| 220 | 2013-07-08 | 28.2372 | 2013 |
| 221 | 2013-07-10 | 29.2171 | 2013 |
| 222 | 2013-07-11 | 29.7322 | 2013 |
| 223 | 2013-07-13 | 29.0477 | 2013 |
| 224 | 2014-05-23 | 28.3297 | 2014 |
| 225 | 2014-05-31 | 28.6978 | 2014 |
| 226 | 2014-06-17 | 28.5005 | 2014 |
| 227 | 2014-06-18 | 29.4382 | 2014 |
| 228 | 2014-06-19 | 30.1999 | 2014 |
| 229 | 2014-06-20 | 30.032 | 2014 |
| 230 | 2014-06-21 | 28.7772 | 2014 |
| 231 | 2014-07-09 | 28.6819 | 2014 |
To decluster, we calculate the extremal index using the estimator \[\hat{\theta}(u) = \frac{2\left(\sum_{i-1}^{N-1} T_i\right)^2}{(N-1)\sum_{i=1}^{N-1}T_i^2}.\]
S = findall(dat.TempMax .> 28)
N = length(S)
T = diff(S)
θ = 2 * sum(T)^2 / ((N-1) * sum(T.^2))0.47832378879687915
This means that the declustering time is 2.0 days. Now we want to assign data that is within that period to appropriate clusters. Let’s write a function to do that and then take the maximum value of the exceedances within each cluster.
# cluster data points which occur within period
function assign_cluster(dat, period)
cluster_index = 1
clusters = zeros(Int, length(dat))
for i in 1:length(dat)
if clusters[i] == 0
clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index
cluster_index += 1
end
end
return clusters
end
# cluster exceedances that occur within a two-day window
# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformation
temp_high = @transform temp_high :cluster = assign_cluster(:Day, Dates.Day(2))
# find maximum value within cluster
temp_decluster = combine(temp_high -> temp_high[argmax(temp_high.TempMax), :], groupby(temp_high, :cluster))| Row | cluster | Day | TempMax | Year |
|---|---|---|---|---|
| Int64 | Date | Float64 | Int64 | |
| 1 | 1 | 1853-08-13 | 28.734 | 1853 |
| 2 | 2 | 1854-07-12 | 29.5106 | 1854 |
| 3 | 3 | 1854-07-19 | 28.9641 | 1854 |
| 4 | 4 | 1854-07-31 | 28.1494 | 1854 |
| 5 | 5 | 1854-08-03 | 28.7721 | 1854 |
| 6 | 6 | 1857-07-07 | 28.0677 | 1857 |
| 7 | 7 | 1858-07-29 | 28.0767 | 1858 |
| 8 | 8 | 1860-07-01 | 28.5655 | 1860 |
| 9 | 9 | 1860-07-29 | 29.743 | 1860 |
| 10 | 10 | 1860-08-08 | 28.4803 | 1860 |
| 11 | 11 | 1861-06-30 | 28.3882 | 1861 |
| 12 | 12 | 1861-07-22 | 29.5198 | 1861 |
| 13 | 13 | 1861-08-10 | 28.1314 | 1861 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 153 | 153 | 2011-06-20 | 29.1638 | 2011 |
| 154 | 154 | 2013-05-30 | 29.2171 | 2013 |
| 155 | 155 | 2013-06-06 | 28.669 | 2013 |
| 156 | 156 | 2013-06-07 | 30.2114 | 2013 |
| 157 | 157 | 2013-06-17 | 28.2765 | 2013 |
| 158 | 158 | 2013-07-08 | 28.2372 | 2013 |
| 159 | 159 | 2013-07-11 | 29.7322 | 2013 |
| 160 | 160 | 2014-05-23 | 28.3297 | 2014 |
| 161 | 161 | 2014-05-31 | 28.6978 | 2014 |
| 162 | 162 | 2014-06-17 | 28.5005 | 2014 |
| 163 | 163 | 2014-06-19 | 30.1999 | 2014 |
| 164 | 164 | 2014-07-09 | 28.6819 | 2014 |
Now we want to count the number of exceedances by year and plot how these occurrences are changing over time.
temp_decluster.Year = year.(temp_decluster.Day)
temp_count = combine(groupby(temp_decluster, :Year), nrow)
bar(temp_count.Year, temp_count.nrow, label="Number of Clusters")Visually, it appears as though there has been an uptick in the number of clusters post-2000, with more years containing 3+ clusters. This would be hard to pull out without a model that depends on a broader warming signal, since the most of the data record looks stationary, so it’s unclear whether this is actually a trend. As a result, we might use stationary Poisson process for the occurrences, or we could try to fit one that was based on a covariate like global mean temperature.
Problem 4.2
Let’s now fit a stationary GPD to see what how the distribution of conditional exceedances looks.
temp_exceedances = temp_high.TempMax .- 28
gpd_fit = Optim.optimize(p -> -sum(logpdf(GeneralizedPareto(0.0, p[1], p[2]), temp_exceedances)), [0.0, -Inf], [Inf, Inf], [1.0, 1.0]).minimizer2-element Vector{Float64}:
0.8510410229847296
-0.18775785233967718
Let’s visualize this distribution (Figure 7).
histogram(temp_decluster.TempMax, normalize=:pdf, label="Data", xlabel="Conditional Excess Distribution (°C)", ylabel="PDF", color=:lightblue)
plot!(GeneralizedPareto(0.0, gpd_fit[1], gpd_fit[2]) .+ 28, color=:red, label="Fitted Stationary GPD", lw=2)Now, let’s fit a Poisson model for the number of exceedances in each year.
poisson_fit = Optim.optimize(p -> -sum(logpdf.(Poisson(p[1]), temp_count.nrow)), [0.1], [10.0], [2.0]).minimizer
@show round(poisson_fit[1]; digits=2);round(poisson_fit[1]; digits = 2) = 1.73
The 100-year return level is given by the solution to the quantile equation,
\[ \begin{aligned} \text{RL}_{100} &= u + \frac{\sigma}{\xi}\left((100\lambda)^\xi - 1\right) \\ &= 28 - \frac{0.85}{0.19}\left(100 * 1.73)^{-0.19} - 1\right) \\ &= 30.8^\circ \text{C} \end{aligned} \]
This is effectively the same as the result from the stationary GEV analysis in Problem 3 (the difference is down to the second decimal place).