Lecture 11
February 18, 2026
Many environmental datasets involve time series: repeated observations over time:
\[X = \{X_t, X_{t+h}, X_{t+2h}, \ldots, X_{t+nh}\}\]
More often: ignore sampling time in notation,
\[X = \{X_1, X_2, X_3, \ldots, X_n\}\]
Often time series have periodic components. Can recover these using regression: \[y_t = A \cos(\omega t) + B \sin(\omega t) + C + \varepsilon_t, \qquad \varepsilon \sim N(0, \sigma^2).\]
function load_data(fname)
date_format = "yyyy-mm-dd HH:MM"
# this uses the DataFramesMeta package -- it's pretty cool
return @chain fname begin
CSV.File(; dateformat=date_format)
DataFrame
DataFrames.rename(
"Time (GMT)" => "time", "Predicted (m)" => "harmonic", "Verified (m)" => "gauge"
)
@transform :datetime = (Date.(:Date, "yyyy/mm/dd") + Time.(:time))
select(:datetime, :gauge, :harmonic)
@transform :weather = :gauge - :harmonic
@transform :month = (month.(:datetime))
end
end
dat = load_data("data/surge/norfolk-hourly-surge-2015.csv")
p1 = plot(dat.datetime, dat.gauge; ylabel="Gauge Measurement (m)", label="Observed", legend=:topleft, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm, size=(1100, 500))Figure 2: 2015 tide gauge data from the Norfolk, VA tide gauge.
If we want to look at the tide gauge data (which has an hourly sampling period and 8760 samples) with a monthly period, then \[\begin{gathered} 2\pi / \omega = 8760 / 30 = 292 \\ \omega = 2\pi / 292 \end{gathered}\]
and we get: \([A, B, C, \sigma\)] = $[-0.012, -0.022, 0.516, 0.34].
Spectral analysis: Based on Fourier decomposition to switch between time domain and frequency domain.
Let \(\omega_j = 2\pi j / T\). Then:
\[\begin{aligned} \hat{A} &= \frac{2}{T} \sum_t y_t \cos(\omega_j t) \\ \hat{B} &= \frac{2}{T} \sum_t y_t \sin(\omega_j t) \\ \hat{C} &= \bar{Y} = \frac{1}{T} \sum_t y_t. \end{aligned}\]
Can then define the periodogram \[I_T(\omega) = \frac{1}{2\pi T} \left|\sum_{t=1}^T y_t e^{i\omega t}\right|^2.\]
Larger values of \(I_T\) mean those frequencies contribute more to the overall signal.
The biggest spikes are at period 0 (no period; this is just a drift without harmonic behavior) and period 711 (roughly 30 days; aligns with the lunar cycle).
Sometimes periodograms are computed with respect to frequency, rather than period: recover period by multiplying by the sampling frequency.
Dependence: History or sequencing of the data matters
\[p(y_t) = f(y_1, \ldots, y_{t-1})\]
Serial dependence captured by autocorrelation:
\[\varsigma(i) = \rho(y_t, y_{t-i}) = \frac{\text{Cov}[y_t, y_{t-i}]}{\mathbb{V}[y_t]} \]
I.I.D. Data: All data drawn from the same distribution, \(y_i \sim \mathcal{D}\).
Equivalent for time series \(\mathbf{y} = \{y_t\}\) is stationarity.
# tds_riverflow_loglik: function to compute the log-likelihood for the tds model
# θ: vector of model parameters (coefficients β₀ and β₁ and stdev σ)
# tds, flow: vectors of data
tds = let
fname = "data/tds/cuyaTDS.csv" # CHANGE THIS!
tds = DataFrame(CSV.File(fname))
tds[!, [:date, :discharge_cms, :tds_mgL]]
end
function tds_riverflow_loglik(θ, tds, flow)
β₀, β₁, σ = θ # unpack parameter vector
μ = β₀ .+ β₁ * log.(flow) # find mean
ll = sum(logpdf.(Normal.(μ, σ), tds)) # compute log-likelihood
return ll
end
lb = [0.0, -1000.0, 1.0]
ub = [1000.0, 1000.0, 100.0]
θ₀ = [500.0, 0.0, 50.0]
optim_out = Optim.optimize(θ -> -tds_riverflow_loglik(θ, tds.tds_mgL, tds.discharge_cms), lb, ub, θ₀)
θ_mle = optim_out.minimizer; digits=0
x = 1:0.1:60
μ = θ_mle[1] .+ θ_mle[2] * log.(x)
p1 = scatter(tds.discharge_cms, tds.tds_mgL, markersize=3, xlabel=L"Log-Discharge (log-m$^3$/s)", ylabel="Total dissolved solids (mg/L)", label="Data", size=(500, 450))
plot!(p1, x, μ, linewidth=3, color=:red, label="Regression")
pred = θ_mle[1] .+ θ_mle[2] * log.(tds.discharge_cms)
resids = pred - tds.tds_mgL
p2 = scatter(log.(tds.discharge_cms), resids, markersize=3, ylabel=L"$r$ (mg/L)", xlabel=L"Log-Discharge (log-m$^3$/s) ", size=(500, 450))
p3 = scatter(resids[1:end-1], resids[2:end], markersize=3, ylabel=L"$r_{t+1}$ (mg/L)", xlabel=L"$r_t$ (mg/L)", size=(500, 450))
T = nrow(tds)
rlr = lm([ones(T-1) resids[1:end-1]], resids[2:end])
rpred = predict(rlr, [ones(T-1) resids[1:end-1]])
plot!(p3, resids[1:end-1], rpred, linewidth=3, color=:red)
display(p1)
display(p2)
display(p3)Residuals have autocorrelation 0.13.
AR(p): (autoregressive of order \(p\)):
\[ \begin{align*} y_t &= \alpha + \rho y_{t-1} + \varepsilon_t \\ \varepsilon &\sim \mathcal{D}(\theta) \\ y_0 &\sim \mathcal{G}(\psi) \end{align*} \]
\(\mathbb{E}[\varepsilon] = 0\), \(\text{cor}(\varepsilon_i, \varepsilon_j) = 0\), \(\text{cor}(\varepsilon_i, y_0) = 0\)
AR models are commonly used for prediction: bond yields, prices, electricity demand, short-run weather.
But may have little explanatory power: what causes the autocorrelation?
Plot \(\varsigma(i)\) over a series of lags.
Data generated by an AR(1) with \(\rho = 0.7\).
Notice: No explicit higher order autocorrelation, but \(\varsigma(2) > 0\).
\[\begin{aligned} y_t &= \alpha + \rho y_{t-1} + \varepsilon_t \\ &= \alpha + \rho \left(\alpha + \rho y_{t-2} + \varepsilon_{t-1}\right) + \varepsilon_t \\ &= \alpha + \rho \alpha + \rho^2 y_{t-2} + \rho \varepsilon_{t-1} + \varepsilon_t \\ &= \ldots \\ &= \alpha \sum_{k=0}^{t-1} \rho^k + \rho^t y_0 + \sum_{k=0}^{t-1} \varepsilon_{t-k} \rho^k \end{aligned}\]
Can isolate \(\varsigma(i)\) independent of \(\varsigma(i-k)\) through partial autocorrelation.
Typically estimated through regression, \[y = \sum_{k=1}^i \varsigma(i-k) y_{i-k}.\]
Friday: AR(1) Inference
Next Week: Extreme Values (Theory and Models)
Next Unit: Hypothesis Testing, Model Evaluation, and Comparison
HW2: Due Friday at 9pm.
Exercises: Available after class (will include GLMs + time series).
HW3: Assigned on 3/2, due 3/13.