Lecture 07
February 6, 2026
\[ y = \underbrace{\sum_j \beta_j x^j_i}_{\mathbb{E}[y | x^j]} + \underbrace{\varepsilon_i}_{\text{noise}}, \quad \varepsilon_i \sim N(0, \sigma^2) \]
The distributional assumption of independent Gaussian residuals is strong. Why would we add this?
Two main reasons to use Gaussian noise:

Source: r/GymMemes
If
\[\begin{aligned} \lim_{n \to \infty} \sqrt{n}(\bar{X}_n - \mu ) &= \mathcal{N}(0, \sigma^2) \\ \Rightarrow & \bar{X}_n \overset{\text{approx}}{\sim} \mathcal{N}(\mu, \sigma^2/n) \end{aligned}\]
For a large enough set of samples:
The sampling distribution of a sum or mean of random variables is approximately normal distribution, even if the random variables themselves are not.

Source: Unknown
Without explicit modeling of sources of measurement uncertainty, in LR we can’t separate a noisy system state (due to e.g. omitted variables) from an uncertain measurement error (more on this later).
\[ \begin{array}{l} X \sim N(\mu_1, \sigma_1^2) \\ Y \sim N(\mu_2, \sigma_2^2) \\ Z = X + Y \end{array} \qquad \Rightarrow \qquad Z \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \]
\[ \begin{align*} S &= \beta_0 + \beta_1 \log(D) + U\\ U &\sim \mathcal{N}(0, \sigma^2) \end{align*} \]
How do we find \(\beta_i\) and \(\sigma\)?
Likelihood of data to have come from distribution \(f(\mathbf{x} | \theta)\): \[\mathcal{L}(\theta | \mathbf{x}) = \underbrace{f(\mathbf{x} | \theta)}_{\text{PDF}}\]
Here the randomness comes from \(U\): \[S \sim \mathcal{N}(\beta_0 + \beta_1 \log(D), \sigma^2)\]
\[y_i \sim \mathcal{N}(F(x_i), \sigma^2)\]
\[\mathcal{L}(\theta | \mathbf{y}; F) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} \exp(-\frac{y_i - F(x_i)^2}{2\sigma^2})\]
\[\log \mathcal{L}(\theta | \mathbf{y}; F) = \sum_{i=1}^n \left[\log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2}(y_i - F(x_i))^2 \right]\]
\[ \begin{align} \log \mathcal{L}(\theta | \mathbf{y}, F) &= \sum_{i=1}^n \left[\log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2}(y_i - F(x_i)) ^2 \right] \\ &= n \log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - F(x_i))^2 \end{align} \]
Ignoring constants (including \(\sigma\)):
\[\log \mathcal{L}(\theta | \mathbf{y}, F) \propto -\sum_{i=1}^n (y_i - F(x_i))^2.\]
Maximizing \(f(x)\) is equivalent to minimizing \(-f(x)\):
\[ -\log \mathcal{L}(\theta | \mathbf{y}, F) \propto \sum_{i=1}^n (y_i - F(x_i))^2 = \text{MSE} \]
# tds_riverflow_loglik: function to compute the log-likelihood for the tds model
# θ: vector of model parameters (coefficients β₀ and β₁ and stdev σ)
# tds, flow: vectors of data
function tds_riverflow_loglik(θ, tds, flow)
β₀, β₁, σ = θ # unpack parameter vector
μ = β₀ .+ β₁ * log.(flow) # find mean
ll = sum(logpdf.(Normal.(μ, σ), tds)) # compute log-likelihood
return ll
end
lb = [0.0, -1000.0, 1.0]
ub = [1000.0, 1000.0, 100.0]
θ₀ = [500.0, 0.0, 50.0]
optim_out = Optim.optimize(θ -> -tds_riverflow_loglik(θ, tds.tds_mgL, tds.discharge_cms), lb, ub, θ₀)
θ_mle = optim_out.minimizer; digits=0
@show θ_mle;θ_mle = [609.5487268719755, -111.63106888212424, 71.52458711451465]
Note: Should really round these results to account for significant digits. Did not here purely for comparison with OLS.
Important to check if the assumptions of the model held: are the residuals seemingly uncorrelated and normally distributed?
The Gaussian-noise assumptions is strong. Why make it?
Now we have conditional distributions for each \(y_i\), \(p(y_i | X = \mathbf{x}_i)\).
This lets us:
Uncertainty can come from randomness of sampling. How does that affect our estimates?
The resulting distribution of estimates is called the sampling distribution.
function sim_lrfit(x, β₀, β₁, σ²; coefficients=true)
n = length(x)
y = β₀ .+ β₁ * x .+ rand(Normal(0, σ²), n)
if coefficients
mod = lm([ones(n) x], y)
return [coef(mod); dispersion(mod)]
else
return DataFrame(X=x, Y=y)
end
end
x = rand(Uniform(-5, 5), 42)
# reduce(hcat, v)' turns vector of vectors v into Matrix by concatenating horizontally then transposing
sim_coefs = reduce(hcat, [sim_lrfit(x, 5, -2, 0.1, coefficients=true) for i in 1:10_000])'
p1 = histogram(sim_coefs[:, 1], xlabel=L"β$_0$", ylabel="Count", label="Simulated Fits")
vline!([5], color=:red, linewidth=3, label="True Value")
p2 = histogram(sim_coefs[:, 2], xlabel=L"β$_1$", ylabel="Count", label="Simulated Fits")
vline!([-2], color=:red, linewidth=3, label="True Value", legend=:topleft)
plot(p1, p2, layout=(2, 1), size=(600, 550))What if we ramp up the observation variability?
function sim_lrfit(x, β₀, β₁, σ²; coefficients=true)
n = length(x)
y = β₀ .+ β₁ * x .+ rand(Normal(0, σ²), n)
if coefficients
mod = lm([ones(n) x], y)
return [coef(mod); dispersion(mod)]
else
return DataFrame(X=x, Y=y)
end
end
x = rand(Uniform(-5, 5), 42)
# reduce(hcat, v)' turns vector of vectors v into Matrix by concatenating horizontally then transposing
sim_coefs = reduce(hcat, [sim_lrfit(x, 5, -2, 0.5, coefficients=true) for i in 1:10_000])'
p1 = histogram(sim_coefs[:, 1], xlabel=L"β$_0$", ylabel="Count", label="Simulated Fits")
vline!([5], color=:red, linewidth=3, label="True Value")
p2 = histogram(sim_coefs[:, 2], xlabel=L"β$_1$", ylabel="Count", label="Simulated Fits")
vline!([-2], color=:red, linewidth=3, label="True Value")
plot(p1, p2, layout=(2, 1), size=(600, 550))If the linear model is “true”:
\[\begin{aligned} \hat{\beta}_1 &\sim N\left(\beta_1, \frac{\sigma^2}{ns_X^2}\right) \\ \hat{\beta}_0 &\sim N\left(\beta_0, \frac{\sigma^2}{n}\left(1 + \frac{\bar{x}^2}{s_X^2}\right)\right) \\ \frac{n\hat{\sigma}^2}{\sigma^2} &\sim \chi^2_{n-2} \end{aligned} \]
These closed form distributions allow for direct calculation of confidence intervals, and therefore to address questions like “How improbable is seeing these estimates assuming an alternative model?”
But they only work if the model is true. If the model is false, all of these calculations are wrong, hence the importance of diagnostics.
Frequentist estimates have confidence intervals, which will contain the “true” parameter value for \(\alpha\)% of data samples.
No guarantee that an individual CI contains the true value (with any “probability”)!

# set up distribution
mean_true = 0.4
n_cis = 100 # number of CIs to compute
dist = Normal(mean_true, 2)
# use sample size of 100
samples = rand(dist, (100, n_cis))
# mapslices broadcasts over a matrix dimension, could also use a loop
sample_means = mapslices(mean, samples; dims=1)
sample_sd = mapslices(std, samples; dims=1)
mc_sd = 1.96 * sample_sd / sqrt(100)
mc_ci = zeros(n_cis, 2) # preallocate
for i = 1:n_cis
mc_ci[i, 1] = sample_means[i] - mc_sd[i]
mc_ci[i, 2] = sample_means[i] + mc_sd[i]
end
# find which CIs contain the true value
ci_true = (mc_ci[:, 1] .< mean_true) .&& (mc_ci[:, 2] .> mean_true)
# compute percentage of CIs which contain the true value
ci_frac1 = 100 * sum(ci_true) ./ n_cis
# plot CIs
p1 = plot([mc_ci[1, :]], [1, 1], linewidth=3, color=:deepskyblue, label="95% Confidence Interval", title="Sample Size 100", yticks=:false, legend=:false)
for i = 2:n_cis
if ci_true[i]
plot!(p1, [mc_ci[i, :]], [i, i], linewidth=2, color=:deepskyblue, label=:false)
else
plot!(p1, [mc_ci[i, :]], [i, i], linewidth=2, color=:red, label=:false)
end
end
vline!(p1, [mean_true], color=:black, linewidth=2, linestyle=:dash, label="True Value") # plot true value as a vertical line
xaxis!(p1, "Estimate")
plot!(p1, size=(500, 350)) # resize to fit slide
# use sample size of 1000
samples = rand(dist, (1000, n_cis))
# mapslices broadcasts over a matrix dimension, could also use a loop
sample_means = mapslices(mean, samples; dims=1)
sample_sd = mapslices(std, samples; dims=1)
mc_sd = 1.96 * sample_sd / sqrt(1000)
mc_ci = zeros(n_cis, 2) # preallocate
for i = 1:n_cis
mc_ci[i, 1] = sample_means[i] - mc_sd[i]
mc_ci[i, 2] = sample_means[i] + mc_sd[i]
end
# find which CIs contain the true value
ci_true = (mc_ci[:, 1] .< mean_true) .&& (mc_ci[:, 2] .> mean_true)
# compute percentage of CIs which contain the true value
ci_frac2 = 100 * sum(ci_true) ./ n_cis
# plot CIs
p2 = plot([mc_ci[1, :]], [1, 1], linewidth=3, color=:deepskyblue, label="95% Confidence Interval", title="Sample Size 1,000", yticks=:false, legend=:false)
for i = 2:n_cis
if ci_true[i]
plot!(p2, [mc_ci[i, :]], [i, i], linewidth=2, color=:deepskyblue, label=:false)
else
plot!(p2, [mc_ci[i, :]], [i, i], linewidth=2, color=:red, label=:false)
end
end
vline!(p2, [mean_true], color=:black, linewidth=2, linestyle=:dash, label="True Value") # plot true value as a vertical line
xaxis!(p2, "Estimate")
plot!(p2, size=(500, 350)) # resize to fit slide
display(p1)
display(p2)98% of the CIs contain the true value (left) vs. 95% (right)
The key quantity is the standard error of a parameter, e.g. \(\text{se}(\hat{\beta}_1)\).
The standard error is the standard deviation of the sampling distribution.
We call it the “error” to reflect that it is an “error” in estimating parameters.
Of course, we also need to estimate the standard error since we don’t know the “true” values:
\[\hat{\text{se}}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{n}s_X} \approx \frac{\hat{\sigma}}{\sqrt{n-2}s_X}\]
since \(\frac{n}{n-2}\hat{\sigma}^2 \approx \sigma^2\).
Generate these using the conditional distributions based on probability model. An \(\alpha\)-distribution should contain \(\alpha\)% of new observations.
For example: find 90% interval by simulating samples from model and taking 5–95% quantiles.
# simulate 10,000 predictions
x = 1:0.1:60
μ = θ_mle[1] .+ θ_mle[2] * log.(x)
y_pred = zeros(length(x), 10_000)
for i = 1:length(x)
y_pred[i, :] = rand(Normal(μ[i], θ_mle[3]), 10_000)
end
# take quantiles to find prediction intervals
y_q = mapslices(v -> quantile(v, [0.05, 0.5, 0.95]), y_pred; dims=2)
p = scatter(
tds.discharge_cms,
tds.tds_mgL,
xlabel=L"Discharge (m$^3$/s)",
ylabel="Total dissolved solids (mg/L)",
markersize=5,
label="Observations",
size = (600, 550)
)
plot!(x,
y_q[:, 2],
ribbon = (y_q[:, 2] - y_q[:, 1], y_q[:, 3] - y_q[:, 2]),
linewidth=3,
fillalpha=0.2,
label="Best Fit",
size=(600, 550))Next Week: More general probability models (including generalized linear models)
Homework 1 Due tonight.
Exercises: Due before class Monday.
Reading: Shmueli (2010).
Quiz 1: Next Friday (2/13). Will include Monday’s content.