Generalized Linear Models


Lecture 09

February 10, 2026

Review

Logistic Regression

  • Used to model probabilities for classification.
  • Two-class (e.g. success or failure:

\[\begin{aligned} y_i &\sim Bernoulli(p_i) \\ \text{logit}(p_i) &= \sum_j \beta_j x^i_j \end{aligned}\]

where \(\text{logit}(p) = \log\left(p / (1 - p)\right)\).

Multi-Class Logistic Regression

Treat one class as a reference. Other classes \(c\) will have its own intercept \(\beta_0^c\) and coefficients \(\beta^c\).

\[p(Y = c | X = \mathbf{x}) = \frac{\exp\left(\beta_0^c + x \cdot \beta^c \right)}{\sum_c \exp\left(\beta_0^c + x \cdot \beta^c\right) }\]

Interpretation of Coefficients

\(\exp(\beta^c_j)\) gives the relative change in the odds of class c from a unit change in \(x\).

Logistic Regression Example

Forecasting Precipitation

Code
# read in data; each year is a row and each column is a day
dat = CSV.read("data/weather/snoqualmie_falls.txt", delim=" ", ignorerepeated=true, silencewarnings=true, skipto=2, DataFrame)
years = 1948:1983
# the last "day" of each non-leap year is NA, so need to skip over these
days_per_year = repeat([366, 365, 365, 365], outer=Int(ceil(length(years) / 4)))
snoq = zeros(sum(days_per_year))
for i = 1:length(years)
    # need to use values() to get the vector of elements in the row; this is a quirk of DataFrames.jl
    snoq[1+sum(days_per_year[1:i-1]):sum(days_per_year[1:i])] .= values(dat[i, 1:days_per_year[i]])
end
# want to arrange dataframe with today's precipitation (predictor) and tomorrow's (prediction)
snoq_dat = DataFrame(today=snoq[1:end-1], tomorrow=snoq[2:end])
p1 = histogram(snoq_dat.today, xlabel="Precipitation (1/100 inch)", ylabel="Count", label=false)
p2 = scatter(snoq_dat.today, snoq_dat.tomorrow, xlabel="Precip Today (1/100 inch)", ylabel="Precip Tomorrow (1/100 inch)", markersize=2, alpha=0.2, label=false)
plot(p1, p2, layout=(1, 2), size=(1200, 500))

Figure 1: Snoqualmie Falls, WA precipitation dataset.

Forecasting Model

Let’s try to forecast when there will be non-zero precipitation (\(Y=1\)) vs. zero precipitation (\(Y=0\)) from the previous day’s precipitation.

\[\text{logit}(p_i) = \text{logit}(p(Y_i = 1)) = \beta_0 + \beta_1 x_i\]

Maximum Likelihood Estimation

invlogit(x) = exp(x) / (1 + exp(x)) # define inverse logit
function logreg_loglik(params, y, x)
    b0, b1 = params
    p = invlogit.(b0 .+ b1 * x) # logistic regression
    ll = sum(logpdf.(Bernoulli.(p), y)) # use Bernoulli log-pdf as log-likelihood
    return ll
end

lb = [-1.0, -1.0]
ub = [1.0, 1.0]
p0 = [0.0, 0.0]
optim_out = Optim.optimize(p -> -logreg_loglik(p, snoq_dat.tomorrow .> 0, snoq_dat.today), lb, ub, p0)
mle = round.(optim_out.minimizer; digits=2)
2-element Vector{Float64}:
 -0.44
  0.05

Visualizing Results

Code
x_pred = 0:0.1:maximum(snoq_dat.today)
p_pred = invlogit.(mle[1] .+ mle[2] * x_pred)
plot(x_pred, p_pred, linewidth=3, label="Modeled Probability", xlabel="Precip Today (1/100 inch)", ylabel="Positive Precip Tomorrow?", size=(1000, 400))
scatter!(snoq_dat.today, snoq_dat.tomorrow .> 0, alpha=0.2, label="Data")

Figure 2: Modeled probability of precipitation tomorrow as a function of precipitation today.

Calibration

Does the predicted CDF \(F(y)\) align with the “true” distribution of observations \(y\)? \[\mathbb{P}(y \leq F^{-1}(\tau)) = \tau \qquad \forall \tau \in [0, 1]\]

In this case: does the model produce the “right” probabilities of events?

Example: Calibration

We have a total of 6,919 days with precipitation. To see the expected number from the model, sum up the probabilities:

p_fit = invlogit.(mle[1] .+ mle[2] * snoq_dat.today)
round(sum(p_fit); digits=0)
7007.0

We’re off by about 1.3%. Is that good?

Example: Calibration

How well calibrated is the prediction for a no-precipitation day followed by a precipitation day?

The actual fraction:

emp_pnp = mean(snoq_dat[snoq_dat.today .== 0, :tomorrow] .> 0)
round(emp_pnp; digits=3)
0.287

The modeled fraction:

mod_pnp = mean(p_fit[snoq_dat.today .== 0])
round(mod_pnp; digits=3)
0.392

Potential Model Improvements

What might we try to improve this forecast?

Generalized Linear Models

Linear vs. Logistic Regression

Linear Regression: Model the conditional expectation function \[\mu(x) = \mathbb{E}[Y | X=\mathbf{x}] \approx \beta_0 + \sum_i \beta_i x_i.\]

Logistic Regression:

\[\mu(x) = \mathbb{E}[Y | X=\mathbf{x}] = \mathbb{P}[Y = 1 | X = \mathbf{x}].\]

We transform the conditional expectation to be linear (using \(g(x) = \text{logit}(x)\)):

\[g(\mu(x)) = \eta(x) \approx \beta_0 + \sum_i \beta_i x_i.\]

Choice of Distribution

Many possible distributions depending on the problem at hand. Common ones:

  • Binomial: Integer count from 0 to some upper limit \(n\); \[y_i \sim \text{Binomial}(p_i)\]
  • Poisson: Modeling counts with no fixed upper limit, but low base probability of “success”: \[y_i \sim \text{Poisson}(\lambda_i)\]

Poisson Regression Example

Modeling Counts

Code
fish = CSV.File("data/ecology/Fish.csv") |> DataFrame
fish[1:3, :]
3×6 DataFrame
Row fish_caught livebait camper persons child hours
Int64 Int64 Int64 Int64 Int64 Float64
1 0 0 0 1 0 21.124
2 0 1 1 1 0 5.732
3 0 1 0 1 0 1.323

What Distribution?

Count data can be modeled using a Poisson or negative binomial distribution.

Figure 4: Poisson
Figure 5: Negative Binomial

What Distribution?

Let’s start with a Poisson.

One property of \(y \sim \text{Poisson}(\lambda)\): \(\mathbb{E}(y) = \text{Var}(y) = \lambda\).

  • Mean: 3.3
  • Variance: 135.4

What Does This Mean?

  • The data is overdispersed relative to a Poisson distribution.
  • But a Poisson GLM may give rise to overdispersed data, since every observation has a different \(\lambda_i\) (hence mean and variance.)
Figure 6: Fishing Data

Be Cautious…

  • Over-relying on plots like histograms can lead to mis-specifying GLM dynamics.
  • Think of linear regression: the question is not if the data are Gaussian but the residuals.

Brain on Regression Meme

Source: Richard McElreath

Model Specification

We might hypothesize that the number of fish caught is influenced by the number of children (noisy, distracting, impatient) and whether the party brought a camper.

\[\begin{align*} y_i &\sim \text{Poisson}(\lambda_i) \\ f(\lambda_i) &= \beta_0 + \beta_1 \text{child}_i + \beta_2 \text{camper}_i \end{align*} \]

\(\lambda_i\): positive “rate”

\(f(\cdot)\): maps positive reals (rate scale) to all reals (linear model scale)

Fitting the Model

function fish_model(params, child, camper, fish_caught)
    β₀, β₁, β₂ = params
    λ = exp.(β₀ .+ β₁ * child + β₂ * camper)
    loglik = sum(logpdf.(Poisson.(λ), fish_caught))
end

lb = [-100.0, -100.0, -100.0]
ub = [100.0, 100.0, 100.0]
init = [0.0, 0.0, 0.0]

optim_out = optimize-> -fish_model(θ, fish.child, fish.camper, fish.fish_caught), lb, ub, init)
θ_mle = optim_out.minimizer
@show round.(θ_mle; digits=2);
@show round.(exp.(θ_mle); digits=2);
round.(θ_mle; digits = 2) = [0.91, -1.23, 1.05]
round.(exp.(θ_mle); digits = 2) = [2.48, 0.29, 2.87]

Interpretation of Coefficients

Since \(\log(\lambda_i) = \sum_j \beta_j x^i_j\):

\(\exp(\beta_j)\) says how much a unit increase in \(x_j\) multiplies the rate (hence expected value).

So \(\beta_1 =\) -1.2 says that each extra child reduces the expected number of fish caught by a factor of 0.3.

Predictive Checks To Evaluate Model

  • Linear regression: model checking was straightforward: residuals should have been Gaussian distributed.
  • GLMs: Less clear how to check distributional assumptions.

Strategy: check predictive distribution of relevant summary statistic(s) by simulation; do you capture the empirical statistic?

Predictive Checks

In other words, for some statistic \(T\):

  • Estimate \(\hat{T}\) from the data.
  • Simulate synthetic datasets using your model and calculate \(\tilde{T}_n\) for each one;
  • Examine the distribution of \((\tilde{T})\) to see where \(\hat{T}\) falls.

Teaser: This is closely related to the idea of a \(p\)-value…

Checking Proportion of Zero Fish

Figure 7: Predictive distribution for fitted fish model.

What Was The Problem?

  • The model produces far fewer zero-fish counts than the data.
  • This is typical of a lot of environmental count data; often have far more zeros than would be expected from a Poisson regression.
  • Alternative: Use a zero-inflated model.

Key Points

GLMs

  • Generalized linear models: use an alternative distribution but let parameters vary.
  • Require a choice of distribution, a link function, and a linear model.
  • Lots of different choices of distributions — think about structure of the data and generative properties of the distributions.
  • Simulation is valuable for checking model fit.

Upcoming Schedule

Upcoming Schedule

Friday: Quiz 1 for first 25 minutes.

Monday: No class!

Next Week: Modeling extreme values.