Logistic Regression


Lecture 08

February 9, 2026

Review

Linear Regression

Linear regression as probability model: \[ y = \sum_i \beta_i x_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) \]

  • Maximum likelihood gives same parameters as OLS
  • Allows us to determine conditional predictive probability \(p(Y | X=\mathbf{x})\)
  • Can get other insights: sampling distributions, confidence intervals, etc.

More on Predictive Intervals

\(1-\alpha\)-predictive intervals:

  • Contain \(1-\alpha\%\) of predicted values
  • Typically conditional \(p(Y | X=\mathbf{x})\)
  • Can be unconditional if we know \(p(X)\): \[p(Y) = \int_X p(Y | X=\mathbf{x}) p(\mathbf{x}) dx\]

Generating Predictive Intervals for LR

  • \(p(Y | X=\mathbf{x}) = N(\sum_i \beta_i x_i, \sigma^2)\)
  • Due to homoskedasticity, get \(\alpha/2\) and \(1 - \alpha/2\) quantiles for error distribution \(N(0, \sigma^2)\)
  • Center this interval on \(\mathbf{E}(Y | \mathbf{x}) = \sum_i \beta_i x_i\)
  • Can also generate by simulation (more general approach).

Example: Simulating Prediction Intervals

# load data
tds = let
    fname = "data/tds/cuyaTDS.csv" # CHANGE THIS!
    tds = DataFrame(CSV.File(fname))
    tds[!, [:date, :discharge_cms, :tds_mgL]]
end

# fit model
function tds_riverflow_loglik(θ, tds, flow)
    β₀, β₁, σ = θ # unpack parameter vector
    μ = β₀ .+ β₁ * log.(flow) # find mean
    ll = sum(logpdf.(Normal.(μ, σ), tds)) # compute log-likelihood
    return ll
end

lb = [0.0, -1000.0, 1.0]
ub = [1000.0, 1000.0, 100.0]
θ₀ = [500.0, 0.0, 50.0]
optim_out = Optim.optimize-> -tds_riverflow_loglik(θ, tds.tds_mgL, tds.discharge_cms), lb, ub, θ₀)
θ_mle = round.(optim_out.minimizer; digits=0)

# simulate 10,000 predictions
x = 1:0.1:60
μ = θ_mle[1] .+ θ_mle[2] * log.(x)

y_pred = zeros(length(x), 10_000)
for i = 1:length(x)
    y_pred[i, :] = rand(Normal(μ[i], θ_mle[3]), 10_000)
end
# take quantiles to find prediction intervals
y_q = mapslices(v -> quantile(v, [0.05, 0.5, 0.95]), y_pred; dims=2)

Resulting Prediction Intervals

Code
p = scatter(
    tds.discharge_cms,
    tds.tds_mgL,
    xlabel=L"Discharge (m$^3$/s)",
    ylabel="Total dissolved solids (mg/L)",
    markersize=5,
    label="Observations",
    size = (1100, 500)
)
plot!(x,
    μ,
    ribbon = (y_q[:, 2] - y_q[:, 1], y_q[:, 3] - y_q[:, 2]),
    linewidth=3, 
    fillalpha=0.2, 
    label="Best Fit")

Figure 1: Prediction interval for

Checking Assumptions

Always check assumptions of probability model.

In this case: are the residuals independently and normally distributed with constant variance?

If not:

  1. Relationship might not be linear.
  2. Might need different distribution for error distribution.

Checking Assumptions

(a) Residuals for the TDS-Riverflow model.
(b)
Figure 2

Logistic Regression

Models for Classification

Suppose instead of predicting values, we want to predict membership in a class.

Examples:

  • Will it snow in Ithaca tomorrow?
  • Will we see a bird in a location?
  • Will this person get heart disease in the next five years?

Probability Model for Classification

Would like conditional distribution of the probabilities, \(p(Y | X)\).

Should we treat a prediction with a 51% of a classification the same as 90%?

Indicator Variables

For simplicity, assume our response variable \(Y\) is binary: 0 or 1.

Such a variable is called an indicator variable (also sometimes used as predictors in linear regression).

Would like to model \(\mathbb{E}[Y] = p(Y=1)\) or \(\mathbb{E}[Y | X] = p(Y=1 | X)\).

Approach to Modeling

We could model \(Y\) as a linear regression. Would this work?

Instead, what distribution might we use to model classification outcomes?

Bernoulli Distribution

“Recall” that a Bernoulli distribution models the outcome of a random indicator variable.

Consider a sequence of Bernoulli trials with constant probability of “success” (\(Y=1\)) \(p\):

\[\mathcal{L}(p | Y = y_i, X = x_i) = \Pi_{i=1}^n p^{y_i}(1-p)^{1-y_i}\]

Inhomogeneous Probabilities

We often want to know how covariates/predictors influence the classification probability.

This means we want to model:

\[\mathcal{L}(p_i | Y = y_i, X = \mathbf{x}_i) = \Pi_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\]

where \(p_i = g(\mathbf{x}_i)\).

How To Model Probabilities?

Some approaches:

  1. Model \(p_i\) as a linear regression in \(x\). Would this work?
  2. Use a transformation \(f\) of \(p\) which has unbounded range and domain \([0, 1]\), \[f(p): [0, 1] \to \mathbb{R}.\]

Logit Function

Simplest function: the logit (or logistic) function

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]

Code
p = 0:0.01:1
logit(p) = log(p / (1 - p))
plot(p, logit.(p), xlabel=L"$x$", ylabel=L"$y$", linewidth=3, legend=false, size=(600, 450))
Figure 3: Logit Function

Logistic Regression

This gives us the logistic regression model:

\[\text{logit}(p_i) = \sum_j \beta_j x^i_j\]

Solving for \(p\):

\[p_i = \frac{\exp\left(\sum_j \beta_j x^i_j\right)}{1 + \exp\left(\sum_j \beta_j x^i_j\right)}.\]

Then we can predict the class when \(p > p_\text{thresh}\), usually 0.5, or just work with probabilities.

Assumptions for Logistic Regression

  1. There is a linear decision boundary, \(\sum_j x_j \beta_j = \text{logit}^{-1}(p_\text{thresh})\).
  2. Class probabilities change as we move away from the boundary based on \(\|\beta\|\).
  3. The larger \(\|\beta\|\), the smaller the change in \(x\) required to move to the extremes.

Interpretation of \(\beta\)

\[\log\left(\frac{p}{1-p}\right)\] is called the log-odds.

\(\exp^{\beta_i}\) tells us how much more likely the classification \(Y=1\) is from a reference \(x_i^1\) to \(x_i^1 + 1\).

Logistic Regression Likelihood

\[\begin{aligned} \mathcal{L}(\beta_0, \beta) &= \Pi_{i=1}^n p(x_i)^{y_i} (1-p(x_i))^{1-y_i} \\ \mathcal{l}(\beta_0, \beta) &= \sum_{i=1}^n y_i \log p(x_i) + (1-y_i) \log (1-p(x_i)) \\ &= \sum_{i=1}^n \log (1-p(x_i)) + \sum_{i=1}^n y_i \log\left(\frac{p(x_i)}{1-p(x_i)}\right) \end{aligned}\]

Logistic Regression Log-Likelihood

\[\begin{aligned} \mathcal{l}(\beta_0, \beta) &= \sum_{i=1}^n \log(1-p(x_i)) + \sum_{i=1}^n y_i(\beta_0 + \mathbf{x}_i \cdot \beta_i) \\ &= \sum_{i=1}^n -\log(1+\exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)) + \sum_{i=1}^n y_i(\beta_0 + \mathbf{x}_i \cdot \beta_i) \end{aligned}\]

Maximum Likelihood

Now we can differentiate:

\[\begin{aligned} \frac{\partial l}{\partial \beta_j} &= -\sum_{i=1}^n \frac{1}{1 + \exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)}\exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)x^i_j + \sum_{i=1}^n y_i x^i_j \\ &= \sum_{i=1}^n (y_i - p(x_i | \beta_0, \beta)) x^i_j \end{aligned}\]

We need to use numerical optimization to find the zeroes of this!

Multi-Class Logistic Regression

What if \(Y\) can take on more than just two values?

We can still use logistic regression, just need to modify the setup.

Now each class \(c\) will have its own intercept \(\beta_0^c\) and coefficients \(\beta^c\).

Multi-Class Logistic Regression

Predicted probabilities:

\[p(Y = c | X = \mathbf{x}) = \frac{\exp\left(\beta_0^c + x \cdot \beta^c \right)}{\sum_c \exp\left(\beta_0^c + x \cdot \beta^c\right) }\]

Maximizing the likelihood is similar, but encoding each class as a different outcome and using a Categorical distribution.

Key Points

Logistic Regression

  • Approach for classification problems.
  • Models probability of class occurrence.
  • Assumes a linear decision boundary.

Upcoming Schedule

Lectures

Wednesday: Generalized Linear Models

Friday: Quiz 1, wrap up week

Assessments

HW2 Released: Due 2/20

Quiz 1: Friday, on material through today.

References

References