Models for Classification
Suppose instead of predicting values, we want to predict membership in a class.
Examples:
- Will it snow in Ithaca tomorrow?
- Will we see a bird in a location?
- Will this person get heart disease in the next five years?
Probability Model for Classification
Would like conditional distribution of the probabilities, \(p(Y | X)\).
Should we treat a prediction with a 51% of a classification the same as 90%?
Indicator Variables
For simplicity, assume our response variable \(Y\) is binary: 0 or 1.
Such a variable is called an indicator variable (also sometimes used as predictors in linear regression).
Would like to model \(\mathbb{E}[Y] = p(Y=1)\) or \(\mathbb{E}[Y | X] = p(Y=1 | X)\).
Approach to Modeling
We could model \(Y\) as a linear regression. Would this work?
Instead, what distribution might we use to model classification outcomes?
Bernoulli Distribution
“Recall” that a Bernoulli distribution models the outcome of a random indicator variable.
Consider a sequence of Bernoulli trials with constant probability of “success” (\(Y=1\)) \(p\):
\[\mathcal{L}(p | Y = y_i, X = x_i) = \Pi_{i=1}^n p^{y_i}(1-p)^{1-y_i}\]
Inhomogeneous Probabilities
We often want to know how covariates/predictors influence the classification probability.
This means we want to model:
\[\mathcal{L}(p_i | Y = y_i, X = \mathbf{x}_i) = \Pi_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\]
where \(p_i = g(\mathbf{x}_i)\).
How To Model Probabilities?
Some approaches:
- Model \(p_i\) as a linear regression in \(x\). Would this work?
- Use a transformation \(f\) of \(p\) which has unbounded range and domain \([0, 1]\), \[f(p): [0, 1] \to \mathbb{R}.\]
Logit Function
Simplest function: the logit (or logistic) function
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]
Code
p = 0:0.01:1
logit(p) = log(p / (1 - p))
plot(p, logit.(p), xlabel=L"$x$", ylabel=L"$y$", linewidth=3, legend=false, size=(600, 450))
Logistic Regression
This gives us the logistic regression model:
\[\text{logit}(p_i) = \sum_j \beta_j x^i_j\]
Solving for \(p\):
\[p_i = \frac{\exp\left(\sum_j \beta_j x^i_j\right)}{1 + \exp\left(\sum_j \beta_j x^i_j\right)}.\]
Then we can predict the class when \(p > p_\text{thresh}\), usually 0.5, or just work with probabilities.
Assumptions for Logistic Regression
- There is a linear decision boundary, \(\sum_j x_j \beta_j = \text{logit}^{-1}(p_\text{thresh})\).
- Class probabilities change as we move away from the boundary based on \(\|\beta\|\).
- The larger \(\|\beta\|\), the smaller the change in \(x\) required to move to the extremes.
Interpretation of \(\beta\)
\[\log\left(\frac{p}{1-p}\right)\] is called the log-odds.
\(\exp^{\beta_i}\) tells us how much more likely the classification \(Y=1\) is from a reference \(x_i^1\) to \(x_i^1 + 1\).
Logistic Regression Likelihood
\[\begin{aligned}
\mathcal{L}(\beta_0, \beta) &= \Pi_{i=1}^n p(x_i)^{y_i} (1-p(x_i))^{1-y_i} \\
\mathcal{l}(\beta_0, \beta) &= \sum_{i=1}^n y_i \log p(x_i) + (1-y_i) \log (1-p(x_i)) \\
&= \sum_{i=1}^n \log (1-p(x_i)) + \sum_{i=1}^n y_i \log\left(\frac{p(x_i)}{1-p(x_i)}\right)
\end{aligned}\]
Logistic Regression Log-Likelihood
\[\begin{aligned}
\mathcal{l}(\beta_0, \beta) &= \sum_{i=1}^n \log(1-p(x_i)) + \sum_{i=1}^n y_i(\beta_0 + \mathbf{x}_i \cdot \beta_i) \\
&= \sum_{i=1}^n -\log(1+\exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)) + \sum_{i=1}^n y_i(\beta_0 + \mathbf{x}_i \cdot \beta_i)
\end{aligned}\]
Maximum Likelihood
Now we can differentiate:
\[\begin{aligned}
\frac{\partial l}{\partial \beta_j} &= -\sum_{i=1}^n \frac{1}{1 + \exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)}\exp(\beta_0 + \mathbf{x}_i \cdot \beta_i)x^i_j + \sum_{i=1}^n y_i x^i_j \\
&= \sum_{i=1}^n (y_i - p(x_i | \beta_0, \beta)) x^i_j
\end{aligned}\]
We need to use numerical optimization to find the zeroes of this!
Multi-Class Logistic Regression
What if \(Y\) can take on more than just two values?
We can still use logistic regression, just need to modify the setup.
Now each class \(c\) will have its own intercept \(\beta_0^c\) and coefficients \(\beta^c\).
Multi-Class Logistic Regression
Predicted probabilities:
\[p(Y = c | X = \mathbf{x}) = \frac{\exp\left(\beta_0^c + x \cdot \beta^c \right)}{\sum_c \exp\left(\beta_0^c + x \cdot \beta^c\right) }\]
Maximizing the likelihood is similar, but encoding each class as a different outcome and using a Categorical distribution.