Lecture 05
February 2, 2026
What can we use a probability model for?
Let’s say that we want to predict the value of a variable \(y \sim Y\). We need a criteria to define the “best” point prediction.
Reasonable starting point:
\[\text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right]\]
\[\begin{aligned} \mathbb{V}(Y) &= \mathbb{E}\left[(Y - \mathbb{E}[Y])^2\right] \\ &= \mathbb{E}\left[Y^2 - 2Y\mathbb{E}[Y] + \mathbb{E}[Y]^2\right] \\ &= \mathbb{E}[Y^2] - 2\mathbb{E}[Y]^2 + \mathbb{E}[Y]^2 = \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 . \end{aligned}\]
\[\Rightarrow \text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right] = \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m).\]
Then:
\[\begin{aligned} \text{MSE}(m) &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m) \\ &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y) \\ &= \left(\mathbb{E}[Y] - m\right)^2 + \mathbb{V}(Y). \end{aligned}\]
This is the source of the so-called “bias-variance tradeoff” (more on this later).
We want to find the minimum value of \(\text{MSE}(m)\) (denote the optimal prediction by \(\mu\)):
\[ \begin{aligned} \frac{d\text{MSE}}{dm} &= -2(\mathbb{E}[Y] - m) + 0 \\ 0 = \left.\frac{d\text{MSE}}{dm}\right|_{m = \mu} &= -2(\mathbb{E}[Y] - \mu) \end{aligned}\]
\[\Rightarrow \mu = \mathbb{E}[Y].\]
In other words, the best predicted value of a random variable is its expectation.
But:
The “simplest” model is linear:
Why might we generally think of linear models as reasonable?
Question: Does river flow affect the concentration of total dissolved solids?
Data: Cuyahoga River (1969 – 1973), from Helsel et al. (2020, Chapter 9).
Question: Does river flow affect the concentration of total dissolved solids?
Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]
Can address non-linear relationship in this case with a transformation.
Better in general to transform \(S\): \(\mathbb{E}[f(Y)] \neq f(\mathbb{E}[Y])\) (and these may be very different).
\[ \begin{align*} S &= \beta_0 + \beta_1 \log(D) + U \end{align*} \]
Idea (from Gauss and others): \(S = \beta_0 + \beta_1 \log(D)\) is an overdetermined system of equations.
Since we can’t solve, use minimization of sum of squared residuals as heuristic for a “good fit”.
\[MSE(b) = \sum_{i=1}^n (Y_i - X_i b)^2\]
To be more concrete, let’s assume one predictor, \[y = \beta_0 + \beta_1 x.\]
Then we want to minimize \[MSE(b_0, b_1) = \frac{1}{n} \sum_{i=1}^n \left(y_i - (b_0 + b_1 x_i)\right)^2.\]
\[ \begin{gathered} \frac{\partial MSE}{\partial b_0} = 0 \\ -\frac{2}{n} \sum_{i=1}^n y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) = 0\\ \bar{y} - \hat{\beta}_0 - \hat{\beta}_1 \bar{x} = 0 \\ \Rightarrow \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \end{gathered} \]
\[ \begin{gathered} \frac{\partial MSE}{\partial b_1} = 0 \\ -\frac{2}{n} \sum_{i=1}^n \left(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\right)x_i = 0\\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x}^2 = 0 \\ \end{gathered} \]
\[ \begin{gathered} \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x^2} = 0 \\ \Rightarrow \bar{xy} - \bar{y}\bar{x} + \hat{\beta}_1 \bar{x}\bar{x} - \hat{\beta}_1 \bar{x^2} = 0\\ \text{Cov}(X,Y) - \hat{\beta}_1 \text{Var}(X) = 0 \\ \Rightarrow \hat{\beta}_1 = \text{Cov}(X,Y) / \text{Var}(X). \end{gathered} \]
To get an estimate of the error variance \(\hat{\sigma}^2\), a natural first step is using the MSE:
\[\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \left(y_i - \hat{m}(x)\right)^2.\]
This is actually a slightly biased estimator, but can be made unbiased by inflating the variance:
\[s^2 = \frac{n}{n-2}\hat{\sigma}^2\]
For our River Flow-TDS example:
β = [609.5487272461863, -111.63106902057183]
s² = 5266.230284046525
Wednesday: More on linear regression (including multiple regression).
Friday: Maximum likelihood and uncertainty
Homework 1 Due Friday, 2/6.
Exercises: Will be available later this week.
Reading: Shmueli (2010).
Quiz 1: Next Friday (2/13).