Linear Regression


Lecture 05

February 2, 2026

Review

EDA and Data Visualization

  • Is my data fit for purpose?
  • What are features of data?
  • EDA methods are valuable even once we start to fit models, to check appropriateness of assumptions.

Probability Models

Uses of Probability Models

What can we use a probability model for?

  1. Summaries of data: store \(y = f(\mathbf{x})\) instead of all data points \((y, x_1, \ldots, x_n)\)
  2. Predict new data (interpolation/extrapolation): \(\hat{f} = f(\hat{\mathbf{x}})\)
  3. Infer relationships between variables (interpret coefficients of \(f\))

Why Do We Need Models For Data?

  • Data are imperfect: Data \(X\) are only one realization of the data that could have been observed and/or we can only measure indirect proxies of what we care about.
  • Over time, statisticians learned to treat these imperfections as the results of random processes (for some of this history, see Hacking (1990)), requiring probability models.
  • Data are incomplete: We often want to predict values for unobserved data (interpolation or extrapolation).

Predicting Random Variables

Let’s say that we want to predict the value of a variable \(y \sim Y\). We need a criteria to define the “best” point prediction.

Reasonable starting point:

\[\text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right]\]

MSE As (Squared) Bias + Variance

\[\begin{aligned} \mathbb{V}(Y) &= \mathbb{E}\left[(Y - \mathbb{E}[Y])^2\right] \\ &= \mathbb{E}\left[Y^2 - 2Y\mathbb{E}[Y] + \mathbb{E}[Y]^2\right] \\ &= \mathbb{E}[Y^2] - 2\mathbb{E}[Y]^2 + \mathbb{E}[Y]^2 = \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 . \end{aligned}\]

\[\Rightarrow \text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right] = \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m).\]

Bias-Variance Decomposition of MSE

Then:

\[\begin{aligned} \text{MSE}(m) &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m) \\ &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y) \\ &= \left(\mathbb{E}[Y] - m\right)^2 + \mathbb{V}(Y). \end{aligned}\]

This is the source of the so-called “bias-variance tradeoff” (more on this later).

Optimizing…

We want to find the minimum value of \(\text{MSE}(m)\) (denote the optimal prediction by \(\mu\)):

\[ \begin{aligned} \frac{d\text{MSE}}{dm} &= -2(\mathbb{E}[Y] - m) + 0 \\ 0 = \left.\frac{d\text{MSE}}{dm}\right|_{m = \mu} &= -2(\mathbb{E}[Y] - \mu) \end{aligned}\]

\[\Rightarrow \mu = \mathbb{E}[Y].\]

Expected Value As Best Prediction

In other words, the best predicted value of a random variable is its expectation.

But:

  1. We usually need a model to estimate \(\mathbb{E}[Y | X=x]\).
  2. In many applications, we don’t just want a point estimate, we need some estimate of ranges of values we might observe.

“Simple” Linear Regression

Linear Regression

The “simplest” model is linear:

  1. The distribution of \(X\) is arbitrary.
  2. If \(X = \mathbf{x}, Y = \sum_{i=1}^n \beta_i x_i + \varepsilon\).
  3. \(\mathbf{E}[\varepsilon | X = x] = 0, \text{Var}[\varepsilon | X = x] = \sigma^2\).
  4. \(\varepsilon\) is uncorrelated across observations.

Why might we generally think of linear models as reasonable?

Example: How Does River Flow Affect TDS?

Question: Does river flow affect the concentration of total dissolved solids?

Data: Cuyahoga River (1969 – 1973), from Helsel et al. (2020, Chapter 9).

How Does River Flow Affect TDS?

Question: Does river flow affect the concentration of total dissolved solids?

Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]

Transforming Data

Can address non-linear relationship in this case with a transformation.

Better in general to transform \(S\): \(\mathbb{E}[f(Y)] \neq f(\mathbb{E}[Y])\) (and these may be very different).

\[ \begin{align*} S &= \beta_0 + \beta_1 \log(D) + U \end{align*} \]

Ordinary Least Squares

Idea (from Gauss and others): \(S = \beta_0 + \beta_1 \log(D)\) is an overdetermined system of equations.

Since we can’t solve, use minimization of sum of squared residuals as heuristic for a “good fit”.

\[MSE(b) = \sum_{i=1}^n (Y_i - X_i b)^2\]

Estimating Equations

To be more concrete, let’s assume one predictor, \[y = \beta_0 + \beta_1 x.\]

Then we want to minimize \[MSE(b_0, b_1) = \frac{1}{n} \sum_{i=1}^n \left(y_i - (b_0 + b_1 x_i)\right)^2.\]

OLS Solution: \(\beta_0\)

\[ \begin{gathered} \frac{\partial MSE}{\partial b_0} = 0 \\ -\frac{2}{n} \sum_{i=1}^n y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) = 0\\ \bar{y} - \hat{\beta}_0 - \hat{\beta}_1 \bar{x} = 0 \\ \Rightarrow \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \end{gathered} \]

OLS Solution: \(\beta_1\)

\[ \begin{gathered} \frac{\partial MSE}{\partial b_1} = 0 \\ -\frac{2}{n} \sum_{i=1}^n \left(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\right)x_i = 0\\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x}^2 = 0 \\ \end{gathered} \]

Solving the Estimating Equations

\[ \begin{gathered} \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x^2} = 0 \\ \Rightarrow \bar{xy} - \bar{y}\bar{x} + \hat{\beta}_1 \bar{x}\bar{x} - \hat{\beta}_1 \bar{x^2} = 0\\ \text{Cov}(X,Y) - \hat{\beta}_1 \text{Var}(X) = 0 \\ \Rightarrow \hat{\beta}_1 = \text{Cov}(X,Y) / \text{Var}(X). \end{gathered} \]

Variance of Errors

To get an estimate of the error variance \(\hat{\sigma}^2\), a natural first step is using the MSE:

\[\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \left(y_i - \hat{m}(x)\right)^2.\]

This is actually a slightly biased estimator, but can be made unbiased by inflating the variance:

\[s^2 = \frac{n}{n-2}\hat{\sigma}^2\]

Linear Model Fitting: OLS

For our River Flow-TDS example:

X = log.(tds.discharge_cms) # predictors
Y = tds.tds_mgL # predicands
β = zeros(2)
β[2] = cov(X, Y) / var(X)
β[1] = mean(Y) - β[2] * mean(X)
@show β;
ε = Y - [ones(length(X)) X] * β
=' * ε) / (nrow(tds) - 2)  # error estimate
@show s²;
β = [609.5487272461863, -111.63106902057183]
s² = 5266.230284046525

Key Points

Key Points

  • Models let us make predictive statements about data.
  • Best point prediction: \(\mathbb{E}(Y)\), but this is a useless statement without some model.
  • Linear regression is a first example.
  • Classical approach: solve using ordinary least squares.

Upcoming Schedule

Next Classes

Wednesday: More on linear regression (including multiple regression).

Friday: Maximum likelihood and uncertainty

Assessments

Homework 1 Due Friday, 2/6.

Exercises: Will be available later this week.

Reading: Shmueli (2010).

Quiz 1: Next Friday (2/13).

References

References (Scroll for Full List)

Hacking, I. (1990). The Taming of Chance. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9780511819766
Helsel, D. R., Hirsch, R. M., Ryberg, K. R., Archfield, S. A., & Gilroy, E. J. (2020). Statistical methods in water resources (research report No. 4-A3) (p. 484). Reston, VA: U.S. Geological Survey. Retrieved from http://pubs.er.usgs.gov/publication/tm4A3