Linear Regression

Lecture 05

February 2, 2026

Review

EDA and Data Visualization

Is my data fit for purpose?
What are features of data?
EDA methods are valuable even once we start to fit models, to check appropriateness of assumptions.

Probability Models

Uses of Probability Models

What can we use a probability model for?

Summaries of data: store \(y = f(\mathbf{x})\) instead of all data points \((y, x_1, \ldots, x_n)\)
Predict new data (interpolation/extrapolation): \(\hat{f} = f(\hat{\mathbf{x}})\)
Infer relationships between variables (interpret coefficients of \(f\))

Why Do We Need Models For Data?

Data are imperfect: Data \(X\) are only one realization of the data that could have been observed and/or we can only measure indirect proxies of what we care about.
Over time, statisticians learned to treat these imperfections as the results of random processes (for some of this history, see Hacking (1990)), requiring probability models.
Data are incomplete: We often want to predict values for unobserved data (interpolation or extrapolation).

Predicting Random Variables

Let’s say that we want to predict the value of a variable \(y \sim Y\). We need a criteria to define the “best” point prediction.

Reasonable starting point:

\[\text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right]\]

MSE As (Squared) Bias + Variance

\[\begin{aligned} \mathbb{V}(Y) &= \mathbb{E}\left[(Y - \mathbb{E}[Y])^2\right] \\ &= \mathbb{E}\left[Y^2 - 2Y\mathbb{E}[Y] + \mathbb{E}[Y]^2\right] \\ &= \mathbb{E}[Y^2] - 2\mathbb{E}[Y]^2 + \mathbb{E}[Y]^2 = \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 . \end{aligned}\]

\[\Rightarrow \text{MSE}(m) = \mathbb{E}\left[(Y - m)^2\right] = \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m).\]

Bias-Variance Decomposition of MSE

Then:

\[\begin{aligned} \text{MSE}(m) &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y - m) \\ &= \mathbb{E}\left[(Y-m)\right]^2 + \mathbb{V}(Y) \\ &= \left(\mathbb{E}[Y] - m\right)^2 + \mathbb{V}(Y). \end{aligned}\]

This is the source of the so-called “bias-variance tradeoff” (more on this later).

Optimizing…

We want to find the minimum value of \(\text{MSE}(m)\) (denote the optimal prediction by \(\mu\)):

\[ \begin{aligned} \frac{d\text{MSE}}{dm} &= -2(\mathbb{E}[Y] - m) + 0 \\ 0 = \left.\frac{d\text{MSE}}{dm}\right|_{m = \mu} &= -2(\mathbb{E}[Y] - \mu) \end{aligned}\]

\[\Rightarrow \mu = \mathbb{E}[Y].\]

Expected Value As Best Prediction

In other words, the best predicted value of a random variable is its expectation.

But:

We usually need a model to estimate \(\mathbb{E}[Y | X=x]\).
In many applications, we don’t just want a point estimate, we need some estimate of ranges of values we might observe.

“Simple” Linear Regression

Linear Regression

The “simplest” model is linear:

The distribution of \(X\) is arbitrary.
If \(X = \mathbf{x}, Y = \sum_{i=1}^n \beta_i x_i + \varepsilon\).
\(\mathbf{E}[\varepsilon | X = x] = 0, \text{Var}[\varepsilon | X = x] = \sigma^2\).
\(\varepsilon\) is uncorrelated across observations.

Why might we generally think of linear models as reasonable?

Example: How Does River Flow Affect TDS?

Question: Does river flow affect the concentration of total dissolved solids?

Data: Cuyahoga River (1969 – 1973), from Helsel et al. (2020, Chapter 9).

How Does River Flow Affect TDS?

Question: Does river flow affect the concentration of total dissolved solids?

Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]

Transforming Data

Can address non-linear relationship in this case with a transformation.

Better in general to transform \(S\): \(\mathbb{E}[f(Y)] \neq f(\mathbb{E}[Y])\) (and these may be very different).

\[ \begin{align*} S &= \beta_0 + \beta_1 \log(D) + U \end{align*} \]

Ordinary Least Squares

Idea (from Gauss and others): \(S = \beta_0 + \beta_1 \log(D)\) is an overdetermined system of equations.

Since we can’t solve, use minimization of sum of squared residuals as heuristic for a “good fit”.

\[MSE(b) = \sum_{i=1}^n (Y_i - X_i b)^2\]

Estimating Equations

To be more concrete, let’s assume one predictor, \[y = \beta_0 + \beta_1 x.\]

Then we want to minimize \[MSE(b_0, b_1) = \frac{1}{n} \sum_{i=1}^n \left(y_i - (b_0 + b_1 x_i)\right)^2.\]

OLS Solution: \(\beta_0\)

\[ \begin{gathered} \frac{\partial MSE}{\partial b_0} = 0 \\ -\frac{2}{n} \sum_{i=1}^n y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) = 0\\ \bar{y} - \hat{\beta}_0 - \hat{\beta}_1 \bar{x} = 0 \\ \Rightarrow \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \end{gathered} \]

OLS Solution: \(\beta_1\)

\[ \begin{gathered} \frac{\partial MSE}{\partial b_1} = 0 \\ -\frac{2}{n} \sum_{i=1}^n \left(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\right)x_i = 0\\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x}^2 = 0 \\ \end{gathered} \]

Solving the Estimating Equations

\[ \begin{gathered} \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \\ \bar{xy} - \hat{\beta}_0\bar{x} - \hat{\beta}_1 \bar{x^2} = 0 \\ \Rightarrow \bar{xy} - \bar{y}\bar{x} + \hat{\beta}_1 \bar{x}\bar{x} - \hat{\beta}_1 \bar{x^2} = 0\\ \text{Cov}(X,Y) - \hat{\beta}_1 \text{Var}(X) = 0 \\ \Rightarrow \hat{\beta}_1 = \text{Cov}(X,Y) / \text{Var}(X). \end{gathered} \]

Variance of Errors

To get an estimate of the error variance \(\hat{\sigma}^2\), a natural first step is using the MSE:

\[\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \left(y_i - \hat{m}(x)\right)^2.\]

This is actually a slightly biased estimator, but can be made unbiased by inflating the variance:

\[s^2 = \frac{n}{n-2}\hat{\sigma}^2\]

Linear Model Fitting: OLS

For our River Flow-TDS example:

X = log.(tds.discharge_cms) # predictors
Y = tds.tds_mgL # predicands
β = zeros(2)
β[2] = cov(X, Y) / var(X)
β[1] = mean(Y) - β[2] * mean(X)
@show β;
ε = Y - [ones(length(X)) X] * β
s² = (ε' * ε) / (nrow(tds) - 2)  # error estimate
@show s²;

β = [609.5487272461863, -111.63106902057183]
s² = 5266.230284046525

Key Points

Models let us make predictive statements about data.
Best point prediction: \(\mathbb{E}(Y)\), but this is a useless statement without some model.
Linear regression is a first example.
Classical approach: solve using ordinary least squares.

Upcoming Schedule

Next Classes

Wednesday: More on linear regression (including multiple regression).

Friday: Maximum likelihood and uncertainty

Assessments

Homework 1 Due Friday, 2/6.

Exercises: Will be available later this week.

Reading: Shmueli (2010).

Quiz 1: Next Friday (2/13).

References

References (Scroll for Full List)

Hacking, I. (1990). The Taming of Chance. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9780511819766

Helsel, D. R., Hirsch, R. M., Ryberg, K. R., Archfield, S. A., & Gilroy, E. J. (2020). Statistical methods in water resources (research report No. 4-A3) (p. 484). Reston, VA: U.S. Geological Survey. Retrieved from http://pubs.er.usgs.gov/publication/tm4A3