(Null) Hypothesis Testing


Lecture 22

March 16, 2026

Review

Model Assessment

  • Metrics are based on loss functions which penalize differences between projections and observations.
  • Scoring rules are the extension to probabilistic forecasts.
  • Cross-validation for separating training and testing data.
  • Information criteria as “full data” approximation to cross-validation.
  • Never read into in-sample error metrics.

Statistics and Decision-Making

Science as Decision-Making Under Uncertainty

Goal is to draw insights:

  • About causes and effects;
  • About interventions.

XKCD 2440

Source: XKCD 2440

Questions We Might Like To Answer

  • Are high water levels influenced by environmental change?
  • Does some environmental condition have an effect on water quality/etc?
  • Does a drug or treatment have some effect?

Each of these is a hypothesis about causation or influence.

Null Hypothesis Testing

Onus probandi incumbit ei qui dicit, non ei qui negat

Core assumption: Burden of proof is on someone claiming an effect (or a similar hypothesis).

Null Hypothesis Meme

Null Hypothesis Significance Testing

  • Check if the data is consistent with a “null” model;
  • If the data is unlikely from the null model (to some level of significance), this is evidence for the alternative.
  • If the data is consistent with the null, there is no need for an alternative hypothesis.

Alternative Hypothesis Meme

From Null Hypothesis to Null Model

the null hypothesis must be exact, that is free of vagueness and ambiguity, because it must supply the basis of the ‘problem of distribution,’ of which the test of significance is the solution.

— R. A. Fisher, The Design of Experiments, 1935.

Example: High Water Nonstationarity

Code
# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)

n = nrow(dat_annmax)
linfit = lm(@formula(residual ~ Year), dat_annmax)
pred = coef(linfit)[1] .+ coef(linfit)[2] * dat_annmax.Year

plot!(p1, dat_annmax.Year, pred, linewidth=3, label="Linear Trend")

Figure 1: Annual maxima surge data from the San Francisco, CA tide gauge.

The Null: Is The Trend Real?

\(\mathcal{H}_0\) (Null Hypothesis):

  • The “trend” is just due to chance, there is no “true” long-term trend in the data.
  • Statistically:

\[y = \underbrace{b}_{\text{constant}} + \underbrace{\varepsilon}_{\text{residuals}}, \qquad \varepsilon \underbrace{\sim}_{\substack{\text{distributed} \\ {\text{according to}}}} \mathcal{N}(0, \sigma^2) \]

An Alternative Hypothesis

\(\mathcal{H}\):

  • The trend is likely non-zero in time.
  • Statistically:

\[y = \alpha + \beta \times t + \varepsilon, \qquad \varepsilon \sim Normal(0, \sigma^2) \]

Null Test

Comparing \(\mathcal{H}\) with \(\mathcal{H}_0\):

  • \(\mathcal{H}\): \(\beta \neq 0\)
  • \(\mathcal{H}_0\): \(\beta = 0\)

In this example, our null is an example of a point-null hypothesis.

Computing the Test Statistic

For this type of null hypothesis test, our test statistic is the slope of the linear fit. OLS estimate: \[\hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(x_i - \bar{x})^2}.\]

Idea is that even assuming the null hypothesis, we could obtain many different datasets, some of which will have a non-zero slope by chance.

Sampling Distribution of Test Statistic

Standard Result: \[\frac{\hat{\beta} - \beta}{\text{se}\left[\hat{\beta}\right]} \sim t_{n-2},\]

where \(t_{n-2}\) is the t-distribution with \(n-2\) degrees of freedom.

This is why the standard “test” for linear regression coefficients is called a t-test.

Statistical Significance

Is the value of the test statistic consistent with the null hypothesis?

More formally, could the test statistic have been reasonably observed from a random sample given the null hypothesis?

p-Values: Quantification of “Surprise”

One-Tailed Test:

Figure 2: Illustration of a p-value

Two-Tailed Test:

Figure 3: Illustration of a two-tailed p-value

Statistical Significance

Error Types

Null Hypothesis Is
True False
Decision About Null Hypothesis Don’t reject True negative (probability \(1-\alpha\)) Type II error (probability \(\beta\))
Reject Type I Error (probability \(\alpha\)) True positive (probability \(1-\beta\))

Key Points

Hypothesis Testing

  • Classical framework: Compare a null hypothesis (no effect) to an alternative (some effect)
  • \(p\)-value: probability (under \(H_0\)) of more extreme test statistic than observed.
  • “Significant” if \(p\)-value is below a significance level reflecting acceptable Type I error rate.

Problems with NHST framework

  • Real “null” hypotheses are often more nuanced than in typical tests (which were often developed for controlled experiments or for computational convenience).
  • Decisions are often not binary (“significant/not significant”).
  • \(p\)-values are often over-interpreted and are often be incorrectly calculated, with negative outcomes!
  • Important: “Big” data can make things worse, as NHST is highly sensitive to small but evidence effects.

Upcoming Schedule

Next Classes

Wednesday: Introduction to Simulation and Random Sampling

Friday: Sampling and Monte Carlo

Next Week The Bootstrap

Assessments

HW4 assigned, due 3/27.

References

References (Scroll for Full List)