Homework 6: The Bootstrap and Missing Data

BEE 4850/5850, Spring 2026

ImportantDue Date

Friday, 5/01/26, 9:00pm

To do this assignment in Julia, you can find a Jupyter notebook with an appropriate environment in the homework’s Github repository. Otherwise, you will be responsible for setting up an appropriate package environment in the language of your choosing. Make sure to include your name and NetID on your solution.

Overview

Instructions

The goal of this homework assignment is to practice model evaluation and using cross-validation and information criteria to distinguish between models.

  • Problem 1 asks you to use the parametric bootstrap to estimate uncertainty in a Poisson regression model.
  • Problem 2 asks you to use a moving block bootstrap to estimate the sampling distribution of the median of extreme water level data.
  • Problem 3 asks you to analyze the mechanism by which data is missing in a dataset.

Load Environment

The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.

import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).

using Random # random number generation and seed-setting
using DataFrames # tabular data structure
using DataFramesMeta # API which can simplify chains of DataFrames transformations
using CSV # reads/writes .csv files
using Distributions # interface to work with probability distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using Optim # optimization package for model fitting
using Dates # API for date-time data structures

Problems

Scoring

  • Problem 1 is worth 10 points;
  • Problem 2 is worth 10 points;
  • Problem 3 is worth 5 points

Problem 1

Revisit the salamander model from Homework 2, using percent groundcover as a predictor in the Poisson regression. Use the non-parametric bootstrap to estimate bias and confidence intervals for the model parameters.

Problem 1.1

Load the data from data/salamanders.csv and fit a Poisson regression model for salamander counts using the percentage of ground cover.

Problem 1.2

Use 1,000 parametric bootstrap samples to obtain estimates of bias and the 90% confidence interval for the intercept and coefficient in the Poisson regression.

Problem 1.3

What can you conclude about the level of uncertainty about the influence of ground cover on the expected number of salamanders?

Problem 2

Let’s revisit the 2015 Sewell’s Point tide gauge data, which consists of hourly observations and predicted sea-level based on NOAA’s harmonic model.

function load_data(fname)
    date_format = "yyyy-mm-dd HH:MM"
    # this uses the DataFramesMeta package -- it's pretty cool
    return @chain fname begin
        CSV.File(; dateformat=date_format)
        DataFrame
        rename(
            "Time (GMT)" => "time", "Predicted (m)" => "harmonic", "Verified (m)" => "gauge"
        )
        @transform :datetime = (Date.(:Date, "yyyy/mm/dd") + Time.(:time))
        select(:datetime, :gauge, :harmonic)
        @transform :weather = :gauge - :harmonic
        @transform :month = (month.(:datetime))
    end
end

dat = load_data("data/norfolk-hourly-surge-2015.csv")

plot(dat.datetime, dat.gauge; ylabel="Gauge Measurement (m)", label="Observed", legend=:topleft, xlabel="Date/Time", color=:blue)
plot!(dat.datetime, dat.harmonic, label="Prediction", color=:orange)

We detrend the data to isolate the weather-induced variability by subtracting the predictions from the observations; the results (following the Julia code) are in dat[:, :weather].

plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=1, legend=:topleft, xlabel="Date/Time")

We would like to understand the uncertainty in an estimate of the median sea level.

Problem 2.1

Construct 1,000 bootstrap replicates by adding a moving block bootstrap replicate from the weather-induced variability series (with block length 20) to the harmonic prediction. Use these replicates to compute a 90% confidence interval. What is the bias of the estimator?

Problem 2.2

Repeat the analysis with block length 50. How does this affect the confidence intervals and estimate of bias?

Problem 2.3

Why do you think using different block lengths produced the results that they did?

Problem 3

Let’s revisit the Chicago deaths dataset (data/chicago.csv).

Problem 3.1

Which variables have missing entries? How many are missing?

Problem 3.2

Let’s focus on pm10median. Based on an exploratory analysis, do you think these data are missing completely at random, missing at random, or missing not at random?

References