Homework 4 Solutions

BEE 4850/5850

Overview

Instructions

The goal of this homework assignment is to practice model evaluation and using cross-validation and information criteria to distinguish between models.

Problem 1 asks you to fit a model for cherry blossom bloom dates and use cross-validation to draw conclusions about generalizability.
Problem 2 asks you to compare several linear regression models for environmental influences on non-accidental death data.

Load Environment

The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.

import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).

using Random # random number generation and seed-setting
using DataFrames # tabular data structure
using DataFramesMeta # API which can simplify chains of DataFrames transformations
using CSV# reads/writes .csv files
using Distributions # interface to work with probability distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using Optim
using LaTeXStrings
using Dates

Random.seed!(1)

Problems

Problem 1

Problem 1.1

Let’s load the data from ‘data/HadCRUT5.1Analysis_gl.txt’ and ‘data/kyoto.csv’ and merge the two datasets along common years.

temp_dat = CSV.read(joinpath("data", "HadCRUT5.1Analysis_gl.txt"), DataFrame, delim=" ", ignorerepeated=true, header=false)
temp_dat = temp_dat[1:2:nrow(temp_dat), [1, 4]]
rename!(temp_dat, ["Year", "March"])

kyoto_dat = CSV.read(joinpath("data", "kyoto.csv"), DataFrame, missingstring="NA")
kyoto_dat = kyoto_dat[:, 1:2]
rename!(kyoto_dat, [:Year, :DOY])

dat = innerjoin(kyoto_dat, temp_dat, on = :Year)
dat  = dat[completecases(dat), :]
disallowmissing!(dat)

1: The data is space-delimited and the number of spaces between columns differs in the temperature (odd) and number of station (even) rows, so we use ignorerepeated=true to cause the parser to not interpret multiple consecutive spaces as delimiting multiple columns.
2: We need to only keep the odd rows (which are the temperatures) and the year/March anomaly columns.
3: The periods in the column names in kyoto.csv are annoying for later DataFrame operations, so we’ll just rename the columns to get rid of them.
4: An innerjoin is a type of database join that merges two datasets based on the entries of a common ID column which are present in both datasets.
5: There are still some years with missing DOYs, so we use completecases to remove those and disallowmissing! to change the column datatype.

┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 2. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 4. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 6. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 8. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 10. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 12. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 14. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 16. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 18. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 20. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 22. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 24. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 26. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 28. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 30. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 32. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 34. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 36. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 38. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 40. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 42. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 44. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 46. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 48. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 50. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 52. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 54. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 56. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 58. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 60. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 62. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 64. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 66. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 68. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 70. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 72. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 74. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 76. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 78. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 80. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 82. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 84. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 86. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 88. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 90. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 92. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 94. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 96. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 98. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 100. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 102. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 104. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 106. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 108. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 110. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 112. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 114. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 116. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 118. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 120. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 122. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 124. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 126. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 128. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 130. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 132. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 134. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 136. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 138. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 140. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 142. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 144. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 146. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 148. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 150. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 152. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 154. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 156. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 158. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 160. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 162. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 164. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 166. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 168. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 170. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 172. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 174. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 176. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 178. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 180. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 182. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 184. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 186. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 188. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 190. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 192. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 194. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 196. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 198. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 200. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1 warning: only found 13 / 14 columns around data row: 202. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:592
┌ Warning: thread = 1: too many warnings, silencing any further warnings
└ @ CSV ~/.julia/packages/CSV/LiiJM/src/file.jl:597

161×3 DataFrame

136 rows omitted

Row	Year	DOY	March
	Int64	Int64	Float64
1	1850	109	-0.627
2	1851	102	-0.641
3	1852	113	-0.567
4	1853	102	-0.273
5	1854	102	-0.247
6	1855	101	-0.345
7	1856	111	-0.371
8	1857	108	-0.488
9	1858	104	-0.566
10	1859	110	-0.322
11	1860	113	-0.789
12	1861	100	-0.422
13	1862	117	-0.538
⋮	⋮	⋮	⋮
150	2004	92	0.518
151	2005	99	0.64
152	2006	98	0.531
153	2007	97	0.583
154	2008	95	0.603
155	2009	95	0.439
156	2010	95	0.851
157	2011	99	0.549
158	2012	101	0.416
159	2013	93	0.576
160	2014	94	0.681
161	2015	93	0.787

Now let’s fit the model.

function cherry_loglik(p, doy, temp)
    b0, b1, s = p
    m = b0 .+ b1 * temp
    ll = sum(logpdf.(Normal.(m, s), doy))
    return ll
end

lb = [-200., -100., 0.01]
ub = [200., 100., 10.]
p0 = [100., 5., 5.]
optim_out = optimize(p -> -cherry_loglik(p, dat.DOY, dat.March), lb, ub, p0)
p_mle = optim_out.minimizer
@show p_mle;

p_mle = [100.6710269629213, -7.84224390609888, 4.717152358910784]

This suggests that every degree of increased average March temperature makes the cherry blossoms flower around 8 days earlier.

Problem 1.2

We can get the log-loss metric directly from the optimization, as the negative log-likelihood was the optimization objective.

ll = optim_out.minimum
@show ll;

ll = 478.19315773357795

For the mean-square error, we need to compute the predictions.

pred = p_mle[1] .+ p_mle[2] * dat.March
mse = mean((pred .- dat.DOY).^2)
@show mse;

mse = 22.251526373148707

Problem 1.3

To conduct a 5-fold cross-validation, we need to split the data and then repeat the earlier workflow and store the metrics. Since we’re modeling this using a linear regression, which assumes that the observations are independent, we can do an independent split by random partitioning of the data rows. Let’s write a function to do this.

function cherry_cv(dat, nfolds = 5)
    # split the data
    labels = repeat(1:nfolds, outer = Int(floor(nrow(dat) / nfolds)))
    labels = [labels; 1]
    fold_labels = shuffle(labels) # randomly permite the labels
    # create storage for error metrics
    cv_ll = zeros(nfolds)
    cv_mse = zeros(nfolds)
    # set optimization parameters
    lb = [-200., -100., 0.01]
    ub = [200., 100., 10.]
    p0 = [100., 5., 5.]
    for fold in 1:nfolds
        test_idx = findall(fold_labels .== fold)
        train_idx = setdiff(1:nrow(dat), test_idx)
        # fit model to the fold
        optim_out = optimize(p -> -cherry_loglik(p, dat[train_idx, :DOY], dat[train_idx, :March]), lb, ub, p0)
        p_mle = optim_out.minimizer
        pred = p_mle[1] .+ p_mle[2] * dat[test_idx, :March]
        cv_ll[fold] = -mean(logpdf.(Normal.(pred, p_mle[3]), dat[test_idx, :DOY]))
        cv_mse[fold] = mean((pred .- dat[test_idx, :DOY]).^2)
    end
    return cv_ll, cv_mse
end
cv_ll, cv_mse = cherry_cv(dat, 5)

1: The number of data points is not divisible evenly by 5, so we just create as close to an even set of splits as we can and then add the last data point to one of the remaining folds (in this case the first one).
2: We get the mean log score here because the number of data points differs between the CV folds and the full-data.

([2.9726714436496695, 2.878466671848684, 2.866129236793706, 3.275913504259708, 3.025694464239982], [22.364186980103344, 17.917198626382362, 17.317145915863687, 34.037171588895646, 24.67543045145178])

Now let’s compare to the total-data/in-sample tests.

@show ll / nrow(dat);
@show mean(cv_ll);

ll / nrow(dat) = 2.9701438368545214
mean(cv_ll) = 3.00377506415835

@show mse;
@show mean(cv_mse);

mse = 22.251526373148707
mean(cv_mse) = 23.262226712539366

We can see that in general, the model’s in-sample performance is not too different than its average CV performance, suggesting that there is no systematic overfitting occurring.

Problem 1.4

Since there is no evidence of overfitting from the cross-validation test, we might conclude that the model generalizes well.

Problem 2

Problem 2.1

Loading and plotting the deaths data from data/chicago.csv:

chicago_dat = CSV.read(joinpath("data", "chicago.csv"), DataFrame, delim=",", missingstring="NA", header=true)
day_zero = Date("1993-12-31")
chicago_dat.Date = day_zero .+ Day.(chicago_dat.time .+ 0.5)

UndefVarError: `Date` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name may be made accessible by importing Dates in the current active module Main
Stacktrace:
 [1] top-level scope
   @ ~/Teaching/BEE4850/spring2026/solutions/hw04/hw04.qmd:192

Figure 1

To ensure that we use the same observations for the purposes of model comparison, let’s remove the cases corresponding to missing PM2.5 data.

chicago_dat = chicago_dat[.!(ismissing).(chicago_dat.pm25median), :]

p_deaths = plot(chicago_dat.Date, chicago_dat.death, lw=2, xlabel="Date", ylabel="Non-Accidental Deaths)", label="Data", color=:gray)

ArgumentError: column name "Date" not found in the data frame; existing most similar names are: "death"
Stacktrace:
 [1] lookupname
   @ ~/.julia/packages/DataFrames/b4w9K/src/other/index.jl:434 [inlined]
 [2] getindex
   @ ~/.julia/packages/DataFrames/b4w9K/src/other/index.jl:440 [inlined]
 [3] getindex(df::DataFrame, ::typeof(!), col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/b4w9K/src/dataframe/dataframe.jl:557
 [4] getproperty(df::DataFrame, col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/b4w9K/src/abstractdataframe/abstractdataframe.jl:448
 [5] top-level scope
   @ ~/Teaching/BEE4850/spring2026/solutions/hw04/hw04.qmd:207

Figure 2

We could fit the regression models in a few different ways (in Julia, we could also use the GLM.jl package), but for consistency’s sake we’ll do this with numerical optimization. First, let’s fit a model for the impact of time.

# general log-likelihood function for a linear regression with covariate x
function timereg_loglik(p, y, t)
    μ₀, μ₁, σ = p
    μ = μ₀ .+ μ₁ * t
    ll = sum(logpdf.(Normal.(μ, σ), y))
    return ll
end

lb = [0.0, -100.0, 0.1]
ub = [200.0, 100.0, 20.0]
x₀ = [100.0, 0.0, 5.0]

optim_out = Optim.optimize(p -> -timereg_loglik(p, chicago_dat.death, chicago_dat.time), lb, ub, x₀)
ll_time = -optim_out.minimum
θ_time = optim_out.minimizer
@show round.(θ_time; digits=1);

1: Since we minimized the negative log-likelihood, to get the log-likelihood we need to take the negative of the minimum value again.

round.(θ_time; digits = 1) = [101.7, 0.0, 14.1]

Plotting this regression over the data:

plot!(p_deaths, chicago_dat.Date, θ_time[1] .+ θ_time[2] * chicago_dat.time, lw=2, linestyle=:dash, color=:red, label="Time Regression")

UndefVarError: `p_deaths` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ ~/Teaching/BEE4850/spring2026/solutions/hw04/hw04.qmd:244

Figure 3

We can see that even though the time coefficient is close to zero, suggesting a minimal time-dependent trend, over the course of the dataset there is a slight increase over this time (seen in Figure 3).

Problem 2.2

Now let’s look at the impact of temperature.

# general log-likelihood function for a linear regression with covariate x
function tempreg_loglik(p, y, temp)
    μ₀, μ₁, σ = p
    μ = μ₀ .+ μ₁ * temp
    ll = sum(logpdf.(Normal.(μ, σ), y))
    return ll
end

lb = [0.0, -100.0, 0.1]
ub = [200.0, 100.0, 20.0]
x₀ = [100.0, 0.0, 5.0]

optim_out = Optim.optimize(p -> -tempreg_loglik(p, chicago_dat.death, chicago_dat.tmpd), lb, ub, x₀)
ll_temp = -optim_out.minimum
θ_temp = optim_out.minimizer
@show round.(θ_temp; digits=1);

round.(θ_temp; digits = 1) = [129.2, -0.4, 12.4]

Adding this regression line to the plot:

plot!(p_deaths, chicago_dat.Date, θ_temp[1] .+ θ_temp[2] * chicago_dat.tmpd, lw=2, linestyle=:dash, color=:blue, label="Temperature Regression")

UndefVarError: `p_deaths` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ ~/Teaching/BEE4850/spring2026/solutions/hw04/hw04.qmd:284

Figure 4

Figure 4 shows that the temperature regression results in some oscillations due to the seasonal impact of temperature, which visually appear to align with the oscillations in the deaths data.

Problem 2.3

Our last model adds PM2.5 concentrations to the previous temperature regression. Here we need to drop the entries corresponding to missing data.

# general log-likelihood function for a linear regression with covariate x
function multireg_loglik(p, y, temp, pm25)
    μ₀, μ₁, μ₂, σ = p
    μ = μ₀ .+ μ₁ * temp .+ μ₂ * pm25
    ll = sum(logpdf.(Normal.(μ, σ), y))
    return ll
end

lb = [0.0, -100.0, -100.0, 0.1]
ub = [200.0, 100.0, 100.0, 20.0]
x₀ = [100.0, 0.0, 0.0, 5.0]

optim_out = Optim.optimize(p -> -multireg_loglik(p, chicago_dat.death, chicago_dat.tmpd, chicago_dat.pm25median), lb, ub, x₀)
ll_multi = -optim_out.minimum
θ_multi = optim_out.minimizer
@show round.(θ_multi; digits=1);

round.(θ_multi; digits = 1) = [129.3, -0.4, 0.2, 12.3]

We add this regression to the plot:

plot!(p_deaths, chicago_dat.Date, θ_multi[1] .+ θ_multi[2] * chicago_dat.tmpd .+ θ_multi[3] * chicago_dat.pm25median, lw=2, linestyle=:dash, color=:violet, label="Temperature and PM2.5 Regression")

UndefVarError: `p_deaths` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ ~/Teaching/BEE4850/spring2026/solutions/hw04/hw04.qmd:323

Figure 5

Problem 2.4

Now we want to conduct a 5-fold cross-validation to compare each model’s performance. As set up, we don’t need to account for any particular structure, as we aren’t accounting for any underlying seasonal effects that aren’t captured by the covariates.

We will write a function to conduct these for each of our models.

function time_cv(dat, nfolds=5)
    # split the data
    labels = repeat(1:nfolds, outer = Int(floor(nrow(dat) / nfolds)))
    labels = [labels; 1; 2]
    fold_labels = shuffle(labels) # randomly permite the labels
    # create storage for error metrics
    cv_ll = zeros(nfolds)
    cv_mse = zeros(nfolds)
    # set optimization parameters
    lb = [0.0, -100.0, 0.1]
    ub = [200.0, 100.0, 20.0]
    p0 = [100.0, 0.0, 5.0]
    for fold in 1:nfolds
        test_idx = findall(fold_labels .== fold)
        train_idx = setdiff(1:nrow(dat), test_idx)
        # fit model to the fold
        optim_out = optimize(p -> -timereg_loglik(p, dat[train_idx, :death], dat[train_idx, :time]), lb, ub, p0)
        p_mle = optim_out.minimizer
        pred = p_mle[1] .+ p_mle[2] * dat[test_idx, :time]
        cv_ll[fold] = -mean(logpdf.(Normal.(pred, p_mle[3]), dat[test_idx, :death]))
        cv_mse[fold] = mean((pred .- dat[test_idx, :death]).^2)
    end
    return cv_ll, cv_mse
end

function temp_cv(dat, nfolds=5)
    # split the data
    labels = repeat(1:nfolds, outer = Int(floor(nrow(dat) / nfolds)))
    labels = [labels; 1; 2]
    fold_labels = shuffle(labels) # randomly permite the labels
    # create storage for error metrics
    cv_ll = zeros(nfolds)
    cv_mse = zeros(nfolds)
    # set optimization parameters
    lb = [0.0, -100.0, 0.1]
    ub = [200.0, 100.0, 20.0]
    p0 = [100.0, 0.0, 5.0]
    for fold in 1:nfolds
        test_idx = findall(fold_labels .== fold)
        train_idx = setdiff(1:nrow(dat), test_idx)
        # fit model to the fold
        optim_out = optimize(p -> -tempreg_loglik(p, dat[train_idx, :death], dat[train_idx, :tmpd]), lb, ub, p0)
        p_mle = optim_out.minimizer
        pred = p_mle[1] .+ p_mle[2] * dat[test_idx, :tmpd]
        cv_ll[fold] = -mean(logpdf.(Normal.(pred, p_mle[3]), dat[test_idx, :death]))
        cv_mse[fold] = mean((pred .- dat[test_idx, :death]).^2)
    end
    return cv_ll, cv_mse
end

function temp_pm_cv(dat, nfolds=5)
    # split the data
    labels = repeat(1:nfolds, outer = Int(floor(nrow(dat) / nfolds)))
    labels = [labels; 1; 2]
    fold_labels = shuffle(labels) # randomly permite the labels
    # create storage for error metrics
    cv_ll = zeros(nfolds)
    cv_mse = zeros(nfolds)
    # set optimization parameters
    lb = [0.0, -100.0, -100.0, 0.1]
    ub = [200.0, 100.0, 100.0, 20.0]
    p0 = [100.0, 0.0, 0.0, 5.0]

    for fold in 1:nfolds
        test_idx = findall(fold_labels .== fold)
        train_idx = setdiff(1:nrow(dat), test_idx)
        # fit model to the fold
        optim_out = optimize(p -> -multireg_loglik(p, dat[train_idx, :death], dat[train_idx, :tmpd], dat[train_idx, :pm25median]), lb, ub, p0)
        p_mle = optim_out.minimizer
        pred = p_mle[1] .+ p_mle[2] * dat[test_idx, :tmpd] .+ p_mle[3] * dat[test_idx, :pm25median]
        cv_ll[fold] = -mean(logpdf.(Normal.(pred, p_mle[4]), dat[test_idx, :death]))
        cv_mse[fold] = mean((pred .- dat[test_idx, :death]).^2)
    end
    return cv_ll, cv_mse

end

temp_pm_cv (generic function with 2 methods)

Now let’s compare the results for each model.

time_cv_ll, time_cv_mse = time_cv(chicago_dat, 5)
temp_cv_ll, temp_cv_mse = temp_cv(chicago_dat, 5)
multi_cv_ll, multi_cv_mse = temp_pm_cv(chicago_dat, 5)

cv_models = ["Time", "Temperature", "Temperature + PM 2.5"]
cv_ll_all = [mean(time_cv_ll), mean(temp_cv_ll), mean(multi_cv_ll)]
cv_mse_all = [mean(time_cv_mse), mean(temp_cv_mse), mean(multi_cv_mse)]

cv_results = DataFrame(Model=cv_models, LogScore=cv_ll_all, MSE=cv_mse_all)

3×3 DataFrame

Row	Model	LogScore	MSE
	String	Float64	Float64
1	Time	4.07116	200.729
2	Temperature	3.94222	154.477
3	Temperature + PM 2.5	3.93293	151.389

By these scores, the temperature + PM 2.5 model scores best for both the log score and the MSE, so we might conclude that it would generalize the best out-of-sample. The temperature-only model performs similarly, and the time model, as we saw in Problem 2.1, does not perform particularly well.

Problem 2.5

To compute the AIC values we will use the log-likelihoods we computed when we fit each model.

aic_time = -2 * (ll_time - 3)
aic_temp = -2 * (ll_temp - 3)
aic_multi = -2 * (ll_multi - 4)
aic_df = DataFrame(Models=["Time", "Temperature", "Temperature and PM2.5"], AIC=round.([aic_time, aic_temp, aic_multi]; digits=0))

3×2 DataFrame

Row	Models	AIC
	String	Float64
1	Time	5918.0
2	Temperature	5729.0
3	Temperature and PM2.5	5714.0

We can see that the Temperature and PM2.5 model minimizes AIC, and the \(\Delta \text{AIC}\) for the temperature-only model is around 15. These differences suggest that there is relatively strong evidence for the influence of both temperature and PM2.5 compared to a temperature-only model, as the predictive skill outweighs the potential overfitting from adding the PM2.5 term. This is similar in direction to what we saw from the cross-validation as well.

When comparing this quantitative result to the plots, drawing this conclusion from Figure 5 itself is challenging, even if we might be able to squint and see that the fluctuations from the PM2.5 model align slightly better with the temperature-only model.