Homework 2 Solutions

BEE 4850/5850

Due Date

Friday, 2/20/26, 9:00pm

To do this assignment in Julia, you can find a Jupyter notebook with an appropriate environment in the homework’s Github repository. Otherwise, you will be responsible for setting up an appropriate package environment in the language of your choosing. Make sure to include your name and NetID on your solution.

Load Environment

The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.

import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).

using Random # random number generation and seed-setting
using DataFrames # tabular data structure
using CSV # reads/writes .csv files
using Distributions # interface to work with probability distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using Optim # optimization tools
using LaTeXStrings # latex formatting for plot strings

Problems

Problem 1

Let’s revisit the chicago dataset from HW1 (found in data/chicago.csv). We will look at the relationship of the potential predictor variables pm25median (the median density anomaly of smaller pollutant particles, in particles/m\(^3\)), o3median (the median concentration anomaly of O\(_3\), in ppb), so2median (the median concentration anomaly of SO\(_2\), in ppb), and tmpd (mean daily temperature, in degrees Fahrenheit) with the variable we would like to predict, death (the number of non-accidental deaths on that day).

Problem 1.1

As always, first, load the data:

chicago_dat = CSV.read("data/chicago.csv", DataFrame) # load data into DataFrame

5114×8 DataFrame

5089 rows omitted

Row	Column1	death	pm10median	pm25median	o3median	so2median	time	tmpd
	Int64	Int64	String15	String15	Float64	String15	Float64	Float64
1	1	130	-7.433544304	NA	-19.5923	1.9280425967	-2556.5	31.5
2	2	150	NA	NA	-19.0386	-0.985563116	-2555.5	33.0
3	3	101	-0.826530612	NA	-20.2173	-1.891416086	-2554.5	33.0
4	4	135	5.5664556962	NA	-19.6757	6.1393412739	-2553.5	29.0
5	5	126	NA	NA	-19.2173	2.2784648713	-2552.5	32.0
6	6	130	6.5664556962	NA	-17.634	9.8585839137	-2551.5	40.0
7	7	129	-0.433544304	NA	-15.3744	-5.818992059	-2550.5	34.5
8	8	109	-5.433544304	NA	-12.1705	-5.107941432	-2549.5	29.0
9	9	125	-0.571428571	NA	-20.0923	0.1822373332	-2548.5	26.5
10	10	153	NA	NA	-18.5803	-2.046929333	-2547.5	32.5
11	11	124	-19.4335443	NA	-5.71219	-1.60099878	-2546.5	29.5
12	12	111	-15.4335443	NA	-15.6289	2.937930625	-2545.5	34.5
13	13	104	11.566455696	NA	-17.0455	3.6418000592	-2544.5	34.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
5103	5103	118	10.956209987	7.1845454545	-15.057	5.6127330601	2545.5	6.0
5104	5104	97	-18.12394366	3.0386147417	-7.86229	-0.569948292	2546.5	6.0
5105	5105	124	-17.12394366	0.4386147417	-12.057	-3.214996751	2547.5	-1.0
5106	5106	144	-1.123943662	6.3000356125	-13.9529	13.116773603	2548.5	8.0
5107	5107	128	-9.123943662	5.4386147417	-9.6713	-2.731705421	2549.5	5.0
5108	5108	131	6.876056338	1.9886147417	-13.5779	0.493336582	2550.5	5.0
5109	5109	122	-0.283333333	3.0452023927	-12.0362	7.4917736026	2551.5	11.0
5110	5110	119	-9.123943662	10.988614742	-13.2745	1.6733391207	2552.5	12.0
5111	5111	125	-12.12394366	5.2886147417	-11.1798	2.618336582	2553.5	13.0
5112	5112	132	-3.123943662	0.525	-6.0362	4.9319909939	2554.5	21.0
5113	5113	133	-26.12394366	-10.06138526	-0.598701	-1.45631644	2555.5	27.0
5114	5114	145	-25.12394366	-1.811385258	-4.89037	0.0900057874	2556.5	16.0

Now we make the scatterplots. I’ll make these into a single plot with different panels, but making these as separate plots is also fine. Putting these all on a single plot with different symbols might be a bit dense (and you have to worry about scales for the predictors).

p1 = @df chicago_dat scatter(:pm25median, :death, title="PM 2.5 Daily Median Anomaly", ylabel="Non-Accidental Daily Deaths", xlabel=L"particles/m$^3$", markersize=3, titlefontsize=8, guidefontsize=7, tickfontsize=5, legend=false)
p2 = @df chicago_dat scatter(:o3median, :death, title=L"O$_3$ Daily Median Anomaly", ylabel="Non-Accidental Daily Deaths", xlabel="ppb", markersize=3, titlefontsize=8, guidefontsize=7, tickfontsize=5, legend=false)
p3 = @df chicago_dat scatter(:so2median, :death, title=L"SO$_2$ Daily Median Anomaly", ylabel="Non-Accidental Daily Deaths", xlabel="ppb", markersize=3, titlefontsize=8, guidefontsize=7, tickfontsize=5, legend=false)
p4 = @df chicago_dat scatter(:tmpd, :death, title="Daily Mean Temperature", ylabel="Non-Accidental Daily Deaths", xlabel="°F", markersize=3, titlefontsize=8, guidefontsize=7, tickfontsize=5, legend=false)
plot(p1, p2, p3, p4, layout=(2, 2))

1: This uses the DataFrame plotting macro from StatsPlots.jl which reduces having to write chicago_dat for every variable.
2: This arranges all of the plots as panels of a single plot with the specified layout.

Figure 1: Scatterplots of bivariate relationships between non-accidental daily deaths in Chicago and environmental predictors.

We can see from Figure 1 that there are some large outliers which might be limiting our ability to see if there is an overall linear trend between any of these variables. Let’s change the y-axis limits to let us see the bulk of the data more clearly.

yaxis!(p1, (50, 200))
yaxis!(p2, (50, 200))
yaxis!(p3, (50, 200))
yaxis!(p4, (50, 200))
plot(p1, p2, p3, p4, layout=(2, 2))

Figure 2: Rescaled scatterplots of bivariate relationships between non-accidental daily deaths in Chicago and environmental predictors.

It does not look like from either Figure 1 or Figure 2 that there is much of a linear relationship between the air pollution predictors and the number of deaths. However, temperature appears like a reasonable linear predictor, and visually it could be that the data has roughly similar variance until we get to the very large outliers at high temperatures.

Problem 1.2

Letting \(y\) be the deaths and \(x\) the mean temperature, this linear model is \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2).\] As a result, the likelihood is based on the probability model for the errors, \[\varepsilon_i = y_i - \left(\beta_0 - \beta_1 x_i\right) \sim N(0, \sigma^2).\] Equivalently, we could set this up as \[y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2);\] both will result in the same result but the code might look a little different.

Now, let’s code the log-likelihood function and optimize it.

function gauss_loglik(p, y, x)
    b0, b1, s = p
    pred = b0 .+ b1 * x # calculate predicted y
    ll = sum(logpdf.(Normal.(pred, s), y)) # compute log likelihood
    return ll
end

lb = [-1000.0, -100.0, 0.1]
ub = [1000.0, 100.0, 50.0]
p0 = [0.0, 0.0, 10.0]
optim_out = Optim.optimize(p -> -gauss_loglik(p, chicago_dat.death, chicago_dat.tmpd), lb, ub, p0)
p_mle = round.(optim_out.minimizer; digits=1) # round to a reasonable number of sig figs

1: p is a vector of the model parameters b0, b1, s; often optimization packages want to optimize over a vector so here I unpack this vector within the function. This pattern also keeps the function definition a bit cleaner, particularly as the number of parameters increases. However, you can do the opposite, writing your function with individual parameters and “unpacking” into the optimization function.
2: The lower bound is set to 0.1 as the standard deviation cannot be 0 or negative.

3-element Vector{Float64}:
 130.0
  -0.3
  14.2

How do we interpret the coefficients?

\(\hat{\beta}_0 = 130\) deaths suggests that, at a temperature of 0\(^\circ\) F, we would expect 130 non-accidental deaths.
\(\hat{\beta}_1 = -0.3\) suggests that, on two days with a difference in mean temperature of 1\(^\circ\) F, we would expect to see 0.3 fewer deaths on the warmer day. Note that this does not mean an increase of 1\(^\circ\) F “causes” this reduction.

Problem 1.3

Add your fitted regression line and a 90% prediction interval to your scatterplot of tmpd and death. Using this and any other diagnostics, do you agree with the Gaussian-error linear model assumption? Why or why not?

# get regression line
b0, b1, s = p_mle
x_pred = -20:100
y_pred = b0 .+ b1 * x_pred
# simulate conditional distribution to plot prediction interval
err = rand(Normal(0, s), 10_000)
err_quantile = quantile(err, [0.05, 0.95])

# make plot
p = @df chicago_dat scatter(:tmpd, :death, title="Daily Mean Temperature", xlabel="Non-Accidental Daily Deaths", ylabel="°F", alpha=0.4, label="Observations") # reduce alpha to make rest of plot easier to see
plot!(p, x_pred, y_pred, ribbon=(abs(err_quantile[1]), err_quantile[2]), fillalpha=0.3, color=:red, linewidth=2, label="Regression")

1: Since the error distribution is the same around the regression estimate, regardless of \(x\), we can just simulate from a single distribution and add the quantiles back to the regression line to get the prediction interval. For some other cases (such as heteroskedastic errors, or other distributions), we might have to simulate at every prediction point.
2: The ribbon keyword in Plots.plot can either take an array of values or a tuple of (positive) values if the same distance is to be used from the main plotted line.

Figure 3: Fitted relationship between mean temperature and non-accidental deaths in Chicago. The shaded region is the 90% prediction interval from the linear regression model.

The general trend in Figure 3 looks reasonable. We can calculate the surprise index by finding the fraction of observations which are outside of the 90% prediction interval.

resids = chicago_dat.death - (b0 .+ b1 * chicago_dat.tmpd)
si = sum((resids .< err_quantile[1]) + (resids .> err_quantile[2])) / nrow(chicago_dat)
round(si; digits=2)

1: The sum term here will add together two vectors, one of which is 1 when the residuals are below the lower bound of the interval, and the second which is 1 when they’re above the upper bound.

0.08

The surprise index is about 8%, which is not ideal, but also isn’t so far from the expected 10% to be indicative of a major problem. Let’s look at the distribution of residuals and a Q-Q plot to see if the residuals appear normal.

resid_hist = histogram(resids, xlabel="Residuals", ylabel="Count", legend=false)
resid_qq = qqplot(resids, Normal(), xlabel="Theoretical Quantiles", ylabel="Empirical Quantiles")
p = plot(resid_hist, resid_qq, layout=(2, 1))

Figure 4: Diagnostics to check Gaussian assumptions for the residuals.

Figure 4 suggests that while the bulk of the residuals appear to follow a Gaussian distribution, there is a long upper tail which throws off the Q-Q plot. From Figure 3, these large outliers occur at very high temperatures; the simple regression model cannot account for this change in trend, which could be due to extreme heat exposure or some other factor which is correlated with high temperatures. We might trust this model for temperatures below 80\(^\circ\) F, or we would want to rethink the model completely with a better representation of factors which contribute to deaths.

Problem 2 (6 points)

Problem 2.1

Denoting the occurrence of freezing in year \(i\) as \(y_i\), the probability of this occurring as \(p_i\), and the DJF temperature as \(T\), the logistic regression model can be formulated as:

\[\begin{aligned} y_i &\sim \text{Binomial}(p_i) \\ \text{logit}(p_i) &= \beta_0 + \beta_1 T. \end{aligned}\]

Let’s load the data and calculate the winter temperatures. I decided to write a function which would calculate the DJF mean for a given base year, which would allow me to loop over all of the years and apply the function; there are many other possible approaches one could take.

temp_dat = CSV.read("data/HadCRUT5.1Analysis_gl.txt", delim=" ", ignorerepeated=true, header=false, silencewarnings=true, DataFrame)
temp_dat = temp_dat[1:2:nrow(temp_dat), :] # only keep even rows

function djf_mean(temps, year)
    idx = findfirst(temps[:, 1] .== year)
    return mean([temps[idx - 1, 13], temps[idx, 2], temps[idx, 3]])
end

djf_means = [djf_mean(temp_dat, yr) for yr in 1851:2025]

1: We need silencewarnings=true because CSV.read will complain about the even rows not having the same number of columns.
2: findfirst (or the equivalent) is the fastest solution since it will terminate after finding any matches, but if you used a different approach which involved searching over the entire DataFrame, it wouldn’t matter too much in this case since the data is small.
3: temps[idx - 1, 13] is the previous December since the first column is the year, and similarly for that year’s January and February.
4: This is a straightforward use of a comprehension to apply a function in a loop. Comprehensions are quite memory efficient. There are other strategies, such as mapping the function, that one could use.

175-element Vector{Float64}:
 -0.35166666666666674
 -0.3633333333333333
 -0.17966666666666667
 -0.37366666666666665
 -0.31233333333333335
 -0.2793333333333333
 -0.453
 -0.371
 -0.4036666666666667
 -0.393
  ⋮
  0.9269999999999999
  0.7629999999999999
  0.8213333333333334
  1.0646666666666667
  0.6546666666666666
  0.7643333333333334
  0.8046666666666668
  1.2286666666666666
  1.167

Now we want to create a vector of 1s and 0s for the freezing occurrences and non-occurrences (as a reminder, 1796 and 1816 are outside of the temperature data, so we will not include those). We’ll put all of the data together in a DataFrame to make it easy to assign the relevant 1s.

freezes = [1856, 1875, 1884, 1904, 1912, 1934, 1961, 1979, 2015]
dat = DataFrame(year=1851:2025, temp=djf_means, freeze=zeros(length(djf_means)))
for yr in freezes
    idx = findfirst(dat.year .== yr)
    dat[idx, :freeze] = 1
end

Now let’s maximize the likelihood of the model. We need to be able to compute the inverse logit of the linear part of the model, which is given by \[\text{logit}^{-1}(x) = \frac{\exp(x)}{\exp(x) + 1}.\]

logit(p) = log(p / (1 - p))
invlogit(x) = exp(x) / (exp(x) + 1)
function freeze_loglik(params, dat)
    b0, b1 = params
    p = invlogit.(b0 .+ b1 * dat.temp)
    ll = sum(logpdf.(Bernoulli.(p), dat.freeze))
    return ll
end

lb = [-20.0, -20.0]
ub = [20.0, 20.0]
p0 = [0.25, 0.25]
optim_out = Optim.optimize(v -> -freeze_loglik(v, dat), lb, ub, p0)
v_mle = round.(optim_out.minimizer; digits=1)

2-element Vector{Float64}:
 -3.0
 -0.7

We can interpret these coefficients as follows:

\(\hat{\beta}_0 = -3\) gives us the probability of freezing when the temperature anomaly is the same as the reference period mean, namely \(\text{logit}^{-1}(-3) = 4.7\%\).
\(\hat{\beta}_1 = -0.7\) gives the reduction in freezing log-odds associated with a temperature anomaly increase of \(1^\circ\) C. It’s not entirely clear what this means, but one important feature is that this is negative, so that increased temperatures decrease the probability of freezing, which is what we would expect.

Problem 2.2

Let’s plot how the probability of freezing changes with respect to temperature and time.

temp_pred = -1:0.1:2
p1 = plot(-1:0.1:2, invlogit.(v_mle[1] .+ v_mle[2] * temp_pred), xlabel="DJF Temperature Anomaly (°C)", ylabel="Probability Cayuga Lake Freezes", legend=false)
p2 = plot(dat.year, invlogit.(v_mle[1] .+ v_mle[2] * dat.temp), xlabel="Year", ylabel="Probability Cayuga Lake Freezes")
plot(p1, p2, layout=(1, 2), legend=false)

Figure 5: Modeled probability of Cayuga Lake freezing versus the DJF temperature anomaly and over time.

We can see from Figure 5 that the probability has declined precipitously over time as the temperature anomaly has increased, from around a 6-8% probability between 1851 and 1925 down to around 2% today.

How can we determine if this model is reasonable? Since the probabilities of occurrence are so low, we can’t try to look at a raw mis-classification rate. However, we can try to assess how well calibrated the model is, e.g., does it generally predict the right number of freezes. We have observed 9 years in which Cayuga Lake has frozen (that align with our temperature data); let’s see if the expected number of freezes is similar to this by adding up the modeled probabilities.

p_fit = invlogit.(v_mle[1] .+ v_mle[2] * dat.temp)
sum(p_fit)

9.116926466107916

This is almost exactly right, though it is a bit generous to the model because it’s an in-sample estimate. We can also look at how well the model captures probabilities of freezing at different ranges of predicted probabilities. Let’s look at what happens between probabilities over 5% and under 5%.

sum(p_fit[p_fit .> 0.05])

6.511824596882027

sum(dat.freeze[p_fit .> 0.05])

6.0

That’s off by a little, but not much.

Similarly, for probabilities below 5%:

sum(p_fit[p_fit .< 0.05])

2.605101869225891

sum(dat.freeze[p_fit .< 0.05])

3.0

Try out other ranges to see where the model performs well and where it doesn’t: if you see ranges where the model substantially over or under predicts, that’s a signal of mis-specification.

Beyond this, diagnosing logistic regression post facto can be difficult because it doesn’t make distributional assumptions; you need to convince yourself that the log-odds ought to have a linear relationship with the predictor(s), which can be difficult.

Problem 2.3

To find the temperature required for less than a 1% probability of Cayuga Lake freezing, we need to find \(\text{logit}(0.01)\) and solve for the temperature based on the logistic regression.

freeze_thresh_prob = logit(0.01)
freeze_thresh_temp = (freeze_thresh_prob - v_mle[1]) / v_mle[2]

2.278742643049414

So if the winter temperature anomaly rises above 2.2\(^\circ\) C, we would expect the probability of Cayuga Lake freezing to drop below 1%.

Problem 3 (6 points)

The file data/salamanders.csv contains counts of salamanders from 47 different plots of the same area in California, as well as the percentage of ground cover and age of the forest in the plot. You would like to see if you can use these data to predict the salamander counts with a Poisson regression.

Problem 3.1

Loading the data (note the delimiter is now a semi-colon):

dat = CSV.read(joinpath("data", "salamanders.csv"), DataFrame, delim=";")

47×4 DataFrame

22 rows omitted

Row	SITE	SALAMAN	PCTCOVER	FORESTAGE
	Int64	Int64	Int64	Int64
1	1	13	85	316
2	2	11	86	88
3	3	11	90	548
4	4	9	88	64
5	5	8	89	43
6	6	7	83	368
7	7	6	83	200
8	8	6	91	71
9	9	5	88	42
10	10	5	90	551
11	11	4	87	675
12	12	3	83	217
13	13	3	87	212
⋮	⋮	⋮	⋮	⋮
36	36	0	11	6
37	37	0	14	49
38	38	0	17	29
39	39	0	24	57
40	40	0	44	59
41	41	0	52	78
42	42	0	77	50
43	43	0	78	320
44	44	0	80	411
45	45	0	86	133
46	46	0	89	60
47	47	0	91	187

The first model, using PCTCOVER as a predictor, can be specified as the following (using the standard log link function for Poisson regression):

\[ \begin{aligned} S &\sim \text{Poisson}(\lambda) \\ \log(\lambda) &= a PC + b \end{aligned} \]

As noted in the problem, we will need to standardize the PCTCOVER variable, as its range is much larger than the salamander counts.

function salamander_pcover(p, counts, pctcover) 
    a, b = p
    λ = exp.(a * pctcover .+ b)
    ll = sum(logpdf.(Poisson.(λ), counts))
    return ll
end

lb = [-10.0, -50.0]
ub = [10.0, 50.0]
p0 = [0.0, 0.0]

# function to make this more convenient
stdz(x) = (x .- mean(x)) / std(x)

result = optimize(p -> -salamander_pcover(p, dat.SALAMAN, stdz(dat.PCTCOVER)), lb, ub, p0)
pcover_mle = round.(result.minimizer; digits=1)

2-element Vector{Float64}:
 1.2
 0.4

Problem 3.2

To translate these estimates to expected values and uncertainty intervals, we now simulate observations of salamanders based on this model.

function sim_salamanders_pcover(params, pctcover, n)
    a, b = params
    λ = exp.(a * pctcover .+ b)
    sal_pred = zeros(n, length(pctcover))
    for i = 1:length(pctcover)
        sal_pred[:, i] = rand(Poisson(λ[i]), n)
    end
    return sal_pred
end

sim_pcover = sim_salamanders_pcover(pcover_mle, -2:0.01:2, 10_000)

pcover_q = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), sim_pcover; dims=1) 
plot(-2:0.01:2, pcover_q[2, :], linewidth=3, ribbon=(pcover_q[2, :] - pcover_q[1, :], pcover_q[3, :] - pcover_q[2, :]), fillalpha=0.5, xlabel="Standardized Percent Ground Cover", ylabel="Salamander Count", label="Simulations")
scatter!(stdz(dat.PCTCOVER), dat.SALAMAN, label="Observations")

1: Note that our predictors are along the standardized predictor range, not the raw PCTCOVER range, since we standardized the inputs when we fit the regression.

Figure 6: Predictive interval of salamander counts using the percent cover model.

The model seems to do ok for lower levels of percent cover, but does not account for the level of dispersion at higher levels; this might suggest that a Poisson model is not ideal, but we could try a negative binomial regression instead.

Problem 3.3

Let’s see if we can improve the model using the FORESTAGE predictor. We’ll try to add this in as a linear predictor.

function salamander_fage(p, counts, pctcover, forestage) 
    a1, a2, b = p
    λ = exp.(a1 * pctcover + a2 * forestage .+ b)
    ll = sum(logpdf.(Poisson.(λ), counts))
    return ll
end

lb = [-10.0, -10.0, -50.0]
ub = [10.0, 10.0, 50.0]
p0 = [0.0, 0.0, 0.0]

result = optimize(p -> -salamander_fage(p, dat.SALAMAN, stdz(dat.PCTCOVER), stdz(dat.FORESTAGE)), lb, ub, p0)
fage_mle = round.(result.minimizer; digits=1)

3-element Vector{Float64}:
  1.2
 -0.0
  0.4

We can see that the FORESTAGE predictor variable is effectively zero, which means it does not add much to the prediction. Why is this? The correlation between PCTCOVER and FORESTAGE is quite high: 0.63, which means most of the information in FORESTAGE has already been included in the PCTCOVER model. It makes sense that PCTCOVER is valuable: salamanders are likely to seek areas with high levels of ground cover, while the age of the forest is only meaningful to the extent that older forests may have more ground cover (hence the high correlation).

Problem 4 (7 points)

Problem 4.1

Loading the data (once again, the file is delimited by semi-colons):

dat = CSV.read(joinpath("data", "Hurricanes.csv"), DataFrame, delim=";")

92×8 DataFrame

67 rows omitted

Row	name	year	deaths	category	min_pressure	damage_norm	female	femininity
	String15	Int64	Int64	Int64	Int64	Int64	Int64	Float64
1	Easy	1950	2	3	960	1590	1	6.77778
2	King	1950	4	3	955	5350	0	1.38889
3	Able	1952	3	1	985	150	0	3.83333
4	Barbara	1953	1	1	987	58	1	9.83333
5	Florence	1953	0	1	985	15	1	8.33333
6	Carol	1954	60	3	960	19321	1	8.11111
7	Edna	1954	20	3	954	3230	1	8.55556
8	Hazel	1954	20	4	938	24260	1	9.44444
9	Connie	1955	0	3	962	2030	1	8.5
10	Diane	1955	200	1	987	14730	1	9.88889
11	Ione	1955	7	3	960	6200	0	5.94444
12	Flossy	1956	15	2	975	1540	1	7.0
13	Helene	1958	1	3	946	540	1	9.88889
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
81	Cindy	2005	1	1	991	350	1	9.94444
82	Dennis	2005	15	3	946	2650	0	2.44444
83	Ophelia	2005	1	1	982	91	1	9.16667
84	Rita	2005	62	3	937	10690	1	9.5
85	Wilma	2005	5	3	950	25960	1	8.61111
86	Humberto	2007	1	1	985	51	0	2.38889
87	Dolly	2008	1	1	967	1110	1	9.83333
88	Gustav	2008	52	2	954	4360	0	1.72222
89	Ike	2008	84	2	950	20370	0	1.88889
90	Irene	2011	41	1	952	7110	1	9.27778
91	Isaac	2012	5	1	966	24000	0	1.94444
92	Sandy	2012	159	2	942	75000	1	9.0

Our interpretation of the hypothesis results in the following model specification:

\[ \begin{aligned} D &\sim \text{Poisson}(\lambda) \\ \log(\lambda) &= a_1 FN + a_2 MP + b \end{aligned} \]

We need to standardized the pressure predictor.

function hurricane_pressure(p, counts, fem, pressure) 
    a1, a2, b = p
    m = exp.(a1 * fem + a2 * pressure .+ b)
    ll = sum(logpdf.(Poisson.(m), counts))
    return ll
end

lb = [-10.0, -10.0, -50.0]
ub = [10.0, 10.0, 50.0]
p0 = [5.0, 5.0, 0.0]

# function to make this more convenient
stdz(x) = (x .- mean(x)) / std(x)

result = optimize(p -> -hurricane_pressure(p, dat.deaths, dat.female, stdz(dat.min_pressure)), lb, ub, p0)
p_pressure = result.minimizer

3-element Vector{Float64}:
  0.47211173924356337
 -0.7195324699559839
  2.414667236694508

Problem 4.2

function sim_hurricanes(p, fem, damage,  n)
    a1, a2, b = p
    λ = exp.(a1 * fem + a2 * damage .+ b)
    hur_pred = zeros(n, length(fem))
    for i = 1:length(fem)
        hur_pred[:, i] = rand(Poisson(λ[i]), n)
    end
    return hur_pred
end

press_range = -3:0.01:2 # these are normalized pressures
sim_m = sim_hurricanes(p_pressure, repeat([-1], length(press_range)), press_range, 10_000)
sim_f = sim_hurricanes(p_pressure, repeat([1], length(press_range)), press_range, 10_000)

simm_q = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), sim_m; dims=1)
simf_q = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), sim_f; dims=1)

std_pressure = stdz(dat.min_pressure)
plot(-3:0.01:2, simm_q[2, :], linewidth=3, ribbon=(simm_q[2, :] - simm_q[1, :], simm_q[3, :] - simm_q[2, :]), alpha=0.5, label="Masculine Storms (Simulation)", color=:blue, xlabel="Standardized Minimum Pressure", ylabel="Deaths")
plot!(-3:0.01:2, simf_q[2, :], linewidth=3, ribbon=(simf_q[2, :] - simf_q[1, :], simf_q[3, :] - simf_q[2, :]), alpha=0.5, label="Feminine Storms (Simulation)", color=:red)
scatter!(std_pressure[dat.female .== 0.0], dat.deaths[dat.female .== 0.0], color=:blue, label="Masculine Storms (Observations)")
scatter!(std_pressure[dat.female .== 1.0], dat.deaths[dat.female .== 1.0], color=:red, label="Feminine Storms (Observations)")

1: Here we want to fix one of the predictors (female) to generate the relevant conditional simulations. The repeat() function lets us repeat the same values for the female predictor.

Figure 7: Simulation of hurricane deaths based on femininity of name and minimum hurricane pressure. Included are projections assuming the most masculine and feminine names in the dataset.

The model fails to predict the storms with the most and least deaths for both genders; this isn’t entirely shocking since most of the data consists of storms with low deaths, and a Poisson model has equal variance to its mean. We would need a model with more dispersion (such as a negative binomial, which was used in the original paper) to potentially predict these. However, the model does clearly predict an effect from the gender of the storm name, as we can see in Figure 7.

Problem 4.3

The effect size shown in Figure 7 is quite strong, and does not in general seem supported by the data. While there are female-named storms with more deaths than the male-named storms with similar minimum pressures, we can also see many cases where they are quite similar. This effect seems to be driven by the large outlying storms, which tend to have female names, but aside from these few events, the data do not appear to suggest a systematic pattern of greater deaths from female storms.

Problem 4.4

We can now stop taking this hypothesis seriously. To what extent do you think the finding could be an artifact of the dataset (e.g. there is no actual effect, but there are coincidental features of the data that produce the result that female-named storms are more deadly than male-named storms)? Justify this conclusion with specific reference to an exploratory analysis of the data.

As noted in the solution to Problem 4.3, the model results seem to be driven entirely by the most deadly four storms, all of which had female names. We can see how likely this is from random variation. Let’s look at how many more storms in the dataset have female names than male names.

sum(dat.female) / nrow(dat)

0.6739130434782609

So 2/3 of the storms in the dataset have female names. As a result, there is a \((2/3)^4 =20\%\) probability that the four deadliest storms would happen to have female names just by chance, which is hardly negligible.

Extra (I don’t expect you to have looked this up):

In fact, this isn’t random: the dataset includes storms between 1953–1978, when all tropical storms were given female names. And in fact, other than Sandy (which is classified as having a female name, but it is generally considered unisex), the four most deadly storms were from this period:

sort(dat, :deaths, rev=true)

92×8 DataFrame

67 rows omitted

Row	name	year	deaths	category	min_pressure	damage_norm	female	femininity
	String15	Int64	Int64	Int64	Int64	Int64	Int64	Float64
1	Camille	1969	256	5	909	23040	1	9.05556
2	Diane	1955	200	1	987	14730	1	9.88889
3	Sandy	2012	159	2	942	75000	1	9.0
4	Agnes	1972	117	1	980	20430	1	8.66667
5	Ike	2008	84	2	950	20370	0	1.88889
6	Betsy	1965	75	3	948	20000	1	8.33333
7	Andrew	1992	62	5	922	66730	0	2.22222
8	Rita	2005	62	3	937	10690	1	9.5
9	Carol	1954	60	3	960	19321	1	8.11111
10	Floyd	1999	56	2	956	8130	0	1.83333
11	Gustav	2008	52	2	954	4360	0	1.72222
12	Isabel	2003	51	2	957	4980	1	9.38889
13	Donna	1960	50	4	930	53270	1	9.27778
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
81	Humberto	2007	1	1	985	51	0	2.38889
82	Dolly	2008	1	1	967	1110	1	9.83333
83	Florence	1953	0	1	985	15	1	8.33333
84	Connie	1955	0	3	962	2030	1	8.5
85	Debra	1959	0	1	984	430	1	9.88889
86	Ethel	1960	0	1	981	35	1	8.72222
87	Edith	1971	0	2	978	300	1	8.5
88	Ginger	1971	0	1	995	200	1	10.0
89	Babe	1977	0	1	995	66	1	6.88889
90	Bob	1985	0	1	1003	130	0	1.66667
91	Floyd	1987	0	1	993	1	0	1.83333
92	Bret	1999	0	3	951	94	0	2.33333

As noted, while Katrina and Audrey would have also been among the top storms (but were left out of the dataset), Hurricane Audrey was in 1957, which is also in the female-only period. Neglecting the storms from that period makes the deadliest storms Katrina, Sandy, and Ike: hardly signs of a clear female bias. Moreover, we know that many of the deaths from Katrina resulted from engineering failures rather than residents “not taking storms with female names seriously.”

For a thorough analysis of this paper’s statistical failings (which go beyond what we have considered here), see this paper by Gary Smith. As a sign of how much of a laughingstock this paper is among statisticians, you can see how often it’s mockingly referenced at Andrew Gelman’s blog. This is further evidence that just because you can do statistics, it doesn’t mean your results make sense, even if you fit your model and get “statistically significant” results. You should always think carefully about what your data-generating mechanism actually implies, and how fit for purpose your dataset is to examine your hypothesis (in this case, the inclusion of data between 1953 and 1978 clearly biases the results in favor of the authors’ hypothesis).

Load Environment

Problems

Problem 1

Problem 1.1

Problem 1.2

Problem 1.3

Problem 2 (6 points)

Problem 2.1

Problem 2.2

Problem 2.3

Problem 3 (6 points)

Problem 3.1

Problem 3.2

Problem 3.3

Problem 4 (7 points)

Problem 4.1

Problem 4.2

Problem 4.3

Problem 4.4

References