Code
| Row | rownames | Ozone | Solar | Wind | Temp | Month | Day |
|---|---|---|---|---|---|---|---|
| Int64 | Int64? | Int64? | Float64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 41 | 190 | 7.4 | 67 | 5 | 1 |
| 2 | 2 | 36 | 118 | 8.0 | 72 | 5 | 2 |
| 3 | 3 | 12 | 149 | 12.6 | 74 | 5 | 3 |
| 4 | 4 | 18 | 313 | 11.5 | 62 | 5 | 4 |
| 5 | 5 | missing | missing | 14.3 | 56 | 5 | 5 |
Lecture 03
January 28, 2026
But not all models are theoretically justifiable.




Source: Richard McElreath
XKCD 2620
Source: XKCD 2620
Developing suitable models often starts with both exploratory data analysis (EDA) and theoretical reasoning.
EDA is about generating questions and identifying if the data quality is sufficient to address them.
Common questions to guide EDA:
| Row | rownames | Ozone | Solar | Wind | Temp | Month | Day |
|---|---|---|---|---|---|---|---|
| Int64 | Int64? | Int64? | Float64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 41 | 190 | 7.4 | 67 | 5 | 1 |
| 2 | 2 | 36 | 118 | 8.0 | 72 | 5 | 2 |
| 3 | 3 | 12 | 149 | 12.6 | 74 | 5 | 3 |
| 4 | 4 | 18 | 313 | 11.5 | 62 | 5 | 4 |
| 5 | 5 | missing | missing | 14.3 | 56 | 5 | 5 |
| Row | variable | min | median | mean | max | missings |
|---|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | Int64 | |
| 1 | Ozone | 1.0 | 31.5 | 42.1293 | 168.0 | 37 |
| 2 | Solar | 7.0 | 205.0 | 185.932 | 334.0 | 7 |
| 3 | Wind | 1.7 | 9.7 | 9.95752 | 20.7 | 0 |
| 4 | Temp | 56.0 | 79.0 | 77.8824 | 97.0 | 0 |
Four datasets, all with the same means, variances, correlations, and regression lines.
Shows the importance of visualization!
Source: Wikipedia
Boxplots: Show quantiles (usually median + 1.5 \(\times\) IQR)
Histogram: See distribution of data.
# this uses the dataframe plotting syntax from StatsPlots.jl
p1 = @df aq scatter(:Ozone, :Solar, markersize=5, color=:blue, xlabel="Ozone (ppb)", ylabel="Solar Radiation (lang)", leftmargin=10mm)
p2 = @df aq scatter(:Ozone, :Wind, markersize=5, color=:blue, xlabel="Ozone (ppb)", ylabel="Wind Speed (mph)", leftmargin=10mm)
p3 = @df aq scatter(:Ozone, :Temp, markersize=5, color=:blue, xlabel="Ozone (ppb)", ylabel="Max Temperature (°F) ", leftmargin=10mm)
p = plot(p1, p2, p3, layout=(1, 3), size=(1100, 400))Figure 2: Paired plots of air quality dataset
If we have a candidate distribution, we can try to fit it to the data and visually compare to a histogram.
Here we might try a Log-Normal distribution.
Figure 6: Diagnosing Problems With Q-Q Plots
Figure 7: Diagnosing Problems With Q-Q Plots
Figure 8: Diagnosing Problems With Q-Q Plots
Figure 9: Diagnosing Problems With Q-Q Plots
p1 = qqplot(LogNormal, collect(skipmissing(aq[!, :Ozone])), linewidth=3, markersize=6, title="Log-Normal Q-Q Plot", leftmargin=5mm)
xlabel!(p1, "Theoretical Values")
ylabel!(p1, "Empirical Values")
plot!(pozone, xlims=(0, 300))
p3 = boxplot([ln_samps[1:116] collect(skipmissing(aq[!, :Ozone]))], xlabel="Ozone (ppb)")
xticks!(p3, [1, 2], ["LogNormal", "Ozone Data"])
plot(p1, pozone, p3, layout=(1, 3), size=(1250, 525), rightmargin=5mm, leftmargin=5mm)Figure 11: QQ Plot comparing ozone data with LogNormal distribution
Friday: Data Visualization
Next Week: Linear Regression and Fitting Models
Homework 1 Due Friday, 2/6.
Exercises: This week’s involves some simple computations.
Reading: Franconeri et al (2024): long review, recommend starting early!