Missing Data Mechanisms

Lecture 33

April 20, 2026

Missing Data

Statistics and Missing Data

…Statistics is a missing data problem.

– Little (2013)

Missing vs. Latent Variables

Missing Data: Variables which are inconsistently observed.

Latent Variables: Unobserved variables which influence data-generating process.

In both cases, we would like to understand the complete-data likelihood (data-generating process including the missing/latent data).

Common (But Flawed) Approach: Complete-Case Analysis

Complete-case Analysis: Only consider data for which all variables are available.

Can result in bias if there is a missing values have a systematic pattern.
Could result in discarding a large amount of data.

Importance of Assumptions

Because we don’t observe the missing data, all approaches require assumptions about how the missing data might have looked.

Examples:

Whether data is missing is entirely random;
Data can be linearly inter-/extrapolated.

Missing Data Example

Code

n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))
y = f.(x)

m(y) = invlogit(0.75 * logit(y))
prob_missing_y = m.(y)
missing_y = Bool.(rand.(Binomial.(1, prob_missing_y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]

p_alldat = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label="Observations", markersize=5, size=(600, 500), ylims=(-0.05, 1))
scatter!(x[missing_y], zeros(sum(missing_y)), markershape=:x, markersize=3, label="Missing Observations")

Code

linfit = lm([ones(length(xobs)) xobs], yobs)
linpred = predict(linfit, [ones(sum(missing_y)) x[missing_y]])
p1 = deepcopy(p_alldat)
scatter!(p1, x[missing_y], linpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Missing Data Example

Code

p_alldat

Code

repeatpred = sample(yobs, sum(missing_y), replace=true)
p2 = deepcopy(p_alldat)
scatter!(p2, x[missing_y], repeatpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Missing Data Example

Code

p_alldat

Code

p3 = deepcopy(p_alldat)
scatter!(p3, x[missing_y], y[missing_y], label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Comparison of Imputations

Notation

Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.

Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).

But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).

Categories of Missingness

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)

Implications of Missingness Mechanism

MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.

Key Points and Upcoming Schedule

Key Points

Missing data is very common in environmental contexts.
Ability to draw unbiased inferences depends on MCAR, MAR, or MNAR/informativeness of missingness.
Best approach to missing data is to not have any.

Upcoming Schedule

Wednesday: More on Missing Data

Friday: Finishing Missing Data

Assessments

HW6 released, due 5/1.

Quiz 4: Due 5/1

References

Little, R. J. (2013). In praise of simplicity not mathematistry! Ten simple powerful ideas for the statistical scientist. J. Am. Stat. Assoc., 108, 359–369. https://doi.org/10.1080/01621459.2013.787932