Missing Data Mechanisms


Lecture 33

April 20, 2026

Missing Data

Statistics and Missing Data

…Statistics is a missing data problem.

Little (2013)

Missing vs. Latent Variables

Missing Data: Variables which are inconsistently observed.

Latent Variables: Unobserved variables which influence data-generating process.

In both cases, we would like to understand the complete-data likelihood (data-generating process including the missing/latent data).

Common (But Flawed) Approach: Complete-Case Analysis

Complete-case Analysis: Only consider data for which all variables are available.

  • Can result in bias if there is a missing values have a systematic pattern.
  • Could result in discarding a large amount of data.

Importance of Assumptions

Because we don’t observe the missing data, all approaches require assumptions about how the missing data might have looked.

Examples:

  • Whether data is missing is entirely random;
  • Data can be linearly inter-/extrapolated.

Missing Data Example

Code
n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))
y = f.(x)

m(y) = invlogit(0.75 * logit(y))
prob_missing_y = m.(y)
missing_y = Bool.(rand.(Binomial.(1, prob_missing_y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]

p_alldat = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label="Observations", markersize=5, size=(600, 500), ylims=(-0.05, 1))
scatter!(x[missing_y], zeros(sum(missing_y)), markershape=:x, markersize=3, label="Missing Observations")
Figure 1: Missing Data Running Example
Code
linfit = lm([ones(length(xobs)) xobs], yobs)
linpred = predict(linfit, [ones(sum(missing_y)) x[missing_y]])
p1 = deepcopy(p_alldat)
scatter!(p1, x[missing_y], linpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)
Figure 2: Missing Data Running Example

Missing Data Example

Code
p_alldat
Figure 3: Missing Data Running Example
Code
repeatpred = sample(yobs, sum(missing_y), replace=true)
p2 = deepcopy(p_alldat)
scatter!(p2, x[missing_y], repeatpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)
Figure 4: Missing Data Running Example

Missing Data Example

Code
p_alldat
Figure 5: Missing Data Running Example
Code
p3 = deepcopy(p_alldat)
scatter!(p3, x[missing_y], y[missing_y], label="Imputed Values", markersize=5, markershape=:diamond, legend=false)
Figure 6: Missing Data Running Example

Comparison of Imputations

Notation

Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.

Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).

But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).

Categories of Missingness

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code
flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)
Figure 7: Illustration of MCAR Data

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)
Figure 8: Illustration of MAR Data

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)
Figure 9: Illustration of MCAR Data

Implications of Missingness Mechanism

  1. MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
  2. MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
  3. MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.

Key Points and Upcoming Schedule

Key Points

  • Missing data is very common in environmental contexts.
  • Ability to draw unbiased inferences depends on MCAR, MAR, or MNAR/informativeness of missingness.
  • Best approach to missing data is to not have any.

Upcoming Schedule

Wednesday: More on Missing Data

Friday: Finishing Missing Data

Assessments

HW6 released, due 5/1.

Quiz 4: Due 5/1

References

Little, R. J. (2013). In praise of simplicity not mathematistry! Ten simple powerful ideas for the statistical scientist. J. Am. Stat. Assoc., 108, 359–369. https://doi.org/10.1080/01621459.2013.787932