More Missing Data


Lecture 34

April 22, 2026

Review

Missing Data Notation

Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.

Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).

But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code
n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)
Figure 1: Illustration of MCAR Data

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)
Figure 2: Illustration of MAR Data

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)
Figure 3: Illustration of MCAR Data

Implications of Missingness Mechanism

  1. MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
  2. MAR: More plausible than MCAR, can use multiple imputation and sometimes complete cases when enough data is available.
  3. MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: need to model the missingness mechanism when possible.

Checking Assumptions About Missingness

Checking MCAR

In general, we can’t know for sure if missingness \(M_Y\) is informative about \(Y\) (since we can’t see it!).

But we can check if \(M_Y\) is independent of \(X\): if not, reject MCAR.

Can we conclude MCAR if, in our dataset, \(M_Y\) appears independent of \(X\)?

Distinguishing MAR from MNAR

Can’t do this statistically!

MAR: \(\mathbb{P}(Y=y | X-x, M_Y=1) = \mathbb{P}(Y=y | X-x, M_Y=0)\)

But the data tells us nothing about \(\mathbb{P}(Y=y | X-x, M_Y=1)\). Need to bring to bear understanding of data-collection process.

Instead, try a few different models reflecting different assumptions about missingness: do your conclusions change?

Methods for Dealing with Missing Data

Methods for Dealing with Missing Data

  1. Imputation: substitute values for missing data before analysis;
  2. Averaging: find expected values over all possible values of the missing variables.

Imputation

Imputation does not create “new” information, it reuses existing information to allow the use of standard procedures.

Example: Missing observations in a time series, want to insert values to fit AR(1) model or estimate autocorrelation using “simple” estimators.

As a result, it’s convenient but can create systematic distortions.

Imputation Under MAR

  • Impute from the marginal distribution (parametrically or non-parametrically), \[p(Y_\text{miss}) = p(Y_\text{obs}).\] This can create distortions if meaningful relationships are neglected.
  • Impute using a regression model (such as linear imputation). This generalizes relationships but requires missingness being uninformative about \(Y\).

Imputation Under MAR

  • Impute from the conditional distribution, \[p(Y_\text{miss} | X = x) = p(Y_\text{obs} | X = x).\] Can be done parametrically or non-parametrically.
  • Impute using matching: find a closest predictor and copy value of \(Y\). Can work okay or be a terrible idea.

Imputation Under MNAR

Need to model missingness mechanism (censoring, etc).

Often need to make assumptions about how the relationship extrapolates.

  • Model relationship between predictors and missing data;
  • Add unknown constant to imputed data to reflect biases.

Ultimately, MNAR requires a sensitivity analysis.

Key Points and Upcoming Schedule

Key Points

  • Best approach to missing data is to not have any.
  • Otherwise, try multiple imputation based on understanding/theories of missing mechanisms. Use as much data as possible in these models.
  • Use as much information as possible when conducting multiple imputation.
  • Incorporate as much uncertainty as possible to avoid biasing downstream results: we don’t know what the missing data looks like!

Upcoming Schedule

Friday: Finish Missing Data. Class in 205 Riley-Robb

Next Week: Complete Poll on Ed.

Assessments

HW6 released, due 5/1.

Quiz 4: Due 5/1

References