More Missing Data

Lecture 34

April 22, 2026

Review

Missing Data Notation

Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.

Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).

But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code

n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)

Implications of Missingness Mechanism

MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
MAR: More plausible than MCAR, can use multiple imputation and sometimes complete cases when enough data is available.
MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: need to model the missingness mechanism when possible.

Checking Assumptions About Missingness

Checking MCAR

In general, we can’t know for sure if missingness \(M_Y\) is informative about \(Y\) (since we can’t see it!).

But we can check if \(M_Y\) is independent of \(X\): if not, reject MCAR.

Can we conclude MCAR if, in our dataset, \(M_Y\) appears independent of \(X\)?

Distinguishing MAR from MNAR

Can’t do this statistically!

MAR: \(\mathbb{P}(Y=y | X-x, M_Y=1) = \mathbb{P}(Y=y | X-x, M_Y=0)\)

But the data tells us nothing about \(\mathbb{P}(Y=y | X-x, M_Y=1)\). Need to bring to bear understanding of data-collection process.

Instead, try a few different models reflecting different assumptions about missingness: do your conclusions change?

Methods for Dealing with Missing Data

Imputation: substitute values for missing data before analysis;
Averaging: find expected values over all possible values of the missing variables.

Imputation

Imputation does not create “new” information, it reuses existing information to allow the use of standard procedures.

Example: Missing observations in a time series, want to insert values to fit AR(1) model or estimate autocorrelation using “simple” estimators.

As a result, it’s convenient but can create systematic distortions.

Imputation Under MAR

Impute from the marginal distribution (parametrically or non-parametrically), \[p(Y_\text{miss}) = p(Y_\text{obs}).\] This can create distortions if meaningful relationships are neglected.
Impute using a regression model (such as linear imputation). This generalizes relationships but requires missingness being uninformative about \(Y\).

Imputation Under MAR

Impute from the conditional distribution, \[p(Y_\text{miss} | X = x) = p(Y_\text{obs} | X = x).\] Can be done parametrically or non-parametrically.
Impute using matching: find a closest predictor and copy value of \(Y\). Can work okay or be a terrible idea.

Imputation Under MNAR

Need to model missingness mechanism (censoring, etc).

Often need to make assumptions about how the relationship extrapolates.

Model relationship between predictors and missing data;
Add unknown constant to imputed data to reflect biases.

Ultimately, MNAR requires a sensitivity analysis.

Key Points and Upcoming Schedule

Key Points

Best approach to missing data is to not have any.
Otherwise, try multiple imputation based on understanding/theories of missing mechanisms. Use as much data as possible in these models.
Use as much information as possible when conducting multiple imputation.
Incorporate as much uncertainty as possible to avoid biasing downstream results: we don’t know what the missing data looks like!

Upcoming Schedule

Friday: Finish Missing Data. Class in 205 Riley-Robb

Next Week: Complete Poll on Ed.

Assessments

HW6 released, due 5/1.

Quiz 4: Due 5/1

More Missing Data

Review

Missing Data Notation

Missingness Complete At Random (MCAR)

Missingness At Random (MAR)

Missingness Not-At-Random (MNAR)

Implications of Missingness Mechanism

Checking Assumptions About Missingness

Checking MCAR

Distinguishing MAR from MNAR

Methods for Dealing with Missing Data

Methods for Dealing with Missing Data

Imputation

Imputation Under MAR

Imputation Under MAR

Imputation Under MNAR

Key Points and Upcoming Schedule

Key Points

Upcoming Schedule

Assessments

References