Mixture Models


Lecture 36

April 27, 2026

Class Logistics

Changes To Grading

  • No literature critique for 5850;
  • Project presentations + peer reviews for everyone to help get feedback prior to report.

Project Presentation

  1. Upload recording of max 10 minute presentation to Canvas by 5/6;
  2. Submit peer reviews to Canvas by 5/8 (will be auto-assigned).
  3. You will be graded on the quality and constructiveness of the peer reviews you provide (templates on website), but not the reviews you receive.

Bimodel Distributions

Mixture Models

What happens when data comes from multiple data-generating processes?

  • Streamflow: drought vs. wet conditions
  • Solar radiation: cloud regimes
  • Precipitation: synoptic-scale weather regimes

Example: Colorado River Streamflow

Code
annual_q = CSV.read("data/streamflow/uc_historical.csv", DataFrame)
annual_q.Year = 1909:2013
annual_q.Upper_Colorado_norm = annual_q.Upper_Colorado / 1e6
p = plot(annual_q.Year, annual_q.Upper_Colorado_norm , marker=:circle, markersize=3, label="Annual Flow", color=:lightblue, ylabel=L"million ft$^3$/yr", xlabel="Year", leftmargin=10mm)
plot!(p, size=(1200, 450))

# compute rolling mean
ll = 11
roll = zeros(nrow(annual_q) - ll)
for i = 1:(nrow(annual_q) - ll)
    wind = i:(i+ll-1)
    roll[i] = mean(annual_q[wind, :Upper_Colorado_norm])
end
# plot rolling mean
plot!(p, annual_q[Int(floor((ll+1)/2)):(end-Int(floor((ll+1)/2))), :Year], roll, color=:black, label="11-Year Rolling Mean")

Figure 1: Streamflow data from the Upper Colorado

Defining “Drought”

Ault et al. (2014): Define a decadal “drought” as when the 11-year mean falls below 1/2 the standard deviation from the overall mean.

Figure 2: Droughts in the Upper Colorado

Mixture of Two Distributions

“Droughts” vs “Normal” Streamflow can be thought of as two different distributions occurring with different probabilities.

This is a mixture model:

\[\begin{aligned} Z &\sim \text{Mult}(\pi_1, \ldots, \pi_K) \\ Y | Z &\sim \mathcal{D}_Z = f_Z \end{aligned}\]

Very common to use Gaussians for \(\mathcal{D}_k\) but this isn’t necessary.

Mixture Models and Clustering

Mixture models are common for model-based clustering (a parametric alternative to e.g. \(k\)-means).

This is:

  • Good! A mixture model is predictive, might explain the data, and can be checked with e.g. cross-validation.
  • Bad! Highly sensitive to specification of distributions.

Example Based On Ault (2014) Definition

Code
annual_q.Upper_Colorado_log = log.(annual_q.Upper_Colorado)
histogram(annual_q.Upper_Colorado_log, xlabel=L"log-ft$^3$/yr", ylabel="PDF", norm=:pdf, fillcolor=:gray, alpha=0.5, label="Observations")
all_dist = fit(Normal, annual_q.Upper_Colorado_log)
plot!(all_dist, color=:black, label="Combined Distribution")
dry_flow = annual_q[dry_yrs, :Upper_Colorado_log]
wet_flow = annual_q[.!(dry_yrs), :Upper_Colorado_log]
dry_dist = fit(Normal, dry_flow)
wet_dist = fit(Normal, wet_flow)
plot!(dry_dist, color=:red, label="Dry Distribution")
plot!(wet_dist, color=:blue, label="Wet Distribution")
plot!(size=(1200, 450), legend=:outerright)

Figure 3: Mixture Fits for Drought Data

Fitting Mixture Models

The likelihood of a Gaussian mixture model: \[L(\theta | \mathbf{y}) = \prod_{i=1}^n \sum_{k=1}^K \pi_k N(y_i; \mu_k, \sigma_k^2)\]

and so the log-likelihood: \[l(\theta | \mathbf{y}) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k N(y_i; \mu_k, \sigma_k^2)\right).\]

Fitting Mixture Models

\[\begin{aligned} \frac{\partial l}{\partial \theta_k} &= \sum_{i=1}^n \frac{1}{\sum_{j=1}^K \pi_j f_j(x_i; \theta)} \pi_k \frac{\partial f_k(y_i; \theta)}{\partial \theta_k} \\ &= \sum_{i=1}^n \frac{\pi_k f_k(x_i; \theta)}{\sum_{j=1}^K \pi_j f_j(x_i; \theta)} \frac{1}{f_j(x_i, \theta)} \frac{\partial f_k(y_i; \theta)}{\partial \theta_k} \\ &= \sum_{i=1}^n \frac{\pi_k f_k(x_i; \theta)}{\sum_{j=1}^K \pi_j f_j(x_i; \theta)} \frac{\partial \log f_j(x_i; \theta)}{\partial \theta_k} \end{aligned}\]

The Problem…

Maximizing this log-likelihood is a weighted likelihood maximization based on the component membership probabilities

\[w_{ik} = \frac{\pi_k f_k(x_i; \theta)}{\sum_{j=1}^K \pi_j f_k(x_i; \theta)} = p(Z_i = j | X_i = x_i; \theta).\]

But to calculate \(w_{ik}\) we need to know the parameters…

The E-M Algorithm

The E-M (Expectation-Maximization) Algorithm is an iterative approach to solving this problem:

  1. Initialize the parameters for the mixture components, call this \(\theta^{(0)}\).

The E-M Algorithm

The E-M (Expectation-Maximization) Algorithm is an iterative approach to solving this problem:

  1. E-step: Find the expected complete data log-likelihood \[\begin{aligned} Q(\theta, \theta^(t)) &= E_{Z | Y, \theta^{(t)}}\left[\log p(Y, Z| \theta)\right] \\ &= \sum_{i=1}^n \sum_{k=1}^K p(Z_i = k | y_i, \theta^{(t)}) \log p(y_i | Z_i = k, \theta). \end{aligned}\]

The E-M Algorithm

The E-M (Expectation-Maximization) Algorithm is an iterative approach to solving this problem:

  1. M-step: Find new parameter values by maximizing \(Q\): \[\theta^{(t+1)} = \arg\max_\theta Q(\theta, \theta^{(t)}).\]

The E-M Algorithm

The E-M (Expectation-Maximization) Algorithm is an iterative approach to solving this problem:

  1. Check for convergence of \(Q\) (usually by checking if log-likelihood has stabilized within some threshold).

The E-Step

Note

\[\begin{aligned} p(Z_i = k | y_i, \theta^{(t)}) &= \frac{p(Z_i = k, Y_i = y_i | \theta^{(t)}}{p(Y_i = y_i | \theta^{(t)})} \\ &= \frac{p(Y_i = y_i | Z_i = k, \theta^{(t)})p(Z_i = k | \theta^{(t)})}{\sum_{j=1}^K p(Y_i = y_i | Z_i = j, \theta^{(t)})p(Z_i = j | \theta^{(t)})} \\ &= \frac{\pi_k^{(t)}p(y_i | Z_i = k, \theta^{(t)})}{\sum_{j=1}^K \pi_j^{(t)}p(y_i | Z_i = j, \theta^{(t)})} \end{aligned}\]

Call the outputs of this calculation \(w_k^i\).

The M-Step

In the Gaussian case, can explicitly calculate the updates (otherwise use numerical optimization on the E-Step output).

The M-Step

Define \(N_k = \sum_{i=1}^n w_k^i\), and then:

\[\begin{aligned} \pi_k^{(t+1)} &= \frac{N_k}{N} \\ \mu_k^{(t+1)} &= \frac{1}{N_k} \sum_{i=1}^n w_k^i x_i \\ (\sigma_k^2)^{(t+1)} &= \frac{1}{N_k} \sum_{i=1}^n (x_i - \mu_k^{(t+1)})^2 w_k^i \end{aligned}\]

The E-Step

function e_step(y, p, mu, sigma)
    N = length(y)
    K = length(p)
    w = zeros(N, K)
    # compute weights for each point
    for k in 1:K
        w[:, k] = p[k] * pdf.(Normal(mu[k], sigma[k]), y)
    end
    w_tot = sum(w, dims=2) # normalize weights
    return w ./ w_tot
end

The M-Step

function m_step(y, w)
    N = length(y)
    K = size(w, 2)
    Nk = sum(w, dims=1) # soft count of mixture membership

    # update coefficients
    p = Nk / N
    mu = sum(w .* y; dims=1) ./ Nk
    sigma = sqrt.(sum(w .* (y .- mu).^2; dims=1) ./ Nk)
    return (p, mu, sigma)
end

The Full Algorithm

# compute log-likelihood for tracking
function log_likelihood(y, p, mu, sigma)
    N = length(y)
    K = length(p)
    ll_comp = zeros(N, K)
    for k = 1:K
        ll_comp[:, k] = p[k] * pdf.(Normal(mu[k], sigma[k]), y)
    end
    ll = sum(log.(sum(ll_comp, dims=2)))
    return ll
end

function em_algorithm(y, p_init, mu_init, sigma_init; max_iter = 300, thresh=1e-4)
    N = length(y)
    K = length(p_init)

    # initialize storage and add initial guesses
    p = zeros(max_iter + 1, K)
    p[1, :] = p_init
    mu = zeros(max_iter + 1, K)
    mu[1, :] = mu_init
    sigma = zeros(max_iter + 1, K)
    sigma[1, :] = sigma_init

    # compute initial log-likelihood
    ll = zeros(max_iter + 1)
    ll[1] = log_likelihood(y, p[1, :], mu[1, :], sigma[1, :])

    iter = 1
    for t = 1:max_iter
        w = e_step(y, p[t, :], mu[t, :], sigma[t, :])
        p[t+1, :], mu[t+1, :], sigma[t+1, :] = m_step(y, w)
        ll[t+1] = log_likelihood(y, p[t+1, :], mu[t+1, :], sigma[t+1, :])
        if abs(ll[t+1] - ll[t]) < thresh
            break
        end
        iter += 1
    end

    return (p[1:iter+1, :], mu[1:iter+1, :], sigma[1:iter+1, :], ll[1:iter+1, :])
end

E-M Algorithm Convergence

Code
mu_init = zeros(2)
sigma_init = zeros(2)
mu_init[1] = dry_dist.μ
sigma_init[1] = dry_dist.σ
mu_init[2] = wet_dist.μ
sigma_init[2] = wet_dist.σ
p_init = [0.5, 0.5]

p, mu, sigma, ll = em_algorithm(annual_q.Upper_Colorado_log, p_init, mu_init, sigma_init, max_iter = 1000, thresh=1e-5)

p1 = plot(ll, marker=:circle, markersize=2, xlabel="E-M Iteration", ylabel="Log-Likelihood", label=false)
p2 = plot(mu, xlabel="E-M Iteration", ylabel="Mean", label=["Dry" "Wet"])
p3 = plot(sigma.^2, xlabel="E-M Iteration", ylabel="Variance", label=["Dry" "Wet"])
plot(p1, p2, p3, layout=(1, 3), size=(1200, 450))

Example of E-M Algorithm

Code
histogram(annual_q.Upper_Colorado_log, xlabel=L"log-ft$^3$/yr", ylabel="PDF", norm=:pdf, fillcolor=:gray, alpha=0.5, label="Observations", legend=:outerright)
plot!(dry_dist, color=:red, linestyle=:dash, label="Dry (Ault)", xlabel=L"log-ft$^3$/yr", ylabel="PDF")
plot!(wet_dist, color=:blue, linestyle=:dash, label="Wet (Ault)")
plot!(Normal(mu[end, 1], sigma[end, 1]), color=:red, label="Dry (E-M)")
plot!(Normal(mu[end, 2], sigma[end, 2]), color=:blue, label="Wet (E-M)")

Figure 4: E-M Initial Steps

Fitted Mixture Fit

Code
mix_dist = MixtureModel(Normal, [(mu[end, 1], sigma[end, 1]), (mu[end, 2], sigma[end, 2])], p[end, :])
mix_samp = rand(mix_dist, 10_000)
qqplot(annual_q.Upper_Colorado_log, mix_samp, xlabel="Observations", ylabel="Mixture Model")

Figure 5: E-M Fit

Probabilities of Observations

Code
# compute component probabilities
N = nrow(annual_q)
K = 2 # number of mixture components
ll_comp = zeros(N, K)
for k = 1:K
    ll_comp[:, k] = p[end, k] * pdf.(Normal(mu[end, k], sigma[end, k]), annual_q.Upper_Colorado_log)
end
ll = ll_comp ./ sum(ll_comp, dims=2) # normalize to get probabilities
annual_q.dry_prob = ll[:, 1]
p = plot(annual_q.Year, annual_q.Upper_Colorado_log, label="Annual Flow", color=:gray, ylabel=L"log-ft$^3$/yr", xlabel="Year", leftmargin=10mm)
scatter!(p, annual_q.Year, annual_q.Upper_Colorado_log, markersize=7, marker_z = annual_q.dry_prob, c = cgrad(:bluesreds), colorbar_title = "Dry State Probability", colorbar_titlefontsize=16, label=false)
plot!(size=(1200, 450))

Figure 6: E-M Probabilities

What Don’t Mixture Models Tell Us

Mixture models are useful for clustering or generating bimodel distributions. But they are limited for understanding intertemporal dynamics.

For example: we don’t see the persistence of wet/dry states, with frequent switches between them. Is this reasonable?

To address this, we’ll need a generative model for state switching: a Hidden Markov Model.

Markov Chains

Hidden Markov Models

Instead, we might want to model the process that gives rise to a mixture distribution.

One way to do this for time series is with a Hidden Markov Model where the generative model at time \(t\) probability model “switches” probabilistically based on a Markov chain.

What Is A Markov Chain?

Consider a stochastic process \(\{X_t\}_{t \in \mathcal{T}}\), where

  • \(X_t \in \mathcal{S}\) is the state at time \(t\), and
  • \(\mathcal{T}\) is a time-index set (can be discrete or continuous)
  • \(\mathbb{P}(s_i \to s_j) = p_{ij}\).

Markov State Space

Markovian Property

This stochastic process is a Markov chain if it satisfies the Markovian (or memoryless) property: \[\begin{align*} \mathbb{P}(X_{T+1} = s_i &| X_1=x_1, \ldots, X_T=x_T) = \\ &\qquad\mathbb{P}(X_{T+1} = s_i| X_T=x_T) \end{align*} \]

Example: “Drunkard’s Walk”

:img Random Walk, 80%
  • How can we model the unconditional probability \(\mathbb{P}(X_T = s_i)\)?
  • How about the conditional probability \(\mathbb{P}(X_T = s_i | X_{T-1} = x_{T-1})\)?

Example: Weather

Suppose the weather can be foggy, sunny, or rainy.

Based on past experience, we know that:

  1. There are never two sunny days in a row;
  2. Even chance of two foggy or two rainy days in a row;
  3. A sunny day occurs 1/4 of the time after a foggy or rainy day.

Aside: Higher Order Markov Chains

Suppose that today’s weather depends on the prior two days.

  1. Can we write this as a Markov chain?
  2. What are the states?

Weather Transition Matrix

We can summarize these probabilities in a transition matrix \(P\): \[ P = \begin{array}{cc} \begin{array}{ccc} \phantom{i}\color{red}{F}\phantom{i} & \phantom{i}\color{red}{S}\phantom{i} & \phantom{i}\color{red}{R}\phantom{i} \end{array} \\ \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/2 & 0 & 1/2 \\ 1/4 & 1/4 & 1/2 \end{pmatrix} & \begin{array}{ccc} \color{red}F \\ \color{red}S \\ \color{red}R \end{array} \end{array} \]

Rows are the current state, columns are the next step, so \(\sum_i p_{ij} = 1\).

Weather Example: State Probabilities

Denote by \(\lambda^t\) a probability distribution over the states at time \(t\).

Then \(\lambda^t = \lambda^{t-1}P\):

\[\begin{pmatrix}\lambda^t_F & \lambda^t_S & \lambda^t_R \end{pmatrix} = \begin{pmatrix}\lambda^{t-1}_F & \lambda^{t-1}_S & \lambda^{t-1}_R \end{pmatrix} \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/2 & 0 & 1/2 \\ 1/4 & 1/4 & 1/2 \end{pmatrix} \]

Multi-Transition Probabilities

Notice that \[\lambda^{t+i} = \lambda^t P^i,\] so multiple transition probabilities are \(P\)-exponentials.

\[P^3 = \begin{array}{cc} \begin{array}{ccc} \phantom{iii}\color{red}{F}\phantom{ii} & \phantom{iii}\color{red}{S}\phantom{iii} & \phantom{ii}\color{red}{R}\phantom{iii} \end{array} \\ \begin{pmatrix} 26/64 & 13/64 & 25/64 \\ 26/64 & 12/64 & 26/64 \\ 26/64 & 13/64 & 26/64 \end{pmatrix} & \begin{array}{ccc} \color{red}F \\ \color{red}S \\ \color{red}R \end{array} \end{array} \]

Long Run Probabilities

What happens if we let the system run for a while starting from an initial sunny day?

Code
current = [1.0, 0.0, 0.0]
P = [1/2 1/4 1/4
    1/2 0 1/2
    1/4 1/4 1/2]   

T = 21

state_probs = zeros(T, 3)
state_probs[1,:] = current
for t=1:T-1
    state_probs[t+1, :] = state_probs[t:t, :] * P
end


p = plot(0:T-1, state_probs, label=["Foggy" "Sunny" "Rainy"], palette=:mk_8, linewidth=3)
xlabel!("Time")
ylabel!("State Probability")
plot!(p, size=(1000, 350))

Figure 7: State probabilities for the weather examples.

Key Points and Upcoming Schedule

Key Points

  • Mixture Models: produce multi-modal distributions by mixing different distributions together
  • Fit mixture models using Expectation-Maximization (E-M) algorithm
  • Markov chains to represent “switching” between different latent states.

Upcoming Schedule

Wednesday: More on Markov chains and Hidden Markov Models

Friday: Quiz 6, finish HMM example

References

References (Scroll for Full List)

Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. J. Clim., 27, 7529–7549. https://doi.org/10.1175/jcli-d-12-00282.1