Sufficiently Normal

Oops I’ve been obsessed with sufficiency lately
Author

Mark Rieke

Published

February 19, 2026

Code
library(tidyverse)
library(cmdstanr)
library(ggdist)
library(riekelib)

# Setup a color palette to be used for groups A/B
pal <- NatParksPalettes::NatParksPalettes$Acadia[[1]][c(2,8)]

As a somewhat unexpected follow-up to my previous post on the sufficient gamma, I’ve decided to take a similarly motivated look at implementing the sufficient normal. I’d previously implemented the sufficient normal as a part of my March Madness predictions last year, but if I’m being honest, I got that working without getting an intuition for why it worked. That’s bugged me for the better part of a year, so here we are.

1 My current employer is a subsidiary of LV Sands, so it is unlikely that I put together a model for March Madness this year.

Like last time, if you want to skip the derivation and just get right to implementation, here’s a sufficient normal density function that you can drop right at the top of your Stan model:

real sufficient_normal_lpdf(
  real Xsum,
  real Xsumsq,
  real n,
  real mu,
  real sigma
) {
  real lp = 0.0;
  real s2 = sigma^2;
  real d = (2 * s2)^(-1);
  lp += -n * 0.5 * log(2 * pi() * s2);
  lp += -d * Xsumsq;
  lp += 2 * d * mu * Xsum;
  lp += -n * d * mu^2;
  return lp;
}

Notably Normal

I’ll simulate 5,000 observations from two normally distributed groups for a total of 10,000 observations in the dataset. I’ll then fit two models to these 10,000 observations — one making use of Stan’s built-in normal likelihood and another with our custom sufficient normal density function.

Code
set.seed(2026)
normal_data <-
  tibble(group = c("A", "B"),
         mu = c(0, 2),
         sigma = c(1, 0.5)) %>%
  mutate(y = pmap(list(mu, sigma), ~rnorm(5000, ..1, ..2)))

normal_data %>%
  unnest(y) %>%
  ggplot(aes(x = y,
             y = group,
             fill = group)) +
  stat_histinterval(breaks = 60,
                    slab_alpha = 0.75) +
  scale_fill_manual(values = pal) +
  theme_rieke()

We’ll fit a simple normal likelihood to these observations where each group, g, has a separate mean, μ, and standard deviation, σ. Unsurprisingly, this model fits with no issues.

yiN(μg,σg)μgN(0,5)σgExponential(1)

Stan Model
data {
  // Dimensions
  int<lower=0> N;                          // Number of observations
  int G;                                   // Number of groups

  // Observations
  vector[N] Y;                             // Outcome per observation
  array[N] int<lower=1, upper=G> gid;      // Group membership per observation
  
  // Normal priors
  real mu_mu;
  real<lower=0> sigma_mu;
  real<lower=0> lambda_sigma;
}

parameters {
  vector[G] mu;
  vector<lower=0>[G] sigma;
}

model {
  target += normal_lpdf(mu | mu_mu, sigma_mu);
  target += exponential_lpdf(sigma | lambda_sigma);
  for (n in 1:N) {
    target += normal_lpdf(Y[n] | mu[gid[n]], sigma[gid[n]]);
  }
}
Code
individual_model <- cmdstan_model("individual-normal.stan")

model_data <-
  normal_data %>%
  mutate(gid = rank(group)) %>%
  unnest(y)

stan_data <-
  list(
    N = nrow(model_data),
    Y = model_data$y,
    G = max(model_data$gid),
    gid = model_data$gid,
    mu_mu = 0,
    sigma_mu = 5,
    lambda_sigma = 1
  )

start_time <- Sys.time()
individual_fit <-
  individual_model$sample(
    data = stan_data,
    seed = 1234,
    iter_warmup = 1000,
    iter_sampling = 1000,
    chains = 10,
    parallel_chains = 10
  )
run_time_individual <- as.numeric(Sys.time() - start_time)

plot_posterior <- function(fit) {
  
  # Get the true parameters used in simulation
  truth <-
    normal_data %>%
    select(group, mu, sigma) %>%
    pivot_longer(-group,
                 names_to = "parameter",
                 values_to = "draw") %>%
    mutate(parameter = if_else(parameter == "mu", "\u03bc", "\u03c3"))
  
  fit$draws(c("mu", "sigma"), format = "df") %>%
    as_tibble() %>%
    select(starts_with("mu"), starts_with("sigma")) %>%
    pivot_longer(everything(),
                 names_to = "parameter",
                 values_to = "draw") %>%
    nest(data = -parameter) %>%
    separate(parameter, c("parameter", "group"), "\\[") %>%
    mutate(group = str_remove(group, "\\]"),
           group = if_else(group == "1", "A", "B"),
           parameter = if_else(parameter == "mu", "\u03bc", "\u03c3")) %>%
    unnest(data) %>%
    ggplot(aes(x = draw,
               y = parameter,
               fill = group)) +
    stat_histinterval(breaks = 60,
                      slab_alpha = 0.75) +
    geom_point(data = truth,
               color = "red") +
    scale_fill_manual(values = pal) +
    theme_rieke()
  
}

plot_posterior(individual_fit) +
  labs(title = "**Fit against individual observations**",
       subtitle = "Can be used as a baseline for evaluating the sufficient formulation",
       x = NULL,
       y = NULL)

While it only takes 4.1 seconds to fit this model, we have some room for improvement by summarizing the dataset with a sufficient formulation.

2 As an aside, wow Stan’s implementation of the normal likelihood is incredibly fast.

Density Derivations

When fitting a model, we repeatedly multiply the probability density for each individual observation to get a total probability density for the entire set of observations. A sufficient exploit would allow us to calculate this product in one step, rather than repeatedly calculating the individual probability densities.

Pr(y1,y2,,yn)=Pr(y1)Pr(y2)Pr(yn)Pr(y)=iNPr(yi)

The probability density of the normal distribution with mean μ and standard deviation σ for a single observation yi is:

Pr(yi)=12πσ2e12σ2(yiμ)2

Our goal is to find a variant of this density function that allows us to evaluate Pr(y) given summary information from y, rather than individual observations. Bayesian models evaluate log probabilities and it’s generally easier to work on the log scale — the log of the normal density function for a single observation is:

log(Pr(yi))=12log(2πσ2)(yiμ)22σ2

We want to setup yi in such a way that we can easily replace it with summary statistics when we move from a single observation to multiple observations. Here, that means we need to expand (yiμ)2.

log(Pr(yi))=12log(2πσ2)yi22μyi+μ22σ2=12log(2πσ2)(2σ2)1(yi22μyi+μ2)

The last step of this expansion is to multiply (2σ2)1 by each of the quadratic terms. For legibility’s sake (and to save some keystrokes), I set d=(2σ2)1. We end up with the following equation for the log-probability of a single observation, yi.

log(Pr(yi))=12log(2πσ2)dyi2+2dμyidμ2

The simplest case beyond a single observation is to consider two observations. Because we’re on the log scale we can simply add the two together to get the log of the product.

log(Pr(y1,y2))=12log(2πσ2)dy12+2dμy1dμ212log(2πσ2)dy22+2dμy2dμ2

This framing makes it clear how this product expands to n observations. For each observation, we have another set of 12log(2πσ2) and dμ2 terms, so we can simply multiply these terms by n. Each yi2 term is multiplied by d, so we replace d(y12+y22+yn2) with dy2 (and can do the same operation to get to 2dμy). Altogether, we arrive at a new log probability function based on the count, sum, and sum of square terms in y.

log(Pr(y))=n2log(2πσ2)dy2+2dμyndμ2

The log probability is what Stan uses during sampling and is how I defined the function in the intro of this article. For completeness’ sake, we can unlog to show that the probability density of a normally distributed set of observations, y, has a sufficient formulation given the count, sum, and sum of squares in y.

Pr(y2,y,n|μ,σ)=1(2πσ2)n/2e(2σ2)1(y22μy+nμ2)

Sufficient Speedups

With this sufficient formulation, we can collapse our original 10,000 row dataset into just two rows — one per group.

Code
sufficient_model <- cmdstan_model("sufficient-normal.stan")

model_data <-
  normal_data %>%
  mutate(gid = rank(group),
         n = map_int(y, length),
         Ysum = map_dbl(y, sum),
         Ysumsq = map_dbl(y, ~sum(.x^2)))

model_data
#> # A tibble: 2 × 8
#>   group    mu sigma y               gid     n    Ysum Ysumsq
#>   <chr> <dbl> <dbl> <list>        <dbl> <int>   <dbl>  <dbl>
#> 1 A         0   1   <dbl [5,000]>     1  5000   -40.3  4927.
#> 2 B         2   0.5 <dbl [5,000]>     2  5000 10039.  21429.

We can keep the same priors and just swap out our original likelihood function for our new sufficient likelihood.

y2,y,nSufficient-Normal(μg,σg)μgN(0,5)σgExponential(1)

Stan Model
functions {
  real sufficient_normal_lpdf(
    real Xsum,
    real Xsumsq,
    real n,
    real mu,
    real sigma
  ) {
    real lp = 0.0;
    real s2 = sigma^2;
    real d = (2 * s2)^(-1);
    lp += -n * 0.5 * log(2 * pi() * s2);
    lp += -d * Xsumsq;
    lp += 2 * d * mu * Xsum;
    lp += -n * d * mu^2;
    return lp;
  }
}

data {
  // Dimensions
  int<lower=0> N;                          // Number of observations
  int G;                                   // Number of groups

  // Observations
  vector[N] Ysum;                          // Sum of a group of observations
  vector[N] Ysumsq;                        // Sum(y^2) of a group of observations
  vector[N] n_obs;                         // Number of observations per group
  array[N] int<lower=1, upper=G> gid;      // Group membership
  
  // Normal priors
  real mu_mu;
  real<lower=0> sigma_mu;
  real<lower=0> lambda_sigma;
}

parameters {
  vector[G] mu;
  vector<lower=0>[G] sigma;
}

model {
  target += normal_lpdf(mu | mu_mu, sigma_mu);
  target += exponential_lpdf(sigma | lambda_sigma);
  for (n in 1:N) {
    target += sufficient_normal_lpdf(Ysum[n] | Ysumsq[n], n_obs[n], mu[gid[n]], sigma[gid[n]]);
  }
}
Code
stan_data <-
  list(
    N = nrow(model_data),
    Ysum = model_data$Ysum,
    Ysumsq = model_data$Ysumsq,
    n_obs = model_data$n,
    G = max(model_data$gid),
    gid = model_data$gid,
    mu_mu = 0,
    sigma_mu = 5,
    lambda_sigma = 1
  )

start_time <- Sys.time()
sufficient_fit <-
  sufficient_model$sample(
    data = stan_data,
    seed = 1234,
    iter_warmup = 1000,
    iter_sampling = 1000,
    chains = 10,
    parallel_chains = 10
  )
run_time_sufficient <- as.numeric(Sys.time() - start_time)

Fitting the sufficient formulation to the same dataset only takes 0.7 seconds, a 5.6x speedup compared to the fit against individual observations. And we recover the same posterior that we see in the fit against individual observations!

3 Because the normal likelihood in Stan is already so optimized, this isn’t as large of a speedup as we saw in the sufficient gamma. But in both cases, the speedup from sufficiency is largely determined by the original dataset’s size. For example, fitting the original model to 200,000 individual observations takes ~90s, whereas it still only takes ~0.7 in the sufficient model.

Code
plot_posterior(sufficient_fit)  +
  labs(title = "**Fit using the sufficient normal**",
       subtitle = "Recovers the same posterior from the fit to individual observations!",
       x = NULL,
       y = NULL)

Once again, exploit sufficiency, go zoom zoom. I simply must be cannot be stopped.

Citation

BibTeX citation:
@online{rieke2026,
  author = {Rieke, Mark},
  title = {Sufficiently {Normal}},
  date = {2026-02-19},
  url = {https://www.thedatadiary.net/posts/2026-02-19-sufficient-normal/},
  langid = {en}
}
For attribution, please cite this work as:
Rieke, Mark. 2026. “Sufficiently Normal.” February 19, 2026. https://www.thedatadiary.net/posts/2026-02-19-sufficient-normal/.

Stay in touch

Support my work with a coffee