Estimate your uncertainty

Modeling patient satisfaction data with empirical Bayesian methods

I recently picked up David Robinson’s book, Introduction to Empirical Bayes. It’s available online for a price of your own choosing (operating under a “pay-what-you-want” model), so you can technically pick it up for free, but it’s well worth the suggested price of $9.95. The book has a particular focus on practical steps for implementing Bayesian methods with code, which I appreciate. I’ve made it through Part I (of four), which makes for a good stopping point to practice what I’ve read.

The first section is highly focused on modeling the probability of success/failure of some binary outcome using a beta distribution. This is highly relevant to my work as an analyst, where whether or not a patient responded positively to a particular question on a survey can be modeled with this method. Thus far, however, I’ve taken the frequentist approach to analyses, which assumes we know nothing about what the data ought to look like prior to analyzing it. This is largely because I didn’t know of a robust way to estimate a prior for a Bayesian analysis.

Thankfully, however, the book walks through examples of exactly how to do this! We can use a maximum likelihood estimator to estimate a reasonable prior given the current data. That’s quite a bit of statistical mumbo-jumbo — in this post I’ll walk through an example that spells it out a bit more clearly using fake hospital satisfaction data (N.B.; this is largely a recreation of the steps taken in the book — practice makes perfect!).

Setting up the data

First, let’s simulate responses to patient satisfaction surveys. I tend to look at patient satisfaction scores across individual hospital units (e.g., ED, ICU, IMU, etc.). Units can have varying numbers of discharges, so we’ll use a log-normal distribution to estimate the number of responses for each unit.

# simulate 1,500 hospital units with an average of 150 survey returns per unit
survey_data <- 
  rlnorm(1500, log(150), 1.5) %>%
  as_tibble() %>%
  rename(n = value)

The spectrum of responses is incredibly broad — some units have a massive number of returns (in the tens of thousands!) while others have just a handful. This is fairly consistent with the real-world data that I’ve seen (though the units on the high-side are a bit over-represented here).

Next, let’s assume that there is some true satisfaction rate that is associated with each unit. If each unit had an infinite number of survey returns, the satisfaction rate from the survey returns would approach this true value. In this case, we’ll set the true satisfaction for each unit randomly but have it hover around 66%.

# set the true satisfaction to be different for each unit, but hover around 66%
survey_data <- 
  survey_data %>%
  rowwise() %>%
  mutate(true_satisfaction = rbeta(1, 66, 34))

Although there is a true satisfaction associated with each unit, we wouldn’t expect that the reported survey scores would match this exactly. This is especially true when there are few responses — if a unit has a true satisfaction rate of 75% but only 3 responses, it’s impossible for the reported score to match the underlying true rate!

We can simulate the number of patients who responded positively (in survey terms, the number of “topbox” responses) by generating n responses for each unit using a binomial distribution.

# simulate the number of patients responding with the topbox value
# we *know* the true value, but the actual score may vary!
survey_data <-
  survey_data %>%
  mutate(n = round(n),
         topbox = rbinom(1, n, true_satisfaction)) %>%
  ungroup() %>%
  # name each unit
  rowid_to_column() %>%
  mutate(unit = paste("Unit", rowid)) %>%
  relocate(unit) %>%
  # remove the true satisfaction so we don't know what it is!
  select(-rowid, -true_satisfaction)

# find patient satisfaction scores
survey_data <- 
  survey_data %>%
  mutate(score = topbox/n)

As expected, most of our simulated data hovers around a score of 66%. However, there are a few scores at the extremes of 0% and 100% — given how we simulated the data, it is unlikely that these units are really performing so poorly/so well and it’s likelier that they just have few returns.

# which units have the highest scores?
survey_data %>%
  arrange(desc(score)) %>%
  slice_head(n = 10) %>%
unit n topbox score
Unit 26 12 12 1.0000000
Unit 591 2 2 1.0000000
Unit 616 3 3 1.0000000
Unit 811 3 3 1.0000000
Unit 943 12 12 1.0000000
Unit 1217 6 6 1.0000000
Unit 1435 3 3 1.0000000
Unit 1437 6 6 1.0000000
Unit 863 19 18 0.9473684
Unit 372 13 12 0.9230769
# which units have the lowest scores?
survey_data %>%
  arrange(score) %>%
  slice_head(n = 10) %>%
unit n topbox score
Unit 1092 4 0 0.0000000
Unit 248 20 5 0.2500000
Unit 1120 7 2 0.2857143
Unit 416 3 1 0.3333333
Unit 456 3 1 0.3333333
Unit 972 6 2 0.3333333
Unit 113 13 5 0.3846154
Unit 260 15 6 0.4000000
Unit 695 15 6 0.4000000
Unit 1352 17 7 0.4117647

As expected, the units on either end of the spectrum aren’t necessarily outperforming/underperforming — they simply don’t have a lot of survey responses! We can use Bayesian inference to estimate the true satisfaction rate by specifying and updating a prior!

Generating a prior distribution

When looking at the entire dataset, the distribution of scores is thrown off a bit by the units with few responses. If we restrict the dataset to only the units that have more than 30 responses (which, as I’ve written about before, isn’t necessarily a data-driven cutoff for analysis) we can get a clearer idea of the distribution of the scores.

survey_data_filtered <-
  survey_data %>%
  filter(n > 30)

Alternatively, we can represent this distribution with a density plot:

This looks suspiciously like a beta distribution! A beta distribution’s shape can be defined by two parameters — alpha and beta. Varying these parameters lets us adjust the center and width to match any possible beta distribution.

What may make sense would be to use this distribution as our prior. I.e., if we have no responses for a unit, we can probably guess that their score would be somewhere around 66% with some healthy room on either side for variability. To do so, we need to estimate an appropriate alpha and beta — rather than guess the values using trial and error we can pass the work off to our computer to find parameters that maximize the likelihood that our estimated distribution matches the true distribution (hence the name, maximum likelihood estimator).


# log-likelihood function
log_likelihood <- function(alpha, beta) {
  -sum(dbeta(survey_data_filtered$score, alpha, beta, log = TRUE))

# pass various alphas & betas to `log_likelihood` 
# to find combination that maximizes the likelihood!
params <- 
    start = list(alpha = 50, beta = 50),
    lower = c(1, 1)

# extract alpha & beta
params <- coef(params)
alpha0 <- params[1]
beta0 <- params[2]

print(paste("alpha:", round(alpha0, 1), "beta:", round(beta0, 1)))
## [1] "alpha: 39.7 beta: 20.5"

How well does a beta distribution defined by these parameters match our actual data?

This is a pretty good representation of our initial data! When we have no survey responses, we can use a beta distribution with the initial parameters as specified by the maximum likelihood estimation. As a unit gets more responses, we can update our estimation to rely more heavily on the data rather than the prior:

# update alpha & beta as new responses come in!
alpha_new <- alpha0 + n_topbox
beta_new <- beta0 + n - n_topbox

Updating our priors

With a prior distribution defined by alpha0 and beta0, we can upgrade our frequentest estimation of each unit’s score to a Bayesian estimation!

# empirical bayes estimation of satisfaction score
survey_eb <-
  survey_data %>%
  mutate(eb_estimate = (topbox + alpha0) / (n + alpha0 + beta0))

What are the top and bottom performing units by this new Bayesian estimation?

# which units have the highest estimated scores?
survey_eb %>%
  arrange(desc(eb_estimate)) %>%
  slice_head(n = 10) %>%
unit n topbox score eb_estimate
Unit 133 160 133 0.8312500 0.7841640
Unit 1004 123 103 0.8373984 0.7787827
Unit 172 165 133 0.8060606 0.7667547
Unit 1042 372 291 0.7822581 0.7650930
Unit 1294 1409 1083 0.7686302 0.7641391
Unit 892 349 273 0.7822350 0.7641085
Unit 306 247 195 0.7894737 0.7639102
Unit 1249 1234 943 0.7641815 0.7592901
Unit 427 5469 4151 0.7590053 0.7579168
Unit 920 1637 1243 0.7593158 0.7557585
# which units have the lowest estimated scores?
survey_eb %>%
  arrange(eb_estimate) %>%
  slice_head(n = 10) %>%
unit n topbox score eb_estimate
Unit 613 1886 932 0.4941676 0.4992689
Unit 760 112 49 0.4375000 0.5149645
Unit 363 226 112 0.4955752 0.5299674
Unit 316 431 224 0.5197216 0.5368008
Unit 1032 235 119 0.5063830 0.5375222
Unit 1093 354 183 0.5169492 0.5376064
Unit 749 5286 2839 0.5370791 0.5384528
Unit 291 865 460 0.5317919 0.5400741
Unit 515 60 26 0.4333333 0.5463929
Unit 622 242 127 0.5247934 0.5515432

There are a few things that are worth noting with these estimates:

  • The estimated score is not the same as the actual reported score! As more responses come in, however, the estimated score converges to the actual.
  • The prior pulls estimated scores towards the prior mean — low scores are pulled up a bit and high scores are pulled down a bit.
  • The top (and bottom) performing units are no longer dominated by units with few returns!

We can also estimate the uncertainty around the estimated score with a credible interval. Credible intervals are the Bayesian counterpart to a frequentist’s confidence interval — both estimate the region that the true value could fall in given a certain probability — credible intervals, however, are informed by the prior distribution.

Because credible intervals are informed in part by the prior, they are tighter than their confidence interval counterparts. Like with the estimated score, however, as n-size increases, the Bayesian and frequentist interval estimations converge. In the absence of larger swathes of data, Bayesian methods can offer additional insight into our data by means of a prior distribution.

Some closing thoughts

Overall, this has been a fairly glowing review of the methods laid out in the first section of Introduction to Empirical Bayes. That being said, Bayesian methods of inference are not inherently better than frequentist methods — while they can offer additional context via a prior, there are situations where frequentist methods are preferred. From a math perspective, the prior provides diminishing returns as sample size increases, so it may be better forgoe Bayesian analysis when sample sizes are large. From an organizational perspective, Bayesian inference may be difficult to explain. In my own work, it’s highly unlikely that I’ll use Bayesian inference in any critical projects any time soon — I can imagine a lengthy uphill battle trying to explain the difference between the reported score and the estimated score informed by a prior.

Finally, there are a few things in this toy analysis that I am hoping to improve upon as I progress further through the book:

  • As I mentioned above and in previous writings, using n = 30 is a relatively arbitrary cutoff point for analysis. In this case, the prior distribution is fairly sensitive to the cutoff point selected — I am hoping that later sections in the book highilight more robust ways of partitioning data for setting priors.
  • In the above analysis we’re only examining one variable (univariate analysis) — I am looking forward to extending these methods to multivariate analyses and regressions.
  • The beta distribution is appropriate for modeling the probability distribution of binary outcomes. In this example, where the outcome is simply the proportion of patients that responded favorably to the survey, modeling the outcome with a beta distribution is appropriate (responses can either be in the “topbox” or not). When there are more than two possible outcomes — for example, when trying to model Net Promoter Score as the proportion of “promoters,” “passives,” and “detractors” — the more general Dirichlet distribution is more appropriate.
  • I’m hoping also that the book covers methods for dealing with time-dependent data. For example, we’d expect that concerted efforts (or lack thereof) by the hospital units could significantly impact the underlying “true satisfaction” that we’re attempting to estimate via surveying. We expect that more recent survey responses should be more impactful in informing our posterior estimation, but I’ve yet to find any robust literature on how to weight the recency of responses. In the past, I’ve used exponentional decay to reduce the weight of old responses, but this feels a bit arbitrary.

Overall, this has been a long way of saying that I’m happy with the book so far and I’m excited to see what comes next as I continue reading!

Mark Rieke
Senior CX Analyst

I’m a mechanical engineer by education, data analyst by practice. I love machine learning and communicating complex topics clearly with simple and beautiful charts.