12.Rmd


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# Monsters and Mixtures

> Many monsters are hybrids. Many statistical models are too. This chapter is about constructing likelihood and link functions by piecing together the simpler components of previous chapters. Like legendary monsters, these hybrid likelihoods contain pieces of other model types. Endowed with some properties of each piece, they help us model outcome variables with inconvenient, but common, properties....
>
> We'll consider three common and useful examples. The first are models for handling **over-dispersion**. These models extend the binomial and Poisson models of the previous chapter to cope a bit with unmeasured sources of variation. The second type is a family of **zero-inflated** and **zero-augmented** models, each of which mixes a binary event with an ordinary GLM likelihood like a Poisson or binomial. The third type is the **ordered categorical** model, useful for categorical outcomes with a fixed ordering. This model is built by merging a categorical likelihood function with a special kind of link function, usually a **cumulative link**. We'll also learn how to construct ordered categorical predictors.
>
> These model types help us transform our modeling to cope with the inconvenient realities of measurement, rather than transforming measurements to cope with the constraints of our models. [@mcelreathStatisticalRethinkingBayesian2020, p. 369, **emphasis** in the original]

## Over-dispersed counts

> When counts arise from a mixture of different processes, then there may be more variation--thicker tails--than a pure count model expects. This can again lead to overly excited models. When counts are more variable than a pure process, they exhibit **over-dispersion**. The variance of a variable is sometimes called its **dispersion**. For a counting process like a binomial, the variance is a function of the same parameters as the expected value. For example, the expected value of a binomial is $Np$ and its variance is $Np(1 - p)$. When the observed variance exceeds this amount--after conditioning on all the predictor variables--this implies that some omitted variable is producing additional dispersion in the observed counts.
>
> That isn't necessarily bad. Such a model could still produce perfectly good inferences. But ignoring over-dispersion can also lead to all of the same problems as ignoring any predictor variable. Heterogeneity in counts can be a confound, hiding effects of interest or producing spurious inferences. (p, 370, **emphasis** in the original)

In this chapter we'll cope with the problem using continuous mixture models, "models in which a linear model is attached not to the observations themselves but rather to a distribution of observations" (p. 370).

### Beta-binomial.

> A **beta-binomial** model is a mixture of binomial distributions. It assumes that each binomial count observation has its own probability of success. We estimate the *distribution* of probabilities of success instead of a single probability of success. Any predictor variables describe the shape of this distribution. (p, 370, **emphasis** in the original)

Unfortunately, we need to digress. As it turns out, there are multiple ways to parameterize the beta distribution and we've run square into two. In the text, McElreath described the beta distribution with two parameters, an average probability $\bar p$ and a shape parameter $\theta$. In his **R** code 12.1, which we'll reproduce in a bit, he demonstrated that parameterization with the `rethinking::dbeta2()` function. The nice thing about this parameterization is how intuitive the `pbar` parameter is. If you want a beta with an average of .2, you set `pbar = .2`. If you want the distribution to be more or less certain, make the `theta` argument more or less large, respectively.

However, the beta density is often defined in terms of $\alpha$ and $\beta$. If you denote the data as $y$, this follows the form

$$\operatorname{Beta}(y | \alpha, \beta) = \frac{y^{\alpha - 1} (1 - y)^{\beta - 1}}{\text B (\alpha, \beta)},$$

which you can verify in the [*Continuous distributions on [0, 1]* section *Stan functions reference*](https://mc-stan.org/docs/functions-reference/continuous-distributions-on-0-1.html) [@standevelopmentteamStanFunctionsReference2022]. In the formula, $\text B$ stands for the Beta function, which computes a normalizing constant, which you can learn about in the [*Mathematical functions* chapter](https://mc-stan.org/docs/functions-reference/math-functions.html) of the Stan functions reference. If you look at the base **R** `dbeta()` function, you'll learn it takes two parameters, `shape1` and `shape2`. Those uncreatively-named parameters are the same $\alpha$ and $\beta$ from the density, above. They do not correspond to the `pbar` and `theta` parameters of McEreath's `rethinking::dbeta2()` function.

McElreath had good reason for using `dbeta2()`. Beta's typical $\alpha$ and $\beta$ parameters aren't the most intuitive to use; the parameters in McElreath's `dbeta2()` are much nicer. If you dive a little deeper, it turns out you can find the mean of a beta distribution in terms of $\alpha$ and $\beta$ like this:

$$\mu = \frac{\alpha}{\alpha + \beta}.$$

We can talk about the spread of the distribution, sometimes called $\kappa$, in terms $\alpha$ and $\beta$ like this:

$$\kappa = \alpha + \beta.$$

With $\mu$ and $\kappa$ in hand, we can even find the $\textit{SD}$ of a beta distribution with the formula

$$\sigma = \sqrt{\mu (1 - \mu) / (\kappa + 1)}.$$

I explicate all this because McElreath's `pbar` is $\mu = \frac{\alpha}{\alpha + \beta}$ and his `theta` is $\kappa = \alpha + \beta$, which is great news because it means that we can understand what McElreath did with his `beta2()` function in terms of the base **R** `dbeta()` function. This also sets us up to understand the distribution of the beta parameters used in `brms::brm()`. To demonstrate, let's walk through McElreath's **R** code 12.1.

Before we get to **R** code 12.1 and our version of the resulting plot, we should discuss themes. In this chapter we'll use theme settings and a color palette from the [**ggthemes** package](https://cran.r-project.org/package=ggthemes).

```{r, warning = F, message = F}
library(ggthemes)
```

We'll take our basic theme settings from the `theme_hc()` function. We'll use the `Green fields` color palette, which we can inspect with the `canva_pal()` function and a little help from `scales::show_col()`.

```{r, fig.height = 2}
scales::show_col(canva_pal("Green fields")(4))
canva_pal("Green fields")(4)
canva_pal("Green fields")(4)[3]
```

Now we finally get to **R** code 12.1.

```{r, fig.width = 4, fig.height = 3, warning = F, message = F}
library(tidyverse)

theme_set(
  theme_hc() +
  theme(axis.ticks.y = element_blank(),
        plot.background = element_rect(fill = "grey92"))
)

pbar  <- .5
theta <- 5

ggplot(data = tibble(x = seq(from = 0, to = 1, by = .01)),
       aes(x = x, y = rethinking::dbeta2(x, prob = pbar, theta = theta))) +
  geom_area(fill = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous("probability space", breaks = c(0, .5, 1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle(expression(The~beta~distribution),
          subtitle = expression("Defined in terms of "*mu*" (i.e., pbar) and "*kappa*" (i.e., theta)"))
```

In his [-@kruschkeDoingBayesianData2015] text, [*Doing Bayesian data analysis*](https://sites.google.com/site/doingbayesiandataanalysis/), Kruschke provided code for a convenience function that takes `pbar` and `theta` as inputs and returns the corresponding $\alpha$ and $\beta$ values. Here's the function:

```{r}
betaABfromMeanKappa <- function(mean, kappa) {
  if (mean <= 0 | mean >= 1) stop("must have 0 < mean < 1")
  if (kappa <= 0) stop("kappa must be > 0")
  a <- mean * kappa
  b <- (1.0 - mean) * kappa
  return(list(a = a, b = b))
}
```

Now we can use Kruschke's `betaABfromMeanKappa()` to find the $\alpha$ and $\beta$ values corresponding to `pbar` and `theta`.

```{r}
betaABfromMeanKappa(mean = pbar, kappa = theta)
```

And finally, we can double check that all of this works. Here's the same distribution but defined in terms of $\alpha$ and $\beta$.

```{r, fig.width = 4, fig.height = 3}
ggplot(data = tibble(x = seq(from = 0, to = 1, by = .01)),
       aes(x = x, y = dbeta(x, shape1 = 2.5, shape2 = 2.5))) +
  geom_area(fill = canva_pal("Green fields")(4)[4]) +
  scale_x_continuous("probability space", breaks = c(0, .5, 1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle(expression(The~beta~distribution),
          subtitle = expression("This time defined in terms of "*alpha*" and "*beta))
```

McElreath encouraged us to "explore different values for `pbar` and `theta`" (p. 371). Here's a grid of plots with `pbar = c(.25, .5, .75)` and `theta = c(5, 10, 15)`.

```{r, fig.width = 5, fig.height = 5}
# data
crossing(pbar  = c(.25, .5, .75),
         theta = c(5, 15, 30)) %>% 
  expand_grid(x = seq(from = 0, to = 1, length.out = 100)) %>% 
  mutate(density = rethinking::dbeta2(x, prob = pbar, theta = theta),
         mu      = str_c("mu == ", pbar %>% str_remove(., "0")),
         kappa   = factor(str_c("kappa == ", theta), 
                          levels = c("kappa == 30", "kappa == 15", "kappa == 5"))) %>% 
  
  # plot
  ggplot(aes(x = x, y = density)) +
  geom_area(fill = canva_pal("Green fields")(4)[4]) +
  scale_x_continuous("probability space", 
                     breaks = c(0, .5, 1), labels = c("0", ".5", "1")) +
  scale_y_continuous(NULL, labels = NULL) +
  theme(axis.ticks.y = element_blank()) +
  facet_grid(kappa ~ mu, labeller = label_parsed)
```

If you'd like to see how to make a similar plot in terms of $\alpha$ and $\beta$, see [Chapter 6](https://bookdown.org/content/3686/inferring-a-binomial-probability-via-exact-mathematical-analysis.html) of my [-@kurzDoingBayesianDataAnalysis2023] [ebook](https://bookdown.org/content/3686/) wherein I translated Kruschke's text into **tidyverse** and **brms** code.

But remember, we're not fitting a beta model. We're using the beta-binomial. "We're going to bind our linear model to $\bar p$, so that changes in predictor variables change the central tendency of the distribution" (p. 371). The statistical model we'll be fitting follows the form

\begin{align*}
\text{admit}_i & \sim \operatorname{BetaBinomial}(n_i, \bar p_i, \phi)\\
\operatorname{logit}(\bar p_i) & = \alpha_{\text{gid}[i]} \\
\alpha_j                       & \sim \operatorname{Normal}(0, 1.5) \\
\phi                           & \sim \operatorname{Exponential}(1).
\end{align*}

Here the size, $n$, is defined in the `applications` column in the data we'll load in just a moment. In case you're confused, yes, our statistical model is not quite the same as the one McElreath presented on page 371 in the text. If you look closely, we dropped all mention of $\theta$ and jumped directly to $\phi$. Instead of implementing McElreath's $\theta = \phi + 2$ trick, we're going to set the lower bound for $\phi$ directly. Which brings us to the next issue:

The beta-binomial has historically not been officially supported by **brms** (see [GitHub issue #144](https://github.com/paul-buerkner/brms/issues/144))[^5]. However, **brms** versions 2.2.0 and above allow users to define custom distributions. To get all the details, you might check out Bürkner's [-@Bürkner2022Define] vignette, [*Define custom response distributions with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_customfamilies.html). Happily, Bürkner even used the [beta-binomial distribution as the exemplar](https://cran.r-project.org/package=brms/vignettes/brms_customfamilies.html#the-beta-binomial-distribution) in the vignette.

Before we get carried away, let's load the data and **brms**.

```{r, warning = F, message = F}
data(UCBadmit, package = "rethinking") 
d <- 
  UCBadmit %>% 
  mutate(gid = ifelse(applicant.gender == "male", "1", "2"))
rm(UCBadmit)

library(brms)
```

I'm not going to go into great detail explaining the ins and outs of making custom distributions for `brm()`. You've got Bürkner's vignette for that. For our purposes, we need a few preparatory steps. First, we need to use the `custom_family()` function to define the name and parameters of the beta-binomial distribution for use in `brm()`. Second, we have to define some functions for Stan which are not defined in Stan itself. We'll save them as `stan_funs`. Third, we'll make a `stanvar()` statement which will allow us to pass our `stan_funs` to `brm()`.

```{r}
beta_binomial2 <- custom_family(
  "beta_binomial2", dpars = c("mu", "phi"),
  links = c("logit", "log"),
  lb = c(0, 2), ub = c(1, NA),
  type = "int", vars = "vint1[n]"
)

stan_funs <- "
  real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
    return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);
  }
  int beta_binomial2_rng(real mu, real phi, int T) {
    return beta_binomial_rng(T, mu * phi, (1 - mu) * phi);
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")
```

Did you notice the `lb = c(0, 2)` portion of the code defining `beta_binomial2()`? In Bürkner's vignette, he set the lower bound of `phi` to zero. Since McElreath wanted the lower bound for $\phi$ to be 2, we just set that as the default in the likelihood. We should clarify two more points:

First, what McElreath referred to as the shape parameter, $\theta$, Bürkner called the precision parameter, $\phi$. In our exposition, above, we followed Kruschke's convention and called it $\kappa$. These are all the same thing: $\theta$, $\phi$, and $\kappa$ are **all the same thing**. Perhaps less confusingly, what McElreath called the `pbar` parameter, $\bar p$, Bürkner simply refers to as $\mu$.

Second, we've become accustomed to using the `y | trials() ~ ...` syntax when defining our `formula` arguments for binomial models. Here we are replacing `trials()` with `vint()`. From Bürkner's *Define custom response distributions with brms* vignette, we read:

> To provide information about the number of trials (an integer variable), we are going to use the addition argument `vint()`, which can only be used in custom families. Similarly, if we needed to include additional vectors of real data, we would use `vreal()`. Actually, for this particular example, we could more elegantly apply the addition argument `trials()` instead of `vint()` as in the basic binomial model. However, since the present vignette is meant to give a general overview of the topic, we will go with the more general method.
> 
> We now have all components together to fit our custom beta-binomial model:

```{r b12.1}
b12.1 <-
  brm(data = d, 
      family = beta_binomial2,  # here's our custom likelihood
      admit | vint(applications) ~ 0 + gid,
      prior = c(prior(normal(0, 1.5), class = b),
                prior(exponential(1), class = phi)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      stanvars = stanvars,  # note our `stanvars`
      seed = 12,
      file = "fits/b12.01")
```

Success, our results look a lot like those in the text!

```{r}
print(b12.1)
```

Just remember that, perhaps confusingly, what McElreath's output called `theta`, our **brms** output is calling `phi`. I know; this section is a lot. Keep your chin up! Here's what the corresponding `as_draws_df()` data object looks like.

```{r}
post <- as_draws_df(b12.1)
head(post)
```

Now we can compute and summarize a contrast between the two genders, what McElreath called `da`.

```{r, warning = F, message = F}
library(tidybayes)

post %>% 
  transmute(da = b_gid1 - b_gid2) %>% 
  mean_qi(.width = .89) %>% 
  mutate_if(is.double, round, digits = 3)
```

Much like in the text, the difference between genders on admission rates is near zero with wide uncertainty intervals spanning in either direction.

To stay within the **tidyverse** while making the many thin lines in Figure 12.1.a, we're going to need to do a bit of data processing. First, we'll want a variable to index the rows in `post` (i.e., to index the posterior draws). And we'll want to convert the `b_gid2` to the $\bar p$ metric with the `inv_logit_scaled()` function. Then we'll use `slice_sample()` to randomly draw a subset of the posterior draws. With the `expand_grid()` function, we'll insert a dense sequence of `x` values ranging between 0 and 1--the parameter space of beta distribution. Finally, we'll use `pmap_dbl()` to compute the density values for the `rethinking::dbeta2` distribution corresponding to the unique combination of `x`, `p_bar`, and `phi` values in each row.

```{r, warning = F}
set.seed(12)

lines <-
  post %>% 
  mutate(p_bar = inv_logit_scaled(b_gid2)) %>% 
  slice_sample(n = 100) %>% 
  select(.draw, p_bar, phi) %>% 
  expand_grid(x = seq(from = 0, to = 1, by = .005)) %>% 
  mutate(density = pmap_dbl(list(x, p_bar, phi), rethinking::dbeta2))

str(lines)
```

All that was just for the thin lines. To make the thicker line for the posterior mean, we'll get tricky with `stat_function()`.

```{r, fig.width = 4, fig.height = 3.5}
lines %>% 
  ggplot(aes(x = x, y = density)) + 
  stat_function(fun = rethinking::dbeta2,
                args = list(prob = mean(inv_logit_scaled(post %>% pull(b_gid2))),
                            theta = mean(post %>% pull(phi))),
                linewidth = 1.5, color = canva_pal("Green fields")(4)[4]) +
  geom_line(aes(group = .draw),
            alpha = .2, color = canva_pal("Green fields")(4)[4]) +
  scale_y_continuous(NULL, breaks = NULL, limits = c(0, 3)) +
  labs(subtitle = "distribution of female admission rates",
       x = "probability admit")
```

There are other ways to do this. For ideas, check out my blog post, [*Make rotated Gaussians, Kruschke style*](https://solomonkurz.netlify.app/blog/2018-12-20-make-rotated-gaussians-kruschke-style/).

Before we can do our variant of Figure 12.1.b, we'll need to define a few more custom functions. The `log_lik_beta_binomial2()` and `posterior_predict_beta_binomial2()` functions are required for `brms::predict()` to work with our `family = beta_binomial2` brmfit object. Similarly, `posterior_epred_beta_binomial2()` is required for `brms::fitted()` to work properly. And before all that, we need to throw in a line with the `expose_functions()` function. Just go with it.

```{r, results = 'hide', warning = F, message = F}
expose_functions(b12.1, vectorize = TRUE)

# required to use `predict()`
log_lik_beta_binomial2 <- function(i, prep) {
  mu     <- brms::get_dpar(prep, "mu", i = i)
  phi    <- brms::get_dpar(prep, "phi", i = i)
  trials <- prep$data$vint1[i]
  y      <- prep$data$Y[i]
  beta_binomial2_lpmf(y, mu, phi, trials)
}

posterior_predict_beta_binomial2 <- function(i, prep, ...) {
  mu     <- brms::get_dpar(prep, "mu", i = i)
  phi    <- brms::get_dpar(prep, "phi", i = i)
  trials <- prep$data$vint1[i]
  beta_binomial2_rng(mu, phi, trials)
}

# required to use `fitted()`
posterior_epred_beta_binomial2 <- function(prep) {
  mu     <- brms::get_dpar(prep, "mu")
  trials <- prep$data$vint1
  trials <- matrix(trials, nrow = nrow(mu), ncol = ncol(mu), byrow = TRUE)
  mu * trials
}
```

With those intermediary steps out of the way, we're ready to make Figure 12.1.b.

```{r, fig.width = 4, fig.height = 3.5}
# the prediction intervals
predict(b12.1) %>%
  data.frame() %>% 
  transmute(ll = Q2.5,
            ul = Q97.5) %>%
  bind_cols(
    # the fitted intervals
    fitted(b12.1) %>% data.frame(),
    # the original data used to fit the model) %>% 
    b12.1$data
    ) %>% 
  mutate(case = 1:12) %>% 
  
  # plot!
  ggplot(aes(x = case)) +
  geom_linerange(aes(ymin = ll / applications, 
                     ymax = ul / applications),
                 color = canva_pal("Green fields")(4)[1], 
                 linewidth = 2.5, alpha = 1/4) +
  geom_pointrange(aes(ymin = Q2.5  / applications, 
                      ymax = Q97.5 / applications, 
                      y = Estimate/applications),
                  color = canva_pal("Green fields")(4)[4],
                  linewidth = 1/2, shape = 1) +
  geom_point(aes(y = admit/applications),
             color = canva_pal("Green fields")(4)[2],
             size = 2) +
  scale_x_continuous(breaks = 1:12) +
  scale_y_continuous(breaks = 0:5 / 5, limits = c(0, 1)) +
  labs(subtitle = "Posterior validation check",
       caption = expression(italic(Note.)*" A = admittance probability"),
       y = "A") +
  theme(axis.ticks.x = element_blank(),
        legend.position = "none")
```

As in the text, the raw data are consistent with the prediction intervals. But those intervals are so incredibly wide, they're hardly an endorsement of the model. Once we learn about hierarchical models, we'll be able to do much better.

### Negative-binomial or gamma-Poisson.

Recall from the last chapter how the Poisson distribution presumes $\sigma^2$ scales with $\mu$. The negative binomial distribution relaxes this assumption and presumes "each Poisson count observation has its own rate. It estimates the shape of a gamma distribution to describe the Poisson rates across cases" (p. 373).

If you execute `?dgamma`, you'll see that base **R** will allow you to define the gamma distribution with either the `shape` and `rate` or the `shape` and `scale`. If we define gamma in terms of shape and rate, it follows the formula

$$\operatorname{Gamma}(y | \alpha, \beta) = \frac{\beta^\alpha y^{\alpha - 1} e^{-\beta y}}{\Gamma (\alpha)},$$

where $\alpha$ is the shape, $\beta$ is the rate, $e$ is base of the natural logarithm, and $\Gamma$ is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function). It turns out the rate and scale parameters are the reciprocals of each other. Thus if you'd like to define a gamma distribution in terms of shape and scale, it would follow the formula

$$\operatorname{Gamma}(y | \alpha, \theta) = \frac{y^{\alpha - 1} e^{-x /\theta}}{\theta^\alpha \Gamma (\alpha)},$$

where $\alpha$, $e$, and $\Gamma$ are all as they were before and $\theta$ is the scale parameter. If that all wasn't complicated enough, it turns out there's one more way to define a gamma distribution. You can use the mean and shape. This would follow the formula

$$\operatorname{Gamma}(y | \mu, \alpha) = \frac{ \big (\frac{\alpha}{\mu} \big)^\alpha}{\Gamma (\alpha)} y^{\alpha - 1} \exp (- \frac{\alpha y}{\mu}),$$

where $\alpha$ and $\Gamma$ are still the shape and gamma function, respectively, and $\mu$ is the mean. I know, this is a lot and it probably all seems really abstract, right now. Think of this section as a reference. As you'll see after we fit our model, you may well need it. Returning to the content in the text, we might express the gamma-Poisson (negative binomial) as

$$y_i \sim \operatorname{Gamma-Poisson}(\mu, \alpha),$$

where $\mu$ is the mean or rate, taking the place of $\lambda$ from the Poisson distribution, and $\alpha$ is the shape. I should warn you that this notation is a little different from the notation McElreath used in the text (p. 374), where he used $\lambda$ in place of our $\mu$ and $\phi$ in place of our $\alpha$. The meaning is really the same. The reason I have diverged from McElreath's notation is to emphasize the connection with the walk-out, above. If you want to add injury to insult, compare both of these notations with the notation Bürkner used in his [-@Bürkner2022Parameterization] vignette, [*Parameterization of response distributions in brms*](https://cran.r-project.org/package=brms/vignettes/brms_families.html#ordinal-and-categorical-models).

Since this all is already a pedagogical nightmare, let's throw in one more technical tidbit before we start fitting models. I don't believe McElreath plainly explained this in the text, but when we talk about the gamma-Poisson model as having two parameters, the $\lambda$ parameter (a.k.a. $\mu$) is doing double duty. As with conventional Poisson models, $\lambda$ is the mean for the criterion. What might not be as clear is that $\lambda$ is also the mean of the gamma mixture distribution. So when you want to get a sense of the overall shape of the gamma-Poisson model-implied distribution of $\lambda$ parameters—one for each variable in the data set--, the distribution is gamma with a mean of $\lambda$ and shape of $\phi$ (a.k.a. $\alpha$). For a more detailed walk out of this, see Section 8.2 in @hilbeNegativeBinomialRegression2011.

Okay, that's enough technical background. Let's load the `Kline` data and bring this down to Earth with some models.

```{r, warning = F, message = F}
data(Kline, package = "rethinking")
d <- 
  Kline %>% 
  mutate(p          = rethinking::standardize(log(population)),
         contact_id = ifelse(contact == "high", 2L, 1L),
         cid        = contact)
rm(Kline)

print(d)
```

If you take a look of McElreath's `m12.2`, you'll see it's a gamma-Poisson version of the non-linear model he fit in last chapter, `m11.11`. You might also recall that we had to employ somewhat complicated non-linear syntax to translate that model into **brms**. Instead of jumping straight into a similarly complicated gamma-Poisson version of that model, I'm going to warm us up with a simple intercept-only model of the data. The formula will be

\begin{align*}
\text{total_tools}_i & \sim \operatorname{Gamma-Poisson} (\mu, \alpha) \\
\text{log}(\mu) & = \beta_0 \\
\beta_0         & \sim \operatorname{Normal}(3, 0.5) \\
\alpha          & \sim \operatorname{Gamma}(0.01, 0.01), 
\end{align*}

where we have deviated from McElreath's convention of using $\alpha$ for the model in favor $\beta_0$. This is because **brms** parameterizes the gamma likelihood in terms of $\mu$ and shape and, as we discussed above, shape is typically denoted as $\alpha$. I mean, technically we could refer to the shape parameter as $\psi$ or $\xi$ or whatever, but then we'd just be abandoning one convention for another. There's no way to win, here. *Sigh*. The prior for $\beta_0$ is the same one we used for the intercept way back for model `b11.9`. We have assigned a gamma prior for our troublesome new $\alpha$ (shape) parameter. Here's where that prior came from.

```{r}
get_prior(data = d, 
          family = negbinomial,
          total_tools ~ 1)
```

`gamma(0.01, 0.01)` is the **brms** default for the `shape` parameter in this model. Within **brms**, priors using the gamma distribution are based on the shape-rate ($\alpha$-$\theta$) parameterization. This is what $\operatorname{Gamma}(0.01, 0.01)$ looks like.

```{r, fig.width = 3.5, fig.height = 3, warning = F}
ggplot(data = tibble(x = seq(from = 0, to = 60, by = .1)),
       aes(x = x, y = dgamma(x, shape = 0.01, rate = 0.01))) +
  geom_area(color = "transparent", 
            fill = canva_pal("Green fields")(4)[2]) +
  scale_x_continuous(NULL) +
  scale_y_continuous(NULL, breaks = NULL) +
  coord_cartesian(xlim = c(0, 50)) +
  ggtitle(expression(brms~default~gamma(0.01*", "*0.01)~shape~prior))
```

Let's fit the model.

```{r b12.2a}
b12.2a <-
  brm(data = d, 
      family = negbinomial,
      total_tools ~ 1,
      prior = c(prior(normal(3, 0.5), class = Intercept),  # beta_0
                prior(gamma(0.01, 0.01), class = shape)),  # alpha
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 12,
      file = "fits/b12.02a")
```

Notice how you use the language of `family = negbinomial` to fit these models with **brms**. Here's the summary.

```{r}
print(b12.2a)
```

The intercept is our estimate of $\log \mu$, similar to $\log \lambda$ from a simple Poisson model. The `shape` is our estimate of, well, the shape ($\alpha$). To help us get a sense of what this model is, let's use the `brms::predict()` function to return random samples of the poster predictive distribution. Because we want random samples instead of summary values, we will specify `summary = F`. Let's take a look of what this returns.

```{r}
p <-
  predict(b12.2a,
          summary = F)

p %>% 
  str()
```

Because we have 4,000 posterior iterations, we also get back 4,000 rows. We have 10 columns, which correspond to the 10 rows (i.e., cultures) in the original data. In the next block, we'll put convert that output to a data frame and wrangle a little before plotting the results.

```{r, fig.width = 7, fig.height = 3}
p %>% 
  data.frame() %>% 
  set_names(d$culture) %>% 
  pivot_longer(everything(),
               names_to = "culture",
               values_to = "lambda") %>% 
  
  ggplot(aes(x = lambda)) +
  geom_density(color = "transparent", fill = canva_pal("Green fields")(4)[2]) +
  scale_x_continuous(expression(lambda["[culture]"]), breaks = 0:2 * 100) +
  scale_y_continuous(NULL, breaks = NULL) +
  coord_cartesian(xlim = c(0, 210)) +
  facet_wrap(~ culture, nrow = 2)
```

Because this model had no predictors, we have similar posterior-predictive distributions for each case in the data. It's important, however, to be very clear of what these posterior-predictive distributions are of. They are not of the data, per se. Let's look back at the text:

> A **negative-binomial** model, more usefully called a **gamma-Poisson** model, assumes that each Poisson count observation has its own rate. It estimates the shape of the gamma distribution to describe the Poisson rates across cases. (p. 373, **emphasis** in the original)

As a reminder, the "rate" for the Poisson distribution is just another word for the mean, also called $\lambda$. So unlike a simple Poisson model where we use the individual cases to estimate one overall $\lambda$, here we're presuming each case has its own $\lambda_i$. There are 10 $\lambda_i$ values that generated our data and if we look at those $\lambda_i$ values on the whole, their distribution can be described with a gamma distribution. And again, this is not a gamma distribution for our data. This is a gamma distribution of the $\lambda_i$ values from the 10 separate Poisson distributions that presumably made our data.

After exponentiating the intercept parameter $(\log \mu)$, here are the posterior distributions for those two gamma parameters.

```{r, fig.width = 5.5, fig.height = 2.5, warning = F}
post <- as_draws_df(b12.2a)

post %>% 
  mutate(mu    = exp(b_Intercept),
         alpha = shape) %>%
  pivot_longer(mu:alpha, names_to = "parameter") %>% 
  
  ggplot(aes(x = value)) +
  geom_density(color = "transparent", fill = canva_pal("Green fields")(4)[2]) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "Behold our gamma parameters!",
       x = "posterior") +
  facet_wrap(~ parameter, scales = "free", labeller = label_parsed)
```

We might want to use these parameters estimates to visualize the model-implied gamma distribution of $\lambda$ parameters. But recall that the base **R** `dgamma()` function doesn't take the mean. It is based on either the `shape` and `rate` or the `shape and `scale`. Since we already have the shape ($\alpha$), we need a way to compute the scale or rate. Happily, we can define the scale in terms of the mean and the shape with the equation

$$\theta = \frac{\mu}{\alpha}.$$

Behold $\theta$ in a plot.

```{r, fig.width = 3, fig.height = 2.5}
post %>% 
  mutate(mu    = exp(b_Intercept),
         alpha = shape) %>%
  mutate(theta = mu / alpha) %>% 
  
  ggplot(aes(x = theta)) +
  geom_density(color = "transparent", fill = canva_pal("Green fields")(4)[2]) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = expression(We~define~the~scale~as~theta==mu/alpha),
       x = "posterior") +
  coord_cartesian(xlim = c(0, 40))
```

Now we know how to get both $\alpha$ and $\theta$ from the model, we can pump them into `dgamma()` to get a sense of the model-implied gamma distribution, the presumed underlying distribution of $\lambda$ values that generated the `total_tools` data.

```{r, fig.width = 3.5, fig.height = 3, warning = F}
set.seed(12)

# wrangle to get 200 draws
post %>% 
  mutate(alpha = shape,
         theta = exp(b_Intercept) / shape) %>%
  slice_sample(n = 200) %>% 
  expand_grid(x = 0:250)%>% 
  mutate(density = dgamma(x, shape = alpha, scale = theta)) %>% 
  
  # plot
  ggplot(aes(x = x, y = density)) +
  geom_line(aes(group = .draw),
            alpha = .1, color = canva_pal("Green fields")(4)[4]) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = expression("200 credible gamma densities for "*lambda),
       x = expression(lambda)) +
  coord_cartesian(xlim = c(0, 170),
                  ylim = c(0, 0.045))
```

Now we've warmed up with an intercept-only gamma-Poisson, it's time to fit a **brms** version of McElreath's `m12.2`. Our model formula will be

\begin{align*}
\text{total_tools_i} & \sim \operatorname{Gamma-Poisson} (\mu_i, \alpha) \\
\mu_i       & = \exp (\beta_{0,\text{cid}[i]}) \text{population}_i^{\beta_{1,\text{cid}[i]}} / \gamma \\
\beta_{0,j} & \sim \operatorname{Normal}(1, 1) \\
\beta_{1,j} & \sim \operatorname{Exponential}(1) \\
\gamma      & \sim \operatorname{Exponential}(1) \\
\alpha      & \sim \operatorname{Exponential}(1),
\end{align*}

where $\mu$ and $\alpha$ and the mean and shape of the gamma distribution for the case-specific $\lambda$ parameters. Here's how we might fit that model with **brms**.

```{r b12.2b}
b12.2b <-
  brm(data = d, 
      family = negbinomial(link = "identity"),
      bf(total_tools ~ exp(b0) * population^b1 / g,
         b0 + b1 ~ 0 + cid,
         g ~ 1,
         nl = TRUE),
      prior = c(prior(normal(1, 1), nlpar = b0),
                prior(exponential(1), nlpar = b1, lb = 0),
                prior(exponential(1), nlpar = g, lb = 0),
                prior(exponential(1), class = shape)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 12,
      file = "fits/b12.02b") 
```

Here is the model summary.

```{r}
print(b12.2b)
```

Compute and check the PSIS-LOO estimates along with their diagnostic Pareto $k$ values.

```{r, message = F}
b12.2b <- add_criterion(b12.2b, "loo")
loo(b12.2b)
```

One of those Pareto $k$ values is still on the high side. Can you guess which one that is?

```{r}
d %>% 
  mutate(k = b12.2b$criteria$loo$diagnostics$pareto_k) %>% 
  filter(k > .5) %>% 
  select(culture, k)
```

Before we can make our version of Figure 12.2, we'll need to reload `b11.11` from last chapter. One way is with the `readRDS()` function.

```{r}
b11.11 <- readRDS("fits/b11.11.rds")
```

Here we make the left panel of the figure.

```{r}
# the new data
nd <-
  distinct(d, cid) %>% 
  expand_grid(population = seq(from = 0, to = 300000, length.out = 100))

p1 <-
  # compute the expected trajectories
  fitted(b11.11,
         newdata = nd,
         probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  
  # plot
  ggplot(aes(x = population, group = cid, color = cid)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
              stat = "identity",
              alpha = 1/4, linewidth = 1/2) +
  geom_point(data = bind_cols(d, b11.11$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  labs(subtitle = "pure Poisson model",
       y = "total tools")
```

Now make the right panel.

```{r, fig.width = 3.5, fig.height = 3.25}
# for the annotation
text <-
  distinct(d, cid) %>% 
  mutate(population  = c(150000, 110000),
         total_tools = c(57, 69),
         label       = str_c(cid, " contact"))

p2 <-
  fitted(b12.2b,
         newdata = nd,
         probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  
  ggplot(aes(x = population, group = cid, color = cid)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
              stat = "identity",
              alpha = 1/4, linewidth = 1/2) +
  geom_point(data = bind_cols(d, b12.2b$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  geom_text(data = text,
            aes(y = total_tools, label = label)) +
  scale_y_continuous(NULL, labels = NULL) +
  labs(subtitle = "gamma-Poisson model")
```

Combine the two ggplots with **patchwork** syntax to make the full version of Figure 12.2.

```{r, fig.width = 7.25, fig.height = 3.25}
library(patchwork)

(p1 | p2) &
  scale_fill_manual(values = canva_pal("Green fields")(4)[c(4, 1)]) &
  scale_color_manual(values = canva_pal("Green fields")(4)[c(4, 1)]) &
  scale_size(range = c(2, 5)) &
  scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) &
  coord_cartesian(xlim = range(d$population),
                  ylim = range(d$total_tools)) &
  theme(axis.ticks = element_blank(),
        legend.position = "none")
```

Oh man!

> Recall that Hawaii was a highly influential point in the pure Poisson model. It does all the work of pulling the low-contact trend down. In this new model, Hawaii is still influential, but it exerts a lot less influence on the trends. Now the high and low contact trends are much more similar, very hard to reliably distinguish. This is because the gamma-Poisson model expects rate variation, and the estimated amount of variation is quite large. Population is still strongly related to the total tools, but the influence of contact rate has greatly diminished. (p. 374)

Before we move on, let's use `predict()` to generate posterior predictive distributions for each of our 10 cultures.

```{r, fig.width = 7, fig.height = 3, warning = F, message = F}
predict(b12.2b,
        summary = F) %>% 
  data.frame() %>% 
  set_names(d$culture) %>% 
  pivot_longer(everything(),
               names_to = "culture",
               values_to = "lambda") %>% 
  left_join(d, by = "culture") %>% 
  
  ggplot(aes(x = lambda, y = 0)) +
  stat_halfeye(point_interval = mean_qi, .width = .5,
               fill = canva_pal("Green fields")(4)[2],
               color = canva_pal("Green fields")(4)[1]) +
  geom_vline(aes(xintercept = total_tools),
             color = canva_pal("Green fields")(4)[3]) +
  scale_x_continuous(expression(lambda["[culture]"]), breaks = 0:2 * 100) +
  scale_y_continuous(NULL, breaks = NULL) +
  coord_cartesian(xlim = c(0, 210)) +
  facet_wrap(~ culture, nrow = 2)
```

Because we used predictors in the model, this time the posterior predictive distributions differ across the cultures. The mean and interquartile range of each distribution are marked off by the light-green dot and horizontal line below each. The vertical lines in the foreground mark off the corresponding `total_tools` values from the data. Recall that these distributions are not based on the `total_tools` values themselves, but rather are estimates of the $\lambda$ values from the underlying Poisson distributions that might have generated such `total_tools` values.

### Over-dispersion, entropy, and information criteria.

> In terms of model comparison using information criteria, a beta-binomial model is a binomial model, and a gamma-Poisson (negative-binomial) is a Poisson model.
>
> You should not use WAIC and PSIS with these models, however, unless you are very sure of what you are doing. The reason is that while ordinary binomial and Poisson models can be aggregated and disaggregated across rows in the data, without changing any causal assumptions, the same is not true of beta-binomial and gamma-Poisson models. The reason is that a beta-binomial or gamma-Poisson likelihood applies an unobserved parameter to each row in the data. When we then go to calculate log-likelihoods, how the data are structured will determine how the beta-distributed or gamma-distributed variation enters the model. (pp. 374--375)

## Zero-inflated outcomes

> Very often, the things we can measure are not emissions from any pure process. Instead, they are *mixtures* of multiple processes. Whenever there are different causes for the same observation, then a **mixture model** may be useful. A mixture model uses more than one simple probability distribution to model a mixture of causes. In effect, these models use more than one likelihood for the same outcome variable.
>
> Count variables are especially prone to needing a mixture treatment. The reason is that a count of zero can often arise more than one way. A "zero" means that nothing happened, and nothing can happen either because the rate of events is low or rather because the process that generates events failed to get started. (p. 376, **emphasis** in the original)

#### Rethinking: Breaking the law.

McElreath discussed how advances in computing have made it possible for working scientists to define their own data generating models. If you'd like to dive deeper into the topic, check out Bürkner's [-@Bürkner2022Define] vignette, [*Define custom response distributions with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_customfamilies.html).

### Example: Zero-inflated Poisson.

Do you remember the monk data from back in [Chapter 11][Example: Exposure and the offset.]? Here we simulate some more. This time we'll work in a little alcohol.

```{r}
# define parameters
prob_drink <- 0.2  # 20% of days
rate_work  <- 1    # average 1 manuscript per day

# sample one year of production
n <- 365

# simulate days monks drink
set.seed(365)
drink <- rbinom(n, size = 1, prob = prob_drink)

# simulate manuscripts completed
y <- (1 - drink) * rpois(n, lambda = rate_work)
```

We'll put those data in a tidy tibble before plotting.

```{r, fig.width = 4.5, fig.height = 3}
d <-
  tibble(drink = factor(drink, levels = 1:0), 
         y     = y)
  
d %>% 
  ggplot(aes(x = y)) +
  geom_histogram(aes(fill = drink),
                 binwidth = 1, linewidth = 1/10, color = "grey92") +
  scale_fill_manual(values = canva_pal("Green fields")(4)[1:2]) +
  xlab("Manuscripts completed") +
  theme(legend.position = "none")
```

With these data, the likelihood of observing zero on `y`, (i.e., the likelihood zero manuscripts were completed on a given occasion) is

\begin{align*}
\operatorname{Pr}(0 | p, \lambda) & = \operatorname{Pr}(\text{drink} | p) + \operatorname{Pr}(\text{work} | p) \times \operatorname{Pr}(0 | \lambda) \\
                                   & = p + (1 - p) \exp (- \lambda).
\end{align*}

And

> since the Poisson likelihood of $y$ is $\operatorname{Pr}(y | \lambda) = \lambda^y \exp (- \lambda) / y!$, the likelihood of $y = 0$ is just $\exp (- \lambda)$. The above is just the mathematics for:
>
>> *The probability of observing a zero is the probability that the monks didn't drink OR ($+$) the probability that the monks worked AND ($\times$) failed to finish anything*. 
>
> And the likelihood of a non-zero value $y$ is:
>
> $$\operatorname{Pr}(y | y > 0, p, \lambda) = \operatorname{Pr}(\text{drink} | p) (0) + \operatorname{Pr}(\text{work} | p) \operatorname{Pr}(y | \lambda) = (1 - p) \frac {\lambda^y \exp (- \lambda)}{y!}$$
>
> Since drinking monks never produce $y > 0$, the expression above is just the chance the monks both work $1 - p$, and finish $y$ manuscripts. (pp. 377--378, *emphasis* in the original)

So letting $p$ be the probability $y$ is zero and $\lambda$ be the shape of the distribution, the zero-inflated Poisson ($\operatorname{ZIPoisson}$) regression model might take the basic form

\begin{align*}
y_i & \sim \operatorname{ZIPoisson}(\color{#5a5f37}{p_i}, \color{#524a3a}{\lambda_i})\\
\color{#5a5f37}{\operatorname{logit}(p_i)} & \color{#5a5f37}= \color{#5a5f37}{\alpha_p + \beta_p x_i} \\
\color{#524a3a}{\log (\lambda_i)} & \color{#524a3a}= \color{#524a3a}{\alpha_\lambda + \beta_\lambda x_i},
\end{align*}

where both parameters in the likelihood, $p_i$ and $\lambda_i$ might get their own statistical model, making this a special case of what Bürkner [-@Bürkner2022Distributional] calls distributional models. One last thing to note is that in **brms**, $p_i$ is denoted `zi`. To fit a zero-inflated Poisson model with **brms**, make sure to specify the correct likelihood with `family = zero_inflated_poisson`. To use a non-default prior for `zi`, make sure to indicate `class = zi` within the `prior()` function.
      
```{r b12.3}
b12.3 <- 
  brm(data = d, 
      family = zero_inflated_poisson,
      y ~ 1,
      prior = c(prior(normal(1, 0.5), class = Intercept),
                prior(beta(2, 6), class = zi)),  # the brms default is beta(1, 1)
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 12,
      file = "fits/b12.03") 
```

```{r}
print(b12.3)
```

If you look at the [*Zero-inflated and hurdle models* section](https://cran.r-project.org/package=brms/vignettes/brms_families.html#zero-inflated-and-hurdle-models) of Bürkner's [-@Bürkner2022Parameterization] [*Parameterization of response distributions in brms*](https://cran.r-project.org/package=brms/vignettes/brms_families.html) document, you'll see the zero-inflated Poisson is set up a little differently in **brms** than it is in **rethinking**. The difference did not influence the estimate for the intercept, $\lambda$. In both here and in the text, $\lambda$ was about zero. However, it did influence the summary of `zi`. Note how McElreath's `mean( inv_logit( post$ap ) )` returned 0.2241255, which seems rather close to our `zi` estimate of `r round(posterior_summary(b12.3)[2, 1], 3)`. Hopefully it's clear that `zi` in **brms** is already in the probability metric. There's no need to convert it. You can further confirm this by looking at the second line from the `print()` output, `Links: mu = log; zi = identity`. When there are no predictors for `zi`, the **brms** default is to use the identity link. In the text, however, McElreath used the logit link for `p`.

In the `prior` argument, we used `beta(2, 6)` for `zi` and also mentioned in the margin that the **brms** default is `beta(1, 1)`. The beta distribution ranges from 0 to 1, making it a natural distribution to use for priors on probabilities when using the identity link. To give you a sense of what those two versions of the beta look like, let's plot.

A typical way to plot a beta distribution would be to use the base **R** `dbeta()` function. Let's try a different approach, instead. The **tidybayes** package includes a `parse_dist()` function which takes the kind of string specifications you would usually include in the `brms::prior()` function and converts them into a format one can plot with. For example, here's what the preparatory work would be in our case.

```{r}
priors <-
  c(prior(beta(1, 1)),
    prior(beta(2, 6)))

priors %>% 
  parse_dist(prior)
```

The first several columns look a lot like the kind of output we'd get from the `brms::get_prior()` function. The `parse_dist()` function added those last two columns. Here we put them to work by feeding them into a ggplot.

```{r, fig.width = 3, fig.height = 2.5}
priors %>% 
  parse_dist(prior) %>%
  ggplot(aes(y = prior, dist = .dist, args = .args, fill = prior)) +
  stat_dist_halfeye(.width = .95) +
  scale_fill_manual(values = canva_pal("Green fields")(4)[c(4, 1)]) +
  scale_x_continuous("zi", breaks = c(0, .5, 1)) +
  scale_y_discrete(expand = expansion(add = 0.1)) +
  ylab(NULL) +
  theme(legend.position = "none")
```

```{r, echo = F, eval = F}
# here's McElreath's prior for `ap`
tibble(n = rnorm(1e5, -1.5, 1)) %>% 
  mutate(p = inv_logit_scaled(n)) %>% 
  ggplot(aes(x = p)) +
  geom_density(fill = canva_pal("Green fields")(4)[1])
```

Whereas the **brms** default is flat, our prior guided the posterior a bit toward 0. In case you were curious, we might write our statistical model for `b12.3` as

\begin{align*}
y_i            & \sim \operatorname{ZIPoisson} (p, \lambda) \\
p              & = \alpha_p \\
\log \lambda   & = \alpha_\lambda \\
\alpha_p       & \sim \operatorname{Beta}(2, 6) \\
\alpha_\lambda & \sim \operatorname{Normal}(1, 0.5).
\end{align*}

Anyway, here's that exponentiated $\alpha_\lambda$.

```{r}
fixef(b12.3)[1, ] %>%
  exp()
```

#### Overthinking: Zero-inflated Poisson calculations in Stan.

If you're curious, here's the Stan code underlying our **brms** fit, `b12.3`.

```{r}
b12.3$model
```

## Ordered categorical outcomes

> It is very common in the social sciences, and occasional in the natural sciences, to have an outcome variable that is discrete, like a count, but in which the values merely indicate different ordered levels along some dimension. For example, if I were to ask you how much you like to eat fish,on a scale from 1 to 7, you might say 5. If I were to ask 100 people the same question, I'd end up with 100 values between 1 and 7. In modeling each outcome value, I'd have to keep in mind that these values are *ordered*, because 7 is greater than 6, which is greater than 5, and so on. The result is a set of **ordered categories**. Unlike a count, the differences in value are not necessarily equal....
>
> In principle, an ordered categorical variable is just a multinomial prediction problem (page 359). But the constraint that the categories be ordered demands special treatment....
>
> The conventional solution is to use a **cumulative link** function. The cumulative probability of a value is the probability of that value *or any smaller value*. In the context of ordered categories, the cumulative probability of 3 is the sum of the probabilities of 3, 2, and 1. Ordered categories by convention begin at 1, so a result less than 1 has no probability at all. By linking a linear model to cumulative probability, it is possible to guarantee the ordering of the outcomes. (p. 380, *emphasis* in the original)

### Example: Moral intuition.

Let's get the `Trolley` data from **rethinking** [see @cushmanRoleConsciousReasoning2006].

```{r, message = F}
data(Trolley, package = "rethinking")
d <- Trolley
rm(Trolley)
```

Use the `dplyr::glimpse()` to get a sense of the dimensions of the data.

```{r}
glimpse(d)
```

Though we have 9,930 rows, we only have 331 unique individuals.

```{r}
d %>% 
  distinct(id) %>% 
  count()
```

### Describing an ordered distribution with intercepts.

Make our version of the simple Figure 12.4 histogram of our primary variable, `response`.

```{r, fig.width = 2.5, fig.height = 3}
p1 <-
  d %>% 
  ggplot(aes(x = response, fill = after_stat(x))) +
  geom_histogram(binwidth = 1/4, linewidth = 0) +
  scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                      high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous(breaks = 1:7) +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")
p1
```

Our cumulative proportion plot, Figure 12.4.b, will require some pre-plot wrangling.

```{r, fig.width = 2.5, fig.height = 3}
p2 <-
  d %>%
  count(response) %>%
  mutate(pr_k     = n / nrow(d),
         cum_pr_k = cumsum(pr_k)) %>% 
  
  ggplot(aes(x = response, y = cum_pr_k, 
             fill = response)) +
  geom_line(color = canva_pal("Green fields")(4)[2]) +
  geom_point(shape = 21, color = "grey92", 
             size = 2.5, stroke = 1) +
  scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                      high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous(breaks = 1:7) +
  scale_y_continuous("cumulative proportion", 
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")
p2
```

> Then to re-describe the histogram as log-cumulative odds, we'll need a series of intercept parameters. Each intercept will be on the log-cumulative-odds scale and stand in for the cumulative probability of each outcome. So this is just the application of the link function. The log-cumulative-odds that a response value $y_i$ is equal-to-or-less-than some possible outcome value $k$ is:
>
> $$\log \frac{\operatorname{Pr}(y_i \leq k)}{1 - \operatorname{Pr}(y_i \leq k)} = \alpha_k$$
>
> where $\alpha_k$ is an "intercept" unique to each possible outcome value $k$. (p. 383)

We can compute the $\alpha_k$ estimates directly with a little help from McElreath's custom `logit()` function.

```{r}
logit <- function(x) log(x / (1 - x)) # convenience function

d %>%
  count(response) %>%
  mutate(pr_k     = n / nrow(d),
         cum_pr_k = cumsum(n / nrow(d))) %>% 
  mutate(alpha = logit(cum_pr_k) %>% round(digits = 2))
```

Now we plot those joints to make our version of Figure 12.4.c.

```{r, fig.width = 2.5, fig.height = 3}
p3 <-
  d %>%
  count(response) %>%
  mutate(cum_pr_k = cumsum(n / nrow(d))) %>% 
  filter(response < 7) %>% 
  
  # we can do the `logit()` conversion right in `ggplot() 
  ggplot(aes(x = response, y = logit(cum_pr_k), fill = response)) +
  geom_line(color = canva_pal("Green fields")(4)[2]) +
  geom_point(shape = 21, colour = "grey92", 
             size = 2.5, stroke = 1) +
  scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                      high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous(breaks = 1:7, limits = c(1, 7)) +
  ylab("log-cumulative-odds") +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")
p3
```

Why not combine the three subplots with patchwork?

```{r, fig.width = 7, fig.height = 3}
(p1 | p2 | p3) + 
  plot_annotation(title = "Re-describing a discrete distribution using log-cumulative-odds.")
```

The code for Figure 12.5 is itself something of a monster.

```{r, fig.width = 3.5, fig.height = 3}
# primary data
d_plot <-
  d %>%
  count(response) %>%
  mutate(pr_k     = n / nrow(d),
         cum_pr_k = cumsum(n / nrow(d))) %>% 
  mutate(discrete_probability = ifelse(response == 1, cum_pr_k, cum_pr_k - pr_k))

# annotation
text <-
  tibble(text     = 1:7,
         response = seq(from = 1.25, to = 7.25, by = 1),
         cum_pr_k = d_plot$cum_pr_k - .065)

d_plot %>% 
  ggplot(aes(x = response, y = cum_pr_k,
             color = cum_pr_k, fill = cum_pr_k)) +
  geom_line(color = canva_pal("Green fields")(4)[1]) +
  geom_point(shape = 21, colour = "grey92", 
             size = 2.5, stroke = 1) +
  geom_linerange(aes(ymin = 0, ymax = cum_pr_k),
                 alpha = 1/2, color = canva_pal("Green fields")(4)[1]) +
  geom_linerange(aes(x = response + .025,
                     ymin = ifelse(response == 1, 0, discrete_probability), 
                     ymax = cum_pr_k),
                 color = "black") +
  # number annotation
  geom_text(data = text, 
            aes(label = text),
            size = 4) +
  scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                      high = canva_pal("Green fields")(4)[1]) +
  scale_color_gradient(low = canva_pal("Green fields")(4)[4],
                       high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous(breaks = 1:7) +
  scale_y_continuous("cumulative proportion", breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")
```

A compact way to express the formula for this first type of statistical model is

\begin{align*} 
\text{response}_i & \sim \operatorname{Categorical} (\mathbf p) \\
\operatorname{logit}(p_k) & = \alpha_k - \phi \\
\phi              & = 0 \\
\alpha_k          & \sim \operatorname{Normal}(0, 1.5),
\end{align*}

where the $\alpha_k$ term denotes the $K - 1$ intercepts (cut points or thresholds) we use to describe each possible outcome value $k$ and. The mysterious looking $\phi$ term is a stand-in for the potential terms of the linear model. In the case where we have no predictors, it's just 0. Just hold on to your hats; this will make more sense in the next section.

> An ordered-logit distribution is really just a categorical distribution that takes a vector $\mathbf p = \{p_1, p_2, p_3, p_4, p_5, p_6\}$ of probabilities of each response value below the maximum response (7 in this example). Each response value $k$ in this vector is defined by its link to an intercept parameter, $\alpha_k$. Finally, some weakly regularizing priors are placed on these intercepts. (p. 385)

Whereas in `rethinking::ulam()` you indicate the likelihood by `<criterion> ~ dordlogit(0 , c(<the thresholds>)`, in `brms::brm()` you code `family = cumulative`. Here's how to fit the intercepts-only model.

```{r b12.4}
# define the start values
inits <- list(`Intercept[1]` = -2,
              `Intercept[2]` = -1,
              `Intercept[3]` = 0,
              `Intercept[4]` = 1,
              `Intercept[5]` = 2,
              `Intercept[6]` = 2.5)

inits_list <- list(inits, inits, inits, inits)

b12.4 <- 
  brm(data = d, 
      family = cumulative,
      response ~ 1,
      prior(normal(0, 1.5), class = Intercept),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      inits = inits_list,  # here we add our start values
      file = "fits/b12.04")  
```

McElreath needed to include the `depth=2` argument in the `rethinking::precis()` function to show the threshold parameters from his `m11.1stan` model (**R** code 12.24). With a `brm()` fit, we just use `print()` or `summary()` as usual.

```{r}
print(b12.4)
```

What McElreath's `m12.4` summary termed `cutpoints[k]`, our **brms** summary termed `Intercept[k]`. In both cases, these are the $\alpha_k$ parameters from the formula, above (i.e., the thresholds). The summaries look like those in the text, the $\widehat R$ values are great, and both measures of effective sample size are high. The results looks good.

We can use the `brms::inv_logit_scaled()` function to get these into the probability metric.

```{r}
b12.4 %>% 
  fixef() %>% 
  inv_logit_scaled() %>% 
  round(digits = 3)
```

But recall that the posterior $\textit{SD}$ (i.e., the 'Est.Error' values) are not valid using that approach. If you really care about them, you'll need to work with the `as_draws_df()`.

```{r, message = F, warning = F}
as_draws_df(b12.4) %>% 
  mutate_at(vars(starts_with("b_")), inv_logit_scaled) %>% 
  pivot_longer(starts_with("b_")) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            sd   = sd(value),
            ll   = quantile(value, probs = .025),
            ul   = quantile(value, probs = .975))
```

Just to confirm, those posterior means are centered right around the `cum_pr_k` we computed for Figure 12.4.

```{r}
d_plot %>% 
  select(response, cum_pr_k)
```

To walk out our results even further, we can make `b12.4`-based version of Figure 12.4.c after formatting our posterior summary a little.

```{r, fig.width = 2.5, fig.height = 3.25}
fixef(b12.4) %>% 
  data.frame() %>% 
  rownames_to_column("intercept") %>% 
  mutate(response = str_extract(intercept, "\\d") %>% as.double()) %>% 
  
  ggplot(aes(x = response, y = Estimate,
             ymin = Q2.5, ymax = Q97.5,
             fill = response)) +
  geom_line(color = canva_pal("Green fields")(4)[2]) +
  geom_point(shape = 21, colour = "grey92", 
             size = 1.5, stroke = 1) +
  geom_linerange(color = canva_pal("Green fields")(4)[2]) +
  scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                      high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous(breaks = 1:7, limits = c(1, 7)) +
  ylab("log-cumulative-odds") +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")
```

Now the dots are the posterior means and the vertical lines layered on top of them are their 95% posterior intervals. Given the large amount of data, the posteriors for our $\alpha_k$ parameters are rather narrow, which is expressed in tightness of those vertical lines.

### Adding predictor variables.

Now we define the generic linear model as $\phi_i = \beta x_i$. Accordingly, the formula for our cumulative logit model becomes

\begin{align*}
\log \frac{\operatorname{Pr}(y_i \leq k)}{1 - \operatorname{Pr}(y_i \leq k)} & = \alpha_k - \phi_i \\
\phi_i & = \beta x_i.
\end{align*}

> This form automatically ensures the correct ordering of the outcome values, while still morphing the likelihood of each individual value as the predictor  $x_i$ changes value. Why is the linear model $\phi$ subtracted from each intercept? Because if we decrease the log-cumulative-odds of every outcome value $k$ below the maximum, this necessarily shifts probability mass upwards towards higher outcome values. So then positive values of $\beta$ mean increasing $x$ also increases the mean $y$. You could add $\phi$ instead like $\alpha_k + \phi_i$. But then $\beta > 0$ would indicate increasing $x$ decreases the mean. (p. 386)

I'm not aware that **brms** has an equivalent to the `rethinking::dordlogit()` function. So here we'll make it by hand. The code comes from McElreath's [GitHub repo for **rethinking**](https://github.com/rmcelreath/rethinking/blob/a309712d904d1db7af1e08a76c521ab994006fd5/R/distributions.r).

```{r}
logistic <- function(x) {
  
  p <- 1 / (1 + exp(-x))
  p <- ifelse(x == Inf, 1, p)
  p
  
}
# now we get down to it
dordlogit <- function(x, phi, a, log = FALSE) {
  
  a  <- c(as.numeric(a), Inf)
  p  <- logistic(a[x] - phi)
  na <- c(-Inf, a)
  np <- logistic(na[x] - phi)
  p  <- p - np
  if (log == TRUE) p <- log(p)
  p
  
}
```

The `dordlogit()` function works like this:

```{r}
pk <- dordlogit(1:7, 0, fixef(b12.4)[, 1])

pk %>% 
  round(digits = 2)
```

Note the slight difference in how we used `dordlogit()` with a `brm()` fit summarized by `fixef()` than the way McElreath did with a `ulam()` fit summarized by `coef()`. McElreath just put `coef(m12.4)` into `dordlogit()`. We, however, more specifically placed `fixef(b12.4)[, 1]` into the function. With the `[, 1]` part, we specified that we were working with the posterior means (i.e., `Estimate`) and neglecting the other summaries (i.e., the posterior $\textit{SD}$s and 95% intervals). If you forget to subset, chaos ensues.

Next, as McElreath further noted on page 338, "these probabilities imply an average outcome of:"

```{r}
sum(pk * (1:7))
```
 
I found that a bit abstract. Here's the thing in a more elaborate tibble format.

```{r}
(
  explicit_example <-
  tibble(probability_of_a_response = pk) %>%
  mutate(the_response = 1:7) %>%
  mutate(their_product = probability_of_a_response * the_response)
)

explicit_example %>%
  summarise(average_outcome_value = sum(their_product))
```

Now we'll try it by subtracting .5 from each.
 
```{r}
# the probabilities of a given response
pk <- dordlogit(1:7, phi = 0, a = fixef(b12.4)[, 1] - .5)

pk %>% 
  round(digits = 2)

# the average rating
sum(pk * (1:7))
```

So the rule is we *subtract the linear model from each intercept*. "This way, a positive $\beta$ value indicates that an increase in the predictor variable $x$ results in an increase in the average response" (p. 387). Happily, even though this makes for a somewhat confusing statistical formula, we still enter our predictor terms into the `brm()` `formula` argument much the same way we always have. As to our upcoming model, we might express the statistical formula as

\begin{align*} 
\text{response}_i & \sim \operatorname{Categorical} (\mathbf p) \\
\text{logit}(p_k)     & = \alpha_k - \phi_i \\
\phi_i                & = \beta_1 \text{action}_i + \beta_2 \text{contact}_i +  (\beta_3 + \beta_4 \text{action}_i + \beta_5 \text{contact}_i) \text{intention}_i \\
\alpha_k              & \sim \operatorname{Normal}(0, 1.5) \\
\beta_1, \dots, \beta_5 & \sim \operatorname{Normal}(0, 0.5),
\end{align*}

where, because we have included predictors, $\phi$ is no longer set to 0. Using our skills from back in [Chapter 8][Conditional Manatees], we might also rewrite the linear model for $\phi$ as

$$\phi_i = \beta_1 \text{action}_i + \beta_2 \text{contact}_i + \beta_3 \text{intention}_i + \beta_4 (\text{action}_i \times \text{intention}_i) + \beta_5 (\text{contact}_i \times \text{intention}_i).$$

Let's fit the model.

```{r b12.5}
b12.5 <- 
  brm(data = d, 
      family = cumulative,
      response ~ 1 + action + contact + intention + intention:action + intention:contact,
      prior = c(prior(normal(0, 1.5), class = Intercept),
                prior(normal(0, 0.5), class = b)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 12,
      file = "fits/b12.05")
```

There are ways with **brms** to mirror how McElreath coded his `phi <- bA*A + bC*C + BI*I` and `BI <- bI + bIA*A + bIC*C`. Here we just used a more conventional style of syntax. Behold the summary.

```{r}
print(b12.5)
```

For a little variety, we'll make our coefficient plot with a little help from the `tidybayes::stat_gradientinterval()` function.

```{r, fig.width = 7, fig.height = 1.5, warning = F}
labs <- str_c("beta[", 1:5, "]")

as_draws_df(b12.5) %>% 
  select(b_action:`b_contact:intention`) %>% 
  set_names(labs) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = 0, alpha = 1/5, linetype = 3) +
  stat_gradientinterval(.width = .5, size = 1, point_size = 3/2, shape = 21,
                        point_fill = canva_pal("Green fields")(4)[3], 
                        fill = canva_pal("Green fields")(4)[1], 
                        color = canva_pal("Green fields")(4)[2]) +
  scale_x_continuous("marginal posterior", breaks = -5:0 / 4) +
  scale_y_discrete(NULL, labels = parse(text = labs)) +
  coord_cartesian(xlim = c(-1.4, 0))
```

> As always, this will all be easier to see if we plot the posterior predictions. There is no perfect way to plot the predictions of these log-cumulative-odds models. Why? Because each prediction is really a vector of probabilities, one for each possible outcome value. So as a predictor variable changes value, the entire vector changes. This kind of thing can be visualized in several different ways. (p. 388)

Our approach to making the top panels of Figure 12.6 will start with `fitted()`.

```{r}
nd <- 
  d %>% 
  distinct(action, contact, intention) %>% 
  mutate(combination = str_c(action, contact, intention, sep = "_"))

f <-
  fitted(b12.5,
         newdata = nd,
         summary = F)

# what have we done?
f %>% str()
```

That returned a three-dimensional array. The 4,000 rows correspond to the 4,000 post-warmup iterations. The six columns correspond to the six unique combinations of `action`, `contact`, and `intention` within the data (i.e., `d %>% distinct(action, contact, intention)`). The three levels of the third dimension correspond to the seven levels of our `response` variable. This is all the information we need to plot our posterior in the triptych in the top row of Figure 12.6. It will take a bit of tricky wrangling to get it into a useful format. Our first several steps have to do with arranging the data into a long tibble format.

```{r}
f <-
  rbind(f[, , 1],
        f[, , 2],
        f[, , 3],
        f[, , 4],
        f[, , 5],
        f[, , 6],
        f[, , 7]) %>% 
  data.frame() %>% 
  set_names(pull(nd, combination)) %>% 
  mutate(response = rep(1:7, each = n() / 7),
         draw     = rep(1:4000, times = 7)) %>% 
  pivot_longer(-c(draw, response),
               names_to = c("action", "contact", "intention"),
               names_sep = "_",
               values_to = "pk") %>% 
  mutate(intention = intention %>% as.integer())

# how do the data look, now?
glimpse(f)
```

We're moving pretty quickly, here. If it wasn't apparent, the original values in the `f` data were in the probability metric. This is why we followed the convention from earlier in this section and named them `pk`. However, we need them to be in the cumulative-probability metric to make the top panels of the figure.

```{r}
# to order our factor levels for `facet`
levels <- c("action=0, contact=0", "action=1, contact=0", "action=0, contact=1")

p1 <-
  f %>% 
  # unnecessary for these plots
  filter(response < 7) %>% 
  # this will help us define the three panels of the triptych
  mutate(facet = factor(str_c("action=", action, ", contact=", contact),
                        levels = levels)) %>% 
  # these next three lines allow us to compute the cumulative probabilities
  group_by(draw, facet, intention) %>% 
  arrange(draw, facet, intention, response) %>% 
  mutate(probability = cumsum(pk)) %>% 
  ungroup() %>% 
  # these next three lines are how we randomly selected 50 posterior draws
  nest(data = -draw) %>% 
  slice_sample(n = 50) %>%
  unnest(data) %>% 
  
  # plot!
  ggplot(aes(x = intention, y = probability)) +
  geom_line(aes(group = interaction(draw, response), color = probability),
            alpha = 1/10) +
  geom_point(data = d %>%  # wrangle the original data to make the dots
               group_by(intention, contact, action) %>% 
               count(response) %>% 
               mutate(probability = cumsum(n / sum(n)),
                      facet = factor(str_c("action=", action, ", contact=", contact),
                                     levels = levels)) %>% 
               filter(response < 7),
             color = canva_pal("Green fields")(4)[2]) +
  scale_color_gradient(low = canva_pal("Green fields")(4)[4],
                       high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous("intention", breaks = 0:1) +
  scale_y_continuous(breaks = c(0, .5, 1), limits = 0:1) +
  theme(legend.position = "none") +
  facet_wrap(~ facet)
```

We will look at the results of our code in just a bit. For now, we'll focus on the code for the triptych in the bottom panels of Figure 12.6. These plots will be based on `predict()`.

```{r}
p <-
  predict(b12.5,
          newdata = nd,
          ndraws = 1000,
          scale = "response",
          summary = F)

p %>% str()
```

This time we have a simple two-dimensional array. There are only 1,000 rows because we set `ndraws = 1000`. Much like with `fitted()`, above, the six columns correspond to the six unique combinations of our three predictor variables. With the `scale = "response"` argument, we requested our results were in the metric of the original data, which were `response` values ranging from `1` to `7`. Compared to the last plot, the post-`predict()` data wrangling for this triptych is low-key. We just need the data in a long tibble format that includes a variable with which we might facet.

```{r}
p2 <-
  p %>% 
  data.frame() %>% 
  set_names(pull(nd, combination)) %>% 
  pivot_longer(everything(),
               names_to = c("action", "contact", "intention"),
               names_sep = "_",
               values_to = "response") %>% 
  mutate(facet = factor(str_c("action=", action, ", contact=", contact),
                        levels = levels)) %>% 
  
  ggplot(aes(x = response, fill = intention)) +
  geom_bar(width = 1/3, position = position_dodge(width = .4)) +
  scale_fill_manual(values = canva_pal("Green fields")(4)[2:1]) +
  scale_x_continuous("response", breaks = 1:7) +
  theme(legend.position = "none") +
  facet_wrap(~ facet)
```

Finally, we're ready to combine our two triptychs into one glorious remake of Figure 12.6.

```{r, fig.width = 6, fig.height = 5}
(p1 / p2) & theme(panel.background = element_rect(fill = "grey94"))
```

Just for kicks and giggles, I'd like to make an alternative version of Figure 12.6. The triptych on the top panels did a pretty good job depicting the model in terms of the thresholds, $\alpha_k$. It's important that we, the data analysts, have a good sense of what those are. However, I suspect many of our substantively-oriented colleagues will find themselves confused by a plot like that. In my field, people generally just want to know *What's the mean for each group?* At this point, you and I know that such a question is a bit impoverished compared the to the rich output from a model like this. But if we do want to boil these analyses down to comparisons of means, McElreath has already showed us how. Look back to **R** code 12.20 through 12.23 (pp. 386--387). In those blocks, we multiplied the vector of probability values (`pk`) by their respective `response` values and summed, which produced an average outcome value. We can use that same approach so that our top triptych might express the results of the model in terms of means rather than parameters. All it takes is a slightly amended wrangling workflow with respect to the `f` data.

```{r, message = F}
p1 <-
  f %>% 
  mutate(facet = factor(str_c("action=", action, ", contact=", contact),
                        levels = levels)) %>% 
  group_by(draw, facet, intention) %>% 
  summarise(mean_response = sum(pk * response)) %>% 
  ungroup() %>% 
  nest(data = -draw) %>% 
  slice_sample(n = 50) %>%
  unnest(data) %>%

  ggplot(aes(x = intention, y = mean_response)) +
  geom_line(aes(group = draw, color = mean_response),
            alpha = 1/10) +
  scale_color_gradient(low = canva_pal("Green fields")(4)[4],
                       high = canva_pal("Green fields")(4)[1]) +
  scale_x_continuous("intention", breaks = 0:1) +
  scale_y_continuous("resopnse", breaks = 1:7, limits = c(1, 7)) +
  theme(legend.position = "none") +
  facet_wrap(~ facet)
```

I really like how intuitive the histograms were in McElreath’s bottom triptych. A limitation of that approach, however, is there is no direct expression of uncertainty. To address this, we might recall that the bars in the histogram are basically just reparameterizations of the `pk` values already in our `f` data and, happily, our vector of `pk` values already contains the uncertainty in the posterior. One way we might express that is with the `tidybayes::stat_ccdfinterval()` function, which will return a bar plot where the top parts of the bars depict our uncertainty in terms of cumulative density curves.

```{r}
p2 <-
  f %>% 
  mutate(facet = factor(str_c("action=", action, ", contact=", contact),
                        levels = levels)) %>% 
  
  ggplot(aes(x = response, y = pk, fill = factor(intention))) +
  stat_ccdfinterval(.width = .95, justification = 1, size = 1/4, 
                    shape = 21, point_fill = canva_pal("Green fields")(4)[3], point_size = 1/3,
                    position = "dodge", width = .75) +
  scale_fill_manual(values = canva_pal("Green fields")(4)[2:1]) +
  scale_x_continuous("resopnse", breaks = 1:7) +
  scale_y_continuous("count", breaks = 0:3 / 10, labels = 0:3 * 100, limits = c(0, NA)) +
  theme(legend.position = "none") +
  facet_wrap(~ facet)
```

Here's our alternative plot.

```{r, fig.width = 6, fig.height = 5}
p1 / p2
```

I suspect our `stat_ccdfinterval()` approach would work better in plots with fewer bars. But hopefully it gives you some ideas.

#### Rethinking: Staring into the abyss.

> The plotting code for ordered logistic models is complicated, compared to that of models from previous chapters. But as models become more monstrous, so too does the code needed to compute predictions and display them. With power comes hardship. It's better to see the guts of the machine than to live in awe or fear of it. (p. 391)

## Ordered categorical predictors

> We can handle ordered outcome variables using a categorical model with a cumulative link. That was the previous section. What about ordered predictor variables? We could just include them as continuous predictors like in any linear model. But this isn't ideal. Just like with ordered outcomes, we don't really want to assume that the distance between each ordinal value is the same. Luckily, we don’t have to. (p. 391)

Here are the eight levels of `edu`.

```{r}
distinct(d, edu)
```

McElreath defined his `edu_new` variable with an impressively compact couple lines of code. I'm going to take a more explicit approach with the `dplyr::recode()` function.

```{r}
d <-
  d %>% 
  mutate(edu_new = 
           recode(edu,
                  "Elementary School" = 1,
                  "Middle School" = 2,
                  "Some High School" = 3,
                  "High School Graduate" = 4,
                  "Some College" = 5, 
                  "Bachelor's Degree" = 6,
                  "Master's Degree" = 7,
                  "Graduate Degree" = 8) %>% 
           as.integer())

# what did we do?
d %>% 
  distinct(edu, edu_new) %>% 
  arrange(edu_new)
```

The prior often used to handle monotonic effects is the Dirichlet distribution. The Dirichlet distribution is the multivariate extension of the beta distribution, which we met back in [Section 12.1.1][Beta-binomial.]. Here we follow McElreath's **R** code 12.32 to simulate a few draws from the Dirichlet distribution.

```{r, warning = F, message = F}
library(gtools)
set.seed(1805)
delta <- rdirichlet(10, alpha = rep(2, 7)) 

str(delta)
```

Plot `delta` with **ggplot2**.

```{r, fig.width = 4.5, fig.height = 2.5}
delta %>% 
  data.frame() %>%
  set_names(1:7) %>% 
  mutate(row = 1:n()) %>% 
  pivot_longer(-row, names_to = "index") %>% 
  
  ggplot(aes(x = index, y = value, group = row,
             alpha = row == 3, color = row == 3)) +
  geom_line() +
  geom_point() +
  scale_alpha_manual(values = c(1/3, 1)) +
  scale_color_manual(values = canva_pal("Green fields")(4)[1:2]) +
  ylab("probability") +
  theme(legend.position = "none")
```

The **brms** package has a `rdirichlet()` function, too. Here we use that to make an alternative version of the plot, above.

```{r, fig.width = 6, fig.height = 3}
set.seed(12)

brms::rdirichlet(n = 1e4, alpha = rep(2, 7)) %>% 
  data.frame() %>% 
  set_names(1:7) %>% 
  pivot_longer(everything()) %>% 
  mutate(name  = name %>% as.double(),
         alpha = str_c("alpha[", name, "]")) %>% 
  
  ggplot(aes(x = value, color = name, group = name, fill= name)) + 
  geom_density(alpha = .8) + 
  scale_fill_gradient(low = canva_pal("Green fields")(4)[2],
                      high = canva_pal("Green fields")(4)[3]) +
  scale_color_gradient(low = canva_pal("Green fields")(4)[2],
                       high = canva_pal("Green fields")(4)[3]) +
  scale_x_continuous("probability", limits = c(0, 1),
                     breaks = c(0, .5, 1), labels = c("0", ".5", "1"), ) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = expression("Dirichlet"*(2*", "*2*", "*2*", "*2*", "*2*", "*2*", "*2))) +
  theme(legend.position = "none") +
  facet_wrap(~ alpha, labeller = label_parsed, nrow = 2)
```

When using **brms**, the issue of treating a predictor in the way McElreath covered in this section is referred to as **monotonic effects**. Bürkner outlined the issue in his [-@Bürkner2022Monotonic] vignette, [*Estimating monotonic effects with with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_monotonic.html), and in his [-@burknerModellingMonotonicEffects2020] article with Emmanuel Charpentier, *Modelling monotonic effects of ordinal predictors in Bayesian regression models* (click [here](https://psyarxiv.com/9qkhj/) for the freely-available  preprint). From the introduction in Bürkner's vignette, we read:

> For a single monotonic predictor, $x$, the linear predictor term of observation $n$ looks as follows:
> 
> $$\eta_n = bD \sum_{i = 1}^{x_n} \zeta_i$$
>
> The parameter $b$ can take on any real value, while $\zeta$ is a simplex, which means that it satisfies $\zeta_i \in [0, 1]$ and $\sum_{i = 1}^D \zeta_i = 1$ with $D$ being the number of elements of $\zeta$. Equivalently, $D$ is the number of categories (or highest integer in the data) minus 1, since we start counting categories from zero to simplify the notation.

In this context, $n$ indexes the observations in the hypothetical data and $\eta$ denotes the linear model for some outcome $y$. Unlike with **rethinking**, the **brms** syntax for fitting models with monotonic predictors is fairly simple. Just place your monotonic predictors within the `mo()` function and enter them into the `formula`.

In the text (p. 394), McElreath remarked it took about 20 minutes to fit this model in his computer. Using my 2019 MacBook Pro, it took just under 40 minutes. Mind you, our **brms** version of the model used twice the number of warmup and post-warmup iterations, so this isn't a knock against the sampling speed of **brms** versus **rethinking**. But either way, get ready to wait a while when fitting a model like this.

```{r b12.6}
b12.6 <- 
  brm(data = d, 
      family = cumulative,
      response ~ 1 + action + contact + intention + mo(edu_new),  # note the `mo()` syntax
      prior = c(prior(normal(0, 1.5), class = Intercept),
                prior(normal(0, 1), class = b),
                # note the new kinds of prior statements
                prior(normal(0, 0.143), class = b, coef = moedu_new),
                prior(dirichlet(2, 2, 2, 2, 2, 2, 2), class = simo, coef = moedu_new1)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 12,
      file = "fits/b12.06")
```

Here's the summary.

```{r}
print(b12.6)
```

If you compare our results to those in the text, you may be surprised by how small our summary values are for `moedu_new`. **brms** and **rethinking** have an important difference in how they parameterize $\beta_\text{Education}$. From page 392 in the text, McElreath explained the

> sum of all the $\delta$ parameters is the maximum education effect. It will be very convenient for interpretation if we call this maximum sum an ordinary coefficient like $\beta_\text{E}$ and then let the $\delta$ parameters be fractions of it. If we also make a dummy $\delta_0 = 0$ then we can write it all very compactly. Like this:
>
> $$\phi_i = \beta_\text{E} \sum_{j = 0}^{\text{E}_i - 1} \delta_j + \text{other stuff}$$
>
> where $\text{E}_i$ is the completed education level of individual $i$. Now the sum of every $\delta_j$ is 1, and we can interpret the maximum education effect by looking at $\beta_\text{E}$.

The current version of **brms** takes expresses $\beta_\text{E}$ as an average effect. From @burknerModellingMonotonicEffects2020, we read:

> If the monotonic effect is used in a linear model, $b$ can be interpreted as the expected average difference between two adjacent categories of $x$, while $\zeta_i$ describes the expected difference between the categories $i$ and $i - 1$ in the form of a proportion of the overall difference between lowest and highest categories. Thus, this parameterization has an intuitive interpretation while guaranteeing the monotonicity of the effect (p. 6)

To clarify, the $b$ in this section is what we're calling $\beta_\text{E}$ in the current example and Bürkner and Charpentier used $\zeta_i$ in place of McElreath's $\delta_j$. The upshot of all this is that if we'd like to compare the summary of our `b12.6` to the results McElreath reported for his `m12.6`, we'll need to multiply our `moedu_new` by 7.

```{r, warning = F}
as_draws_df(b12.6) %>% 
  transmute(bE = bsp_moedu_new * 7) %>% 
  median_qi(.width = .89) %>% 
  mutate_if(is.double, round, digits = 2)
```

This parameterization difference between **brms** and **rethinking** is also the reason why we set `prior(normal(0, 0.143), class = b, coef = moedu_new)` within `b12.6` whereas McElreath used a $\operatorname{Normal}(0, 1)$ prior for all his $\beta$ coefficients, including for his `bE`. Because our `moedu_new` (i.e., $\beta_\text{E}$) is parameterized as the average of seven $\delta$ parameters, it made sense to divide our hyperparameter for $\sigma$ by 7. That is, $1 / 7 \approx 0.143$.

This might be a good place to practice expressing our model in formal statistical notation. Here we'll complement McElreath's formula from page 392 by mixing in a little notation from @burknerModellingMonotonicEffects2020:

\begin{align*}
\text{response}_i & \sim \operatorname{Categorical} (\mathbf p) \\
\operatorname{logit}(p_k) & = \alpha_k - \phi_i \\
\phi_i & = \beta_1 \text{action}_i + \beta_2 \text{contact}_i + \beta_3 \text{intention}_i + \color{#524a3a}{\beta_4 \operatorname{mo}(\text{edu_new}_i, \boldsymbol{\delta})} \\
\alpha_k & \sim \operatorname{Normal}(0, 1.5) \\
\beta_1, \dots, \beta_3 & \sim \operatorname{Normal}(0, 1) \\
\color{#524a3a}{\beta_4} & \color{#524a3a}\sim \color{#524a3a}{\operatorname{Normal}(0, 0.143)} \\
\color{#524a3a}{\boldsymbol{\delta}} & \color{#524a3a}\sim \color{#524a3a}{\operatorname{Dirichlet}(2, 2, 2, 2, 2, 2, 2)},
\end{align*}

where $\operatorname{mo}(x, \boldsymbol{\delta})$ is an operator indicating some predictor $x$ is has undergone the monotonic transform and $\boldsymbol{\delta}$ is our vector of simplex parameters $\delta_1, \dots, \delta_7$. That is, $\beta_4 \operatorname{mo}(\text{edu_new}_i, \boldsymbol{\delta})$ is our alternative **brms**-oriented way of expressing McElreath's $\beta_\text{E} \sum_{j = 0}^{\text{E}_i - 1} \delta_j$. I don't know that one is better. `r emo::ji("person_shrugging_light_skin_tone")` At first contact, I found them both confusing.

Since this will be our only pairs plot for this chapter, let's use `GGally::ggpairs()` to make it fancy. First we customize the panels.

```{r}
my_lower <- function(data, mapping, ...) {
  
  # get the x and y data to use the other code
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # compute the correlations
  corr <- cor(x, y, method = "p", use = "pairwise")
  abs_corr <- abs(corr)
  
  # plot the cor value
  ggally_text(
    label = formatC(corr, digits = 2, format = "f") %>% str_replace(., "0.", "."),
    mapping = aes(),
    size = 4,
    color = canva_pal("Green fields")(4)[2]) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}

my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_density(fill = canva_pal("Green fields")(4)[1], linewidth = 0) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}

my_upper <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_hex(bins = 18) +
    scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
                        high = canva_pal("Green fields")(4)[3]) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL) +
    theme(panel.background = element_rect(fill = canva_pal("Green fields")(4)[2]))
}
```

Now we are ready to make our custom version of the pairs plot in Figure 12.8.

```{r, fig.height = 5.5, fig.width = 6, warning = F, message = F}
library(GGally)

delta_labels <- c("Elem", "MidSch", "SHS", "HSG", "SCol", "Bach", "Mast", "Grad")

as_draws_df(b12.6) %>% 
  select(contains("simo_moedu_new1")) %>% 
  set_names(str_c(delta_labels[2:8], "~(delta[", 1:7, "])")) %>% 
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = label_parsed) +
  theme(strip.text = element_text(size = 8))
```

Here we add a normalized version of `edu_new` called `edu_norm`.

```{r}
d <-
  d %>% 
  mutate(edu_norm = (edu_new - 1) / 7)

# what does this look like?
d %>% 
  distinct(edu, edu_new, edu_norm) %>% 
  arrange(edu_new)
```

Now fit the more conventional model in which we treat `edu_norm` as a simple continuous predictor.

```{r b12.7}
b12.7 <- 
  brm(data = d, 
      family = cumulative,
      response ~ 1 + action + contact + intention + edu_norm,
      prior = c(prior(normal(0, 1.5), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 12,
      file = "fits/b12.07")
```

```{r}
fixef(b12.7)[7:10, ]
```

It might be nice to get a better sense of the two models with a plot. For our approach, we'll use `fitted()`. Start with `b12.6`.

```{r}
nd <-
  tibble(edu_new   = 1:8,
         action    = 0,
         contact   = 0,
         intention = 0)

f <-
  fitted(b12.6, 
         newdata = nd) 

f %>% str()
```

The rows correspond to the eight educational levels. The columns are the typical summary columns. The seven levels of the third dimension are the seven levels of `response`.  Before we plot, we're going to need to wrangle that a little and then do the same thing all over for `b12.7`.

```{r}
# b12.6
f12.6 <-
  rbind(f[, , 1],
        f[, , 2],
        f[, , 3],
        f[, , 4],
        f[, , 5],
        f[, , 6],
        f[, , 7]) %>% 
  data.frame() %>% 
  mutate(edu      = factor(rep(1:8, times = 7)),
         response = rep(1:7, each = 8))

# b12.7
nd <-
  nd %>% 
  mutate(edu_norm  = 1:8)

f <-
  fitted(b12.7, 
         newdata = nd) 

f12.7 <-
  rbind(f[, , 1],
        f[, , 2],
        f[, , 3],
        f[, , 4],
        f[, , 5],
        f[, , 6],
        f[, , 7]) %>% 
  data.frame() %>% 
  mutate(edu      = factor(rep(1:8, times = 7)),
         response = rep(1:7, each = 8))
```

Now combine the two data objects and plot.

```{r}
# this will help with `scale_color_manual()`
colors <- 
  scales::seq_gradient_pal(canva_pal("Green fields")(4)[4], 
                           canva_pal("Green fields")(4)[3])(seq(0, 1, length.out = 8))

bind_rows(f12.6, f12.7) %>% 
  mutate(fit = rep(c("b12.6 with `mo()` syntax", "b12.7 with conventional syntax"), 
                   each = n() / 2)) %>% 
  
  ggplot(aes(x = response, y = Estimate,
             ymin = Q2.5, ymax = Q97.5,
             color = edu, group = edu)) +
  geom_pointrange(fatten = 3/2, position = position_dodge(width = 3/4)) + 
  scale_color_manual("education", values = colors, labels = delta_labels) +
  scale_x_continuous(breaks = 1:7) +
  scale_y_continuous("probability", limits = c(0, .43)) +
  theme(legend.background = element_blank(),
        legend.position = "right") +
  facet_wrap(~ fit)  
```

In case you were curious, the PSIS-LOO suggests the monotonic model (`b12.6`) made better sense of the data.

```{r, warning = F, message = F}
b12.6 <- add_criterion(b12.6, "loo")
b12.7 <- add_criterion(b12.7, "loo")

loo_compare(b12.6, b12.7, criterion = "loo") %>% print(simplify = F)
model_weights(b12.6, b12.7, weights = "loo") %>% round(digits = 2)
```

We might explore the monotonic effects of `b12.6` in one more way. If you were reading closely along in the text, you may have noticed that "the sum of every $\delta_j$ is 1" (p. 392). When using HMC, this is true for each posterior draw. We can exploit that information to visualize the $\delta_j$ parameters in a cumulative fashion.

```{r, fig.width = 4.5, fig.height = 2.75, warning = F}
as_draws_df(b12.6) %>% 
  select(contains("new1")) %>% 
  set_names(1:7) %>% 
  mutate(draw = 1:n(), 
         `0`  = 0) %>% 
  pivot_longer(-draw, names_to = "delta") %>% 
  arrange(delta) %>% 
  group_by(draw) %>% 
  mutate(cum_sum = cumsum(value)) %>% 
  
  ggplot(aes(x = delta, y = cum_sum)) +
  stat_pointinterval(.width = .95, size = 1,
                     color = canva_pal("Green fields")(4)[1]) +
  stat_pointinterval(.width = .5,
                     color = canva_pal("Green fields")(4)[4],
                     point_color = canva_pal("Green fields")(4)[2]) +
  scale_x_discrete(NULL, labels = parse(text = str_c("delta[", 0:7 , "]"))) +
  ylab("cumulative sum")
```

This is another way to show that the largest effects of education are when going from `Elementary School` to `Middle School` ($\delta_0 \rightarrow \delta_1$) and when going from `Some High School` to `High School Graduate` ($\delta_2 \rightarrow \delta_3$).

```{r}
d %>% 
  distinct(edu, edu_new) %>% 
  arrange(edu_new) %>% 
  mutate(`delta[j]` = edu_new - 1)
```

## Summary

"This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs)" (p. 397). This chapter has been a ride. If you'd like more practice with the negative-binomial model and some of the others models for categorical data we covered in this chapter and in [Chapter 11][God Spiked the Integers], Alan Agresti covered them extensively in his [-@agrestiFoundationsLinearGeneralized2015] text, [*Foundations of linear and generalized linear models*](https://www.wiley.com/en-us/Foundations+of+Linear+and+Generalized+Linear+Models-p-9781118730034). For more on different kinds of zero-inflated count models, check out @atkinsTutorialCountRegression2013, [*A tutorial on count regression and zero-altered count models for longitudinal substance use data*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3513584/pdf/nihms396181.pdf). If you'd like to learn more about using cumulative probabilities to model ordinal data in **brms**, check out Bürkner and Vuorre's [-@burknerOrdinalRegressionModels2019] [*Ordinal regression models in psychology: A tutorial*](https://doi.org/10.1177/2515245918823199) and its repository on the Open Science Framework ([https://osf.io/cu8jv/](https://osf.io/cu8jv/)). Also check out [Chapter 23](https://bookdown.org/content/3686/ordinal-predicted-variable.html) of my [-@kurzDoingBayesianDataAnalysis2023] ebook [*Doing Bayesian Data Analysis in brms and the tidyverse*](https://bookdown.org/content/3686/) were we model ordinal data with a series of cumulative *probit* models.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F}
rm(pbar, theta, betaABfromMeanKappa, d, beta_binomial2, stan_funs, stanvars, b12.1, post, lines, log_lik_beta_binomial2, beta_binomial2_lpmf, posterior_predict_beta_binomial2, beta_binomial2_rng, posterior_epred_beta_binomial2, b12.2a, p, b12.2b, b11.11, nd, p1, text, p2, prob_drink, rate_work, n, drink, y, b12.3, priors, logit, p3, d_plot, inits, inits_list, b12.4, logistic, dordlogit, pk, explicit_example, b12.5, labs, f, levels, delta, b12.6, my_upper, my_diag, my_lower, delta_labels, b12.7, f12.6, f12.7, colors)
```

```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
# pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

[^5]: Okay, it's actually a little more complicated. **brms** has officially supported family `beta_binomial` since version 2.17.0. For the official news release, see [here](https://CRAN.R-project.org/package=brms/news/news.html). However, I still think it's worth while to show how to make a custom likelihood with **brms**. But be warned that I consider this advanced. For example, I would be extremely hesitant to make a custom likelihood for my own research work. I'm just not responsible enough for that kind of power.