05.Rmd


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# Treating Time More Flexibly

> All the illustrative longitudinal data sets in previous chapters share two structural features that simplify analysis. Each is: (1) balanced--everyone is assessed on the identical number of occasions; and (2) time-structured--each set of occasions is identical across individuals. Our analyses have also been limited in that we have used only: (1) time-invariant predictors that describe immutable characteristics of individuals or their environment (except for *TIME* itself); and (2) a representation of *TIME* that forces the level-1 individual growth parameters to represent "initial status" and "rate of change."
>
> The multilevel model for change is far more flexible than these examples suggest. With little or no adjustment, you can use the same strategies to analyze more complex data sets. Not only can the waves of data be irregularly spaced, their number and spacing can vary across participants. Each individual can have his or her own data collection schedule and number of waves can vary without limit from person to person. So, too, predictors of change can be time-invariant or time-varying, and the level-1 submodel can be parameterized in a variety of interesting ways. [@singerAppliedLongitudinalData2003, p. 138, *emphasis* in the original]

## Variably spaced measurement occasions

> Many researchers design their studies with the goal of assessing each individual on an identical set of occasions...
>
> Yet sometimes, despite a valiant attempt to collect time-structured data, actual measurement occasions will differ. Variation often results from the realities of fieldwork and data collection...
>
> So, too, many researchers design their studies knowing full well that the measurement occasions may differ across participants. This is certainly true, for example, of those who use an *accelerated cohort* design in which an age-heterogeneous cohort of individuals is followed for a constant period of time. Because respondents initial vary in age, and age, not *wave*, is usually the appropriate metric for analyses (see the discussion of time metrics in section 1.3.2), observed measurement occasions will differ across individuals. (p. 139, *emphasis* in the original)

### The structure of variably spaced data sets.

You can find the PIAT data from the CNLSY study in the `reading_pp.csv` file.

```{r, warning = F, message = F}
library(tidyverse)

reading_pp <- read_csv("data/reading_pp.csv")

head(reading_pp)
```

On pages 141 and 142, Singer and Willett discussed the phenomena of *occasion creep*, which is when "the temporal separations of occasions widens as the actual ages exceed design projections". Here's what that might look like.

```{r, fig.width = 6, fig.height = 2.5}
reading_pp %>% 
  ggplot(aes(x = age, y = wave)) +
  geom_vline(xintercept = c(6.5, 8.5, 10.5), color = "white") +
  geom_jitter(alpha = .5, height = .33, width = 0) +
  scale_x_continuous(breaks = c(6.5, 8.5, 10.5)) +
  scale_y_continuous(breaks = 1:3) +
  ggtitle("This is what occasion creep looks like.",
          subtitle = "As the waves go by, the variation of the ages widens and their central tendency\ncreeps away from the ideal point.") +
  theme(panel.grid = element_blank())
```

Here's how we might make our version of Figure 5.1.

```{r, fig.width = 5.5, fig.height = 5, message = F}
set.seed(5)

# wrangle
reading_pp %>% 
  nest(data = c(wave, agegrp, age, piat)) %>% 
  sample_n(size = 9) %>% 
  unnest(data) %>% 
  # this will help format and order the facets
  mutate(id = ifelse(id < 10, str_c("0", id), id) %>% str_c("id = ", .)) %>% 
  pivot_longer(contains("age")) %>% 
  
  # plot
  ggplot(aes(x = value, y = piat, color = name)) +
  geom_point(alpha = 2/3) +
  stat_smooth(method = "lm", se = F, linewidth = 1/2) +
  scale_color_viridis_d(NULL, option = "B", end = .5, direction = -1) +
  xlab("measure of age") +
  coord_cartesian(xlim = c(5, 12),
                  ylim = c(0, 80)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ id)
```

Since it wasn't clear which `id` values the authors used in the text, we just randomized. Change the seed to view different samples.

### Postulating and fitting multilevel models with variably spaced waves of data.

The composite formula for our first model is

$$
\begin{align*}
\text{piat}_{ij} & = \gamma_{00} + \gamma_{10} (\text{agegrp}_{ij} - 6.5) + \zeta_{0i} + \zeta_{1i} (\text{agegrp}_{ij} - 6.5) + \epsilon_{ij} \\

\epsilon_{ij}    & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf \Sigma
\end{pmatrix}, \text{where} \\

\mathbf \Sigma  & = \mathbf D \mathbf\Omega \mathbf D', \text{where} \\

\mathbf D       & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \text{and} \\ 

\mathbf \Omega  & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}

\end{align*}
$$

It's the same for the twin model using `age` rather than `agegrp`. Notice how we've switched from Singer and Willett's $\sigma^2$ parameterization to the $\sigma$ parameterization typical of **brms**.

```{r}
reading_pp <-
  reading_pp %>% 
  mutate(agegrp_c = agegrp - 6.5,
         age_c    = age    - 6.5)
  
head(reading_pp)
```

In the last chapter, we began familiarizing ourselves with `brms::brm()` default priors. It's time to level up. Another approach is to use domain knowledge to set weakly-informative priors. Let's start with the PIAT. The Peabody Individual Achievement Test is a standardized individual test of scholastic achievement. It yields several subtest scores. The reading subtest is the one we're focusing on, here. As is typical for such tests, the PIAT scores are normed to yield a population mean of 100 and a standard deviation of 15.

With that information alone, even a PIAT novice should have an idea about how to specify the priors. Since our sole predictor variables are versions of age centered at 6.5, we know that the model intercept is interpreted as the expected value on the PIAT when the children are 6.5 years old. If you knew nothing else, you'd guess the mean score would be 100 with a standard deviation around 15. One way to use a weakly-informative prior on the intercept would be to multiply that $SD$ by a number like 2.

Next we need a prior for the time variables, `age_c` and `agegrp_c`. A one-unit increase in either of these is the expected increase in the PIAT with one year's passage of age. Bringing in a little domain knowledge, IQ and achievement tests tend to be rather stable over time. However, we also expect children to get better as they age and we also don't know exactly how these data have been adjusted for the children's ages. It's also important to know that it's typical within the Bayesian world to place Normal priors on $\beta$ parameters. So one approach would be to center the Normal prior on 0 and put something like twice the PIAT's standard deviation on the prior's $\sigma$. If we were PIAT researchers, we could do much better. But with minimal knowledge of the test, this approach is certainly beats defaults.

Next we have the variance parameters. Recall that `brms::brm()` defaults are Student's $t$-distributions with $\nu = 3$ and $\mu = 0$. Let's start there. Now we just need to put values on $\sigma$. Since the PIAT has a standard deviation of 15 in the population, why not just use 15? If you felt insecure about this, multiply if by a factor of 2 or so. Also recall that when Student's $t$-distributions has a $\nu = 3$, the tails are quite fat. Within the context of Bayesian priors, those fat tails make it easy for the likelihood to dominate the prior even when it's a good way into the tail.

Finally, we have the correlation among the group-level variance parameters, $\sigma_0$ and $\sigma_1$. Recall that last chapter we learned the `brms::brm()` default was `lkj(1)`. To get a sense of what the LKJ does, we'll simulate from it. McElreath's **rethinking** package contains a handy `rlkjcorr()` function, which will allow us to simulate `n` draws from a `K` by `K` correlation matrix for which $\eta$ is defined by `eta`. Let's take `n <- 1e6` draws from two LKJ prior distributions, one with $\eta = 1$ and the other with $\eta = 4$.

```{r, warning = F, message = F}
library(rethinking)

n <- 1e6
set.seed(5)

lkj <-
  tibble(eta = c(1, 4)) %>% 
  mutate(draws = purrr::map(eta, ~ rlkjcorr(n, K = 2, eta = .)[, 2, 1])) %>% 
  unnest(draws)

glimpse(lkj)
```

Now plot that `lkj`.

```{r, fig.width = 4, fig.height = 2.5}
lkj %>% 
  mutate(eta = factor(eta)) %>% 
  
  ggplot(aes(x = draws, fill = eta, color = eta)) +
  geom_density(size = 0, alpha = 2/3) +
  geom_text(data = tibble(
    draws = c(.75, .35),
    y     = c(.6, 1.05),
    label = c("eta = 1", "eta = 4"),
    eta   = c(1, 4) %>% as.factor()),
    aes(y = y, label = label)) +
  scale_fill_viridis_d(option = "A", end = .5) +
  scale_color_viridis_d(option = "A", end = .5) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(rho)) +
  theme(panel.grid = element_blank(),
        legend.position = "none")
```

When we use `lkj(1)`, the prior is flat over the parameter space. However, setting `lkj(4)` is tantamount to a prior with a probability mass concentrated a bit towards zero. It's a prior that's skeptical of extremely large or small correlations. Within the context of our multilevel model $\rho$ parameters, this will be our weakly-regularizing prior.

Let's prepare to fit our models and load **brms**.

```{r, warning = F, message = F}
detach(package:rethinking, unload = T)
library(brms)
```

Fit the models. Following the same form, the differ in that the first uses `agegrp_c` and the second uses `age_c`.

```{r fit5.1}
fit5.1 <-
  brm(data = reading_pp, 
      family = gaussian,
      piat ~ 0 + Intercept + agegrp_c + (1 + agegrp_c | id),
      prior = c(prior(normal(100, 30), class = b, coef = Intercept),
                prior(normal(0, 30),   class = b, coef = agegrp_c),
                prior(student_t(3, 0, 15), class = sd),
                prior(student_t(3, 0, 15), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.01")

fit5.2 <-
  brm(data = reading_pp, 
      family = gaussian,
      piat ~ 0 + Intercept + age_c + (1 + age_c | id),
      prior = c(prior(normal(100, 30), class = b, coef = Intercept),
                prior(normal(0, 30),   class = b, coef = age_c),
                prior(student_t(3, 0, 15), class = sd),
                prior(student_t(3, 0, 15), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.02")
```

Focusing first on `fit5.1`, our analogue to the $(AGEGRP – 6.5)$ model displayed in Table 5.2, here is our model summary.

```{r}
print(fit5.1, digits = 3)
```

Here's the `age_c` model.

```{r}
print(fit5.2, digits = 3)
```

For a more focused look, we can use `fixef()` compare our $\gamma$'s to each other and those in the text.

```{r}
fixef(fit5.1) %>% round(digits = 3)
fixef(fit5.2) %>% round(digits = 3)
```

Here are our $\sigma_\epsilon$ summaries.

```{r}
VarCorr(fit5.1)$residual$sd %>% round(digits = 3)
VarCorr(fit5.2)$residual$sd %>% round(digits = 3)
```

From a quick glance, you can see they are about the square of the $\sigma_\epsilon^2$ estimates in the text.

Let's go ahead and compute the LOO and WAIC.

```{r, warning = F, message = F}
fit5.1 <- add_criterion(fit5.1, criterion = c("loo", "waic"))
fit5.2 <- add_criterion(fit5.2, criterion = c("loo", "waic"))
```

Compare the models with a WAIC difference.

```{r}
loo_compare(fit5.1, fit5.2, criterion = "waic") %>% 
  print(simplify = F)
```

The WAIC difference between the two isn't that large relative to its standard error. The LOO tells a similar story.

```{r}
loo_compare(fit5.1, fit5.2, criterion = "loo") %>% 
  print(simplify = F)
```

The uncertainty in our WAIC and LOO estimates and their differences provides information that was not available for the AIC and the BIC comparisons in the text. We can also compare the WAIC and the LOO with model weights. Given the WAIC, from @mcelreathStatisticalRethinkingBayesian2015 we learn

> A total weight of 1 is partitioned among the considered models, making it easier to compare their relative predictive accuracy. The weight for a model $i$ in a set of $m$ models is given by:
>
> $$w_i = \frac{\exp(-\frac{1}{2} \text{dWAIC}_i)}{\sum_{j = 1}^m \exp(-\frac{1}{2} \text{dWAIC}_i)}$$
>
> where dWAIC is the `dWAIC` in the `compare` table output. This example uses WAIC but the formula is the same for any other information criterion, since they are all on the deviance scale. (p. 199) 

The `compare()` function McElreath referenced is from his [-@R-rethinking] [**rethinking** package](https://xcelab.net/rm/software/), which is meant to accompany his [texts](https://xcelab.net/rm/statistical-rethinking/). We don't have that function with **brms**. A rough analogue to the `rethinking::compare()` function is `loo_compare()`. We don't quite have a `dWAIC` column from `loo_compare()`. Remember how last chapter we discussed how Aki Vehtari isn't a fan of converting information criteria to the $\chi^2$ difference metric with that last $-2 \times ...$ step? That's why we have an `elpd_diff` instead of a `dWAIC`. But to get the corresponding value, you just multiply those values by -2. And yet if you look closely at the formula for $w_i$, you'll see that each time the dWAIC term appears, it's multiplied by $-\frac{1}{2}$. So we don't really need that `dWAIC` value anyway. As it turns out, we're good to go with our `elpd_diff`. Thus the above equation simplifies to

$$
w_i = \frac{\exp(\text{elpd_diff}_i)}{\sum_{j = 1}^m \exp(\text{elpd_diff}_i)}
$$

But recall you don't have to do any of this by hand. We have the `brms::model_weights()` function, which we can use to compute weights with the WAIC or the LOO.

```{r}
model_weights(fit5.1, fit5.2, weights = "waic") %>% round(digits = 3)
model_weights(fit5.1, fit5.2, weights = "loo") %>% round(digits = 3)
```

Both put the lion's share of the weight on the `age_c` model. Back to McElreath @mcelreathStatisticalRethinkingBayesian2015:

> But what do these weights mean? There actually isn't a consensus about that. But here's Akaike’s interpretation, which is [common](https://www.springer.com/us/book/9780387953649). 
>
>> *A model's weight is an estimate of the probability that the model will make the best predictions on new data, conditional on the set of models considered*.
>
> Here's the heuristic explanation. First, regard WAIC as the expected deviance of a model on future data. That is to say that WAIC gives us an estimate of $\text{E} (D_\text{test})$. Akaike weights convert these deviance values, which are log-likelihoods, to plain likelihoods and then standardize them all. This is just like Bayes' theorem uses a sum in the denominator to standardize the produce of the likelihood and prior. Therefore the Akaike weights are analogous to posterior probabilities of models, conditional on expected future data. (p. 199, *emphasis* in the original)

## Varying numbers of measurement occasions

As Singer and Willett pointed out,

> once you allow the spacing of waves to vary across individuals, it is a small leap to allow their *number* to vary as well. Statisticians say that such data sets are *unbalanced.* As you would expect, balance facilitates analysis: models can be parameterized more easily, random effects can be estimated more precisely, and computer algorithms will converge more rapidly.
>
> Yet a major advantage of the multilevel model for change is that it is easily fit to unbalanced data. (p. 146, *emphasis* in the original) 


```{r, echo = F}
# to free up memory, we'll drop these two
rm(fit5.1, fit5.2)
```

### Analyzing data sets in which the number of waves per person varies.

Here we load the `wages_pp.csv` data.

```{r, warning = F, message = F}
wages_pp <- read_csv("data/wages_pp.csv")

glimpse(wages_pp)
```

Here's a more focused look along the lines of Table 5.3.

```{r}
wages_pp %>% 
  select(id, exper, lnw, black, hgc, uerate) %>% 
  filter(id %in% c(206, 332, 1028))
```

To get a sense of the diversity in the number of occasions per `id`, use `group_by()` and `count()`.

```{r, fig.width = 6, fig.height = 3}
wages_pp %>% 
  count(id) %>% 
   
  ggplot(aes(y = n)) +
  geom_bar() +
  scale_y_continuous("# measurement occasions", breaks = 1:13) +
  xlab("count of cases") +
  theme(panel.grid = element_blank())
```

The spacing of the measurement occasions also differs a lot across cases. Recall that `exper` "identifies the specific moment--to the nearest day--in each man's labor force history associated with each observed value of" `lnw` (p. 147). Here's a sense of what that looks like.

```{r, fig.width = 6, fig.height = 2.25}
wages_pp %>% 
  filter(id %in% c(206, 332, 1028)) %>% 
  mutate(id = factor(id)) %>% 
  
  ggplot(aes(x = exper, y = lnw, color = id)) +
  geom_point() +
  geom_line() +
  scale_color_viridis_d(option = "B", begin = .35, end = .8) +
  theme(panel.grid = element_blank())
```

Uneven for dayz.

Here's the **brms** version of the composite formula for Model A, the unconditional growth model for `lnw`.

$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{10} \text{exper}_{ij} + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf\Sigma
\end{pmatrix}, \text{where} \\

\mathbf\Sigma & = \mathbf D \mathbf \Omega \mathbf D', \text{where} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix}, \text{and} \\ 

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}
\end{align*}
$$

To attempt setting priors for this, we need to review what `lnw` is. From the text: "To adjust for inflation, each hourly wage is expressed in constant 1990 dollars. To address the skewness commonly found in wage data and to linearize the individual wage trajectories, we analyze the natural logarithm of wages, *LNW*" (p. 147). So it's the log of participant wages in 1990 dollars. From the official [US Social Secutiry website](https://www.ssa.gov/oact/cola/central.html), we learn the average yearly wage in 1990 was $20,172.11. Here's that natural log for that.

```{r}
log(20172.11)
```

However, that's the yearly wage. In the text, this is conceptualized as rate per hour. If we presume a 40 hour week for 52 weeks, this translates to a little less than $10 per hour.

```{r}
20172.11 / (40 * 52)
```

Here's what that looks like in a log metric.

```{r}
log(20172.11 / (40 * 52))
```

But keep in mind that "to track wages on a common temporal scale, Murnane and colleagues decided to clock time from each respondent's first day of work" (p. 147). So the wages at one's initial point in the study were often entry-level wages. From the official website for the [US Department of Labor](https://www.dol.gov/whd/minwage/chart.htm), we learn the national US minimum wage in 1990 was $3.80 per hour. Here's what that looks like on the log scale.

```{r}
log(3.80)
```

So perhaps this is a better figure to center our prior for the model intercept on. If we stay with a conventional Gaussian prior and put $\mu = 1.335$, what value should we use for the standard deviation? Well, if that's the log minimum and 2.27 is the log mean, then there's less than a log value of 1 between the minimum and the mean. If we'd like to continue our practice of weakly regularizing priors a value of 1 or even 0.5 on the log scale would seem reasonable. For simplicity, we'll use `normal(1.335, 1)`.

Next we need a prior for the expected increase over a single year's employment. A conservative default might be to center it on zero—no change from year to year. Since as we've established a 1 on the log scale is more than the difference between the minimum and average hourly wages in 1990 dollars, we might just use `normal(0, 0.5)` as a starting point.

So then what about our variance parameters? Given these are all entry-level workers and given how little we'd expect them to increase from year to year, a `student_t(3, 0, 1)` on the log scale would seem pretty permissive.

So then here's how we might formally specify our model priors:

$$
\begin{align*}
\gamma_{00}     & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{10}     & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_0        & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_1        & \sim \operatorname{Student-t}(3, 0, 1) \\
\rho_{01}       & \sim \operatorname{LKJ} (4)
\end{align*}
$$

For a point of comparison, here are the `brms::brm()` default priors.

```{r}
get_prior(data = wages_pp, 
          family = gaussian,
          lnw ~ 0 + Intercept + exper + (1 + exper | id))
```

Even though our priors are still quite permissive on the scale of the data, they're much more informative than the defaults. If we had formal backgrounds in the entry-level economy of the US in the early 1900s, we'd be able to specify even better priors. But hopefully this walk-through gives a sense of how to start thinking about model priors.

Let's fit the model. To keep the size of the `fits/fit05.03.rds` file below the 100MB GitHub limit, we'll set `chains = 3` and compensate by upping `iter` a little.

```{r fit5.3}
fit5.3 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b, coef = exper),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.03")
```

Here are the results.

```{r}
print(fit5.3, digits = 3)
```

Since the criterion `lnw` is on the log scale, Singer and Willett pointed out our estimate for $\gamma_{10}$ indicates a nonlinear growth rate on the natural dollar scale. They further explicated that "if an outcome in a linear relationship, $Y$, is expressed as a natural logarithm and $\hat \gamma_{01}$ is the regression coefficient for a predictor $X$, then $100(e^{\hat{\gamma}_{01}} - 1)$ is the *percentage change* in $Y$ per unit difference in $X$" (p. 148, *emphasis* in the original). Here's how to do that conversion with our **brms** output.

```{r, warning = F}
draws <-
  as_draws_df(fit5.3) %>%
  transmute(percent_change = 100 * (exp(b_exper) - 1))

head(draws)
```

For our plot, let's break out [Matthew Kay](https://twitter.com/mjskay)'s handy [**tidybayes** package](mjskay.github.io/tidybayes/) [@R-tidybayes]. With the `tidybayes::stat_halfeye()` function, it's easy to put horizontal point intervals beneath out parameter densities. Here we'll use 95% intervals.

```{r, fig.width = 3.5, fig.height = 2.5, warning = F, message = F}
library(tidybayes)

draws %>% 
  ggplot(aes(x = percent_change, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "Percent change",
       x = expression(100*(italic(e)^(hat(gamma)[1][0])-1))) +
  theme(panel.grid = element_blank())
```

The **tidybayes** package also has a group of functions that make it easy to summarize posterior parameters with measures of central tendency (i.e., mean, median, mode) and intervals (i.e., percentile based, highest posterior density intervals). Here we’ll use `median_qi()` to get the posterior median and percentile-based 95% intervals.

```{r}
draws %>% 
  median_qi(percent_change)
```

For our next model, Model B in Table 5.4, we add two time-invariant covariates. In the data, these are listed as `black` and `hgc.9`. Before we proceed, let's rename `hgc.9` to be more consistent with [**tidyverse** style](https://style.tidyverse.org/syntax.html#object-names).

```{r}
wages_pp <-
  wages_pp %>% 
  rename(hgc_9 = hgc.9)
```

There we go. Let's take a look at the distributions of our covariates.

```{r, fig.width = 6, fig.height = 2.5}
wages_pp %>% 
  pivot_longer(c(black, hgc_9)) %>% 

  ggplot(aes(x = value)) +
  geom_bar() +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

We see `black` is a dummy variable coded "Black" = 1, "Non-black" = 0. `hgc_9` is a somewhat Gaussian ordinal centered around zero. For context, it might also help to check its standard deviation.

```{r}
sd(wages_pp$hgc_9)
```

With a mean near 0 and an $SD$ near 1, `hgc_9` is almost in a standardized metric. If we wanted to keep with our weakly-regularizing approach, `normal(0, 1)` or even `normal(0, 0.5)` would be pretty permissive for both these variables. Recall that we're predicting wage on the log scale. A $\gamma$ value of 1 or even 0.5 would be humongous for the social sciences. Since we already have the $\gamma$ for `exper` set to `normal(0, 0.5)`, let's just keep with that. Here's how we might describe our model in statistical terms:

$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{01} (\text{hgc}_{i} - 9) + \gamma_{02} \text{black}_{i} \\
& \;\;\; + \gamma_{10} \text{exper}_{ij} + \gamma_{11} \text{exper}_{ij} \times (\text{hgc}_{i} - 9) + \gamma_{12} \text{exper}_{ij} \times \text{black}_{i} \\
& \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{01}, \dots, \gamma_{12}                     & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t} (3, 0, 1) \\
\rho_{01}                                        & \sim \operatorname{LKJ} (4).
\end{align*}
$$

The top portion up through the $\mathbf\Omega$ line is the likelihood. Starting with $\gamma_{00} \sim \text{Normal}(1.335, 1)$ on down, we've listed our priors. Here's how to fit the model with **brms**.

```{r fit5.4}
fit5.4 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + black + exper + exper:hgc_9 + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.04")
```

Let's take a look at the results.

```{r}
print(fit5.4, digits = 3)
```

The $\gamma$'s are on par with those in the text. When we convert the $\sigma$ parameters to the $\sigma^2$ metric, here's what they look like.

```{r, fig.width = 8, fig.height = 2.25, warning = F}
draws <- as_draws_df(fit5.4) 

draws %>% 
  transmute(`sigma[0]^2` = sd_id__Intercept^2,
            `sigma[1]^2` = sd_id__exper^2,
            `sigma[epsilon]^2` = sigma^2) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  coord_cartesian(ylim = c(1.4, 3.4)) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

We might plot our $\gamma$'s, too. Here we'll use `tidybayes::stat_pointinterval()` to just focus on the points and intervals.

```{r, fig.width = 8, fig.height = 2, warning = F}
draws %>% 
  select(b_Intercept:`b_black:exper`) %>% 
  set_names(str_c("gamma", c("[0][0]", "[0][1]", "[0][2]", "[1][0]", "[1][1]", "[1][2]"))) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = 0, color = "white") +
  stat_pointinterval(.width = .95, size = 1/2) +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

As in the text, our $\gamma_{02}$ and $\gamma_{11}$ parameters hovered around zero. For our next model, Model C in Table 5.4, we'll drop those parameters.

```{r fit5.5}
fit5.5 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.05")
```

Let's take a look at the results.

```{r}
print(fit5.5, digits = 3)
```

Perhaps unsurprisingly, the parameter estimates for `fit5.5` ended up quite similar to those from `fit5.4`. Happily, they're also similar to those in the text. Let's compute the WAIC estimates.

```{r, warning = F, message = F}
fit5.3 <- add_criterion(fit5.3, criterion = "waic")
fit5.4 <- add_criterion(fit5.4, criterion = "waic")
fit5.5 <- add_criterion(fit5.5, criterion = "waic")
```

Compare their WAIC estimates using $\text{elpd}$ difference scores.

```{r}
loo_compare(fit5.3, fit5.4, fit5.5, criterion = "waic") %>% 
  print(simplify = F)
```

The differences are subtle. Here are the WAIC weights.

```{r}
model_weights(fit5.3, fit5.4, fit5.5, weights = "waic") %>% 
  round(digits = 3)
```

When we use weights, almost all goes to `fit5.4` and `fit5.5`. Focusing on the trimmed model, `fit5.5`, let's get ready to make our version of Figure 5.2. We'll start with `fitted()` work.

```{r}
nd <-
  crossing(black = 0:1,
           hgc_9 = c(0, 3)) %>% 
  expand_grid(exper = seq(from = 0, to = 11, length.out = 30))

f <-
  fitted(fit5.5, 
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

head(f)
```

Here it is, our two-panel version of Figure 5.2.

```{r, fig.width = 6, fig.height = 3}
f %>%
  mutate(black = factor(black,
                        labels = c("Latinos and Whites", "Blacks")),
         hgc_9 = factor(hgc_9, 
                        labels = c("9th grade dropouts", "12th grade dropouts"))) %>% 
  
  ggplot(aes(x = exper,
             color = black, fill = black)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  scale_fill_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  scale_color_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  ylab("lnw") +
  coord_cartesian(ylim = c(1.6, 2.4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ hgc_9)
```

This leads in nicely to a brief discussion of posterior predictive checks (PPC). The basic idea is that good models should be able to retrodict the data used to produce them. Table 5.3 in the text introduced the data set by highlighting three participants and we went ahead and looked at their data in a plot. One way to do a PPC might be to plot their original data atop their model estimates. The `fitted()` function will help us with the preparatory work.

```{r}
nd <-
  wages_pp %>% 
  filter(id %in% c(206, 332, 1028))

f <-
  fitted(fit5.5, 
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd)

head(f)
```

Here's the plot.

```{r, fig.width = 8, fig.height = 2.5}
f %>% 
  mutate(id = str_c("id = ", id)) %>% 
  
  ggplot(aes(x = exper)) +
  geom_pointrange(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5,
                      color = id)) +
  geom_point(aes(y = lnw)) +
  scale_color_viridis_d(option = "B", begin = .35, end = .8) +
  labs(subtitle = "The black dots are the original data. The colored points and vertical lines are the participant-specific posterior\nmeans and 95% intervals.") +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~ id)
```

Although each participant got their own intercept and slope, the estimates all fall in straight lines. Since we're only working with time-invariant covariates, that's about the best we can do. Though our models can express gross trends over time, they're unable to speak to variation from occasion to occasion. Just a little later on in this chapter and we'll learn how to do better.

### Practical problems that may arise when analyzing unbalanced data sets.

With HMC, the issues with non-convergence aren't quite the same as with maximum likelihood estimation. However, the basic issue still remains:

> Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation--variation in the residuals over and above the fixed effects. If too many people have too little data, you will ~~be unable to quantify~~ [have difficulty quantifying] this residual variability. (p. 152)

The big difference is that as Bayesians, our priors add additional information that will help us define the posterior distributions of our variance components. Thus our challenge will choosing sensible priors for our $\sigma$'s.

#### Boundary constraints.

Unlike with the frequentist multilevel software discussed in the text, **brms** will not yield negative values on the $\sigma$ parameters. This is because the **brms** default is to set a lower limit of zero on those parameters. For example, see what happens when we execute `fit5.3$model`.

```{r}
fit5.3$model
```

That returned the Stan code corresponding to our `brms::brm()` code, above. Notice the second and third lines in the `parameters` block. Both contained `<lower=0>`, which indicated the lower bounds for those parameters was zero. See? Stan has you covered.

Let's load the `wages_small_pp.csv` data.

```{r, warning = F, message = F}
wages_small_pp <- read_csv("data/wages_small_pp.csv") %>% 
  rename(hgc_9 = hcg.9)

glimpse(wages_small_pp)
```

Here's the distribution of the number of measurement occasions for our small data set.

```{r, fig.width = 6, fig.height = 3}
wages_small_pp %>% 
  count(id) %>% 
   
  ggplot(aes(y = n)) +
  geom_bar() +
  scale_y_continuous("# measurement occasions", breaks = 1:13, limits = c(.5, 13)) +
  xlab("count of cases") +
  theme(panel.grid = element_blank())
```

Our `brm()` code is the same as that for `fit5.5`, above, with just a slightly different `data` argument. If we wanted to, we could be hasty and just use `update()`, instead. But since we're still practicing setting our priors and such, here we'll be exhaustive.

```{r fit5.6}
fit5.6 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.06")
```

```{r}
print(fit5.6)
```

Let's walk through this slow.

You may have noticed that warning message about *divergent transitions*. We'll get to that in a bit. First focus on the parameter estimates for `sd(exper)`. Unlike in the text, our posterior mean is not 0.000. But do remember that our posterior is parameterized in the $\sigma$ metric. Let's do a little converting and look at it in a plot.

```{r, warning = F}
draws <- as_draws_df(fit5.6)

v <-
  draws %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything())
```

Plot.

```{r, fig.width = 8, fig.height = 2.5}
v %>% 
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

In the $\sigma$ metric, the posterior is bunched up a little on the boundary, but much of its mass is a gently right-skewed mound concentrated in the 0—0.1 range. When we convert the posterior to the $\sigma^2$ metric, the parameter appears much more bunched up against the boundary. Because we typically summarize our posteriors with means or medians, the point estimate still moves away from zero.

```{r}
v %>% 
  group_by(name) %>% 
  mean_qi() %>% 
  mutate_if(is.double, round, digits = 4)

v %>% 
  group_by(name) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 4)
```

But it really does start to shoot to zero if we attempt to summarize the central tendency with the mode, as within the maximum likelihood paradigm.

```{r}
v %>% 
  group_by(name) %>% 
  mode_qi() %>% 
  mutate_if(is.double, round, digits = 4)
```

Backing up to that warning message, we were informed that "Increasing adapt_delta above 0.8 may help." The `adapt_delta` parameter ranges from 0 to 1. The `brm()` default is .8. In my experience, increasing to .9 or .99 is often a good place to start. For this model, .9 wasn't quite enough, but .99 worked. Here's how to do it.

```{r fit5.7}
fit5.7 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.07")
```

Now look at the summary.

```{r}
print(fit5.7)
```

Our estimates were pretty much the same as before. Happily, this time we got our summary without any warning signs. It won't always be that way, so make sure to take `adapt_delta` warnings seriously.

Now do note that for both `fit5.6` and `fit5.7`, our effective sample sizes for $\sigma_0$ and $\sigma_1$ aren't terribly large relative to the total number of post-warmup draws, 4,000. If it was really important that you had high-quality summary statistics for these parameters, you might need to refit the model with something like `iter = 20000, warmup = 2000`.

In Model B in Table 5.5, Singer and Willett gave the results of a model with the boundary constraints on the $\sigma^2$ parameters removed. I am not going to attempt something like that with **brms**. If you're interested, you're on your own.

But we will fit a version of their Model C where we've removed the $\sigma_1$ parameter. Notice that this results in our removal of the LKJ prior for $\rho_{01}$, too. Without a $\sigma_1$, there's no other parameter with which our lonely $\sigma_0$ might covary.

```{r fit5.8}
fit5.8 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.08")
```

Here is the basic model summary.

```{r}
print(fit5.8)
```

No warning messages and our effective samples for $\sigma_0$ improved a bit. Compute the WAIC for both models.

```{r}
fit5.7 <- add_criterion(fit5.7, criterion = "waic")
fit5.8 <- add_criterion(fit5.8, criterion = "waic")
```

Compare.

```{r}
loo_compare(fit5.7, fit5.8, criterion = "waic") %>% 
  print(simplify = F, digits = 3)
```

Yep. Those WAIC estimates are quite similar and when you compare them with formal $\text{elpd}$ difference scores, the standard error is about the same size as the difference itself.

Though we're stepping away from the text a bit, we should explore more alternatives for this boundary issue. The Stan team has put together a *Prior Choice Recommendations* wiki at [https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). In the [*Boundary-avoiding priors for modal estimation (posterior mode, MAP, marginal posterior mode, marginal maximum likelihood, MML)*](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#boundary-avoiding-priors-for-modal-estimation-posterior-mode-map-marginal-posterior-mode-marginal-maximum-likelihood-mml) section, we read: 

> * These are for parameters such as group-level scale parameters, group-level correlations, group-level covariance matrix
> * What all these parameters have in common is that (a) they're defined on a space with a boundary, and (b) the likelihood, or marginal likelihood, can have a mode on the boundary. Most famous example is the group-level scale parameter tau for the 8-schools hierarchical model.
> * With full Bayes the boundary shouldn't be a problem (as long as you have any proper prior).
> * But with modal estimation, the estimate can be on the boundary, which can create problems in posterior predictions. For example, consider a varying-intercept varying-slope multilevel model which has an intercept and slope for each group. Suppose you fit marginal maximum likelihood and get a modal estimate of 1 for the group-level correlation. Then in your predictions the intercept and slope will be perfectly correlated, which in general will be unrealistic.
> * For a one-dimensional parameter restricted to be positive (e.g., the scale parameter in a hierarchical model), we recommend Gamma(2,0) prior (that is, p(tau) proportional to tau) which will keep the mode away from 0 but still allows it to be arbitrarily close to the data if that is what the likelihood wants. For details see this paper by Chung et al.: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
>     + Gamma(2,0) biases the estimate upward. When number of groups is small, try Gamma(2,1/A), where A is a scale parameter representing how high tau can be.

We should walk those Gamma priors out, a bit. The paper by @chungNondegeneratePenalizedLikelihood2013 is quite helpful. We'll first let them give us a little more background in the topic:

> Zero group-level variance estimates can cause several problems. Zero variance can go against prior knowledge of researchers and results in underestimation of uncertainty in fixed coefficient estimates. Inferences for groups are often of interest to researchers, but when the group-level variance is estimated as zero, the resulting predictions of the group-level errors will all be zero, so one fails to find unexplained differences between groups. In addition, uncertainty in predictions for new and existing groups is also understated. (p. 686)

They expounded further on page 687.

> When a variance parameter is estimated as zero, there is typically a large amount of uncertainty about this variance. One possibility is to declare in such situations that not enough information is available to estimate a multilevel model. However, the available alternatives can be unappealing since, as noted in the introduction, discarding a variance component or setting the variance to zero understates the uncertainty. In particular, standard errors for coefficients of covariates that vary between groups will be too low as we will see in Section 2.2. The other extreme is to fit a regression with indicators for groups (a fixed-effects model), but this will  overcorrect for group effects (it is mathematically equivalent to a mixed-effects model with variance set to infinity), and also does not allow predictions for new groups.
>
> Degenerate variance estimates lead to complete shrinkage of predictions for new and existing groups and yield estimated prediction standard errors that understate uncertainty. This problem has been pointed out by Li and Lahiri [-@liAdjustedMaximumLikelihood2010] and Morris and Tang [-@morrisEstimatingRandomEffects2011] in small area estimation....
>
> If zero variance is not a null hypothesis of interest, a boundary estimate, and the corresponding zero likelihood ratio test statistic, should not necessarily lead us to accept the null hypothesis and to proceed as if the true variance is zero.

In their paper, they covered both penalized maximum likelihood and full Bayesian estimation. We're just going to focus on Bayes, but some of the quotes will contain ML talk. Further, we read:

> We recommend a class of log-gamma penalties (or gamma priors) that in our default setting (the log-gamma(2, $\lambda$) penalty with $\lambda \rightarrow 0$) produce maximum penalized likelihood (MPL) estimates (or Bayes modal estimates) approximately one standard error away from zero when the maximum likelihood estimate is at zero. We consider these priors to be weakly informative in the sense that they supply some direction but still allow inference to be driven by the data. The penalty has little influence when the number of groups is large or when the data are informative about the variance, and the asymptotic mean squared error of the proposed estimator is the same as that of the maximum likelihood estimator. (p. 686)

In the upper left panel of Figure 3, Chung and colleagues gave an example of what they mean by $\lambda \rightarrow 0$: $\lambda = 0.1$. Here's an example of what $\operatorname{Gamma}(2, 0.1)$ looks like across the parameter space of 0 to 100.

```{r, fig.width = 4, fig.height = 2.25, message = F}
library(ggdist)

prior(gamma(2, 0.1)) %>% 
  parse_dist() %>% 
  
  ggplot(aes(xdist = .dist_obj, y = prior)) + 
  stat_halfeye(.width = .95, p_limits = c(.0001, .9999)) +
  scale_y_discrete(NULL, breaks = NULL, expand = expansion(add = 0.1)) +
  labs(title = "Gamma(2, 0.1)",
       x = "parameter space") +
  coord_cartesian(xlim = c(0, 100)) +
  theme(panel.grid = element_blank())
```

Given we're working with data on the log scale, that's a massively permissive prior. Let's zoom in and see what it means for the parameter space of possible values for our data.

```{r, fig.width = 4, fig.height = 2.25, message = F}
prior(gamma(2, 0.1)) %>% 
  parse_dist() %>% 
  
  ggplot(aes(xdist = .dist_obj, y = prior)) + 
  stat_halfeye(.width = .95, p_limits = c(.000001, .99)) +
  scale_y_discrete(NULL, breaks = NULL, expand = expansion(add = 0.1)) +
  labs(title = "Gamma(2, 0.1)",
       x = "parameter space (zoomed in)") +
  coord_cartesian(xlim = c(0, 2)) +
  theme(panel.grid = element_blank())
```

Now keep that picture in mind as we read further along in the paper:

>  In addition, with $\lambda \rightarrow 0$, the gamma density function has a positive constant derivative at zero, which allows the likelihood to dominate if it is strongly curved near zero. The positive constant derivative implies that the prior is linear at zero so that there is no dead zone near zero. The top-left panel of Figure 3 shows that the gamma(2,0.1) density increases linearly from zero with a gentle slope. The shape will be even flatter with a smaller rate parameter. (p. 691)

In case you're not familiar with the gamma distribution, the rate parameter is what we've been calling $\lambda$. Let's test this baby out with our model. Here's how to specify it in **brms**.

```{r fit5.9}
fit5.9 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(gamma(2, 0.1), class = sd, group = id, coef = exper),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .85),
      file = "fits/fit05.09")
```

Notice how we had to increase `adapt_delta` a bit. Here are the results.

```{r}
print(fit5.9, digits = 3)
```

Our `sd(exper)` is still quite close to zero. But notice how not the lower level of the 95% interval is higher than zero. Here's what it looks like in both $\sigma$ and $\sigma^2$ metrics.

```{r, fig.width = 8, fig.height = 2.5, warning = F}
as_draws_df(fit5.9) %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything()) %>%
  
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

Let's zoom in on the leftmost part of the plot.
  
```{r, fig.width = 8, fig.height = 2.5, warning = F}
as_draws_df(fit5.9) %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything()) %>%
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = 0, color = "white") +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  coord_cartesian(xlim = c(0, 0.01)) +
  theme(panel.grid = element_blank())
```

Although we are still brushing up on the boundary with $\sigma_1^2$, the mode is no longer at zero. In the discussion, Chung and colleagues pointed out "sometimes weak prior information is available about a variance parameter. When $\alpha = 2$, the gamma density has its mode at $1 / \lambda$, and so one can use the $\operatorname{gamma}(\alpha, \lambda)$ prior with $1 / \lambda$ set to the prior estimate of $\sigma_\theta$" (p. 703). Let's say we only had our `wages_small_pp`, but the results of something like the `wages_pp` data were published by some earlier group of researchers. In this case, we do have good prior data; we have the point estimate from the model of the `wages_pp` data! Here's what that was in terms of the median.

```{r, fig.width = 8, fig.height = 2.5}
as_draws_df(fit5.3) %>%
  median_qi(sd_id__exper)
```

And here's what that value is when set as the divisor of 1.

```{r}
1 / 0.04154273
```

What does that distribution look like?

```{r, fig.width = 4, fig.height = 2.25, message = F}
prior(gamma(2, 24.0716)) %>% 
  parse_dist() %>% 
  
  ggplot(aes(xdist = .dist_obj, y = prior)) + 
  stat_halfeye(.width = .95, p_limits = c(.0001, .9999)) +
  scale_y_discrete(NULL, breaks = NULL, expand = expansion(add = 0.1)) +
  labs(title = "Gamma(2, 24.0716)",
       x = "parameter space") +
  coord_cartesian(xlim = c(0, 1)) +
  theme(panel.grid = element_blank())
```

So this is much more informative than our `gamma(2, 0.1)` prior from before. But given the magnitude of the estimate from `fit5.3`, it's still fairly liberal. Let's practice using it.
      
```{r fit5.10}
fit5.10 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(gamma(2, 24.0716), class = sd, group = id, coef = exper),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.10")
```

Check out the results.

```{r}
print(fit5.10, digits = 3)
```

Here we compare the three ways to specify the $\sigma_1$ prior with the posterior from the original model fit with the full data set. For simplicity, we'll just look at the results in the **brms**-like $\sigma$ metric. Hopefully by now you’ll know how to do the conversions to get the values into the $\sigma^2$ metric.

```{r, fit.width = 8, fig.height = 4.5}
tibble(`full data, student_t(3, 0, 1) prior`  = VarCorr(fit5.3, summary = F)[[1]][[1]][1:4000, 2],
       `small data, student_t(3, 0, 1) prior` = VarCorr(fit5.7, summary = F)[[1]][[1]][, 2],
       `small data, gamma(2, 0.1) prior`      = VarCorr(fit5.9, summary = F)[[1]][[1]][, 2],
       `small data, gamma(2, 24.0716) prior`  = VarCorr(fit5.10, summary = F)[[1]][[1]][, 2]) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, 
                       levels = c("full data, student_t(3, 0, 1) prior", 
                                  "small data, student_t(3, 0, 1) prior", 
                                  "small data, gamma(2, 0.1) prior", 
                                  "small data, gamma(2, 24.0716) prior"))) %>% 
  
  ggplot(aes(x = value, y = 0, fill = name == "full data, student_t(3, 0, 1) prior")) +
  geom_vline(xintercept = 0, color = "white") +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  scale_fill_manual(values = c("grey75", "darkgoldenrod2")) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~ name, ncol = 1)
```

One thing to notice is that when you're working with full Bayesian estimation with even a rather vague prior with a boundary on zero, the measure of central tendency in the posterior is away from zero. Things get more compact when you're working in the $\sigma^2$ metric. But remember that when we're fitting our models with **brms**, we're in the $\sigma$ metric, anyway. And with either of these three options, you don't have a compelling reason to set the $\sigma_\theta$ parameter to zero the way you would with ML. Even a rather vague prior will add enough information to the model that we can feel confident about keeping our theoretically-derived $\sigma_1$ parameter.

#### Nonconvergence [i.e., it's time to talk chains and such].

As discussed in [Section 4.3][Methods of estimation, revisited], all multilevel modeling programs implement iterative numeric algorithms for model fitting. (p. 155). This is also true for our Stan-propelled **brms** software. However, what's going on under the hood, here, is not what's happening with the frequentist packages discussed by Singer and Willett. We're using Hamiltonian Monte Carlo (HMC) to draw from the posterior. To my eye, Bürkner gave in a [-@burknerBayesianItemResponse2020] preprint probably the clearest and most direct introduction to why we need fancy algorithms like HMC to fit Bayesian models. First, Bürkner warmed up by contrasting Bayes with conventional frequentist inference:

> In frequentist statistics, parameter estimates are usually obtained by finding those parameter values that maximise the likelihood. In contrast, Bayesian statistics aims to estimate the full (joint) posterior distribution of the parameters. This is not only fully consistent with probability theory, but also much more informative than a single point estimate (and an approximate measure of uncertainty commonly known as ‘standard error’). (p. 9)

Those iterative algorithms Singer and Willett discussed in this section, that's what they're doing. They are maximizing the likelihood. But with Bayes, we have the more challenging goal of describing the entire posterior distribution, which is the product of the likelihood and the prior. As such,

> Obtaining the posterior distribution analytically is only possible in certain cases of carefully chosen combinations of prior and likelihood, which may considerably limit modeling flexibilty but yield a computational advantage. However, with the increased power of today's computers, Markov-Chain Monte-Carlo (MCMC) sampling methods constitute a powerful and feasible alternative to obtaining posterior distributions for complex models in which the majority of modeling decisions is made based on theoretical and not computational grounds. Despite all the computing power, these sampling algorithms are computationally very intensive and thus fitting models using full Bayesian inference is usually much slower than in point estimation techniques. However, advantages of Bayesian inference – such as greater modeling flexibility, prior distributions, and more informative results – are often worth the increased computational cost (Gelman, Carlin, Stern, and Rubin 2013). (pp. 9--10)

The gritty details are well beyond the scope of this project. If you'd like a more thorough walk-through on why it's analytically and computationally challenging to get the posterior, I recommend working through the first several chapters in Kruschke's [-@kruschkeDoingBayesianData2015] [*Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan*](http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/).

But so anyways, our primary algorithm is HMC as implemented by Stan. You can find all kinds of technical details at [https://mc-stan.org/users/documentation/](https://mc-stan.org/users/documentation/). Because it's rather difficult to describe our Bayesian multilevel models analytically, we use HMC to draw from the posterior instead. We then summarize the marginal and joint distributions of those parameters with things like measures of central tendency (i.e., means, medians, modes) and spread (i.e., standard deviations, percentile-based intervals). We make lots of plots.

And somewhat like with the frequentist iterative algorithms, we need to make sure our sweet Stan-based HMC is working well, too. One way is with trace plots.

##### Trace plots.

We can get the trace plots for a model by placing a `brm()` fit object into the `plot()` function. Here's an example with the full model, `fit5.4`.

```{r, fig.width = 6, fig.height = 5.5}
plot(fit5.4)
```

You'll notice we get two plots for each of the major model parameters, the $\gamma$'s, the $\sigma$'s and the $\rho$'s. The plots on the left are the density plots for each parameter. On the right, we have the actual trace plots. On the $x$-axis, we have an ordering of the posterior draws; on the $y$, we have the parameter space. Since we requested three HMC chains to draw from the posterior (`chains = 3`), those three chains are depicted by different colored lines. We generally like it when the lines for our chains all overlap with each other in a stable zig-zag sort of way. Trace plots are sometimes called caterpillar plots because, when things are going well, they often resemble nice multicolored fuzzy caterpillars.

If you'd like more control over your trace plot visuals, you might check out the [**bayesplot** package](https://github.com/stan-dev/bayesplot) [@R-bayesplot; @gabry2019visualization].

```{r, warning = F, message = F}
library(bayesplot)
```

Our main function will be `mcmc_trace()`. Unlike with the `brms::plot()` method, `bayesplog::mcmc_trace()` takes the posterior draws themselves as input. So we'll have to use `as_draws_df()` first.

```{r}
draws <- as_draws_df(fit5.4)
```

We can use the `pars` argument to focus on particular parameters.

```{r, fig.width = 4, fig.height = 1.5}
mcmc_trace(draws, pars = "sigma")
```

If we use the `pars = vars(...)` format, we can use function helpers from **dplyr** to select subsets of parameters. For example, here's how we might single out the $\gamma$'s.

```{r, fig.width = 6, fig.height = 4}
mcmc_trace(draws,
           pars = vars(starts_with("b_")),
           facet_args = list(ncol = 2))
```

Notice how we used the `facet_args` argument to adjust the number of columns in the output. We can also use familiar **ggplot2** functions to customize the plots further.

```{r, fig.width = 8, fig.height = 2.25, message = F, warning = F}
draws %>% 
  mutate(`sigma[0]`       = sd_id__Intercept,
         `sigma[1]`       = sd_id__exper,
         `sigma[epsilon]` = sigma) %>%

mcmc_trace(pars = vars(starts_with("sigma[")),
           facet_args = list(labeller = label_parsed)) +
  scale_color_viridis_d(option = "A") +
  scale_x_continuous(NULL, breaks = NULL) +
  ggtitle("I can't wait to show these traceplots to my mom.") +
  theme_grey() +
  theme(legend.position = "bottom",
        panel.grid = element_blank(),
        panel.grid.major.y = element_line(color = "white", size = 1/4),
        strip.text = element_text(size = 12))
```

Trace plots are connected to other important concepts, like autocorrelation and effective sample size.

##### Autocorrelation.

When using Markov chain Monte Carlo methods, of which HMC is a special case, the notions of autocorrelation and effective sample size are closely connected. Both have to do with the question, How many post-warmup draws from the posterior do I need to take? If you take too few, you won't have a good sense of the shape of the posterior. If you take more than necessary, you're just wasting time and computer memory. Here's how McElreath introduced the topic in his [-@mcelreathStatisticalRethinkingBayesian2015] text:

> So how many samples do we need for accurate inference about the posterior distribution? It depends. First, what really matters is the *effective* number of samples, not the raw number. The effective number of samples is an estimate of the number of independent samples from the posterior distribution. Markov chains are typically *autocorrelated*, so that sequential samples are not entirely independent. Stan chains tend to be less autocorrelated than those produced by other engines [e.g., the Gibbs sampler], but there is always some autocorrelation. (p. 255, *emphasis* in the original)

I'm not aware of a way to query the autocorrelations from a `brm()` fit using **brms** convenience functions. However, we can get those diagnostics from the `bayesplot::mcmc_acf()` function.

```{r, fig.width = 8, fig.height = 4, warning = F}
mcmc_acf(draws, pars = vars(starts_with("b_")), lags = 10)  +
  theme_grey() +
  theme(panel.grid = element_blank())
```

The `mcmc_acf()` function gives a wealth of granular output. The columns among the plots are the specified parameters. The rows are the chains, one for each. In this particular case, the autocorrelations were quite low for all our $\gamma$ parameters by the second or third lag. That's really quite good and not uncommon for HMC. Do note, however, that this won’t always be the case. For example, here are the plots for our variance parameters and $\rho_{01}$.

```{r, fig.width = 6, fig.height = 4}
mcmc_acf(draws, pars = vars(starts_with("sd_"), "sigma", starts_with("cor")), lags = 10)  +
  theme_grey() +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 7))
```

On the whole, all of them are pretty okay. But notice how the autocorrelations for $\sigma_1$ and $\rho_{01}$ remained relatively high up until the 10^th^ lag.

The plots from `mcmc_act()` are quite handy for focused diagnostics. But if you want a more global perspective, they're too tedious. Fortunately for us, we have other diagnostic tools.

##### Effective sample size.

Above we quoted McElreath as pointing out "what really matters is the *effective* number of samples, not the raw number." With **brms**, you typically get the effective number of samples in the `print()` or `summary()` output. Here it is again for `fit4`.

```{r}
summary(fit5.4)
```

See the last two columns on the right: `Bulk_ESS` and `Tail_ESS`? Following @vehtariRanknormalizationFoldingLocalization2019, those describe our effective sample size in two ways. From the paper, we read:

> If you plan to report quantile estimates or posterior intervals, we strongly suggest assessing the convergence of the chains for these quantiles. In Section 4.3 we show that convergence of Markov chains is not uniform across the parameter space and propose diagnostics and effective sample sizes specifically for extreme quantiles. This is *different* from the standard ESS estimate (which we refer to as the "bulk-ESS"), which mainly assesses how well the centre of the distribution is resolved. Instead, these "tail-ESS" measures allow the user to estimate the MCSE for interval estimates. (p. 5, *emphasis* in the original)

For more technical details, see the paper. In short, `Bulk_ESS` indexes the number of effective samples in 'the center of the' posterior distribution (i.e., the posterior mean or median). But since we also care about uncertainty in our parameters, we care about stability in the 95% intervals and such. The `Tail_ESS` column allows us to gauge the effective sample size for those intervals. Like with the autocorrelations, each parameter gets its own estimate for both ESS measures. You might compare the numbers to the number of post-warmup iterations, 4,500 in this case.

You may wonder, how many effective samples do I need? Back to McElreath:

> If all you want are posterior means, it doesn't take many samples at all to get very good estimates. Even a couple hundred samples will do. But if you care about the exact shape in the extreme tails of the posterior, the 99th percentile or so, then you'll need many many more. So there is no universally useful number of samples to aim for. In most typical regression applications, you can get a very good estimate of the posterior mean with as few as 200 effective samples. And if the posterior is approximately Gaussian, then all you need in addition is a good estimate of the variance, which can be had with one order of magnitude more, in most cases. For highly skewed posteriors, you'll have to think more about which region of the distribution interests you. (p. 255)

At the moment, **brms** does not offer a convenience function that allows users to collect the `Bulk_ESS` and `Tail_ESS` values in a data frame. However you can do so with help from the [**posterior** package](https://github.com/stan-dev/posterior) [@R-posterior]. For our purposes, the function of interest is `summarise_draws()`, which will take the output from `as_draws_df()` as input. We'll save the results as `draws_sum`.

```{r, warning = F, message = F}
library(posterior)

draws_sum <-
  as_draws_df(fit5.4) %>% 
  summarise_draws()

draws_sum %>% 
  head(n = 10)
```

Note how the last two columns are the `ess_bulk` and the `ess_tail`. Here we summarize them with histograms.

```{r, fig.width = 5, fig.height = 2.75, warning = F}
draws_sum %>% 
  pivot_longer(starts_with("ess")) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(binwidth = 100) +
  xlim(0, NA) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name)
```

If you wanted a focused plot of the effective sample sizes for our primary summary parameters, you would just wrangle the output a little. Since it's fun, we'll switch to a lollipop plot.

```{r, fig.width = 6, fig.height = 1.75, warning = F}
draws_sum %>% 
  slice(1:10) %>% 
  pivot_longer(contains("ess")) %>% 
  mutate(ess = str_remove(name, "ess_")) %>% 

  ggplot(aes(y = reorder(variable, value))) +
  geom_linerange(aes(xmin = 0, xmax = value)) +
  geom_point(aes(x = value)) +
  labs(x = "effective sample size",
       y = NULL) +
  xlim(0, 4000) +
  theme(panel.grid  = element_blank(),
        axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank()) +
  facet_wrap(~ ess)
```

You may have noticed from the ESS histogram that many of the parameters seemed to have bulk ESS values well above the total number of posterior draws (4,500). What's that about? As is turns out, the sampling in Stan is so good that sometimes the HMC chains for a parameter can be negatively autocorrelated. When this is the case, your effective sample size can be larger than your actual sample size. Madness, I know. This was the case for many of our random effects. Here's a look at the parameters with ten largest values.

```{r}
draws_sum %>% 
  arrange(desc(ess_bulk)) %>% 
  select(variable, ess_bulk) %>% 
  slice(1:10)
```

Let's take the first 4 and check their autocorrelation plots.

```{r, fig.width = 6, fig.height = 4}
mcmc_acf(draws, pars = vars(`r_id[12335,exper]`, `r_id[5968,Intercept]`, `r_id[10476,Intercept]`, `r_id[7117,Intercept]`), lags = 5)  +
  theme_grey() +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 10))
```

See those dips below zero for the first lag in each? That's what a negative autocorrelation looks like. Beautiful. For more on negative autocorrelations within chains and how it influences the number of effective samples, check out [this thread on the Stan forums](https://discourse.mc-stan.org/t/n-eff-bda3-vs-stan/2608) where many members of the Stan team chimed in.

##### $\widehat R$.

Return again to the default `print()` output for a `brms::brm()` fit.

```{r}
print(fit5.4)
```

The third last column for each parameter is `Rhat`. "`Rhat` is a complicated estimate of the convergence of the Markov chains to the target distribution. It should approach 1.00 from above, when all is well" [@mcelreathStatisticalRethinkingBayesian2015, p. 250]. We can extract $\widehat R$ directly with the `brms::rhat()` function.

```{r}
brms::rhat(fit5.4) %>% 
  str()
```

For our `fit5.4`, the `brms::rhat()` function returned a named numeric vector, with one row for each of the 1787 parameters in the model. You can subset the `rhat()` output to focus on a few parameters.

```{r}
brms::rhat(fit5.4)[1:10]
```

Note also that our `draws_sum` object from above has an `rhat` column, too.

```{r}
draws_sum %>% 
  select(variable, rhat) %>% 
  slice(1:4)
```

For a more global perspective, just plot.

```{r, fig.width = 4, fig.height = 2.5}
draws_sum %>% 
  ggplot(aes(x = rhat)) +
  geom_vline(xintercept = 1, color = "white") +
  geom_histogram(binwidth = .0001) +
  theme(panel.grid = element_blank())
```

The **bayesplot** package offers a convenience function for plotting `brms::rhat()` output. Here we'll focus on the first 20 parameters.

```{r, fig.width = 8, fig.height = 3}
mcmc_rhat(brms::rhat(fit5.4)[1:20]) +
  yaxis_text(hjust = 0)
```

By default, `mcmc_rhat()` does not return text on the $y$-axis. But you can retrieve that text with the `yaxis_text()` function. For more on the $\widehat R$, you might check out the *Rhat: potential scale reduction statistic* subsection of Gabry and Modrák's [-@gabryVisualMCMC2020] vignette, [*Visual MCMC diagnostics using the bayesplot package*](https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html#rhat-potential-scale-reduction-statistic).

We should also point out that the Stan team has found some deficiencies with the $\widehat R$. They've made recommendations that will be implemented in the Stan ecosystem sometime soon. In the meantime, you can read all about it in their preprint [@vehtariRanknormalizationFoldingLocalization2019] and in Dan Simpson's blog post, [*Maybe it's time to let the old ways die; or We broke R-hat so now we have to fix it*](https://statmodeling.stat.columbia.edu/2019/03/19/maybe-its-time-to-let-the-old-ways-die-or-we-broke-r-hat-so-now-we-have-to-fix-it/).

```{r, echo = F}
# let's free up some memory
rm(fit5.3, fit5.4, fit5.5)
```

### Distinguishing among different types of missingness.

> Missingness, in and of itself, is not necessarily problematic. It all depends upon what statisticians call the type of missingness. In seminal work on this topic, @little1995ModelingDropoutMechanism, refining earlier work with Rubin [@rubin1987statistical], distinguished among three types of missingness: (1) *missing completely at random* (MCAR); (2) covariate-dependent dropout (CDD); and (3) *missing at random* (MAR) [see also @schafer1997analysis].
>
> When we say that data are MCAR, we argue that the observed values are a random sample of all the values that could have been observed (according ot plan), had there been no missing data.
>
> ...Covariate dependent dropout (CDD) is a less restrictive assumption that permits associations between the probability of missingness and observed predictor values ("covariates"). Data can be CDD even if the probability of missingness is systematically related to either *TIME* or observed substantive predictors. 
>
> ...When data are MAR, the probability of missingness can depend upon *any* observed data, for either the predictors or any outcome values. It cannot, however, depend upon an *un*observed value of either any predictor or the outcome. (pp. 157--158, *emphasis* in the original)

For some more current introductions to missing data methods, I recommend Enders' [-@enders2010applied] [*Applied missing data analysis*](http://www.appliedmissingdata.com), for which you can find a free sample chapter [here](http://www.appliedmissingdata.com/sample-chapter.pdf), and Little and Rubin's [-@little2019statistical] [*Statistical analysis with missing data, 3rd Edition*](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data%2C+3rd+Edition-p-9780470526798). You might also check out van Burren's great [-@vanbuurenFlexibleImputationMissing2018] online text [*Flexible imputation of missing data. Second edition*](https://stefvanbuuren.name/fimd/). If you're a fan of the podcast medium, you might listen to episode 16 from the first season of the [Quantitude podcast](https://quantitudethepodcast.org/), *IF EPISODE=16 THEN EPISODE=-999;*, in which Patrick Curran and Greg Hancock do a fine job introducing the basics of missing data. And very happily, **brms** has several ways to handle missing data, about which you can learn more from Bürkner's [-@Bürkner2021HandleMissingValues] vignette, [*Handle missing values with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_missings.html).

## Time-varying predictors

> A time-varying predictor is a variable whose *values* may differ over time. Unlike their time-invariant cousins, which record an individual's static status, time-varying predictors record an individual's potentially differing status on each associated measurement occasion. Some time-varying predictors have values that change naturally; others have values that change by design. (pp. 159--160, *emphasis* in the original)

### Including the main effect of a time-varying predictor.

You can find Ginexi and colleagues' [-@ginexiDepressionControlBeliefs2000] unemployment study data in the `reading_pp.csv` file.

```{r, warning = F, message = F}
unemployment_pp <- read_csv("data/unemployment_pp.csv")

head(unemployment_pp)
```

We have 254 unique participants.

```{r}
unemployment_pp %>% 
  distinct(id) %>% 
  count()
```

Here's one way to compute the number of participants who were never employed during the study.

```{r}
unemployment_pp %>% 
  filter(unemp == 0) %>% 
  distinct(id) %>% 
  count() %>% 
  summarise(never_employed = 254 - n)
```

In case it wasn't clear, participants had up to 3 interviews.

> By recruiting 254 participants from local unemployment offices, the researchers were able to interview individuals soon after job loss (within the first 2 months). Follow-up interviews were conducted between 3 and 8 months and 10 and 16 months after job loss. (p. 161)

Those times were encoded in the `months` variable. Here's what that looks like.

```{r, fig.width = 3.5, fig.height = 2.5}
unemployment_pp %>% 
  ggplot(aes(x = months))  +
  geom_vline(xintercept = c(3, 8), color = "white") +
  geom_histogram(binwidth = .5) +
  theme(panel.grid = element_blank())
```

To make some of our data questions easier, we can use those 3- and 8-`month` thresholds to make an `interview` variable to indicate the periods during which the interviews were conducted.

```{r, fig.width = 3.5, fig.height = 2.5}
unemployment_pp <- 
  unemployment_pp %>% 
  mutate(interview = ifelse(months < 3, 1,
                            ifelse(months > 8, 3, 2))) 
  
unemployment_pp %>% 
  ggplot(aes(x = interview))  +
  geom_bar() +
  theme(panel.grid = element_blank())
```

With a little wrangling, we can display all possible employment patterns along with counts on how many followed them.

```{r}
unemployment_pp %>%
  select(-months, -cesd) %>%
  mutate(interview = str_c("int_", interview)) %>%
  spread(key = interview, value = unemp) %>%
  group_by(int_1, int_2, int_3) %>%
  count() %>%
  arrange(desc(n)) %>%
  flextable::flextable()
```

It takes a little work to see how Singer and Willett came to the conclusion "62 were always working after the first interview" (p. 161). Based on an analysis of those who had complete data, that corresponds to the pattern in the top row, [1, 0, 0], which we have counted as 55 (i.e., row 2). If you add to that the two rows with missingness on one of the critical values (i.e., [1, 0, NA], [1, NA, 0], and [NA, 1, 1]), that gets you $55 + 4 + 2 + 1 = 62$.

We can confirm that "41 were still unemployed at the second interview but working by the third" (p. 161). That's our pattern [1, 1, 0], shown in row 3. We can also confirm "19 were working by the second interview but unemployed at the third" (p. 161). That's shown in our pattern [1, 0, 1], shown in row 6.

Before we configure our unconditional growth model, we might familiarize ourselves with our criterion variable, `cesd`. Singer and Willett informed us:

> Each time participants completed the Center for Epidemiologic Studies' Depression (CES-D) scale [@radloffCESDScaleSelfreport1977], which asks them to rate, on a four-point scale, the frequency with which they experience each of the 20 depressive symptoms. The CES-D scores can vary from a low or 0 for someone with no symptoms to a high of 80 for someone in serious distress. (p. 161)

In addition to Radloff's original article, you can get a copy of the CES-D [here](http://www.chcr.brown.edu/pcoc/cesdscale.pdf).

To help us pick our priors, [Brown and Gary [-@brownPredictorsDepressiveSymptoms1985] listed the means and standard deviations of the CES-D scores for unemployed African-American adults. They gave the summary statistics broken down by sex:

* Males: 14.05 (8.86), $n = 37$
* Females: 15.35 (9.39), $n = 72$

Based the variables in the data set and the descriptions of it in the text, we don't have a good sense of the demographic backgrounds of the participants. But with the information we have in hand, a reasonable empirically-based but nonetheless noncommittal prior for baseline CES-D might be something like `normal(14.5, 20)`. A weakly-regularizing prior on change over 1 month might be `normal(0, 10)`. It'd be fair if you wanted to argue about these priors. Try your own! But if you are willing to go along with me, we might write the statistical formula for the unconditional growth model as

$$
\begin{align*}
\text{cesd}_{ij} & = \gamma_{00} + \gamma_{10} \text{months}_{ij} + \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \\
\epsilon_{ij}    & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D       & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 

\mathbf \Omega  & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10}                                      & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                        & \sim \operatorname{LKJ} (4).
\end{align*}
$$

Those $\operatorname{Student-t}(3, 0, 11.9)$ priors for the $\sigma$'s are the defaults, which you can confirm with `get_prior()`. The $\operatorname{LKJ} (4)$ for $\rho_{01}$ will weakly regularize the correlation towards zero.

Here's how we might fit that model.

```{r fit5.11}
fit5.11 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.11")
```

Here are the results.

```{r}
print(fit5.11, digits = 3)
```

Here are the posteriors for the CES-D at the first day of job loss (i.e., $\gamma_{00}$) and the expected rate of change over one month (i.e., $\gamma_{10}$).

```{r, fig.width = 6, fig.height = 2.25, warning = F}
as_draws_df(fit5.11) %>% 
  transmute(`first day of job loss` = b_Intercept,
            `linear decline by month` = b_months) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("CES-D composite score") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

We might use `conditional_effects()` to get a quick view on what that might looks like.

```{r, fig.width = 4, fig.height = 2.5}
plot(conditional_effects(fit5.11),
       plot = FALSE)[[1]] +
  geom_hline(yintercept = 14.5, color = "grey50", linetype = 2) +
  coord_cartesian(ylim = c(0, 20)) +
  theme(panel.grid = element_blank())
```

For reference, the dashed gray line is the value we centered our prior for initial status on.

#### Using a composite specification.

We might specify Model B, our first model with a time-varying covariate, like this:

$$
\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{10} \text{months}_{ij} + \gamma_{20} \text{unemp}_{ij} \big ] + \big [  \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \big ]\\
\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix}
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 

\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10} \text{ and } \gamma_{20}             & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                        & \sim \operatorname{LKJ}(4).
\end{align*}
$$

Note a few things about the priors. First, we haven't changed any of the priors from the previous model. All we did was add $\gamma_{20} \sim \text{Normal}(0, 10)$ for our new parameter. Given how weakly-informative our other priors have been for these data, this isn't an unreasonable approach. However, the meaning for our intercept, $\gamma_{01}$, has changed. Now it's the initial status for someone who is *employed* at baseline. But remember that for Model A, we set that prior with unemployed people in mind. A careful researcher might want to dive back into the literature to see if some lower value than 14.5 would be more reasonable to set for the mean of that prior. However, since the standard deviations for our intercepts priors and the covariate priors are all rather wide and permissive, this just won't be much of a problem, for us. Buy anyway, second, note that we've centered our prior for $\gamma_{20}$ on zero. This is a weakly-regularizing prior, slightly favoring smaller effects over larger ones. And like before, one could easily argue for different priors.

Here's how to fit the model in **brms**.

```{r fit5.12}
fit5.12 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + unemp + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .9),
      file = "fits/fit05.12")
```

If you compare our results to those in Table 5.7, you'll see they're quite similar.

```{r}
print(fit5.12)
```

Before we make our versions of Figure 5.3, let's first compare $\gamma_{01}$ posteriors by model. On page 166 of the text, Singer and Willett reported the monthly rate of decline "had been cut in half (to 0.20 from 0.42 in Model A)".

```{r}
fixef(fit5.11)["months", ]
fixef(fit5.12)["months", ]
```

You might be wondering why the quote from Singer and Willett used positive numbers while our parameter estimates have negative ones. No, there's no mistake, there. Negative parameter estimates for monthly trajectories are then same thing as expressing a rate of *decline* with a positive number. But anyways, you see our estimates are on par with theirs. With our Bayesian paradigm, it's also easy to get a formal difference distribution.

```{r}
tibble(fit5.11 = as_draws_df(fit5.11) %>% pull("b_months"),
       fit5.12 = as_draws_df(fit5.12) %>% pull("b_months"))
```


```{r, fig.width = 3, fig.height = 2.25}
tibble(fit5.11 = as_draws_df(fit5.11) %>% pull("b_months"),
       fit5.12 = as_draws_df(fit5.12) %>% pull("b_months")) %>% 
  mutate(dif = fit5.12 - fit5.11) %>%
  
  ggplot(aes(x = dif, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(paste("Difference in ", gamma[1][0]))) +
  theme(panel.grid = element_blank())
```

Here's our posterior for $\gamma_{20}$, `b_unemp`.

```{r, fig.width = 3, fig.height = 2.25}
as_draws_df(fit5.12) %>% 
  ggplot(aes(x = b_unemp, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Let's compute our WAIC estimates for `fit5.11` and `fit5.12`.

```{r}
fit5.11 <- add_criterion(fit5.11, criterion = "waic")
fit5.12 <- add_criterion(fit5.12, criterion = "waic")
```

Now we'll compare the models by both their WAIC differences and their WAIC weights.

```{r}
loo_compare(fit5.11, fit5.12, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, weights = "waic") %>% 
  round(digits = 3)
```

By both metrics, `fit5.12` came out as the clear favorite.

It's finally time to make our version of the upper left panel of Figure 5.3. We'll do so using `fitted()`.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp = 1,
         months = seq(from = 0, to = 14, by = .5))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Remain unemployed") +
  theme(panel.grid = element_blank())
```

The upper right panel will take more care. We'll still use `fitted()`, but we'll have to be tricky with how we define the two segments. When we defined the sequence of `months` values over which we wanted to plot the model trajectory, we just casually set `length.out = 30` within the `seq()` function. But now we need to make sure two of those sequential points are at 5. One way to do so is to use the `by = .5` argument within `seq()`, instead. Since we'll be defining the end points in our range with integer values, dividing up the sequence by every .5^th^ value will ensure we'll both be able to stop at 5 and that we'll have a reasonable amount of values in the sequence to ensure the bowtie-shaped 95% intervals don't look chunky.

But anyway, that also means we'll need to do a good job determining how many values we'll need to repeat our desired `unemp` values over. So here's a quick way to do the math. Since we're using every .5 in the sequence, you just subtract the integer at the beginning of the sequence from the integer at the end of the sequence, multiply that value by 2, and then add 1 to the product. Like this:

```{r}
2 * (5 - 0) + 1
2 * (14 - 5) + 1
```

Those are the number of times we need to repeat `unemp == 1` and `unemp == 0`, respectively. You'll see. Now wrangle and plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(11, 19)),
         months = c(seq(from = 0, to = 5, by = .5),
                    seq(from = 5, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_segment(x = 5, xend = 5,
               y = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 5,
               yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 5 + fixef(fit5.12)[3, 1],
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 8, y = 14.5, label = "gamma[2][0]",
           parse = T) +
  geom_segment(x = 7, xend = 5.5,
               y = 14.5, yend = 14.5,
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 5 months") +
  theme(panel.grid = element_blank())
```

Same deal for the lower left panel of Figure 5.3.

```{r}
2 * (10 - 0) + 1
2 * (14 - 10) + 1
```

Now wrangle and plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(21, 9)),
         months = c(seq(from = 0, to = 10, by = .5),
                    seq(from = 10, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_segment(x = 10, xend = 10,
               y = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 10,
               yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 10 + fixef(fit5.12)[3, 1],
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 7, y = 13.5, label = "gamma[2][0]",
           parse = T) +
  geom_segment(x = 8, xend = 9.5,
               y = 13.5, yend = 13.5,
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 10 months") +
  theme(panel.grid = element_blank())
```

It's just a little bit trickier to get that lower right panel. Now we need to calculate three values.

```{r}
2 * (5 - 0) + 1
2 * (10 - 5) + 1
2 * (14 - 10) + 1
```

Get that plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(c(1, 0, 1), times = c(11, 11, 9)),
         months = c(seq(from = 0, to = 5, by = .5),
                    seq(from = 5, to = 10, by = .5),
                    seq(from = 10, to = 14, by = .5)),
         group  = rep(letters[1:3], times = c(11, 11, 9)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

lines <-
  tibble(group = letters[1:2],
         x     = c(5, 10)) %>% 
  mutate(xend = x,
         y    = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * x,
         yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * x + fixef(fit5.12)[3, 1])

arrow <-
  tibble(x    = c(6.75, 8.25),
         y    = 14,
         xend = c(5.5, 9.5),
         yend = c(14.5, 13.5))
f %>% 
  ggplot(aes(x = months)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5, group = group),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate, group = group)) +
  geom_segment(data = lines,
               aes(x = x, xend = xend,
                   y = y, yend = yend, 
                   group = group),
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 7.5, y = 14, label = "gamma[2][0]",
           parse = T) +
  geom_segment(data = arrow,
               aes(x = x, xend = xend,
                   y = y, yend = yend),
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 5 months\nunemployed again at 10") +
  theme(panel.grid = element_blank())
```

Now we've been on a plotting roll, let's knock out the leftmost panel of Figure 5.4. It's just a small extension of what we've been doing.

```{r}
2 * (14 - 0) + 1
2 * (3.5 - 0) + 1
2 * (14 - 3.5) + 1
```

```{r, fig.width = 3.65, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(29, 22)),
         months = c(seq(from = 0, to = 14, by = .5),
                    seq(from = 3.5, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  # new trick
  geom_abline(intercept = fixef(fit5.12)[1, 1],
              slope = fixef(fit5.12)[2, 1],
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  # another new trick
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Main effects of unemp and time") +
  # don't forget this part
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

#### Using a level-1/level-2 specification.

If we wanted to reexpress our composite equation for `fit5.12` using the level-1/level-2 form, the level-1 model would be

$$
\text{cesd}_{ij} = \pi_{0i} + \pi_{1i} \text{months}_{ij} + \pi_{2i} \text{unemp}_{ij} + \epsilon_{ij}.
$$

Here's the corresponding level-2 model:

$$
\begin{align*}
\pi_{0i} & = \gamma_{00} + \zeta_{0i} \\
\pi_{1i} & = \gamma_{10} + \zeta_{1i} \\
\pi_{2i} & = \gamma_{20}.
\end{align*}
$$

If we wanted the effects of the time-varying covariate `unemp` to vary across individuals, then we'd expand the definition of $\pi_{2i}$ to be

$$
\pi_{2i} = \gamma_{20} + \zeta_{2i}.
$$

Although this doesn't change the way we model $\epsilon_{ij}$, which remains

$$
\epsilon_{ij} \sim \text{Normal} (0, \sigma_\epsilon),
$$

it does change the model for the $\zeta$s. Within our Stan/**brms** paradigm, that would now be

$$
\begin{align*}
\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} \\ \zeta_{2i}
\end{bmatrix} & \sim \text{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix}, \text{where} \\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 & 0 \\ 0 & \sigma_1 & 0 \\ 0 & 0 & \sigma_2 \end{bmatrix} \text{and} \\

\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} & \rho_{02} \\ \rho_{01} & 1 & \rho_{12} \\ \rho_{02} & \rho_{12} & 1 \end{bmatrix}.

\end{align*}
$$

Reworded slightly from the text (p. 169), by adding one residual parameter, $\zeta_{2i}$, we got an additional corresponding standard deviation parameter, $\sigma_2$, and two more correlation parameters, $\rho_{02}$ and $\rho_{12}$. Staying with our weakly-regularizing prior approach, the priors for the updated model might look like

$$
\begin{align*}
\gamma_{00}                          & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10}, \gamma_{20} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \dots, \sigma_2        & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\Omega                               & \sim \operatorname{LKJ} (4).
\end{align*}
$$

Singer and Willett then cautioned readers about hastily adding $\zeta$ parameters to their models, particularly in cases where you're likely to run into estimation issues, such as boundary constraints. Within our Stan/**brms** paradigm, we still have to be aware of these difficulties. However, with skillfully-chosen priors, I think you'll find we can fit more ambitious models than would typically be possible with frequentist estimators. But do beware that as you stretch your data further and further, your choices in likelihoods and priors more heavily influence the results. For more on the topic, check out [Michael Frank](https://twitter.com/mcxfrank?lang=en)'s blog post, [*Mixed effects models: Is it time to go Bayesian by default?*](http://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html), and make sure not to miss the action in the comments section.

#### Time-varying predictors and variance components.

When you add a time-varying predictor, it's not uncommon to see a reduction in $\sigma_\epsilon^2$. Here we compare `fit5.11` and `fit5.12`.

```{r, fig.width = 8, fig.height = 2.5}
v <-
  cbind(VarCorr(fit5.11, summary = F)[[2]][[1]],
        VarCorr(fit5.12, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  set_names(str_c("fit5.", 11:12)) %>% 
  transmute(fit5.11 = fit5.11^2,
            fit5.12 = fit5.12^2) %>% 
  mutate(`fit5.11 - fit5.12` = fit5.11 - fit5.12)

v %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("fit5.11", "fit5.12", "fit5.11 - fit5.12"))) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(sigma[epsilon]^2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

Using the full posterior for both models, here is the percent variance in CES-D explained by `unemp`.

```{r}
v %>% 
  transmute(percent = (fit5.11 - fit5.12) / fit5.11) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 3)
```

When you go beyond point estimates and factor in full posterior uncertainty, it becomes clear how fragile ad hoc statistics like this can be. Interpret them with caution.

### Allowing the effect of a time-varying predictor to vary over time.

"Might unemployment status also affect the trajectory's slope" (p. 171)? Here's the statistical model:

$$
\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{10} \text{months}_{ij} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{months}_{ij} \times \text{unemp}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf \Omega \mathbf D'
\end{pmatrix}\\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 
\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                        & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10}, \gamma_{20}, \gamma_{30} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \sigma_1   & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                          & \sim \operatorname{LKJ}(4).
\end{align*}
$$

Since $\gamma_{30}$ is an interaction term, it might make sense to give it an ever tighter prior, something like $\text{Normal}(0, 5)$. Here we'll just stay wide and loose.

```{r fit5.13}
fit5.13 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + unemp + months:unemp + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .9),
      file = "fits/fit05.13")
```

Here are the results.

```{r}
print(fit5.13)
```

Here's how we might make our version of the middle panel of Figure 5.4.

```{r, fig.width = 3.65, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(29, 22)),
         months = c(seq(from = 0, to = 14, by = .5),
                    seq(from = 3.5, to = 14, by = .5)))

f <-
  fitted(fit5.13,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_abline(intercept = fixef(fit5.13)[1, 1],
              slope = fixef(fit5.13)[2, 1],
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Main effects of unemp and time") +
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

Here's the posterior for $\gamma_{10}$.

```{r, fig.width = 3, fig.height = 2.25}
as_draws_df(fit5.13) %>% 
  ggplot(aes(x = b_months, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(paste(gamma[1][0], ", the main effect for time"))) +
  theme(panel.grid = element_blank())
```

It's quite uncertain and almost symmetrically straddles the parameter space between -0.5 and 0.5.

The next model follows the form

$$
\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{unemp}_{ij} \times \text{months}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{3i} \text{unemp}_{ij} \times \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{3i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_3 \end{bmatrix} \\ 

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{03} \\ \rho_{03} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{20}, \gamma_{30}             & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \sigma_3 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{03}                                        & \sim \operatorname{LKJ}(4).

\end{align*}
$$

Here's how to fit it with `brms::brm()`.

```{r fit5.14}
fit5.14 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + unemp + months:unemp + (1 + months:unemp | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .95),
      file = "fits/fit05.14")
```

It's easy to miss this if you're not following along quite carefully with the text, but this model, which corresponds to Equation 5.9 in the text, is NOT Model D. Rather, it's an intermediary model between Model C and Model D. All this means we can't compare our results with those in Table 5.7. But here they are, anyway.

```{r}
print(fit5.14)
```

We've already computed the WAIC for `fit5.11` and `fit5.12`. Here we do so for `fit5.13` and `fit5.14`.

```{r}
fit5.13 <- add_criterion(fit5.13, criterion = "waic")
fit5.14 <- add_criterion(fit5.14, criterion = "waic")
```

```{r}
loo_compare(fit5.11, fit5.12, fit5.13, fit5.14, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, fit5.13, fit5.14, weights = "waic") %>% 
  round(digits = 3)
```

Yep, it appears `fit5.14` is not an improvement on `fit5.13` (i.e., our analogue to Model C in the text). Here's our version of Model D:

$$
\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{unemp}_{ij} \times \text{months}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{3i} \text{unemp}_{ij} \times \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{2i} \\ \zeta_{3i}
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 & 0 \\ 0 & \sigma_2 & 0 \\ 0 & 0 & \sigma_3 \end{bmatrix} \\

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{02} & \rho_{03} \\ \rho_{02} & 1 & \rho_{23} \\ \rho_{03} & \rho_{23} & 1 \end{bmatrix} \\

\gamma_{00}                          & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{20}, \gamma_{30} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \dots , \sigma_3        & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\mathbf\Omega                        & \sim \operatorname{LKJ}(4).
\end{align*}
$$

We'll call it `fit5.15`. Here's the `brm()` code.

```{r fit5.15}
fit5.15 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + unemp + months:unemp + (1 + unemp + months:unemp | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.15")
```

```{r}
print(fit5.15)
```

Now we can finally make our version of the right panel of Figure 5.4.

```{r, fig.width = 3.65, fig.height = 4}
f <-
  fitted(fit5.15,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_abline(intercept = fixef(fit5.15)[1, 1],
              slope = 0,
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Constraining the effects time\namong the re-employed") +
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

By carefully using `filter()`, we can extract the posterior for summary the expected CES-D value for "the average unemployed person in the population", "immediately upon layoff" (p. 173).

```{r}
f %>% 
  filter(unemp == 1 & months == 0)
```

To get the decline rate per month, just use `fixef()` and subset.

```{r}
fixef(fit5.15)["unemp:months", ]
```

How much lower, on average, are "CES-D scores among those who find a job" right after layoff (p. 173)? Again, just use `fixef()`.

```{r}
fixef(fit5.15)["unemp", ]
```

But if we'd like to get the posterior for the difference at 12 months later, we'll need to go back to `fitted()`.

```{r}
nd <-
  tibble(unemp  = 1:0,
         months = 12)

fitted(fit5.15,
       newdata = nd,
       re_formula = NA,
       summary = F) %>% 
  data.frame() %>% 
  set_names(str_c(c("un", ""), "employed_cesd_at_12")) %>% 
  transmute(difference = unemployed_cesd_at_12 - employed_cesd_at_12) %>% 

  median_qi() %>% 
  mutate_if(is.double, round, digits = 3)
```

There's more posterior uncertainty, there, that you might expect from simply using the point estimates. *Always beware the posterior uncertainty*. We may as well finish off with a little WAIC.

```{r}
fit5.15 <- add_criterion(fit5.15, criterion = "waic")

loo_compare(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15, weights = "waic") %>% 
  round(digits = 3)
```

`fit5.15`, `fit5.13`, and `fit5.12` are all very close, with a modest edge for `fit5.15`.

```{r, echo = F}
# again, let's free up memory
rm(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15)
```

### Recentering time-varying predictors.

Back to the `wages_pp` data! Here's the generic statistical model we'll be fooling with:

$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{10} \text{exper}_{ij} + \gamma_{01} (\text{hgc}_i - 9) + \gamma_{12} \text{black}_i \times \text{exper}_{ij} \\
                & \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}
$$

We will fit the model with three versions of `uerate`. If you execute `head(wages_pp)`, you'll discover they're already in the data. But it might be worth walking out how to compute those variables. First, centering `uerate` at 7 is easy enough. Just subtract.

```{r}
wages_pp <-
  wages_pp %>% 
  mutate(uerate_7 = uerate - 7)
```

Continuing on with the same priors from before, here's how to fit the new model, our version of Model A on page 175.

```{r fit5.16}
fit5.16 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.16")
```

```{r}
print(fit5.16, digits = 3)
```

For the next model, we add a couple variables to the `wages_pp` data.

```{r}
wages_pp <-
  wages_pp %>% 
  group_by(id) %>% 
  mutate(uerate_id_mu  = mean(uerate)) %>% 
  ungroup() %>% 
  mutate(uerate_id_dev = uerate - uerate_id_mu)
```

In the original data set, these were the `ue.mean` and `ue.person.cen` variables, respectively. Here's how to fit the model.

```{r fit5.17}
fit5.17 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_id_mu + uerate_id_dev + 
        (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.17")
```

```{r}
print(fit5.17, digits = 3)
```

Within the **tidyverse**, probably the easiest way to center on the first value for each `id` is to first `group_by(id)` and then make use of the `dplyr::first()` function, which you can learn more about [here](https://dplyr.tidyverse.org/reference/nth.html).

```{r}
wages_pp <-
  wages_pp %>% 
  group_by(id) %>% 
  mutate(uerate_id_1 = first(uerate)) %>% 
  ungroup() %>% 
  mutate(uerate_id_1_dev = uerate - uerate_id_1)
```

In the original data set, these were the `ue1` and `ue.centert1` variables, respectively. Here's how to fit the updated model.

```{r fit5.18, message = F}
fit5.18 <-
  update(fit5.16,
         newdata = wages_pp,
         lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_id_1 + uerate_id_1_dev + 
           (1 + exper | id),
         iter = 2500, warmup = 1000, chains = 3, cores = 3,
         seed = 5,
         file = "fits/fit05.18")
```

```{r}
print(fit5.18, digits = 3)
```

Here are the WAIC comparisons.

```{r}
fit5.16 <- add_criterion(fit5.16, criterion = "waic")
fit5.17 <- add_criterion(fit5.17, criterion = "waic")
fit5.18 <- add_criterion(fit5.18, criterion = "waic")

loo_compare(fit5.16, fit5.17, fit5.18, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.16, fit5.17, fit5.18, weights = "waic") %>% 
  round(digits = 3)
```

```{r, echo = F}
# let's free up some memory
rm(fit5.16, fit5.17, fit5.18)
```

Like in the text, these three models are all close with respect to the WAIC. Based on their WAIC weights, `fit5.18` (Model C) and `fit5.16` (Model A) seen the be the best depictions of the data.

### An important caveat: The problem of reciprocal causation.

In this section, Singer and Willett gave a typology for time-varying covariates.

* A variable is *defined* if, "in advance to data collection, its values are predetermined for everyone under study" (p. 177). Examples are time and the seasons.
* A variable is *ancillary* if "its values cannot be influenced by study participants because they are determined by a stochastic process totally external to them" (p. 178). Within the context of a study on people within monogamous relationships, the availability of potential mates within the region would be an example.
* A variable is *contextual* if "it describes an 'external' stochastic process, but the connection between units is closer--between husbands and wives, parents and children, teachers and students, employers and employees. Because of this proximity, contextual predictors can be influenced by an individual's contemporaneous outcome values; if so, they are susceptible to issues of reciprocal causation" (p. 179). 
* A variable is *internal* if it describes an "individual's potentially changeable status over time" (p. 179). Examples include mood, psychiatric syndromes, and employment status.

## Recentering the effect of *TIME*

Our version of the data from Tomarken, Shelton, Elkins, and Anderson [-@tomarken1997sleep] is saved as `medication_pp.csv`.

```{r, warning = F, message = F}
medication_pp <- read_csv("data/medication_pp.csv")

glimpse(medication_pp)
```

The `medication_pp` data do not come with a `reading` variable, as in Table 5.9. Here we'll make one and display the time variables highlighted in Table 5.9.

```{r}
medication_pp <-
  medication_pp %>% 
  mutate(reading = ifelse(time.of.day == 0, "8 am",
                          ifelse(time.of.day < .6, "3 pm", "10 pm")))

medication_pp %>% 
  select(wave, day, reading, time.of.day:time667) %>% 
  head(n = 10)
```

To help get a sense of the balance in the data, here is a bar plot of the distribution of the numbers of measurement occasions within participants. It's color coded by treatment status.

```{r, fig.width = 6, fig.height = 3}
medication_pp %>% 
  mutate(treat = str_c("treat = ", treat)) %>% 
  group_by(treat, id) %>% 
  count() %>% 
  
  ggplot(aes(y = n, fill = treat)) +
  geom_bar() +
  scale_y_continuous(breaks = c(2,12, 15:21)) +
  scale_fill_viridis_d(NULL, end = .8) +
  labs(x = "count of cases",
       y = "# measurement occasions") +
  theme(panel.grid = element_blank()) 
```

For this section, our basic model will be

$$
\begin{align*}
\text{pos}_{ij} & = \pi_{0i} + \pi_{1i} (\text{time}_{ij} - c) + \epsilon_{ij} \\
\pi_{0i}        & = \gamma_{00} + \gamma_{01} \text{treat}_i + \zeta_{0i} \\
\pi_{1i}        & = \gamma_{10} + \gamma_{11} \text{treat}_i + \zeta_{1i},
\end{align*}
$$

where $c$ is a generic constant. For our three variants of $\text{time}_{ij}$, $c$ will be

* 0 for `time`,
* 3.3333333 for `time333`, and
* 6.666667 for `time667`.

Our criterion variable is `pos`, positive mood rating. The text told us these were from "a package of mood diaries (which use a five-point scale to assess positive and negative moods)" (p. 182). However, we don't know what numerals were assigned to the points on the scale, we don't know how many items were used, and we don't even know whether the items were taken from an existing questionnaire. And unfortunately, the citation Singer and Willett gave for the study is from a conference presentation, making it a pain to track down background information on the internet. In such a situation, it's difficult to figure out how to set our priors. Though suboptimal, we might first get a sense of the `pos` data with a histogram.

```{r, fig.width = 4, fig.height = 2.5}
medication_pp %>% 
  ggplot(aes(x = pos)) +
  geom_histogram(binwidth = 10) +
  theme(panel.grid = element_blank())
```

Here's the `range()`.

```{r}
range(medication_pp$pos)
```

Starting with the prior for our intercept, I think there's a lot of room for argument, here. To keep with our weakly-regularizing approach to priors, it might make sense to use something like `normal(200, 100)`. But then again, we know something about the study design. At the beginning, participants were in treatment for depression, so we'd expect the starting point to be closer to the lower end of the scale. In that case, we might update our approach to something like `normal(150, 50)`. Feel free to play with alternatives. What I hope this illustrates is that our task would be much easier with more domain knowledge.

Anyway, given the scale of the data, weakly-regularizing priors for the predictor variables might take the form of something like `normal(0, 25)`. We'll use `student_t(0, 50)` on the $\sigma$s and stay steady with `lkj(4)` on $\rho_{01}$.

Here we fit the model with all three versions of `time`.

```{r fit5.19}
fit5.19 <-
  brm(data = medication_pp, 
      family = gaussian,
      pos ~ 0 + Intercept + time + treat + time:treat + (1 + time | id),
      prior = c(prior(normal(150, 50), class = b, coef = Intercept),
                prior(normal(0, 25), class = b),
                prior(student_t(3, 0, 50), class = sd),
                prior(student_t(3, 0, 50), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.19")

fit5.20 <-
  update(fit5.19,
         newdata = medication_pp,
         pos ~ 0 + Intercept + time333 + treat + time333:treat + (1 + time333 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 5,
         file = "fits/fit05.20")

fit5.21 <-
  update(fit5.19,
         newdata = medication_pp,
         pos ~ 0 + Intercept + time667 + treat + time667:treat + (1 + time667 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 5,
         file = "fits/fit05.21")
```

Given the size of our posterior standard deviations, our $\gamma$ posteriors are quite comparable to the point estimates (and their standard errors) in the text.

```{r}
print(fit5.19)
```

Given the scales we're working with, it's difficult to compare our $\sigma$ summaries with the $\sigma^2$ summaries in Table 5.10 of the text. Here we convert and re-summarize.

```{r, warning = F}
v <-
  as_draws_df(fit5.19) %>%
  transmute(sigma_2_0 = sd_id__Intercept^2,
            sigma_2_1 = sd_id__time^2,
            sigma_2_epsilon = sigma^2)

v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 2)
```

Turns out the posterior medians are quite similar to the ML estimates in the text. Here's what the entire distributions looks like.

```{r, fig.width = 8, fig.height = 2.5}
v %>% 
  set_names("sigma[0]^2", "sigma[1]^2", "sigma[epsilon]^2") %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("posterior") +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 11)) +
  facet_wrap(~ name, scales = "free", labeller = label_parsed)
```

Here's our version of Figure 5.5.

```{r, fig.width = 4, fig.height = 2.75}
nd <-
  crossing(treat = 0:1,
           time  = seq(from = 0, to = 7, length.out = 30))

text <-
  tibble(treat = 0:1,
         time  = 4,
         y     = c(135, 197),
         label = c("control", "treatment"),
         angle = c(350, 15))

fitted(fit5.19,
       newdata = nd,
       re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  ggplot(aes(x = time, fill = treat, color = treat, group = treat)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/3, size = 0) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = text,
            aes(y = y, label = label, angle = angle)) +
  scale_fill_viridis_c(option = "A", begin = .3, end = .6) +
  scale_color_viridis_c(option = "A", begin = .3, end = .6) +
  labs(x = "days",
       y = "pos") +
  theme(legend.position = "none",
        panel.grid = element_blank())
```

Although we still have clear evidence of an interaction, adding those 95% intervals makes it look less impressive, doesn't it?

Here are the summaries for the models with the alternative versions of $(\text{time}_{ij} - c)$.

```{r}
print(fit5.20)
print(fit5.21)
```

It might be easier to compare the $\gamma$ summaries across models with a well-designed coefficient plot. Here's an attempt.

```{r, fig.width = 8, fig.height = 1.5}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fixef = map(name, ~ get(.) %>% 
                       fixef() %>% 
                       data.frame() %>% 
                       rownames_to_column("parameter"))) %>% 
  unnest(fixef) %>% 
  mutate(gamma = rep(c("gamma[0][0]", "gamma[1][0]", "gamma[0][1]", "gamma[1][1]"), times = 3)) %>% 
  
  ggplot(aes(x = name, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_pointrange() +
  xlab(NULL) +
  coord_flip() +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank(),
        strip.text = element_text(size = 11)) +
  facet_wrap(~ gamma, scales = "free_x", labeller = label_parsed, ncol = 4)
```

Since we're juggling less information, we might compare the posteriors for $\sigma_1^2$ across the three models with good old `tidybayes::geom_halfeyeh()`.

```{r, fig.width = 6, fig.height = 2}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fit = map(name, get)) %>% 
  mutate(sigma_2_1 = map(fit, ~ VarCorr(., summary = F)[[1]][[1]][, 1]^2 %>% 
                           data.frame() %>% 
                           set_names("sigma_2_1"))) %>% 
  unnest(sigma_2_1) %>% 
  
  ggplot(aes(x = sigma_2_1, y = name)) +
  stat_halfeye(.width = c(.5, .95)) +
  scale_x_continuous(expression(sigma[1]^2), limits = c(0, NA)) +
  ylab(NULL) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

For kicks and giggles, we marked off both 50% and 95% intervals for each. Here's the same for $\rho_{01}$.

```{r, fig.width = 6, fig.height = 2}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fit = map(name, get)) %>% 
  mutate(rho = map(fit, ~ VarCorr(., summary = F)[[1]][[2]][, 2, "Intercept"] %>% 
                           data.frame() %>% 
                           set_names("rho"))) %>% 
  unnest(rho) %>% 
  
  ggplot(aes(x = rho, y = name)) +
  stat_halfeye(.width = c(.5, .95)) +
  labs(x = expression(rho[0][1]),
       y = NULL) +
  coord_cartesian(xlim = c(-1, 1)) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

"As Rogosa and Willett [-@rogosaUnderstandingCorrelatesChange1985] demonstrate, you can always alter the correlation between the level-1 growth parameters simply by changing the centering constant" (p. 186).

Here are the WAIC comparisons.

```{r}
fit5.19 <- add_criterion(fit5.19, criterion = "waic")
fit5.20 <- add_criterion(fit5.20, criterion = "waic")
fit5.21 <- add_criterion(fit5.21, criterion = "waic")

loo_compare(fit5.19, fit5.20, fit5.21, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.19, fit5.20, fit5.21, weights = "waic") %>% 
  round(digits = 3)
```

As one might hope, not much going on.

As Singer and Willett alluded to in the text (p. 187), the final model in this chapter is an odd one. Do note the `initial` and `final` columns in the data.

```{r}
glimpse(medication_pp)
```

With this model

$$
\begin{align*}
\text{pos}_{ij} & = \pi_{0i} \bigg (\frac{6.67 - \text{time}_{ij}}{6.67} \bigg ) + \pi_{1i} \bigg (\frac {\text{time}_{ij}}{6.67} \bigg ) + \epsilon_{ij} \\
\pi_{0i}        & = \gamma_{00} + \gamma_{01} \text{treat}_i + \zeta_{0i} \\
\pi_{1i}        & = \gamma_{10} + \gamma_{11} \text{treat}_i + \zeta_{1i}.
\end{align*}
$$

Because of the parameterization, we'll use both variables simultaneously to indicate time. In the code, below, you'll notice we no longer have an intercept parameter. Rather, we just have `initial` and `final`. As such, both of those parameters get the same prior we've been using for the intercept in the previous models.

```{r fit5.22}
fit5.22 <-
  brm(data = medication_pp, 
      family = gaussian,
      pos ~ 0 + initial + final + initial:treat + final:treat + (0 + initial + final | id),
      prior = c(prior(normal(150, 50), class = b, coef = initial),
                prior(normal(150, 50), class = b, coef = final),
                prior(normal(0, 25), class = b),
                prior(student_t(3, 0, 50), class = sd),
                prior(student_t(3, 0, 50), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.22")
```

Our results are quite similar to those in the text.

```{r}
print(fit5.22)
```

Let's finish up with the WAIC.

```{r}
fit5.22 <- add_criterion(fit5.22, criterion = "waic")

loo_compare(fit5.19, fit5.20, fit5.21, fit5.22, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.19, fit5.20, fit5.21, fit5.22, weights = "waic") %>% 
  round(digits = 3)
```

All about the same.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F, eval = F}
# here we'll remove our objects
rm(reading_pp, n, lkj, wages_pp, draws, nd, f, wages_small_pp, fit5.6, v, fit5.7, fit5.8, fit5.9, fit5.10, unemployment_pp, lines, arrow, medication_pp, fit5.19, text, fit5.20, fit5.21, fit5.22)

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```