_main.Rmd

--- 
title: "*Applied longitudinal data analysis* in brms and the tidyverse"
subtitle: "version 0.0.2"
author: "A Solomon Kurz"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: 
  bookdown::gitbook:
    split_bib: yes
documentclass: book
bibliography: bib.bib
biblio-style: apalike
csl: apa.csl
link-citations: yes
geometry:
  margin = 0.5in
urlcolor: blue
highlight: tango
header-includes:
  \usepackage{underscore}
  \usepackage[T1]{fontenc}
github-repo: ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse
twitter-handle: SolomonKurz
description: "This project is a reworking of Singer and Willett's classic (2003) text within a contemporary Bayesian framework with emphasis of the brms and tidyverse packages within the R computational framework."
---

# What and why {-}

This project is based on Singer and Willett's classic [-@singerAppliedLongitudinalData2003] text, [*Applied longitudinal data analysis: Modeling change and event occurrence*](https://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780195152968.001.0001/acprof-9780195152968). You can download the data used in the text at [http://www.bristol.ac.uk/cmm/learning/support/singer-willett.html](https://www.bristol.ac.uk/cmm/learning/support/singer-willett.html) and find a wealth of ideas on how to fit the models in the text at [https://stats.idre.ucla.edu/other/examples/alda/](https://stats.idre.ucla.edu/other/examples/alda/). My contributions show how to fit these models and others like them within a Bayesian framework. I make extensive use of Paul Bürkner's [**brms** package](https://github.com/paul-buerkner/brms) [@R-brms; @burknerBrmsPackageBayesian2017; @burknerAdvancedBayesianMultilevel2018], which makes it easy to fit Bayesian regression models in **R** [@R-base] using Hamiltonian Monte Carlo (HMC) via the [Stan](https://mc-stan.org) probabilistic programming language [@carpenterStanProbabilisticProgramming2017]. Much of the data wrangling and plotting code is done with packages connected to the [**tidyverse**](http://style.tidyverse.org) [@R-tidyverse; @wickhamWelcomeTidyverse2019].

## Caution: Work in progress {-}

This release contains drafts of Chapters 1 through 6 and 9 through 13. Chapters 1 through 6 provide the motivation and foundational principles for fitting longitudinal multilevel models. Chapters 9 through 13 motivation and foundational principles for fitting discrete-time survival analyses.

In addition to fleshing out more of the chapters, I plan to add more goodies like introductions to multivariate longitudinal models and mixed-effect location and scale models. But there is no time-table for this project. To keep up with the latest changes, check in at the GitHub repository, [https://github.com/ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse](https://github.com/ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse), or follow my announcements on twitter at [https://twitter.com/SolomonKurz](https://twitter.com/SolomonKurz).

## **R** setup {-}

To get the full benefit from this ebook, you'll need some software. Happily, everything will be free (provided you have access to a decent personal computer and an good internet connection).

First, you'll need to install **R**, which you can learn about at [https://cran.r-project.org/](https://cran.r-project.org/).

Though not necessary, your **R** experience might be more enjoyable if done through the free RStudio interface, which you can learn about at [https://rstudio.com/products/rstudio/](https://rstudio.com/products/rstudio/).

Once you have installed **R**, execute the following to install the bulk of the add-on packages. This will probably take a few minutes to finish. Go make yourself a coffee.

```{r, eval = F}
packages <- c("bayesplot", "brms", "broom", "devtools", "flextable", "GGally", "ggmcmc", "ggrepel", "gtools", "loo", "patchwork", "psych", "Rcpp", "remotes", "rstan", "StanHeaders", "survival", "tidybayes", "tidyverse")

install.packages(packages, dependencies = T)
```

A few of the other packages are not officially available via the Comprehensive R Archive Network (CRAN; https://cran.r-project.org/). You can download them directly from GitHub by executing the following.

```{r, eval = F}
devtools::install_github("stan-dev/cmdstanr")
remotes::install_github("stan-dev/posterior")
devtools::install_github("rmcelreath/rethinking")
```

It's possible you'll have problems installing some of these packages. Here are some likely suspects and where you can find help:

* for difficulties installing **brms**, go to [https://github.com/paul-buerkner/brms#how-do-i-install-brms](https://github.com/paul-buerkner/brms#how-do-i-install-brms) or search around in the [**brms** section of the Stan forums ](https://discourse.mc-stan.org/c/interfaces/brms/36);
* for difficulties installing **cmdstanr**, go to [https://mc-stan.org/cmdstanr/articles/cmdstanr.html](https://mc-stan.org/cmdstanr/articles/cmdstanr.html);
* for difficulties installing **rethinking**, go to [https://github.com/rmcelreath/rethinking#quick-installation](https://github.com/rmcelreath/rethinking#quick-installation); and
* for difficulties installing **rstan**, go to [https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started](https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started).

## License and citation {-}

This book is licensed under the Creative Commons Zero v1.0 Universal license. You can learn the details, [here](https://github.com/ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse/blob/master/LICENSE). In short, you can use my work. Just please give me the appropriate credit the same way you would for any other scholarly resource. Here's the citation information:

```{r, eval = F}
@book{kurzAppliedLongitudinalDataAnalysis2021,
  title = {Applied longitudinal data analysis in brms and the tidyverse},
  author = {Kurz, A. Solomon},
  year = {2021},
  month = {4},
  edition = {version 0.0.2},
  url = {https://bookdown.org/content/4253/}
}
```


<!--chapter:end:index.Rmd-->


# A Framework for Investigating Change over Time

> It is possible to measure change, and to do it well, *if you have longitudinal data* [@rogosaGrowthCurveApproach1982; @willettResultsReliabilityLongitudinal1989]. Cross-sectional data--so easy to collect and so widely available--will not suffice. In this chapter, we describe why longitudinal data are necessary for studying change. [@singerAppliedLongitudinalData2003, p. 3, *emphasis* in the original]

## When might you study change over time?

Perhaps a better question is: *When wouldn't you?*

## Distinguishing between two types of questions about change

On page 8, Singer and Willett proposed there are two fundamental questions for longitudinal data analysis:

1. "How does the outcome change over time?" and
2. "Can we predict differences in these changes?"

Within the hierarchical framework, we often speak about two levels of change. We address within-individual change at *level-1*.

> The goal of a level-1 analysis is to describe the *shape* of each person's individual growth trajectory.
>
> In the second stage of an analysis of change, known as *level-2*, we ask about *interindividual differences in change*... The goal of a level-2 analysis is to detect heterogeneity in change across individuals and to determine the *relationship* between predictors and the *shape* of each person's individual growth trajectory. (p. 8, *emphasis* in the original)

## Three important features of a study of change

* Three or more waves of data
* An outcome whose values change systematically over time
* A sensible metric for clocking time

### Multiple waves of data.

Singer and Willett criticized two-waves data on two grounds.

> First, it cannot tell us about the *shape* of each person's individual growth trajectory, the focus of our level-1 question. Did all the change occur immediately after the first assessment? Was progress steady or delayed? Second, it cannot distinguish true change from measurement error. If measurement error renders pretest scores too low and posttest scores too high, you might conclude erroneously that scores increase over time when a longer temporal view would suggest the opposite. In statistical terms, two-waves studies cannot describe individual trajectories of change and they confound true change with measurement error [see @rogosaGrowthCurveApproach1982]. (p. 10, *emphasis* in the original)

I am not a fan of this 'true change/measurement error' way of speaking and would rather speak in terms of systemic and [seemingly] un-systemic changes among means and variances. Otherwise put, I'd rather speak in terms of trait and state. Two waves of data do not allow us to disentangle systemic mean differences from stable means and substantial variances for those means. Two waves of data do not allow us to disentangle changes in traits from stable traits but important differences in states. For an introduction to this way of thinking, check out Nezlek's [-@nezlekMultilevelFrameworkUnderstanding2007] [*A multilevel framework for understanding relationships among traits, states, situations and behaviors*](https://www.researchgate.net/publication/228079300_A_Multilevel_Framework_for_Understanding_Relationships_Among_Traits_States_Situations_and_Behaviours).

### A sensible metric for time.

> Choice of a time metric affects several interrelated decisions about the number and spacing of data collection waves....
>
> Our overarching point is that there is no single answer to the seemingly simple question about the most sensible metric for time. You should adopt whatever scale makes most sense for your outcomes and your research question....
>
> Our point is simple: choose a metric for the that reflect the cadence you expect to be most useful for your outcome. (p. 11)

Data collection waves can be evenly spaced or not. E.g., if you anticipate a time period of rapid nonlinear change, it might be helpful to increase the density of assessments during that period. Everyone does not have to have the same assessment schedule. If all are assessed on the same schedule, we describe the data as *time-structured*. When assessment schedules vary across participants, the data are termed *time-unstructured*. The data are *balanced* if all participants have the same number of waves. Issues like attrition and so on lead to *unbalanced* data. Though they may have some pedagogical use, I have not found these terms useful in practice.

### A continuous outcome that changes systematically over time.

To my eye, the most interesting part of this section is the discussion of measurement validity over time. E.g.,

> when we say the metric in which the outcome is measured must be preserved across time, we mean that the outcome scores must be equatable over time--a given value of the outcome on any occasion must represent the same "amount" of the outcome on every occasion. Outcome equatability is easiest to ensure when you use the identical instrument for measurement repeatedly over time. (p. 13)

This isn't as simple as is sounds. Though it's beyond the scope of this project, you might learn more about this from a study of the longitudinal measurement invariance literature. To dive in, see the first couple chapters in Newsom's [-@newsom2015longitudinal] text, [*Longitudinal structural equation modeling: A comprehensive introduction*](http://www.longitudinalsem.com/).

## Session info {-}

```{r}
sessionInfo()
```


<!--chapter:end:01.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 120)
```

# Exploring Longitudinal Data on Change

> Wise researchers conduct descriptive exploratory analyses of their data before fitting statistical models. As when working with cross-sectional data, exploratory analyses of longitudinal data con reveal general patterns, provide insight into functional form, and identify individuals whose data do not conform to the general pattern. The exploratory analyses presented in this chapter are based on numerical and graphical strategies already familiar from cross-sectional work. Owing to the nature of longitudinal data, however, they are inevitably more complex in this new setting. [@singerAppliedLongitudinalData2003, p. 16]

## Creating a longitudinal data set

> In longitudinal work, data-set organization is less straightforward because you can use two very different arrangements:
>
>>* *A person-level data set*, in which each person has one record and multiple variables contain the data from each measurement occasion
>
>>* *A person-period data set*, in which each person has multiple records—one for each measurement occasion (p. 17, *emphasis* in the original)

These are also sometimes referred to as the wide and long data formats, respectively.

As you will see, we will use two primary functions from the **tidyverse** to convert data from one format to another.

### The person-level data set.

Here we load the person-level data from [this UCLA web site](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-2/). These are the NLY data [see @raudenbushGrowthCurveAnalysis2016] shown in the top of Figure 2.1.

```{r, warning = F, message = F}
library(tidyverse)

tolerance <- read_csv("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tolerance1.txt", col_names = T)

head(tolerance, n = 16)
```

With person-level data, each participant has a single row. In these data, participants are indexed by their `id` number. To see how many participants are in these data, just `count()` the rows.

```{r}
tolerance %>% 
  count()
```

The `nrow()` function will work, too.

```{r}
tolerance %>% 
  nrow()
```

With the base **R** `cor()` function, you can get the Pearson's correlation matrix shown in Table 2.1.

```{r}
cor(tolerance[ , 2:6]) %>%
  round(digits = 2)
```

We used the `round()` function to limit the number of decimal places in the output. Leave it off and you'll see `cor()` returns up to seven decimal places instead. It can be hard to see the patters within a matrix of numerals. It might be easier in a plot.

```{r, fig.width = 3.75, fig.height = 2.25}
cor(tolerance[ , 2:6]) %>%
  data.frame() %>%
  rownames_to_column("row") %>% 
  pivot_longer(-row,
               names_to = "column",
               values_to = "correlation") %>% 
  mutate(row = factor(row) %>% fct_rev(.)) %>% 
  
  ggplot(aes(x = column, y = row)) + 
  geom_raster(aes(fill = correlation)) + 
  geom_text(aes(label = round(correlation, digits = 2)),
            size = 3.5) +
  scale_fill_gradient(low = "white", high = "red4", limits = c(0, 1)) +
  scale_x_discrete(NULL, position = "top", expand = c(0, 0)) +
  scale_y_discrete(NULL, expand = c(0, 0)) +
  theme(axis.ticks = element_blank())
```

If all you wanted was the lower diagonal, you could use the `lowerCor()` function from the [**psych** package](https://CRAN.R-project.org/package=psych) [@R-psych].

```{r}
psych::lowerCor(tolerance[ , 2:6])
```

For more ways to compute, organize, and visualize correlations within the **tidyverse** paradigm, check out the [**corrr** package](https://tidymodels.github.io/corrr/) [@R-corrr].

### The person-period data set.

Here are the person-period data (i.e., those shown in the bottom of Figure 2.1).

```{r, warning = F, message = F}
tolerance_pp <- read_csv("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tolerance1_pp.txt",
                         col_names = T)

tolerance_pp %>%
  slice(c(1:9, 76:80))
```

With data like these, the simple use of `count()` or `nrow()` won't help us discover how many participants there are in the `tolerance_pp` data. One quick way is to `count()` the number of `distinct()` `id` values.

```{r}
tolerance_pp %>% 
  distinct(id) %>% 
  count()
```

A fundamental skill is knowing how to convert longitudinal data in one format to the other. If you're using packages within the **tidyverse**, the `pivot_longer()` function will get you from the person-level format to the person-period format.

```{r}
tolerance %>%
  # this is the main event
  pivot_longer(-c(id, male, exposure),
               names_to = "age", 
               values_to = "tolerance") %>% 
  # here we remove the `tol` prefix from the `age` values and then save the numbers as integers
  mutate(age = str_remove(age, "tol") %>% as.integer()) %>% 
  # these last two lines just make the results look more like those in the last code chunk
  arrange(id, age) %>%
  slice(c(1:9, 76:80))
```

You can learn more about the `pivot_longer()` function [here](https://tidyr.tidyverse.org/reference/pivot_longer.html) and [here](https://tidyr.tidyverse.org/articles/pivot.html).

As hinted at in the above hyperlinks, the opposite of the `pivot_longer()` function is `pivot_wider()`. We can use `pivot_wider()` to convert the person-period `tolerance_pp` data to the same format as the person-level `tolerance` data.

```{r}
tolerance_pp %>% 
  # we'll want to add that `tol` prefix back to the `age` values
  mutate(age = str_c("tol", age)) %>% 
  # this variable is just in the way. we'll drop it
  select(-time) %>%
  # here's the main action
  pivot_wider(names_from = age, values_from = tolerance)
```

## Descriptive analysis of individual change over time

The following "descriptive analyses [are intended to] reveal the nature and idiosyncrasies of each person's temporal pattern of growth, addressing the question: How does each person change over time" (p. 23)?

### Empirical growth plots.

*Empirical growth plots* show individual-level sequence in a variable of interest over time. We'll put `age` on the $x$-axis, `tolerance` on the y-axis, and make our variant of Figure 2.2 with `geom_point()`. It's the `facet_wrap()` part of the code that splits the plot up by `id`.

```{r, fig.width = 4.5, fig.height = 5}
tolerance_pp %>%
  ggplot(aes(x = age, y = tolerance)) +
  geom_point() +
  coord_cartesian(ylim = c(1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id)
```

By default, **ggplot2** sets the scales of the $x$- and $y$-axes to the same values across subpanels. If you'd like to free that constraint, play around with the `scales` argument within `facet_wrap()`.

### Using a trajectory to summarize each person's empirical growth record.

If we wanted to connect the dots, we might just add a `geom_line()` line.

```{r, fig.width = 4.5, fig.height = 5}
tolerance_pp %>%
  ggplot(aes(x = age, y = tolerance)) +
  geom_point() +
  geom_line() +
  coord_cartesian(ylim = c(1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id)
```

However, Singer and Willett recommend two other approaches: 

* nonparametric smoothing
* parametric functions

#### Smoothing the empirical growth trajectory nonparametrically.

For our version of Figure 2.3, we'll use a loess smoother. When using the `stat_smooth()` function in **ggplot2**, you can control how smooth or wiggly the line is with the `span` argument.

```{r, fig.width = 4.5, fig.height = 5, message = F, warning = F}
tolerance_pp %>%
  ggplot(aes(x = age, y = tolerance)) +
  geom_point() +
  stat_smooth(method = "loess", se = F, span = .9) +
  coord_cartesian(ylim = c(1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id)
```

#### Smoothing the empirical growth trajectory using ~~OLS~~ single-level Bayesian regression.

Although "fitting person-specific regression models, one individual at a time, is hardly the most efficient use of longitudinal data" (p. 28), we may as well play along with the text. It'll have pedagogical utility. You'll see.

For this section, we'll take a [cue from Hadley Wickham](https://www.youtube.com/watch?v=rz3_FDVt9eg&t=3458s) and use `group_by()` and `nest()` to make a tibble composed of tibbles (i.e., a nested tibble).

```{r}
by_id <-
  tolerance_pp %>%
  group_by(id) %>%
  nest()
```

You can get a sense of what we did with `head()`.

```{r}
by_id %>% head()
```

As indexed by `id`, each participant now has their own data set stored in the `data` column. To get a better sense, we'll use our double-bracket subsetting skills to open up the first data set, the one for `id == 9`. If you're not familiar with this skill, you can learn more from [Chapter 9](https://bookdown.org/rdpeng/rprogdatascience/subsetting-r-objects.html) of [Roger Peng](https://twitter.com/rdpeng?lang=en)'s great [-@pengProgrammingDataScience2019] online book, [*R programming for data science*](https://bookdown.org/rdpeng/rprogdatascience/), or [Jenny Bryan](https://twitter.com/JennyBryan)'s fun and useful talk, [*Behind every great plot there's a great deal of wrangling*](https://www.youtube.com/watch?v=4MfUCX_KpdE).

```{r}
by_id$data[[1]]
```

Our `by_id` data object has many data sets stored in a higher-level data set. The code we used is verbose, but that's what made it human-readable. Now we have our nested tibble, we can make a function that will fit the simple linear model `tolerance ~ 1 + time` to each id-level data set. *Why use `time` as the predictor?* you ask. On page 29 in the text, Singer and Willett clarified they fit their individual models with $(\text{age} - 11)$ in order to have the model intercepts centered at 11 years old rather than 0. If we wanted to, we could make an $(\text{age} - 11)$ variable like so.

```{r}
by_id$data[[1]] %>% 
  mutate(age_minus_11 = age - 11)
```

Did you notice how our `age_minus_11` variable is the same as the `time` variable already in the data set? Yep, that's why we'll be using `time` in the model. In our data, $(\text{age} - 11)$ is encoded as `time`.

Singer and Willett used OLS to fit their exploratory models. We could do that to with the `lm()` function and we will do a little of that in this project. But let's get frisky and fit the models as Bayesians, instead. Our primary statistical package for fitting Bayesian models will be [Paul Bürkner](https://twitter.com/paulbuerkner?lang=en)'s [**brms**](https://github.com/paul-buerkner/brms). Let's open it up.

```{r, warning = F, message = F}
library(brms)
```

Since this is our first Bayesian model, we should start slow. The primary model-fitting function in **brms** is `brm()`. The function is astonishingly general and includes numerous arguments, most of which have sensible defaults. The primary two arguments are `data` and `formula`. I'm guessing they're self-explanatory. I'm not going to go into detail on the three arguments at the bottom of the code. We'll go over them later. For simple models like these, I would have omitted them entirely, but given the sparsity of the data (i.e., 5 data points per model), I wanted to make sure we gave the algorithm a good chance to arrive at reasonable estimates.

```{r fit2.1}
fit2.1 <-
  brm(data = by_id$data[[1]],
      formula = tolerance ~ 1 + time,
      prior = prior(normal(0, 2), class = b),
      iter = 4000, chains = 4, cores = 4,
      seed = 2,
      file = "fits/fit02.01")
```

We just fit a single-level Bayesian regression model for our first participant. We saved the results as an object named `fit2.1`. We can return a useful summary of `fit2.1` with either `print()` or `summary()`. Since it's less typing, we'll use `print()`.

```{r}
print(fit2.1)
```

The 'Intercept' and 'time' coefficients are the primary regression parameters. Also notice 'sigma', which is our variant of the residual standard error you might get from an OLS output (e.g., from base **R** `lm()`). Since we're Bayesians, the output summaries do not contain $p$-values. But we do get posterior standard deviations (i.e., the 'Est.Error' column) and the upper- and lower-levels of the percentile-based 95% intervals.

You probably heard somewhere that Bayesian statistics require priors. We can see what those were by pulling them out of our `fit2.1` object.

```{r}
fit2.1$prior
```

The prior in the top line, `normal(0, 2)`, is for all parameters of `class = b`. We actually specified this in our `brm()` code, above, with the code snip: `prior = prior(normal(0, 2), class = b)`. At this stage in the project, my initial impulse was to leave this line blank and save the discussion of how to set priors by hand for later. However, the difficulty is that the first several models we're fitting are all of $n = 5$. Bayesian statistics handle small-$n$ models just fine. However, when your $n$ gets small, the algorithms we use to implement our Bayesian models benefit from priors that are at least modestly informative. As it turns out, the **brms** default priors are flat for parameters of `class = b`. They offer no information beyond that contained in the likelihood. To stave off algorithm problems with our extremely-small-$n$ data subsets, we used `normal(0, 2)` instead. In our model, the only parameter of `class = b` is the regression slope for `time`. On the scale of the data, `normal(0, 2)` is a vary-permissive prior for our `time` slope.

In addition to our `time` slope parameter, our model contained an intercept and a residual variance. From the `fit2.1$prior` output, we can see those were `student_t(3, 2.1, 2.5)` and `student_t(3, 0, 2.5)`, respectively. **brms** default priors are designed to be weakly informative. Given the data and the model, these priors have a minimal influence on the results. We'll focus more on priors later in the project. For now just recognize that even if you don't specify your priors, you can't escape using some priors when using `brm()`. This is a good thing.

Okay, so that was the model for just one participant. We want to do that for all 16. Instead of repeating that code 15 times, we can work in bulk. With **brms**, you can reuse a model with the `update()` function. Here's how to do that with the data from our second participant.

```{r fit2.2}
fit2.2 <-
  update(fit2.1, 
         newdata = by_id$data[[2]],
         control = list(adapt_delta = .9),
         file = "fits/fit02.02")
```

Peek at the results.

```{r}
print(fit2.2)
```

Different participants yield different model results.

Looking ahead a bit, we'll need to know how to get the $R^2$ for a single-level Gaussian model. With **brms**, you do that with the `bayes_R2()` function.

```{r}
bayes_R2(fit2.2)
```

Though the default spits out summary statistics, you can get the full posterior distribution for the $R^2$ by specifying `summary = F`.

```{r}
bayes_R2(fit2.2, summary = F) %>% 
  str()
```

Our code returned a numeric vector. If you'd like to plot the results with **ggplot2**, you'll need to convert the vector to a data frame.

```{r, fit.width = 4, fig.height = 2}
bayes_R2(fit2.2, summary = F) %>% 
  data.frame() %>% 
  
  ggplot(aes(x = R2)) +
  geom_density(fill = "black") +
  scale_x_continuous(expression(italic(R)[Bayesian]^2), limits = c(0, 1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

You'll note how non-Gaussian the Bayesian $R^2$ can be. Also, with the combination of default minimally-informative priors and only 5 data points, there' massive uncertainty in the shape. As such, the value of central tendency will vary widely based on which statistic you use.

```{r}
bayes_R2(fit2.2, summary = F) %>% 
  data.frame() %>% 
  summarise(mean   = mean(R2),
            median = median(R2),
            mode   = tidybayes::Mode(R2))
```

By default, `bayes_R2()` returns the mean. You can get the median with the `robust = TRUE` argument. To pull the mode, you'll need to use `summary = F` and feed the results into a mode function, like `tidybayes::Mode()`.

I should also point out the **brms** package did not get these $R^2$ values by traditional method used in, say, OLS estimation. To learn more about how the Bayesian $R^2$ sausage is made, check out Gelman, Goodrich, Gabry, and Vehtari's [-@gelmanRsquaredBayesianRegression2019] paper, [*R-squared for Bayesian regression models*]https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1549100).

With a little tricky programming, we can use the `purrr::map()` function to serially fit this model to each of our participant-level data sets. We'll save the results as `fits`.

```{r, eval = F}
fits <- 
  by_id %>%
  mutate(model = map(data, ~update(fit2.1, newdata = ., seed = 2)))
```

```{r fits, echo = F}
# save a little time
# save(fits, file = "fits/fits02.rda")
load(file = "fits/fits02.rda")
```

Let's walk through what we did. The `map()` function takes two primary arguments, `.x` and `.f`, respectively. We set `.x = data`, which meant we wanted to iterate over the contents in our `data` vector. Recall that each row of `data` itself contained an entire data set--one for each of the 16 participants. It's with the second argument `.f` that we indicated what we wanted to do with our rows of `data`. We set that to `.f = ~update(fit2.1, newdata = ., seed = 2)`. With the `~` syntax, we entered in a formula, which was `update(fit2.1, newdata = ., seed = 2)`. Just like we did with `fit2.2`, above, we reused the model formula and other technical specs from `fit2.1`. Now notice the middle part of the formula, `newdata = .`. That little `.` refers to the element we specified in the `.x` argument. What this combination means is that for each of the 16 rows of our nested `by_id` tibble, we plugged in the `id`-specific data set into `update(fit, newdata[[i]])` where `i` is simply meant as a row index. The new column, `model`, contains the output of each of the 16 iterations.

```{r}
print(fits)
```

Next, we'll want to extract the necessary summary information from our `fits` to remake our version of Table 2.2. There's a lot of info in that table, so let's take it step by step. First, we'll extract the posterior means (i.e., "Estimate") and standard deviations (i.e., "se") for the initial status and rate of change of each model. We'll also do the same for sigma (i.e., the square of the "Residual variance").

```{r, message = F}
mean_structure <-
  fits %>% 
  mutate(coefs = map(model, ~ posterior_summary(.)[1:2, 1:2] %>% 
                       data.frame() %>% 
                       rownames_to_column("coefficients"))) %>% 
  unnest(coefs) %>% 
  select(-data, -model) %>% 
  unite(temp, Estimate, Est.Error) %>% 
  pivot_wider(names_from = coefficients,
              values_from = temp) %>% 
  separate(b_Intercept, into = c("init_stat_est", "init_stat_sd"), sep = "_") %>% 
  separate(b_time, into = c("rate_change_est", "rate_change_sd"), sep = "_") %>% 
  mutate_if(is.character, ~ as.double(.) %>% round(digits = 2)) %>% 
  ungroup()

head(mean_structure)
```

It's simpler to extract the residual variance. Recall that because **brms** gives that in the standard deviation metric (i.e., $\sigma$), you need to square it to return it in a variance metric (i.e., $\sigma^2$).

```{r, message = F}
residual_variance <-
  fits %>% 
  mutate(residual_variance = map_dbl(model, ~ posterior_summary(.)[3, 1])^2) %>% 
  mutate_if(is.double, round, digits = 2) %>% 
  select(id, residual_variance)

head(residual_variance)
```

We'll extract our Bayesian $R^2$ summaries, next. Given how nonnormal these are, we'll use the posterior median rather than the mean. We get that by using the `robust = T` argument within the `bayes_R2()` function.

```{r, message = F}
r2 <-
  fits %>% 
  mutate(r2 = map_dbl(model, ~ bayes_R2(., robust = T)[1])) %>% 
  mutate_if(is.double, round, digits = 2) %>% 
  select(id, r2)

head(r2)
```

Here we combine all the components with a series of `left_join()` statements and present it in a [**flextable**](https://cran.r-project.org/web/packages/flextable/index.html)-type table.

```{r}
table <-
  fits %>% 
  unnest(data) %>% 
  group_by(id) %>% 
  slice(1) %>% 
  select(id, male, exposure) %>% 
  left_join(mean_structure,    by = "id") %>% 
  left_join(residual_variance, by = "id") %>% 
  left_join(r2,                by = "id") %>% 
  rename(residual_var = residual_variance) %>% 
  select(id, init_stat_est:r2, everything()) %>% 
  ungroup()

table %>% 
  flextable::flextable()
```

We can make the four stem-and-leaf plots of Figure 2.4 with serial combinations of `pull()` and `stem()`.

```{r}
# fitted initial status
table %>% 
  pull(init_stat_est) %>% 
  stem(scale = 2)

# fitted rate of change
table %>% 
  pull(rate_change_est) %>% 
  stem(scale = 2)

# residual variance
table %>% 
  pull(residual_var) %>% 
  stem(scale = 2)

# r2 statistic
table %>% 
  pull(r2) %>% 
  stem(scale = 2)
```

To make Figure 2.5, we'll combine information from the original data and the 'Estimates' (i.e., posterior means) from our Bayesian models we've encoded in `mean_structure`.

```{r, fig.width = 4.5, fig.height = 5, message = F, warning = F}
by_id %>% 
  unnest(data) %>% 
  
  ggplot(aes(x = time, y = tolerance, group = id)) +
  geom_point() +
  geom_abline(data = mean_structure,
              aes(intercept = init_stat_est,
                  slope = rate_change_est, group = id),
              color = "blue") +
  scale_x_continuous(breaks = 0:4, labels = 0:4 + 11) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id)
```

## Exploring differences in change across people

"Having summarized how each individual changes over time, we now examine similarities and differences in these changes across people" (p. 33).

### Examining the entire set of smooth trajectories.

The key to making our version of the left-hand side of Figure 2.6 is two `stat_smooth()` lines. The first one will produce the overall smooth. The second one, the one including the `aes(group = id)` argument, will give the `id`-specific smooths.

```{r, fig.width = 2.5, fig.height = 3.25, message = F, warning = F}
tolerance_pp %>%
  ggplot(aes(x = age, y = tolerance)) +
  stat_smooth(method = "loess", se = F, span = .9, size = 2) +
  stat_smooth(aes(group = id),
              method = "loess", se = F, span = .9, size = 1/4) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(panel.grid = element_blank())
```

To get the linear OLS trajectories, just switch `method = "loess"` `to method = "lm"`.

```{r, fig.width = 2.5, fig.height = 3.25, message = F, warning = F}
tolerance_pp %>%
  ggplot(aes(x = age, y = tolerance)) +
  stat_smooth(method = "lm", se = F, span = .9, size = 2) +
  stat_smooth(aes(group = id),
              method = "lm", se = F, span = .9, size = 1/4) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(panel.grid = element_blank())
```

But we wanted to be Bayesians. We already have the `id`-specific trajectories. All we need now is one based on all the data.

```{r fit2.3}
fit2.3 <-
  update(fit2.1, 
         newdata = tolerance_pp,
         file = "fits/fit02.03")
```

Here's the model summary.

```{r}
summary(fit2.3)
```

Before, we used `posterior_summary()` to isolate the posterior means and $SD$s. We can also use the `fixef()` function for that.

```{r}
fixef(fit2.3)
```

With a little subsetting, we can extract just the means from each.

```{r}
fixef(fit2.3)[1, 1]
fixef(fit2.3)[2, 1]
```

For this plot, we'll work more directly with the model formulas to plot the trajectories. We can use `init_stat_est` and `rate_change_est` from the `mean_structure` object as stand-ins for $\beta_{0i}$ and $\beta_{1i}$ from our model equation,

$$\text{tolerance}_{ij} = \beta_{0i} + \beta_{1i} \cdot \text{time}_{ij} + \epsilon_{ij},$$

where $i$ indexes children and $j$ indexes time points. All we need to do is plug in the appropriate values for `time` and we'll have the fitted `tolerance` values for each level of `id`. After a little wrangling, the data will be in good shape for plotting.

```{r}
tol_fitted <-
  mean_structure %>% 
  mutate(`11` = init_stat_est + rate_change_est * 0,
         `15` = init_stat_est + rate_change_est * 4) %>% 
  select(id, `11`, `15`) %>% 
  pivot_longer(-id, 
               names_to = "age", 
               values_to = "tolerance") %>% 
  mutate(age = as.integer(age))

head(tol_fitted)
```

We'll plot the `id`-level trajectories with those values and `geom_line()`. To get the overall trajectory, we'll get tricky with `fixef(fit2.3)` and `geom_abline()`.

```{r, fig.width = 2.5, fig.height = 3.25}
tol_fitted %>% 
  ggplot(aes(x = age, y = tolerance, group = id)) +
  geom_line(color = "blue", size = 1/4) +
  geom_abline(intercept = fixef(fit2.3)[1, 1] + fixef(fit2.3)[2, 1] * -11,
              slope     = fixef(fit2.3)[2, 1],
              color = "blue", size = 2) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(panel.grid = element_blank()) 
```

### Using the results of model fitting to frame questions about change.

If you're new to the multilevel model, the ideas in this section are foundational. 

> To learn about the observed *average* pattern of change, we examine the sample averages of the fitted intercepts and slopes; these tell us about the average initial status and the average annual rate of change in the sample as a whole. To learn about the observed *individual differences* in change, we examine the sample *variances* and *standard deviations* of the intercepts and slopes; these tell us about the observed variability in initial status. And to learn about the observed relationship between initial status and the rate of change, we can examine the sample *covariance* or *correlation* between intercepts and slopes.
>
> Formal answers to these questions require the multilevel model for change of chapter 3. But we can presage this work by conducting simple descriptive analyses of the estimated intercepts and slopes. (p. 36, *emphasis* in the original)

Here are the means and standard deviations presented in Table 2.3.

```{r, message = F}
mean_structure %>% 
  pivot_longer(ends_with("est")) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            sd   = sd(value)) %>% 
  mutate_if(is.double, round, digits = 2)
```

Here's how to get the Pearson's correlation coefficient.

```{r}
mean_structure %>% 
  select(init_stat_est, rate_change_est) %>% 
  cor() %>% 
  round(digits = 2)
```

### Exploring the relationship between change and time-invariant predictors.

"Evaluating the impact of predictors helps you uncover systematic patterns in the individual change trajectories corresponding to interindividual variation in personal characteristics" (p. 37).

#### Graphically examining groups of smoothed individual growth trajectories.

If we'd like Bayesian estimates differing by `male`, we'll need to fit an interaction model.

```{r fit2.4}
fit2.4 <-
  update(fit2.1, 
         newdata = tolerance_pp,
         tolerance ~ 1 + time + male + time:male,
         file = "fits/fit02.04")
```

Check the model summary.

```{r}
print(fit2.4)
```

Here's how to use `fixef()` and the model equation to get fitted values for `tolerance` based on specific values for `time` and `male`.

```{r}
tol_fitted_male <-
  tibble(male = rep(0:1, each = 2),
         age  = rep(c(11, 15), times = 2)) %>% 
  mutate(time = age - 11) %>% 
  mutate(tolerance = fixef(fit2.4)[1, 1] + 
           fixef(fit2.4)[2, 1] * time + 
           fixef(fit2.4)[3, 1] * male + 
           fixef(fit2.4)[4, 1] * time * male)

tol_fitted_male
```

Now we're ready to make our Bayesian version of the top panels of Figure 2.7.

```{r, fig.width = 5, fig.height = 3.25}
tol_fitted %>% 
  # we need to add `male` values to `tol_fitted`
  left_join(tolerance_pp %>% select(id, male),
            by = "id") %>% 
  
  ggplot(aes(x = age, y = tolerance, color = factor(male))) +
  geom_line(aes(group = id),
            size = 1/4) +
  geom_line(data = tol_fitted_male,
            size = 2) +
  scale_color_viridis_d(end = .75) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~male)
```

Before we can do the same thing with `exposure`, we'll need to dichotomize it by its median. A simple way is with a conditional statement within the `if_else()` function.

```{r}
tolerance_pp <-
  tolerance_pp %>% 
  mutate(exposure_01 = if_else(exposure > median(exposure), 1, 0))
```

Now fit the second interaction model.

```{r fit2.5}
fit2.5 <-
  update(fit2.4, 
         newdata = tolerance_pp,
         tolerance ~ 1 + time + exposure_01 + time:exposure_01,
         file = "fits/fit02.05")
```

Here's the summary.

```{r}
print(fit2.5)
```

Now use `fixef()` and the model equation to get fitted values for `tolerance` based on specific values for `time` and `exposure_01`.

```{r}
tol_fitted_exposure <-
  crossing(exposure_01 = 0:1,
           age         = c(11, 15)) %>% 
  mutate(time = age - 11) %>% 
  mutate(tolerance = fixef(fit2.5)[1, 1] + 
           fixef(fit2.5)[2, 1] * time + 
           fixef(fit2.5)[3, 1] * exposure_01 + 
           fixef(fit2.5)[4, 1] * time * exposure_01,
         exposure = if_else(exposure_01 == 1, "high exposure", "low exposure") %>% 
           factor(., levels = c("low exposure", "high exposure")))

tol_fitted_exposure
```

Did you notice in the last lines in the second `mutate()` how we made a version of `exposure` that is a factor? That will come in handy for labeling and ordering the subplots. Now make our Bayesian version of the bottom panels of Figure 2.7.

```{r, fig.width = 5, fig.height = 3.25}
tol_fitted %>% 
  # we need to add `exposure_01` values to `tol_fitted`
  left_join(tolerance_pp %>% select(id, exposure_01),
            by = "id") %>% 
  mutate(exposure = if_else(exposure_01 == 1, "high exposure", "low exposure") %>% 
           factor(., levels = c("low exposure", "high exposure"))) %>% 
  
  ggplot(aes(x = age, y = tolerance, color = exposure)) +
  geom_line(aes(group = id),
            size = 1/4) +
  geom_line(data = tol_fitted_exposure,
            size = 2) +
  scale_color_viridis_d(option = "A", end = .75) +
  coord_cartesian(ylim = c(0, 4)) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~exposure)
```

#### The relationship between ~~OLS-Estimated~~ single-level Bayesian trajectories and substantive predictors

"To investigate whether fitted trajectories vary systematically with predictors, we can treat the estimated intercepts and slopes as outcomes and explore the relationship between them and predictors" (p. 39). Here are the left panels of Figure 2.8.

```{r, fig.width = 2.5, fig.height = 5}
p1 <-
  mean_structure %>% 
  pivot_longer(ends_with("est")) %>% 
  mutate(name = factor(name, labels = c("Fitted inital status", "Fitted rate of change"))) %>% 
  # we need to add `male` values to `tol_fitted`
  left_join(tolerance_pp %>% select(id, male),
            by = "id") %>% 
  
  ggplot(aes(x = factor(male), y = value, color = name)) +
  geom_point(alpha = 1/2) +
  scale_color_viridis_d(option = "B", begin = .2, end = .7) +
  labs(x = "male",
       y = NULL) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~name, scale = "free_y", ncol = 1)

p1
```

Here are the right panels.

```{r, fig.width = 2.5, fig.height = 5}
p2 <-
  mean_structure %>% 
  pivot_longer(ends_with("est")) %>% 
  mutate(name = factor(name, labels = c("Fitted inital status", "Fitted rate of change"))) %>% 
  # we need to add `male` values to `tol_fitted`
  left_join(tolerance_pp %>% select(id, exposure),
            by = "id") %>% 
  
  ggplot(aes(x = exposure, y = value, color = name)) +
  geom_point(alpha = 1/2) +
  scale_color_viridis_d(option = "B", begin = .2, end = .7) +
  scale_x_continuous(breaks = 0:2,
                     limits = c(0, 2.4)) +
  labs(y = NULL) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~name, scale = "free_y", ncol = 1)

p2
```

Did you notice how we saved those last two plots as `p1` and `p2`? We can use syntax from the [**patchwork** package](https://patchwork.data-imaginist.com/) [@R-patchwork] to combine them into one compound plot.

```{r, fig.width = 5, fig.height = 5}
library(patchwork)

p1 + p2 + scale_y_continuous(breaks = NULL)
```

As interesting as these plots are, do remember that "the need for ad hoc correlations has been effectively replaced by the widespread availability of computer software for fitting the multilevel model for change directly" (pp. 41--42). As you'll see, Bürkner's **brms** package is one of the foremost in that regard.

## Improving the precision and reliability of ~~OLS~~ single-level-Bayesian-estimated rates of change: Lessons for research design

> Statisticians assess the precision of a parameter estimate in terms of its *sampling variation*, a measure of the variability that would be found across infinite resamplings from the same population. The most common measure of sampling variability is an estimate's *standard error*, the square root of its estimated sampling variance. Precision and standard error have an inverse relationship; the smaller the standard error, the more precise the estimate. (p. 41, *emphasis* in the original)

So here's the deal: When Singer and Willett wrote "Statisticians assess..." a more complete expression would have been 'Frequentist statisticians assess...' Bayesian statistics are not based on asymptotic theory. They do not presume an idealized infinite distribution of replications. Rather, Bayesian statistics use Bayes theorem to estimate the probability of the parameters given the data. That probability has a distribution. Analogous to frequentist statistics, we often summarize that distribution (i.e., the posterior distribution) in terms of central tendency (e.g., posterior mean, posterior median, posterior mode) and spread. *Spread?* you say. We typically express spread in one or both of two ways. One typical expression of spread is the 95% intervals. In the Bayesian world, these are often called credible or probability intervals. The other typical expression of spread is the *posterior standard deviation*. In **brms**, this of typically summarized in the 'Est.error' column of the output of functions like `print()` and `posterior_summary()` and so on. The posterior standard deviation is analogous to the frequentist standard error. Philosophically and mechanically, they are *not* the same. But in practice, they are often quite similar.

Later we read:

> Unlike precision which describes how well an individual slope estimate measures that person's true rate of change, reliability describes how much the rate of change varies across people. Precision has meaning for the individual; reliability has meaning for the group. (p. 42)

I have to protest. True, if we were working within a Classical Test Theory paradigm, this would be correct. But this places reliability with the context of group-based cross-sectional design. Though this is a popular design, it is not the whole story (i.e., see this book!). For introductions to more expansive and person-specific notions of reliability, check out [Lee Cronbach](https://en.wikipedia.org/wiki/Lee_Cronbach)'s Generalizability Theory [@cronbachDependabilityBehavioralMeasurements1972; @brennanGeneralizabilityTheory2001; also @cranfordProcedureEvaluatingSensitivity2006; @lopilatoUpdatingGeneralizabilityTheory2015; @shroutPsychometrics2012].

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(tolerance, tolerance_pp, by_id, fit2.1, fit2.2, fits, mean_structure, residual_variance, r2, table, fit2.3, tol_fitted, fit2.4, tol_fitted_male, fit2.5, tol_fitted_exposure, p1, p2)

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

<!--chapter:end:02.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# Introducing the Multilevel Model for Change

> In this chapter [Singer and Willett introduced] the multilevel model for change, demonstrating how it allows us to address within-person and between-person questions about change simultaneously. Although there are several ways of writing the statistical model, here we adopt a simple and common approach that has much substantive appeal. We specify the multilevel model for change by simultaneously postulating a pair of subsidiary models—a level-1 submodel that describes how each person changes over time, and a level-2 model that describes how these changes differ across people [@bryk1987application; @rogosaUnderstandingCorrelatesChange1985]. [@singerAppliedLongitudinalData2003, p. 3]

## What is the purpose of the multilevel model for change?

Unfortunately, we do not have access to the full data set Singer and Willett used in this chapter. For details, go [here](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-3/). However, I was able to use the data provided in Table 3.1 and the model results in Table 3.3 to simulate data with similar characteristics as the original. To see how I did it, look at the section at the end of the chapter.

Anyway, here are the data in Table 3.1.

```{r, warning = F, message = F}
library(tidyverse)

early_int <-
  tibble(id      = rep(c(68, 70:72, 902, 904, 906, 908), each = 3),
         age     = rep(c(1, 1.5, 2), times = 8),
         cog     = c(103, 119, 96, 106, 107, 96, 112, 86, 73, 100, 93, 87, 
                     119, 93, 99, 112, 98, 79, 89, 66, 81, 117, 90, 76),
         program = rep(1:0, each = 12))

print(early_int)
```

Later on, we also fit models using $age - 1$. Here we'll compute that and save it as `age_c`.

```{r}
early_int <-
  early_int %>% 
  mutate(age_c = age - 1)

head(early_int)
```

Here we'll load our simulation of the full $n = 103$ data set.

```{r}
load("data/early_int_sim.rda")
```

## The level-1 submodel for individual change

This part of the model is also called the *individual growth model*. Remember how in last chapter we fit a series of participant-specific models? That's the essence of this part of the model.

Here's our version of Figure 3.1. Note that here we're being lazy and just using OLS estimates.

```{r, fig.width = 6.5, fig.height = 4, message = F, warning = F}
early_int %>% 
  ggplot(aes(x = age, y = cog)) +
  stat_smooth(method = "lm", se = F) +
  geom_point() +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  ylim(50, 150) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id, ncol = 4)
```

Based on these data, we postulate our level-1 submodel to be

$$
\text{cog}_{ij} = [ \pi_{0i} + \pi_{1i} (\text{age}_{ij} - 1) ] + [\epsilon_{ij}].
$$

### The structural part of the level-1 submodel.

As far as I can tell, the data for Figure 3.2 are something like this.

```{r}
d <-
  tibble(id  = "i",
         age = c(1, 1.5, 2),
         cog = c(95, 100, 135))

d
```

To add in the horizontal dashed lines in Figure 3.2, we'll need to fit a model. Let's be lazy and use OLS. Don't worry, we'll use Bayes in a bit.

```{r}
fit3.1 <-
  lm(data = d,
     cog ~ age)

summary(fit3.1)
```

We can use the `fitted()` function to compute the model-implied fitted values for `cog` based on the `age` values in the data. We'll then save those in a sensibly-named vector which we'll attach to the rest of the data.

```{r}
f <-
  fitted(fit3.1) %>% 
  data.frame() %>% 
  set_names("fitted") %>% 
  bind_cols(d)

print(f)
```

To make all the dashed lines and arrows in the figure, we'll want a few specialty tibbles.

```{r, fig.width = 3.25, fig.height = 4}
path <-
  tibble(age = c(1, 2, 2),
         cog = c(90, 90, 130))

text <-
  tibble(age   = c(1.2, 1.65, 2.15, 1.125, 2.075),
         cog   = c(105, 101, 137, 75, 110),
         label = c("epsilon[italic(i)][1]", "epsilon[italic(i)][2]", 
                   "epsilon[italic(i)][3]", "pi[0][italic(i)]", "pi[1][italic(i)]"))

arrow <-
  tibble(age  = c(1.15, 1.6, 2.1, 1.1),
         xend = c(1.01, 1.51, 2.01, 1.01),
         cog  = c(103, 101, 137, 78),
         yend = c(92.5, 105, 132.5, 89))

# we're finally ready to plot!
f %>% 
  ggplot(aes(x = age, y = cog)) +
  geom_point() +
  # the main fitted trajectory
  geom_line(aes(y = fitted)) +
  # the thick dashed line bending upward at age == 2
  geom_path(data = path,
            linetype = 2, size = 1/2) +
  # the thin dashed vertical lines extending from the data dots to the fitted line
  geom_segment(aes(xend = age, y = cog, yend = fitted),
               linetype = 3, size = 1/4) +
  # the arrows
  geom_segment(data = arrow,
               aes(xend = xend, yend = yend),
               arrow = arrow(length = unit(0.1, "cm")), size = 1/4) +
  # the statistical notation
  geom_text(data = text,
            aes(label = label),
            size = c(4, 4, 4, 5, 5), parse = T) +
  # "1 year"
  annotate(geom = "text",
           x = 1.5, y = 86, label = "1 year") +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  ylim(50, 150) +
  theme(panel.grid = element_blank())
```

> In-specifying a level-1 submodel that attempts to describe everyone (all the $i$'s) in the population, we implicitly assume that all the true individual change trajectories have a common algebraic form. But we do not assume that everyone has the same exact trajectory. Because each person has his or her own individual growth parameters (intercepts and slopes), different people can have their own distinct change trajectories. (pp. 53--54)

In this way, the multilevel model's level-1 submodel is much like an interaction/moderation model with interaction terms for each level of $i$. 

### The stochastic part of the level-1 submodel.

The last term in our level-1 equation from above was $[\epsilon_{ij}]$. This is the residual variance left in the criterion after accounting for the predictor(s) in the model. It is a mixture of systemic variation that could be accounted for by adding covariates to the model as well as measurement error. The typical assumption is

$$
\epsilon_{ij} \sim \operatorname{Normal} (0, \sigma_\epsilon^2).
$$

We should point out that another way to express this is

\begin{align*}
\text{cog} & \sim \operatorname{Normal} (\mu_{ij}, \sigma_\epsilon^2) \\
\mu_{ij}   & = \pi_{0i} + \pi_{1i} (\text{age}_{ij} - 1).
\end{align*}

This won't be a huge deal in the context of the models Singer and Willett presented in the initial chapters of the text but expressing the models this way can help one think in terms of likelihood functions. That's a major advantage when you start working with data which are natural to model using other distributions (e.g., count data and the Poisson, binary data and the binomial). For more on this approach, check out either edition of McElreath's [-@mcelreathStatisticalRethinkingBayesian2015; -@mcelreathStatisticalRethinkingBayesian2020] [*Statistical Rethinking*](https://xcelab.net/rm/statistical-rethinking/) and my [-@kurzStatisticalRethinkingBrms2020; -@kurzStatisticalRethinkingSecondEd2021] companion ebook([here](https://bookdown.org/content/3890/) and [here](https://bookdown.org/content/4857/)) translating his work into **brms** and **tidyverse** code. Also, thinking in terms of likelihoods will pay off starting around [Chapter 10][Describing Discrete-Time Event Occurrence Data] when we start fitting discrete-time survival models.

### Relating the level-1 submodel to the OLS exploratory methods of chapter 2.

To get the top panel in Figure 3.3, we'll use `stat_smooth()` to get the OLS trajectories.

```{r, fig.width = 3, fig.height = 4, warning = F, message = F}
early_int_sim %>% 
  ggplot(aes(x = age, y = cog)) +
  stat_smooth(aes(group = id),
              method = "lm", se = F, size = 1/6) +
  stat_smooth(method = "lm", se = F, size = 2) +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  ylim(50, 150) +
  theme(panel.grid = element_blank())
```

Note that now we're working with our `early_int_sim` data, the one where we added the data of 95 simulated individuals to the real data of 8 `id` levels from Table 3.1. As such, our results will deviate a bit from those in the text.

But anyways, here we go on to fit 103 individual OLS models, one for each of the `id` levels. Don't worry; we'll depart from this madness shortly.

```{r}
by_id <-
  early_int_sim %>% 
  mutate(age_c = age - 1) %>% 
  group_by(id) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(data = ., cog ~ age_c)))

head(by_id)
```

Now we'll use the great helper functions from the [**broom** package](https://cran.r-project.org/web/packages/broom/index.html) [@R-broom], `tidy()` and `glance()`, to store the coefficient information and model fit information, respectively, in a tidy data format (see [here](http://r4ds.had.co.nz/tidy-data.html) and also [here](https://www.youtube.com/watch?v=rz3_FDVt9eg&t=3458s)).

```{r, warning = F, message = F}
# install.packages("broom")
library(broom)

by_id <- 
  by_id %>%
  mutate(tidy   = map(model, tidy),
         glance = map(model, glance))
```

Here's what our `by_id` object now looks like:

```{r}
by_id %>%
  head()
```

If you want to extract the intercepts from the `tidy` column, you might execute code like this.

```{r}
unnest(by_id, tidy) %>%
  filter(term == "(Intercept)")
```

This first line took the model coefficients and their respective statistics (e.g., standard errors) and unnested them (i.e., took them out of the list of data frames and converted the data to a longer structure). The second line filtered out any coefficients that were not the intercept. In this case, there are just two coefficients, the intercept and the slope for `age_c`.

With that, we can make the leftmost stem and leaf plot from Figure 3.3.

```{r}
unnest(by_id, tidy) %>%
  filter(term == "(Intercept)") %>% 
  pull(estimate) %>% 
  stem()
```

Here's the stem and leaf plot in the middle.

```{r}
unnest(by_id, tidy) %>%
  filter(term == "age_c") %>% 
  pull(estimate) %>% 
  stem()
```

If you want the residual variances (i.e., $\sigma_\epsilon^2$), you'd `unnest()` the `glance` column. They'll be listed in the `sigma` column.

```{r}
unnest(by_id, glance) %>% 
  pull(sigma) %>% 
  stem(scale = 1)
```

## The level-2 submodel for systematic interindividual differences in change

Here are the top panels of Figure 3.4.

```{r, fig.width = 6, fig.height = 4, warning = F, message = F}
early_int_sim <-
  early_int_sim %>% 
  mutate(label = str_c("program = ", program)) 

early_int_sim %>% 
  ggplot(aes(x = age, y = cog, color = label)) +
  stat_smooth(aes(group = id),
              method = "lm", se = F, size = 1/6) +
  stat_smooth(method = "lm", se = F, size = 2) +
  scale_color_viridis_d(option = "B", begin = .33, end = .67) +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  ylim(50, 150) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~label)
```

Given the simplicity of the shapes, the bottom panels of Figure 3.4 will take a bit of preparatory work. First, we'll need to wrangle the data a bit to get the necessary points. If we were working with a Bayesian model fit with **brms**, we'd use the `fitted()` function. But since we're working with models fit with base **R**'s OLS estimator, `lm()`, we'll use `predict()`, which accommodates a `newdata` argument. That'll be crucial because in order to get the shapes correct, we'll need to evaluate the minimum and maximum values across the fitted lines across a densely-packed sequence of `age_c` values.

```{r, message = F}
# how may `age_c` values do we need?
n <- 30

# define the specific `age_c` values
nd <-
  tibble(age_c = seq(from = 0, to = 1, length.out = n))

# wrangle
p <-
  by_id %>%
  mutate(fitted  = map(model, ~predict(., newdata = nd))) %>% 
  unnest(fitted) %>% 
  mutate(age     = seq(from = 1, to = 2, length.out = n),
         program = ifelse(id < 900, 1, 0)) %>% 
  group_by(program, age) %>% 
  summarise(min = min(fitted),
            max = max(fitted)) %>%
  mutate(label = str_c("program = ", program))

# what did we do?
head(p)
```

Before we plot, we'll need a couple tibbles for the annotation.

```{r}
text <-
  tibble(age   = 1.01,
         cog   = c(101.5, 110),
         label = c("program = 0", "program = 1"),
         text  = c("Average population trajectory,", "Average population trajectory,"),
         angle = c(345.7, 349))

math <-
  tibble(age   = 1.01,
         cog   = c(94.5, 103),
         label = c("program = 0", "program = 1"),
         text  = c("gamma[0][0] + gamma[10](italic(age) - 1)", "(gamma[0][0] + gamma[10]) + (gamma[10] + gamma[11]) (italic(age) - 1)"),
         angle = c(345.7, 349))
```

Finally, we're ready for the bottom panels of Figure 3.4.

```{r, fig.width = 6, fig.height = 4, warning = F, message = F}
p %>%
  ggplot(aes(x = age)) +
  geom_ribbon(aes(ymin = min, ymax = max, fill = label),
              alpha = 1/3) +
  stat_smooth(data = early_int_sim,
              aes(y = cog, color = label),
              method = "lm", se = F, size = 1) +
  geom_text(data = text,
            aes(y = cog, label = text, angle = angle),
            hjust = 0) +
  geom_text(data = math,
            aes(y = cog, label = text, angle = angle),
            hjust = 0, parse = T) +
  scale_fill_viridis_d(option = "B", begin = .33, end = .67) +
  scale_color_viridis_d(option = "B", begin = .33, end = .67) +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  scale_y_continuous("cog", limits = c(50, 150)) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~label)
```

Returning to the text (p. 58), Singer and Willett asked: "What kind [of] population model might have given rise to these patterns?" Their answer is the level-2 model should have 4 specific features:

1. "Its outcomes must be the individual growth parameters."
2. "The level-2 submodel must be written in separate parts, one for each level-1 growth parameter."
3. "Each part must specify a relationship between an individual growth parameter and the predictor."
4. "Each model must allow individuals who share common predictor values to vary in their individual change trajectories."

Given the current model, the level-2 submodel of change may be expressed as

\begin{align*}
\pi_{0i} & = \gamma_{00} + \gamma_{01} \text{program} + \zeta_{0i} \\
\pi_{1i} & = \gamma_{10} + \gamma_{11} \text{program} + \zeta_{1i}.
\end{align*}

We'll discuss the details about the $\zeta$ terms in a bit.

### Structural components of the level-2 submodel.

> The structural parts of the level-2 submodel contain four level-2 parameters--$\gamma_{00}$, $\gamma_{01}$, $\gamma_{10}$, and $\gamma_{11}$--known collectively as the *fixed effects*. The fixed effects capture systematic interindividual differences in change trajectory according to values of the level-2 predictors. (p. 60, *emphasis* in the original)

### Stochastic components of the level-2 submodel.

> Each part of the level-2 submodel contains a residual that allows the value of each person’s growth parameters to be scattered around the relevant population averages. These residuals, $\zeta_{0i}$ and $\zeta_{1i}$ in [the] equation [above], represent those portions of the level-2 outcomes--the individual growth parameters--that remain unexplained by the level-2 predictor(s). (p. 61)

We often summarize the $\zeta_{0i}$ and $\zeta_{1i}$ deviations as $\sigma_0^2$ and $\sigma_1^2$, respectively. And importantly, these variance parameters have a covariance $\sigma_{01}$. However, *and this next part is quite important*, **brms** users should know that unlike the convention in many frequentist software packages [e.g., **lme4**, @R-lme4; @batesFittingLinearMixedeffects2015] and in the text, **brms** parameterizes these in the standard-deviation metric. That is, in **brms**, these are expressed as $\sigma_0$ and $\sigma_1$. Similarly, the $\sigma_{01}$ presented in **brms** output is in a correlation metric, rather than a covariance. There are technical reasons for this are outside of the scope of the present situation [see @burknerBrmsPackageBayesian2017]. The consequences is that we'll make frequent use of squares and square roots in this project when comparing our `brms::brm()` results to those in the text.

As on page 63 of the text, the typical way to express the multivariate distribution of the $\zeta$ parameters would be

\begin{align*}
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & 
\sim \operatorname{N} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},  
\begin{bmatrix} \sigma_0^2 & \sigma_{01}\\ \sigma_{01} & \sigma_1^2 \end{bmatrix} 
\end{pmatrix},
\end{align*}

where the bracketed matrix on the right part of the equation is the variance/covariance matrix. If we summarize the vector of $\zeta$ terms as $u$ and so on, we can re-express the above equation as

$$
u \sim \operatorname N (\mathbf 0, \mathbf \Sigma),
$$

where $\mathbf 0$ is the vector of 0 means and $\mathbf \Sigma$ is the variance/covariance matrix. In Stan, and thus **brms**, we typically decompose $\mathbf \Sigma$ as

\begin{align*}
\mathbf \Sigma & = \mathbf D \mathbf \Omega \mathbf D, \text{where} \\
\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \text{and} \\
\mathbf \Omega & = \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}.
\end{align*}

Thus $\mathbf D$ is the diagonal matrix of standard deviations and $\mathbf \Omega$ is the correlation matrix.

## Fitting the multilevel model for change to data

Singer and Willett discussed how in the 90s, we saw a bloom of software for fitting multilevel models. Very notably, they mentioned BUGS [@gilksMCMCinPractice1995] which stands for 'Bayesian inference using Gibbs sampling' and was a major advance in Bayesian software. As we learn in @kruschkeDoingBayesianData2015:

> In 1997, BUGS had a Windows-only version called WinBUGS, and later it was reimplemented in OpenBUGS which also runs best on Windows operating systems. JAGS [@plummer2003jags; @plummer2012jags] retained many of the design features of BUGS, but with different samplers under the hood and better usability across different computer-operating systems (Windows, MacOS, and other forms of Linux/Unix). (pp. 193--194)

There's also Stan. From their homepage, [https://mc-stan.org](https://mc-stan.org), we read "Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation." Stan is free and open-source and you can find links to various documentation resources, such as the current *Stan user's guide* [@standevelopmentteamStanUserGuide2021] and *Stan reference manual* [@standevelopmentteamStanReferenceManual2021], at [https://mc-stan.org/users/documentation/](https://mc-stan.org/users/documentation/). Unlike BUGS and JAGS, Stan samples from the posterior via Hamiltonian Monte Carlo, which tends to scale particularly well for complex multilevel models. However, in this project we won't be working with Stan directly. Rather, we'll interface with it indirectly through [**brms**](https://github.com/paul-buerkner/brms). To my knowledge, **brms** is the most flexible and user-friendly interface for Stan within the **R** ecosystem.

Talking about the various software options, Singer and Willett wrote:

> All have their strengths, and we use many of them in our research and in this book. At their core, each program does the same job; it fits the multilevel model for change to data and provides parameter estimates, measures of precision, diagnostics, and so on. There is also some evidence that all the different packages produce the same, or similar, answers to a given problem [@kreftComparingFourDifferent1990]. So, in one sense, it does not matter which program you choose. (p. 64)

But importantly, in the next paragraph the authors clarified their text focused on "one particular method of estimation--*maximum likelihood*" (p. 64, *emphasis* in the original). This is quite important because, whereas we might expect various maximum-likelihood-based packages to yield the same or similar results, this will not necessarily hold when working with Bayesian software which, in addition to point estimates and expressions of uncertainty, yields an entire posterior distribution as a consequence of Bayes' theorem,

$$
p(\theta | d) = \frac{p(d | \theta)\ p(\theta)}{p(d)},
$$

where $p(\theta | d)$ is the posterior distribution, $p(d | \theta)$ is the likelihood (i.e., the star of maximum likelihood), $p(\theta)$ are the priors, and $p(d)$ is the normalizing constant, the probability of the data which transforms the numerator to a probability metric.

Given that multilevel models are a fairly advanced topic, it is not my goal in this project to offer a tutorial on the foundations of applied Bayesian statistics. I'm taking it for granted you are familiar with the basics. But if you're very ambitious and this is new or if you're just rusty, I recommend you back up and lean the ropes with [Richard McElreath](https://twitter.com/rlmcelreath)'s excellent introductory text, [*Statistical Rethinking*](http://xcelab.net/rm/statistical-rethinking/). He has great [freely-available lectures](https://www.youtube.com/channel/UCNJK6_DZvcMqNSzQdEkzvzA/playlists) that augment the text and I also have ebooks ([here](https://bookdown.org/content/3890/) and [here](https://bookdown.org/content/4857/)) translating both editions of his text into **brms** and **tidyverse**-style code.

### The advantages of ~~maximum likelihood~~ Bayesian estimation.

I just don't think I have the strength to evangelize Bayes, at the moment. McElreath covered that a little bit in [*Statistical Rethinking*](http://xcelab.net/rm/statistical-rethinking/), but he mainly took it for granted. Kruschke has been more of a Bayesian evangelist, examples of which are his [2015 text](http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/) or his [-@kruschkeBayesianNewStatistics2018] coauthored paper with Torrin Liddell, [*The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective*](https://link.springer.com/article/10.3758/s13423-016-1221-4). My assumption is if you’re reading this, you're already interested.

### Using ~~maximum likelihood~~ modern Bayesian methods to fit a multilevel model.

Just as with maximum likelihood, you have to specify a likelihood function with Bayes, too. The likelihood, $p(d | \theta)$, is half of the numerator of Bayes' theorem and its meaning in Bayes is that same as with maximum likelihood estimation. The likelihood "describes the probability of observing the sample data as a function of the model's unknown parameters" (p. 66). But unlike with maximum likelihood, we multiply the likelihood with the prior(s) and normalize the results so they're in a probability metric.

## Examining estimated fixed effects

Singer and Willett discussed hypothesis testing in this section. The Bayesian paradigm can be used for hypothesis testing. For an introduction to this approach, check out Chapters 11 and 12 of Kruschke's (2015) text. This will not be our approach in this project. My perspective on Bayesian modeling is more influenced by McElreath's text and by [Andrew Gelman](http://www.stat.columbia.edu/~gelman/)'s various works. I like fitting models, inspecting their parameters, interpreting them from an effect-size perspective, and considering posterior predictions. You'll see plenty of examples of this approach in the examples to come.

### Interpreting estimated fixed effects.

Singer and Willett presented the maximum likelihood results of our multilevel model in Table 3.3. Before we present ours, we'll need to fit our corresponding Bayesian model. Let's fire up **brms**.

```{r, warning = F, message = F}
library(brms)
```

Just like we did with the single-level models in the last chapter, we'll fit our Bayesian multilevel models with the `brm()` function.

```{r fit3.2}
fit3.2 <-
  brm(data = early_int_sim,
      family = gaussian,
      formula = cog ~ 0 + Intercept + age_c + program + age_c:program + (1 + age_c | id),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      control = list(adapt_delta = 0.95),
      seed = 3,
      file = "fits/fit03.02")
```

Compared to last chapter, we've added a few arguments. Notice the second line, `family = gaussian`. Remember all that likelihood talk from the last few sections? Well, with `family = gaussian` we've indicated we want to use the Gaussian likelihood function. As this is the **brms** default, we didn't actually need to type out the argument. But we'll follow this convention for the remainder of the text for two reasons. First, I hope it's pedagogically useful to remind you of what likelihood you're working with. Second, I think it's generally a good idea to explicate your likelihood. In the context of the initial chapters of this text, this might seem unnecessary. We'll constantly be using the Gaussian. But that's largely a pedagogical decision made by the authors. There are lots of every-day applications for multilevel models with other likelihood functions, such as those suited for discrete data. And when the day comes you'll need to fit a multilevel logistic regression model, you'll need to use a different setting in the `family` argument. Plus, we will have some practice using other likelihood functions in the survival chapters later in the text.

Here's the big new thing: our `formula` line specified a multilevel model! Check out the `(1 + age_c | id)` syntax on the right. This syntax is designed to be similar to the that of the widely-used frequentist **lme4** package. In the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf) [@brms2021RM], we lean this syntax follows the generic form `(gterms | group)` where "the optional `gterms` part may contain effects that are assumed to vary across grouping variables specified in `group`. We call them 'group-level' or 'varying' effects, or (adopting frequentist vocabulary) 'random' effects, although the latter name is misleading in a Bayesian context" (p. 35). And like with base **R**'s `lm()` function or with **lme4**, the `1` portion is a stand-in for the intercept. Thus, with `1 + age_c`, we indicated we wanted the intercept and `age_c` slope to vary across groups. On the right side of the `|`, we defined our grouping variable as `id`.

Another important part of the `formula` syntax concerns the intercept for the fixed effects. See the `0 + Intercept` part? Here's the deal: If we were using default behavior, we'd have coded either `1 + ...` or just left that part out entirely. Both would have estimated the fixed intercept according to **brms** default behavior. But that's the issue. By default, `brms::brm()` presumes your predictors are mean centered. This is critical because the default priors set by `brms::brm()` are also set based on this assumption. As it turns out, neither our `age_c` nor `program` variables are centered that way. `program` is a dummy variable and `age_c` is centered on 1, not the mean. Now since the `brm()` default priors are rather broad and uninformative, This probably wouldn't have made much of a difference, here. However, we may as well address this issue now and avoid bad practices. So, with our `0 + Intercept` solution, we told `brm()` to suppress the default intercept and replace it with our smartly-named `Intercept` parameter. This is our fixed effect for the population intercept and, importantly, `brms()` will assign default priors to it based on the data themselves without assumptions about centering.

I'd like to acknowledge at this point that if **brms** and/or the multilevel model are new to you, this can be disorienting and confusing. I'm sorry. The world is unfair and Singer and Willett didn't write their text with **brms** in mind. Had they done so, they'd have used mean-centered predictors for the first few models to put off this technical issue until later chapters. Yet here we are. So if you're feeling frustrated, that mainly just means you're paying attention. Good job! Forge on, friends. It'll get better.

The line starting with `iter` largely explicates `brms::brm()` default settings. The only change from the defaults is `cores = 4`, which allow you to sample from all four `chains` simultaneously.

The `control` line opens up a can of worms I just don't want to address at this point in the text. It's a technical setting that helped us do a better job sampling form the posterior. We'll have more opportunities to talk about it later. For now, know that the default setting for `adapt_delta` is `0.8`. The value ranges from 0 to 1.

The last line of interest is `seed = 3`. Markov chain Monte Carlo methods use pseudo-random number generators to sample from the posterior. To make the results of a pseudo-random process reproducible, you set the seed. There's nothing special about setting it to 3. I just did so because that's the chapter we're on. Play around with other values and see what happens.

Okay, that's a lot of boring technical talk. Let's use `print()` to see what we did!

```{r}
print(fit3.2)
```

Now we're in multilevel-model land, we have three main sections in the output. Let's start in the middle. The 'Population-Level Effects:' section contains our **brms** analogues to the "Fixed Effects" portion of Singer and Willett's Table 3.3. These are our $\gamma$ parameter summaries. Like we briefly covered in the last chapter, **brms** does not give $z$- or $p$-values. But we do get high-quality percentile-based Bayesian 95% intervals. Perhaps somewhat frustratingly, our 'Estimate' values are a little different from those in the text. In this instance, this is more due to us not having access to the original data than differences between Bayesian and maximum likelihood estimation. They'll be closer in many other examples.

The bottom section, 'Family Specific Parameters:', is our analogue to the first line in Table 3.3's "Variance Components" section. What Singer and Willett referred to as $\epsilon_{ij}$, the **brms** package calls `sigma`. But importantly, do recall that Stan and **brms** parameterize variance components in the standard-deviation metric. You'd have to square our `sigma` to put it in a similar metric to the estimate in the text. This will be the case throughout this project. But why call this section 'Family Specific Parameters'? Well, not all likelihoods have a $\sigma$ parameter. In the Poisson likelihood, for example, the mean and variance scale together as one parameter called $\lambda$. Since **brms** is designed to handle a whole slew of likelihood functions [see the [*Parameterization of response distributions in brms*](https://cran.r-project.org/web/packages/brms/vignettes/brms_families.html) vignette, @Bürkner2021Parameterization] it behooved [Bürkner](https://twitter.com/paulbuerkner) to give this section a generic name.

Now we're ready to draw our attention to the topmost section. The 'Group-Level Effects:' are our **brms** variants of the Level 2 section of Singer and Willett's "Variance Components" section in Table 3.3. Our `sd(Intercept)` corresponds to their $\sigma_0^2$ and our `sd(age_c)` corresponds to their $\sigma_1^2$. But recall, ours are in a standard-deviation metric. The estimates in the book are expressed as variances. Finally, our `cor(Intercept,age_c)` parameter is a correlation among the varying-effects, whereas Singer and Willett's $\sigma_{01}$ is a covariance.

In addition to `print()`, a handy way to pull the fixed effects from a `brm()` model is with the `fixef()` function.

```{r}
fixef(fit3.2)
```

You can subset its components with `[]` syntax. E.g., here we'll pull the posterior mean for the overall intercept and round to two decimal places.

```{r}
fixef(fit3.2)[1, 1] %>% 
  round(digits = 2)
```

Thus, we can write our version of the equations atop page 70 as

* $\hat{\pi}_{0i} =$ `r round(fixef(fit3.2)[1, 1], 2)` $+$ `r round(fixef(fit3.2)[3, 1], 2)`$\text{program}_i$ and
* $\hat{\pi}_{1i} =$ `r round(fixef(fit3.2)[2, 1], 2)` $+$ `r round(fixef(fit3.2)[4, 1], 2)`$\text{program}_i$.

Again, our results differ largely because they're based on simulated data rather than the real data in the text. We'll be able to work with the original data in the later chapters. Anyway, here are the results of those two equations.

```{r}
fixef(fit3.2)[1, 1] + fixef(fit3.2)[3, 1]
fixef(fit3.2)[2, 1] + fixef(fit3.2)[4, 1]
```

Here's how to get our estimates corresponding to the values at the bottom of page 70.

```{r}
# when `program` is 0
fixef(fit3.2)[1, 1] + fixef(fit3.2)[3, 1] * 0 
fixef(fit3.2)[2, 1] * 1 + fixef(fit3.2)[4, 1] * 1 * 0

# when `program` is 1
fixef(fit3.2)[1, 1] + fixef(fit3.2)[3, 1] * 1 
fixef(fit3.2)[2, 1] * 1 + fixef(fit3.2)[4, 1] * 1 * 1
```

To make our version of Figure 3.5, we'll pump the necessary `age_c` and `program` values into the full formula of the fixed effects.

```{r, fig.width = 3, fig.height = 4}
# specify the values for our covariates `age_c` and `program`
crossing(age_c   = 0:1,
         program = 0:1) %>% 
  # push those values through the fixed effects
  mutate(cog = fixef(fit3.2)[1, 1] + fixef(fit3.2)[2, 1] * age_c + fixef(fit3.2)[3, 1] * program + fixef(fit3.2)[4, 1] * age_c * program,
         # wrangle a bit
         age     = age_c + 1,
         size    = ifelse(program == 1, 1/5, 3),
         program = factor(program, levels = c("0", "1"))) %>% 

  # plot!
  ggplot(aes(x = age, y = cog, group = program)) +
  geom_line(aes(size = program)) +
  scale_size_manual(values = c(1, 1/2)) +
  scale_x_continuous(breaks = c(1, 1.5, 2)) +
  ylim(50, 150) +
  theme(legend.position = "none",
        panel.grid = element_blank())
```

### Single parameter tests for the fixed effects.

> As in regular regression, you can conduct a hypothesis test on each fixed effect (each $\gamma$) using a single parameter text. Although you can equate the parameter value to any pre-specified value in your hypothesis text, most commonly you examine the null hypothesis that, controlling for all other predictors in the model, the population value of the parameter is 0, $H_0: \gamma = 0$, against the two-sided alternative that it is not, $H_1: \gamma \neq 0$. (p. 71)

You can do this with **brms** with the `hypothesis()` function. I'm not a fan of this method and am not going to showcase it in this ebook If you insist on the NHST paradigm, you'll have to go that alone.

Within the Bayesian paradigm, we have an entire posterior distribution. So let's just look at that. Recall that in the `print()` and `fixef()` outputs, we get the parameter estimates for our $\gamma$'s summarized in terms of the posterior mean (i.e., 'Estimate'), the posterior standard deviation (i.e., 'Est.error'), and the percentile-based 95% credible intervals (i.e., 'Q2.5' and 'Q97.5'). But we can get much richer output with the `posterior_samples()` function.

```{r}
post <- posterior_samples(fit3.2)
```

Here's a look at the first 10 columns.

```{r}
post[, 1:10] %>%
  glimpse()
```

Here are the dimensions.

```{r}
post %>% 
  dim()
```

We saved our results as `post`, which is a data frame with 4,000 rows (i.e., 1,000 post-warmup iterations times 4 chains) and 215 columns, each depicting one of the model parameters. With **brms**, the $\gamma$ parameters (i.e., the fixed effects or population parameters) get `b_` prefixes in the `posterior_samples()` output. So we can isolate them like so.

```{r}
post %>% 
  select(starts_with("b_")) %>% 
  head()
```

Just a little more data wrangling will put `post` in a format suitable for plotting.

```{r, fig.width = 7, fig.height = 4}
post %>% 
  pivot_longer(starts_with("b_")) %>% 
  
  ggplot(aes(x = value)) +
  geom_density(color = "transparent", fill = "grey25") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

Sure, you could fixate on zero if you wanted to. But of more interest is the overall shape of each parameter's posterior distribution. Look at each's central tendency and spread. Look at where each is in the parameter space. To my mind, that story is so much richer than fixating on zero.

We'll have more to say along these lines in subsequent chapters.

## Examining estimated variance components

> Estimated variance and covariance components are trickier to interpret as their numeric values have little absolute meaning and there are no graphic aids to fall back on. Interpretation for a single fitted model is especially difficult as you lack benchmarks for evaluating the components' magnitudes. This increases the utility of hypothesis testing, for at least the tests provide some benchmark (against the null value of 0) for comparison. (p. 72)

No, no, no! I do protest. No!

As I hope to demonstrate, our Bayesian **brms** paradigm offers rich and informative alternatives to the glib picture Singer and Willett painted back in 2003. Nowadays, we have full Bayesian estimation with Stan. Rejoice, friends. Rejoice.

### Interpreting the estimated variance components.

To extract just the variance components of a `brm()` model, use the `VarCorr()` function.

```{r}
VarCorr(fit3.2)
```

In case that output is confusing, `VarCorr()` returned a 2-element list of lists. We can use the `[[]]` subsetting syntax to isolate the first list of lists.

```{r}
VarCorr(fit3.2)[[1]] %>% str()
```

If you just want the $\zeta$'s, subset the first list of the first list.

```{r}
VarCorr(fit3.2)[[1]][[1]]
```

Here's how to get their correlation matrix.

```{r}
VarCorr(fit3.2)[[1]][[2]]
```

And perhaps of great interest, here's how to get their variance/covariance matrix.

```{r}
VarCorr(fit3.2)[[1]][[3]]
```

You can also use the appropriate algebraic operations to transform some of the columns in the `posterior_samples()` output into the variance metric used in the text. Here we'll do so for the elements in the variance/covariance matrix and $\sigma_\epsilon^2$, too.

```{r}
posterior_samples(fit3.2) %>% 
  mutate(`sigma[0]^2`       = sd_id__Intercept^2,
         `sigma[1]^2`       = sd_id__age_c^2,
         `sigma[0][1]`      = sd_id__Intercept * cor_id__Intercept__age_c * sd_id__age_c,
         `sigma[epsilon]^2` = sigma^2) %>% 
  pivot_longer(starts_with("sigma["),
               values_to = "posterior") %>% 
  
  ggplot(aes(x = posterior)) +
  geom_density(color = "transparent", fill = "grey33") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 12)) +
  facet_wrap(~name, scales = "free", labeller = label_parsed) 
```

As it turns out, the multilevel variance components are often markedly non-Gaussian. This is important for the next section.

### Single parameter tests for the variance components.

> Statisticians disagree as to the nature, form, and effectiveness of these tests. @millerBeyondANOVA1997, @raudenbushHLM2002, and others have long questioned their utility because of their sensitivity to departures from normality. Longford (1999) describes their sensitivity to sample size and imbalance (unequal numbers of observations per person) and argues that they are so misleading that they should be abandoned completely. (p. 73)

This reminds me of parts from Gelman and Hill's [-@gelmanDataAnalysisUsing2006] [text on multilevel models](http://www.stat.columbia.edu/~gelman/arm/). In Section 12.7 on the topic of model building and statistical significance, they wrote:

> It is *not* appropriate to use statistical significance as a criterion for including particular group indicators in a multilevel model....

[They went on to discuss a particular example from the text, regarding radon levels in housed in various counties.]

> However, we should include all 85 counties in the model, and nothing is lost by doing so. The purpose of the multilevel model is not to use whether radon levels in county 1 are statistically significantly different from those in county 2, or from the Minnesota average. Rather, we seek the best possible estimate in each county, with appropriate accounting for uncertainty. Rather that make some significance threshold, we allow all the intercepts to vary and recognize that we may not have much precision in many of the individual groups....
>
> The same principle holds for the models discussed in the following chapters, which include varying slopes, non-nested levels, discrete data, and other complexities. Once we have included a source of variation, we do not use statistical significance to pick and choose indicators to include or exclude from the model.
>
> In practice, our biggest constraints--the main reasons we do not use extremely elaborate models in which all coefficients can vary with respect to all grouping factors--are fitting and understanding complex models. The `lmer()` function works well when it works, but it can break down for models with many groping factors. (p. 272, *emphasis* in the original)

For context, `lmer()` is the primary function in the frequentist **lme4** package. After pointing out difficulties with `lmer()`, they went on to point out how the Bayesian Bugs software can often overcome limitations in frequentist packages. We now have the benefit of Stan and **brms**. My general recommendation is if your theory suggests there should be group-level variability and you've collected the necessary data to fit that model, fit the full model.

## Bonus: How did you simulate that data?

What makes our task difficult is the multilevel model we'd like to simulate our data for has both varying intercepts and slopes. And worst yet, those varying intercepts and slopes have a correlation structure. Also of note, Singer and Willett presented their summary statistics in the form of a variance/covariance matrix in Table 3.3.

As it turns out, the `mvnorm()` function from the [**MASS** package](https://cran.r-project.org/package=MASS) [@R-MASS; @MASS2002] will allow us to simulate multivariate normal data from a given mean structure and variance/covariance matrix. So our first step in simulating our data is to simulate the $103 – 8 = 95$ $\zeta$ values. We'll name the results `z`.

```{r}
# how many people are we simulating?
n <- 103 - 8

# what's the variance/covariance matrix?
sigma <- matrix(c(124.64, -36.41,
                  -36.41, 12.29),
                ncol = 2)

# what's our mean structure?
mu <- c(0, 0)

# set the seed and simulate!
set.seed(3)
z <-
  MASS::mvrnorm(n = n, mu = mu, Sigma = sigma) %>% 
  data.frame() %>% 
  set_names("zeta_0", "zeta_1")

head(z)
```

For our next step, we'll define our $\gamma$ parameters. These are also taken from Table 3.3.

```{r}
g <-
  tibble(id       = 1:n,
         gamma_00 = 107.84,
         gamma_01 = 6.85,
         gamma_10 = -21.13,
         gamma_11 = 5.27)

head(g)
```

Note how they're the same for each row. That's the essence of the meaning of a fixed effect.

Anyway, this next block is a big one. After we combine `g` and `z`, we add in the appropriate `program` and `age_c` values. You can figure out those from pages 46 and 47. We then insert our final model parameter, $\epsilon$, and combine the $\gamma$'s and $\zeta$'s to make our two $\pi$ parameters (see page 60). Once that's all in place, we're ready to use the model formula to calculate the expected `cog` values from the $\pi$'s, `age_c`, and $\epsilon$.

```{r}
# set the seed for the second `mutate()` line
set.seed(3)

early_int_sim <-
  bind_cols(g, z) %>% 
  mutate(program = rep(1:0, times = c(54, 41))) %>% 
  expand(nesting(id, gamma_00, gamma_01, gamma_10, gamma_11, zeta_0, zeta_1, program),
         age_c = c(0, 0.5, 1)) %>% 
  mutate(epsilon = rnorm(n(), mean = 0, sd = sqrt(74.24))) %>% 
  mutate(pi_0 = gamma_00 + gamma_01 * program + zeta_0,
         pi_1 = gamma_10 + gamma_11 * program + zeta_1) %>% 
  mutate(cog = pi_0 + pi_1 * age_c + epsilon)

head(early_int_sim)
```

But before we do, we'll want to wrangle a little. We need an `age` column. If you look closely at Table 3.3, you'll see all the `cog` values are integers. So we’ll round ours to match. Finally, we'll want to renumber our `id` values to match up better with those in Table 3.3.

```{r}
early_int_sim <-
  early_int_sim %>% 
  mutate(age = age_c + 1,
         cog = round(cog, digits = 0),
         id  = ifelse(id > 54, id + 900, id))

head(early_int_sim)
```

Finally, now we just need to prune the columns with the model parameters, rearrange the order of the columns we'd like to keep, and join these data with those from Table 3.3.

```{r}
early_int_sim <-
  early_int_sim %>% 
  select(id, age, cog, program, age_c) %>% 
  full_join(early_int,
            by = c("id", "age", "cog", "program", "age_c")) %>% 
  arrange(id, age)

glimpse(early_int_sim)
```

Here we save our results in an external file for use later.

```{r}
save(early_int_sim,
     file = "data/early_int_sim.rda")
```

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(early_int, early_int_sim, d, fit3.1, f, path, text, arrow, by_id, n, nd, p, math, fit3.2, post, sigma, mu, z, g)

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```


<!--chapter:end:03.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# Doing Data Analysis with the Multilevel Model for Change

"We now delve deeper into the specification, estimation, and interpretation of the multilevel model for change" [@singerAppliedLongitudinalData2003, p. 75].

## Example: Changes in adolescent alcohol use

Load the data.

```{r, warning = F, message = F}
library(tidyverse)

alcohol1_pp <- read_csv("data/alcohol1_pp.csv")

head(alcohol1_pp)
```

Do note we already have an $(\text{age} - 14)$ variable in the data, `age_14`.

```{r, warning = F, message = F, eval = F, echo = F}
# Here we load the early-intervention data from [this UCLA web site](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-4/).


library(tidyverse)
# alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header = T, sep = ",")
# 
# alcohol1 %>% 
#   head()
```

Here's our version of Figure 4.1, using `stat_smooth()` to get the exploratory OLS trajectories.

```{r, fig.width = 7, fig.height = 4, message = F}
alcohol1_pp %>%
  filter(id %in% c(4, 14, 23, 32, 41, 56, 65, 82)) %>%
  
  ggplot(aes(x = age, y = alcuse)) +
  stat_smooth(method = "lm", se = F) +
  geom_point() +
  coord_cartesian(xlim = c(13, 17),
                  ylim = c(-1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id, ncol = 4)
```

By this figure, Singer and Willett suggested the simple linear level-1 submodel following the form

\begin{align*}
\text{alcuse}_{ij} & = \pi_{0i} + \pi_{1i} (\text{age}_{ij} - 14) + \epsilon_{ij}\\
\epsilon_{ij}      & \sim \operatorname{Normal}(0, \sigma_\epsilon^2),
\end{align*}

where $\pi_{0i}$ is the initial status of participant $i$, $\pi_{1i}$ is participant $i$'s rate of change, and $\epsilon_{ij}$ is the variation in participant $i$'s data not accounted for in the model.

Singer and Willett made their Figure 4.2 "with a random sample of 32 of the adolescents" (p. 78). If we just wanted a random sample of rows, the `sample_n()` function would do the job. But since we're working with long data, we'll need some `group_by()` + `nest()` mojo. I got the trick from [Jenny Bryan](https://twitter.com/JennyBryan)'s [*Sample from groups, n varies by group*](https://jennybc.github.io/purrr-tutorial/ls12_different-sized-samples.html). Setting the seed makes the results from `sample_n()` reproducible. Here are the top panels.

```{r, fig.width = 5, fig.height = 3, message = F}
set.seed(4)

alcohol1_pp %>% 
  group_by(id) %>% 
  nest() %>% 
  sample_n(size = 32, replace = T) %>% 
  unnest(data) %>%
  mutate(coa = ifelse(coa == 0, "coa = 0", "coa = 1")) %>%

  ggplot(aes(x = age, y = alcuse, group = id)) +
  stat_smooth(method = "lm", se = F, size = 1/4) +
  coord_cartesian(xlim = c(13, 17),
                  ylim = c(-1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~coa)
```

We have similar data wrangling needs for the bottom panels.

```{r, fig.width = 5, fig.height = 3, message = F}
set.seed(4)

alcohol1_pp %>% 
  group_by(id) %>% 
  nest() %>% 
  ungroup() %>% 
  sample_n(size = 32, replace = T) %>% 
  unnest(data) %>%
  mutate(hp = ifelse(peer < mean(peer), "low peer", "high peer")) %>%
  mutate(hp = factor(hp, levels = c("low peer", "high peer"))) %>%

  ggplot(aes(x = age, y = alcuse, group = id)) +
  stat_smooth(method = "lm", se = F, size = 1/4) +
  coord_cartesian(xlim = c(13, 17),
                  ylim = c(-1, 4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~hp)
```

Based on the exploratory analyses, Singer and Willett posited the initial level-2 submodel might take the form

\begin{align*}
\pi_{0i} & = \gamma_{00} + \gamma_{01} \text{coa}_i + \zeta_{0i}\\
\pi_{1i} & = \gamma_{10} + \gamma_{11} \text{coa}_i + \zeta_{1i} \\

\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \operatorname{Normal} \begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0^2 & \sigma_{01}\\ \sigma_{01} & \sigma_1^2 \end{bmatrix}
\end{pmatrix},
\end{align*}

where $\gamma_{00}$ and $\gamma_{10}$ are the level-2 intercepts, the population averages when $\text{coa} = 0$, $\gamma_{10}$ and $\gamma_{11}$ are the level-2 slopes expressing the difference when $\text{coa} = 1$ and $\zeta_{0i}$ and $\zeta_{1i}$ are the unexplained variation across the $\text{id}$-level intercepts and slopes. Since we'll be fitting the model with `brms::brm()`, the $\Sigma$ matrix will be parameterized in the $\sigma$ metric. So we might reexpress the model as

\begin{align*}
\pi_{0i} & = \gamma_{00} + \gamma_{01} \text{coa}_i + \zeta_{0i}\\
\pi_{1i} & = \gamma_{10} + \gamma_{11} \text{coa}_i + \zeta_{1i} \\

\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \operatorname{Normal} \begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0 & \rho_{01}\\ \rho_{01} & \sigma_1 \end{bmatrix}
\end{pmatrix}.
\end{align*}

## The composite specification of the multilevel model for change

With a little algebra, we can combine the level-1 and level-2 submodels into the composite multilevel model for change, which follows the form

\begin{align*}
\text{alcuse}_{ij} & = \big [ \gamma_{00} + \gamma_{10} \text{age_14}_{ij} + \gamma_{01} \text{coa}_i + \gamma_{11} (\text{coa}_i \times \text{age_14}_{ij}) \big ] \\
& \;\;\;\;\; + [ \zeta_{0i} + \zeta_{1i} \text{age_14}_{ij} + \epsilon_{ij} ] \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon^2) \\

\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0^2 & \sigma_{01} \\ \sigma_{01} & \sigma_1^2 \end{bmatrix}
\end{pmatrix},
\end{align*}

where the brackets in the first line partition the structural model (i.e., the model for $\mu$) and the stochastic components (i.e., the $\sigma$ terms). We should note that this is the format that most closely mirrors what we use in the `formula` argument in `brms::brm()`. As long as `age` is not centered on the mean, our **brms** syntax would be: `formula = alcuse ~ 0 + Intercept + age_c + coa + age_c:coa + (1 + age_c | id)`.

### The structural component of the composite model.

> Although their interpretation is identical, the $\gamma$s in the composite model describe patterns of change in a different way. Rather than postulating first how *ALCUSE* is related to *TIME* and the individual growth parameters, and second how the individual growth parameters are related to *COA*, the composite specification in equation 4.3 postulates that *ALCUSE* depends *simultaneously* on: (1) the level-1 predictor, *TIME*; (2) the level-2 predictor, *COA*; and (3) the *cross-level* interaction, *COA* by *TIME*. From this perspective, the composite model's structural portion strongly resembles a regular regression model with predictors, *TIME* and *COA*, appearing as main effects (associated with $\gamma_{10}$ and $\gamma_{01}$, respectively) and in a *cross-level* interaction (associated with $\gamma_{11}$). (p. 82, *emphasis* in the original)

### The stochastic component of the composite model.

> A distinctive feature of the composite multilevel model is its composite residual, the three terms in the second set of brackets on the right of equation 4.3 that combine together the level-1 residual and the two level-2 residuals:
>
> $$\text{Composite residual: } [ \zeta_{0i} +  \zeta_{1i} \text{age_14}_{ij} + \epsilon_{ij} ].$$
> The composite residual is not a simple sum. Instead, the second level-2 residual, $\zeta_{1i}$, is multiplied by the level-1 predictor, $[\text{age_14}_{ij}]$, before joining its siblings. Despite its unusual construction, the interpretation of the composite residual is straightforward: it describes the difference between the observed and expected value of $[\text{alcuse}]$ for individual $i$ on occasion $j$.
>
> The mathematical form of the composite residual reveals two important properties about the occasion-specific residuals not readily apparent in the level-1/level-2 specification: they can be both *autocorrelated* and *heteroscedastic* within person. (p. 84, *emphasis* in the original)

## Methods of estimation, revisited

In this section, the authors introduced generalized least squares (GLS) estimation and iterative generalized least squares (IGLS) estimation and then distinguished between full and restricted maximum likelihood estimation. Since our goal is to fit these models as Bayesians, we won't be using or discussing any of these in this project. There are, of course, different ways to approach Bayesian estimation. Though we're using Hamiltonian Monte Carlo, we could use other algorithms, such as the Gibbs sampler. However, all that is outside of the scope of this project.

I suppose the only thing to add is that whereas GLS estimates come from minimizing a weighted function of the residuals and maximum likelihood estimates come from maximizing the log-likelihood function, the results of our Bayesian analyses (i.e., the posterior distribution) come from the consequences of Bayes' theorem,

$$
p(\theta | d) = \frac{p(d | \theta)\ p(\theta)}{p(d)}.
$$

If you really want to dive into the details of this, I suggest referencing a proper introductory Bayesian textbook, such as McElreath [-@mcelreathStatisticalRethinkingBayesian2015; -@mcelreathStatisticalRethinkingBayesian2020], Kruschke [-@kruschkeDoingBayesianData2015], or @gelman2013bayesian. I haven't had time to check it out, but I've heard Labmert's [-@lambertAStudentsGuidetoBayes2018] text is good, too. And for details specific to Stan, and thus **brms**, you might check out the documentation resources at [https://mc-stan.org/users/documentation/](https://mc-stan.org/users/documentation/).

## First steps: Fitting two unconditional multilevel models for change

Singer and Willett recommended that before you fit your full theoretical multilevel model of change--the one with all the interesting covariates--you should fit two simpler preliminary models. The first is the *unconditional means model*. The second is the *unconditional growth model*.

I agree. In addition to the reasons they cover in the text, this is just good pragmatic data analysis. Start simple and build up to the more complicated models only after you're confident you understand what's going on with the simpler ones. And if you're new to them, you'll discover this is especially so with Bayesian methods.

### The unconditional means model.

The likelihood for the unconditional means model follows the formula

\begin{align*}
\text{alcuse}_{ij} & =  \gamma_{00} +  \zeta_{0i} + \epsilon_{ij} \\
\epsilon_{ij}      & \sim \operatorname{Normal}(0, \sigma_\epsilon^2) \\
\zeta_{0i}         & \sim \operatorname{Normal}(0, \sigma_0^2).
\end{align*}

Let's open **brms**.

```{r, warning = F, message = F}
library(brms)
```

Up till this point, we haven't focused on priors. It would have been reasonable to wonder if we'd been using them at all. Yes, we have. Even if you don't specify priors in the `brm()` function, it'll compute default weakly-informative priors for you. You might be wondering, *What might these default priors look like?* The `get_prior()` function let us take a look.

```{r}
get_prior(data = alcohol1_pp, 
          family = gaussian,
          alcuse ~ 1 + (1 | id))
```

For this model, all three priors are based on Student's $t$-distribution. In case you're rusty, the normal distribution is just a special case of Student's $t$-distribution. Whereas the normal is defined by two parameters ($\mu$ and $\sigma$), the $t$ distribution is defined by $\nu$, $\mu$, and $\sigma$. In frequentist circles, $\nu$ is often called the degrees of freedom. More generally, it's also referred to as a normality parameter. We'll examine the prior more closely in a bit.

For now, let's practice setting our priors by manually specifying them within `brm()`. You do with the `prior` argument. There are actually several ways to do this. To explore all the options, check out the `set_prior` section of the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf) [@brms2021RM]. I typically define my individual priors with the `prior()` function. When there are more than one priors to define, I typically bind them together within `c(...)`.

Other than the addition of our fancy `prior` statement, the rest of the settings within `brm()` are much like those in prior chapters. Let's fit the model.

```{r fit4.1}
fit4.1 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 1 + (1 | id),
      prior = c(prior(student_t(3, 1, 2.5), class = Intercept),
                prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.01")
```

Here are the results.

```{r}
print(fit4.1)
```

Compare the results to those listed under "Model A" in Table 4.1. It's important to keep in mind that **brms** returns 'sigma' and 'sd(Intercept)' in the standard deviation metric rather than the variance metric. "*But I want them in the variance metric like in the text!*", you say. Okay fine. The best way to do the transformations is after saving the results from `posterior_samples()`.

```{r}
post <- posterior_samples(fit4.1)

glimpse(post[, 1:12])
```

Since all we're interested in are the variance components, we'll `select()` out the relevant columns from `post`, compute the squared versions, and save the results in a mini data frame, `v`.

```{r}
v <-
  post %>% 
  select(sigma, sd_id__Intercept) %>% 
  mutate(sigma_2_epsilon = sigma^2,
         sigma_2_0       = sd_id__Intercept^2)

head(v)
```

We can view their distributions like this.

```{r, fig.height = 3.25, fig.width = 6, warning = F, message = F}
v %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value)) +
  geom_vline(xintercept = c(.25, .5, .75, 1), color = "white") +
  geom_density(size = 0, fill = "black") +
  scale_x_continuous(NULL, limits = c(0, 1.25),
                     breaks = seq(from = 0, to = 1.25, by = .25)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free_y")
```

In case it's hard to follow what just happened, the estimates in the **brms**-default standard-deviation metric are the two panels on the top. Those on the bottom are in the Singer-and-Willett style variance metric. Like we discussed toward the end of last chapter, the variance parameters won't often be Gaussian. In my experience, they're typically skewed to the right. There's nothing wrong with that. This is a recurrent pattern among distributions that are constrained to be zero and above.

If you're interested, you can summarize those posteriors like so.

```{r, message = F}
v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(mean   = mean(value),
            median = median(value),
            sd     = sd(value),
            ll     = quantile(value, prob = .025),
            ul     = quantile(value, prob = .975)) %>% 
  # this last bit just rounds the output
  mutate_if(is.double, round, digits = 3)
```

For this model, our posterior medians are closer to the estimates in the text (Table 4.1) than the means. However, our posterior standard deviations are pretty close to the standard errors in the text.

One of the advantages of our Bayesian method is that when we compute something like the intraclass correlation coefficient $\rho$, we get an entire distribution for the parameter rather than a measly point estimates. This is always the case with Bayes. The algebraic transformations of the posterior distribution are themselves distributions. Before we compute $\rho$, do pay close attention to the formula,

$$
\rho = \frac{\sigma_0^2}{\sigma_0^2 + \sigma_\epsilon^2}.
$$

Even though our **brms** output yields the variance parameters in the standard-deviation metric, the formula for $\rho$ demands we use variances. That's nothing a little squaring can't fix. Here's what our $\rho$ looks like.

```{r, fig.height = 2.25, fig.width = 3.5}
v %>%
  transmute(rho = sd_id__Intercept^2 / (sd_id__Intercept^2 + sigma^2)) %>% 
  
  ggplot(aes(x = rho)) +
  geom_density(size = 0, fill = "black") +
  scale_x_continuous(expression(rho), limits = 0:1) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Though the posterior for $\rho$ is indeed centered around .5, look at how wide and uncertain that distribution is. The bulk of the posterior mass takes up almost half of the parameter space. If you wanted the summary statistics, you might do what we did for the variance parameters, above.

```{r}
v %>%
  transmute(rho = sd_id__Intercept^2 / (sd_id__Intercept^2 + sigma^2)) %>% 
  summarise(mean   = mean(rho),
            median = median(rho),
            sd     = sd(rho),
            ll     = quantile(rho, prob = .025),
            ul     = quantile(rho, prob = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Concerning $\rho$, Singer and Willett pointed out

> it summarizes the size of the residual autocorrelation in the composite unconditional means mode....
>
> Each person has a different composite residual on each occasion of measurement. But notice the difference in the subscripts of the pieces of the composite residual: while the level-1 residual, $\epsilon_{ij}$ has two subscripts ($i$ and $j$), the level-2 residual, $\zeta_{0i}$, has only one ($i$). Each person can have a different $\epsilon_{ij}$ on each occasion, but has only one $\zeta_{0i}$ across every occasion. The repeated presence of $\zeta_{0i}$ in individual $i$'s composite residual links his or her composite residuals across occasions. The error autocorrelation coefficient quantifies the magnitude of this linkage; in the unconditional means model, the error autocorrelation coefficient *is* the intraclass correlation coefficient. Thus, we estimate that, for each person, the average correlation between any pair of composite residuals--between occasions 1 and 2, or 2 and 3, or 1 and 3--is [.5]. (pp. 96--97, *emphasis* in the original)

Because of the differences in how they're estimated with and presented by `brm()`, we focused right on the variance components. But before we move on to the next section, we should back up a bit. On page 93, Singer and Willett discussed their estimate for $\gamma_{00}$. Here's ours.

```{r}
fixef(fit4.1)
```

They talked about how squaring that value puts it back to the natural metric the data were originally collected in. [Recall that as discussed earlier in the text the `alcuse` variable was square-root transformed because of excessive skew.] If you want a quick and dirty look, you can square our results, too.

```{r}
fixef(fit4.1)^2 
```

However, I do not recommend this method. Though it did okay at transforming the posterior mean (i.e., Estimate), it's not a great way to get the summary statistics correct. To do that, you'll need to work with the posterior samples themselves. Remember how we saved them as `post`? Let's refresh ourselves and look at the first few columns.

```{r}
post[1:6, 1:3]
```

See that `b_Intercept` column there? That contains our posterior draws from $\gamma_{00}$. If you want proper summary statistics from the transformed estimate, get them after transforming that column.

```{r}
post %>% 
  transmute(gamma_00_squared = b_Intercept^2) %>% 
  summarise(mean   = mean(gamma_00_squared),
            median = median(gamma_00_squared),
            sd     = sd(gamma_00_squared),
            ll     = quantile(gamma_00_squared, prob = .025),
            ul     = quantile(gamma_00_squared, prob = .975)) %>%
  mutate_if(is.double, round, digits = 3) %>% 
  pivot_longer(everything())
```

And one last bit before we move on to the next section. Remember how we discovered what the `brm()` default priors were for our model with the handy `get_prior()` function? Let's refresh ourselves on how that worked.

```{r}
get_prior(data = alcohol1_pp, 
          family = gaussian,
          alcuse ~ 1 + (1 | id))
```

We inserted the data and the model and `get_prior()` returned the default priors. Especially for new Bayesians, or even for experienced Bayesians working with unfamiliar models, it can be handy to plot your priors to get a sense of them.

Base **R** has an array of functions based on the $t$ distribution (e.g., `rt()`, `dt()`). These functions are limited in that while they allow users to select the desired $\nu$ values (i.e., degrees of freedom), they fix $\mu = 0$ and $\sigma = 1$. If you want to stick with the base **R** functions, you can find tricky ways around this. To avoid overwhelming anyone new to Bayes or the multilevel model or **R** or some exasperating combination, let's just make things simpler and use a different function. As it turns out, the [**metRology** package](https://sourceforge.net/projects/metrology/) [@R-metRology] contains a `dt.scaled()` function that allows users to define all three parameters for Student's $t$.

We'll start with the default intercept prior, $t(\nu = 3, \mu = 1, \sigma = 2.5)$. Here's the density in the range $[-20, 20]$.

```{r, fig.width = 4, fig.height = 3}
tibble(x = seq(from = -20, to = 20, length.out = 1e3)) %>%
  mutate(density = metRology::dt.scaled(x, df = 3, mean = 1, sd = 2.5)) %>% 
  
  ggplot(aes(x = x, y = density)) +
  geom_vline(xintercept = 1, color = "white") +
  geom_line() +
  labs(title = expression(paste("prior for ", gamma[0][0])),
       x = "parameter space") +
  theme(panel.grid = element_blank())
```

Though it's centered on 1, the bulk of the mass seems to range from -10 to 10. Given the model estimate ended up about 0.9, it looks like that was a pretty broad and minimally-informative prior. However, the prior isn't flat and it does help guard against wasting time and HMC iterations sampling from ridiculous regions of the parameter space such as -10,000 or +500,000,000. No adolescent is drinking that much (or that little--how does one drink a negative value?).

Here's the shape of the variance priors.

```{r, fig.width = 4, fig.height = 3}
tibble(x = seq(from = 0, to = 20, length.out = 1e3)) %>%
  mutate(density = metRology::dt.scaled(x, df = 3, mean = 0, sd = 2.5)) %>% 
  
  ggplot(aes(x = x, y = density)) +
  geom_vline(xintercept = 0, color = "white") +
  geom_line() +
  labs(title = expression(paste("prior for both ", sigma[0], " and ", sigma[epsilon])),
       x = "parameter space") +
  theme(panel.grid = element_blank())
```

Recall that by **brms** default, the variance parameters have a lower-limit of 0. So specifying a Student's $t$ or other Gaussian-like prior on them ends up cutting the distribution off at 0. Given that our estimates were both below 1, it appears that these priors were minimally informative. But again, they did help prevent `brm()` from sampling from negative values or from obscenely-large values.

*These priors look kinda silly*, you might say. *Anyone with a little common sense can do better*. Well, sure. Probably. Maybe. But keep in mind we're still getting the layout of the land. And plus, this was a pretty simple model. Selecting high-quality priors gets tricky as the models get more complicated. In other chapters, we'll explore other ways to specify priors for our multilevel models. But to keep things simple for now, let's keep practicing inspecting and using the defaults with `get_prior()` and so on.

### The unconditional growth model.

Using the composite formula, our next model, the unconditional growth model, follows the form

\begin{align*}
\text{alcuse}_{ij} & = \gamma_{00} + \gamma_{10} \text{age_14}_{ij} + \zeta_{0i} + \zeta_{1i} \text{age_14}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon^2) \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0^2 & \sigma_{01} \\ \sigma_{01} & \sigma_1^2 \end{bmatrix}
\end{pmatrix}.
\end{align*}

With it, we now have a full composite stochastic model. Let's query the `brms::brm()` default priors when we apply this model to our data.

```{r}
get_prior(data = alcohol1_pp, 
          family = gaussian,
          alcuse ~ 0 + Intercept + age_14 + (1 + age_14 | id))
```

Several things of note: First, notice how we continue to use the `student_t(3, 0, 2.5)` for all three of our standard-deviation-metric variance parameters. Since we're now estimating $\sigma_0$ and $\sigma_1$, which themselves have a correlation, $\rho_{01}$, we have a prior of `class = cor`. I'm going to put off what is meant by the name `lkj`, but for the moment just realize that this prior is essentially noninformative within this context.

There's a major odd development with this output. Notice how the `prior` column is `(flat)` for the rows for our two coefficients of class `b`. And if you're a little confused, recall that because our predictor `age_14` is not mean-centered, we've used the `0 + Intercept` syntax, which switches the model intercept parameter to the class of `b`. From the `set_prior` section of the [reference manual](https://cran.r-project.org/web/packages/brms/brms.pdf) for **brms** version 2.12.0, we read: "The default prior for population-level effects (including monotonic and category specific effects) is an improper flat prior over the reals" (p. 179). At present, these priors are uniform across the entire parameter space. They're not just weak, their entirely noninformative. That is, the likelihood dominates the posterior for those parameters.

Here's how to fit the model with these priors.

```{r fit4.2}
fit4.2 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 0 + Intercept + age_14 + (1 + age_14 | id),
      prior = c(prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma),
                prior(lkj(1), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.02")
```

How did we do?

```{r}
print(fit4.2, digits = 3)
```

If your compare our results with those in the "Model B" column in Table 4.1, you'll see our summary results match well with those in the text. Our $\gamma$s (i.e., 'Population-Level Effects:') are near identical. The leftmost panel in Figure 4.3 shows the prototypical trajectory, based on the $\gamma$s. A quick way to get that within our **brms** framework is with the `conditional_effects()` function. Here's the default output.

```{r, fig.width = 2.5, fig.height = 3}
conditional_effects(fit4.2)
```

Staying with `conditional_effects()` allows users some flexibility for customizing the plot(s). For example, the default behavior is to depict the trajectory in terms of its 95% intervals and posterior median. If you'd prefer the 80% intervals and the posterior mean, customize it like so.

```{r, fig.width = 2.5, fig.height = 3}
conditional_effects(fit4.2,
                    robust = F,
                    prob = .8)
```

We'll explore more options with `brms::conditional_effects()` with Model C. For now, let's turn our focus on the stochastic elements in the model. Here we extract the posterior samples and do the conversions to see how they compare with Singer and Willett's.

```{r}
post <- posterior_samples(fit4.2)

v <-
  post %>% 
  transmute(sigma_2_epsilon = sigma^2,
            sigma_2_0       = sd_id__Intercept^2,
            sigma_2_1       = sd_id__age_14^2,
            sigma_01        = sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14)

head(v)
```

This time, our `v` object only contains the stochastic components in the variance metric. Let's plot.

```{r, fig.height = 3.25, fig.width = 6, warning = F, message = F}
v %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value)) +
  geom_density(size = 0, fill = "black") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

For each, their posterior mass is centered near the point estimates Singer and Willet reported in the text. Here are the summary statistics.

```{r, message = F}
v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(mean   = mean(value),
            median = median(value),
            sd     = sd(value),
            ll     = quantile(value, prob = .025),
            ul     = quantile(value, prob = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Happily, they're quite comparable to those in the text.

We’ve been pulling the posterior samples for all parameters with `posterior_samples()` and subsetting to a few variables of interest, such as the variance parameters. But it our primary interest is just the iterations for the variance parameters, we can extract them in a more focused way with the `VarCorr()` function. Here's how we'd do so for `fit4.2`.

```{r}
VarCorr(fit4.2, summary = F) %>% 
  str()
```

That last part, the contents of the second higher-level list indexed by `$ residual`, contains the contents for $\sigma_\epsilon$. On page 100 in the text, Singer and Willett compared $\sigma_\epsilon^2$ from the first model to that from the second. We might do that like so.

```{r, fig.width = 8, fig.height = 2.25}
cbind(VarCorr(fit4.1, summary = F)[[2]][[1]],
      VarCorr(fit4.2, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  mutate_all(~.^2) %>% 
  set_names(str_c("fit4.", 1:2)) %>% 
  mutate(`fit4.1 - fit4.2` = fit4.1 - fit4.2) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("fit4.1", "fit4.2", "fit4.1 - fit4.2"))) %>% 
  
  ggplot(aes(x = value)) +
  geom_vline(xintercept = .5, color = "white") +
  geom_density(fill = "grey25", color = "transparent") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(sigma[epsilon]^2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free_y", ncol = 3)
```

To compute a formal summary of the decline in $\sigma_\epsilon^2$ after adding time to the model, we might summarize like before.

```{r}
cbind(VarCorr(fit4.1, summary = F)[[2]][[1]],
      VarCorr(fit4.2, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  mutate_all(~.^2) %>% 
  set_names(str_c("fit4.", 1:2)) %>% 
  mutate(proportion_decline = (fit4.1 - fit4.2) / fit4.1) %>% 
  summarise(mean   = mean(proportion_decline),
            median = median(proportion_decline),
            sd     = sd(proportion_decline),
            ll     = quantile(proportion_decline, prob = .025),
            ul     = quantile(proportion_decline, prob = .975)) %>%
  mutate_if(is.double, round, digits = 3)
```

In case it wasn't clear, when we presented `fit4.1 – fit4.2` in the density plot, that was a simple difference score. However, we computed `proportion_decline` above by dividing that difference score by `fit4.1`; that's what put the difference in a proportion metric. Anyway, Singer and Willett's method led them to summarize the decline as .40. Our method was a more conservative .34-ish. And very happily, our method allows us to describe the proportion decline with summary statistics for the full posterior, such as with the $SD$ and the 95% intervals.

```{r, fig.width = 6, fig.height = 2}
post %>% 
  ggplot(aes(x = cor_id__Intercept__age_14)) +
  geom_vline(xintercept = 0, color = "white") +
  geom_density(fill = "grey25", color = "transparent") +
  scale_x_continuous(expression(rho[0][1]), limits = c(-1, 1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

The estimate Singer and Willett hand-computed in the text, -.22, is near the mean of our posterior distribution for $\rho_{01}$. However, our distribution provides a full expression of the uncertainty in the parameter. As are many other values within the parameter space, zero is indeed a credible value for $\rho_{01}$.

On page 101, we get the generic formula for computing the residual variance for a given occasion $j$,

$$
\sigma_{\text{Residual}_j}^2 = \sigma_0^2 + \sigma_1^2 \text{time}_j + 2 \sigma_{01} \text{time}_j + \sigma_\epsilon^2.
$$

If we were just interested in applying it to one of our `age` values, say 14, we might apply the formula to the posterior like this.

```{r}
post %>% 
  transmute(sigma_2_residual_j = sd_id__Intercept^2 + 
              sd_id__age_14^2 * 0 + 
              2 * sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14 * 0 + 
              sigma^2) %>% 
  head()
```

But given we'd like to do so over several values of `age`, it might be better to wrap the equation in a custom function. Let's call it `make_s2rj()`.

```{r}
make_s2rj <- function(x) {
  post %>% 
    transmute(sigma_2_residual_j = sd_id__Intercept^2 + sd_id__age_14^2 * x + 2 * sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14 * x + sigma^2) %>% 
    pull()
}
```

Now we can put our custom `make_s2rj()` function to work within the `purrr::map()` paradigm. We'll plot the results.

```{r, fig.width = 6, fig.height = 4}
tibble(age = 14:16) %>% 
  mutate(age_c = age - 14) %>% 
  mutate(s2rj = map(age_c, make_s2rj)) %>% 
  unnest(s2rj) %>% 
  mutate(label = str_c("age = ", age)) %>% 
  
  ggplot(aes(x = s2rj)) +
  geom_density(fill = "grey25", color = "transparent") +
  # just for reference 
  geom_vline(xintercept = 1, color = "grey92", linetype = 2) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "Behold the shape of longitudinal heteroscedasticity.",
       x = expression(sigma[italic(Residual[j])]^2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~label, scales = "free_y", ncol = 1)
```

We see a subtle increase over time, particularly from `age = 15` to `age = 16`. Yep, that's heteroscedasticity. It is indeed "beyond the bland homoscedasticity we assume of residuals in cross-sectional data" (p. 101).

We might also be interested in computing the autocorrelation between the composite residuals on occasions $j$ and $j'$, which follows the formula

$$
\rho_{\text{Residual}_j, \text{Residual}_{j'}} = \frac{\sigma_0^2 + \sigma_{01} (\text{time}_j + \text{time}_{j'}) + \sigma_1^2 \text{time}_j \text{time}_{j'}} {\sqrt{\sigma_{\text{Residual}_j}^2 \sigma_{\text{Residual}_{j'}}^2 }}.
$$

We only want to do that by hand once. Let's make a custom function following the formula.

```{r}
 make_rho_rj_rjp <- function(j, jp) {
  
  # define the elements in the denominator  
  s2rj_j  <- make_s2rj(j)
  s2rj_jp <- make_s2rj(jp)
  
  # compute
  post %>% 
    transmute(r = (sd_id__Intercept^2 + 
                     sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14 * (j + jp) + 
                     sd_id__age_14^2 * j * jp) /
                sqrt(s2rj_j * s2rj_jp)) %>% 
    pull()
}
```

If you only cared about measures of central tendency, such as the posterior median, you could use the funciton like this.

```{r}
make_rho_rj_rjp(0, 1) %>% median()
make_rho_rj_rjp(1, 2) %>% median()
make_rho_rj_rjp(0, 2) %>% median()
```

Here are the full posteriors.

```{r, fig.width = 6, fig.height = 4}
tibble(occasion = 1:3) %>% 
  mutate(age_c = occasion - 1,
         j     = c(1, 2, 1) - 1,
         jp    = c(2, 3, 3) - 1) %>% 
  mutate(r = map2(j, jp, make_rho_rj_rjp)) %>% 
  unnest(r) %>% 
  mutate(label = str_c("occasions ", j + 1, " and ", jp + 1)) %>% 
  
  ggplot(aes(x = r)) +
  # just for reference
  geom_vline(xintercept = c(.5, .75), color = "white") +
  geom_density(fill = "grey25", color = "transparent") +
  scale_x_continuous(expression(rho[Residual[italic(j)]][Residual[italic(j*minute)]]), limits = 0:1) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("Behold the shapes of our autocorrelations!") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~label, scales = "free_y", ncol = 1)
```

### Quantifying the proportion of outcome variation "explained."

Because of the way the multilevel model partitions off variance into different sources (e.g., $\sigma_0^2$, $\sigma_1^2$, and $\sigma_\epsilon^2$ in the unconditional growth model), the conventional $R^2$ is not applicable for evaluating models in the traditional OLS sense of percent of variance explained. Several pseudo $R^2$ statistics are frequently used instead. Be warned, "statisticians have yet to agree on appropriate summaries [@kreftIntroducingMultilevelModeling1998; @snijdersModeledVarianceTwolevel1994]" (p. 102). See also @jaegerR2StatisticFixed2017, @rightsEffectSizeMeasures2018, and @rights2020NewRecommendations. To my eye, none of the solutions presented in this section are magic bullets.

#### An overall summary of total outcome variability explained.

> In multiple regression, one simple way of computing a summary $R^2$ statistic is to square the sample correlation between observed and predicted values of the outcome. The same approach can be used in the multilevel model for change. All you need to do is: (1) compute the predicted outcome value for each person on each occasion of measurement; and (2) square the sample correlation between observed and predicted values. The resultant pseudo-$R^2$ statistic assesses the proportion of total outcome variation "explained" by the *multilevel model's specific contribution of predictors*. (p. 102, *emphasis* added)

Singer and Willett called this $R_{y, \hat y}^2$. They then walked through an example with their Model B (`fit4.2`), the unconditional growth model. Within our **brms** paradigm, we typically use the `fitted()` function to return predicted outcome values for cases within the data. The default option for the `fitted()` function is to return these predictions after accounting for the level-2 clustering. As we will see, Singer and Willett's $R_{y, \hat y}^2$ statistic only accounts for predictors (i.e., `age_14`, in this case), not clustering variables (i.e., `id`, in this case). To follow Singer and Willett's specification, we need to set `re_formula = NA`, which will instruct `fitted()` to return the expected values without reference to the level-2 clustering. Here's a look at the first six rows of that output.

```{r}
fitted(fit4.2, re_formula = NA) %>% 
  head()
```

Within our Bayesian/**brms** paradigm, out expected values come with expressions of uncertainty in terms of the posterior standard deviation and percentile-based 95% intervals. If we followed Singer and Willett's method in the text, we'd only work with the posterior means as presented within the `Estimate` column. But since we're Bayesians, we should attempt to work with the model uncertainty. One approach is to set `summary = F`.

```{r, warning = F}
f <-
  fitted(fit4.2,
         summary = F,
         re_formula = NA) %>%
  as_tibble() %>% 
  set_names(1:ncol(.)) %>% 
  rownames_to_column("iter")

head(f)
```

With those settings, `fitted()` returned a $4,000 \times 246$ numeric array. The 4,000 rows corresponded to the 4,000 post-warmup HMC draws. Each of the 246 columns corresponded to one of the 246 rows in the original `alcohol1_pp` data. To make the output more useful, we converted it to a data frame, named the columns by the row numbers corresponding to the original `alcohol1_pp` data, and converted the row names to an `iter` column.

In the next code block, we’ll convert `f` to the long format and use `left_join()` to join it with the relevant subset of the `alcohol1_pp` data.

```{r}
f <-
  f %>% 
  pivot_longer(-iter,
               names_to = "row",
               values_to = "fitted") %>% 
  mutate(row = row %>% as.integer()) %>% 
  left_join(
    alcohol1_pp %>% 
      mutate(row = 1:n()) %>% 
      select(row, alcuse),
    by = "row"
  ) 

f
```

If we collapse the distinction across the 4,000 HMC draws, here is the squared correlation between `fitted` and `alcuse`.

```{r}
f %>% 
  summarise(r  = cor(fitted, alcuse),
            r2 = cor(fitted, alcuse)^2)
```

This is close to the $R_{y, \hat y}^2 = .043$ Singer and Willett reported in the text. It might seem unsatisfying how this seemingly ignores model uncertainty by collapsing across HMC iterations. Here's a look at what happens is we compute the $R_{y, \hat y}^2$ separately for each iteration.

```{r, message = F}
f %>% 
  mutate(iter = iter %>% as.double()) %>% 
  group_by(iter) %>% 
  summarise(r  = cor(fitted, alcuse),
            r2 = cor(fitted, alcuse)^2)
```

Now for every level of `iter`, $R_{y, \hat y}^2 = .0434$, which matches up nicely with the text. But it seems odd that the value should be the same for each of the 4,000 HMC draws. Sadly, my efforts to debug my workflow have been unsuccessful. If you see a flaw in this method, please [share on GitHub](https://github.com/ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse/issues).

Just for kicks, here's a more compact alternative to our `fitted()` + `left_join()` approach that more closely resembles the work flow Singer and Willett showed on pages 102 and 103.

```{r, message = F}
tibble(age_14 = 0:2) %>% 
  mutate(fitted = map(age_14, ~ post$b_Intercept + post$b_age_14 * .)) %>% 
  full_join(alcohol1_pp %>% select(id, age_14, alcuse),
            by = "age_14") %>% 
  mutate(row = 1:n()) %>% 
  unnest(fitted) %>% 
  mutate(iter = rep(1:4000, times = alcohol1_pp %>% nrow())) %>% 
  group_by(iter) %>% 
  summarise(r  = cor(fitted, alcuse),
            r2 = cor(fitted, alcuse)^2)
```

Either way, our results agree with those in the text: about "4.3% of the total variability in *ALCUSE* is associated with linear time" (p. 103, *emphasis* in the original).

#### Pseudo-$R^2$ statistics computed from the variance components.

> Residual variation--that portion of the outcome variation unexplained by a model's predictors--provides another criterion for comparison. When you fit a series of models, you hope that added predictors further explain unexplained outcome variation, causing residual variation to decline. The magnitude of this decline quantifies the improvement in fit. A large decline suggests that the predictors make a big difference; a small, or zero, decline suggests that they do not. To assess these declines on a common scale, we compute the *proportional reduction in residual variance* as we add predictors.
>
> Each unconditional model yields residual variances that serve as yardsticks for comparison. The unconditional means model provides a baseline estimate of $\sigma_\epsilon^2$; the unconditional growth model provides baseline estimates of $\sigma_0^2$ and $\sigma_1^2$. Each leads to its own pseudo-$R^2$ statistic. (p. 103, *emphasis* in the original)

This provides three more pseudo-$R^2$ statistics: $R_\epsilon^2$, $R_0^2$, and $R_1^2$. The formula for the first is

$$
R_\epsilon^2 = \frac{\sigma_\epsilon^2 (\text{unconditional means model}) - \sigma_\epsilon^2 (\text{unconditional growth model})}{\sigma_\epsilon^2 (\text{unconditional means model})}.
$$

We've actually already computed this one, above, under the name where we referred to it as the decline in $\sigma_\epsilon^2$ after adding time to the model. Here it is again.

```{r}
cbind(VarCorr(fit4.1, summary = F)[[2]][[1]],
      VarCorr(fit4.2, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  mutate_all(~.^2) %>% 
  set_names(str_c("fit4.", 1:2)) %>% 
  mutate(r_2_epsilon = (fit4.1 - fit4.2) / fit4.1) %>% 
  summarise(mean   = mean(r_2_epsilon),
            median = median(r_2_epsilon),
            sd     = sd(r_2_epsilon),
            ll     = quantile(r_2_epsilon, prob = .025),
            ul     = quantile(r_2_epsilon, prob = .975)) %>%
  mutate_if(is.double, round, digits = 3)
```

Here's a look at the full distribution for our $\sigma_\epsilon^2$.

```{r, fig.width = 6, fig.height = 2}
cbind(VarCorr(fit4.1, summary = F)[[2]][[1]],
      VarCorr(fit4.2, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  mutate_all(~.^2) %>% 
  set_names(str_c("fit4.", 1:2)) %>% 
  mutate(r_2_epsilon = (fit4.1 - fit4.2) / fit4.1) %>%
  
  ggplot(aes(x = r_2_epsilon)) +
  geom_vline(xintercept = 0, color = "white") +
  geom_density(fill = "grey25", color = "transparent") +
  scale_x_continuous(expression(Pseudo~italic(R)[epsilon]^2), limits = c(-1, 1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

When we use the full posteriors of our two $\epsilon_\epsilon^2$ parameters, we end up with a slightly smaller statistic than the one in the text. So our conclusion is about 35% of the intraindividual variance is accounted for by time.

If we consider additional models with predictors for the $\zeta$s, we can examine similar pseudo $R^2$ statistics following the generic form

$$
R_\zeta^2 = \frac{\sigma_\zeta^2 (\text{unconditional growth model}) - \sigma_\zeta^2 (\text{subsequent model})}{\sigma_\zeta^2 (\text{unconditional growth model})},
$$

where $\zeta$ could refer to $\zeta_{0i}$, $\zeta_{1i}$, and so on. If you look back up at the shape of the full posterior of $R_\epsilon^2$, you'll notice part of the left tail crosses zero. "Unlike traditional $R^2$ statistics, which will always be positive (or zero), some of these statistics can be negative" (p. 104)! If you compute them, interpret pseudo-$R^2$ statistics with a grain of salt.

## Practical data analytic strategies for model building

> A sound statistical model includes all necessary predictors and no unnecessary ones. But how do you separate the wheat from the chaff? We suggest you rely on a combination of substantive theory, research questions, and statistical evidence. *Never* let a computer select predictors mechanically. (pp. 104--105, *emphasis* in the original)

### A taxonomy of statistical models.

> We suggest that you base decisions to enter, retain, and remove predictors on a combination of logic, theory, and prior research, supplemented by judicious [parameter evaluation] and comparison of model fit. At the outset, you might examine the effect of each predictor individually. You might then focus on predictors of primary interest (while including others whose effects you want to control). As in regular regression, you can add predictors singly or in groups and you can address issues of functional form using interactions and transformations. As you develop the taxonomy, you will progress toward a "final model" whose interpretation addresses your research questions. We place quotes around this term to emphasize that we believe no statistical model is *ever* final; it is simply a placeholder until a better model is found. (p. 105, *emphasis* in the original)

### Interpreting fitted models.

> You need not interpret every model you fit, especially those designed to guide interim decision making. When writing up findings for presentation and publication, we suggest that you identify a manageable subset of models that, taken together, tells a persuasive story parsimoniously. At a minimum, this includes the unconditional means model, the unconditional growth model, and a "final model". You may also want to present intermediate models that either provide important building blocks or tell interesting stories in their own right. (p. 106)

In the dawn of the post-replication crisis era, it's astonishing to reread and transcribe this section and the one above. I like a lot of what the authors had to say. Much of it seems like good pragmatic advice. But if they were to rewrite these sections again, I wonder what changes they'd make. Would they recommend researchers preregister their primary hypothesis, variables of interest, and perhaps their model building strategy [@nosekPreregistrationRevolution2018]? Would they be interested in a multiverse analysis [@steegen2016increasing]? Would they still recommend sharing only a subset of one's analyses in the era of sharing platforms like [GitHub](https://github.com) and the [Open Science Framework](https://osf.io)? Would they weigh in on developments in causal inference [@pearlCausalInferenceStatistics2016]?

#### Model C: The uncontrolled effects of COA.

The default priors for Model C are the same as for the unconditional growth model. All we’ve done is add parameters of `class = b`. As these default to improper flat priors, we have nothing to add to the `prior` argument to include them. Feel free to check with `get_prior()`. For the sake of practice, this model follows the form

\begin{align*}
\text{alcuse}_{ij} & = \gamma_{00} + \gamma_{01} \text{coa}_i + \gamma_{10} \text{age_14}_{ij} + \gamma_{11} \text{coa}_i \times \text{age_14}_{ij} + \zeta_{0i} + \zeta_{1i} \text{age_14}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \text{Normal} (0, \sigma_\epsilon^2) \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \text{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0^2 & \sigma_{01} \\ \sigma_{01} & \sigma_1^2 \end{bmatrix}
\end{pmatrix}.
\end{align*}

Fit the model.

```{r fit4.3}
fit4.3 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 0 + Intercept + age_14 + coa + age_14:coa + (1 + age_14 | id),
      prior = c(prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma),
                prior(lkj(1), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.03")
```

Check the summary.

```{r}
print(fit4.3, digits = 3)
```

Our $\gamma$s are quite similar to those presented in the text. Our $\sigma_\epsilon$ for this model is about the same as with `fit4.2`. Let's practice with `conditional_effects()` to plot the consequences of this model.

```{r, fig.width = 2.5, fig.height = 3}
conditional_effects(fit4.3)
```

This time we got back three plots. The first two were of the lower-order parameters $\gamma_{10}$ and $\gamma_{01}$. Note how the plot for `coa` treated it as a continuous variable. This is because the variable was saved as an integer in the original data set.

```{r}
fit4.3$data %>% 
  glimpse()
```

Coding it as an integer further complicated things for the third plot returned by `conditional_effects()`, the one for the interaction of `age_14` and `coa`, $\gamma_{11}$.

Since `coa` is binary, the natural way to express its interaction with `age_14` would be with `age_14` on the $x$-axis and two separate trajectories, one for each value of `coa`. That's what Singer and Willett very sensibly did with the middle panel of Figure 4.3. However, the `conditional_effects() `function defaults to expressing interactions such that the first variable in the term--in this case,`age_14`--is on the $x$-axis and the second variable in the term--`coa`, treated as an integer--is depicted in three lines corresponding its mean and its mean $\pm$ one standard deviation. This is great for continuous variables, but incoherent for categorical ones. The fix is to adjust the data and refit the model.

```{r fit4.4}
fit4.4 <-
  update(fit4.3,
         newdata = alcohol1_pp %>% mutate(coa = factor(coa)),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 4,
         file = "fits/fit04.04")
```

We might compare the updated model with its predecessor. To get a focused look, we can use the `posterior_summary()` function with a little subsetting.

```{r}
posterior_summary(fit4.3)[1:4, ] %>% round(digits = 3)
posterior_summary(fit4.4)[1:4, ] %>% round(digits = 3)
```

The results are about the same. The payoff comes when we try again with `conditional_effects()`.

```{r, fig.width = 2.5, fig.height = 3}
conditional_effects(fit4.4)
```

Much better. Now the plot for $\gamma_{01}$ treats `coa` as binary and our plot for the interaction between `age_14` and `coa` is much close to the one in Figure 4.3. Since we're already on a `conditional_effects()` tangent, we may as well go further. When working with models like `fit4.3` where you have multiple fixed effects, sometimes you only want the plots for a subset of those effects. For example, if our main goal is to do a good job tastefully reproducing the middle plot in Figure 4.3, we only need the interaction plot. In such a case, use the `effects` argument.

```{r, fig.width = 2.5, fig.height = 3}
conditional_effects(fit4.4, effects = "age_14:coa")
```

Earlier we discussed how `conditional_effects()` lets users adjust some of the output. But if you want an extensive overhaul, it's better to save the output of `conditional_effects()` as an object and manipulate that object with the `plot()` function.

```{r}
ce <- conditional_effects(fit4.4, effects = "age_14:coa")

str(ce)
```

Our `ce` is an object of class "'brms_conditional_effects" which contains a list of a single data frame. Had we omitted our `effects` argument, above, we'd have a list of 3 instead. Anyway, these data frames contain the necessary information to produce the plot. The advantage of saving `ce` this way is we can now insert it into the `plot()` function. The simple output is the same as before.

```{r, fig.width = 2.5, fig.height = 3}
plot(ce)
```

The `plot()` function will allow us to do other things, like add in the original data or omit the white grid lines.

```{r, fig.width = 2.5, fig.height = 3}
ce %>% 
  plot(points = T,
       point_args = list(size = 1/4, alpha = 1/4, width = .05, height = .05, color = "black"),
       theme = theme(panel.grid = element_blank()))
```

And for even more control, you can tack on typical **ggplot2** functions. But when you want to do so, make sure to set the `plot = FALSE` argument and then subset after the right parenthesis of the `plot()` function.

```{r, fig.width = 2.5, fig.height = 3}
plot(ce, 
     theme = theme(legend.position = "none",
                   panel.grid = element_blank()),
     plot = FALSE)[[1]] +
  annotate(geom = "text",
           x = 2.1, y = c(.95, 1.55),
           label = str_c("coa = ", 0:1),
           hjust = 0, size = 3.5) +
  scale_fill_brewer(type = "qual") +
  scale_color_brewer(type = "qual") +
  scale_x_continuous("age", limits = c(-1, 3), labels = 13:17) +
  scale_y_continuous(limits = c(0, 2), breaks = 0:2)
```

But anyway, let's get back on track and talk about the variance components. Singer and Willett contrasted $\sigma_\epsilon^2$ from Model B to the new one from Model C. We might use `VarCorr()` to do the same.

```{r}
VarCorr(fit4.2)[[2]]
VarCorr(fit4.3)[[2]]
```

We could have also extracted that information by subsetting `posterior_summary()`.

```{r}
posterior_summary(fit4.2)["sigma", ]
posterior_summary(fit4.3)["sigma", ]
```

Anyway, to get these in a variance metric, just square their posterior samples and summarize.

Our next task is to formally compare `fit4.2` and `fit4.3` in terms of declines in $\sigma_0^2$ and $\sigma_1^2$.

```{r, fig.width = 5, fig.height = 3.5, warning = F}
bind_cols(
  posterior_samples(fit4.2) %>% 
    transmute(fit2_sigma_2_0 = sd_id__Intercept^2,
              fit2_sigma_2_1 = sd_id__age_14^2),
  posterior_samples(fit4.3) %>% 
    transmute(fit3_sigma_2_0 = sd_id__Intercept^2,
              fit3_sigma_2_1 = sd_id__age_14^2)
) %>% 
  mutate(`decline~'in'~sigma[0]^2` = (fit2_sigma_2_0 - fit3_sigma_2_0) / fit2_sigma_2_0,
         `decline~'in'~sigma[1]^2` = (fit2_sigma_2_1 - fit3_sigma_2_1) / fit2_sigma_2_1) %>% 
  pivot_longer(contains("decline")) %>% 
  
  ggplot(aes(x = value)) +
  geom_vline(xintercept = 0, color = "white") +
  geom_density(fill = "grey25", color = "transparent") +
  scale_x_continuous(NULL, limits = c(-5, 2)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, labeller = label_parsed, ncol = 1)
```

Here are the percents of variance declined from `fit4.2` to `fit4.3`.

```{r, message = F}
bind_cols(
  posterior_samples(fit4.2) %>% 
    transmute(fit2_sigma_2_0 = sd_id__Intercept^2,
              fit2_sigma_2_1 = sd_id__age_14^2),
  posterior_samples(fit4.3) %>% 
    transmute(fit3_sigma_2_0 = sd_id__Intercept^2,
              fit3_sigma_2_1 = sd_id__age_14^2)
) %>% 
  mutate(`decline~'in'~sigma[0]^2` = (fit2_sigma_2_0 - fit3_sigma_2_0) / fit2_sigma_2_0,
         `decline~'in'~sigma[1]^2` = (fit2_sigma_2_1 - fit3_sigma_2_1) / fit2_sigma_2_1) %>% 
  pivot_longer(contains("decline")) %>% 
  group_by(name) %>% 
  summarise(mean   = mean(value),
            median = median(value),
            sd     = sd(value),
            ll     = quantile(value, prob = .025),
            ul     = quantile(value, prob = .975))
```

In this case, we end up with massive uncertainty when working with the full posteriors. This is particularly the case with the difference in $\sigma_1^2$, which is left skewed for days. Here are the results when we only use point estimates.

```{r}
bind_cols(
  posterior_samples(fit4.2) %>% 
    transmute(fit2_sigma_2_0 = sd_id__Intercept^2,
              fit2_sigma_2_1 = sd_id__age_14^2),
  posterior_samples(fit4.3) %>% 
    transmute(fit3_sigma_2_0 = sd_id__Intercept^2,
              fit3_sigma_2_1 = sd_id__age_14^2)
) %>% 
  summarise_all(median) %>% 
  transmute(`% decline in sigma_2_0` = 100 * (fit2_sigma_2_0 - fit3_sigma_2_0) / fit2_sigma_2_0,
            `% decline in sigma_2_1` = 100 * (fit2_sigma_2_1 - fit3_sigma_2_1) / fit2_sigma_2_1)
```

"These variance components are now called *partial* or *conditional* variances because they quantify the interindividual differences in change that remain unexplained by the model's predictors" (p. 108, emphasis in the original).

#### Model D: The controlled effects of COA.

This model follows the form

\begin{align*}
\text{alcuse}_{ij} & = \gamma_{00} + \gamma_{01} \text{coa}_i + \gamma_{02} \text{peer}_i + \gamma_{10} \text{age_14}_{ij} \\
& \;\;\; + \gamma_{11} \text{coa}_i \times \text{age_14}_{ij} + \gamma_{12} \text{peer}_i \times \text{age_14}_{ij} \\
& \;\;\; + \zeta_{0i} + \zeta_{1i} \text{age_14}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \text{Normal} (0, \sigma_\epsilon^2) \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & \sim \text{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix}, 
\begin{bmatrix} \sigma_0^2 & \sigma_{01} \\ \sigma_{01} & \sigma_1^2 \end{bmatrix}
\end{pmatrix}.
\end{align*}

Fit that joint.

```{r ffit4.5}
fit4.5 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 0 + Intercept + age_14 + coa + peer + age_14:coa + age_14:peer + (1 + age_14 | id),
      prior = c(prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma),
                prior(lkj(1), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.05")
```

```{r}
print(fit4.5, digits = 3)
```

All our $\gamma$ estimates are similar to those presented in Table 4.1. Let's compute the variances and the covariance, $\sigma_{01}^2$. Here are the plots.

```{r, fig.height = 3.25, fig.width = 6, warning = F, message = F}
 v <-
  posterior_samples(fit4.5) %>% 
  transmute(sigma_2_epsilon = sigma^2,
            sigma_2_0       = sd_id__Intercept^2,
            sigma_2_1       = sd_id__age_14^2,
            sigma_01        = sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14)

v %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(x = value)) +
  geom_density(size = 0, fill = "black") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

And now we compute the summary statistics.

```{r, message = F}
v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(mean   = mean(value),
            median = median(value),
            sd     = sd(value),
            ll     = quantile(value, prob = .025),
            ul     = quantile(value, prob = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Like the $\gamma$'s, our variance components are all similar to those in the text.

```{r}
bind_cols(
  posterior_samples(fit4.2) %>% 
  transmute(fit4.2_sigma_2_epsilon = sigma^2,
            fit4.2_sigma_2_0 = sd_id__Intercept^2,
            fit4.2_sigma_2_1 = sd_id__age_14^2),
  posterior_samples(fit4.5) %>% 
  transmute(fit4.5_sigma_2_epsilon = sigma^2,
            fit4.5_sigma_2_0 = sd_id__Intercept^2,
            fit4.5_sigma_2_1 = sd_id__age_14^2)
) %>% 
  summarise_all(median) %>% 
  mutate(`% decline in sigma_2_epsilon` = 100 * (fit4.2_sigma_2_epsilon - fit4.5_sigma_2_epsilon) / fit4.2_sigma_2_epsilon,
         `% decline in sigma_2_0` = 100 * (fit4.2_sigma_2_0 - fit4.5_sigma_2_0) / fit4.2_sigma_2_0,
         `% decline in sigma_2_1` = 100 * (fit4.2_sigma_2_1 - fit4.5_sigma_2_1) / fit4.2_sigma_2_1) %>% 
  pivot_longer(contains("%")) %>% 
  select(name, value)
```

The percentages in which our variance componence declined relative to the unconditional growth model are of similar orders of magnitude as those presented in the text.

#### Model E: A tentative "final model" for the controlled effects of `coa`.

This model is just like the last, but with the simple omission of the $\gamma_{12}$ parameter.

```{r fit4.6}
fit4.6 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 0 + Intercept + age_14 + coa + peer + age_14:peer + (1 + age_14 | id),
      prior = c(prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma),
                prior(lkj(1), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.06")
```

```{r}
print(fit4.6, digits = 3)
```

The $\gamma$'s all look well behaved. Here are the variance component summaries.

```{r, message = F}
 v <-
  posterior_samples(fit4.6) %>% 
  transmute(sigma_2_epsilon = sigma^2,
            sigma_2_0       = sd_id__Intercept^2,
            sigma_2_1       = sd_id__age_14^2,
            sigma_01        = sd_id__Intercept * cor_id__Intercept__age_14 * sd_id__age_14)

v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(mean   = mean(value),
            median = median(value),
            sd     = sd(value),
            ll     = quantile(value, prob = .025),
            ul     = quantile(value, prob = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

### Displaying prototypical change trajectories.

On page 111, Singer and Willett computed the various levels of the $\pi$ coefficients when `coa == 0` or `coa == 1`. To follow along, we'll want to work directly with the posterior samples of `fit4.3`.

```{r}
post <- posterior_samples(fit4.3) 

post %>% 
  select(starts_with("b_")) %>% 
  head()
```

Here we apply the formulas to the posterior samples and then summarize with posterior means.

```{r, message = F}
post %>%
  select(starts_with("b_")) %>% 
  transmute(pi_0_coa0 = b_Intercept + b_coa          * 0,
            pi_1_coa0 = b_age_14    + `b_age_14:coa` * 0,
            pi_0_coa1 = b_Intercept + b_coa          * 1,
            pi_1_coa1 = b_age_14    + `b_age_14:coa` * 1) %>%
  pivot_longer(everything()) %>% 
  group_by(name) %>%
  summarise(posterior_mean = mean(value) %>% round(digits = 3))
```

We already plotted these trajectories and their 95% intervals a few sections up. If we want to work with the full composite model to predict $Y_{ij}$ (i.e., `alcuse`) directly, we multiply the `b_coa`, `b_age_14`, and `b_age_14:coa` vectors by the appropriate values of `coa` and `peer`. For example, here's what you'd code if you wanted the initial `alcuse` status for when `coa == 1`.

```{r}
post %>%
  select(starts_with("b_")) %>% 
  mutate(y = b_Intercept + b_coa * 1 + b_age_14 * 0 + `b_age_14:coa` * 0 * 1) %>% 
  head()
```

If you were to take the mean of that new `y` column, you'd discover it's the same as the mean of our `pi_0_coa1`, above.

```{r}
post %>%
  select(starts_with("b_")) %>% 
  mutate(y = b_Intercept + b_coa * 1 + b_age_14 * 0 + `b_age_14:coa` * 0 * 1) %>% 
  summarise(pi_0_coa1 = mean(y))
```

Singer and Willett suggested four strategies to help researchers pick the prototypical values of the predictors to focus on:

* Substantively interesting values (e.g., typical ages, values corresponding to transition points)
* A range of percentiles (e.g., 25^th^, 50^th^, and 75^th^)
* Just the sample mean
* The sample mean $\pm$ something like 1 standard deviation

They next discuss the right panel of Figure 4.3. We could continue to work directly with the `posterior_samples()` to make our version of that figure. But it you want to accompany the posterior mean trajectories with their 95% intervals, and I hope you do, the `posterior_samples()` method will get tedious. Happily, **brms** offers users and alternative with the `fitted()` function. Since the right panel is somewhat complicated, it'll behoove us to practice with the simpler left panel, first.

In `fit4.2` (i.e., Model C), `age_14` is the only predictor. Here we'll specify the values along the range in the original data, ranging from 0 to 2. However, we end up specifying a bunch of values within that range in addition to the two endpoints. This is because the 95% intervals typically have a bow tie shape. To depict that shape well, we need more than a couple values. We save those values as a tibble called `nd` (i.e., new data). We make use of them within fitted with the `newdata = nd` argument.

Since we're only interested in the general trajectory, the consequence of the $\gamma$'s, we end up coding `re_formula = NA`. In so doing, we ask `fitted()` to ignore the group-level effects. In this example, that means we are ignoring the `id`-level deviations from the overall trajectories. If you're confused by that that means, don't worry. That part of the model should become more clear as we go along in the text.

Since `fitted()` returns an array, we then convert the results into a data frame for use within the **tidyverse** framework. For plotting, it's handy to bind those results together with the `nd`, the predictor values we used to compute the fitted values with. In the final wrangling step, we use our `age_14` values to compute the `age` values.

```{r}
nd <- 
  tibble(age_14 = seq(from = 0, to = 2, length.out = 30))

f <- 
  fitted(fit4.2, 
         newdata = nd,
         re_formula = NA) %>%
  data.frame() %>%
  bind_cols(nd) %>% 
  mutate(age = age_14 + 14)

head(f)
```

Since we only had one predictor, `age_14`, for which we specified 30 specific values, we ended up with 30 rows in our output. By default, `fitted()` summarized the fitted values with posterior means (`Estimate`), standard deviations (`Est.Error`), and percentile-based 95% intervals (`Q2.5` and `Q97.5`). The other columns are the values we bound to them. Here’s how we might use these to make our `fitted()` version of the leftmost panel of Figure 4.3.

```{r, fig.height = 3.5, fig.width = 2.5}  
f %>%
  ggplot(aes(x = age)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey75", alpha = 3/4) +
  geom_line(aes(y = Estimate)) +
  scale_y_continuous("alcuse", breaks = 0:2, limits = c(0, 2)) +
  coord_cartesian(xlim = c(13, 17)) +
  theme(panel.grid = element_blank())
```

With `fit4.6` (i.e., Model E), we now have three predictors. We'd like to see the full range across `age_14` for four combinations of `coa` and `peer` values. To my mind, the easiest way to get those values right is with a little `crossing()` and `expand()`.

```{r}
nd <-
  crossing(coa  = 0:1,
           peer = c(.655, 1.381)) %>% 
  expand(nesting(coa, peer),
         age_14 = seq(from = 0, to = 2, length.out = 30))

head(nd, n = 10)
```

Now we use `fitted()` much like before.

```{r}
f <- 
  fitted(fit4.6, 
         newdata = nd,
         re_formula = NA) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  # a little wrangling will make plotting much easier
  mutate(age  = age_14 + 14,
         coa  = ifelse(coa == 0, "coa = 0", "coa = 1"),
         peer = factor(peer))

glimpse(f)
```

For our version of the right panel of Figure 4.3, most of the action is in `ggplot()`, `geom_ribbon()`, `geom_line()`, and `facet_wrap()`. All the rest is cosmetic.

```{r, fig.height = 3.75, fig.width = 4.6}  
f %>%
  ggplot(aes(x = age, color = peer, fill = peer)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate, size = peer)) +
  scale_size_manual(values = c(1/2, 1)) +
  scale_fill_manual(values = c("blue3", "red3")) +
  scale_color_manual(values = c("blue3", "red3")) +
  scale_y_continuous("alcuse", breaks = 0:2) +
  labs(subtitle = "High peer values are in red; low ones are in blue.") +
  coord_cartesian(xlim = c(13, 17)) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~coa)
```  

In my opinion, it works better to split the plot into two when you include the 95% intervals.

### Recentering predictors to improve interpretation.

> The easiest strategy for recentering a time-invariant predictor is to subtract its sample mean from each observed value. When we center a predictor on its sample mean, the level-2 fitted intercepts represent the average fitted values of initial status (or rate of change). We can also recenter a time-invariant predictor by subtracting another meaningful value... Recentering works best when the centering constant is substantively meaningful. (pp. 113--114)

As we'll see later, centering can also make it easier to select meaningful priors on the model intercept. If you look at our `alcohol1_pp` data, you'll see we already have centered versions of our time-invariant predictors. They're the last two columns, `cpeer` and `ccoa`.

```{r}
alcohol1_pp %>% 
  glimpse()
```

If you wanted to center them by hand, you'd just execute something like this.

```{r}
alcohol1_pp %>% 
  mutate(peer_c = peer - mean(peer))
```

Did you notice how our `peer_c` values, above, deviated slightly from those in `cpeer`? That's because `peer_c` was based on the exact sample mean. Those in `cpeer` are based on the sample mean as provided in the text, 1.018, which is introduces rounding error. For the sake of simplicity, we'll go with centered variables matching up with the text.

Here we'll hastily fit the models with help from the `update()` function.

```{r fit4.7}
fit4.7 <-
  update(fit4.6,
         newdata = alcohol1_pp,
         alcuse ~ 0 + Intercept + age_14 + coa + cpeer + age_14:cpeer + (1 + age_14 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 4,
         file = "fits/fit04.07")

fit4.8 <-
  update(fit4.6,
         newdata = alcohol1_pp,
         alcuse ~ 0 + Intercept + age_14 + ccoa + peer + age_14:peer + (1 + age_14 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 4,
         file = "fits/fit04.08")
```

Here we reproduce the $\gamma$s from `fit4.6` and compare the to the updates from `fit4.7` and `fit4.8`.

```{r}
fixef(fit4.6) %>% round(digits = 3)
fixef(fit4.7) %>% round(digits = 3)
fixef(fit4.8) %>% round(digits = 3)
```

## Comparing models using deviance statistics

As you will see, we will also make use of deviance within our Bayesian Stan-based paradigm. But we'll do so a little differently than what Singer and Willett presented.

### The deviance statistic.

As it turns out, we Bayesians use the log-likelihood (LL), too. Recall how the numerator in the right-hand side of Bayes' Theorem was $p(\text{data} | \theta) p(\theta)$? That first part, $p(\text{data} | \theta)$, is the likelihood. In words, the likelihood is the *probability of the data given the parameters*. We generally work with the log of the likelihood rather than the likelihood itself because it's easier to work with statistically.

When you're working with **brms**, you can extract the LL with the `log_lik()` function. Here's an example with `fit4.1`, our unconditional means model.

```{r}
log_lik(fit4.1) %>% 
  str()
```

You may have noticed we didn't just get a single value back. Rather, we got an array of 4,000 rows and 246 columns. The reason we got 4,000 rows is because that's how many post-warmup iterations we drew from the posterior. I.e., we set `brm(..., iter = 2000, warmup = 1000, chains = 4)`. With respect to the 246 columns, that's how many rows there are in the `alcohol1_pp` data. So for each case in the data, we get an entire posterior distribution of LL values.

With the multilevel model, we can define deviance for a given model as its LL times -2,

$$
\text{Deviance} = -2 LL_\text{current model}.
$$

Here that is in code for `fit4.1`.

```{r}
ll <-
  log_lik(fit4.1) %>%
  data.frame() %>% 
  mutate(ll = rowSums(.)) %>% 
  mutate(deviance = -2 * ll) %>% 
  select(ll, deviance, everything())

dim(ll)
```

Because we used HMC, deviance is a distribution rather than a single number. Here's what it looks like for `fit4.1`.

```{r, fig.width = 6, fig.height = 2}
ll %>% 
  ggplot(aes(x = deviance)) +
  geom_density(fill = "grey25", size = 0) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Much like the frequentists, we Bayesian generally prefer models with smaller deviance distributions.

The reasons frequentists multiply the LL by -2 is because after doing so, the difference in deviances between two models follows a $\chi^2$ distribution and the old $\chi^2$-difference test is widely-used in frequentist statistics. Bayesians often just go ahead and use the -2 multiplication, too. It's largely out of tradition. But as we'll see, some contemporary Bayesians are challenging that tradition.

### When and how can you compare deviance statistics?

As for the frequentists, deviance values/distributions in the Bayesian context are only meaningful in the relative sense. You cannot directly interpret a single models deviance distribution by the magnitude or sign of its central tendency. But you can compare two or more models by the relative locations of their deviance distributions. When doing so, they must have been computed using the same data (i.e., no differences in missingness in the predictors) and the models must be nested.

However, in contemporary Bayesian practice we don't tend to compare models with deviance. For details on why, check out Chapter 6 in McElreath's [*Statistical Rethinking*](https://xcelab.net/rm/statistical-rethinking/). McElreath also covered the topic in several online lectures (e.g., [here](https://www.youtube.com/watch?v=vSjL2Zc-gEQ&list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z&index=8) and [here](https://www.youtube.com/watch?v=gjrsYDJbRh0&list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI&index=8)).

### Implementing deviance-based hypothesis tests.

In this project, we are not going to practice comparing deviances using frequentist $\chi^2$ tests. We will, however, cover Bayesian information criteria.

### ~~AIC and BIC~~ WAIC and LOO statistics: Comparing nonnested models using information criteria [and cross validation].

We do not use the AIC or the BIC within the Stan ecosystem. The AIC is frequentist and cannot handle models with priors. The BIC is interesting in it's a double misnomer. It is neither Bayesian nor is it a proper information criterion--though it does scale like one. However, it might be useful for our purposes to walk out the AIC a bit. It'll ground our discussion of the WAIC and LOO. From Spiegelhalter, Best, Carlin and van der Linde [@-spiegelhalterDevianceInformationCriterion2014], we read:

> Suppose that we have a given set of candidate models, and we would like a criterion to assess which is 'better' in a defined sense. Assume that a model for observed data $y$ postulates a density $p(y | \theta)$ (which may include covariates etc.), and call $D(\theta) = -2 \log {p(y | \theta)}$ the deviance, here considered as a function of $\theta$. Classical model choice uses hypothesis testing for comparing nested models, e.g. the deviance (likelihood ratio) test in generalized linear models. For non nested models, alternatives include the Akaike information criterion
>
> $$AIC = -2 \log {p(y | \hat{\theta})} + 2k$$
>
> where $\hat{\theta}$ is the maximum likelihood estimate and $k$ is the number of parameters in the model (dimension of $\Theta$).
>
> AIC is built with the aim of favouring models that are likely to make good predictions. Since we generally do not have independent validation data, we can assess which model best predicts the *observed* data by using the deviance, but if parameters have been estimated we need some penalty for this double use of the data. AIC's penalty of *2k* has been shown to be asymptotically equivalent to leave-one-out cross-validation. However, AIC does not work in models with informative prior information, such as hierarchical models, since the prior effectively acts to 'restrict' the freedom of the model parameters, so the appropriate 'number of parameters' is generally unclear. (pp. 485--486, *emphasis* in the original)

For the past two decades, the Deviance Information Criterion [DIC; @spiegelhalterBayesianMeasuresModel2002] has been a popular information criterion among Bayesians. Let's define $D$ as the posterior distribution of deviance values and $\bar D$ as its mean. If you compute deviance based on the posterior mean, you have $\hat D$. Within a multi-parameter model, this would be the deviance based on the collection of the posterior mean of each parameter. With these, we define the DIC as

$$\text{DIC} = \bar D + (\bar D + \hat D) + \bar D + p_D,$$

where $p_D$ is the number of effective parameters in the model, which is also sometimes referred to as the penalty term. As McElreath pointed out in [*Statistical Rethinking*](https://xcelab.net/rm/statistical-rethinking/), the $p_D$

> is just the expected distance between the deviance in-sample and the deviance out-of-sample. In the case of flat priors, DIC reduces directly to AIC, because the expected distance is just the number of parameters. But more generally, $p_D$ will be some fraction of the number of parameters, because regularizing priors constrain a model’s flexibility. (p. 191)

As you'll see, you can get the $p_D$ for `brms::brm()` models. However, the DIC is limited in that it requires a multivariate Gaussian posterior and I'm not aware of a convenience function within **brms** that will compute the DIC. Which is fine. The DIC has been overshadowed in recent years by newer methods. But for a great talk on the DIC, check out the authoritative David Spiegelhalter's [*Retrospective read paper: Bayesian measure of model complexity and fit*](https://www.youtube.com/watch?v=H-59eqmHuuQ&frags=pl%2Cwn). 

#### The Widely Applicable Information Criterion (WAIC).

The main information criterion within our Stan ecosystem paradigm is the Widely Applicable Information Criterion [WAIC; @watanabeAsymptoticEquivalenceBayes2010]. From McElreath, again, we read:

> It does not require a multivariate Gaussian posterior, and it is often more accurate than DIC. There are types of models for which it is hard to define at all, however. We'll discuss that issue more, after defining WAIC.
>
> The distinguishing feature of WAIC is that it is *pointwise*. This means that uncertainty in prediction is considered case-by-case, or point-by-point, in the data. This is useful, because some observations are much harder to predict than others and may also have different uncertainty... You can think of WAIC as handling uncertainty where it actually matters: for each independent observation.
>
> Define $\text{Pr} (y_i)$ as the average likelihood of observation $i$ in the training sample. This means we compute the likelihood of $y_i$ for each set of parameters samples from the posterior distribution. Then we average the likelihoods for each observation $i$ and finally sum over all observations. This produces the first part of WAIC, the log-pointwise-predictive-density,
>
> $$\text{lppd} = \sum_{i = 1}^N \text{log Pr} (y_i)$$
>
> You might say this out loud as:
>
>> *The log-pointwise-predictive-density is the total across observations of the logarithm of the average likelihood of each observation.*
>
> The lppd is just a pointwise analog of deviance, averaged over the posterior distribution. If you multiplied it by -2, it'd be similar to the deviance, in fact.
>
> The second piece of WAIC is the effective number of parameters $p_\text{WAIC}$. Define $V(y_i)$ as the variance in log-likelihood for observation $i$ in the training sample. This means we compute the log-likelihood of $y_i$  for each sample from the posterior distribution. Then we take the variance of those values. This is $V(y_i)$. Now $p_\text{WAIC}$ is defined as:
>
> $$p_\text{WAIC} = \sum_{i=1}^N V(y_i)$$
>
> Now WAIC is defined as:
>
> $$\text{WAIC} = -2(\text{lppd} - p_\text{WAIC})$$
>
> And this value is yet another estimate of out-of-sample deviance. (pp. 191--192, emphasis in the original)

In [Chapter 6 of my ebook](https://bookdown.org/content/3890/overfitting-regularization-and-information-criteria.html) translating McElreath's *Statistical Rethinking* into **brms** and **tidyverse** code, I walk out how to hand compute the WAIC for a `brm()` fit. I'm not going to repeat the exercise, here. But do see the project and McElreath's text if you’re interested. Rather, I'd like to get down to business. In **brms**, you can get a model's WAIC with the `waic()` function.

```{r}
waic(fit4.1)
```

We'll come back to that warning message, later. For now, notice the main output is a $3 \times 2$ data frame with named rows. For the statistic in each row, you get a point estimate and a standard error. The WAIC is on the bottom. The effective number of parameters, the $p_\text{WAIC}$, is in the middle. Notice the `elpd_waic` on the top. That's what you get without the $-2 \times ...$ in the formula. Remember how that part is just to put things in a metric amenable to $\chi^2$-difference testing? Well, not all Bayesians like that and within the Stan ecosystem you'll also see the WAIC expressed instead as the $\text{elpd}_\text{WAIC}$.

The current recommended workflow within **brms** is to attach the WAIC information to the model fit. You do it with the `add_criterion()` function.

```{r, message = F}
fit4.1 <- add_criterion(fit4.1, "waic")
```

And now you can access that information directly with good-old `$` indexing.

```{r}
fit4.1$criteria$waic
```

You might notice how that value is similar to the AIC and BIC values for Model A in Table 4.1. But it's not identical and we shouldn't expect it to be. It was computed by a different formula that accounts for priors. For our purposes, this is much better than the frequentist AIC and BIC. We need statistics that can handle priors.

#### Leave-one-out cross-validation (LOO-CV).

We have another big option for model comparison within the Stan ecosystem. It involves leave-one-out cross-validation (LOO-CV). It's often the case that we aren't just interested in modeling the data we have in hand. The hope is our findings would generalize to other data we could have collected or may collect in the future. We'd like our findings to tell us something more general about the world at large. But unless you're studying something highly uniform like the weights of hydrogen atoms, chances are your data have idiosyncrasies that won't generalize well to other data. Sure, if we had all the information on all the relevant variables, we could explain the discrepancies across samples with hidden moderators and such. But we don't have all the data and we typically don't even know what all the relevant variables are.

Welcome to science.

To address this problem, you might recommend we collect data from two samples for each project. Starting with sample A, we'd fit a series of models and settle on one or a small subset that both speak to our scientific hypothesis and seem to fit the sample A data well. Then we'd switch to sample B and rerun our primary model(s) from A to make sure our findings generalize. In this paradigm, we might call the A data *in sample* and the B data *out of sample*--or out of sample A, anyways.

The problem is we often have time and funding constraints. We only have sample A and we may never collect sample B. So we'll need to make the most out of A. Happily, tricky statisticians have our back. Instead, what we might do is divide our data into $k$ equally-sized subsets. Call those subsets *folds*. If we leave one of the folds out, we can fit the model with the remaining data and then see how well that model speaks to the left-out fold. After doing this for every fold, we can get an average performance across folds.

Note how as $k$ increases, the number of cases with a fold get smaller. In the extreme, $k = N$, the number of cases within the data. At that point, $k$-fold cross-validation turns into leave-one-out cross-validation (LOO-CV).

But there's a practical difficulty with LOO-CV: it's costly. As you may have noticed, it takes some time to fit a Bayesian multilevel model. For large data and/or complicated models, sometimes it takes hours or days. Most of us just don't have enough time or computational resources to fit that many models. Happily, we have an approximation to pure LOO-CV. Vehtari, Gelman, and Gabry [-@vehtariPracticalBayesianModel2017] proposed Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO) as an efficient way to approximate true LOO-CV. At this point, it's probably best to let the statisticians speak for themselves:

> To maintain comparability with the given dataset and to get easier interpretation of the differences in scale of effective number of parameters, we define a measure of predictive accuracy for the $n$ data points taken one at a time: 
>
> \begin{align*}
> \text{elpd} & = \text{expected log pointwise predictive density for a new dataset} \\
> & = \sum_{i = 1}^n \int p_t (\tilde{y}_i) \log p (\tilde{y}_i | y) d \tilde{y}_i,
> \end{align*}
> 
> where $p_t (\tilde{y}_i)$ is the distribution representing the true data-generating process for $\tilde{y}_i$. The $p_t (\tilde{y}_i)$'s are unknown, and we will use cross-validation or WAIC to approximate. In a regression, these distributions are also implicitly conditioned on any predictors in the model...
> 
> The Bayesian LOO estimate of out-of-sample predictive fit is
> 
> $$\text{elpd}_{\text{loo}} = \sum_{i = 1}^n \log p (y_i | y - _i),$$
> 
> where
> 
> $$p (y_i | y - _i) = \int p (y_i | \theta) p (\theta | y - _i) d \theta$$
> 
> is the leave-one-out predictive density given the data without the ith data point. (pp. 2--3)

For the rest of the details, check out the original paper. Our goal is to practice using the PSIS-LOO. Since this is the only version of the LOO we'll be using in this project, I'm just going to refer to it as the LOO from here on. To use the LOO to evaluate a `brm()` fit, you just use the `loo()` function. Though you don't have to save the results as an object, we'll be forward thinking and do so here.

```{r}
l_fit4.1 <- loo(fit4.1)

print(l_fit4.1)
```

Remember that warning message we got from the `waic()` a while back? We get more information along those lines from the `loo()`. As it turns out, a few of the cases in the data were unduly influential in the model fit. Within the `loo()` paradigm, those are indexed by the `pareto_k` values. As it turns out, the Pareto $k$ [can be used as a diagnostic tool](https://cran.r-project.org/web/packages/loo/vignettes/loo2-example.html#plotting-pareto-k-diagnostics). Each case in the data gets its own $k$ value and we like it when those $k$'s are low. We typically get worried when those $k$'s exceed 0.7 and the `loo()` function spits out a warning message when they do.

If you didn't know, the **brms** functions like the `waic()` and `loo()` actually come from the [**loo** package](https://CRAN.R-project.org/package=loo) [@R-loo; @vehtariPracticalBayesianModel2017; @yaoUsingStackingAverage2018]. Explicitly loading **loo** will buy us some handy convenience functions.

```{r, warning = F, message = F}
library(loo)
```

We'll be leveraging those $k$ values with the `pareto_k_table()` and `pareto_k_ids()` functions. Both functions take objects created by the `loo()` or `psis()` functions. Let's take a look at the `pareto_k_table()` function first.

```{r}
pareto_k_table(l_fit4.1) 
```

This is the same table that popped out earlier after using the `loo()`. Recall that this data set has 246 observations (i.e., execute `count(alcohol1_pp)`). With `pareto_k_table()`, we see how the Pareto $k$ values have been categorized into bins ranging from "good" to "very bad". Clearly, we like nice and low $k$'s. In this example, most of our observations are "good" or "ok." Two are in the "bad" $k$ range. We can take a closer look by placing our `loo()` object into `plot()`.

```{r, fig.width = 4.5, fig.height = 3.5}
plot(l_fit4.1)
```

We got back a nice diagnostic plot for those $k$ values, ordered by row number. We can see that our three observations with the "bad" $k$ values were earlier in the data and it appears their $k$ values are just a smidge above the recommended threshold. If we wanted to further verify to ourselves which observations those were, we'd use the `pareto_k_ids()` function.

```{r}
pareto_k_ids(l_fit4.1, threshold = .7)
```

Note our use of the `threshold` argument. Play around with it to see how it works. In case you're curious, here are those rows.

```{r}
alcohol1_pp[c(27, 229), ]
```

If you want an explicit look at those $k$ values, execute `l_fit4.1$diagnostics$pareto_k`. For the sake of space, I'm going to omit the output.

```{r, eval = F}
l_fit4.1$diagnostics$pareto_k
```

The `pareto_k` values can be used to examine cases that are overly-influential on the model parameters, something like a Cook's $D_i$. See, for example [this discussion on stackoverflow.com](https://stackoverflow.com/questions/39578834/linear-model-diagnostics-for-bayesian-models-using-rstan/39595436) in which several members of the [Stan team](http://mc-stan.org) weighed in. The issue is also discussed in @vehtariPracticalBayesianModel2017 and in [this presentation by Aki Vehtari](https://www.youtube.com/watch?v=FUROJM3u5HQ&feature=youtu.be&a=).

Anyway, the implication of all this is these values suggest `fit4.1` (i.e., Model A) might not be the best model of the data. Happily, we have other models to compare it to. That leads into the next section:

#### You can compare Bayesian models with the WAIC and LOO.

Remember how we used the `add_criterion()` function, above. That'll work for both WAIC and the LOO. Let's do that for Models A through E.

```{r, warning = F, message = F}
fit4.1 <- add_criterion(fit4.1, c("loo", "waic"))
fit4.2 <- add_criterion(fit4.2, c("loo", "waic"))
fit4.3 <- add_criterion(fit4.3, c("loo", "waic"))
fit4.5 <- add_criterion(fit4.5, c("loo", "waic"))
```

And to refresh, we can pull the WAIC and LOO information with `$` indexing. Here's how to get the LOO info for `fit4.2`.

```{r}
fit4.2$criteria$loo
```

Sigh. Turns out there are even more overly-influential cases in the unconditional growth model. In the case of a real data analysis, this might suggest we need a more robust model. One possible solution might be switching out our Gaussian likelihood for the robust Student's $t$-distribution. For an introduction, you might check out my blog post on the topic, [*Robust Linear Regression with Student's $t$-Distribution*](https://solomonkurz.netlify.com/post/robust-linear-regression-with-the-robust-student-s-t-distribution/). But that'll take us farther afield than I want to go, right now.

The point to focus on, here, is we can use the `loo_compare()` function to compare fits by their WAIC or LOO. Let's practice with the WAIC.

```{r}
ws <- loo_compare(fit4.1, fit4.2, fit4.3, fit4.5, criterion = "waic")

print(ws)
```

Remember how we said that some contemporary Bayesians aren't fans of putting Bayesian information criteria in a $\chi^2$ metric? Well, it turns out [Aki Vehtari](https://twitter.com/avehtari), of the Stan team and **loo** package fame--and also the primary author in that [PSIS-LOO paper](https://arxiv.org/abs/1507.04544) from before--, is one of those Bayesians. So instead of getting difference scores in the WAIC metric, we get them in the $\text{elpd}_\text{WAIC}$ metric instead. But remember, if you prefer these estimates in the traditional metric, just multiply by -2.

```{r}
cbind(waic_diff = ws[, 1] * -2,
      se        = ws[, 2] *  2)
```

The reason we multiplied the `se_diff` column (i.e., the standard errors for the difference estimates) by 2 is because you can't have negative standard errors. That'd be silly.

But anyway, notice that the `brm()` fits have been rank ordered with the smallest differences at the top. Each row in the output is the difference of one of the fits compared to the best-fitting fit. Since `fit4.5` apparently had the lowest WAIC value, it was ranked at the top. And notice how its `waic_diff` is 0. That, of course, is because $x - x = 0$. So all the other difference scores are follow the formula $\text{Difference}_x = \text{WAIC}_\text{fit_x} - \text{WAIC}_\text{fit_4.5}$.

Concerning our `ws` object, we can get more information on our models' WAIC information if we include a `simplify = F` argument within `print()`.

```{r}
print(ws, simplify = F)
```

Their WAIC estimates and the associated standard errors are in the final two columns. In the two before that, we get the $p_\text{WAIC}$ estimates and their standard errors. We can get similar information for the LOO.

```{r}
loo_compare(fit4.1, fit4.2, fit4.3, fit4.5, criterion = "loo") %>% 
  print(simplify = F)
```

If you wanted a more focused comparison, say between `fit1` and `fit2`, you'd just simplify your input.

```{r}
loo_compare(fit4.1, fit4.2, criterion = "loo") %>% 
  print(simplify = F)
```

We'll get more practice with these methods as we go along. But for your own edification, you might check out [the vignettes](https://CRAN.R-project.org/package=loo) put out by the **loo** team.

## Using Wald statistics to test composite hypotheses about fixed effects

I'm not going to address issues of composite null-hypothesis tests using the Wald statistic. However, we can address some of these issues from a different more estimation-based perspective. Consider the initial question posed on page 123:

> Suppose, for example, you wanted to test whether the entire true change trajectory for a particular type of adolescent--say, a child of non-alcoholic parents with an average value of *PEER*--differs from a "null" trajectory (one with zero intercept and zero slope). This is tantamount to asking whether the average child of non-alcoholic parents drinks no alcohol at age 14 and remains abstinent over time.

Singer and Willett then expressed their joint null hypothesis as

$$H_0: \gamma_{00} = 0 \; \text{and} \; \gamma_{10} = 0.$$

This is a substantive question we can address more informatively with `fitted()`. First, let's provide the necessary values for our predictor variables, `coa`, `peer`, and `age_14`.

```{r}
mu_peer <- mean(alcohol1_pp$peer)

nd <-
  tibble(coa    = 0,
         peer   = mu_peer,
         age_14 = seq(from = 0, to = 2, length.out = 30))

head(nd)
```

Now we use `fitted()` to examine the model-implied trajectory for a child of non-alcoholic parents and average `peer` values.

```{r, fig.height = 3.95, fig.width = 2.5}  
f <-
  fitted(fit4.6, 
         newdata = nd,
         re_formula = NA) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  mutate(age = age_14 + 14) 

f %>%
  ggplot(aes(x = age)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  scale_y_continuous("alcuse", breaks = 0:2, limits = c(0, 2)) +
  labs(subtitle = "Zero is credible for neither\nthe intercept nor the slope.") +
  coord_cartesian(xlim = c(13, 17)) +
  theme(legend.position = "none",
        panel.grid = element_blank())
``` 

Recall that the result of our Bayesian analyses are the probability of the parameters given the data, $p(\theta | d)$. Based on our plot, there is much less than a .05 probability either intercept or slope for teens of this demographic are zero. If you really wanted to fixate on zero, you could even use `geom_hline()` to insert a horizontal line at zero in the figure. But all that fixating on zero detracts from what to my mind are the more important parts of the model. The intercept at `age = 14` is about 1/3 and the endpoint when `age = 16` is almost at $1$. Those are our effect sizes. If you wanted to quantify those effect sizes more precisely, just query our `fitted()` object, `f`.

```{r}
f %>% 
  select(age, Estimate, Q2.5, Q97.5) %>% 
  filter(age %in% c(14, 16)) %>% 
  mutate_all(round, digits = 2)
```

Works like a champ. But we haven't fully covered part of Singer and Willett's joint hypothesis text. They proposed a joint Null that included the $\gamma_{10} = 0$. Though it's clear from the plot that the trajectory increases, we can address the issue more directly with a difference score. For our difference, we'll subtract the estimate at `age = 14` from the one at `age = 15`. But to that, we'll have to return to `fitted()`. So far, we've been using the default output which returns summaries of the posterior. To compute a proper difference score, we'll need to work with all the posterior draws in order to approximate the full distribution. We do that by setting `summary = F`. And since we're only interested in the estimates from these two `age` values, we'll streamline our `nd` data.

```{r}
nd <-
  tibble(coa    = 0,
         peer   = mu_peer,
         age_14 = 0:1)

f <-
  fitted(fit4.6, 
         newdata = nd,
         re_formula = NA,
         summary = F) %>%
  data.frame()

str(f)
```

Now our `f` object has 4,000 rows and 2 columns. Each of the rows corresponds to one of the 4,000 post-warmup posterior draws. The columns correspond to the two rows in our `nd` data. To get a slope based on this combination of predictor values, we simply subtract the first column from the second.

```{r, fig.width = 4, fig.height = 3}
f <-
  f %>% 
  transmute(difference = X2 - X1)

f %>% 
  ggplot(aes(x = difference)) +
  geom_density(size = 0, fill = "grey25") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = "Based on 4,000 posterior draws, not a single one\nsuggests the slope is even close to zero. Rather, the\nposterior mass is concentrated around 0.25.",
       x = expression(paste(gamma[0][1], " (i.e., the difference between the two time points)"))) +
  coord_cartesian(xlim = 0:1) +
  theme(panel.grid = element_blank())
```

Here are the posterior mean and 95% intervals.

```{r}
f %>% 
  summarise(mean = mean(difference),
            ll   = quantile(difference, probs = .025),
            ul   = quantile(difference, probs = .975))
```

On page 125, Singer and Willett further mused:

> When we examined the OLS estimated change trajectories in figure 4.2, we noticed that among children of non-alcoholic parents, those with low values of *CPEER* tended to have a lower initial status and steeper slopes than those with high values of *CPEER*. We might therefore ask whether the former group "catches up" to the latter. This is a question about the "vertical" separation between these two groups['] true change trajectories at some later age, say 16. 

Within their joint hypothesis testing paradigm, they pose this as testing

$$H_0: 0\gamma_{00} + 0\gamma_{01} + 1\gamma_{02} + 0\gamma_{10} + 2\gamma_{12} = 0.$$

From our perspective, this is a differences of differences analysis. That is, first we'll compute the model implied `alcuse` estimates for the four combinations of the two levels of `age` and `peer`, holding `coa` constant at 0. Second, we'll compute the differences between the two `peer` levels at each `age`. Third and finally, we'll compute a difference of those differences.

For our first step, recall it was `fit4.7` that used the `cpeer` variable.

```{r}
# first step
nd <-
  crossing(age_14 = c(0, 2),
           cpeer  = c(-.363, .363)) %>% 
  mutate(coa = 0)

head(nd)

f <-
  fitted(fit4.7, 
         newdata = nd,
         re_formula = NA,
         summary = F) %>% 
  data.frame()

head(f)
```

For our initial difference scores, we'll subtract the estimates for the lower level of `cpeer` from the higher ones.

```{r}
# step 2
f <-
  f %>% 
  transmute(`difference at 14` = X2 - X1,
            `difference at 16` = X4 - X3)

head(f)
```

For our final difference score, we'll subtract the first difference score from the second.

```{r}
# step 3
f <-
  f %>% 
  mutate(`difference in differences` = `difference at 16` - `difference at 14`)

head(f)
```

Here we'll plot all three.

```{r, fig.width = 8, fig.height = 2.25}
f %>%
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value)) +
  geom_density(size = 0, fill = "grey25") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("different differences") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free_y")
```

Singer and Willett concluded they could "reject the null hypothesis at any conventional level of significance" (p. 126). If we must appeal to the Null, here are the posterior means and 95% intervals for our differences.

```{r, message = F}
f %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            ll   = quantile(value, probs = .025),
            ul   = quantile(value, probs = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Our results contrast a bit from Singer and Willett's. Though the bulk of our posterior mass is concentrated around -0.22, zero is a credible value within the difference of differences density. Our best bet is the differences begin to converge over time. However, that rate of that convergence is subtle and somewhat imprecise relative to the effect size. Interpret with caution.

## Evaluating the tenability of a model's assumptions

"Whenever you fit a statistical model, you invoke assumptions" (p. 127). This is the case for multilevel Bayesian models, too.

### Checking functional form.

We've already checked the functional form at level-1 with our version of Figure 4.1. When we made our version of Figure 4.1, we relied on `ggplot2::stat_smooth()` to compute the `id`-level OLS trajectories. To make our variants of Figure 4.4, we'll have to back up and compute them externally with `lm()`. Here we'll do so in bulk with a nested data frame. The **broom** package will help us extract the results.

```{r, warning = F, message = F}
library(broom)

o <-
  alcohol1_pp %>% 
  nest(-id, -coa, -peer) %>% 
  mutate(ols = map(data, ~lm(data = ., alcuse ~ 1 + age_14))) %>% 
  mutate(tidy = map(ols, tidy)) %>% 
  unnest(tidy) %>% 
  # this is unnecessary, but will help with plotting
  mutate(term = factor(term, 
                       levels = c("(Intercept)", "age_14"),
                       labels = c("pi[0]", "pi[1]")))

head(o)
```

Now plot.

```{r, fig.width = 6, fig.height = 4}
o %>% 
  select(coa:peer, term:estimate) %>% 
  pivot_longer(coa:peer) %>% 
  
  ggplot(aes(x = value, y = estimate)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_point(alpha = 2/3) +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 11)) +
  facet_grid(term~name, scales = "free", labeller = label_parsed)
```

With a little more wrangling, we can extract the Pearson's correlation coefficients for each panel.

```{r}
o %>% 
  select(coa:peer, term:estimate) %>% 
  pivot_longer(coa:peer) %>% 
  group_by(term, name) %>% 
  nest() %>%
  mutate(r = map_dbl(data, ~cor(.)[2, 1] %>% round(digits = 2)))
```

### Checking normality.

The basic multilevel model of change yields three variance parameters, $\epsilon_{ij}$, $\zeta_{0i}$, and $\zeta_{1i}$. Each measurement occasion in the model receives a model-implied estimate for each. Singer and Willett referred to those estimates as $\hat{\epsilon}_{ij}$, $\hat{\zeta}_{0i}$, and $\hat{\zeta}_{1i}$. As with frequentist software, our Bayesian software **brms** will return these estimates.

To extract our Bayesian draws for the $\hat{\epsilon}_{ij}$'s, we use the `residuals()` function.

```{r}
e <- residuals(fit4.6)

str(e)

head(e)
```

For our `fit5`, the `residuals()` function returned a $246 \times 4$ numeric array. Each row corresponded to one of the rows of the original data set. The four vectors are the familiar summaries `Estimate`, `Est.Error`, `Q2.5`, and `Q97.5`. If we'd like to work with these in a **ggplot2**-made plot, we'll have to convert our `e` object to a data frame. 

After we make the conversion, we then make the top left panel of Figure 4.5.

```{r, fig.width = 3.5, fig.height = 2.75}
e <- 
  e %>% 
  data.frame()

e %>% 
  ggplot(aes(sample = Estimate)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_qq() +
  ylim(-2, 2) +
  labs(x = "Normal score",
       y = expression(hat(epsilon)[italic(ij)])) +
  theme(panel.grid = element_blank())
```

For the right plot on the top, we need to add an `id` index. That's as easy as appending the one from the original data. If you followed closely with the text, you may have also noticed this panel is of the standardized residuals. That just means we'll have to hand-standardize ours before plotting.

```{r, fig.width = 3.5, fig.height = 2.75, warning = F, message = F}
e %>% 
  bind_cols(alcohol1_pp %>% select(id)) %>% 
  mutate(z = (Estimate - mean(Estimate)) / sd(Estimate)) %>% 
  
  ggplot(aes(x = id, y = z)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_point() +
  scale_y_continuous(expression(italic(std)~hat(epsilon)[italic(ij)]), limits = c(-2, 2)) +
  theme(panel.grid = element_blank())
```

We'll need to use the `ranef()` function to return the estimates for the $\zeta$'s.

```{r}
z <- ranef(fit4.6)

str(z)

z[[1]][1:6, , "Intercept"]
z[[1]][1:6, , "age_14"]
```

For our `fit5`, the `ranef()` function returned a list of 1, indexed by `id`. Therein lay a 3-dimensional array. The first two dimensions are the same as what we got from `residuals()`, above. The third dimension had two levels: `Intercept` and `age_14`. In other words, the third dimension is the one that differentiated between $\hat{\zeta}_{0i}$ and $\hat{\zeta}_{1i}$. to make this thing a little more useful, let's convert it to a long-formatted data frame.

```{r}
z <-
  rbind(z[[1]][ , , "Intercept"],
        z[[1]][ , , "age_14"]) %>% 
  data.frame() %>% 
  mutate(ranef = rep(c("hat(zeta)[0][italic(i)]", "hat(zeta)[1][italic(i)]"), each = n() / 2))

glimpse(z)
```

Now we're ready to plot the remaining panels on the left of Figure 4.5.

```{r, fig.width = 3.5, fig.height = 5.5}
z %>% 
  ggplot(aes(sample = Estimate)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_qq() +
  ylim(-1, 1) +
  labs(x = "Normal score",
       y = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ranef, labeller = label_parsed, ncol = 1)
```

Here are the ones on the right.

```{r, fig.width = 3.5, fig.height = 5.5}
z %>% 
  bind_cols(
    bind_rows(
      alcohol1_pp %>% distinct(id),
      alcohol1_pp %>% distinct(id)
      )
    ) %>%
  mutate(ranef = str_c("italic(std)~", ranef)) %>% 
  # note we have to group them before standardizing
  group_by(ranef) %>% 
  mutate(z = (Estimate - mean(Estimate)) / sd(Estimate)) %>% 
  
  ggplot(aes(x = id, y = z)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_point() +
  scale_y_continuous(NULL, limits = c(-3, 3)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ranef, labeller = label_parsed, ncol = 1)
```

If you were paying close attention, you may have noticed that for all three of our `id`-level deviation estimates, they were summarized not only by a posterior mean but by standard deviations and 95% intervals, too. To give a sense of what that means, here are those last two plots, again, but this time including vertical bars defined by the 95% intervals.

```{r, fig.width = 3.5, fig.height = 5.5}
z %>% 
  bind_cols(
    bind_rows(
      alcohol1_pp %>% distinct(id),
      alcohol1_pp %>% distinct(id)
      )
    ) %>%

  ggplot(aes(x = id, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_pointrange(shape = 20, size = 1/3) +
  ylab(NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ranef, labeller = label_parsed, ncol = 1)
```

When you go Bayesian, even your residuals get full posterior distributions.

### Checking homoscedasticity.

Here we examine the homoscedasticity assumption by plotting the residual estimates against our predictors. We'll start with the upper left panel of Figure 4.6.

```{r, fig.width = 3.5, fig.height = 2.75}
e %>% 
  bind_cols(alcohol1_pp) %>% 
  
  ggplot(aes(x = age, y = Estimate)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_point(alpha = 1/4) +
  ylab(expression(hat(epsilon)[italic(ij)])) +
  coord_cartesian(xlim = c(13, 17),
                  ylim = c(-2, 2)) +
  theme(panel.grid = element_blank())
```

Here's a quick way to get the remaining four panels.

```{r, fig.width = 7, fig.height = 5.25}
z %>% 
  bind_cols(
    bind_rows(
      alcohol1_pp %>% distinct(id, coa, peer),
      alcohol1_pp %>% distinct(id, coa, peer)
      )
    ) %>%
  select(Estimate, ranef, coa, peer) %>% 
  pivot_longer(-c(Estimate, ranef)) %>% 
  
  ggplot(aes(x = value, y = Estimate)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_point(alpha = 1/3) +
  ylim(-1, 1) +
  labs(x = "covariate value",
       y = NULL) +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 10)) +
  facet_grid(ranef~name, labeller = label_parsed, scales = "free")
```

## Model-based (Empirical Bayes) estimates of the individual growth parameters

In this section, the authors discussed two methods for constructing `id`-level trajectories: a) use a weighted average of the OLS and multilevel estimates and b) rely solely on the multilevel model by making use of the three sources of residual variation. Our method will be the latter.

Here are the data for `id == 23`.

```{r}
alcohol1_pp %>% 
  select(id:coa, cpeer, alcuse) %>% 
  filter(id == 23)
```

```{r}
post_23 <-
  posterior_samples(fit4.7) %>% 
  select(starts_with("b_")) %>% 
  # make our pis
  mutate(`pi[0][",23"]` = b_Intercept + b_coa * 1 + b_cpeer * -1.018,
         `pi[1][",23"]` = b_age_14 + `b_age_14:cpeer` * -1.018)

head(post_23)
```

It doesn't help us much now, but the reason we've formatted the names for our two $\pi$ columns so oddly is because those names will work much nicer in the figure we'll make, below. Just wait and see.

Anyways, more than a couple point estimates, we returned the draws from the full posterior distribution. We might summarize them.

```{r, message = F}
post_23 %>% 
  pivot_longer(starts_with("pi")) %>%   
  group_by(name) %>% 
  summarise(mean = mean(value),
            ll   = quantile(value, probs = .025),
            ul   = quantile(value, probs = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Or we could plot them.

```{r, fig.width = 6, fig.height = 2.25}
post_23 %>% 
  pivot_longer(starts_with("pi")) %>% 
  
  ggplot(aes(x = value)) +
  geom_density(size = 0, fill = "grey25") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("participant-specific parameter estimates") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, labeller = label_parsed, scales = "free_y")
```

Yet this approach neglects the $\zeta$'s. W've been extracting the $\zeta$'s with `ranef()`. We also get them when we use `posterior_samples()`. Here we'll extract both the $\gamma$'s as well as the $\zeta$'s for `id == 23`.

```{r}
post_23 <-
  posterior_samples(fit4.7) %>% 
  select(starts_with("b_"), contains("23"))

glimpse(post_23)
```

With the `r_id` prefix, **brms** tells you these are residual estimates for the levels in the `id` grouping variable. Within the brackets, we learn these particular columns are for `id == 23`, the first with respect to the `Intercept` and second with respect to the `age_14` parameter. Let's put them to use.

```{r}
post_23 <-
  post_23 %>% 
  mutate(`pi[0][",23"]` = b_Intercept + b_coa * 1 + b_cpeer * -1.018 + `r_id[23,Intercept]`,
         `pi[1][",23"]` = b_age_14 + `b_age_14:cpeer` * -1.018 + `r_id[23,age_14]`)

glimpse(post_23)
```

Here are our updated summaries.

```{r, message = F}
post_23 %>% 
  pivot_longer(starts_with("pi")) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            ll   = quantile(value, probs = .025),
            ul   = quantile(value, probs = .975)) %>% 
  mutate_if(is.double, round, digits = 3)
```

And here are the updated density plots.

```{r, fig.width = 6, fig.height = 2.25}
post_23 %>% 
  pivot_longer(starts_with("pi")) %>% 
  
  ggplot(aes(x = value)) +
  geom_density(size = 0, fill = "grey25") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("participant-specific parameter estimates") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, labeller = label_parsed, scales = "free_y")
```

We've been focusing on the $\pi$ parameters. Notice that when we turn our attention to Figure 4.7, we're now shifting focus slightly to the consequences of those parameters. We're not attending to trajectories. It's important to pick up on this distinction because it has consequences for our programming workflow. If you wanted to keep a parameter-centric workflow, we could continue to expand on our `posterior_samples()` by applying the full composite formula to explicitly add in predictions for various levels of `age_14`. And we could do that separately or in bulk for the eight participants highlighted in the figure. 

However pedagogically useful that might be, it'd be very tedious. If we instead take a trajectory-centric perspective, it'll be more natural and efficient to work with a `fitted()`-based workflow. Let's define our `nd` data.

```{r}
nd <-
  alcohol1_pp %>% 
  select(id:coa, age_14:alcuse, cpeer) %>% 
  filter(id %in% c(4, 14, 23, 32, 41, 56, 65, 82)) %>% 
  # these next two lines will make plotting easier
  mutate(id_label = ifelse(id < 10, str_c("0", id), id)) %>% 
  mutate(id_label = str_c("id = ", id_label))

glimpse(nd)
```

We've isolated the relevant predictor variables for our eight focal participants. Next we'll pump them through `fitted()` and wrangle as usual. 

```{r}
f <-
  fitted(fit4.7,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd)

glimpse(f)
```

Notice how this time we omitted the `re_formula = NA` argument. By default, `re_formula = NULL`, the consequence of which is the output is based on all the parameters in the multilevel model, not just the $\gamma$'s. Here are what they look like.

```{r, fig.height = 5, fig.width = 8}  
f %>% 
  ggplot(aes(x = age, y = Estimate)) +
  geom_line(size = 1) +
  scale_y_continuous("alcuse", breaks = 0:4, limits = c(-1, 4)) +
  xlim(13, 17) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~id_label, ncol = 4)
```

Now we've warmed up, let's add in the data and the other lines so make the full version of Figure 4.7. Before we do so, we'll revisit `fitted()`. Notice the return of the `re_formula = NA` argument. The trajectories in our `f_gamma_only` data frame will only be sensitive to the $\gamma$s.

```{r}
f_gamma_only <-
  fitted(fit4.7,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

glimpse(f_gamma_only)
```

Let's plot!

```{r, fig.height = 5, fig.width = 8, message = F}  
f %>% 
  ggplot(aes(x = age)) +
  # `id`-specific lines
  geom_line(aes(y = Estimate),
            size = 1) +
  # gamma-centric lines
  geom_line(data = f_gamma_only,
            aes(y = Estimate),
            size = 1/2) +
  # OLS lines
  stat_smooth(data = nd,
              aes(y = alcuse),
              method = "lm", se = F,
              color = "black", linetype = 2, size = 1/2) +
  # data points
  geom_point(data = nd,
             aes(y = alcuse)) +
  scale_y_continuous("alcuse", breaks = 0:4, limits = c(-1, 4)) +
  xlim(13, 17) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~id_label, ncol = 4)
```

Though our purpose was largely to reproduce Figure 4.7, we might push ourselves a little further. Our Bayesian estimates came with measures of uncertainty, the posterior standard deviations and the 95% intervals. Whenever possible, it's good form to include some expression of our uncertainty in our plots. Here let's focus on the `id`-specific trajectories.

```{r, fig.height = 5, fig.width = 8}  
f %>% 
  ggplot(aes(x = age, y = Estimate)) +
  # `id`-specific 95% intervals
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey75") +
  # `id`-specific lines
  geom_line(size = 1) +
  # data points
  geom_point(data = nd,
             aes(y = alcuse)) +
  scale_y_continuous("alcuse", breaks = 0:4, limits = c(-1, 4)) +
  xlim(13, 17) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~id_label, ncol = 4)
```

This also clarifies an important visualization point. If you only care about plotting straight lines, you only need two points. However, if you want to express shapes with curves, such as the typically-bowtie-shaped 95% intervals, you need estimates over a larger number of predictor values. Back to `fitted()`!

```{r}
# we need an expanded version of the `nd`
nd_expanded <-
  alcohol1_pp %>% 
  select(id, coa, cpeer) %>% 
  filter(id %in% c(4, 14, 23, 32, 41, 56, 65, 82)) %>% 
  # this part is important!
  expand(nesting(id, coa, cpeer),
         age_14 = seq(from = 0, to = 2, length.out = 30)) %>% 
  mutate(id_label = ifelse(id < 10, str_c("0", id), id)) %>% 
  mutate(id_label = str_c("id = ", id_label),
         age      = age_14 + 14)

# pump our `nd_expanded` into `fitted()`
f <-
  fitted(fit4.7,
         newdata = nd_expanded) %>% 
  data.frame() %>% 
  bind_cols(nd_expanded)

glimpse(f)
```

Notice how we now have many more rows. Let's plot.

```{r, fig.height = 5, fig.width = 8}  
f %>% 
  ggplot(aes(x = age, y = Estimate)) +
  # `id`-specific 95% intervals
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey75") +
  # `id`-specific lines
  geom_line(size = 1) +
  # data points
  geom_point(data = nd,
             aes(y = alcuse)) +
  scale_y_continuous("alcuse", breaks = 0:4, limits = c(-1, 4)) +
  xlim(13, 17) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~id_label, ncol = 4)
```

Singer and Willett pointed out that one of the ways in which the multilevel model is more parsimonious than a series of `id`-specific single-level models is that all `id` levels share the same $\sigma_\epsilon$ parameter. At this point, we should just point out that it's possible to relax this assumption with modern Bayesian software, such as **brms**. For ideas on how, check out [Donald Williams](https://twitter.com/wdonald_1985)' work [e.g., @williamsSurfaceUnearthingWithinperson2019].

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(alcohol1_pp, fit4.1, post, v, fit4.2, make_s2rj, make_rho_rj_rjp, fit4.3, fit4.4, ce, fit4.5, fit4.6, nd, f, fit4.7, fit4.8, ll, l_fit4.1, ws, mu_peer, o, e, z, post_23, f_gamma_only, nd_expanded)

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```


<!--chapter:end:04.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# Treating Time More Flexibly

> All the illustrative longitudinal data sets in previous chapters share two structural features that simplify analysis. Each is: (1) balanced--everyone is assessed on the identical number of occasions; and (2) time-structured--each set of occasions is identical across individuals. Our analyses have also been limited in that we have used only: (1) time-invariant predictors that describe immutable characteristics of individuals or their environment (except for *TIME* itself); and (2) a representation of *TIME* that forces the level-1 individual growth parameters to represent "initial status" and "rate of change."
>
> The multilevel model for change is far more flexible than these examples suggest. With little or no adjustment, you can use the same strategies to analyze more complex data sets. Not only can the waves of data be irregularly spaced, their number and spacing can vary across participants. Each individual can have his or her own data collection schedule and number of waves can vary without limit from person to person. So, too, predictors of change can be time-invariant or time-varying, and the level-1 submodel can be parameterized in a variety of interesting ways. [@singerAppliedLongitudinalData2003, p. 138, *emphasis* in the original]

## Variably spaced measurement occasions

> Many researchers design their studies with the goal of assessing each individual on an identical set of occasions...
>
> Yet sometimes, despite a valiant attempt to collect time-structured data, actual measurement occasions will differ. Variation often results from the realities of fieldwork and data collection...
>
> So, too, many researchers design their studies knowing full well that the measurement occasions may differ across participants. This is certainly true, for example, of those who use an *accelerated cohort* design in which an age-heterogeneous cohort of individuals is followed for a constant period of time. Because respondents initial vary in age, and age, not *wave*, is usually the appropriate metric for analyses (see the discussion of time metrics in section 1.3.2), observed measurement occasions will differ across individuals. (p. 139, *emphasis* in the original)

### The structure of variably spaced data sets.

You can find the PIAT data from the CNLSY study in the `reading_pp.csv` file.

```{r, warning = F, message = F}
library(tidyverse)

reading_pp <- read_csv("data/reading_pp.csv")

head(reading_pp)
```

On pages 141 and 142, Singer and Willett discussed the phenomena of *occasion creep*, which is when "the temporal separations of occasions widens as the actual ages exceed design projections". Here's what that might look like.

```{r, fig.width = 6, fig.height = 2.5}
reading_pp %>% 
  ggplot(aes(x = age, y = wave)) +
  geom_vline(xintercept = c(6.5, 8.5, 10.5), color = "white") +
  geom_jitter(alpha = .5, height = .33, width = 0) +
  scale_x_continuous(breaks = c(6.5, 8.5, 10.5)) +
  scale_y_continuous(breaks = 1:3) +
  ggtitle("This is what occasion creep looks like.",
          subtitle = "As the waves go by, the variation of the ages widens and their central tendency\ncreeps away from the ideal point.") +
  theme(panel.grid = element_blank())
```

Here's how we might make our version of Figure 5.1.

```{r, fig.width = 5.5, fig.height = 5, message = F}
set.seed(5)

# wrangle
reading_pp %>% 
  nest(data = c(wave, agegrp, age, piat)) %>% 
  sample_n(size = 9) %>% 
  unnest(data) %>% 
  # this will help format and order the facets
  mutate(id = ifelse(id < 10, str_c("0", id), id) %>% str_c("id = ", .)) %>% 
  pivot_longer(contains("age")) %>% 
  
  # plot
  ggplot(aes(x = value, y = piat, color = name)) +
  geom_point(alpha = 2/3) +
  stat_smooth(method = "lm", se = F, size = 1/2) +
  scale_color_viridis_d(NULL, option = "B", end = .5, direction = -1) +
  xlab("measure of age") +
  coord_cartesian(xlim = c(5, 12),
                  ylim = c(0, 80)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~id)
```

Since it wasn't clear which `id` values the authors used in the text, we just randomized. Change the seed to view different samples.

### Postulating and fitting multilevel models with variably spaced waves of data.

The composite formula for our first model is

\begin{align*}
\text{piat}_{ij} & = \gamma_{00} + \gamma_{10} (\text{agegrp}_{ij} - 6.5) + \zeta_{0i} + \zeta_{1i} (\text{agegrp}_{ij} - 6.5) + \epsilon_{ij} \\

\epsilon_{ij}    & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf \Sigma
\end{pmatrix}, \text{where} \\

\mathbf \Sigma  & = \mathbf D \mathbf\Omega \mathbf D', \text{where} \\

\mathbf D       & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \text{and} \\ 

\mathbf \Omega  & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}

\end{align*}

It's the same for the twin model using `age` rather than `agegrp`. Notice how we've switched from Singer and Willett's $\sigma^2$ parameterization to the $\sigma$ parameterization typical of **brms**.

```{r}
reading_pp <-
  reading_pp %>% 
  mutate(agegrp_c = agegrp - 6.5,
         age_c    = age    - 6.5)
  
head(reading_pp)
```

In the last chapter, we began familiarizing ourselves with `brms::brm()` default priors. It's time to level up. Another approach is to use domain knowledge to set weakly-informative priors. Let's start with the PIAT. The Peabody Individual Achievement Test is a standardized individual test of scholastic achievement. It yields several subtest scores. The reading subtest is the one we're focusing on, here. As is typical for such tests, the PIAT scores are normed to yield a population mean of 100 and a standard deviation of 15.

With that information alone, even a PIAT novice should have an idea about how to specify the priors. Since our sole predictor variables are versions of age centered at 6.5, we know that the model intercept is interpreted as the expected value on the PIAT when the children are 6.5 years old. If you knew nothing else, you'd guess the mean score would be 100 with a standard deviation around 15. One way to use a weakly-informative prior on the intercept would be to multiply that $SD$ by a number like 2.

Next we need a prior for the time variables, `age_c` and `agegrp_c`. A one-unit increase in either of these is the expected increase in the PIAT with one year's passage of age. Bringing in a little domain knowledge, IQ and achievement tests tend to be rather stable over time. However, we also expect children to get better as they age and we also don't know exactly how these data have been adjusted for the children's ages. It's also important to know that it's typical within the Bayesian world to place Normal priors on $\beta$ parameters. So one approach would be to center the Normal prior on 0 and put something like twice the PIAT's standard deviation on the prior's $\sigma$. If we were PIAT researchers, we could do much better. But with minimal knowledge of the test, this approach is certainly beats defaults.

Next we have the variance parameters. Recall that `brms::brm()` defaults are Student's $t$-distributions with $\nu = 3$ and $\mu = 0$. Let's start there. Now we just need to put values on $\sigma$. Since the PIAT has a standard deviation of 15 in the population, why not just use 15? If you felt insecure about this, multiply if by a factor of 2 or so. Also recall that when Student's $t$-distributions has a $\nu = 3$, the tails are quite fat. Within the context of Bayesian priors, those fat tails make it easy for the likelihood to dominate the prior even when it's a good way into the tail.

Finally, we have the correlation among the group-level variance parameters, $\sigma_0$ and $\sigma_1$. Recall that last chapter we learned the `brms::brm()` default was `lkj(1)`. To get a sense of what the LKJ does, we'll simulate from it. McElreath's **rethinking** package contains a handy `rlkjcorr()` function, which will allow us to simulate `n` draws from a `K` by `K` correlation matrix for which $\eta$ is defined by `eta`. Let's take `n <- 1e6` draws from two LKJ prior distributions, one with $\eta = 1$ and the other with $\eta = 4$.

```{r, warning = F, message = F}
library(rethinking)

n <- 1e6
set.seed(5)

lkj <-
  tibble(eta = c(1, 4)) %>% 
  mutate(draws = purrr::map(eta, ~rlkjcorr(n, K = 2, eta = .)[, 2, 1])) %>% 
  unnest(draws)

glimpse(lkj)
```

Now plot that lkj.

```{r, fig.width = 4, fig.height = 2.5}
lkj %>% 
  mutate(eta = factor(eta)) %>% 
  
  ggplot(aes(x = draws, fill = eta, color = eta)) +
  geom_density(size = 0, alpha = 2/3) +
  geom_text(data = tibble(
    draws = c(.75, .35),
    y     = c(.6, 1.05),
    label = c("eta = 1", "eta = 4"),
    eta   = c(1, 4) %>% as.factor()),
    aes(y = y, label = label)) +
  scale_fill_viridis_d(option = "A", end = .5) +
  scale_color_viridis_d(option = "A", end = .5) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(rho)) +
  theme(panel.grid = element_blank(),
        legend.position = "none")
```

When we use `lkj(1)`, the prior is flat over the parameter space. However, setting `lkj(4)` is tantamount to a prior with a probability mass concentrated a bit towards zero. It's a prior that's skeptical of extremely large or small correlations. Within the context of our multilevel model $\rho$ parameters, this will be our weakly-regularizing prior.

Let's prepare to fit our models and load **brms**.

```{r, warning = F, message = F}
detach(package:rethinking, unload = T)
library(brms)
```

Fit the models. Following the same form, the differ in that the first uses `agegrp_c` and the second uses `age_c`.

```{r fit5.1}
fit5.1 <-
  brm(data = reading_pp, 
      family = gaussian,
      piat ~ 0 + Intercept + agegrp_c + (1 + agegrp_c | id),
      prior = c(prior(normal(100, 30), class = b, coef = Intercept),
                prior(normal(0, 30),   class = b, coef = agegrp_c),
                prior(student_t(3, 0, 15), class = sd),
                prior(student_t(3, 0, 15), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.01")

fit5.2 <-
  brm(data = reading_pp, 
      family = gaussian,
      piat ~ 0 + Intercept + age_c + (1 + age_c | id),
      prior = c(prior(normal(100, 30), class = b, coef = Intercept),
                prior(normal(0, 30),   class = b, coef = age_c),
                prior(student_t(3, 0, 15), class = sd),
                prior(student_t(3, 0, 15), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.02")
```

Focusing first on `fit5.1`, our analogue to the $(AGEGRP – 6.5)$ model displayed in Table 5.2, here is our model summary.

```{r}
print(fit5.1, digits = 3)
```

Here's the `age_c` model.

```{r}
print(fit5.2, digits = 3)
```

For a more focused look, we can use `fixef()` compare our $\gamma$'s to each other and those in the text.

```{r}
fixef(fit5.1) %>% round(digits = 3)
fixef(fit5.2) %>% round(digits = 3)
```

Here are our $\sigma_\epsilon$ summaries.

```{r}
VarCorr(fit5.1)$residual$sd %>% round(digits = 3)
VarCorr(fit5.2)$residual$sd %>% round(digits = 3)
```

From a quick glance, you can see they are about the square of the $\sigma_\epsilon^2$ estimates in the text.

Let's go ahead and compute the LOO and WAIC.

```{r, warning = F, message = F}
fit5.1 <- add_criterion(fit5.1, c("loo", "waic"))
fit5.2 <- add_criterion(fit5.2, c("loo", "waic"))
```

Compare the models with a WAIC difference.

```{r}
loo_compare(fit5.1, fit5.2, criterion = "waic") %>% 
  print(simplify = F)
```

The WAIC difference between the two isn't that large relative to its standard error. The LOO tells a similar story.

```{r}
loo_compare(fit5.1, fit5.2, criterion = "loo") %>% 
  print(simplify = F)
```

The uncertainty in our WAIC and LOO estimates and their differences provides information that was not available for the AIC and the BIC comparisons in the text. We can also compare the WAIC and the LOO with model weights. Given the WAIC, from @mcelreathStatisticalRethinkingBayesian2015 we learn

> A total weight of 1 is partitioned among the considered models, making it easier to compare their relative predictive accuracy. The weight for a model $i$ in a set of $m$ models is given by:
>
> $$w_i = \frac{\exp(-\frac{1}{2} \text{dWAIC}_i)}{\sum_{j = 1}^m \exp(-\frac{1}{2} \text{dWAIC}_i)}$$
>
> where dWAIC is the `dWAIC` in the `compare` table output. This example uses WAIC but the formula is the same for any other information criterion, since they are all on the deviance scale. (p. 199) 

The `compare()` function McElreath referenced is from his [-@R-rethinking] [**rethinking** package](https://xcelab.net/rm/software/), which is meant to accompany his [texts](https://xcelab.net/rm/statistical-rethinking/). We don't have that function with **brms**. A rough analogue to the `rethinking::compare()` function is `loo_compare()`. We don't quite have a `dWAIC` column from `loo_compare()`. Remember how last chapter we discussed how Aki Vehtari isn't a fan of converting information criteria to the $\chi^2$ difference metric with that last $-2 \times ...$ step? That's why we have an `elpd_diff` instead of a `dWAIC`. But to get the corresponding value, you just multiply those values by -2. And yet if you look closely at the formula for $w_i$, you'll see that each time the dWAIC term appears, it's multiplied by $-\frac{1}{2}$. So we don't really need that `dWAIC` value anyway. As it turns out, we're good to go with our `elpd_diff`. Thus the above equation simplifies to

$$
w_i = \frac{\exp(\text{elpd_diff}_i)}{\sum_{j = 1}^m \exp(\text{elpd_diff}_i)}
$$

But recall you don't have to do any of this by hand. We have the `brms::model_weights()` function, which we can use to compute weights with the WAIC or the LOO.

```{r}
model_weights(fit5.1, fit5.2, weights = "waic") %>% round(digits = 3)
model_weights(fit5.1, fit5.2, weights = "loo") %>% round(digits = 3)
```

Both put the lion’s share of the weight on the `age_c` model. Back to McElreath @mcelreathStatisticalRethinkingBayesian2015:

> But what do these weights mean? There actually isn't a consensus about that. But here's Akaike’s interpretation, which is [common](https://www.springer.com/us/book/9780387953649). 
>
>> *A model's weight is an estimate of the probability that the model will make the best predictions on new data, conditional on the set of models considered*.
>
> Here's the heuristic explanation. First, regard WAIC as the expected deviance of a model on future data. That is to say that WAIC gives us an estimate of $\text{E} (D_\text{test})$. Akaike weights convert these deviance values, which are log-likelihoods, to plain likelihoods and then standardize them all. This is just like Bayes' theorem uses a sum in the denominator to standardize the produce of the likelihood and prior. Therefore the Akaike weights are analogous to posterior probabilities of models, conditional on expected future data. (p. 199, *emphasis* in the original)

## Varying numbers of measurement occasions

As Singer and Willett pointed out,

> once you allow the spacing of waves to vary across individuals, it is a small leap to allow their *number* to vary as well. Statisticians say that such data sets are *unbalanced.* As you would expect, balance facilitates analysis: models can be parameterized more easily, random effects can be estimated more precisely, and computer algorithms will converge more rapidly.
>
> Yet a major advantage of the multilevel model for change is that it is easily fit to unbalanced data. (p. 146, *emphasis* in the original) 


```{r, echo = F}
# to free up memory, we'll drop these two
rm(fit5.1, fit5.2)
```

### Analyzing data sets in which the number of waves per person varies.

Here we load the `wages_pp.csv` data.

```{r, warning = F, message = F}
wages_pp <- read_csv("data/wages_pp.csv")

glimpse(wages_pp)
```

Here's a more focused look along the lines of Table 5.3.

```{r}
wages_pp %>% 
  select(id, exper, lnw, black, hgc, uerate) %>% 
  filter(id %in% c(206, 332, 1028))
```

To get a sense of the diversity in the number of occasions per `id`, use `group_by()` and `count()`.

```{r, fig.width = 6, fig.height = 3}
wages_pp %>% 
  count(id) %>% 
   
  ggplot(aes(y = n)) +
  geom_bar() +
  scale_y_continuous("# measurement occasions", breaks = 1:13) +
  xlab("count of cases") +
  theme(panel.grid = element_blank())
```

The spacing of the measurement occasions also differs a lot across cases. Recall that `exper` "identifies the specific moment--to the nearest day--in each man's labor force history associated with each observed value of" `lnw` (p. 147). Here's a sense of what that looks like.

```{r, fig.width = 6, fig.height = 2.25}
wages_pp %>% 
  filter(id %in% c(206, 332, 1028)) %>% 
  mutate(id = factor(id)) %>% 
  
  ggplot(aes(x = exper, y = lnw, color = id)) +
  geom_point() +
  geom_line() +
  scale_color_viridis_d(option = "B", begin = .35, end = .8) +
  theme(panel.grid = element_blank())
```

Uneven for dayz.

Here's the **brms** version of the composite formula for Model A, the unconditional growth model for `lnw`.

\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{10} \text{exper}_{ij} + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf\Sigma
\end{pmatrix}, \text{where} \\

\mathbf\Sigma & = \mathbf D \mathbf \Omega \mathbf D', \text{where} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix}, \text{and} \\ 

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}
\end{align*}

To attempt setting priors for this, we need to review what `lnw` is. From the text: "To adjust for inflation, each hourly wage is expressed in constant 1990 dollars. To address the skewness commonly found in wage data and to linearize the individual wage trajectories, we analyze the natural logarithm of wages, *LNW*" (p. 147). So it's the log of participant wages in 1990 dollars. From the official [US Social Secutiry website](https://www.ssa.gov/oact/cola/central.html), we learn the average yearly wage in 1990 was $20,172.11. Here's that natural log for that.

```{r}
log(20172.11)
```

However, that's the yearly wage. In the text, this is conceptualized as rate per hour. If we presume a 40 hour week for 52 weeks, this translates to a little less than $10 per hour.

```{r}
20172.11 / (40 * 52)
```

Here's what that looks like in a log metric.

```{r}
log(20172.11 / (40 * 52))
```

But keep in mind that "to track wages on a common temporal scale, Murnane and colleagues decided to clock time from each respondent's first day of work" (p. 147). So the wages at one's initial point in the study were often entry-level wages. From the official website for the [US Department of Labor](https://www.dol.gov/whd/minwage/chart.htm), we learn the national US minimum wage in 1990 was $3.80 per hour. Here's what that looks like on the log scale.

```{r}
log(3.80)
```

So perhaps this is a better figure to center our prior for the model intercept on. If we stay with a conventional Gaussian prior and put $\mu = 1.335$, what value should we use for the standard deviation? Well, if that's the log minimum and 2.27 is the log mean, then there's less than a log value of 1 between the minimum and the mean. If we'd like to continue our practice of weakly regularizing priors a value of 1 or even 0.5 on the log scale would seem reasonable. For simplicity, we'll use `normal(1.335, 1)`.

Next we need a prior for the expected increase over a single year's employment. A conservative default might be to center it on zero—no change from year to year. Since as we've established a 1 on the log scale is more than the difference between the minimum and average hourly wages in 1990 dollars, we might just use `normal(0, 0.5)` as a starting point.

So then what about our variance parameters? Given these are all entry-level workers and given how little we'd expect them to increase from year to year, a `student_t(3, 0, 1)` on the log scale would seem pretty permissive.

So then here's how we might formally specify our model priors:

\begin{align*}
\gamma_{00}     & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{10}     & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_0        & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_1        & \sim \operatorname{Student-t}(3, 0, 1) \\
\rho_{01}       & \sim \operatorname{LKJ} (4)
\end{align*}

For a point of comparison, here are the `brms::brm()` default priors.

```{r}
get_prior(data = wages_pp, 
          family = gaussian,
          lnw ~ 0 + Intercept + exper + (1 + exper | id))
```

Even though our priors are still quite permissive on the scale of the data, they're much more informative than the defaults. If we had formal backgrounds in the entry-level economy of the US in the early 1900s, we'd be able to specify even better priors. But hopefully this walk-through gives a sense of how to start thinking about model priors.

Let's fit the model. To keep the size of the `fits/fit05.03.rds` file below the 100MB GitHub limit, we'll set `chains = 3` and compensate by upping `iter` a little.

```{r fit5.3}
fit5.3 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b, coef = exper),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.03")
```

Here are the results.

```{r}
print(fit5.3, digits = 3)
```

Since the criterion `lnw` is on the log scale, Singer and Willett pointed out our estimate for $\gamma_{10}$ indicates a nonlinear growth rate on the natural dollar scale. They further explicated that "if an outcome in a linear relationship, $Y$, is expressed as a natural logarithm and $\hat \gamma_{01}$ is the regression coefficient for a predictor $X$, then $100(e^{\hat{\gamma}_{01}} - 1)$ is the *percentage change* in $Y$ per unit difference in $X$" (p. 148, *emphasis* in the original). Here's how to do that conversion with our **brms** output.

```{r}
post <-
  posterior_samples(fit5.3) %>% 
  transmute(percent_change = 100 * (exp(b_exper) - 1))

head(post)
```

For our plot, let's break out [Matthew Kay](https://twitter.com/mjskay)'s handy [**tidybayes** package](mjskay.github.io/tidybayes/) [@R-tidybayes]. With the `tidybayes::stat_halfeye()` function, it's easy to put horizontal point intervals beneath out parameter densities. Here we'll use 95% intervals.

```{r, fig.width = 3.5, fig.height = 2.5, warning = F, message = F}
library(tidybayes)

post %>% 
  ggplot(aes(x = percent_change, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "Percent change",
       x = expression(100*(italic(e)^(hat(gamma)[1][0])-1))) +
  theme(panel.grid = element_blank())
```

The **tidybayes** package also has a group of functions that make it easy to summarize posterior parameters with measures of central tendency (i.e., mean, median, mode) and intervals (i.e., percentile based, highest posterior density intervals). Here we’ll use `median_qi()` to get the posterior median and percentile-based 95% intervals.

```{r}
post %>% 
  median_qi(percent_change)
```

For our next model, Model B in Table 5.4, we add two time-invariant covariates. In the data, these are listed as `black` and `hgc.9`. Before we proceed, let’s rename `hgc.9` to be more consistent with [**tidyverse** style](https://style.tidyverse.org/syntax.html#object-names).

```{r}
wages_pp <-
  wages_pp %>% 
  rename(hgc_9 = hgc.9)
```

There we go. Let's take a look at the distributions of our covariates.

```{r, fig.width = 6, fig.height = 2.5}
wages_pp %>% 
  pivot_longer(c(black, hgc_9)) %>% 

  ggplot(aes(x = value)) +
  geom_bar() +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

We see `black` is a dummy variable coded "Black" = 1, "Non-black" = 0. `hgc_9` is a somewhat Gaussian ordinal centered around zero. For context, it might also help to check its standard deviation.

```{r}
sd(wages_pp$hgc_9)
```

With a mean near 0 and an $SD$ near 1, `hgc_9` is almost in a standardized metric. If we wanted to keep with our weakly-regularizing approach, `normal(0, 1)` or even `normal(0, 0.5)` would be pretty permissive for both these variables. Recall that we're predicting wage on the log scale. A $\gamma$ value of 1 or even 0.5 would be humongous for the social sciences. Since we already have the $\gamma$ for `exper` set to `normal(0, 0.5)`, let's just keep with that. Here's how we might describe our model in statistical terms:

\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{01} (\text{hgc}_{i} - 9) + \gamma_{02} \text{black}_{i} \\
& \;\;\; + \gamma_{10} \text{exper}_{ij} + \gamma_{11} \text{exper}_{ij} \times (\text{hgc}_{i} - 9) + \gamma_{12} \text{exper}_{ij} \times \text{black}_{i} \\
& \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{01},..., \gamma_{12}                     & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t} (3, 0, 1) \\
\rho_{01}                                        & \sim \operatorname{LKJ} (4).
\end{align*}

The top portion up through the $\mathbf\Omega$ line is the likelihood. Starting with $\gamma_{00} \sim \text{Normal}(1.335, 1)$ on down, we've listed our priors. Here's how to fit the model with **brms**.

```{r fit5.4}
fit5.4 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + black + exper + exper:hgc_9 + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.04")
```

Let's take a look at the results.

```{r}
print(fit5.4, digits = 3)
```

The $\gamma$'s are on par with those in the text. When we convert the $\sigma$ parameters to the $\sigma^2$ metric, here's what they look like.

```{r, fig.width = 8, fig.height = 2.25}
post <- posterior_samples(fit5.4) 

post %>% 
  transmute(`sigma[0]^2` = sd_id__Intercept^2,
            `sigma[1]^2` = sd_id__exper^2,
            `sigma[epsilon]^2` = sigma^2) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  coord_cartesian(ylim = c(1.4, 3.4)) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

We might plot our $\gamma$'s, too. Here we'll use `tidybayes::stat_pointinterval()` to just focus on the points and intervals.

```{r, fig.width = 8, fig.height = 2}
post %>% 
  select(b_Intercept:`b_black:exper`) %>% 
  set_names(str_c("gamma", c("[0][0]", "[0][1]", "[0][2]", "[1][0]", "[1][1]", "[1][2]"))) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = 0, color = "white") +
  stat_pointinterval(.width = .95, size = 1/2) +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

As in the text, our $\gamma_{02}$ and $\gamma_{11}$ parameters hovered around zero. For our next model, Model C in Table 5.4, we'll drop those parameters.

```{r fit5.5}
fit5.5 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.05")
```

Let's take a look at the results.

```{r}
print(fit5.5, digits = 3)
```

Perhaps unsurprisingly, the parameter estimates for `fit5.5` ended up quite similar to those from `fit5.4`. Happily, they're also similar to those in the text. Let's compute the WAIC estimates.

```{r, warning = F, message = F}
fit5.3 <- add_criterion(fit5.3, criterion = "waic")
fit5.4 <- add_criterion(fit5.4, criterion = "waic")
fit5.5 <- add_criterion(fit5.5, criterion = "waic")
```

Compare their WAIC estimates using $\text{elpd}$ difference scores.

```{r}
loo_compare(fit5.3, fit5.4, fit5.5, criterion = "waic") %>% 
  print(simplify = F)
```

The differences are subtle. Here are the WAIC weights.

```{r}
model_weights(fit5.3, fit5.4, fit5.5, weights = "waic") %>% 
  round(digits = 3)
```

When we use weights, almost all goes to `fit5.4` and `fit5.5`. Focusing on the trimmed model, `fit5.5`, let's get ready to make our version of Figure 5.2. We'll start with `fitted()` work.

```{r}
nd <-
  crossing(black = 0:1,
           hgc_9 = c(0, 3)) %>% 
  expand(nesting(black, hgc_9),
         exper = seq(from = 0, to = 11, length.out = 30))

f <-
  fitted(fit5.5, 
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

head(f)
```

Here it is, our two-panel version of Figure 5.2.

```{r, fig.width = 6, fig.height = 3}
f %>%
  mutate(black = factor(black,
                        labels = c("Latinos and Whites", "Blacks")),
         hgc_9 = factor(hgc_9, 
                        labels = c("9th grade dropouts", "12th grade dropouts"))) %>% 
  
  ggplot(aes(x = exper,
             color = black, fill = black)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  scale_fill_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  scale_color_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  ylab("lnw") +
  coord_cartesian(ylim = c(1.6, 2.4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~hgc_9)
```

This leads in nicely to a brief discussion of posterior predictive checks (PPC). The basic idea is that good models should be able to retrodict the data used to produce them. Table 5.3 in the text introduced the data set by highlighting three participants and we went ahead and looked at their data in a plot. One way to do a PPC might be to plot their original data atop their model estimates. The `fitted()` function will help us with the preparatory work.

```{r}
nd <-
  wages_pp %>% 
  filter(id %in% c(206, 332, 1028))

f <-
  fitted(fit5.5, 
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd)

head(f)
```

Here's the plot.

```{r, fig.width = 8, fig.height = 2.5}
f %>% 
  mutate(id = str_c("id = ", id)) %>% 
  
  ggplot(aes(x = exper)) +
  geom_pointrange(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5,
                      color = id)) +
  geom_point(aes(y = lnw)) +
  scale_color_viridis_d(option = "B", begin = .35, end = .8) +
  labs(subtitle = "The black dots are the original data. The colored points and vertical lines are the participant-specific posterior\nmeans and 95% intervals.") +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~id)
```

Although each participant got their own intercept and slope, the estimates all fall in straight lines. Since we're only working with time-invariant covariates, that's about the best we can do. Though our models can express gross trends over time, they're unable to speak to variation from occasion to occasion. Just a little later on in this chapter and we'll learn how to do better.

### Practical problems that may arise when analyzing unbalanced data sets.

With HMC, the issues with non-convergence aren't quite the same as with maximum likelihood estimation. However, the basic issue still remains:

> Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation--variation in the residuals over and above the fixed effects. If too many people have too little data, you will ~~be unable to quantify~~ [have difficulty quantifying] this residual variability. (p. 152)

The big difference is that as Bayesians, our priors add additional information that will help us define the posterior distributions of our variance components. Thus our challenge will choosing sensible priors for our $\sigma$'s.

#### Boundary constraints.

Unlike with the frequentist multilevel software discussed in the text, **brms** will not yield negative values on the $\sigma$ parameters. This is because the **brms** default is to set a lower limit of zero on those parameters. For example, see what happens when we execute `fit5.3$model`.

```{r}
fit5.3$model
```

That returned the Stan code corresponding to our `brms::brm()` code, above. Notice the second and third lines in the `parameters` block. Both contained `<lower=0>`, which indicated the lower bounds for those parameters was zero. See? Stan has you covered.

Let's load the `wages_small_pp.csv` data.

```{r, warning = F, message = F}
wages_small_pp <- read_csv("data/wages_small_pp.csv") %>% 
  rename(hgc_9 = hcg.9)

glimpse(wages_small_pp)
```

Here's the distribution of the number of measurement occasions for our small data set.

```{r, fig.width = 6, fig.height = 3}
wages_small_pp %>% 
  count(id) %>% 
   
  ggplot(aes(y = n)) +
  geom_bar() +
  scale_y_continuous("# measurement occasions", breaks = 1:13, limits = c(.5, 13)) +
  xlab("count of cases") +
  theme(panel.grid = element_blank())
```

Our `brm()` code is the same as that for `fit5.5`, above, with just a slightly different `data` argument. If we wanted to, we could be hasty and just use `update()`, instead. But since we're still practicing setting our priors and such, here we'll be exhaustive.

```{r fit5.6}
fit5.6 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.06")
```

```{r}
print(fit5.6)
```

Let's walk through this slow. 

You may have noticed that warning message about divergent transitions. We'll get to that in a bit. First focus on the parameter estimates for `sd(exper)`. Unlike in the text, our posterior mean is not 0.000. But do remember that our posterior is parameterized in the $\sigma$ metric. Let's do a little converting and look at it in a plot.

```{r}
post <- posterior_samples(fit5.6)

v <-
  post %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything())
```

Plot.

```{r, fig.width = 8, fig.height = 2.5}
v %>% 
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

In the $\sigma$ metric, the posterior is bunched up a little on the boundary, but much of its mass is a gently right-skewed mound concentrated in the 0—0.1 range. When we convert the posterior to the $\sigma^2$ metric, the parameter appears much more bunched up against the boundary. Because we typically summarize our posteriors with means or medians, the point estimate still moves away from zero.

```{r}
v %>% 
  group_by(name) %>% 
  mean_qi() %>% 
  mutate_if(is.double, round, digits = 4)

v %>% 
  group_by(name) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 4)
```

But it really does start to shoot to zero if we attempt to summarize the central tendency with the mode, as within the maximum likelihood paradigm.

```{r}
v %>% 
  group_by(name) %>% 
  mode_qi() %>% 
  mutate_if(is.double, round, digits = 4)
```

Backing up to that warning message, we were informed that "Increasing adapt_delta above 0.8 may help." The `adapt_delta` parameter ranges from 0 to 1. The `brm()` default is .8. In my experience, increasing to .9 or .99 is often a good place to start. For this model, .9 wasn't quite enough, but .99 worked. Here's how to do it.

```{r fit5.7}
fit5.7 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.07")
```

Now look at the summary.

```{r}
print(fit5.7)
```

Our estimates were pretty much the same as before. Happily, this time we got our summary without any warning signs. It won't always be that way, so make sure to take `adapt_delta` warnings seriously.

Now do note that for both `fit5.6` and `fit5.7`, our effective sample sizes for $\sigma_0$ and $\sigma_1$ aren't terribly large relative to the total number of post-warmup draws, 4,000. If it was really important that you had high-quality summary statistics for these parameters, you might need to refit the model with something like `iter = 20000, warmup = 2000`.

In Model B in Table 5.5, Singer and Willett gave the results of a model with the boundary constraints on the $\sigma^2$ parameters removed. I am not going to attempt something like that with **brms**. If you're interested, you're on your own.

But we will fit a version of their Model C where we've removed the $\sigma_1$ parameter. Notice that this results in our removal of the LKJ prior for $\rho_{01}$, too. Without a $\sigma_1$, there's no other parameter with which our lonely $\sigma_0$ might covary.

```{r fit5.8}
fit5.8 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.08")
```

Here is the basic model summary.

```{r}
print(fit5.8)
```

No warning messages and our effective samples for $\sigma_0$ improved a bit. Compute the WAIC for both models.

```{r, warning = F, message = F}
fit5.7 <- add_criterion(fit5.7, criterion = "waic")
fit5.8 <- add_criterion(fit5.8, criterion = "waic")
```

Compare.

```{r}
loo_compare(fit5.7, fit5.8, criterion = "waic") %>% 
  print(simplify = F, digits = 3)
```

Yep. Those WAIC estimates are quite similar and when you compare them with formal $\text{elpd}$ difference scores, the standard error is about the same size as the difference itself.

Though we're stepping away from the text a bit, we should explore more alternatives for this boundary issue. The Stan team has put together a *Prior Choice Recommendations* wiki at [https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). In the [*Boundary-avoiding priors for modal estimation (posterior mode, MAP, marginal posterior mode, marginal maximum likelihood, MML)*](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#boundary-avoiding-priors-for-modal-estimation-posterior-mode-map-marginal-posterior-mode-marginal-maximum-likelihood-mml) section, we read: 

> * These are for parameters such as group-level scale parameters, group-level correlations, group-level covariance matrix
> * What all these parameters have in common is that (a) they're defined on a space with a boundary, and (b) the likelihood, or marginal likelihood, can have a mode on the boundary. Most famous example is the group-level scale parameter tau for the 8-schools hierarchical model.
> * With full Bayes the boundary shouldn't be a problem (as long as you have any proper prior).
> * But with modal estimation, the estimate can be on the boundary, which can create problems in posterior predictions. For example, consider a varying-intercept varying-slope multilevel model which has an intercept and slope for each group. Suppose you fit marginal maximum likelihood and get a modal estimate of 1 for the group-level correlation. Then in your predictions the intercept and slope will be perfectly correlated, which in general will be unrealistic.
> * For a one-dimensional parameter restricted to be positive (e.g., the scale parameter in a hierarchical model), we recommend Gamma(2,0) prior (that is, p(tau) proportional to tau) which will keep the mode away from 0 but still allows it to be arbitrarily close to the data if that is what the likelihood wants. For details see this paper by Chung et al.: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
>     + Gamma(2,0) biases the estimate upward. When number of groups is small, try Gamma(2,1/A), where A is a scale parameter representing how high tau can be.

We should walk those Gamma priors out, a bit. The paper by @chungNondegeneratePenalizedLikelihood2013 is quite helpful. We'll first let them give us a little more background in the topic:

> Zero group-level variance estimates can cause several problems. Zero variance can go against prior knowledge of researchers and results in underestimation of uncertainty in fixed coefficient estimates. Inferences for groups are often of interest to researchers, but when the group-level variance is estimated as zero, the resulting predictions of the group-level errors will all be zero, so one fails to find unexplained differences between groups. In addition, uncertainty in predictions for new and existing groups is also understated. (p. 686)

They expounded further on page 687.

> When a variance parameter is estimated as zero, there is typically a large amount of uncertainty about this variance. One possibility is to declare in such situations that not enough information is available to estimate a multilevel model. However, the available alternatives can be unappealing since, as noted in the introduction, discarding a variance component or setting the variance to zero understates the uncertainty. In particular, standard errors for coefficients of covariates that vary between groups will be too low as we will see in Section 2.2. The other extreme is to fit a regression with indicators for groups (a fixed-effects model), but this will  overcorrect for group effects (it is mathematically equivalent to a mixed-effects model with variance set to infinity), and also does not allow predictions for new groups.
>
> Degenerate variance estimates lead to complete shrinkage of predictions for new and existing groups and yield estimated prediction standard errors that understate uncertainty. This problem has been pointed out by Li and Lahiri [-@liAdjustedMaximumLikelihood2010] and Morris and Tang [-@morrisEstimatingRandomEffects2011] in small area estimation....
>
> If zero variance is not a null hypothesis of interest, a boundary estimate, and the corresponding zero likelihood ratio test statistic, should not necessarily lead us to accept the null hypothesis and to proceed as if the true variance is zero.

In their paper, they covered both penalized maximum likelihood and full Bayesian estimation. We're just going to focus on Bayes, but some of the quotes will contain ML talk. Further, we read:

> We recommend a class of log-gamma penalties (or gamma priors) that in our default setting (the log-gamma(2, $\lambda$) penalty with $\lambda \rightarrow 0$) produce maximum penalized likelihood (MPL) estimates (or Bayes modal estimates) approximately one standard error away from zero when the maximum likelihood estimate is at zero. We consider these priors to be weakly informative in the sense that they supply some direction but still allow inference to be driven by the data. The penalty has little influence when the number of groups is large or when the data are informative about the variance, and the asymptotic mean squared error of the proposed estimator is the same as that of the maximum likelihood estimator. (p. 686)

In the upper left panel of Figure 3, Chung and colleagues gave an example of what they mean by $\lambda \rightarrow 0$: $\lambda = 0.1$. Here's an example of what $\operatorname{Gamma}(2, 0.1)$ looks like across the parameter space of 0 to 100.

```{r, fig.width = 4, fig.height = 1.75}
tibble(x = c(0, 100)) %>% 
  ggplot(aes(x = x)) +
  stat_function(fun = dgamma, n = 1e2, args = list(shape = 2, rate = 0.1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Given we're working with data on the log scale, that's a massively permissive prior. Let's zoom in and see what it means for the parameter space of possible values for our data.

```{r, fig.width = 4, fig.height = 1.75}
tibble(x = c(0, 2)) %>% 
  ggplot(aes(x = x)) +
  stat_function(fun = dgamma, n = 1e2, args = list(shape = 2, rate = .1)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Now keep that picture in mind as we read further along in the paper:

>  In addition, with $\lambda \rightarrow 0$, the gamma density function has a positive constant derivative at zero, which allows the likelihood to dominate if it is strongly curved near zero. The positive constant derivative implies that the prior is linear at zero so that there is no dead zone near zero. The top-left panel of Figure 3 shows that the gamma(2,0.1) density increases linearly from zero with a gentle slope. The shape will be even flatter with a smaller rate parameter. (p. 691)

In case you're not familiar with the gamma distribution, the rate parameter is what we've been calling $\lambda$. Let's test this baby out with our model. Here's how to specify it in **brms**.

```{r fit5.9}
fit5.9 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(gamma(2, 0.1), class = sd, group = id, coef = exper),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.09")
```

Notice that we didn't bother fooling around with `adapt_delta` and the model fit just fine. Here are the results.

```{r}
print(fit5.9, digits = 3)
```

Our `sd(exper)` is still quite close to zero. But notice how not the lower level of the 95% interval is higher than zero. Here's what it looks like in both $\sigma$ and $\sigma^2$ metrics.

```{r, fig.width = 8, fig.height = 2.5}
posterior_samples(fit5.9) %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything()) %>%
  
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

Let's zoom in on the leftmost part of the plot.
  
```{r, fig.width = 8, fig.height = 2.5}
posterior_samples(fit5.9) %>%
  transmute(sigma_1 = sd_id__exper) %>% 
  mutate(sigma_2_1 = sigma_1^2) %>% 
  set_names("sigma[1]", "sigma[1]^2") %>% 
  pivot_longer(everything()) %>%
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = 0, color = "white") +
  stat_halfeye(.width = .95, normalize = "xy") +
  scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
  coord_cartesian(xlim = c(0, 0.01)) +
  theme(panel.grid = element_blank())
```

Although we are still brushing up on the boundary with $\sigma_1^2$, the mode is no longer at zero. In the discussion, Chung and colleagues pointed out "sometimes weak prior information is available about a variance parameter. When $\alpha = 2$, the gamma density has its mode at $1 / \lambda$, and so one can use the $\operatorname{gamma}(\alpha, \lambda)$ prior with $1 / \lambda$ set to the prior estimate of $\sigma_\theta$" (p. 703). Let's say we only had our `wages_small_pp`, but the results of something like the `wages_pp` data were published by some earlier group of researchers. In this case, we do have good prior data; we have the point estimate from the model of the `wages_pp` data! Here's what that was in terms of the median.

```{r, fig.width = 8, fig.height = 2.5}
posterior_samples(fit5.3) %>%
  median_qi(sd_id__exper)
```

And here's what that value is when set as the divisor of 1.

```{r}
1 / 0.04154273
```

What does that distribution look like?

```{r, fig.width = 4, fig.height = 1.75}
tibble(x = c(0, 1)) %>% 
  ggplot(aes(x = x)) +
  stat_function(fun = dgamma, n = 1e3, args = list(shape = 2, rate = 24.0716)) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

So this is much more informative than our `gamma(2, 0.1)` prior from before. But given the magnitude of the estimate from `fit5.3`, it's still fairly liberal. Let's practice using it.
      
```{r fit5.10}
fit5.10 <-
  brm(data = wages_small_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(gamma(2, 24.0716), class = sd, group = id, coef = exper),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.10")
```

Check out the results.

```{r}
print(fit5.10, digits = 3)
```

Here we compare the three ways to specify the $\sigma_1$ prior with the posterior from the original model fit with the full data set. For simplicity, we’ll just look at the results in the **brms**-like $\sigma$ metric. Hopefully by now you’ll know how to do the conversions to get the values into the $\sigma^2$ metric.

```{r, fit.width = 8, fig.height = 4.5}
tibble(`full data, student_t(3, 0, 1) prior`  = VarCorr(fit5.3, summary = F)[[1]][[1]][1:4000, 2],
       `small data, student_t(3, 0, 1) prior` = VarCorr(fit5.7, summary = F)[[1]][[1]][, 2],
       `small data, gamma(2, 0.1) prior`      = VarCorr(fit5.9, summary = F)[[1]][[1]][, 2],
       `small data, gamma(2, 24.0716) prior`  = VarCorr(fit5.10, summary = F)[[1]][[1]][, 2]) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, 
                       levels = c("full data, student_t(3, 0, 1) prior", 
                                  "small data, student_t(3, 0, 1) prior", 
                                  "small data, gamma(2, 0.1) prior", 
                                  "small data, gamma(2, 24.0716) prior"))) %>% 
  
  ggplot(aes(x = value, y = 0, fill = name == "full data, student_t(3, 0, 1) prior")) +
  geom_vline(xintercept = 0, color = "white") +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  scale_fill_manual(values = c("grey75", "darkgoldenrod2")) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~name, ncol = 1)
```

One thing to notice is that when you're working with full Bayesian estimation with even a rather vague prior with a boundary on zero, the measure of central tendency in the posterior is away from zero. Things get more compact when you're working in the $\sigma^2$ metric. But remember that when we're fitting our models with **brms**, we're in the $\sigma$ metric, anyway. And with either of these three options, you don't have a compelling reason to set the $\sigma_\theta$ parameter to zero the way you would with ML. Even a rather vague prior will add enough information to the model that we can feel confident about keeping our theoretically-derived $\sigma_1$ parameter.

#### Nonconvergence [i.e., it's time to talk chains and such].

As discussed in [Section 4.3][Methods of estimation, revisited], all multilevel modeling programs implement iterative numeric algorithms for model fitting. (p. 155). This is also true for our Stan-propelled **brms** software. However, what's going on under the hood, here, is not what's happening with the frequentist packages discussed by Singer and Willett. We're using Hamiltonian Monte Carlo (HMC) to draw from the posterior. To my eye, Bürkner gave in a [-@burknerBayesianItemResponse2020] preprint probably the clearest and most direct introduction to why we need fancy algorithms like HMC to fit Bayesian models. First, Bürkner warmed up by contrasting Bayes with conventional frequentist inference:

> In frequentist statistics, parameter estimates are usually obtained by finding those parameter values that maximise the likelihood. In contrast, Bayesian statistics aims to estimate the full (joint) posterior distribution of the parameters. This is not only fully consistent with probability theory, but also much more informative than a single point estimate (and an approximate measure of uncertainty commonly known as ‘standard error’). (p. 9)

Those iterative algorithms Singer and Willett discussed in this section, that's what they're doing. They are maximizing the likelihood. But with Bayes, we have the more challenging goal of describing the entire posterior distribution, which is the product of the likelihood and the prior. As such,

> Obtaining the posterior distribution analytically is only possible in certain cases of carefully chosen combinations of prior and likelihood, which may considerably limit modeling flexibilty but yield a computational advantage. However, with the increased power of today's computers, Markov-Chain Monte-Carlo (MCMC) sampling methods constitute a powerful and feasible alternative to obtaining posterior distributions for complex models in which the majority of modeling decisions is made based on theoretical and not computational grounds. Despite all the computing power, these sampling algorithms are computationally very intensive and thus fitting models using full Bayesian inference is usually much slower than in point estimation techniques. However, advantages of Bayesian inference – such as greater modeling flexibility, prior distributions, and more informative results – are often worth the increased computational cost (Gelman, Carlin, Stern, and Rubin 2013). (pp. 9--10)

The gritty details are well beyond the scope of this project. If you'd like a more thorough walk-through on why it's analytically and computationally challenging to get the posterior, I recommend working through the first several chapters in Kruschke's [-@kruschkeDoingBayesianData2015] [*Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan*](http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/).

But so anyways, our primary algorithm is HMC as implemented by Stan. You can find all kinds of technical details at [https://mc-stan.org/users/documentation/](https://mc-stan.org/users/documentation/). Because it's rather difficult to describe our Bayesian multilevel models analytically, we use HMC to draw from the posterior instead. We then summarize the marginal and joint distributions of those parameters with things like measures of central tendency (i.e., means, medians, modes) and spread (i.e., standard deviations, percentile-based intervals). We make lots of plots.

And somewhat like with the frequentist iterative algorithms, we need to make sure our sweet Stan-based HMC is working well, too. One way is with trace plots.

##### Trace plots.

We can get the trace plots for a model by placing a `brm()` fit object into the `plot()` function. Here's an example with the full model, `fit5.4`.

```{r, fig.width = 6, fig.height = 5.5}
plot(fit5.4)
```

You'll notice we get two plots for each of the major model parameters, the $\gamma$'s, the $\sigma$'s and the $\rho$'s. The plots on the left are the density plots for each parameter. On the right, we have the actual trace plots. On the $x$-axis, we have an ordering of the posterior draws; on the $y$, we have the parameter space. Since we requested three HMC chains to draw from the posterior (`chains = 3`), those three chains are depicted by different colored lines. We generally like it when the lines for our chains all overlap with each other in a stable zig-zag sort of way. Trace plots are sometimes called caterpillar plots because, when things are going well, they often resemble nice multicolored fuzzy caterpillars.

If you'd like more control over your trace plot visuals, you might check out the [**bayesplot** package](https://github.com/stan-dev/bayesplot) [@R-bayesplot; @gabry2019visualization].

```{r, warning = F, message = F}
library(bayesplot)
```

Our main function will be `mcmc_trace()`. Unlike with the `brms::plot()` method, `bayesplog::mcmc_trace()` takes the posterior draws themselves as input. So we'll have to use `posterior_samples()` first. By default, `posterior_samples()` does not extract information about which chain a given draw is from. So we'll have to add the `add_chain = T` argument.

```{r}
post <- posterior_samples(fit5.4, add_chain = T)
```

We can use the `pars` argument to focus on particular parameters.

```{r, fig.width = 4, fig.height = 1.5}
mcmc_trace(post, pars = "sigma")
```

If we use the `pars = vars(...)` format, we can use function helpers from **dplyr** to select subsets of parameters. For example, here's how we might single out the $\gamma$'s.

```{r, fig.width = 6, fig.height = 4}
mcmc_trace(post,
           pars = vars(starts_with("b_")),
           facet_args = list(ncol = 2))
```

Notice how we used the `facet_args` argument to adjust the number of columns in the output. We can also use familiar **ggplot2** functions to customize the plots further.

```{r, fig.width = 8, fig.height = 2.25, message = F, warning = F}
post %>% 
  mutate(`sigma[0]`       = sd_id__Intercept,
         `sigma[1]`       = sd_id__exper,
         `sigma[epsilon]` = sigma) %>%

mcmc_trace(pars = vars(starts_with("sigma[")),
           facet_args = list(labeller = label_parsed)) +
  scale_color_viridis_d(option = "A") +
  scale_x_continuous(NULL, breaks = NULL) +
  ggtitle("I can't wait to show these traceplots to my mom.") +
  theme_grey() +
  theme(legend.position = "bottom",,
        panel.grid = element_blank(),
        panel.grid.major.y = element_line(color = "white", size = 1/4),
        strip.text = element_text(size = 12))
```

Trace plots are connected to other important concepts, like autocorrelation and effective sample size.

##### Autocorrelation.

When using Markov chain Monte Carlo methods, of which HMC is a special case, the notions of autocorrelation and effective sample size are closely connected. Both have to do with the question, How many post-warmup draws from the posterior do I need to take? If you take too few, you won't have a good sense of the shape of the posterior. If you take more than necessary, you're just wasting time and computer memory. Here’s how McElreath introduced the topic in his [-@mcelreathStatisticalRethinkingBayesian2015] text:

> So how many samples do we need for accurate inference about the posterior distribution? It depends. First, what really matters is the *effective* number of samples, not the raw number. The effective number of samples is an estimate of the number of independent samples from the posterior distribution. Markov chains are typically *autocorrelated*, so that sequential samples are not entirely independent. Stan chains tend to be less autocorrelated than those produced by other engines [e.g., the Gibbs sampler], but there is always some autocorrelation. (p. 255, *emphasis* in the original)

I'm not aware of a way to query the autocorrelations from a `brm()` fit using **brms** convenience functions. However, we can get those diagnostics from the `bayesplot::mcmc_acf()` function.

```{r, fig.width = 8, fig.height = 4}
mcmc_acf(post, pars = vars(starts_with("b_")), lags = 10)  +
  theme_grey() +
  theme(panel.grid = element_blank())
```

The `mcmc_acf()` function gives a wealth of granular output. The columns among the plots are the specified parameters. The rows are the chains, one for each. In this particular case, the autocorrelations were quite low for all our $\gamma$ parameters by the second or third lag. That's really quite good and not uncommon for HMC. Do note, however, that this won’t always be the case. For example, here are the plots for our variance parameters and $\rho_{01}$.

```{r, fig.width = 6, fig.height = 4}
mcmc_acf(post, pars = vars(starts_with("sd_"), "sigma", starts_with("cor")), lags = 10)  +
  theme_grey() +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 7))
```

On the whole, all of them are pretty okay. But notice how the autocorrelations for $\sigma_1$ and $\rho_{01}$ remained relatively high up until the 10^th^ lag.

The plots from `mcmc_act()` are quite handy for focused diagnostics. But if you want a more global perspective, they're too tedious. Fortunately for us, we have other diagnostic tools.

##### Effective sample size.

Above we quoted McElreath as pointing out "what really matters is the *effective* number of samples, not the raw number." With **brms**, you typically get the effective number of samples in the `print()` or `summary()` output. Here it is again for `fit4`.

```{r}
summary(fit5.4)
```

See the last two columns on the right: `Bulk_ESS` and `Tail_ESS`? Following @vehtariRanknormalizationFoldingLocalization2019, those describe our effective sample size in two ways. From the paper, we read:

> If you plan to report quantile estimates or posterior intervals, we strongly suggest assessing the convergence of the chains for these quantiles. In Section 4.3 we show that convergence of Markov chains is not uniform across the parameter space and propose diagnostics and effective sample sizes specifically for extreme quantiles. This is *different* from the standard ESS estimate (which we refer to as the "bulk-ESS"), which mainly assesses how well the centre of the distribution is resolved. Instead, these "tail-ESS" measures allow the user to estimate the MCSE for interval estimates. (p. 5, *emphasis* in the original)

For more technical details, see the paper. In short, `Bulk_ESS` indexes the number of effective samples in 'the center of the' posterior distribution (i.e., the posterior mean or median). But since we also care about uncertainty in our parameters, we care about stability in the 95% intervals and such. The `Tail_ESS` column allows us to gauge the effective sample size for those intervals. Like with the autocorrelations, each parameter gets its own estimate for both ESS measures. You might compare the numbers to the number of post-warmup iterations, 4,500 in this case.

You may wonder, how many effective samples do I need? Back to McElreath:

> If all you want are posterior means, it doesn't take many samples at all to get very good estimates. Even a couple hundred samples will do. But if you care about the exact shape in the extreme tails of the posterior, the 99th percentile or so, then you'll need many many more. So there is no universally useful number of samples to aim for. In most typical regression applications, you can get a very good estimate of the posterior mean with as few as 200 effective samples. And if the posterior is approximately Gaussian, then all you need in addition is a good estimate of the variance, which can be had with one order of magnitude more, in most cases. For highly skewed posteriors, you'll have to think more about which region of the distribution interests you. (p. 255)

At the moment, **brms** does not offer a convenience function that allows users to collect the `Bulk_ESS` and `Tail_ESS` values in a data frame. However you can do so with help from the [**posterior** package](https://github.com/stan-dev/posterior) [@R-posterior], which has not made its way to CRAN, yet, but can be downloaded directly from GitHub.

```{r, eval = F}
# install the beta release with this
install.packages("posterior", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

# install the latest development version with this instead
install.packages("remotes")
remotes::install_github("stan-dev/posterior")
```

For our purposes, the function of interest is `summarise_draws()`, which will take the output from `posterior_samples()` as input. We'll save the results as `post_sum`.

```{r, warning = F, message = F}
library(posterior)

post_sum <-
  posterior_samples(fit5.4) %>% 
  summarise_draws()

post_sum %>% 
  head(n = 10)
```

Note how the last two columns are the `ess_bulk` and the `ess_tail`. Here we summarize them with histograms.

```{r, fig.width = 5, fig.height = 2.75, warning = F}
post_sum %>% 
  pivot_longer(starts_with("ess")) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(binwidth = 100) +
  xlim(0, NA) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name)
```

If you wanted a focused plot of the effective sample sizes for our primary summary parameters, you would just wrangle the output a little. Since it's fun, we'll switch to a lollipop plot.

```{r, fig.width = 6, fig.height = 1.5, warning = F}
post_sum %>% 
  slice(1:10) %>% 
  pivot_longer(contains("ess")) %>% 
  mutate(ess = str_remove(name, "ess_")) %>% 

  ggplot(aes(y = reorder(variable, value))) +
  geom_linerange(aes(xmin = 0, xmax = value)) +
  geom_point(aes(x = value)) +
  labs(x = "effective sample size",
       y = NULL) +
  xlim(0, 4000) +
  theme(panel.grid  = element_blank(),
        axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank()) +
  facet_wrap(~ess)
```

You may have noticed from the ESS histogram that many of the parameters seemed to have bulk ESS values well above the total number of posterior draws (4,500). What's that about? As is turns out, the sampling in Stan is so good that sometimes the HMC chains for a parameter can be negatively autocorrelated. When this is the case, your effective sample size can be larger than your actual sample size. Madness, I know. This was the case for many of our random effects. Here's a look at the parameters with ten largest values.

```{r}
post_sum %>% 
  arrange(desc(ess_bulk)) %>% 
  select(variable, ess_bulk) %>% 
  slice(1:10)
```

Let's take the first 4 and check their autocorrelation plots.

```{r, fig.width = 6, fig.height = 4}
mcmc_acf(post, pars = vars(`r_id[12335,exper]`, `r_id[5968,Intercept]`, `r_id[10476,Intercept]`, `r_id[7117,Intercept]`), lags = 5)  +
  theme_grey() +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 10))
```
See those dips below zero for the first lag in each? That's what a negative autocorrelation looks like. Beautiful. For more on negative autocorrelations within chains and how it influences the number of effective samples, check out [this thread on the Stan forums](https://discourse.mc-stan.org/t/n-eff-bda3-vs-stan/2608) where many members of the Stan team chimed in.

##### $\widehat R$.

Return again to the default `print()` output for a `brms::brm()` fit.

```{r}
print(fit5.4)
```

The third last column for each parameter is `Rhat`. "`Rhat` is a complicated estimate of the convergence of the Markov chains to the target distribution. It should approach 1.00 from above, when all is well" [@mcelreathStatisticalRethinkingBayesian2015, p. 250]. We can extract $\widehat R$ directly with the `brms::rhat()` function.

```{r}
brms::rhat(fit5.4) %>% 
  str()
```

For our `fit5.4`, the `brms::rhat()` function returned a named numeric vector, with one row for each of the 1787 parameters in the model. You can subset the `rhat()` output to focus on a few parameters.

```{r}
brms::rhat(fit5.4)[1:10]
```

Note also that our `post_sum` object from above has an `rhat` column, too.

```{r}
post_sum %>% 
  select(variable, rhat) %>% 
  slice(1:4)
```

For a more global perspective, just plot.

```{r, fig.width = 4, fig.height = 2.5}
post_sum %>% 
  ggplot(aes(x = rhat)) +
  geom_vline(xintercept = 1, color = "white") +
  geom_histogram(binwidth = .0001) +
  theme(panel.grid = element_blank())
```

The **bayesplot** package offers a convenience function for plotting `brms::rhat()` output. Here we'll focus on the first 20 parameters.

```{r, fig.width = 8, fig.height = 3}
mcmc_rhat(brms::rhat(fit5.4)[1:20]) +
  yaxis_text(hjust = 0)
```

By default, `mcmc_rhat()` does not return text on the $y$-axis. But you can retrieve that text with the `yaxis_text()` function. For more on the $\widehat R$, you might check out the *Rhat: potential scale reduction statistic* subsection of Gabry and Modrák's [-@gabryVisualMCMC2020] vignette, [*Visual MCMC diagnostics using the bayesplot package*](https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html#rhat-potential-scale-reduction-statistic).

We should also point out that the Stan team has found some deficiencies with the $\widehat R$. They've made recommendations that will be implemented in the Stan ecosystem sometime soon. In the meantime, you can read all about it in their preprint [@vehtariRanknormalizationFoldingLocalization2019] and in Dan Simpson's blog post, [*Maybe it's time to let the old ways die; or We broke R-hat so now we have to fix it*](https://statmodeling.stat.columbia.edu/2019/03/19/maybe-its-time-to-let-the-old-ways-die-or-we-broke-r-hat-so-now-we-have-to-fix-it/). If you learn best by sassy twitter banter, [click through this interchange](https://twitter.com/betanalpha/status/1108185746870030336) among some of our Stan team all-stars.

```{r, echo = F}
# let's free up some memory
rm(fit5.3, fit5.4, fit5.5)
```

### Distinguishing among different types of missingness.

> Missingness, in and of itself, is not necessarily problematic. It all depends upon what statisticians call the type of missingness. In seminal work on this topic, @little1995ModelingDropoutMechanism, refining earlier work with Rubin [@rubin1987statistical], distinguished among three types of missingness: (1) *missing completely at random* (MCAR); (2) covariate-dependent dropout (CDD); and (3) *missing at random* (MAR) [see also @schafer1997analysis].
>
> When we say that data are MCAR, we argue that the observed values are a random sample of all the values that could have been observed (according ot plan), had there been no missing data.
>
> ...Covariate dependent dropout (CDD) is a less restrictive assumption that permits associations between the probability of missingness and observed predictor values ("covariates"). Data can be CDD even if the probability of missingness is systematically related to either *TIME* or observed substantive predictors. 
>
> ...When data are MAR, the probability of missingness can depend upon *any* observed data, for either the predictors or any outcome values. It cannot, however, depend upon an *un*observed value of either any predictor or the outcome. (pp. 157--158, *emphasis* in the original)

For some more current introductions to missing data methods, I recommend Enders' [-@enders2010applied] [*Applied missing data analysis*](http://www.appliedmissingdata.com), for which you can find a free sample chapter [here](http://www.appliedmissingdata.com/sample-chapter.pdf), and Little and Rubin's [-@little2019statistical] [*Statistical analysis with missing data, 3rd Edition*](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data%2C+3rd+Edition-p-9780470526798). You might also check out van Burren's great [-@vanbuurenFlexibleImputationMissing2018] online text [*Flexible imputation of missing data. Second edition*](https://stefvanbuuren.name/fimd/). If you're a fan of the podcast medium, you might listen to episode 16 from the first season of the [Quantitude podcast](https://quantitudethepodcast.org/), *IF EPISODE=16 THEN EPISODE=-999;*, in which Patrick Curran and Greg Hancock do a fine job introducing the basics of missing data. And very happily, **brms** has several ways to handle missing data, about which you can learn more from Bürkner's [-@Bürkner2021HandleMissingValues] vignette, [*Handle missing values with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_missings.html).

## Time-varying predictors

> A time-varying predictor is a variable whose *values* may differ over time. Unlike their time-invariant cousins, which record an individual's static status, time-varying predictors record an individual's potentially differing status on each associated measurement occasion. Some time-varying predictors have values that change naturally; others have values that change by design. (pp. 159--160, *emphasis* in the original)

### Including the main effect of a time-varying predictor.

You can find Ginexi and colleagues' [-@ginexiDepressionControlBeliefs2000] unemployment study data in the `reading_pp.csv` file.

```{r, warning = F, message = F}
unemployment_pp <- read_csv("data/unemployment_pp.csv")

head(unemployment_pp)
```

We have 254 unique participants.

```{r}
unemployment_pp %>% 
  distinct(id) %>% 
  count()
```

Here's one way to compute the number of participants who were never employed during the study.

```{r}
unemployment_pp %>% 
  filter(unemp == 0) %>% 
  distinct(id) %>% 
  count() %>% 
  summarise(never_employed = 254 - n)
```

In case it wasn't clear, participants had up to 3 interviews.

> By recruiting 254 participants from local unemployment offices, the researchers were able to interview individuals soon after job loss (within the first 2 months). Follow-up interviews were conducted between 3 and 8 months and 10 and 16 months after job loss. (p. 161)

Those times were encoded in the `months` variable. Here's what that looks like.

```{r, fig.width = 3.5, fig.height = 2.5}
unemployment_pp %>% 
  ggplot(aes(x = months))  +
  geom_vline(xintercept = c(3, 8), color = "white") +
  geom_histogram(binwidth = .5) +
  theme(panel.grid = element_blank())
```

To make some of our data questions easier, we can use those 3- and 8-`month` thresholds to make an `interview` variable to indicate the periods during which the interviews were conducted.

```{r, fig.width = 3.5, fig.height = 2.5}
unemployment_pp <- 
  unemployment_pp %>% 
  mutate(interview = ifelse(months < 3, 1,
                            ifelse(months > 8, 3, 2))) 
  
unemployment_pp %>% 
  ggplot(aes(x = interview))  +
  geom_bar() +
  theme(panel.grid = element_blank())
```

With a little wrangling, we can display all possible employment patterns along with counts on how many followed them.

```{r}
unemployment_pp %>%
  select(-months, -cesd) %>%
  mutate(interview = str_c("int_", interview)) %>%
  spread(key = interview, value = unemp) %>%
  group_by(int_1, int_2, int_3) %>%
  count() %>%
  arrange(desc(n)) %>%
  flextable::flextable()
```

It takes a little work to see how Singer and Willett came to the conclusion "62 were always working after the first interview" (p. 161). Based on an analysis of those who had complete data, that corresponds to the pattern in the top row, [1, 0, 0], which we have counted as 55 (i.e., row 2). If you add to that the two rows with missingness on one of the critical values (i.e., [1, 0, NA], [1, NA, 0], and [NA, 1, 1]), that gets you $55 + 4 + 2 + 1 = 62$.

We can confirm that "41 were still unemployed at the second interview but working by the third" (p. 161). That's our pattern [1, 1, 0], shown in row 3. We can also confirm "19 were working by the second interview but unemployed at the third" (p. 161). That's shown in our pattern [1, 0, 1], shown in row 6.

Before we configure our unconditional growth model, we might familiarize ourselves with our criterion variable, `cesd`. Singer and Willett informed us:

> Each time participants completed the Center for Epidemiologic Studies' Depression (CES-D) scale [@radloffCESDScaleSelfreport1977], which asks them to rate, on a four-point scale, the frequency with which they experience each of the 20 depressive symptoms. The CES-D scores can vary from a low or 0 for someone with no symptoms to a high of 80 for someone in serious distress. (p. 161)

In addition to Radloff's original article, you can get a copy of the CES-D [here](http://www.chcr.brown.edu/pcoc/cesdscale.pdf).

To help us pick our priors, [Brown and Gary [-@brownPredictorsDepressiveSymptoms1985] listed the means and standard deviations of the CES-D scores for unemployed African-American adults. They gave the summary statistics broken down by sex:

* Males: 14.05 (8.86), $n = 37$
* Females: 15.35 (9.39), $n = 72$

Based the variables in the data set and the descriptions of it in the text, we don't have a good sense of the demographic backgrounds of the participants. But with the information we have in hand, a reasonable empirically-based but nonetheless noncommittal prior for baseline CES-D might be something like `normal(14.5, 20)`. A weakly-regularizing prior on change over 1 month might be `normal(0, 10)`. It'd be fair if you wanted to argue about these priors. Try your own! But if you are willing to go along with me, we might write the statistical formula for the unconditional growth model as

\begin{align*}
\text{cesd}_{ij} & = \gamma_{00} + \gamma_{10} \text{months}_{ij} + \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \\
\epsilon_{ij}    & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D       & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 

\mathbf \Omega  & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10}                                      & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                        & \sim \operatorname{LKJ} (4).
\end{align*}

Those $\operatorname{Student-t}(3, 0, 11.9)$ priors for the $\sigma$'s are the defaults, which you can confirm with `get_prior()`. The $\operatorname{LKJ} (4)$ for $\rho_{01}$ will weakly regularize the correlation towards zero.

Here's how we might fit that model.

```{r fit5.11}
fit5.11 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.11")
```

Here are the results.

```{r}
print(fit5.11, digits = 3)
```

Here are the posteriors for the CES-D at the first day of job loss (i.e., $\gamma_{00}$) and the expected rate of change over one month (i.e., $\gamma_{10}$).

```{r, fig.width = 6, fig.height = 2.25}
posterior_samples(fit5.11) %>% 
  transmute(`first day of job loss` = b_Intercept,
            `linear decline by month` = b_months) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("CES-D composite score") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

We might use `conditional_effects()` to get a quick view on what that might looks like.

```{r, fig.width = 4, fig.height = 2.5}
plot(conditional_effects(fit5.11),
       plot = FALSE)[[1]] +
  geom_hline(yintercept = 14.5, color = "grey50", linetype = 2) +
  coord_cartesian(ylim = c(0, 20)) +
  theme(panel.grid = element_blank())
```

For reference, the dashed gray line is the value we centered our prior for initial status on.

#### Using a composite specification.

We might specify Model B, our first model with a time-varying covariate, like this:

\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{10} \text{months}_{ij} + \gamma_{20} \text{unemp}_{ij} \big ] + \big [  \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \big ]\\
\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix}
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 

\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10} \text{ and } \gamma_{20}             & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                        & \sim \operatorname{LKJ}(4).
\end{align*}

Note a few things about the priors. First, we haven't changed any of the priors from the previous model. All we did was add $\gamma_{20} \sim \text{Normal}(0, 10)$ for our new parameter. Given how weakly-informative our other priors have been for these data, this isn't an unreasonable approach. However, the meaning for our intercept, $\gamma_{01}$, has changed. Now it's the initial status for someone who is *employed* at baseline. But remember that for Model A, we set that prior with unemployed people in mind. A careful researcher might want to dive back into the literature to see if some lower value than 14.5 would be more reasonable to set for the mean of that prior. However, since the standard deviations for our intercepts priors and the covariate priors are all rather wide and permissive, this just won't be much of a problem, for us. Buy anyway, second, note that we've centered our prior for $\gamma_{20}$ on zero. This is a weakly-regularizing prior, slightly favoring smaller effects over larger ones. And like before, one could easily argue for different priors.

Here's how to fit the model in **brms**.

```{r fit5.12}
fit5.12 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + unemp + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.12")
```

If you compare our results to those in Table 5.7, you'll see they're quite similar.

```{r}
print(fit5.12)
```

Before we make our versions of Figure 5.3, let's first compare $\gamma_{01}$ posteriors by model. On page 166 of the text, Singer and Willett reported the monthly rate of decline "had been cut in half (to 0.20 from 0.42 in Model A)".

```{r}
fixef(fit5.11)["months", ]
fixef(fit5.12)["months", ]
```

You might be wondering why the quote from Singer and Willett used positive numbers while our parameter estimates have negative ones. No, there's no mistake, there. Negative parameter estimates for monthly trajectories are then same thing as expressing a rate of *decline* with a positive number. But anyways, you see our estimates are on par with theirs. With our Bayesian paradigm, it's also easy to get a formal difference distribution.

```{r}
tibble(fit5.11 = posterior_samples(fit5.11) %>% pull("b_months"),
       fit5.12 = posterior_samples(fit5.12) %>% pull("b_months"))
```


```{r, fig.width = 3, fig.height = 2.25}
tibble(fit5.11 = posterior_samples(fit5.11) %>% pull("b_months"),
       fit5.12 = posterior_samples(fit5.12) %>% pull("b_months")) %>% 
  mutate(dif = fit5.12 - fit5.11) %>%
  
  ggplot(aes(x = dif, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(paste("Difference in ", gamma[1][0]))) +
  theme(panel.grid = element_blank())
```

Here's our posterior for $\gamma_{20}$, `b_unemp`.

```{r, fig.width = 3, fig.height = 2.25}
posterior_samples(fit5.12) %>% 
  ggplot(aes(x = b_unemp, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Let's compute our WAIC estimates for `fit5.11` and `fit5.12`.

```{r, message = F}
fit5.11 <- add_criterion(fit5.11, criterion = "waic")
fit5.12 <- add_criterion(fit5.12, criterion = "waic")
```

Now we'll compare the models by both their WAIC differences and their WAIC weights.

```{r}
loo_compare(fit5.11, fit5.12, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, weights = "waic") %>% 
  round(digits = 3)
```

By both metrics, `fit5.12` came out as the clear favorite.

It's finally time to make our version of the upper left panel of Figure 5.3. We'll do so using `fitted()`.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp = 1,
         months = seq(from = 0, to = 14, by = .5))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Remain unemployed") +
  theme(panel.grid = element_blank())
```

The upper right panel will take more care. We'll still use `fitted()`, but we'll have to be tricky with how we define the two segments. When we defined the sequence of `months` values over which we wanted to plot the model trajectory, we just casually set `length.out = 30` within the `seq()` function. But now we need to make sure two of those sequential points are at 5. One way to do so is to use the `by = .5` argument within `seq()`, instead. Since we'll be defining the end points in our range with integer values, dividing up the sequence by every .5^th^ value will ensure we'll both be able to stop at 5 and that we'll have a reasonable amount of values in the sequence to ensure the bowtie-shaped 95% intervals don't look chunky.

But anyway, that also means we'll need to do a good job determining how many values we'll need to repeat our desired `unemp` values over. So here's a quick way to do the math. Since we're using every .5 in the sequence, you just subtract the integer at the beginning of the sequence from the integer at the end of the sequence, multiply that value by 2, and then add 1 to the product. Like this:

```{r}
2 * (5 - 0) + 1
2 * (14 - 5) + 1
```

Those are the number of times we need to repeat `unemp == 1` and `unemp == 0`, respectively. You'll see. Now wrangle and plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(11, 19)),
         months = c(seq(from = 0, to = 5, by = .5),
                    seq(from = 5, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_segment(x = 5, xend = 5,
               y = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 5,
               yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 5 + fixef(fit5.12)[3, 1],
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 8, y = 14.5, label = "gamma[2][0]",
           parse = T) +
  geom_segment(x = 7, xend = 5.5,
               y = 14.5, yend = 14.5,
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 5 months") +
  theme(panel.grid = element_blank())
```

Same deal for the lower left panel of Figure 5.3.

```{r}
2 * (10 - 0) + 1
2 * (14 - 10) + 1
```

Now wrangle and plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(21, 9)),
         months = c(seq(from = 0, to = 10, by = .5),
                    seq(from = 10, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_segment(x = 10, xend = 10,
               y = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 10,
               yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * 10 + fixef(fit5.12)[3, 1],
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 7, y = 13.5, label = "gamma[2][0]",
           parse = T) +
  geom_segment(x = 8, xend = 9.5,
               y = 13.5, yend = 13.5,
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 10 months") +
  theme(panel.grid = element_blank())
```

It's just a little bit trickier to get that lower right panel. Now we need to calculate three values.

```{r}
2 * (5 - 0) + 1
2 * (10 - 5) + 1
2 * (14 - 10) + 1
```

Get that plot.

```{r, fig.width = 3, fig.height = 4}
nd <-
  tibble(unemp  = rep(c(1, 0, 1), times = c(11, 11, 9)),
         months = c(seq(from = 0, to = 5, by = .5),
                    seq(from = 5, to = 10, by = .5),
                    seq(from = 10, to = 14, by = .5)),
         group  = rep(letters[1:3], times = c(11, 11, 9)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd)

lines <-
  tibble(group = letters[1:2],
         x     = c(5, 10)) %>% 
  mutate(xend = x,
         y    = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * x,
         yend = fixef(fit5.12)[1, 1] + fixef(fit5.12)[2, 1] * x + fixef(fit5.12)[3, 1])

arrow <-
  tibble(x    = c(6.75, 8.25),
         y    = 14,
         xend = c(5.5, 9.5),
         yend = c(14.5, 13.5))
f %>% 
  ggplot(aes(x = months)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5, group = group),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate, group = group)) +
  geom_segment(data = lines,
               aes(x = x, xend = xend,
                   y = y, yend = yend, 
                   group = group),
               size = 1/3, linetype = 2) +
  annotate(geom = "text",
           x = 7.5, y = 14, label = "gamma[2][0]",
           parse = T) +
  geom_segment(data = arrow,
               aes(x = x, xend = xend,
                   y = y, yend = yend),
               arrow = arrow(length = unit(0.05, "inches"))) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Reemployed at 5 months\nunemployed again at 10") +
  theme(panel.grid = element_blank())
```

Now we've been on a plotting roll, let's knock out the leftmost panel of Figure 5.4. It's just a small extension of what we've been doing.

```{r}
2 * (14 - 0) + 1
2 * (3.5 - 0) + 1
2 * (14 - 3.5) + 1
```

```{r, fig.width = 3.65, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(29, 22)),
         months = c(seq(from = 0, to = 14, by = .5),
                    seq(from = 3.5, to = 14, by = .5)))

f <-
  fitted(fit5.12,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  # new trick
  geom_abline(intercept = fixef(fit5.12)[1, 1],
              slope = fixef(fit5.12)[2, 1],
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  # another new trick
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Main effects of unemp and time") +
  # don't forget this part
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

#### Using a level-1/level-2 specification.

If we wanted to reexpress our composite equation for `fit5.12` using the level-1/level-2 form, the level-1 model would be

$$
\text{cesd}_{ij} = \pi_{0i} + \pi_{1i} \text{months}_{ij} + \pi_{2i} \text{unemp}_{ij} + \epsilon_{ij}.
$$

Here's the corresponding level-2 model:

\begin{align*}
\pi_{0i} & = \gamma_{00} + \zeta_{0i} \\
\pi_{1i} & = \gamma_{10} + \zeta_{1i} \\
\pi_{2i} & = \gamma_{20}.
\end{align*}

If we wanted the effects of the time-varying covariate `unemp` to vary across individuals, then we'd expand the definition of $\pi_{2i}$ to be

$$
\pi_{2i} = \gamma_{20} + \zeta_{2i}.
$$

Although this doesn't change the way we model $\epsilon_{ij}$, which remains

$$
\epsilon_{ij} \sim \text{Normal} (0, \sigma_\epsilon),
$$

it does change the model for the $\zeta$s. Within our Stan/**brms** paradigm, that would now be

\begin{align*}
\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} \\ \zeta_{2i}
\end{bmatrix} & \sim \text{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix}, \text{where} \\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 & 0 \\ 0 & \sigma_1 & 0 \\ 0 & 0 & \sigma_2 \end{bmatrix} \text{and} \\

\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} & \rho_{02} \\ \rho_{01} & 1 & \rho_{12} \\ \rho_{02} & \rho_{12} & 1 \end{bmatrix}.

\end{align*}

Reworded slightly from the text (p. 169), by adding one residual parameter, $\zeta_{2i}$, we got an additional corresponding standard deviation parameter, $\sigma_2$, and two more correlation parameters, $\rho_{02}$ and $\rho_{12}$. Staying with our weakly-regularizing prior approach, the priors for the updated model might look like

\begin{align*}
\gamma_{00}                          & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10} \text{ and } \gamma_{20} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon,..., \sigma_2        & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\Omega                               & \sim \operatorname{LKJ} (4).
\end{align*}

Singer and Willett then cautioned readers about hastily adding $\zeta$ parameters to their models, particularly in cases where you're likely to run into estimation issues, such as boundary constraints. Within our Stan/**brms** paradigm, we still have to be aware of these difficulties. However, with skillfully-chosen priors, I think you'll find we can fit more ambitious models than would typically be possible with frequentist estimators. But do beware that as you stretch your data further and further, your choices in likelihoods and priors more heavily influence the results. For more on the topic, check out [Michael Frank](https://twitter.com/mcxfrank?lang=en)'s blog post, [*Mixed effects models: Is it time to go Bayesian by default?*](http://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html), and make sure not to miss the action in the comments section.

#### Time-varying predictors and variance components.

When you add a time-varying predictor, it's not uncommon to see a reduction in $\sigma_\epsilon^2$. Here we compare `fit5.11` and `fit5.12`.

```{r, fig.width = 8, fig.height = 2.5}
v <-
  cbind(VarCorr(fit5.11, summary = F)[[2]][[1]],
        VarCorr(fit5.12, summary = F)[[2]][[1]]) %>% 
  data.frame() %>% 
  set_names(str_c("fit5.", 11:12)) %>% 
  transmute(fit5.11 = fit5.11^2,
            fit5.12 = fit5.12^2) %>% 
  mutate(`fit5.11 - fit5.12` = fit5.11 - fit5.12)

v %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("fit5.11", "fit5.12", "fit5.11 - fit5.12"))) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(sigma[epsilon]^2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

Using the full posterior for both models, here is the percent variance in CES-D explained by `unemp`.

```{r}
v %>% 
  transmute(percent = (fit5.11 - fit5.12) / fit5.11) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 3)
```

When you go beyond point estimates and factor in full posterior uncertainty, it becomes clear how fragile ad hoc statistics like this can be. Interpret them with caution.

### Allowing the effect of a time-varying predictor to vary over time.

"Might unemployment status also affect the trajectory's slope" (p. 171)? Here's the statistical model:

\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{10} \text{months}_{ij} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{months}_{ij} \times \text{unemp}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{1i} \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf \Omega \mathbf D'
\end{pmatrix}\\

\mathbf D      & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 
\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\

\gamma_{00}                                        & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{10}, \gamma_{20}, \text{ and } \gamma_{30} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1   & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{01}                                          & \sim \operatorname{LKJ}(4).
\end{align*}

Since $\gamma_{30}$ is an interaction term, it might make sense to give it an ever tighter prior, something like $\text{Normal}(0, 5)$. Here we'll just stay wide and loose.

```{r fit5.13}
fit5.13 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + months + unemp + months:unemp + (1 + months | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .9),
      file = "fits/fit05.13")
```

Here are the results.

```{r}
print(fit5.13)
```

Here's how we might make our version of the middle panel of Figure 5.4.

```{r, fig.width = 3.65, fig.height = 4}
nd <-
  tibble(unemp  = rep(1:0, times = c(29, 22)),
         months = c(seq(from = 0, to = 14, by = .5),
                    seq(from = 3.5, to = 14, by = .5)))

f <-
  fitted(fit5.13,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_abline(intercept = fixef(fit5.13)[1, 1],
              slope = fixef(fit5.13)[2, 1],
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Main effects of unemp and time") +
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

Here's the posterior for $\gamma_{10}$.

```{r, fig.width = 3, fig.height = 2.25}
posterior_samples(fit5.13) %>% 
  ggplot(aes(x = b_months, y = 0)) +
  stat_halfeye(.width = .95) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(paste(gamma[1][0], ", the main effect for time"))) +
  theme(panel.grid = element_blank())
```

It's quite uncertain and almost symmetrically straddles the parameter space between -0.5 and 0.5.

The next model follows the form

\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{unemp}_{ij} \times \text{months}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{3i} \text{unemp}_{ij} \times \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{3i} 
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_3 \end{bmatrix} \\ 

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{03} \\ \rho_{03} & 1 \end{bmatrix} \\

\gamma_{00}                                      & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{20} \text{ and } \gamma_{30}             & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_3 & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\rho_{03}                                        & \sim \operatorname{LKJ}(4).

\end{align*}

Here's how to fit it with `brms::brm()`.

```{r fit5.14}
fit5.14 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + unemp + months:unemp + (1 + months:unemp | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .99),
      file = "fits/fit05.14")
```

It's easy to miss this if you're not following along quite carefully with the text, but this model, which corresponds to Equation 5.9 in the text, is NOT Model D. Rather, it's an intermediary model between Model C and Model D. All this means we can't compare our results with those in Table 5.7. But here they are, anyway.

```{r}
print(fit5.14)
```

We've already computed the WAIC for `fit5.11` and `fit5.12`. Here we do so for `fit5.13` and `fit5.14`.

```{r, message = F}
fit5.13 <- add_criterion(fit5.13, "waic")
fit5.14 <- add_criterion(fit5.14, "waic")
```

```{r}
loo_compare(fit5.11, fit5.12, fit5.13, fit5.14, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, fit5.13, fit5.14, weights = "waic") %>% 
  round(digits = 3)
```

Yep, it appears `fit5.14` is not an improvement on `fit5.13` (i.e., our analogue to Model C in the text). Here's our version of Model D:

\begin{align*}
\text{cesd}_{ij} & = \big [ \gamma_{00} + \gamma_{20} \text{unemp}_{ij} + \gamma_{30} \text{unemp}_{ij} \times \text{months}_{ij} \big ] \\
& \;\;\; + \big [  \zeta_{0i} + \zeta_{3i} \text{unemp}_{ij} \times \text{months}_{ij} + \epsilon_{ij} \big ] \\

\epsilon_{ij}    & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\

\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{2i} \\ \zeta_{3i}
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\

\mathbf D     & = \begin{bmatrix} \sigma_0 & 0 & 0 \\ 0 & \sigma_2 & 0 \\ 0 & 0 & \sigma_3 \end{bmatrix} \\

\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{02} & \rho_{03} \\ \rho_{02} & 1 & \rho_{23} \\ \rho_{03} & \rho_{23} & 1 \end{bmatrix} \\

\gamma_{00}                          & \sim \operatorname{Normal}(14.5, 20) \\
\gamma_{20} \text{ and } \gamma_{30} & \sim \operatorname{Normal}(0, 10) \\
\sigma_\epsilon,..., \sigma_3        & \sim \operatorname{Student-t}(3, 0, 11.9) \\
\mathbf\Omega                        & \sim \operatorname{LKJ}(4).
\end{align*}

We'll call it `fit5.15`. Here's the `brm()` code.

```{r fit5.15}
fit5.15 <-
  brm(data = unemployment_pp, 
      family = gaussian,
      cesd ~ 0 + Intercept + unemp + months:unemp + (1 + unemp + months:unemp | id),
      prior = c(prior(normal(14.5, 20), class = b, coef = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 11.9), class = sd),
                prior(student_t(3, 0, 11.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      control = list(adapt_delta = .95),
      file = "fits/fit05.15")
```

```{r}
print(fit5.15)
```

Now we can finally make our version of the right panel of Figure 5.4.

```{r, fig.width = 3.65, fig.height = 4}
f <-
  fitted(fit5.15,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(label = str_c("unemp = ", unemp))

f %>% 
  ggplot(aes(x = months, group = unemp)) +
  geom_abline(intercept = fixef(fit5.15)[1, 1],
              slope = 0,
              color = "grey80", linetype = 2) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey67", alpha = 1/2) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = f %>% filter(months == 14), 
            aes(label = label, y = Estimate), hjust = -.05) +
  scale_x_continuous("Months since job loss", breaks = seq(from = 0, to = 14, by = 2)) +
  scale_y_continuous("CES-D", limits = c(5, 20)) +
  labs(subtitle = "Constraining the effects time\namong the re-employed") +
  coord_cartesian(clip = "off") +
  theme(panel.grid = element_blank(),
        plot.margin = margin(6, 55, 6, 6))
```

By carefully using `filter()`, we can extract the posterior for summary the expected CES-D value for "the average unemployed person in the population", "immediately upon layoff" (p. 173).

```{r}
f %>% 
  filter(unemp == 1 & months == 0)
```

To get the decline rate per month, just use `fixef()` and subset.

```{r}
fixef(fit5.15)["unemp:months", ]
```

How much lower, on average, are "CES-D scores among those who find a job" right after layoff (p. 173)? Again, just use `fixef()`.

```{r}
fixef(fit5.15)["unemp", ]
```

But if we'd like to get the posterior for the difference at 12 months later, we'll need to go back to `fitted()`.

```{r}
nd <-
  tibble(unemp  = 1:0,
         months = 12)

fitted(fit5.15,
       newdata = nd,
       re_formula = NA,
       summary = F) %>% 
  data.frame() %>% 
  set_names(str_c(c("un", ""), "employed_cesd_at_12")) %>% 
  transmute(difference = unemployed_cesd_at_12 - employed_cesd_at_12) %>% 

  median_qi() %>% 
  mutate_if(is.double, round, digits = 3)
```

There's more posterior uncertainty, there, that you might expect from simply using the point estimates. *Always beware the posterior uncertainty*. We may as well finish off with a little WAIC.

```{r, message = F}
fit5.15 <- add_criterion(fit5.15, "waic")

loo_compare(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15, weights = "waic") %>% 
  round(digits = 3)
```

`fit5.15`, `fit5.13`, and `fit5.12` are all very close, with a modest edge for `fit5.15`.

```{r, echo = F}
# again, let's free up memory
rm(fit5.11, fit5.12, fit5.13, fit5.14, fit5.15)
```

### Recentering time-varying predictors.

Back to the `wages_pp` data! Here's the generic statistical model we'll be fooling with:

\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{10} \text{exper}_{ij} + \gamma_{01} (\text{hgc}_i - 9) + \gamma_{12} \text{black}_i \times \text{exper}_{ij} \\
                & \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}

We will fit the model with three versions of `uerate`. If you execute `head(wages_pp)`, you'll discover they're already in the data. But it might be worth walking out how to compute those variables. First, centering `uerate` at 7 is easy enough. Just subtract.

```{r}
wages_pp <-
  wages_pp %>% 
  mutate(uerate_7 = uerate - 7)
```

Continuing on with the same priors from before, here's how to fit the new model, our version of Model A on page 175.

```{r fit5.16}
fit5.16 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.16")
```

```{r}
print(fit5.16, digits = 3)
```


```{r}
wages_pp <-
  wages_pp %>% 
  group_by(id) %>% 
  mutate(uerate_id_mu  = mean(uerate)) %>% 
  ungroup() %>% 
  mutate(uerate_id_dev = uerate - uerate_id_mu)
```

In the original data set, these were the `ue.mean` and `ue.person.cen` variables, respectively. Here's how to fit the model.

```{r fit5.17}
fit5.17 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_id_mu + uerate_id_dev + 
        (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5),   class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.17")
```

```{r}
print(fit5.17, digits = 3)
```

Within the **tidyverse**, probably the easiest way to center on the first value for each `id` is to first `group_by(id)` and then make use of the `dplyr::first()` function, which you can learn more about [here](https://dplyr.tidyverse.org/reference/nth.html).

```{r}
wages_pp <-
  wages_pp %>% 
  group_by(id) %>% 
  mutate(uerate_id_1 = first(uerate)) %>% 
  ungroup() %>% 
  mutate(uerate_id_1_dev = uerate - uerate_id_1)
```

In the original data set, these were the `ue1` and `ue.centert1` variables, respectively. Here's how to fit the updated model.

```{r fit5.18, message = F}
fit5.18 <-
  update(fit5.16,
         newdata = wages_pp,
         lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_id_1 + uerate_id_1_dev + 
           (1 + exper | id),
         iter = 2500, warmup = 1000, chains = 3, cores = 3,
         seed = 5,
         file = "fits/fit05.18")
```

```{r}
print(fit5.18, digits = 3)
```

Here are the WAIC comparisons.

```{r, message = F}
fit5.16 <- add_criterion(fit5.16, "waic")
fit5.17 <- add_criterion(fit5.17, "waic")
fit5.18 <- add_criterion(fit5.18, "waic")

loo_compare(fit5.16, fit5.17, fit5.18, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.16, fit5.17, fit5.18, weights = "waic") %>% 
  round(digits = 3)
```

```{r, echo = F}
# let's free up some memory
rm(fit5.16, fit5.17, fit5.18)
```

Like in the text, these three models are all close with respect to the WAIC. Based on their WAIC weights, `fit5.18` (Model C) and `fit5.16` (Model A) seen the be the best depictions of the data.

### An important caveat: The problem of reciprocal causation.

In this section, Singer and Willett gave a typology for time-varying covariates.

* A variable is *defined* if, "in advance to data collection, its values are predetermined for everyone under study" (p. 177). Examples are time and the seasons.
* A variable is *ancillary* if "its values cannot be influenced by study participants because they are determined by a stochastic process totally external to them" (p. 178). Within the context of a study on people within monogamous relationships, the availability of potential mates within the region would be an example.
* A variable is *contextual* if "it describes an 'external' stochastic process, but the connection between units is closer--between husbands and wives, parents and children, teachers and students, employers and employees. Because of this proximity, contextual predictors can be influenced by an individual's contemporaneous outcome values; if so, they are susceptible to issues of reciprocal causation" (p. 179). 
* A variable is *internal* if it describes an "individual's potentially changeable status over time" (p. 179). Examples include mood, psychiatric syndromes, and employment status.

## Recentering the effect of *TIME*

Our version of the data from Tomarken, Shelton, Elkins, and Anderson [-@tomarken1997sleep] is saved as `medication_pp.csv`.

```{r, warning = F, message = F}
medication_pp <- read_csv("data/medication_pp.csv")

glimpse(medication_pp)
```

The `medication_pp` data do not come with a `reading` variable, as in Table 5.9. Here we'll make one and display the time variables highlighted in Table 5.9.

```{r}
medication_pp <-
  medication_pp %>% 
  mutate(reading = ifelse(time.of.day == 0, "8 am",
                          ifelse(time.of.day < .6, "3 pm", "10 pm")))

medication_pp %>% 
  select(wave, day, reading, time.of.day:time667) %>% 
  head(n = 10)
```

To help get a sense of the balance in the data, here is a bar plot of the distribution of the numbers of measurement occasions within participants. It's color coded by treatment status.

```{r, fig.width = 6, fig.height = 3}
medication_pp %>% 
  mutate(treat = str_c("treat = ", treat)) %>% 
  group_by(treat, id) %>% 
  count() %>% 
  
  ggplot(aes(y = n, fill = treat)) +
  geom_bar() +
  scale_y_continuous(breaks = c(2,12, 15:21)) +
  scale_fill_viridis_d(NULL, end = .8) +
  labs(x = "count of cases",
       y = "# measurement occasions") +
  theme(panel.grid = element_blank()) 
```

For this section, our basic model will be

\begin{align*}
\text{pos}_{ij} & = \pi_{0i} + \pi_{1i} (\text{time}_{ij} - c) + \epsilon_{ij} \\
\pi_{0i}        & = \gamma_{00} + \gamma_{01} \text{treat}_i + \zeta_{0i} \\
\pi_{1i}        & = \gamma_{10} + \gamma_{11} \text{treat}_i + \zeta_{1i},
\end{align*}

where $c$ is a generic constant. For our three variants of $\text{time}_{ij}$, $c$ will be

* 0 for `time`,
* 3.3333333 for `time333`, and
* 6.666667 for `time667`.

Our criterion variable is `pos`, positive mood rating. The text told us these were from "a package of mood diaries (which use a five-point scale to assess positive and negative moods)" (p. 182). However, we don't know what numerals were assigned to the points on the scale, we don't know how many items were used, and we don't even know whether the items were taken from an existing questionnaire. And unfortunately, the citation Singer and Willett gave for the study is from a conference presentation, making it a pain to track down background information on the internet. In such a situation, it's difficult to figure out how to set our priors. Though suboptimal, we might first get a sense of the `pos` data with a histogram.

```{r, fig.width = 4, fig.height = 2.5, warning = F, message = F}
medication_pp %>% 
  ggplot(aes(x = pos)) +
  geom_histogram() +
  theme(panel.grid = element_blank())
```

Here's the `range()`.

```{r}
range(medication_pp$pos)
```

Starting with the prior for our intercept, I think there's a lot of room for argument, here. To keep with our weakly-regularizing approach to priors, it might make sense to use something like `normal(200, 100)`. But then again, we know something about the study design. At the beginning, participants were in treatment for depression, so we'd expect the starting point to be closer to the lower end of the scale. In that case, we might update our approach to something like `normal(150, 50)`. Feel free to play with alternatives. What I hope this illustrates is that our task would be much easier with more domain knowledge.

Anyway, given the scale of the data, weakly-regularizing priors for the predictor variables might take the form of something like `normal(0, 25)`. We'll use `student_t(0, 50)` on the $\sigma$s and stay steady with `lkj(4)` on $\rho_{01}$.

Here we fit the model with all three versions of `time`.

```{r fit5.19}
fit5.19 <-
  brm(data = medication_pp, 
      family = gaussian,
      pos ~ 0 + Intercept + time + treat + time:treat + (1 + time | id),
      prior = c(prior(normal(150, 50), class = b, coef = Intercept),
                prior(normal(0, 25), class = b),
                prior(student_t(3, 0, 50), class = sd),
                prior(student_t(3, 0, 50), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.19")

fit5.20 <-
  update(fit5.19,
         newdata = medication_pp,
         pos ~ 0 + Intercept + time333 + treat + time333:treat + (1 + time333 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 5,
         file = "fits/fit05.20")

fit5.21 <-
  update(fit5.19,
         newdata = medication_pp,
         pos ~ 0 + Intercept + time667 + treat + time667:treat + (1 + time667 | id),
         iter = 2000, warmup = 1000, chains = 4, cores = 4,
         seed = 5,
         file = "fits/fit05.21")
```

Given the size of our posterior standard deviations, our $\gamma$ posteriors are quite comparable to the point estimates (and their standard errors) in the text.

```{r}
print(fit5.19)
```

Given the scales we're working with, it's difficult to compare our $\sigma$ summaries with the $\sigma^2$ summaries in Table 5.10 of the text. Here we convert and resummarize.

```{r}
v <-
  posterior_samples(fit5.19) %>%
  transmute(sigma_2_0 = sd_id__Intercept^2,
            sigma_2_1 = sd_id__time^2,
            sigma_2_epsilon = sigma^2)

v %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  median_qi() %>% 
  mutate_if(is.double, round, digits = 2)
```

Turns out the posterior medians are quite similar to the ML estimates in the text. Here's what the entire distributions looks like.

```{r, fig.width = 8, fig.height = 2.5}
v %>% 
  set_names("sigma[0]^2", "sigma[1]^2", "sigma[epsilon]^2") %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("posterior") +
  theme(panel.grid = element_blank(),
        strip.text = element_text(size = 11)) +
  facet_wrap(~name, scales = "free", labeller = label_parsed)
```

Here's our version of Figure 5.5.

```{r, fig.width = 4, fig.height = 2.75}
nd <-
  crossing(treat = 0:1,
           time  = seq(from = 0, to = 7, length.out = 30))

text <-
  tibble(treat = 0:1,
         time  = 4,
         y     = c(135, 197),
         label = c("control", "treatment"),
         angle = c(350, 15))

fitted(fit5.19,
       newdata = nd,
       re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  ggplot(aes(x = time, fill = treat, color = treat, group = treat)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/3, size = 0) +
  geom_line(aes(y = Estimate)) +
  geom_text(data = text,
            aes(y = y, label = label, angle = angle)) +
  scale_fill_viridis_c(option = "A", begin = .3, end = .6) +
  scale_color_viridis_c(option = "A", begin = .3, end = .6) +
  labs(x = "days",
       y = "pos") +
  theme(legend.position = "none",
        panel.grid = element_blank())
```

Although we still have clear evidence of an interaction, adding those 95% intervals makes it look less impressive, doesn't it?

Here are the summaries for the models with the alternative versions of $(\text{time}_{ij} - c)$.

```{r}
print(fit5.20)
print(fit5.21)
```

It might be easier to compare the $\gamma$ summaries across models with a well-designed coefficient plot. Here's an attempt.

```{r, fig.width = 8, fig.height = 1.5}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fixef = map(name, ~get(.) %>% 
                       fixef() %>% 
                       data.frame() %>% 
                       rownames_to_column("parameter"))) %>% 
  unnest(fixef) %>% 
  mutate(gamma = rep(c("gamma[0][0]", "gamma[1][0]", "gamma[0][1]", "gamma[1][1]"), times = 3)) %>% 
  
  ggplot(aes(x = name, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_pointrange() +
  xlab(NULL) +
  coord_flip() +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank(),
        strip.text = element_text(size = 11)) +
  facet_wrap(~gamma, scales = "free_x", labeller = label_parsed, ncol = 4)
```

Since we're juggling less information, we might compare the posteriors for $\sigma_1^2$ across the three models with good old `tidybayes::geom_halfeyeh()`.

```{r, fig.width = 6, fig.height = 2}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fit = map(name, get)) %>% 
  mutate(sigma_2_1 = map(fit, ~VarCorr(., summary = F)[[1]][[1]][, 1]^2 %>% 
                           data.frame() %>% 
                           set_names("sigma_2_1"))) %>% 
  unnest(sigma_2_1) %>% 
  
  ggplot(aes(x = sigma_2_1, y = name)) +
  stat_halfeye(.width = c(.5, .95)) +
  scale_x_continuous(expression(sigma[1]^2), limits = c(0, NA)) +
  ylab(NULL) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

For kicks and giggles, we marked off both 50% and 95% intervals for each. Here's the same for $\rho_{01}$.

```{r, fig.width = 6, fig.height = 2}
tibble(name = str_c("fit5.", 19:21)) %>% 
  mutate(fit = map(name, get)) %>% 
  mutate(rho = map(fit, ~VarCorr(., summary = F)[[1]][[2]][, 2, "Intercept"] %>% 
                           data.frame() %>% 
                           set_names("rho"))) %>% 
  unnest(rho) %>% 
  
  ggplot(aes(x = rho, y = name)) +
  stat_halfeye(.width = c(.5, .95)) +
  labs(x = expression(rho[0][1]),
       y = NULL) +
  coord_cartesian(xlim = c(-1, 1)) +
  theme(axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

"As Rogosa and Willett [-@rogosaUnderstandingCorrelatesChange1985] demonstrate, you can always alter the correlation between the level-1 growth parameters simply by changing the centering constant" (p. 186).

Here are the WAIC comparisons.

```{r, warning = F, message = F}
fit5.19 <- add_criterion(fit5.19, "waic")
fit5.20 <- add_criterion(fit5.20, "waic")
fit5.21 <- add_criterion(fit5.21, "waic")

loo_compare(fit5.19, fit5.20, fit5.21, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.19, fit5.20, fit5.21, weights = "waic") %>% 
  round(digits = 3)
```

As one might hope, not much going on.

As Singer and Willett alluded to in the text (p. 187), the final model in this chapter is an odd one. Do note the `initial` and `final` columns in the data.

```{r}
glimpse(medication_pp)
```

With this model

\begin{align*}
\text{pos}_{ij} & = \pi_{0i} \bigg (\frac{6.67 - \text{time}_{ij}}{6.67} \bigg ) + \pi_{1i} \bigg (\frac {\text{time}_{ij}}{6.67} \bigg ) + \epsilon_{ij} \\
\pi_{0i}        & = \gamma_{00} + \gamma_{01} \text{treat}_i + \zeta_{0i} \\
\pi_{1i}        & = \gamma_{10} + \gamma_{11} \text{treat}_i + \zeta_{1i}.
\end{align*}

Because of the parameterization, we'll use both variables simultaneously to indicate time. In the code, below, you'll notice we no longer have an intercept parameter. Rather, we just have `initial` and `final`. As such, both of those parameters get the same prior we've been using for the intercept in the previous models.

```{r fit5.22}
fit5.22 <-
  brm(data = medication_pp, 
      family = gaussian,
      pos ~ 0 + initial + final + initial:treat + final:treat + (0 + initial + final | id),
      prior = c(prior(normal(150, 50), class = b, coef = initial),
                prior(normal(150, 50), class = b, coef = final),
                prior(normal(0, 25), class = b),
                prior(student_t(3, 0, 50), class = sd),
                prior(student_t(3, 0, 50), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 5,
      file = "fits/fit05.22")
```

Our results are quite similar to those in the text.

```{r}
print(fit5.22)
```

Let's finish up with the WAIC.

```{r, warning = F, message = F}
fit5.22 <- add_criterion(fit5.22, "waic")

loo_compare(fit5.19, fit5.20, fit5.21, fit5.22, criterion = "waic") %>% 
  print(simplify = F)

model_weights(fit5.19, fit5.20, fit5.21, fit5.22, weights = "waic") %>% 
  round(digits = 3)
```

All about the same.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F, eval = F}
# here we'll remove our objects
rm(reading_pp, n, lkj, wages_pp, post, nd, f, wages_small_pp, fit5.6, v, fit5.7, fit5.8, fit5.9, fit5.10, unemployment_pp, lines, arrow, medication_pp, fit5.19, text, fit5.20, fit5.21, fit5.22)

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```


<!--chapter:end:05.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# Modeling Discontinuous and Nonlinear Change

> All the multilevel models for change presented so far assume that individual growth is smooth and linear. Yet individual change can also be discontinuous or nonlinear...
>
> In this chapter, we introduce strategies for fitting models in which individual change is explicitly discontinuous or nonlinear. Rather than view these patterns as inconveniences, we treat them as substantively compelling opportunities. In doing so, we broaden our questions about the nature of change beyond the basic concepts of initial status and rate of change to a consideration of acceleration, deceleration, turning points, shifts, and asymptotes. The strategies that we use fall into two broad classes. *Empirical* strategies that let the "data speak for themselves." Under this approach, you inspect observed growth records systematically and identify a transformation of the outcome, or of *TIME*, that linearizes the individual change trajectory. Unfortunately, this approach can lead to interpretive difficulties, especially if it involves esoteric transformations or higher order polynomials. Under *rational* strategies, on the other hand, you use theory to hypothesize a substantively meaningful functional form for the individual change trajectory. Although rational strategies generally yield clearer interpretations, their dependence on good theory makes them somewhat more difficult to develop and apply. [@singerAppliedLongitudinalData2003, p. 189--190, *emphasis* in the original]

## Discontinuous individual change

> Not all individual change trajectories are continuous functions of time...
>
> If you have reason to believe that individual change trajectories might shift in elevation and/or slope, your level-1 model should reflect this hypothesis. Doing so allows you to test ideas about how the trajectory’s shape might change over time...
>
> To postulate a discontinuous individual change trajectory, you need to know not just *why* the shift might occur but also *when*. This is because your level-1 individual growth model must include one (or more) time-varying predictor(s) that specify whether and if so, when each person experiences the hypothesized shift. (pp. 190--191, *emphasis* in the original)

### Alternative discontinuous level-1 models for change.

> To postulate a discontinuous level-1 individual growth model, you must first decide on its functional form. Although you can begin empirically, we prefer to focus on substance and the longitudinal process that gave rise to the data. What kind of discontinuity might the precipitating event create? What would a plausible level-1 trajectory look like? Before parameterizing models and constructing variables, we suggest that you: (1) take a pen and paper and sketch some options; and (2) articulate--in words, not equations--the rationale for each. We recommend these steps because, as we demonstrate, the easiest models to specify may not display the type of discontinuity you expect to find. (pp. 191--192)

I'll leave the pen and paper scribbling to you. Here we load the `wages_pp.csv` data.

```{r, warning = F, message = F}
library(tidyverse)

wages_pp <- read_csv("data/wages_pp.csv")

glimpse(wages_pp)
```

Here's a more focused look along the lines of Table 6.1.

```{r}
wages_pp %>% 
  select(id, lnw, exper, ged, postexp) %>% 
  mutate(`ged by exper` = ged * exper) %>% 
  filter(id %in% c(206, 2365, 4384))
```

Similar to what we did in Section 5.2.1, here is a visualization of the two primary variables, `exper` and `lnw`, for those three participants.

```{r, fig.width = 6, fig.height = 2.5}
wages_pp %>% 
  filter(id %in% c(206, 2365, 4384)) %>% 
  mutate(id = factor(id)) %>% 
  
  ggplot(aes(x = exper, y = lnw)) +
  geom_point(aes(color = id),
             size = 4) +
  geom_line(aes(color = id)) +
  geom_text(aes(label = ged),
            size = 3) +
  scale_x_continuous(breaks = 1:13) +
  scale_color_viridis_d(option = "B", begin = .6, end = .9) +
  labs(caption = expression(italic("Note")*'. GED status is coded 0 = "not yet", 1 = "yes."')) +
  theme(panel.grid = element_blank(),
        plot.caption = element_text(hjust = 0))
```

Note how the time-varying predictor `ged` is depicted as either an `0` or a `1` in the center of the dots. Maybe we might describe that change as linear with the simple model $\text{lnw}_{ij} = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij}$. Maybe a different model that included `ged` would be helpful.

In the data, $n = 581$ never got their GED's, while $n = 307$ did.

```{r}
wages_pp %>% 
  group_by(id) %>% 
  summarise(got_ged = sum(ged) > 0) %>% 
  count(got_ged)
```

However, those who did get their GED's did so at different times. Here's their distribution.

```{r, fig.width = 4.5, fig.height = 2.5}
wages_pp %>% 
  filter(ged == 1) %>% 
  group_by(id) %>% 
  slice(1) %>%
  
  ggplot(aes(x = exper)) +
  geom_histogram(binwidth = 0.5) +
  labs(subtitle = expression(The~italic(timing)~of~the~GED~attainment~varies)) +
  theme(panel.grid = element_blank())
```

Here's another, more focused, look at the GED status for our three focal participants, over `exper`.

```{r, fig.width = 6, fig.height = 1.5}
wages_pp %>% 
  select(id, lnw, exper, ged, postexp) %>% 
  filter(id %in% c(206, 2365, 4384)) %>% 
  mutate(id = factor(id)) %>%
  
  ggplot(aes(x = exper, y = ged)) +
  geom_point(aes(color = id),
             size = 4) +
  scale_x_continuous(breaks = 1:13) +
  scale_y_continuous(breaks = 0:1, limits = c(-0.2, 1.2)) +
  scale_color_viridis_d(option = "B", begin = .6, end = .9, breaks = NULL) +
  theme(panel.grid = element_blank(),
        plot.caption = element_text(hjust = 0)) +
  facet_wrap(~ id)
```

One might wonder: "How might GED receipt affect individual $i$'s wage trajectory?" (p. 193). Here we reproduce Figure 6.1, which entertains four possibilities.

```{r, fig.width = 4, fig.height = 3.5}
tibble(exper = c(0, 3, 3, 10),
       ged   = rep(0:1, each = 2)) %>% 
  expand(model = letters[1:4],
         nesting(exper, ged)) %>% 
  mutate(exper2 = if_else(ged == 0, 0, exper - 3)) %>% 
  mutate(lnw = case_when(
    model == "a" ~ 1.60 + 0.04 * exper,
    model == "b" ~ 1.65 + 0.04 * exper + 0.05 * ged,
    model == "c" ~ 1.75 + 0.04 * exper + 0.02 * exper2 * ged,
    model == "d" ~ 1.85 + 0.04 * exper + 0.01 * ged + 0.02 * exper * ged
  ),
  model = fct_rev(model)) %>% 
  
  ggplot(aes(x = exper, y = lnw)) +
  geom_line(aes(color = model),
            size = 1) +
  scale_color_viridis_d(option = "D", begin = 1/4, end = 3/4) +
  ylim(1.5, 2.5) +
  theme(panel.grid.minor = element_blank())
```

#### Including a discontinuity in elevation, not slope.

We can write the level-1 formula for when there is a change in elevation, but not slope, as

$$
\text{lnw}_{ij} = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} \text{ged}_{ij} + \epsilon_{ij}.
$$

Because we are equating `ged` values as relating to the intercept, but not the slope, it might be helpful to rewrite that formula as

$$
\text{lnw}_{ij} = (\pi_{0i} + \pi_{2i} \text{ged}_{ij}) + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij},
$$

where the portion inside of the parentheses concerns initial status and discontinuity in elevation, but not slope. Because the `ged` values only come in 0's and 1's, we can express the two versions of this equation as

$$
\begin{align*}
\text{lnw}_{ij} & = [\pi_{0i} + \pi_{2i} (0)] + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij} \;\;\; \text{and} \\
                & = [\pi_{0i} + \pi_{2i} (1)] + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}
$$

In other words, whereas the pre-GED intercept is $\pi_{0i}$, the post-GED intercept is $\pi_{0i} + \pi_{2i}$. To get a better sense of this, we might make a version of the upper left panel of Figure 6.2. Since we'll be making four of these over the next few sections, we might reduce the redundancies in the code by makign a custom plotting function. We'll call it `plot_figure_6.2()`.

```{r}
plot_figure_6.2 <- function(data, mapping, 
                            sizes = c(1, 1/4), 
                            linetypes = c(1, 2), 
                            ...) {
  
  ggplot(data, mapping) +
    geom_line(aes(size = model, linetype = model)) +
    geom_text(data = text,
              aes(label = label, hjust = hjust),
              size = 3, parse = T) +
    geom_segment(data = arrow,
                 aes(xend = xend, yend = yend),
                 arrow = arrow(length = unit(0.075, "inches"), type = "closed"),
                 size = 1/4) +
    scale_size_manual(values = sizes) +
    scale_linetype_manual(values = linetypes) +
    scale_x_continuous(expand = expansion(mult = c(0, 0.05))) +
    scale_y_continuous(breaks = 0:4 * 0.2 + 1.6, expand = c(0, 0)) +
    coord_cartesian(ylim = c(1.6, 2.4)) +
    theme(legend.position = "none",
          panel.grid = element_blank())
  
}
```

Now we have our custom plotting function, let's take it for a spin.

```{r, fig.width = 3.5, fig.height = 4.5}
text <-
  tibble(exper = c(4.5, 4.5, 7.5, 7.5, 1),
         lnw   = c(2.24, 2.2, 1.82, 1.78, 1.62),
         label = c("Common~rate~of~change",
                   "Pre-Post~GED~(pi[1][italic(i)])",
                   "Elevation~differential",
                   "on~GED~receipt~(pi[2][italic(i)])",
                   "italic(LNW)~at~labor~force~entry~(pi[0][italic(i)])"),
         hjust = c(.5, .5, .5, .5, 0))

arrow <-
  tibble(exper = c(2.85, 5.2, 5.5, 1.7),
         xend  = c(2, 6.8, 3.1, 0.05),
         lnw   = c(2.18, 2.18, 1.8, 1.64),
         yend  = c(1.84, 2.08, 1.9, 1.74))

p1 <-
  tibble(exper = c(0, 3, 3, 10),
         ged   = rep(0:1, each = 2)) %>% 
  expand(model = letters[1:2],
         nesting(exper, ged)) %>% 
  mutate(exper2 = if_else(ged == 0, 0, exper - 3)) %>% 
  mutate(lnw = case_when(
    model == "a" ~ 1.75 + 0.04 * exper,
    model == "b" ~ 1.75 + 0.04 * exper + 0.05 * ged),
  model = fct_rev(model)) %>%
  
  plot_figure_6.2(aes(x = exper, y = lnw))
  
p1  
```

Worked like a dream!

#### Including a discontinuity in slope, not elevation.

> To specify a level-1 individual growth model that includes a discontinuity in slope, not elevation, you need a different time-varying predictor. Unlike GED, this predictor must clock the passage of time (like *EXPER*). But unlike *EXPER*, it must do so within only one of the two epochs (pre- or post-GED receipt). Adding a second temporal predictor allows each individual change trajectory to have two distinct slopes: one before the hypothesized discontinuity an another after. (p. 195, *emphasis* in the original)

In the `wages_pp` data, `postexp` is this second variable. Here is how it compares to the other relevant variables.

```{r}
wages_pp %>% 
  select(id, ged, exper, postexp) %>% 
  filter(id >= 53)
```

Singer and Willett then went on to report "construction of a suitable time-varying predictor to register the desired discontinuity is often the hardest part of the model specification" (p. 195). They weren't kidding.

This concept caused me a good bit of frustration when learning about these models. Let's walk through this slowly. In the last code block, we looked at four relevant variables. You may wonder why we executed `filter(id >= 53)`. This is because the first two participants always had `ged == 1`. They're valid cases and all, but those data won't be immediately helpful for understanding what's going on with `postexp`. Happily, the next case, `id == 53`, is perfect for our goal. First, notice how that person's `postexp` values are always 0 when `ged == 0`. Second, notice how the first time where `ged == 1`, `postexp` is still a 0. Third, notice that after that first initial row, `postexp` increases. If you caught all that, go you!

To make the next point, it'll come in handy to subset the data. Because we're trying to understand the relationship between `exper` and `postesp` conditional on `ged`, cases for which `ged` is always the same will be of little use. Let's drop them.

```{r}
wages_pp_subset <-
  wages_pp %>% 
  group_by(id) %>% 
  filter(mean(ged) > 0) %>% 
  filter(mean(ged) < 1) %>% 
  ungroup() %>% 
  select(id, ged, exper, postexp)

wages_pp_subset
```

What might not be obvious yet is `exper` and `postexp` scale together. To show how this works, we'll make two new columns. First, we'll mark the minimum `exper` value for each level of `id`. Then we'll make a `exper - postexp` which is exactly what the name implies. Here's what that looks like.

```{r}
wages_pp_subset %>% 
  filter(ged == 1) %>% 
  group_by(id) %>% 
  mutate(min_exper         = min(exper),
         `exper - postexp` = exper - postexp)
```

Huh. For each case, the `min_exper` value is (near)identical with `exper - postexp`. The reason they're not always identical is simply rounding error. Had we computed them by hand without rounding, they would always be the same. This relationship is the consequence of our having coded `postexp == 0` the very first time `ged == 1`, but allowed it to linearly increase afterward. Within each level of `id`--and conditional on `ged == 1`--, the way it increases is simply `exper – min(exper)`. Here's that value.

```{r}
wages_pp_subset %>% 
  filter(ged == 1) %>% 
  group_by(id) %>% 
  mutate(min_exper         = min(exper),
         `exper - postexp` = exper - postexp) %>% 
  mutate(`exper - min_exper` = exper - min_exper)
```

See? Our new `exper - min_exper` column is the same, within rounding error, as `postexp`.

> A fundamental feature of *POSTEXP*--indeed, any temporal predictor designed to register a shift in slope--is that the difference between each non-zero pair of consecutive values must be numerically identical to the difference between the corresponding pair of values for the basic predictor (here, *EXPER*). (p. 197, *emphasis* in the original)

```{r}
wages_pp %>% 
  group_by(id) %>% 
  filter(mean(ged) > 0 & mean(ged) < 1) %>% 
  ungroup() %>%
  pivot_longer(c(exper, postexp),
               names_to = "temporal predictor") %>% 
  filter(id < 250) %>% 
  
  ggplot(aes(x = value, y = lnw)) +
  geom_line(aes(linetype = `temporal predictor`, color = `temporal predictor`)) +
  scale_color_viridis_d(option = "A", begin = 1/3, end = 2/3, direction = -1) +
  theme(legend.position = c(5/6, .25),
        panel.grid = element_blank()) +
  facet_wrap(~id, scales = "free_x")
```

See how those two scale together within each level of `id`?

All of this work is a setup for the level-1 equation

$$
\text{lnw}_{ij} = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{3i} \text{postexp}_{ij} + \epsilon_{ij},
$$

where $\pi_{0i}$ is the only intercept parameter and $\pi_{1i}$ and $\pi_{3i}$ are two slope parameters. Singer and Willett explained

> each slope assesses the effect of work experience, but it does so from a different origin: (1) $\pi_{1i}$ captures the effects of *total* work experience (measured from labor force entry); and (2) $\pi_{3i}$ captures the *added* effect of post-GED work experience (measured from GED receipt). (p. 197, *emphasis* in the original)

To get a sense of what this looks like, here's our version of the upper right panel of Figure 6.2.

```{r, fig.width = 3.5, fig.height = 4.5}
text <-
  tibble(exper = c(5, 5, 0.5, 0.5, 1),
         lnw   = c(2.24, 2.2, 2, 1.96, 1.62),
         label = c("Slope~differential",
                   "Pre-Post~GED~(pi[3][italic(i)])",
                   "Rate~of~change",
                   "Pre~GED~(pi[1][italic(i)])",
                   "italic(LNW)~at~labor~force~entry~(pi[0][italic(i)])"),
         hjust = c(.5, .5, 0, 0, 0))

arrow <-
  tibble(exper = c(5.2, 1.7, 1.7),
         xend  = c(9.1, 1.7, 0.05),
         lnw   = c(2.18, 1.93, 1.64),
         yend  = c(2.15, 1.84, 1.74))

p2 <-
  tibble(exper = c(0, 3, 3, 10),
         ged   = rep(0:1, each = 2)) %>% 
  expand(model = letters[1:2],
         nesting(exper, ged)) %>% 
  mutate(postexp = ifelse(exper == 10, 1, 0)) %>% 
  mutate(lnw = case_when(
    model == "a" ~ 1.75 + 0.04 * exper,
    model == "b" ~ 1.75 + 0.04 * exper + 0.15 * postexp),
  model = fct_rev(model)) %>%
  
  plot_figure_6.2(aes(x = exper, y = lnw)) +
  annotate(geom = "curve",
           x = 8.5, xend = 8.8,
           y = 2.195, yend = 2.109,
           arrow = arrow(length = unit(0.05, "inches"), type = "closed", ends = "both"),
           size = 1/4, linetype = 2, curvature = -0.85)
  
p2  
```

#### Including discontinuities in both elevation and slope.

There are (at least) two ways to do this. They are similar, but not identical. The first is an extension of the model from the last subsection where we retain `postexp` from our second slope parameter. We can express this as the equation

$$
\text{lnw}_{ij} = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} \text{ged}_{ij} + \pi_{3i} \text{postexp}_{ij} + \epsilon_{ij}.
$$

For those without a GED, the equation reduces to

$$
\begin{align*}
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} (0) + \pi_{3i} (0) + \epsilon_{ij} \\
                & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}
$$

Once people secure their GED, $\pi_{2i}$ is always multiplied by 1 (i.e., $\pi_{2i} (1) $) and the values by which we multiply $\pi_{3i}$ scale linearly with `exper`, but with the offset the way we discussed in the previous subsection. To emphasize that, we might rewrite the equation as

$$
\begin{align*}
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} (1) + \pi_{3i} \text{postexp} + \epsilon_{ij} \\
                & = (\pi_{0i} + \pi_{2i}) + \pi_{1i} \text{exper}_{ij} + \pi_{3i} \text{postexp} + \epsilon_{ij}.
\end{align*}
$$

To get a sense of what this looks like, here's our version of the lower left panel of Figure 6.2.

```{r, fig.width = 3.5, fig.height = 4.5}
text <-
  tibble(exper = c(5, 5, 0.5, 0.5, 1, 7, 7),
         lnw   = c(2.24, 2.2, 2, 1.96, 1.62, 1.78, 1.74),
         label = c("Slope~differential",
                   "Pre-Post~GED~(pi[3][italic(i)])",
                   "Rate~of~change",
                   "Pre~GED~(pi[1][italic(i)])",
                   "italic(LNW)~at~labor~force~entry~(pi[0][italic(i)])",
                   "Constant~elevation~differential",
                   "on~GED~receipt~(pi[2][italic(i)])"),
         hjust = c(.5, .5, 0, 0, 0, .5, .5))

arrow <-
  tibble(exper = c(5.2, 1.7, 1.7, 6),
         xend  = c(9.1, 1.7, 0.05, 3.1),
         lnw   = c(2.18, 1.93, 1.64, 1.8),
         yend  = c(2.15, 1.84, 1.74, 1.885))

p3 <-
  tibble(exper = c(0, 3, 3, 10),
         ged   = rep(0:1, each = 2)) %>%
  expand(model = letters[1:3],
         nesting(exper, ged)) %>% 
  mutate(postexp = ifelse(exper == 10, 1, 0)) %>% 
  mutate(lnw = case_when(
    model == "a" ~ 1.75 + 0.04 * exper,
    model == "b" ~ 1.75 + 0.04 * exper + 0.02 * ged,
    model == "c" ~ 1.75 + 0.04 * exper + 0.02 * ged + 0.1 * postexp),
  model = fct_rev(model)) %>%
  
  plot_figure_6.2(aes(x = exper, y = lnw),
                  sizes = c(1, 1/4, 1/4),
                  linetypes = c(1, 2, 2)) +
  annotate(geom = "curve",
           x = 8.5, xend = 8.8,
           y = 2.185, yend = 2.125,
           arrow = arrow(length = unit(0.05, "inches"), type = "closed", ends = "both"),
           size = 1/4, linetype = 2, curvature = -0.85)
  
p3  
```

The second way to include discontinuities in both elevation and slope replaces the `postexp` variable with an interaction between `exper` and `ged`. Here's the equation:

$$
\text{lnw}_{ij} = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} \text{ged}_{ij} + \pi_{3i} (\text{exper}_{ij} \times \text{ged}_{ij}) + \epsilon_{ij}.
$$

For those without a GED, the equation simplifies to

$$
\begin{align*}
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} (0) + \pi_{3i} (\text{exper}_{ij} \times 0) + \epsilon_{ij} \\
                & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}
$$

Once a participant secures their GED, the equation changes to

$$
\begin{align*}
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} (1) + \pi_{3i} (\text{exper}_{ij} \times 1) + \epsilon_{ij} \\
               & = (\pi_{0i} + \pi_{2i}) + (\pi_{1i} + \pi_{3i}) \text{exper}_{ij} + \epsilon_{ij}.
\end{align*}
$$

So again, the two ways we might include discontinuities in both elevation and slope are

$$
\begin{align*}
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} \text{ged}_{ij} + \pi_{3i} \text{postexp}_{ij} + \epsilon_{ij} & \text{and} \\
\text{lnw}_{ij} & = \pi_{0i} + \pi_{1i} \text{exper}_{ij} + \pi_{2i} \text{ged}_{ij} + \pi_{3i} (\text{exper}_{ij} \times \text{ged}_{ij}) + \epsilon_{ij}.
\end{align*}
$$

The $\pi_{0i}$ and $\pi_{1i}$ terms have the same meaning in both. Even though $\pi_{3i}$ is multiplied by different values in the two equations, it has the same interpretation: "it represents the increment (or decrement) of the slope in the post-GED epoch" (p. 200). However, the big difference is the behavior and interpretation for $\pi_{2i}$. In the equation for the first approach, it "assesses the magnitude of the instantaneous increment (or decrement) associated with GED attainment" (p. 200). But in the equation for the second approach, "$\pi_{2i}$ assesses the magnitude of the increment (or decrement) associated with GED attainment at a particular--and not particularly meaningful--moment: the day of labor force entry" (p. 220, *emphasis* added). That is, whereas $\pi_{2i}$ has a fixed value for the first approach, its magnitude changes with time in the second.

To get a sense of what this looks like, here's our version of the lower right panel of Figure 6.2.

```{r, fig.width = 3.5, fig.height = 4.5}
text <-
  tibble(exper = c(5, 5, 0.5, 0.5, 1, 7, 7, 8, 8, 8),
         lnw   = c(2.28, 2.24, 2.1, 2.06, 1.62, 1.76, 1.72, 1.94, 1.9, 1.86),
         label = c("Slope~differential",
                   "Pre-Post~GED~(pi[3][italic(i)])",
                   "Rate~of~change",
                   "Pre~GED~(pi[1][italic(i)])",
                   "italic(LNW)~at~labor~force~entry~(pi[0][italic(i)])",
                   "GED~differential~at",
                   "labor~force~entry~(pi[2][italic(i)])",
                   "Elevation~differential",
                   "on~GED~receipt",
                   "(pi[2][italic(i)]+pi[3][italic(i)]*italic(EXPER))"),
         hjust = c(.5, .5, 0, 0, 0, .5, .5, .5, .5, .5))

arrow <-
  tibble(exper = c(5.2, 1.7, 1.7, 4.9, 6.2),
         xend  = c(8.8, 1.7, 0.05, 0.05, 3.1),
         lnw   = c(2.22, 2.03, 1.64, 1.745, 1.9),
         yend  = c(2.18, 1.825, 1.74, 1.775, 1.9))

p4 <-
  crossing(model = letters[1:4],
           point = 1:4) %>% 
  mutate(exper = ifelse(point == 1, 0,
                        ifelse(point == 4, 10, 3)),
         ged   = c(0, 0, 1, 1,  1, 1, 1, 1,  0, 0, 0, 0,  1, 1, 1, 1)) %>% 
  mutate(lnw = case_when(
    model %in% letters[1:3] ~ 1.75 + 0.04 * exper + 0.04 * ged + 0.01 * exper * ged,
    model == "d"            ~ 1.75 + 0.04 * exper + 0.04 * ged)) %>% 
  
  plot_figure_6.2(aes(x = exper, y = lnw),
                  sizes = c(1, 1/4, 1/4, 1/4),
                  linetypes = c(1, 2, 2, 2))  +
  annotate(geom = "curve",
           x = 8.5, xend = 8.8,
           y = 2.205, yend = 2.145,
           arrow = arrow(length = unit(0.05, "inches"), type = "closed", ends = "both"),
           size = 1/4, linetype = 2, curvature = -0.85)

p4
```

You may have noticed we've been saving the various subplot panels as objects. Here we combine them to make the full version of Figure 6.2.

```{r, fig.width = 7, fig.height = 9, warning = F, message = F}
library(patchwork)

(p1 + p2) / (p3 + p4)
```

Glorious.

### Selecting among the alternative discontinuous models.

Our first model in this section will be a call back from the last chapter, `fit5.16`.

```{r fit5.16, warning = F, message = F}
library(brms)

# model a
fit5.16 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 5,
      file = "fits/fit05.16")
```

Review the summary.

```{r}
print(fit5.16, digits = 3)
```

Before we move forward with the next models, we'll need to wrangle the data a bit. First, we rename `hgc.9` to the more **tidyverse**-centric `hgc_9`. Then we compute `uerate_7`, which is `uerate` centered on `7`.

```{r}
wages_pp <-
  wages_pp %>% 
  rename(hgc_9 = hgc.9) %>% 
  mutate(uerate_7 = uerate - 7)
```

Our model A (`fit5.16`) and the rest of the models B through J (`fit6.1` through `fit6.9`) can all be thought of as variants of two parent models. The first parent model is model F (`fit6.5`), which follows the form
                
$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{01} (\text{hgc}_i - 9) \\
                & \;\;\; + \gamma_{10} \text{exper}_{ij} + \gamma_{12} \text{black}_i \times \text{exper}_{ij} \\
                & \;\;\; + \gamma_{20} (\text{uerate}_{ij} - 7) \\
                & \;\;\; + \gamma_{30} \text{ged}_{ij} \\
                & \;\;\; + \gamma_{40} \text{postexp}_{ij}\\
                & \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \zeta_{3i} \text{ged}_{ij} + \zeta_{4i} \text{postexp}_{ij} + \epsilon_{ij}, \;\;\; \text{where} \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\
\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} \\ \zeta_{3i} \\ \zeta_{4i}
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf \Omega \mathbf D'
\end{pmatrix} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 & 0 & 0 \\ 0 & \sigma_1 & 0 & 0 \\ 0 & 0 & \sigma_3 & 0 \\ 0 & 0 & 0 & \sigma_4 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} & \rho_{03} & \rho_{04} \\ 
\rho_{10} & 1 & \rho_{13} & \rho_{14} \\ 
\rho_{30} & \rho_{31} & 1 & \rho_{34} \\ 
\rho_{40} & \rho_{41} & \rho_{43} &1 \end{bmatrix} \\
\gamma_{00} & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{01}, \dots, \gamma_{40} & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_0, \dots, \sigma_4 & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_\epsilon & \sim \operatorname{Student-t}(3, 0, 1) \\
\mathbf\Omega   & \sim \operatorname{LKJ}(4),
\end{align*}
$$

which uses the `postexp`-based approach for discontinuity in slopes. Notice how we're using the same basic prior specification as with `fit5.16`. The second parent model is model I (`fit6.8`), which follows the form

$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{01} (\text{hgc}_i - 9) \\
                & \;\;\; + \gamma_{10} \text{exper}_{ij} + \gamma_{12} \text{black}_i \times \text{exper}_{ij} \\
                & \;\;\; + \gamma_{20} (\text{uerate}_{ij} - 7) \\
                & \;\;\; + \gamma_{30} \text{ged}_{ij} \\
                & \;\;\; + \gamma_{50} \text{ged}_{ij} \times \text{exper}_{ij} \\
                & \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \zeta_{3i} \text{ged}_{ij} + \zeta_{5i} \text{ged}_{ij} \times \text{exper}_{ij} + \epsilon_{ij}, \;\;\; \text{where} \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\
\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} \\ \zeta_{3i} \\ \zeta_{5i}
\end{bmatrix} & \sim \operatorname{Normal} 
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf \Omega \mathbf D'
\end{pmatrix} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 & 0 & 0 \\ 0 & \sigma_1 & 0 & 0 \\ 0 & 0 & \sigma_3 & 0 \\ 0 & 0 & 0 & \sigma_5 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} & \rho_{03} & \rho_{05} \\ 
\rho_{10} & 1 & \rho_{13} & \rho_{15} \\ 
\rho_{30} & \rho_{31} & 1 & \rho_{35} \\ 
\rho_{50} & \rho_{51} & \rho_{53} &1 \end{bmatrix} \\
\gamma_{00} & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{01}, \dots, \gamma_{50} & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_0, \dots, \sigma_5 & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_\epsilon & \sim \operatorname{Student-t}(3, 0, 1) \\
\mathbf\Omega   & \sim \operatorname{LKJ}(4).
\end{align*}
$$

which uses the `ged:exper`-based approach for discontinuity in slopes. Here we fit the models in bulk.

```{r fit6.1}
# model b
fit6.1 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + (1 + exper + ged | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.01")

# model c
fit6.2 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.02")

# model d
fit6.3 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + postexp + (1 + exper + postexp | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      control = list(adapt_delta = .99),
      file = "fits/fit06.03")

# model e
fit6.4 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + postexp + (1 + exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.04")

# model f
fit6.5 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + postexp + (1 + exper + ged + postexp | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.05")

# model g
fit6.6 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + postexp + (1 + exper + ged | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.06")

# model h
fit6.7 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + postexp + (1 + exper + postexp | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      control = list(adapt_delta = .99),
      file = "fits/fit06.07")

# model i
fit6.8 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + ged:exper + (1 + exper + ged + ged:exper | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      control = list(adapt_delta = .99),
      file = "fits/fit06.08")

# model j
fit6.9 <-
  brm(data = wages_pp, 
      family = gaussian,
      lnw ~ 0 + Intercept + exper + hgc_9 + black:exper + uerate_7 + ged + ged:exper + (1 + exper + ged | id),
      prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
                prior(normal(0, 0.5), class = b),
                prior(student_t(3, 0, 1), class = sd),
                prior(student_t(3, 0, 1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.09")
```

To keep from cluttering up this ebook, I'm not going to show the summary `print()` output for `fit6.1` through `fit6.7`. If you go through that output yourself, you'll see that several of them had low effective sample size estimates for one or a few of the $\sigma$ parameters. For our pedagogical purposes, I'm okay with moving forward with these. But do note that when fitting a model for your scientific or other real-world projects, make sure you attend to your effective sample size issues (i.e., extract more posterior draws, as needed, by adjusting the `warmup`, `iter`, and `chains` arguments).

Though we're avoiding `print()` output, we might take a birds-eye perspective and summarize the parameters for our competing models with a faceted coefficient plot. First we extract and save the relevant information as an object, `post`, and then we wrangle and plot.

```{r, fig.width = 8, fig.height = 6.5}
# compute
post <-
  tibble(model = letters[1:10],
         fit   = c("fit5.16", str_c("fit6.", 1:9))) %>% 
  mutate(p = map(fit, ~get(.) %>% 
                   posterior_summary() %>% 
                   data.frame() %>% 
                   rownames_to_column("parameter") %>% 
                   filter(!str_detect(parameter, "r_id\\[") & parameter != "lp__"))) %>% 
  unnest(p) 

# wrangle
post %>% 
  mutate(greek = case_when(
    parameter == "b_Intercept"   ~ "gamma[0][0]",
    parameter == "b_hgc_9"       ~ "gamma[0][1]",
    parameter == "b_exper"       ~ "gamma[1][0]",
    parameter == "b_exper:black" ~ "gamma[1][2]",
    parameter == "b_uerate_7"    ~ "gamma[2][0]",
    parameter == "b_ged"         ~ "gamma[3][0]",
    parameter == "b_postexp"     ~ "gamma[4][0]",
    parameter == "b_exper:ged"   ~ "gamma[5][0]",
    parameter == "sd_id__Intercept" ~ "sigma[0]",
    parameter == "sd_id__exper"     ~ "sigma[1]",
    parameter == "sd_id__ged"       ~ "sigma[3]",
    parameter == "sd_id__postexp"   ~ "sigma[4]",
    parameter == "sd_id__exper:ged" ~ "sigma[5]",
    parameter == "sigma"            ~ "sigma[epsilon]",
    parameter == "cor_id__Intercept__exper"     ~ "rho[0][1]",
    parameter == "cor_id__Intercept__ged"       ~ "rho[0][3]",
    parameter == "cor_id__exper__ged"           ~ "rho[1][3]",
    parameter == "cor_id__Intercept__postexp"   ~ "rho[0][4]",
    parameter == "cor_id__Intercept__exper:ged" ~ "rho[0][5]",
    parameter == "cor_id__exper__postexp"       ~ "rho[1][4]",
    parameter == "cor_id__exper__exper:ged"     ~ "rho[1][5]",
    parameter == "cor_id__ged__postexp"         ~ "rho[3][4]",
    parameter == "cor_id__ged__exper:ged"       ~ "rho[3][5]"
  )) %>% 
  mutate(model = fct_rev(model)) %>% 
  
  # plot
  ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = model)) +
  geom_pointrange(size = 1/4, fatten = 1) +
  labs(title = "Marginal posteriors for models a thorugh j",
       x = NULL, y = NULL) +
  theme(axis.text = element_text(size = 6),
        axis.text.y = element_text(hjust = 0),
        panel.grid = element_blank()) +
  facet_wrap(~ greek, labeller = label_parsed, scales = "free_x")
```

For our model comparisons, let's compute the WAIC estimates for each.

```{r, message = F, warning = F}
fit5.16 <- add_criterion(fit5.16, criterion = "waic")
fit6.1 <- add_criterion(fit6.1, criterion = "waic")
fit6.2 <- add_criterion(fit6.2, criterion = "waic")
fit6.3 <- add_criterion(fit6.3, criterion = "waic")
fit6.4 <- add_criterion(fit6.4, criterion = "waic")
fit6.5 <- add_criterion(fit6.5, criterion = "waic")
fit6.6 <- add_criterion(fit6.6, criterion = "waic")
fit6.7 <- add_criterion(fit6.7, criterion = "waic")
fit6.8 <- add_criterion(fit6.8, criterion = "waic")
fit6.9 <- add_criterion(fit6.9, criterion = "waic")
```

On pages 202 through 204, Singer and Willett performed a number of model comparisons with deviance tests and the frequentist AIC and BIC. Here we do the analogous comparisons with WAIC difference estimates.

```{r}
# a vs b
loo_compare(fit5.16, fit6.1, criterion = "waic") %>% print(simplify = F)
# b vs c
loo_compare(fit6.1, fit6.2, criterion = "waic") %>% print(simplify = F)
# a vs d
loo_compare(fit5.16, fit6.3, criterion = "waic") %>% print(simplify = F)
# d vs e
loo_compare(fit6.3, fit6.4, criterion = "waic") %>% print(simplify = F)
# f vs b
loo_compare(fit6.5, fit6.1, criterion = "waic") %>% print(simplify = F)
# f vs d
loo_compare(fit6.5, fit6.3, criterion = "waic") %>% print(simplify = F)
# f vs g
loo_compare(fit6.5, fit6.6, criterion = "waic") %>% print(simplify = F)
# f vs h
loo_compare(fit6.5, fit6.7, criterion = "waic") %>% print(simplify = F)
# i vs b
loo_compare(fit6.8, fit6.1, criterion = "waic") %>% print(simplify = F)
# j vs i
loo_compare(fit6.9, fit6.8, criterion = "waic") %>% print(simplify = F)
# i vs f
loo_compare(fit6.8, fit6.5, criterion = "waic") %>% print(simplify = F)
```

We might also just look at all of the WAIC estimates, plus and minus their standard errors, in a coefficient plot.

```{r, fig.width = 6, fig.height = 1.75}
loo_compare(fit5.16, fit6.1, fit6.2, fit6.3, fit6.4, fit6.5, fit6.6, fit6.7, fit6.8, fit6.9, criterion = "waic") %>% 
  data.frame() %>% 
  rownames_to_column("fit") %>% 
  arrange(fit) %>% 
  mutate(model = letters[1:n()]) %>% 
  
  ggplot(aes(x = waic, xmin = waic - se_waic, xmax = waic + se_waic, y = reorder(model, waic))) +
  geom_pointrange(fatten = 1) +
  labs(x = expression(WAIC%+-%s.e.),
       y = "model") +
  theme(axis.text.y = element_text(hjust = 0),
        panel.grid = element_blank())
```

In the text, model I had the lowest (best) deviance and information criteria values, with model F coming in a close second. Our WAIC results are the reverse. However, look at the widths of the standard error intervals, for each, relative to their point estimates. I wouldn't get too upset about differences in our results versus those in the text. Our Bayesian models reveal there's massive uncertainty in each estimate, an insight missing from the frequent analyses reported in the text.

```{r, eval = F, echo = F}
# virtually all the WAIC weight goes to model f (fit6.5)
model_weights(fit5.16, fit6.1, fit6.2, fit6.3, fit6.4, fit6.5, fit6.6, fit6.7, fit6.8, fit6.9, weights = "waic") %>% round(digits = 2)
```

Here's a focused look at the parameter summary for model F.

```{r}
print(fit6.5, digits = 3)
```

Before we dive into the $\gamma$-based counterfactual trajectories of Figure 6.3, we might use a plot to get at sense of the $\zeta$'s, as summarized by the $\sigma$ and $\rho$ parameters. Here we extract the $\zeta$'s. Since there's such a large number, we'll focus on their posterior means.

```{r}
r <-
  tibble(`zeta[0]` = ranef(fit6.5)$id[, 1, "Intercept"],
         `zeta[1]` = ranef(fit6.5)$id[, 1, "exper"],
         `zeta[3]` = ranef(fit6.5)$id[, 1, "ged"],
         `zeta[4]` = ranef(fit6.5)$id[, 1, "postexp"])

glimpse(r)
```

We'll be plotting with help from the [**GGally** package](https://CRAN.R-project.org/package=GGally) [@R-GGally], which does a nice job displaying a grid of bivariate plots via the `ggpairs()`. We're going to get fancy with our `ggpairs()` plot by using a handful of custom settings. Here we save them as two functions.

```{r}
my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_density(fill = "black", size = 0) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}

my_upper <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_point(size = 1/10, alpha = 1/2) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}
```

Now visualize the model F $\zeta$'s with `GGally::ggpairs()`.

```{r, fig.width = 4.75, fig.height = 4.5, message = F, warning = F}
library(GGally)

r %>% 
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = NULL,
          labeller = label_parsed)
```

Now we're ready to make our version of Figure 6.3. Since we will be expressing the uncertainty of our counterfactual trajectories with 95% interval bands, we'll be faceting the plot by the two levels of `hgc_9`. Otherwise, the overplotting would become too much.

```{r, fig.width = 6, fig.height = 3.5}
# define the new data
nd <-
  crossing(black = 0:1,
           hgc_9 = c(0, 3)) %>% 
  expand(nesting(black, hgc_9),
         exper = seq(from = 0, to = 11, by = 0.02)) %>% 
  mutate(ged      = ifelse(exper < 3, 0, 1),
         postexp  = ifelse(ged == 0, 0, exper - 3),
         uerate_7 = 0)

# compute the fitted draws
fitted(fit6.5, 
       re_formula = NA,
       newdata = nd) %>% 
  # wrangle
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(race  = ifelse(black == 0, "White/Latino", "Black"),
         hgc_9 = ifelse(hgc_9 == 0, "9th grade dropouts", "12th grade dropouts")) %>% 
  mutate(race  = fct_rev(race),
         hgc_9 = fct_rev(hgc_9)) %>% 
  
  # plot!
  ggplot(aes(x = exper, y = Estimate, ymin = Q2.5, ymax = Q97.5,
             fill = race, color = race)) +
  geom_ribbon(size = 0, alpha = 1/4) +
  geom_line() +
  scale_fill_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  scale_color_viridis_d(NULL, option = "C", begin = .25, end = .75) +
  scale_x_continuous(breaks = 0:5 * 2, expand = c(0, 0)) +
  scale_y_continuous("lnw", breaks = 1.6 + 0:4 * 0.2, 
                     expand = expansion(mult = c(0, 0.05))) +
  coord_cartesian(ylim = c(1.6, 2.4)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ hgc_9)
```

### Further extensions of the discontinuous growth model.

"It is easy to generalize these strategies to models with other discontinuities" (p. 206).

#### Dividing TIME into multiple phases.

"You can divide *TIME* into *multiple* epochs, allowing the trajectories to differ in elevation (and perhaps slope) during each" (p. 206, *emphasis* in the original). With our examples, above, we divided time into two epochs: before and (possibly) after receipt of one's GED. More possible epochs might after completing college or graduate school. All such epochs might influence intercepts and/or slopes (by either of the slope methods, above).

#### Discontinuities at common points in time.

> In some data sets, the timing of the discontinuity will not be person-specific; instead, everyone will experience the hypothesized transition at a common point in time. You can hypothesize a similar discontinuous change trajectory for such data sets by applying the strategies outlined above. (p. 207)

Examples might include months, seasons, and academic quarters or semesters.

## Using transformations to model nonlinear individual change

> When confronted by obviously nonlinear trajectories, we usually begin with the transformation approach for two reasons. First, a straight line--even on a transformed scale--is a simple mathematical form whose two parameters have clear interpretations. Second, because the metrics of many variables are ad hoc to begin with, transformation to another ad hoc scale may sacrifice little. (p. 208)

As an example, consider `fit4.6` from back in Chapter 4.

```{r fit4.6}
fit4.6 <-
  brm(data = alcohol1_pp, 
      family = gaussian,
      alcuse ~ 0 + Intercept + age_14 + coa + peer + age_14:peer + (1 + age_14 | id),
      prior = c(prior(student_t(3, 0, 2.5), class = sd),
                prior(student_t(3, 0, 2.5), class = sigma),
                prior(lkj(1), class = cor)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/fit04.06")
```

In the `alcohol1_pp` data, the criterion `alcuse` was transformed by taking its square root. Since we fit a linear model on that transformed variable, our model is actually non-linear on the original metric of `alcuse`. To see, we'll square the `fitted()`-based counterfactual trajectories and their intervals to make our version of Figure 6.4.

```{r, fig.height = 3.75, fig.width = 4.6}  
# define the new data
nd <-
  crossing(coa  = 0:1,
           peer = c(.655, 1.381)) %>% 
  expand(nesting(coa, peer),
         age_14 = seq(from = 0, to = 2, length.out = 30))

# compute the counterfactual trajectories
fitted(fit4.6, 
       newdata = nd,
       re_formula = NA) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  # transform the predictions by squaring them
  mutate(Estimate = Estimate^2,
         Q2.5     = Q2.5^2,
         Q97.5    = Q97.5^2) %>% 
  # a little wrangling will make plotting much easier
  mutate(age  = age_14 + 14,
         coa  = ifelse(coa == 0, "coa = 0", "coa = 1"),
         peer = factor(peer)) %>%
  
  # plot!
  ggplot(aes(x = age, color = peer, fill = peer)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate, size = peer)) +
  scale_size_manual(values = c(1/2, 1)) +
  scale_fill_manual(values = c("blue3", "red3")) +
  scale_color_manual(values = c("blue3", "red3")) +
  scale_y_continuous("alcuse", breaks = 0:3, expand = c(0, 0)) +
  labs(subtitle = "High peer values are in red; low ones are in blue.") +
  coord_cartesian(xlim = c(13, 17),
                  ylim = c(0, 3)) +
  theme(legend.position = "none",
        panel.grid = element_blank()) +
  facet_wrap(~coa)
```  

Compare these to the linear trajectories depicted back in Figure 4.3c.

### The ladder of transformations and the rule of the bulge.

I just not a fan of the "ladder of powers" idea and I'm not interested in reproducing Figure 6.5. However, we will make the next one, real quick. Load the `berkeley_pp` data.

```{r, warning = F, message = F}
berkeley_pp <- 
  read_csv("data/berkeley_pp.csv") %>% 
  mutate(time = age)

glimpse(berkeley_pp)
```

Here's how we might make Figure 6.6.

```{r, fig.height = 3.75, fig.width = 8}  
# left
p1 <- 
  berkeley_pp %>% 
  ggplot(aes(x = time, y = iq)) +
  geom_point() +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 250))

# middle
p2 <-
  berkeley_pp %>% 
  ggplot(aes(x = time, y = iq^2.3)) +
  geom_point() +
  scale_y_continuous(expression(iq^(2.3)), breaks = 0:6 * 5e4, 
                     limits = c(0, 3e5), expand = c(0, 0))

# right
p3 <-
  berkeley_pp %>% 
  ggplot(aes(x = time^(1/2.3), y = iq)) +
  geom_point() +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 250)) +
  xlab(expression(time^(1/2.3)))

# combine
(p1 + p2 + p3) &
  scale_x_continuous(expand = expansion(mult = c(0, 0.05)), limits = c(0, NA)) &
  theme(panel.grid = element_blank())
```

## Representing individual change using a polynomial function of TIME

> We can also model curvilinear change by including several level-1 predictors that *collectively* represent a polynomial function of time. Although the resulting *polynomial growth model* can be cumbersome, it can capture an even wider array of complex patterns of change over time. (p. 213, *emphasis* in the original)

To my eye, it will be easiest to make Table by dividing it up into columns, making each column individually, and then combining them on the back end. For our first step, here's our first column, the "Shape" column.

```{r}
p1 <-
  tibble(x     = 1,
         y     = 9.5,
         label = c("No\nchange", "Linear\nchange", "Quadratic\nchange", "Cubic\nchange"),
         row   = 1:4,
         col   = "Shape") %>% 
  
  ggplot(aes(x = x, y = y, label = label)) +
  geom_text(hjust = 0, vjust = 1, size = 3.25) +
  scale_x_continuous(expand = c(0, 0), limits = 1:2) +
  scale_y_continuous(expand = c(0, 0), limits = c(1, 10)) +
  theme_void() +
  theme(strip.background.y = element_blank(),
        strip.text.x = element_text(hjust = 0, size = 12),
        strip.text.y = element_blank()) +
  facet_grid(row ~ col)
```

For our second step, here's the "Level-1 model" column.

```{r}
p2 <-
  tibble(x     = c(1, 1, 1, 2, 1, 2, 2),
         y     = c(9.5, 9.5, 9.5, 8.5, 9.5, 8.5, 7.25),
         label = c("italic(Y[i][j])==pi[0][italic(i)]+epsilon[italic(ij)]",
                   "italic(Y[i][j])==pi[0][italic(i)]+pi[1][italic(i)]*italic(TIME[ij])+epsilon[italic(ij)]",
                   "italic(Y[i][j])==pi[0][italic(i)]+pi[1][italic(i)]*italic(TIME[ij])",
                   "+pi[2][italic(i)]*italic(TIME[ij])^2+epsilon[italic(ij)]",
                   "italic(Y[i][j])==pi[0][italic(i)]+pi[1][italic(i)]*italic(TIME[ij])",
                   "+pi[2][italic(i)]*italic(TIME[ij])^2+pi[3][italic(i)]*italic(TIME[ij])^3",
                   "+epsilon[italic(ij)]"),
         row   = c(1:3, 3:4, 4, 4),
         col   = "Level-1 model") %>% 
  
  ggplot(aes(x = x, y = y, label = label)) +
  geom_text(hjust = 0, vjust = 1, size = 3.25, parse = T) +
  scale_x_continuous(expand = c(0, 0), limits = c(1, 10)) +
  scale_y_continuous(expand = c(0, 0), limits = c(1, 10)) +
  theme_void() +
  theme(strip.background.y = element_blank(),
        strip.text.x = element_text(hjust = 0, size = 12),
        strip.text.y = element_blank()) +
  facet_grid(row ~ col)
```

For our third step, here's the "Parameter values" column.

```{r}
p3 <-
  tibble(x     = 1,
         y     = c(9.5, 9.5, 8.5, 9.5, 8.5, 7.5, 9.5, 8.5, 7.5, 6.5),
         label = c("pi[0][italic(i)]==71",
                   "pi[0][italic(i)]==71",
                   "pi[1][italic(i)]==1.2",
                   "pi[0][italic(i)]==50",
                   "pi[1][italic(i)]==3.8",
                   "pi[2][italic(i)]==-0.03",
                   "pi[0][italic(i)]==30",
                   "pi[1][italic(i)]==10",
                   "pi[2][italic(i)]==-0.2",
                   "pi[3][italic(i)]==0.0012"),
         row   = rep(1:4, times = 1:4),
         col   = "Parameter\nvalues") %>% 
  
  ggplot(aes(x = x, y = y, label = label)) +
  geom_text(hjust = 0, vjust = 1, size = 3.25, parse = T) +
  scale_x_continuous(expand = c(0, 0), limits = c(1, 10)) +
  scale_y_continuous(expand = c(0, 0), limits = c(1, 10)) +
  theme_void() +
  theme(strip.background.y = element_blank(),
        strip.text.x = element_text(hjust = 0, size = 12),
        strip.text.y = element_blank()) +
  facet_grid(row ~ col)
```

We'll do our fourth step in three stages. First, we'll make and save four data sets, one for each of the plot panels. Second, we'll combine those into a single data set, which we'll wrangle a bit. Third, we'll make the plots in the final column.

```{r}
# make the four small data sets
pi0 <- 71

d1 <- tibble(time = 0:100) %>% 
  mutate(y = pi0)

pi0 <- 71
pi1 <- 1.2

d2 <- tibble(time = 0:100) %>% 
  mutate(y = pi0 + pi1 * time) 

pi0 <- 50
pi1 <- 3.8
pi2 <- -0.03

d3 <- tibble(time = 0:100) %>% 
  mutate(y = pi0 + pi1 * time + pi2 * time^2)

pi0 <- 30
pi1 <- 10
pi2 <- -0.2
pi3 <- 0.0012

d4 <- tibble(time = 0:100) %>% 
  mutate(y = pi0 + pi1 * time + pi2 * time^2 + pi3 * time^3)

# combine the data sets
p4 <-
  bind_rows(d1, d2, d3, d4) %>% 
  # wrangle
  mutate(row = rep(1:4, each = n() / 4),
         col = "Plot of the true change trajectory") %>% 
  
  # plot!
  ggplot(aes(x = time, y = y)) + 
  geom_line() +
  scale_x_continuous(expression(italic(TIME)), expand = c(0, 0),
                     breaks = 0:2 * 50, limits = c(0, 100)) +
  scale_y_continuous(expression(italic(Y)), expand = c(0, 0), 
                     breaks = 0:2 * 100, limits = c(0, 205)) +
  theme(panel.grid = element_blank(),
        strip.background = element_blank(),
        strip.text.x = element_text(hjust = 0, size = 12),
        strip.text.y = element_blank()) +
  facet_grid(row ~ col)
```

Now we're finally ready to combine all the elements to make Table 6.4.

```{r, fig.width = 7, fig.height = 8.5, warning = F}
(p1 | p2 | p3 | p4) +
  plot_annotation(title = "A taxonomy of polynomial individual change trajectories") +
  plot_layout(widths = c(1, 3, 2, 4))
```

### The shapes of polynomial individual change trajectories.

"The 'no change' and 'linear change' models are familiar; the remaining models, which contain quadratic and cubic functions of *TIME*, are new" (p. 213, *emphasis* in the original).

#### "No change" trajectory.

> The "no change" trajectory is known as a polynomial function of "zero order" because *TIME* raised to the 0^th^ power is 1 (i.e., $TIME^0 = 1$). This model is tantamount to including a constant predictor, 1, in the level-1 model, as a multiplier of the sole individual growth parameter, the intercept, $\pi_{0i}$...  Even though each trajectory is flat, different individuals can have different intercepts and so a collection of true "no change" trajectories is a set of vertically scattered horizontal lines. (p. 215, *emphasis* in the original)

#### "Linear change" trajectory.

> The "linear change" trajectory is known as a "first order" polynomial in time because *TIME* raised to the 1^st^ power equals *TIME* itself (i.e., $TIME^1 = TIME$). Linear *TIME* is the sole predictor and the two individual growth parameters have the usual interpretations. (p. 215, *emphasis* in the original)

#### "Quadratic change" trajectory.

> Adding $TIME^2$ to a level-1 individual growth model that already includes linear *TIME* yields a second order polynomial for quadratic change. Unlike a level-1 model that includes only $TIME^2$, a second order polynomial change trajectory includes two *TIME* predictors and three growth parameters ($\pi_{0i}$, $\pi_{1i}$ and $\pi_{2i}$). The first two parameters have interpretations that are *similar*, but not identical, to those in the linear change trajectory; the third is new. (p. 215, *emphasis* in the original)

In this model, $\pi_{0i}$ is still the intercept. The $\pi_{1i}$ parameter is now the *instantaneous rate of change* when $TIME = 0$. The new $\pi_{2i}$ parameter, sometimes called the *curvature* parameter, describes the change in the rate of change.

#### Higher order change trajectories.

"Adding higher powers of TIME increases the complexity of the polynomial trajectory" (p. 216).

### Selecting a suitable level-1 polynomial trajectory for change.

It appears that both the `external_pp.csv` and `external_pp.txt` files within the `data` folder ([here](https://github.com/ASKurz/Applied-Longitudinal-Data-Analysis-with-brms-and-the-tidyverse/tree/master/data)) are missing a few occasions. Happily, you can download a more complete version of the data from the good people at [stats.idre.ucla.edu](stats.idre.ucla.edu).

```{r, warning = F, message = F}
external_pp <- 
  read.table("https://stats.idre.ucla.edu/wp-content/uploads/2020/01/external_pp.txt", 
             header = T, sep = ",")

glimpse(external_pp)
```

There are data from 45 children.

```{r}
external_pp %>% 
  distinct(id) %>% 
  nrow()
```

The 45 kids were composed of 28 boys and 17 girls.

```{r}
external_pp %>% 
  distinct(id, female) %>% 
  count(female) %>% 
  mutate(percent = 100 * n / sum(n))
```

The data were collected over the children's first through sixth grades.

```{r}
external_pp %>% 
  distinct(grade)
```

Our criterion, `external`, is the sum of the 34 items in the Externalizing subscale of the Child Behavior Checklist. Each item is rated on a 3-point Likert-type scale, ranging from 0 (*rarely/never*) to 2 (*often*). The possible range for the sum score of the Externalizing subscale is 0 to 68. Here's the overall distribution.

```{r, fig.width = 4, fig.height = 2.75}
external_pp %>% 
  ggplot(aes(x = external)) +
  geom_histogram(binwidth = 1, boundary = 0) +
  xlim(0, 68) +
  theme(panel.grid = element_blank())
```

The data are strongly bound to the left, which is a good thing in this case. We generally like it when our children exhibit fewer externalizing behaviors.

Figure 6.7 is based on a subset of the cases in the data. It might make our job easier if we just make a subset of the data, called `external_pp_subset`.

```{r}
subset <- c(1, 6, 11, 25, 34, 36, 40, 26)

external_pp_subset <-
  external_pp %>% 
  filter(id %in% subset) %>% 
  # this is for the facets in the plot
  mutate(case = factor(id,
                       levels = subset,
                       labels = LETTERS[1:8]))

glimpse(external_pp_subset)
```

Since the order of the polynomials in Figure 6.7 is tailored to each case, we'll have to first build the subplots in pieces, and then combine then at the end. Make the pieces.

```{r}
# a and e
p1 <-
  external_pp_subset %>% 
  filter(case %in% c("A")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +       # quartic
  stat_smooth(method = "lm", formula = y ~ x + I(x^2),  # quadratic
              se = F, size = 1/2) +
  scale_x_continuous(NULL, breaks = NULL, 
                     limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# e
p2 <-
  external_pp_subset %>% 
  filter(case %in% c("E")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +                # quartic
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3),  # cubic
              se = F, size = 1/2) +
  scale_x_continuous(NULL, breaks = NULL, 
                     limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# b
p3 <-
  external_pp_subset %>% 
  filter(case %in% c("B")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +       # quartic
  stat_smooth(method = "lm", formula = y ~ x + I(x^2),  # quadratic
              se = F, size = 1/2) +
  scale_x_continuous(NULL, breaks = NULL, 
                     limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# f
p4 <-
  external_pp_subset %>% 
  filter(case %in% c("F")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/2) +  # quartic
  scale_x_continuous(breaks = 0:7, limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# c
p5 <-
  external_pp_subset %>% 
  filter(case %in% c("C")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +  # quartic
  stat_smooth(method = "lm", formula = y ~ x,      # linear
              se = F, size = 1/2) +
  scale_x_continuous(NULL, breaks = NULL, 
                     limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# g
p6 <-
  external_pp_subset %>% 
  filter(case %in% c("G")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +       # quartic
  stat_smooth(method = "lm", formula = y ~ x + I(x^2),  # quadratic
              se = F, size = 1/2) +
  scale_x_continuous(breaks = 0:7, limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# d
p7 <-
  external_pp_subset %>% 
  filter(case %in% c("D")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/4, linetype = 2) +  # quartic
  stat_smooth(method = "lm", formula = y ~ x,      # linear
              se = F, size = 1/2) +
  scale_x_continuous(NULL, breaks = NULL, 
                     limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)

# h
p8 <-
  external_pp_subset %>% 
  filter(case %in% c("H")) %>% 
  
  ggplot(aes(x = grade, y = external)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) + I(x^4), 
              se = F, size = 1/2) +  # quartic
  scale_x_continuous(breaks = 0:7, limits = c(0, 7), expand = c(0, 0)) +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ case)
```

Now combine the subplots to make the full Figure 6.7.

```{r, fig.width = 8, fig.height = 6}
((p1 / p2) | (p3 / p4) | (p5 / p6) | (p7 / p8)) &
  coord_cartesian(ylim = c(0, 60)) &
  theme(panel.grid = element_blank())
```

### Testing higher order terms in a polynomial level-1 model.

Let's talk about priors, first focusing on the overall intercept $\pi_{01}$. At the time of the original article by @keiley2000cross, it was known that boys tended to show more externalizing behaviors than girls, and that boys tend to either increase or remain fairly stable during primary school. Less was known about typical trajectories for girls. However, @sandberg1991child can give us a sense of what values are reasonable to center on. I their paper, they compared externalizing in two groups if children, broken down between boys and girls. From their second and third tables, we get the following descriptive statistics:

```{r}
sample_statistics <-
  crossing(gender = c("boys", "girls"),
           sample = c("School", "CBCL Nonclinical")) %>% 
  mutate(n    = c(300, 261, 300, 267),
         mean = c(10.8, 14.5, 10.7, 16.6),
         sd   = c(8.4, 10.4, 8.6, 12.1))

sample_statistics %>% 
  flextable::flextable()
```

Here's the weighted mean of the externalizing scores.

```{r}
sample_statistics %>% 
  summarise(weighted_average = sum(n * mean) / sum(n))
```

Here's the pooled standard deviation.

```{r}
sample_statistics %>% 
  summarise(pooled_sd = sqrt(sum((n - 1) * sd^2) / (sum(n) - 4)))
```

Thus, I propose a good place to start with is with a $\operatorname{Normal}(13, 9.9)$ prior on the overall intercept, $\gamma_{00}$. So far, we've been using the Student-$t$ distribution for the priors on our variance parameters. Another option favored by McElreath in the second edition of his text is the exponential distribution. The exponential distribution has a single parameter, $\lambda$, which is often called the rate. The mean of the exponential distribution is the inverse of the rate, $1 / \lambda$. When you're working with standardized data, a nice weakly-regularizing priors on the variance parameters is $\operatorname{Exponential}(1)$, which has a mean of 1. Since our data are not standardized, we might use the prior $\operatorname{Exponential}(1 / s_p)$, where $s_p$ is the pooled standard deviation. In our case, that would be $\operatorname{Exponential}(1 / 9.9)$. Here's what those priors would look like.

```{r, fig.width = 6.5, fig.height = 2.75}
# left
p1 <-
  tibble(x = seq(from = -30, to = 50, length.out = 200)) %>% 
  mutate(d = dnorm(x, mean = 13, sd = 9.9)) %>% 
  ggplot(aes(x = x, y = d)) +
  geom_area(fill = "black") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = expression("prior for "*gamma[0][0]),
       x = (expression(Normal(13*', '*9.9)))) +
  theme(panel.grid = element_blank())

# right
p2 <-
  tibble(x = seq(from = 0, to = 60, length.out = 200)) %>% 
  mutate(d = dexp(x, rate = 1.0 / 9.9)) %>% 
  ggplot(aes(x = x, y = d)) +
  geom_area(fill = "black") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(subtitle = expression("priors for "*sigma[0]~and~sigma[epsilon]),
       x = (expression(Exponential(1/9.9)))) +
  theme(panel.grid = element_blank())

# combine
p1 | p2
```

Now using those priors, here's how to fit the model.

```{r fit6.10}
fit6.10 <-
  brm(data = external_pp, 
      family = gaussian,
      external ~ 1 + (1 | id),
      prior = c(prior(normal(13, 9.9), class = Intercept),
                prior(exponential(1.0 / 9.9), class = sd),
                prior(exponential(1.0 / 9.9), class = sigma)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.10")
```

For the next three models, we'll keep those priors for $\gamma_{00}$ and the $\sigma$ parameters. Yet now we have to consider the new $\gamma_{00}$ parameters which will account for the population-level effects if time, from a linear, quadratic, and cubic perspective. Given that `time` takes on integer values `0` through `5`, the simple linear slope $\gamma_{10}$ is the expected change from one grade to the next. Since we know from the previous literature that boys tend to have either stable or slightly-increasing trajectories for their externalizing behaviors (recall we're uncertain about girls), a mildly conservative prior might be centered on zero with, say, half of the pooled standard deviation on the scale, $\operatorname{Normal}(0, 4.95)$. This is the rough analogue to a $\operatorname{Normal}(0, 0.5)$ prior on standardized data. Without better information, we'll extend that same $\operatorname{Normal}(0, 4.95)$ prior on $\gamma_{20}$ and $\gamma_{30}$. As to the new $\rho$ parameters, we'll use our typical weakly-regularizing $\operatorname{LKJ}(4)$ prior on the level-2 correlation matrices.

```{r fit6.11}
# linear
fit6.11 <-
  brm(data = external_pp, 
      family = gaussian,
      external ~ 0 + Intercept + time + (1 + time | id),
      prior = c(prior(normal(0, 4.95), class = b),
                prior(normal(13, 9.9), class = b, coef = Intercept),
                prior(exponential(1.0 / 9.9), class = sd),
                prior(exponential(1.0 / 9.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.11")

# quadratic
fit6.12 <-
  brm(data = external_pp, 
      family = gaussian,
      external ~ 0 + Intercept + time + I(time^2) + (1 + time + I(time^2) | id),
      prior = c(prior(normal(0, 4.95), class = b),
                prior(normal(13, 9.9), class = b, coef = Intercept),
                prior(exponential(1.0 / 9.9), class = sd),
                prior(exponential(1.0 / 9.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      control = list(adapt_delta = .99),
      file = "fits/fit06.12")

# cubic
fit6.13 <-
  brm(data = external_pp, 
      family = gaussian,
      external ~ 0 + Intercept + time + I(time^2) + I(time^3) + (1 + time + I(time^2) + I(time^3) | id),
      prior = c(prior(normal(0, 4.95), class = b),
                prior(normal(13, 9.9), class = b, coef = Intercept),
                prior(exponential(1.0 / 9.9), class = sd),
                prior(exponential(1.0 / 9.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      file = "fits/fit06.13")
```

Compute and save the WAIC estimates.

```{r, message = F, warning = F}
fit6.10 <- add_criterion(fit6.10, criterion = "waic")
fit6.11 <- add_criterion(fit6.11, criterion = "waic")
fit6.12 <- add_criterion(fit6.12, criterion = "waic")
fit6.13 <- add_criterion(fit6.13, criterion = "waic")
```

Here we'll make a simplified versionof the WAIC output, saving the results as `waic_summary`.

```{r, message = F, warning = F}
# order the parameters, which will come in handy in the next block
order <- c("gamma[0][0]", "gamma[1][0]", "gamma[2][0]", "gamma[3][0]", "sigma[epsilon]", "sigma[0]", "sigma[1]", "sigma[2]", "sigma[3]", "rho[0][1]", "rho[0][2]", "rho[0][3]", "rho[1][2]", "rho[1][3]", "rho[2][3]", "WAIC")

waic_summary <-
  loo_compare(fit6.10, fit6.11, fit6.12, fit6.13, criterion = "waic") %>%
  data.frame() %>% 
  rownames_to_column("fit") %>% 
  arrange(fit) %>% 
  transmute(model   = str_c("model ", letters[1:n()]),
            summary = str_c(formatC(waic, digits = 2, format = "f"), " (", formatC(se_waic, digits = 2, format = "f"), ")"),
            parameter = factor("WAIC", levels = order)) 

waic_summary
```

Here we'll use a little summarizing, wrangling, and creative plotting, we'll make a streamlined version of Table 6.5 with **ggplot2**.

```{r, fig.width = 5.5, fig.height = 4}
# extract and wrangle the posterior summaries
tibble(model = str_c("model ", letters[1:4]),
         fit   = str_c("fit6.1", 0:3)) %>% 
  mutate(p = map(fit, ~get(.) %>% 
                   posterior_summary() %>% 
                   data.frame() %>% 
                   rownames_to_column("parameter") %>% 
                   filter(!str_detect(parameter, "r_id\\[") & parameter != "lp__"))) %>% 
  unnest(p) %>% 
  mutate(greek = case_when(
    parameter == "b_Intercept" ~ "gamma[0][0]",
    parameter == "b_time"      ~ "gamma[1][0]",
    parameter == "b_ItimeE2"   ~ "gamma[2][0]",
    parameter == "b_ItimeE3"   ~ "gamma[3][0]",
    parameter == "sd_id__Intercept" ~ "sigma[0]",
    parameter == "sd_id__time"      ~ "sigma[1]",
    parameter == "sd_id__ItimeE2"   ~ "sigma[2]",
    parameter == "sd_id__ItimeE3"   ~ "sigma[3]",
    parameter == "sigma"            ~ "sigma[epsilon]",
    parameter == "cor_id__Intercept__time"    ~ "rho[0][1]",
    parameter == "cor_id__Intercept__ItimeE2" ~ "rho[0][2]",
    parameter == "cor_id__Intercept__ItimeE3" ~ "rho[0][3]",
    parameter == "cor_id__time__ItimeE2"      ~ "rho[1][2]",
    parameter == "cor_id__time__ItimeE3"      ~ "rho[1][3]",
    parameter == "cor_id__ItimeE2__ItimeE3"   ~ "rho[2][3]"
  )) %>% 
  mutate(summary = str_c(formatC(Estimate, digits = 2, format = "f"), " (", formatC(Est.Error, digits = 2, format = "f"), ")"),
         parameter = factor(greek, levels = order)) %>% 
  select(model, summary, parameter) %>% 
  # add in the WAIC summary information
  bind_rows(waic_summary) %>% 
  mutate(parameter = fct_rev(parameter)) %>% 
  
  # plot!
  ggplot(aes(x = model, y = parameter, label = summary)) +
  geom_text(hjust = 1, size = 3) +
  scale_x_discrete(NULL, position = "top") +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  coord_cartesian(xlim = c(NA, 3.5)) +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1, color = "black"),
        axis.text.y = element_text(hjust = 0, color = "black"),
        axis.line.x = element_line(size = 1/4),
        panel.grid = element_blank())
```

Here are the formal WAIC difference estimates.

```{r}
loo_compare(fit6.10, fit6.11, fit6.12, fit6.13, criterion = "waic") %>% 
  print(simplify = F)
```

The WAIC difference estimates suggest the cubic and quadratic models are about the same, but both are notably better than the intercepts-only and linear models.

In the text, Singer and Willett preferred the quadratic model (`fit6.12`). Now we've fit our Bayesian version of the model, why not use the posterior distribution to make a high-quality model-based version of Figure 6.7?

```{r, fig.width = 8, fig.height = 6}
# define the new data
nd <-
  external_pp_subset %>% 
  distinct(id, case) %>% 
  expand(nesting(id, case),
         time = seq(from = 0, to = 5, length.out = 50)) %>% 
  mutate(grade = time + 1)

# extract the fitted trajectories and wrangle
fitted(fit6.12, newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  # plot!
  ggplot(aes(x = grade, y = Estimate)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  geom_point(data = external_pp_subset,
             aes(y = external)) +
  scale_x_continuous(breaks = 0:7, labels = 0:7,
                     expand = c(0, 0), limits = c(0, 7)) +
  ylab("external") +
  coord_cartesian(ylim = c(0, 60)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ case, ncol = 4)
```

At the end of this section in the text, Singer and Willett briefly discussed fitting an expansion of the quadratic model which included the variable `female` as time-invariant predictor for the individual differences in initial status ($\pi_{0i}$), the instantaneous rate of change ($\pi_{1i}$), and curvature ($\pi_{2i}$). We might express the model in formal notation as
                
$$
\begin{align*}
\text{external}_{ij} & = \gamma_{00} + \gamma_{01} \text{female}_i \\
 & \;\;\; + \gamma_{10} \text{time}_{ij} + \gamma_{11} ( \text{female}_i \times \text{time}_{ij} ) \\
 & \;\;\; + \gamma_{20} \text{time}_{ij}^2 + \gamma_{21} ( \text{female}_i \times \text{time}_{ij}^2 ) \\
 & \;\;\; + [\zeta_{0i} + \zeta_{1i} \text{time}_{ij} + \zeta_{2i} \text{time}_{ij}^2 + \epsilon_{ij}] \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\
\begin{bmatrix} 
\zeta_{0i} \\ \zeta_{1i} \\ \zeta_{2i}
\end{bmatrix} & \sim \operatorname{Normal}(\mathbf 0, \mathbf D \mathbf \Omega \mathbf D') \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 & 0 \\ 0 & \sigma_1 & 0 \\ 0 & 0 & \sigma_2 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} & \rho_{02} \\ 
 \rho_{10} & 1 & \rho_{12} \\ 
 \rho_{20} & \rho_{21} & 1 \end{bmatrix} \\
\gamma_{00}                     & \sim \operatorname{Normal}(13, 9.9) \\
\gamma_{01}, \dots, \gamma_{21} & \sim \operatorname{Normal}(0, 4.95) \\
\sigma_0, \dots, \sigma_2 & \sim \operatorname{Exponential}(1 / 9.9) \\
\sigma_\epsilon           & \sim \operatorname{Exponential}(1 / 9.9) \\
\mathbf\Omega   & \sim \operatorname{LKJ}(4).
\end{align*}
$$

Now fit the model.

```{r fit6.14}
fit6.14 <-
  brm(data = external_pp, 
      family = gaussian,
      external ~ 0 + Intercept + time + I(time^2) + female + time:female + I(time^2):female +
        (1 + time + I(time^2) | id),
      prior = c(prior(normal(0, 4.95), class = b),
                prior(normal(13, 9.9), class = b, coef = Intercept),
                prior(exponential(1.0 / 9.9), class = sd),
                prior(exponential(1.0 / 9.9), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2500, warmup = 1000, chains = 3, cores = 3,
      seed = 6,
      control = list(adapt_delta = .99),
      file = "fits/fit06.14")
```

Singer and Willett reported the new parameters $\gamma_{01}$, $\gamma_{02}$ and $\gamma_{03}$, were unimpressive. Let's check their posteriors with a coefficient plot.

```{r, fig.width = 5, fig.height = 1.5}
posterior_summary(fit6.14)[4:6, ] %>% 
  data.frame() %>% 
  mutate(parameter = str_c("gamma[", 0:2, "][1]")) %>% 
  
  ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = parameter)) +
  geom_vline(xintercept = 0, color = "white") +
  geom_pointrange(fatten = 1, size = 1/4) +
  scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
  labs(subtitle = "How well did 'female' do?",
       x = "marginal posterior") +
  theme(panel.grid = element_blank())
```

Yeah, they were wither small or highly uncertain. Let's go all the way with a WAIC comparison with `fit6.12`.

```{r, warning = F, message = F}
fit6.14 <- add_criterion(fit6.14, criterion = "waic")

loo_compare(fit6.12, fit6.14, criterion = "waic") %>% print(simplify = F)
```

Yep, there's no compelling reason to prefer `fit6.14` over `fit6.12`. The predictor `female` looks like a dud, which suggests the externalizing behavior trajectories are about the same for the boys and the girls.

## Truly nonlinear trajectories

> All the individual growth models described so far—including the curvilinear ones presented in this chapter--share an important mathematical property: they are *linear in the individual growth parameters*. Why do we use the label "linear" to describe trajectories that are blatantly nonlinear? The explanation for this apparent paradox is that this mathematical property depends not on the *shape* of the underlying growth trajectory but rather *where*--in which portion of the model--the nonlinearity arises. In all previous model, nonlinearity (or discontinuity) stems from the representation of the *predictors*. To allow the hypothesized trajectory to deviate from a straight line, *TIME* is either transformed or expressed using higher order polynomial terms. In the truly nonlinear models we now discuss, nonlinearity arises in a different way--through the *parameters*. (pp. 223--224, *emphasis* in the original)

### What do we mean by truly nonlinear models?

Linear models are *linear* in the sense that the expected value for the criterion ($\operatorname E(y)$) is the sum of the $\gamma$'s multiplies by either a constant (in the case of $\gamma_{00}$) or b the regression weight (the point estimate for frequentists or the measure of central tendency in the posterior, often the mean or median, for Bayesians). In other words, $\operatorname E(y)$ is a *weighted linear composite* of the $\gamma$'s multiplied by a given set of predictor values. Truly nonlinear models do not have this property.

### The logistic individual growth curve.

Load the data from Tivan's (1980) unpublished dissertation.

```{r, warning = F, message = F}
library(tidyverse)

foxngeese_pp <- read_csv("data/foxngeese_pp.csv")

glimpse(foxngeese_pp)
```

There are responses from 17 participants in these data.

```{r}
foxngeese_pp %>% 
  distinct(id)
```

In this experiment, each child played up to 27 games, with some variation among the children.

```{r}
foxngeese_pp %>% 
  count(id, name = "# games, per kid") %>% 
  count(`# games, per kid`)
```

Our criterion is `nmoves`, the number of a child made within a game "before making a catastrophic error" (p. 226). Here's the overall distribution of `nmoves`.

```{r, fig.width = 4.5, fig.height = 2.75}
foxngeese_pp %>% 
  ggplot(aes(x = nmoves)) +
  geom_bar() +
  theme(panel.grid = element_blank())
```

We can get a sense of the child-level data with our version of Figure 6.8.

```{r, fig.width = 7, fig.height = 5}
subset <- c("1", "4", "6", "7", "8", "11", "12", "15")

foxngeese_pp %>% 
  filter(id %in% subset) %>% 
  mutate(id = if_else(id < 10, str_c("0", id), as.character(id))) %>% 

  ggplot(aes(x = game, y = nmoves)) +
  geom_point(size = 2/3) +
  scale_y_continuous(breaks = 0:5 * 5, limits = c(0, 25)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ id, ncol = 4, labeller = label_both)
```

In these data, `nmoves` always lies between 1 and 20. Based on what we know about the structure of these data and the experiment which produced them, Singer and Willett proposed our statistical model should include:

* a *lower asymptote*, which captures how each child had to make at least one move;
* an *upper asymptote*, which captures the maximum number of moves allowed, which seems to be 20; and
* a *smooth curve* showing growth from the lower asymptote to the upper, which involves a period of accelerated learning somewhere in the middle.

They then point out these features are captured in a logistic trajectory. Their proposed logistic model of change follows the form

$$
\begin{align*}
\text{nmoves}_{ij} & = 1 + \frac{19}{1 + \pi_{0i} e^{-(\pi_{1i} \text{game}_{ij})}} + \epsilon_{ij} \\
\epsilon_{ij}      & \sim \operatorname{Normal}(0, \sigma_\epsilon^2).
\end{align*}
$$

This equation is set up so that as the value for $\text{game}_{ij} \rightarrow \infty$, the denominator in the equation shrinks to 1, which leave $1 + 19/1 = 20$, the *upper asymptote*. Yet as $\text{game}_{ij} \rightarrow -\infty$, the denominator inflates to $\infty$, leaving $1 + 0 = 1$, the *lower asymptote*. The $\pi_{0i}$ and $\pi_{1i}$ terms are no longer simply the intercepts and time slopes, as in earlier models. To get a sense, we'll make Figure 6.9. To help, we might make a custom function that follows the equation, above.

```{r}
sw_logistic <- function(pi0 = 15, pi1 = 0.3, game = 10) {
  
  1 + (19 / (1 + pi0 * exp(-pi1 * game)))
}
```

Now use our `sw_logistic()` function[^1] to make Figure 6.9.

```{r, fig.width = 6.5, fig.height = 3.5}
crossing(pi0 = c(1.5, 15, 150),
         pi1 = c(0.1, 0.3, 0.5)) %>% 
  expand(nesting(pi0, pi1),
         game = 0:30) %>% 
  mutate(y     = sw_logistic(pi0, pi1, game),
         pi0_f = factor(str_c("pi[0]==", pi0), 
                        levels = c("pi[0]==150", "pi[0]==15", "pi[0]==1.5"))) %>%
  
  ggplot(aes(x = game, y = y, group = pi1)) +
  geom_line(aes(size = pi1)) + 
  scale_size_continuous(expression(pi[1]), range = c(1/3, 1), breaks = c(0.1, 0.3, 0.5)) +
  scale_y_continuous("nmoves", limits = c(0, 25)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~pi0_f, labeller = label_parsed) 
```

Though $\pi_{0i}$ has clear implications for the intercept, it's not quite the same thing as the intercept. Similarly, though $\pi_{1i}$ has an influence on how rapidly the trajectories approach the upper asymptote, it's not the same thing as the slope. Once you go nonlinear, it can become tricky to interpret the parameters directly.

Since we'll be fitting this model as Bayesians, let's talk about priors. Based on the plots in Figure 6.9, it seems like we should center the $\gamma_{00}$ prior on a relatively large value. If you compare the three panels of the plot, it seems like there's a diminishing returns effect as the value for $\pi_0$ went from 1.5 to 15 to 150. I propose something like $\operatorname N(15, 3)$.

Now let's talk about the prior for $\gamma_{10}$. In each of the three panels of Figure 6.9, it looks like the values of 0.1 to 0.5 cover a good range of plausible values. If we were conservative about how quickly the children might learn the game, perhaps we'd settle for a prior like $\operatorname N(0.2, 0.1)$.

A nonlinear model like this can be hard to understand even with the benefit of Figure 6.9. To get a clear sense of our priors, we might do a graphical prior predictive check. Here we simulate 100 draws from the prior predictive distribution of

$$
\begin{align*}
y_j & = 1 + \frac{19}{1 + \gamma_{00} e^{-(\gamma_{10} \text{game}_j)}} \\
\gamma_{00} & \sim \operatorname N(15, 3) \\
\gamma_{10} & \sim \operatorname N(0.2, 0.1).
\end{align*}
$$

```{r, fig.width = 4, fig.height = 3}
# how many do you want?
n <- 100

# simulate
set.seed(6)

tibble(n   = 1:n,
       gamma00 = rnorm(n, mean = 15, sd = 3),
       gamma10 = rnorm(n, mean = 0.2, sd = 0.1)) %>% 
  expand(nesting(n, gamma00, gamma10),
         game = 0:30) %>% 
  mutate(y = sw_logistic(gamma00, gamma10, game)) %>%
  
  # plot!
  ggplot(aes(x = game, y = y, group = n)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_line(size = 1/4, alpha = 1/2) + 
  scale_y_continuous(limits = c(1, 20)) +
  theme(panel.grid = element_blank())
```

To my eye, this looks like a respectable starting point. As for our variance parameters, I'd be comfortable just setting $\sigma_0 \sim \operatorname{Exponential}(1)$ and $\sigma_\epsilon \sim \operatorname{Exponential}(1)$. Since the range for our $\gamma_{10}$ prior is about an order of magnitude smaller, I'd argue $\operatorname{Exponential}(10)$ would be a decent starting point for our $\sigma_1$ prior. As is typical, I recommend the weakly regularizing $\operatorname{LKJ}(4)$ for the level-2 correlation matrix.

Here's how to fit the model with **brms**.

```{r fit6.15}
fit6.15 <-
  brm(data = foxngeese_pp, 
      family = gaussian,
      bf(nmoves ~ 1 + (19.0 / (1.0 + g0 * exp(-g1 * game))),
         g0 + g1 ~ 1 + (1 |i| id), 
         nl = TRUE),
      prior = c(prior(normal(15, 3), nlpar = g0),
                prior(normal(0.2, 0.1), nlpar = g1),
                prior(exponential(1), class = sd, nlpar = g0),
                prior(exponential(10), class = sd, nlpar = g1),
                prior(exponential(1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      inits = 0, 
      control = list(adapt_delta = .999),
      file = "fits/fit06.15")
```

Before we explore the results, you might have noticed a few odd things about our syntax. For example, notice how we wrapped our model formula within a `bf()` statement, that the actual formula is composed of both variable names (e.g., `game`) AND parameter names (e.g., `g0`, which is an abbreviation for $\gamma_{00}$), and that we set `nl = TRUE`. Further, did you notice how we used the `nlpar` argument in several of our `prior()` lines? Taken as a whole, these indicate we used the **brms** non-linear syntax. Since so few of the models in this book require the **brms** non-linear syntax, I'm not going to explain it in detail, here. However, you can learn all about it in Bürkner's [-@Bürkner2021Non_linear] vignette, [*Estimating non-linear models with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_nonlinear.html). I also use the non-linear syntax quite a bit in my [-@kurzStatisticalRethinkingSecondEd2021] translation of the second edition of McElreath's text.

But as to the task at hand, let's look at the summary for our model.

```{r}
print(fit6.15, digits = 3)
```

Our posterior summaries are similar to the results Singer and Willett reported in their Table 6.6. To further explore the model, let's use the `fitted()` approach to make our version of Figure 6.10a.

```{r, fig.width = 2.25, fig.height = 3.5}
nd <- tibble(game = seq(from = 0, to = 30, by = 0.1))

p1 <-
  fitted(fit6.15,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  ggplot(aes(x = game, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_ribbon(alpha = 1/4) +
  geom_line() + 
  scale_y_continuous("nmoves", limits = c(0, 25), expand = c(0, 0)) +
  labs(subtitle = "Model A") +
  theme(panel.grid = element_blank())

p1
```

For the next model, Singer and Willett proposed we use a mean-centered version of `read` as predictor of the level-2 intercepts and slopes. Here we make a version of that variable, which we'll call `read_c`.

```{r}
foxngeese_pp <-
  foxngeese_pp %>% 
  mutate(read_c = read - mean(read))
```

For our new coefficients, I propose we continue to take a weakly-regularizing approach. Since the level-2 variables they're predicting are on different scales, it seems like these priors should be on different scales, too. I suggest we specify $\gamma_{01} \sim \operatorname{Normal}(0, 1)$ and $\gamma_{11} \sim \operatorname{Normal}(0, 0.1)$. If you follow along, we might express our statistical model in formal notation as

$$
\begin{align*}
\text{nmoves}_{ij} & = 1 + \frac{19}{1 + \pi_{0i} e^{-(\pi_{1i} \text{game}_{ij})}} + \epsilon_{ij} \\
\pi_{0i} & = \gamma_{00} + \gamma_{01} \left (\text{read}_i - \overline{\text{read}} \right ) + \zeta_{0i} \\
\pi_{1i} & = \gamma_{10} + \gamma_{11} \left (\text{read}_i - \overline{\text{read}} \right ) + \zeta_{1i} \\
\epsilon_{ij}      & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & 
 \sim \operatorname{Normal}  \begin{pmatrix} \begin{bmatrix} 
   0 \\ 0  \end{bmatrix}, \mathbf D \mathbf \Omega 
   \mathbf D' \end{pmatrix} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1\end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ 
 \rho_{10} & 1 \end{bmatrix}, \\
\gamma_{00} & \sim \operatorname{Normal}(15, 3) \\ 
\gamma_{10} & \sim \operatorname{Normal}(0.2, 0.1) \\
\gamma_{01} & \sim \operatorname{Normal}(0, 1) \\ 
\gamma_{11} & \sim \operatorname{Normal}(0, 0.1) \\ 
\sigma_0 & \sim \operatorname{Exponential}(1) \\ 
\sigma_1 & \sim \operatorname{Exponential}(10) \\ 
\sigma_\epsilon & \sim \operatorname{Exponential}(1) \\ 
\mathbf \Omega & \sim \operatorname{LKJ}(4).
\end{align*}
$$

Fit the model.

```{r fit6.16}
fit6.16 <-
  brm(data = foxngeese_pp, 
      family = gaussian,
      bf(nmoves ~ 1 + (19.0 / (1.0 + g0 * exp(-g1 * game))),
         g0 + g1 ~ 1 + read_c + (1 |i| id), 
         nl = TRUE),
      prior = c(prior(normal(15, 3), nlpar = g0, coef = Intercept),
                prior(normal(0, 1), nlpar = g0, coef = read_c),
                prior(normal(0.2, 0.1), nlpar = g1, coef = Intercept),
                prior(normal(0, 0.1), nlpar = g1, coef = read_c),
                prior(exponential(1), class = sd, nlpar = g0),
                prior(exponential(10), class = sd, nlpar = g1),
                prior(exponential(1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      inits = 0, 
      control = list(adapt_delta = .999),
      file = "fits/fit06.16")
```

Check the results.

```{r}
print(fit6.16)
```

Before you get worried about how our posterior mean for $\gamma_{01}$ has the opposite sign as the point estimate Singer and Willett reported in the text, take a look at the whole posterior.

```{r, fig.width = 4, fig.height = 2.75, warning = F, message = F}
library(tidybayes)

posterior_samples(fit6.16) %>% 
  ggplot(aes(x = b_g0_read_c, y = 0)) +
  stat_halfeye(.width = c(.8, .95)) +
  geom_vline(xintercept = -0.3745, linetype = 2) +
  annotate(geom = "text",
           x = -0.3745, y = 0.03,
           label = "Singer and Willett's\npoint estimate",
           angle = 90, hjust = 0, size = 3) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(gamma[0][1])) +
  theme(panel.grid = element_blank())
```

The posterior is wide and the point estimate in the text fits comfortably within our inner 80% interval. Okay, let's explore this model further by making the rest of Figure 6.10.

```{r, fig.width = 5.5, fig.height = 3.75}
nd <- 
  crossing(game   = seq(from = 0, to = 30, by = 0.1),
           read_c = c(-1.58, 1.58))

p2 <-
  fitted(fit6.16,
         newdata = nd,
         re_formula = NA) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(read_c = ifelse(read_c < 0, "low (-1.58)", "high (1.58)")) %>% 
  
  ggplot(aes(x = game, y = Estimate, ymin = Q2.5, ymax = Q97.5, 
             fill = read_c, color = read_c)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_ribbon(alpha = 1/4, size = 0) +
  geom_line() + 
  scale_fill_viridis_d(option = "A", begin = .25, end = .75) +
  scale_color_viridis_d(option = "A", begin = .25, end = .75) +
  scale_y_continuous(NULL, breaks = NULL, limits = c(0, 25), expand = c(0, 0)) +
  labs(subtitle = "Model B") +
  theme(panel.grid = element_blank())

# combine
p1 + p2
```

Notice how unimpressive the expected trajectories between the two levels of `read_c` are when you include the 95% interval bands. Beware plots of fitted lines that do not include the 95% intervals!

We might compare our two nonlinear models with their WAIC estimates.

```{r, warning = F, message = F}
fit6.15 <- add_criterion(fit6.15, criterion = "waic")
fit6.16 <- add_criterion(fit6.16, criterion = "waic")

loo_compare(fit6.15, fit6.16, criterion = "waic") %>% print(simplify = F)
```

They're nearly the same, suggesting `fit6.16` was overfit. However, now that we have it, we might use our overfit model `fit6.16` to an updated version of Figure 6.8.

```{r, fig.width = 7.5, fig.height = 5}
# define the new data
nd <-
  foxngeese_pp %>% 
  distinct(id, read_c) %>% 
  filter(id %in% subset) %>% 
  expand(nesting(id, read_c),
         game = seq(from = 0, to = 30, by = 0.1))
  

# extract the fitted trajectories
fitted(fit6.16, newdata = nd) %>% 
  # wrangle
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  # plot!
  ggplot(aes(x = game)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5, fill = read_c),
              alpha = 1/4, size = 0) +
  geom_line(aes(y = Estimate, color = read_c)) + 
  geom_point(data = foxngeese_pp %>% filter(id %in% subset),
             aes(y = nmoves),
             size = 2/3) +
  scale_fill_viridis_c(option = "A", begin = .15, end = .85, 
                       limits = range(foxngeese_pp$read_c)) +
  scale_color_viridis_c(option = "A", begin = .15, end = .85, 
                        limits = range(foxngeese_pp$read_c)) +
  scale_y_continuous(breaks = 0:5 * 5, limits = c(0, 25), expand = c(0, 0)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ id, ncol = 4, labeller = label_both)
```

### A survey of truly nonlinear change trajectories.

> By now you should realize that you can represent individual change using a virtually limitless number of mathematical functions....
>
> How then can you possibly specify a suitable model for your data and purposes? Clearly, you need more than empirical evidence. Among a group of well-fitting growth models, blind numeric comparison of descriptive statistics, goodness-of-fit, and regression diagnostics will rarely pick out the best one. As you might expect, we recommend that you blend theory and empirical evidence, articulating a rationale that you can translate into a statistical model. This recommendation underscores an important point that can often be overlooked in the heat of data analysis: *substance is paramount*. The best way to select an appropriate individual growth model is to work within an explicit theoretical framework. We suggest that you ask not "What is the best model for the job?" but, rather, "What model is most theoretically sound?" (pp. 232--233, *emphasis* in the original)

Based on the work of Mead and Pike [-@head1975review], Singer and Willett suggested we might divide up the range of nonlinear models into four bins:

* polynomial,
* hyperbolic,
* inverse polynomial, and
* exponential.

#### Hyperbolic growth.

> The *rectangular hyperbola* is one of the simplest nonlinear models for individual change. *TIME* enters as a *reciprocal* in the denominator of the model's right side. This model possesses an important property for modeling biological and agricultural growth: over time, its outcome smoothly approaches--but never reaches--an asymptote. (p. 234, *emphasis* in the original)

The example of a hyperbolic growth from Table 6.7 (p. 234) followed the form

$$Y_{ij} = \alpha_i - \frac{1}{\pi_{1i} TIME_{ij}} + \epsilon_{ij},$$

where $\pi_{1i}$ is kinda like a growth slope and there is no $\pi_{0i}$ because $\pi_{0i}$ parameters are for losers. To get a sense of what this model provides, here's our version of the upper-left panel of Figure 6.11.

```{r, fig.width = 3, fig.height = 4}
text <-
  tibble(time  = c(0, 1.2, 4.6, 8),
         y     = c(102.5, 90, 87, 85),
         label = c("alpha==100",
                   "pi[1]==0.01",
                   "pi[1]==0.02",
                   "pi[1]==0.1"))

p1 <-
  crossing(alpha = 100,
           pi1   = c(0.01, 0.02, 0.1)) %>% 
  expand(nesting(alpha, pi1),
         time = seq(from = 0, to = 10, by = 0.01)) %>% 
  mutate(y = alpha - (1 / (pi1 * time))) %>% 
  
  ggplot(aes(x = time, y = y, group = pi1)) +
  geom_hline(yintercept = 100, color = "white") +
  geom_line() +
  geom_text(data = text,
            aes(label = label),
            size = 3, hjust = 0, parse = T) +
  labs(title = "Hyperbola",
       y = expression(E(italic(Y)))) +
  coord_cartesian(ylim = c(0, 100)) +
  theme(panel.grid = element_blank())

p1
```

Note how, in this plot, $\alpha$ is the upper asymptote.

#### Inverse polynomial growth.

"The family of *inverse polynomials* extends the rectangular hyperbola by adding higher powers of *TIME* to the denominator of the quotient on the model's right side" (p. 236, *emphasis* in the original). The example of a hyperbolic growth from Table 6.7 followed the form

$$Y_{ij} = \alpha_i - \frac{1}{\pi_{1i} TIME_{ij} + \pi_{2i} TIME_{ij}^2} + \epsilon_{ij}.$$

To get a sense of what this model provides, here's our version of the upper-right panel of Figure 6.11.

```{r, fig.width = 3, fig.height = 4}
text <-
  tibble(time  = c(0, 3.5, 7.5, 10, 1.5),
         y     = c(102.5, 94, 91, 77, 8),
         label = c("alpha==100",
                   "pi[2]==0.015",
                   "pi[2]==0",
                   "pi[2]==-0.0015",
                   "pi[1]==0.02~('for all curves')"),
         hjust = c(0, 0, 0, 1, 0))

p2 <-
  crossing(alpha = 100,
           pi1   =  0.02,
           pi2   = c(0.015, 0, -0.0015)) %>% 
  expand(nesting(alpha, pi1, pi2),
         time = seq(from = 0, to = 10, by = 0.01)) %>% 
  mutate(y = alpha - (1 / (pi1 * time + pi2 * time^2))) %>% 
  
  ggplot(aes(x = time, y = y, group = pi2)) +
  geom_hline(yintercept = 100, color = "white") +
  geom_line() +
  geom_text(data = text,
            aes(label = label, hjust = hjust),
            size = 3, parse = T) +
  labs(title = "Inverse polynomial",
       y = expression(E(italic(Y)))) +
  coord_cartesian(ylim = c(0, 100)) +
  theme(panel.grid = element_blank())

p2
```

#### Exponential growth.

> Exponential growth is probably the most widely used class of truly nonlinear models. This theoretically compelling group has been used for centuries to model biological, agricultural, and physical growth. This class includes a wide range of different functional forms, but all contain an *exponent of* $e$, the base of the natural logarithm. (p. 237, *emphasis* in the original)

The example of simple exponential growth from Table 6.7 followed the form

$$Y_{ij} = \pi_{0i} e^{\pi_{1i} TIME_{ij}} + \epsilon_{ij},$$

where now we have the triumphant return of $\pi_{0i}$, which still isn't quite an intercept but is close to one. The of negative exponential growth from Table 6.7 followed the form

$$Y_{ij} = \alpha_i -(\alpha_i - \pi_{0i}) e^{\pi_{1i} TIME_{ij}} + \epsilon_{ij}.$$

Let's make the panels of the lower row of Figure 6.11.

```{r}
# left
text <-
  tibble(time  = 10,
         y     = c(100.5, 39, 16, 1),
         label = c("pi[1]==0.3",
                   "pi[1]==0.2",
                   "pi[1]==0.1",
                   "pi[0]==5~('for all curves')"),
         vjust = c(0, .5, .5, 0))

p3 <-
  crossing(pi0 = 5,
           pi1 = c(0.1, 0.2, 0.3)) %>% 
  expand(nesting(pi0, pi1),
         time = seq(from = 0, to = 10, by = 0.01)) %>% 
  mutate(y = pi0 * exp(pi1 * time)) %>% 
  
  ggplot(aes(x = time, y = y, group = pi1)) +
  geom_hline(yintercept = 100, color = "white") +
  geom_line() +
  geom_text(data = text,
            aes(label = label, vjust = vjust),
            size = 3, hjust = 1, parse = T) +
  labs(title = "Exponential (simple)",
       y = expression(E(italic(Y)))) +
  coord_cartesian(ylim = c(0, 100)) +
  theme(panel.grid = element_blank())

# right
text <-
  tibble(time  = c(0, 10, 10, 10, 7),
         y     = c(102.5, 98, 83, 63, 20),
         label = c("alpha==100",
                   "pi[1]==0.3",
                   "pi[1]==0.2",
                   "pi[1]==0.1",
                   "pi[0]==20~('for all curves')"),
         hjust = c(0, 1, 1, 1, 1))

p4 <-
  crossing(alpha = 100,
           pi0   = 20,
           pi1   = c(0.1, 0.2, 0.3)) %>% 
  expand(nesting(alpha, pi0, pi1),
         time = seq(from = 0, to = 10, by = 0.01)) %>% 
  mutate(y = alpha - (alpha - pi0) * exp(-pi1 * time)) %>% 
  
  ggplot(aes(x = time, y = y, group = pi1)) +
  geom_hline(yintercept = 100, color = "white") +
  geom_line() +
  geom_text(data = text,
            aes(label = label, hjust = hjust),
            size = 3, parse = T) +
  labs(title = "Negative exponential",
       y = expression(E(italic(Y)))) +
  coord_cartesian(ylim = c(0, 100)) +
  theme(panel.grid = element_blank())
```

Now combine the ggplots to make the full version of Figure 6.11.

```{r, fig.width = 6, fig.height = 8}
(p1 + p2) / (p3 + p4)
```

As covered, above, the logistic trajectory is great when you have need of both lower and upper asymptotic. I might note that the logistic models we fit were not what you'd expect when thinking of what is typically called *logistic regression*. Conventional logistic regression models don't add a normally-distributed error term. Rather, error is simply a function of the expected value. It's something of a shame Singer and Willett didn't cover logistic multilevel growth models. Is such a shame, in fact, that we'll go rogue and cover them in a [bonus section][Bonus: The logistic growth model] a the end of this chapter.

### From substantive theory to mathematical representations of individual growth.

Here's how to make Figure 6.12.

```{r, fig.width = 3, fig.height = 4}
text <-
  tibble(time  = c(0.2, 9.8, 9.8, 9.8, 1),
         y     = c(102.5, 87, 63, 46, 10),
         label = c("alpha==100",
                   "pi[1]==1",
                   "pi[1]==5",
                   "pi[1]==10",
                   "pi[0]==10~('for all curves')"),
         hjust = c(0, 1, 1, 1, 0))

crossing(alpha = 100,
         pi0   = 10,
         pi1   = c(1, 5, 10)) %>% 
  expand(nesting(alpha, pi0, pi1),
         time = seq(from = 0, to = 10, by = 0.01)) %>% 
  mutate(y = pi0 + ((alpha - pi0) * time) / (pi1 + time)) %>% 
  
  ggplot(aes(x = time, y = y, group = pi1)) +
  geom_hline(yintercept = 100, color = "white") +
  geom_line() +
  geom_text(data = text,
            aes(label = label, hjust = hjust),
            size = 3, parse = T) +
  scale_x_continuous(breaks = 0:5 * 2, expand = c(0, 0), limits = c(0, 10)) +
  scale_y_continuous(expression(E(italic(Y))), 
                     expand = c(0, 0), limits = c(0, 110)) +
  labs(title = "Thurstone's learning equation") +
  theme(panel.grid = element_blank())
```

Just to get in a little more practice with non-linear models and the **brms** non-liner syntax, we might practice applying Thurstone's learning equation to our children's behavioral data from the last section. As before, our preliminary task to to pick our prior. Based on what we already know about the data and also based on the curves we see in the plot, above, I recommend we select something like $\gamma_{00} \sim \operatorname{Normal}(2, 1)$ and $\gamma_{10} \sim \operatorname{Normal}(5, 2)$. To get a sense of what that'd mean, here's a prior predictive check.

```{r, fig.width = 4, fig.height = 3}
# how many do you want?
n <- 100

# simulate
alpha <- 20

set.seed(6)

tibble(n   = 1:n,
       pi0 = rnorm(n, mean = 2, sd = 1),
       pi1 = rnorm(n, mean = 5, sd = 2)) %>% 
  expand(nesting(n, pi0, pi1),
         game = 0:30) %>% 
  mutate(y = pi0 + (((alpha - pi0) * game) / (pi1 + game))) %>%
  
  # plot!
  ggplot(aes(x = game, y = y, group = n)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_line(size = 1/4, alpha = 1/2) + 
  theme(panel.grid = element_blank())
```

Play around with different priors on your own. Here's how to fit the model to the `foxngeese_pp` data.

```{r fit6.17}
fit6.17 <-
  brm(data = foxngeese_pp, 
      family = gaussian,
      bf(nmoves ~ g0 + (((20 - g0) * game) / (g1 + game)),
         g0 + g1 ~ 1 + (1 |i| id), 
         nl = TRUE),
      prior = c(prior(normal(2, 1), nlpar = g0),
                prior(normal(5, 2), nlpar = g1),
                prior(exponential(1), class = sd, nlpar = g0),
                prior(exponential(1), class = sd, nlpar = g1),
                prior(exponential(1), class = sigma),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      inits = 0, 
      control = list(adapt_delta = .99),
      file = "fits/fit06.17")
```

Check the parameter summary.

```{r}
print(fit6.17)
```

We might use our Thurstone model to an updated version of Figure 6.8.

```{r, fig.width = 7, fig.height = 5}
# define the new data
nd <-
  foxngeese_pp %>% 
  distinct(id) %>% 
  filter(id %in% subset) %>% 
  expand(id, game = seq(from = 1, to = 30, by = 0.1))

# extract the fitted trajectories
fitted(fit6.17, newdata = nd) %>% 
  # wrangle
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  # plot!
  ggplot(aes(x = game)) +
  geom_hline(yintercept = c(1, 20), color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/4, size = 0) +
  geom_line(aes(y = Estimate)) + 
  geom_point(data = foxngeese_pp %>% filter(id %in% subset),
             aes(y = nmoves),
             size = 2/3) +
  scale_y_continuous(breaks = 0:5 * 5, expand = c(0, 0)) +
  coord_cartesian(ylim = c(0, 25)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ id, ncol = 4, labeller = label_both)
```

Doesn't look great. Notice how the parameters in this model did not restrict the fitted trajectories to respect the lower asymptote. Thurstone's model has no lower asymptote. Let's see how this model compares with the unconditional logistic model, `fit6.15`, by way of the WAIC.

```{r, warning = F}
fit6.17 <- add_criterion(fit6.17, criterion = "waic")

loo_compare(fit6.15, fit6.17, criterion = "waic") %>% print(simplify = F)
```

Yep, the logistic model captured the patterns in these data better than Thurstone's learning model.

## Bonus: The logistic growth model

In the social sciences, binary data are a widely-collected data type which are often best analyzed with a nonlinear model. In this section, we'll practice fitting a multilevel growth model on binary data using logistic regression. This will differ from the logistic curve models, above, in that we will not be presuming normally distributed errors, $\epsilon_{ij} \sim \operatorname N(0, \sigma_\epsilon)$. Rather, we'll be using a different likelihood altogether.

### We need data.

In this section, we'll be borrowing a data file from the [supplemental material](https://stat.columbia.edu/~gelman/arm/software/) from @gelmanDataAnalysisUsing2006, *Data analysis using regression and multilevel/hierarchical models*.

```{r, warning = F, message = F}
dogs <- 
  read_delim("extra_data/dogs.txt", 
             "\t", 
             escape_double = FALSE, 
             trim_ws = TRUE) %>% 
  rename(dog = Dog)

glimpse(dogs)
```

The data are initially in the wide format. Here we make the `dogs` data long.

```{r}
dogs <-
  dogs %>% 
  pivot_longer(-dog, values_to = "y") %>% 
  mutate(trial = str_remove(name, "T.") %>% as.double())

head(dogs)
```

As Gelman and Hill described (p. 515), these data were taken from a behavioral experiment in the 1950's. Thirty dogs learned a new trick. Each of the 30 dogs was given 25 trials to learn the trick. In these data, *time* is captured by the `tial` column, which ranges from `0` (the first trial) to `24` (the final 25^th^ trial). The criterion variable is `y`, which is coded `0` = fail, `1` = success. To give a sense of the data, here's a descriptive plot of a randomly-chosen subset of eight of the dogs.

```{r, fig.width = 7, fig.height = 5}
# define the dog subset
set.seed(6)

subset <- sample(1:30, size = 8)

# subset the data
dogs %>% 
  filter(dog %in% subset) %>% 

  # plot!
  ggplot(aes(x = trial, y = y)) +
  geom_point() +
  scale_y_continuous(breaks = 0:1) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ dog, ncol = 4, labeller = label_both)
```

At the beginning of the experiment (`trial == 0`), none of the dogs new how to do the trick (i.e., for each one, `y == 0`). However, the dogs tended to learn the trick (i.e., get `y == 1`) by around the fifth or tenth trial. When modeling data of this kind, the conventional $\epsilon_{ij} \sim \operatorname N(0, \sigma_\epsilon)$ assumption has no hope of working. It's just inappropriate. To get a sense of why, we'll need to get technical.

### Have you heard of the generalized linear [mixed] model?

This wasn't really addressed in Singer and Willett, but not all data kinds produce nicely Gaussian residuals. Though multilevel models can help with this, they won't always do the job, particularly with binary data. Happily, the family of approaches described as the *generalized linear mixed model* (GLMM) will allow you to handle all kinds of wacky time-series data types. I'm not really going to offer a thorough introduction to the GLMM, here. For that, see @gelmanDataAnalysisUsing2006 or @mcelreathStatisticalRethinkingBayesian2020. However, we will explore one strategy from the GLMM framework, which replaces the Gaussian likelihood with the Bernoulli.

#### Sometimes data come in 0's and 1's.

Say you have a set of binary data $y_i$. You can summarize those data as $N$ trials, for which $z$ were $1$'s. So if we had five coin flips, for which we assigned *heads* as 1 and *tails* as 0, we might describe those data as $N = 5$ Bernoulli trials with $z = 4$ successes. In 'Bernoulli trials talk', *success* is often a synonym for $y_i = 1$. Further, you can use the Bernoulli likelihood to describe the probability $y_i = 1$, given an underlying probability $p$, as

$$\operatorname{Pr}(y_i = 1 | p) = p^z (1 - p)^{N - z}.$$

Let's see how this plays out. First, we'll make a custom `bernoulli_likelihood()` function.

```{r}
bernoulli_likelihood <- function(p, data) {
  
  n <- length(data)
  z <- sum(data)
  
  p^z * (1 - p)^(n - sum(data))
  
}
```

Now we'll apply this function to a range of possible $p$ values and data of the kind we just discussed, $N = 5$ trials, for which $z = 4$.

```{r, fig.width = 4, fig.height = 2.5}
data <- c(1, 1, 0, 1, 1)

tibble(p = seq(from = 0, to = 1, length.out = 200)) %>% 
  mutate(d = bernoulli_likelihood(p, data)) %>% 
  
  ggplot(aes(x = p, y = d)) +
  geom_line() +
  geom_vline(xintercept = .8, linetype = 3) +
  scale_x_continuous(expression(italic(p)), breaks = 0:5 / 5) +
  ylab("likelihood") +
  theme(panel.grid = element_blank())
```

Note how the maximum value for this likelihood, given our data, is at $p = .8$. It's also the case that the sample mean of a set of Bernoulli trials is the same as the sample estimate of $p$. Let's confirm.

```{r}
mean(data)
```

Although the sample mean gave us a point estimate for $p$, it was our use of the Bernoulli likelihood function that gave us a sense of the relative likelihoods of all possible values of $p$. It also gave us a sense of the certainty for our sample estimate of $p$. As is usually the case, our certainty will increase when $N$ increases. For example, consider the case of $N = 100$ and $z = 80$.

```{r, fig.width = 4, fig.height = 2.5}
n <- 100
z <- 80

data <- rep(1:0, times = c(z, n - z))

tibble(p = seq(from = 0, to = 1,  length.out = 200)) %>% 
  mutate(d = bernoulli_likelihood(p, data)) %>% 
  
  ggplot(aes(x = p, y = d)) +
  geom_line() +
  geom_vline(xintercept = z / n, linetype = 3) +
  scale_x_continuous(expression(italic(p)), breaks = 0:5 / 5) +
  ylab("likelihood") +
  theme(panel.grid = element_blank())
```

The same ratio of $z/N$ produced the same point estimate of $p$, but the overall shape of the likelihood was more concentrated around that point estimate (the mode) than in our first example.

#### Recall how we can express models in terms of the Gaussian likelihood.

Throughout the text, Singer and Willett express their models in terms true scores and stochastic elements. For a simple single-level model of one predictor $x_i$ for a criterion $y_i$, Singer and Willett might express that model as

$$
\begin{align*}
y_i & = [b_0 + b_1 x_i] + \epsilon_i \\
\epsilon_i & \sim \operatorname{N}(0, \sigma_\epsilon^2),
\end{align*}
$$

where the $y_i = [b_0 + b_1 x_i]$ portion describes the deterministic part of the model (the true score) and the $\epsilon_i$ portion describes the stochastic part of the model (the residual variance). As we briefly covered in [Section 3.2.2][The stochastic part of the level-1 submodel.], it turns out that another way to express this model is

$$
\begin{align*}
y_i & \sim \operatorname{N}(\mu_i, \sigma_\epsilon^2)\\
\mu_i & = b_0 + b_1 x_i,
\end{align*}
$$

where we more explicitly declare that the criterion $y_i$ follows a conditional Normal distribution for which we model the mean with the equation $\mu_i = b_0 + b_1 x_i$ and we express the variation around the conditional mean as $\sigma_\epsilon^2$. The advantage of adopting this style of model notation is it generalizes well to data of other kinds for which we might want to use other likelihoods, such as the Bernoulli.

#### The simple logistic regression model using the Bernoulli likelihood.

Now consider our example of $N = 100$ Bernoulli trials data, $y_i$, for which $z = 80$. We can use the likelihood-based notation to express that as

$$
\begin{align*}
y_i & \sim \operatorname{Bernoulli}(p_i)\\
\operatorname{logit}(p_i) & = b_0,
\end{align*}
$$

where, since we don't have a predictor variable, the intercept $b_0$ is the log-odds probability. *Why 'log-odds'?* you say. Well, the problem of fitting regression models for probabilities ($p_i$) is that probabilities are logically bounded between 0 and 1. Yet there's nothing inherent in the machinery of the conventional linear regression model to prevent predictions outside of those bound. Happily, our friends the statisticians can help us get around that with the aid of link functions. If we model the logit of $p$, $\operatorname{logit}(p_i)$, instead of modeling $p$ directly, we end up with an inherently nonlinear model that will always produce estimates within the bounds of 0 and 1. Another way of expressing this model is by nesting the right-hand part of the equation within the inverse logit function,

$$
\begin{align*}
y_i & \sim \operatorname{Bernoulli}(p_i) \\
p_i & = \operatorname{logit}^{-1}(b_0),
\end{align*}
$$

where $\operatorname{logit}^{-1}$ is called the inverse logit. This use of the logit/inverse-logit function is where *logistic regression* get's its name. Getting to the functions themselves, the logistic function is

$$\operatorname{logit}(x) = \log \left ( \frac{x}{1 - x} \right ).$$

You may recall that the odds of a probability is defined as $\left ( \frac{p}{1 - p}\right )$. Therefore, we can describe $\operatorname{logit}(p)$ as the log odds of the probability, or $p$ in a log-odds metric. Anyway, the inverse of the logistic function is

$$\operatorname{logit}^{-1}(x) = \frac{e^x}{1 + e^x}.$$

To bring this all down to earth, here's what happens when we apply the logistic function to a series of values ranging between 0 and 1.

```{r, fig.width = 3.25, fig.height = 3}
tibble(p = seq(from = .001, to = .999, by = .001)) %>% 
  mutate(odds = p / (1 - p)) %>% 
  mutate(log_odds = log(odds)) %>% 
  
  ggplot(aes(x = p, y = log_odds)) +
  geom_vline(xintercept = .5, color = "white", linetype = 2) +
  geom_hline(yintercept = 0, color = "white", linetype = 2) +
  geom_line() +
  scale_x_continuous(expression(italic(p)), 
                     breaks = 0:4 / 4, labels = c("0", ".25", ".5", ".75", "1"),
                     expand = c(0, 0), limits = 0:1) +
  scale_y_continuous(expression(italic(p)~(log~odds)), breaks = -2:2 * 2) +
  coord_cartesian(ylim = c(-4.5, 4.5)) +
  theme(panel.grid = element_blank())
```

The consequence of the logistic function is it can take the probability values, which are necessarily bounded within 0 and 1, and transform then to the unbounded log-odds space. Thus, if you fit a linear regression model on $\operatorname{logit}(p)$, you will avoid making predictions that $p$ is less than 0 or greater than 1.

Hopefully this is all just review. Now let's consider the longitudinal multilevel model version of this model type for our `dogs` data.

### Define the simple logistic multilevel growth model.

If we have $j$ Bernoulli trials nested within $i$ participants over time, we can express the generic logistic multilevel growth model as

\begin{align*}
y_{ij} & \sim \operatorname{Bernoulli}(p_{ij}) \\
\operatorname{logit}(p_{ij}) & = \pi_{0i} + \pi_{1i} \text{time}_{ij} \\
\pi_{0i} & = \gamma_{00} + \zeta_{0i} \\
\pi_{1i} & = \gamma_{10} + \zeta_{1i} \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & 
 \sim \operatorname{Normal}  \begin{pmatrix} \begin{bmatrix} 
   0 \\ 0  \end{bmatrix}, \mathbf D \mathbf \Omega 
   \mathbf D' \end{pmatrix} \\
   
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1\end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ 
 \rho_{10} & 1 \end{bmatrix},   
\end{align*}

where we make the usual multivariate-normal assumptions about $\zeta_{0i}$ and $\zeta_{1i}$. Yet because we are modeling the $y_{ij}$ with the Bernoulli likelihood, we end up setting the linear model to $\operatorname{logit}(p_{ij})$. And also, because the Bernoulli distribution doesn't itself have a $\sigma$ parameter, we have no $\sigma_\epsilon$ term. That is, though the $\zeta$ parameters get a multivariate normal distribution, the $y_{ij}$ data is considered Bernoulli. We have multiple distributional assumptions within a single model and this is totally okay.

Now let's see how this will work with our `dogs` data.

### Ease into the model with a prior predictive check.

If we focus just on the top parts of the model, we can express the logistic multilevel growth model for our `dogs` data as

$$
\begin{align*}
y_{ij} & \sim \operatorname{Bernoulli}(p_{ij}) \\
\operatorname{logit}(p_{ij}) & = \gamma_{00} + \gamma_{10} \text{trial}_{ij} + [\zeta_{0i} + \zeta_{1i} \text{trial}_{ij}],  
\end{align*}
$$

where the Bernoulli trials $y$ are nested within $i$ dogs across $j$ trials. As with all Bayesian models, we have to place priors on all model parameters. Before we get all technical with the model, let's review what we know about the data:

* there were 25 trials;
* the `trial` variable is coded such that the first of the trials is `trial == 0` and the 25^th^ is `trial == 24`;
* the dogs didn't know how to do the trick, at first; and
* the dogs all tended to learn the trick after several trials.

Now if we were the actual scientists who collected these data and wanted to model them for a scientific publication, we wouldn't have known that last bullet point before collecting the data. But perhaps we could have hypothesized that the dogs would tend to learn after a handful of trials. Anyway, this information tells us that because of the scaling of the predictor variable `trial`, the $\gamma_{00}$ parameter will be the expected value at the first trial. Since we know the dogs didn't know how to do the trick at this point in the experiment, the probability value should be close to zero. Also, since we know the trials are scaled from `0` to `24`, we also know that the coefficient increases in time, $\gamma_{10}$, should be positive since we expect improvement, but it shouldn't be overly large since we don't expect a lot of learning from one trial to the next. Rather, we expect to see learning unfold across several trials.

The difficulty is how to encode these insights into parameters that will make good predictions in the log-odds probability space. If you look up at our last  plot, you'll see that $p = .5$ is the same as $\operatorname{logit}(p) = 0$. You'll also notice that as $p \rightarrow 0$, $\operatorname{logit}(p) \rightarrow -\infty$. Further, $p = .25 \approx \operatorname{logit}(p) = -1.1$ and $p = .1 \approx \operatorname{logit}(p) = -2.2$. Thus, we might express our expectation that the initial probability of success will be low with a prior centered somewhere around -2. I propose we consider $\gamma_{00} \sim \operatorname N(-2, 1)$.

Now we consider our effect size from trial to trial, $\gamma_{10}$. This should be modest on the log-odds space. Somewhere between 0 and 1 would be weakly regularizing. I propose we consider $\gamma_{10} \sim \operatorname N(0.25, 0.25)$.

To get a sense of what these priors would predict, let's do a prior predictive check. McElreath covered this strategy extensively in his [-@mcelreathStatisticalRethinkingBayesian2020] text. Here we'll simulate 1,000 draws using these two priors, given 25 trials.

```{r, fig.width = 4, fig.height = 3}
# how many simulations?
n <- 1e3

set.seed(6)

tibble(iter = 1:n,
       # notice we set the parameters on the log-odds scale
       b0   = rnorm(n, mean = -2, sd = 1),
       b1   = rnorm(n, mean = 0.25, sd = 0.25)) %>% 
  expand(nesting(iter, b0, b1),
         trial = seq(from = 0, to = 24, length.out = 50)) %>% 
  # here we use the inverse logit function to convert the liner model to the probability metric
  mutate(y = inv_logit_scaled(b0 + b1 * trial)) %>% 
  
  # plot!
  ggplot(aes(x = trial, y = y, group = iter)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_line(size = 1/4, alpha = 1/4) +
  scale_y_continuous(expression(italic(p[i][j])), 
                     breaks = 0:4 / 4, labels = c("0", ".25", ".5", ".75", "1"),
                     expand = c(0, 0), limits = 0:1) +
  labs(subtitle = "prior predictive check") +
  theme(panel.grid = element_blank())
```

To my eye, this is about what we want. Our $\gamma_{00} \sim \operatorname N(-2, 1)$ prior concentrated the probability values close to zero at the start, but it was permissive enough to allow for the possibility that the success probability could be nearly as high as .5, or so. Also, our $\gamma_{10} \sim \operatorname N(0.25, 0.25)$ prior was largely consistent with moderate growth across trials, but it was not so heavy-handed that it prevented the possibility of no growth or even getting worse, over time.`

Especially if you're new to this, see what happens if you fool with the priors, a little.

Next, we'll need to think about the priors for our $\sigma$'s. Since we generally like those to be only weakly regularizing, I suggest we continue on with $\operatorname{Exponential}(1)$, which puts an expected value at 1, easily allows for values as low as zero, and gently discourages very large values. As to the $\mathbf \Omega$ matrix, we'll stick with the good old $\operatorname{LKJ}(4)$.

Thus, we can express the full model as

$$
\begin{align*}
y_{ij} & \sim \operatorname{Bernoulli}(p_{ij}) \\
\operatorname{logit}(p_{ij}) & = \gamma_{00} + \gamma_{10} \text{time}_{ij} + \zeta_{0i} + \zeta_{1i} \text{time}_{ij} \\
\begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} & 
 \sim \operatorname{Normal}(\mathbf 0, \mathbf D \mathbf \Omega \mathbf D') \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\ 
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ 
 \rho_{10} & 1 \end{bmatrix} \\
\gamma_{00} & \sim \operatorname{Normal}(-2, 1) \\ 
\gamma_{10} & \sim \operatorname{Normal}(0.25, 0.25) \\
\sigma_0\ \&\ \sigma_1 & \sim \operatorname{Exponential}(1) \\ 
\mathbf \Omega & \sim \operatorname{LKJ}(4).
\end{align*}
$$

To fit this model with **brms**, the big new trick is setting `family = bernoulli`. By default, `brm()` will use the logit link.

```{r fit6.18}
fit6.18 <- 
  brm(data = dogs,
      family = bernoulli,
      y ~ 0 + Intercept + trial + (1 + trial | dog),
      prior = c(prior(normal(-2, 1), class = b, coef = Intercept),
                prior(normal(0.25, 0.25), class = b),
                prior(exponential(1), class = sd),
                prior(lkj(4), class = cor)),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 6,
      file = "fits/fit06.18")
```

Check the parameter summary.

```{r}
print(fit6.18)
```

It can take a while to make sense of the output of a logistic regression model, multilevel or otherwise. Perhaps for now, you might compare the sizes of our posterior means with the priors we used. To me, it's helpful to see what this model predicts. For that, we'll want to extract the posterior draws.

```{r}
post <- posterior_samples(fit6.18)
```

If we focus on the $\gamma$'s, here's the population trajectory in $p$, across the trials.

```{r, fig.width = 5.5, fig.height = 3.25}
post %>% 
  expand(nesting(b_Intercept, b_trial),
         trial = seq(from = 0, to = 25, by = 0.1)) %>% 
  # the magic happens here
  mutate(prob = inv_logit_scaled(b_Intercept + b_trial * trial)) %>% 
  
  ggplot(aes(x = trial, y = prob)) +
  geom_hline(yintercept = 0:1, color = "white") +
  stat_lineribbon(.width = .95, fill = alpha("grey67", .5)) +
  scale_y_continuous("probability of success", breaks = 0:1, limits = 0:1) +
  labs(title = "The population-level learning curve",
       subtitle = expression(We~get~the~overall~success~probability~over~time~with~logit^{-1}*(gamma[0][0]+gamma[1][0]*italic(trial[j]))*'.')) +
  theme(panel.grid = element_blank())
```

In the `mutate()` line, note our use of the `brms::inv_logit_scaled()` function (i.e., the $\operatorname{logit}^{-1}$ function). That line was the equivalent of $p_j = \operatorname{logit}^{-1}(\gamma_{00} + \gamma_{10} \text{time}_j)$. Another thing to keep in mind: The conventional Gaussian regression paradigm, particularly when using the Singer and Willet style $\epsilon_i \sim \operatorname{N}(0, \sigma_\epsilon^2)$ notation, can give you a sense that the regression models are supposed to be in the metric of the criterion. That is, we expect to see $y_{ij}$ on the $y$-axis of our trajectory plots. That isn't the case when we switch to logistic regression. Now, we're predicting $p(y_{ij} = 1)$, not $y_{ij}$ itself. This is why our trajectory can take on all values between 0 and 1, whereas our criterion $y_{ij}$ can only take on 0's and 1's.

Another question we might ask of our model is at what point across the `trial` axis do we cross the threshold of a $p = .5$ success rate? For a simple model, like ours, the formula for that threshold is $-\gamma_{00}/ \gamma_{10}$. Here's how to compute and summarize the threshold with our posterior draws.

```{r}
threshold <-
  post %>%
  transmute(threshold = -b_Intercept / b_trial) %>% 
  mean_qi()

threshold
```

The population threshold is about 6.7 trials, give or take one. Here's what that looks like in a plot.

```{r, fig.width = 5.5, fig.height = 3.25, warning = F}
# adjust for plotting
threshold <- 
  threshold %>% 
  expand(nesting(threshold, .lower, .upper),
         prob = 0:1)

# wrangle
post %>% 
  expand(nesting(b_Intercept, b_trial),
         trial = seq(from = 0, to = 25, by = 0.1)) %>% 
  mutate(prob = inv_logit_scaled(b_Intercept + b_trial * trial)) %>% 
  
  # plot
  ggplot(aes(x = trial, y = prob)) +
  geom_hline(yintercept = 0:1, color = "white") +
  geom_hline(yintercept = .5, linetype = 2, color = "white") +
  stat_lineribbon(.width = .95, fill = alpha("grey67", .5)) +
  geom_smooth(data = threshold,
              aes(x = threshold, xmin = .lower, xmax = .upper),
              stat = "identity",
              fill = "black", color = "black", alpha = 1/8) +
  scale_y_continuous("probability of success", breaks = 0:2 / 2, labels = c("0", ".5", "1")) +
  labs(title = "The population-level threshold",
       subtitle = expression(-gamma[0][0]/gamma[1][0]*" marks the typical half-way point for learning."),
       x = "trial") +
  theme(panel.grid = element_blank())
```

The plot is a little cumbersome, but hopefully it clarifies the idea.

Next, we might want to better understand our model at the level of the dogs. As a first step, we might make a plot of the 30 dog-level $\pi$ parameters.

```{r, fig.width = 4, fig.height = 3.75}
# wrangle
pi <-
  bind_rows(
    # pi0
    post %>% 
      select(ends_with("Intercept]")) %>% 
      mutate_all(~ . + post$b_Intercept) %>% 
      set_names(1:30),
    # pi1
    post %>% 
      select(ends_with("trial]")) %>% 
      mutate_all(~ . + post$b_trial) %>% 
      set_names(1:30)
  ) %>% 
  mutate(iter = rep(1:3000, times = 2),
         pi   = rep(c("p0", "p1"), each = n() / 2)) %>% 
  pivot_longer(-c(iter, pi)) %>% 
  pivot_wider(names_from = pi, values_from = value) 

# plot
pi %>% 
  ggplot(aes(x = p0, y = p1, group = name)) +
  stat_ellipse(geom = "polygon", level = 0.1, alpha = 1/2) +
  labs(title = "Dog-level intercepts and slopes",
       subtitle = "Each ellipse marks off the 10% posterior interval.",
       x = expression(pi[0][italic(i)]~(log~odds)),
       y = expression(pi[1][italic(i)]~(log~odds))) +
  theme(panel.grid = element_blank())
```

The full bivariate posterior for each dog's $\pi$'s is wider than we've depicted with our little 10% ellipses. But even those ellipses give you a sense of their distributions better than simple posterior mean points. It can be hard to interpret these, directly, since both the axes are on the log-odds scale. But hopefully you can at least get a sense of two things: There's more variability *across* the $\pi_{0i}$ parameters than across the $\pi_{1i}$ parameters (look at the ranges in the axes). Also, there is more variability *within* the $\pi_{0i}$ parameters than within the $\pi_{1i}$ parameters (look at the shapes of the ellipses, themselves). This is the essence of the posterior means and standard deviations of the level-2 $\sigma$ parameters.

```{r}
posterior_summary(fit6.18)[3:4, 1:2]
```

All this is still quite abstract. It might be helpful if we plotted a handful of the dog-specific trajectories.

```{r, fig.width = 7, fig.height = 5}
nd <- 
  crossing(dog   = subset,
           trial = seq(from = 0, to = 25, length.out = 100))

fitted(fit6.18, newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 

  ggplot(aes(x = trial)) +
  geom_hline(yintercept = 0:1, color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  geom_point(data = dogs %>% filter(dog %in% subset),
             aes(y = y)) +
  scale_y_continuous("y", breaks = 0:1) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ dog, ncol = 4, labeller = label_both)
```

Keep in mind that whereas the $y$-axis is on the scale of the data, the trajectories themselves are actually of $p_{ij}$. To my eye, the dog-level trajectories all look very similar. It might be easier to detect the variation across the dogs by way of their thresholds. To compute the dog-level thresholds, we use the formula $-\pi_{0i} / \pi_{1i}$.

```{r, fig.width = 5.5, fig.height = 3.25}
pi %>% 
  mutate(threshold = -p0 / p1) %>% 
  group_by(name) %>% 
  tidybayes::mean_qi(threshold) %>% 
  
  ggplot(aes(x = threshold, xmin = .lower, xmax = .upper, y = reorder(name, threshold))) +
  geom_pointrange(fatten = 1) +
  scale_y_discrete("dogs, ranked by threshold", breaks = NULL) +
  xlim(0, 25) +
  labs(title = expression(50*'%'~threshold),
       subtitle = expression(-pi[0][italic(i)]/pi[1][italic(i)]*" marks the point where the "*italic(i)*"th dog has acquired the skill half way."),
       x = "trial") +
  theme(panel.grid = element_blank())
```

The dog-specific thresholds range from $5 \pm 2$ to $12 \pm 3.5$. Here's what it looks like if we adjust our trajectory plots to include the threshold information.

```{r, fig.width = 7, fig.height = 5}
# compute the dog-level thresholds
threshold <-
  pi %>% 
  mutate(threshold = -p0 / p1) %>% 
  group_by(name) %>% 
  tidybayes::mean_qi(threshold) %>% 
  mutate(dog = name %>% as.double()) %>% 
  select(dog, threshold) %>% 
  filter(dog %in% subset) %>% 
  expand(nesting(dog, threshold),
         point = 1:3) %>% 
  mutate(trial = ifelse(point == 1, -Inf, threshold),
         prob  = ifelse(point == 3, -Inf, .5))

# go through the usual fitted() steps
nd <- 
  crossing(dog   = subset,
           trial = seq(from = 0, to = 25, length.out = 100))

fitted(fit6.18, newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 

  # plot!
  ggplot(aes(x = trial)) +
  geom_hline(yintercept = 0:1, color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/4) +
  geom_line(aes(y = Estimate)) +
  geom_path(data = threshold,
            aes(y = prob),
            linetype = 2) +
  scale_y_continuous("success probability", breaks = 0:2 / 2, labels = c("0", ".5", "1")) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ dog, ncol = 4, labeller = label_both)
```

To keep things simple, we only focused on threshold posterior means.

### You may want more.

If you'd like more practice with logistic regression, with other aspects of the generalized linear model, Here are a few resources to consider:

* @gelmanDataAnalysisUsing2006 covered the generalized linear model, including logistic regression, in both single-level and multilevel contexts. You can find their supplemental materials at [https://www.stat.columbia.edu/~gelman/arm/software/](https://www.stat.columbia.edu/~gelman/arm/software/). Heads up: some of their Bayesian model code is getting a dated.
* @gelmanRegressionOtherStories2020 is something of a second edition of the first half of @gelmanDataAnalysisUsing2006. This text covered the generalized linear model, but only from a single-level context. You can find supporting materials at [https://avehtari.github.io/ROS-Examples/](https://avehtari.github.io/ROS-Examples/).
* Both editions of McElreath's text [-@mcelreathStatisticalRethinkingBayesian2015; -@mcelreathStatisticalRethinkingBayesian2020] cover the generalized linear model, including logistic regression, from both single-level and multilevel contexts. You can find all kinds of supporting material at [https://xcelab.net/rm/statistical-rethinking/](https://xcelab.net/rm/statistical-rethinking/).
* I have **tidyverse** + **brms** ebook translations of both of McElreath's texts [@kurzStatisticalRethinkingBrms2020; @kurzStatisticalRethinkingSecondEd2021]. I'm also slowly working through @gelmanRegressionOtherStories2020, which you can find on GitHub at [https://github.com/ASKurz/Working-through-Regression-and-other-stories](https://github.com/ASKurz/Working-through-Regression-and-other-stories).

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, eval = F}
# here we'll remove our objects
rm()

theme_set(theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

## Footnote {-}

[^1]: You might also use our `sw_logistic()` to investigate the claims on what happens when $\text{game}_{ij} \rightarrow \infty$ or $\text{game}_{ij} \rightarrow -\infty$.


<!--chapter:end:06.Rmd-->


# Examining the Multilevel Model's Error Covariance Structure

To be fleshed out later

## Session info {-}

```{r}
sessionInfo()
```


<!--chapter:end:07.Rmd-->


# Modeling Change Using Covariance Structure Analysis

To be fleshed out later

## Session info {-}

```{r}
sessionInfo()
```


<!--chapter:end:08.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# A Framework for Investigating Event Occurrence

> Researchers who want to study event occurrence must learn how to think about their data in new and unfamiliar ways. Even traditional methods for data description--the use of means and standard deviations--fail to serve researchers well. In this chapter we introduce the essential features of event occurrence data, explaining how and why they create the need for new analytic methods.
[@singerAppliedLongitudinalData2003, pp 305--306]

## Should you conduct a survival analysis? The "whether" and "when" test

> To determine whether a research question calls for survival analysis, we find it helpful to apply a simple mnemonic we refer to as "the whether and when test." If your research questions include either word--*whether* or *when*--you probably need to use survival methods. (p. 306, *emphasis* added)

### Time to relapse among recently treated alcoholics.

Within the addictive-behaviors literature, researchers often study if and when participants relapse (i.e., begin using the substance(s) again). 

### Length of stay in teaching.

Education researchers can use survival analysis to study whether and for how long newly-hired teachers stay in their positions.

### Age at first suicide ideation.

Suicide is a major health risk and clinical researchers sometimes use survival analysis to whether and when participants have first considered killing themselves.

## Framing a research question about event occurrence

Survival analyses share three common characteristics. 

> Each has a clearly defined:
>
> * *Target event*, whose occurrence is being studies
> * *Beginning of time*, an initial starting point when no one under study has yet experienced the target event
> * *Metrics for clocking time*, a meaningful scale in which event occurrence is recorded (p. 310, *emphasis* in the original)

### Defining event occurrence.

"Event occurrence represents an individual's transition from one 'state' to another 'state'" (p. 310). Though our primary focus will be on binary states (e.g., drinking/abstinent), survival analyses can handle more categories (e.g., whether/when marriages end in divorce or death).

### Identifying the "beginning of time."

> The "beginning of time" is a moment when *everyone* in the population occupies one, and only one, of the possible states... Over time, as individuals move from the original state to the next, they experience the target event. The timing of this transition--the distance from the "beginning of time" until the event occurrence--is referred to as the *event time*.
>
> To identify the "beginning of time" in a given study, imagine placing everyone in the population on a time-line, an axis with the "beginning of time" at one end and the last moment when event occurrence could be observed at the other. The goal is to "start the clock" when on one in the population has yet experienced the event but everyone is at least (theoretically) eligible to do so. In the language of survival analysis, you want to start the clock when everyone in the population is *at risk* of experiencing the event. (pp. 311--312, *emphasis* in the original)

### Specifying a metric for time.

> We distinguish between data recorded in thin precise units and those recorded in thicker intervals by calling the former *continuous time* and the latter *discrete time*.
>
> [Though survival methods can handle both discrete and continuous time,] time should be recorded in the smallest possible units relevant to the process under study. No single metric is universally appropriate, and even different studies of the identical event might use different scales. (p. 313, *emphasis* in the original)

## Censoring: How complete are the data on event occurrence?

> No matter when data collection begins, and no matter how long it lasts, some sample members are likely to have unknown event times. Statisticians call this problem *censoring* and they label the people with the unknown event times *censored observations*. Because censoring is inevitable--and a fundamental conundrum in the study of event occurrence--we now explore it in detail. (p. 316, *emphasis* in the original)

### How and why does censoring arise?

> Censoring occurs whenever a researcher does not know an individual's event time. There are two major reasons for censoring: (1) some individuals will *never* experience the target event; and (2) others will experience the event, but not during the study's data collection. Some of these latter individuals will experience the event shortly after data collection ends while others will do so at a much later time. As a practical matter, though, these distinctions matter little because you cannot distinguish among them. That, unfortunately, is the nature of censoring: it prevents you from knowing the very quantity of interest--*whether* and, if so, *when* the target event occurs for a subset of the sample. (pp. 316--317, *emphasis* in the original)

### Different types of censoring.

"Methodologists make two major types of distinctions: first, between *non-informative* and *informative* censoring mechanisms, and second, between *right*- and *left*-censoring" (p. 318, *emphasis* in the original).

#### Noninformative versus informative censoring.

> A noninformative censoring mechanism operates independent of event occurrence and the risk of event occurrence. If censoring is under an investigator's control, determined in advance by design--as it usually is--then it is noninformative... [Under this mechanism] we can therefore assume that all individuals who remain in the study after the censoring date are representative of everyone who *would have remained in the study* had censoring not occurred.
>
> If censoring occurs because individuals have experienced the event or are likely to do so in the future, the censoring mechanism is informative... Under these circumstances, we can no longer assume that those people who remain in the study after this tie are representative of all individuals who would have remained in the study had censoring not occurred. The noncensored individuals differ systematically from the censored individuals. (pp. 318--319, *emphasis* in the original)

#### Right- versus left-censoring.

> Right-censoring arises when an event time is unknown because event occurrence is not observed. Left-censoring arises when an event time is unknown because *the beginning of time* is not observed.... Because [right-censoring] is the one typically encountered in practice, and because it is the type for which survival methods were developed, references to censoring, unencumbered by a directional modifier, usually refer to right-censoring.
>
> How to left-censored observations arise? Often they arise because researchers have not paid sufficient attention to identifying the beginning of time during the design phase. If the beginning of time is defined well--as that moment when all individuals in the population are eligible to experience the event but none have yet done so--left-censoring can be eliminated....
>
> Left-censoring presents challenges not easily addressed even with the most sophisticated of survival methods [@huEstimationTruncatedLifetime1996]. Little progress has been made in this area since Turnbull [-@turnbullNonparametricEstimationSurvivorship1974, -@turnbullEmpiricalDistributionFunction1976] offered some basic descriptive approaches and Flinn and Heckman [-@flinnNewMethodsAnalyzing1982] and Cox and Oakes [-@cox1984analysis] offered some directions for fitting models under a restrictive set of assumptions. The most common advice, followed by Fichman, is to set the left-censored spells aside from analysis.... Redefining the beginning of time to coincide with a precipitating event... is often the best way of resolving the otherwise intractable problems that left-censored data pose. Whenever possible, we suggest that researchers consider such a redefinition or otherwise eliminate left-censored data through design. (pp. 319--320, *emphasis* in the original)

### How does censoring affect statistical analysis?

Here we load the `teachers.csv` data @singerAreSpecialEducators1992.

```{r, warning = F, message = F}
library(tidyverse)

teachers <- read_csv("data/teachers.csv")

glimpse(teachers)
```

Make a version of Figure 9.1.

```{r, fig.width = 6, fig.height = 5}
teachers %>% 
  count(censor, t) %>% 
  mutate(censor = if_else(censor == "0", "not censored", "censored")) %>% 
  
  ggplot(aes(x = t)) +
  geom_col(aes(y = n)) +
  geom_text(aes(y = n + 25, label = n)) +
  scale_x_continuous("years", breaks = 1:12) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~censor, nrow = 2)
```

Here's a descriptive breakdown of those censored or not.

```{r, message = F}
teachers %>% 
  group_by(censor) %>% 
  summarise(n    = n(),
            mean = mean(t),
            sd   = sd(t)) %>% 
  mutate(percent = 100 * n / sum(n))
```

Whereas the distribution of the censored occasions is flattish with a bit of a spike at 12, the distribution of the non-censored times has a bit of an exponential look to it. Recall that the exponential distribution is controlled by a single parameter, its rate, and the mean of the exponential distribution is the reciprocal of that rate. If we take the empirical mean and $n$ of the non-censored data and plot those in to the `rexp()` function, we can simulate exponential data and plot.

```{r, fig.width = 6, fig.height = 2.5}
set.seed(9)

tibble(years = rexp(n = 2207, rate = 1 / 3.7)) %>% 
  
  ggplot(aes(x = years)) +
  geom_histogram(binwidth = 1, boundary = 0) +
  scale_x_continuous(breaks = 1:12) +
  coord_cartesian(xlim = c(0, 12)) +
  theme(panel.grid = element_blank())
```

That simulation looks pretty similar to our non-censored data. If we stopped there, we might naïvely presume $\operatorname{Exponential}(1/3.7)$ is a good model for our data. But this would ignore the censored data. One of the solutions researchers have used is

> to assign the censored cases the event time they possess at the end of the data collection [e.g., @frank1984academic]. Applying this to our teacher career data (e.g., assigning a career length of 7 years to the 280 teachers censored in the year 7, etc.) yields an estimated mean career duration of 7.5 years. (pp. 322--323)

Here's what that looks like.

```{r}
teachers %>% 
  summarise(mean   = mean(t),
            median = median(t),
            sd     = sd(t))
```

I have no idea where the 7.5 value Singer and Willett presented came from. It's larger than both the mean and the median in the data. But anyway, this method is patently wrong, so it doesn't matter:

> Imputing event times for censored cases simply changes all "nonevents" into "events" and further assumes that all these new "events" occur at the earliest time possible--that is, at the moment of censoring. Surely these decisions are most likely wrong. (p. 323) 

Stay tuned for methods that are better than patently wrong.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(teachers)

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```


<!--chapter:end:09.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# Describing Discrete-Time Event Occurrence Data

> In this chapter, [Singer and Willett presented] a framework for describing discrete-time event occurrence data.... As we will [see], the conceptual linchpin for all subsequent survival methods is to approach the analysis on a period-by-period basis. This allows you to examine event occurrence sequentially among those individuals eligible to experience the event at each discrete point in time. [@singerAppliedLongitudinalData2003, p. 325]

## The life table

> The fundamental tool for summarizing the sample distribution of event occurrence is the *life table*. As befits its name, a life table tracks the event histories (the "lives") of a sample of individuals from the beginning of time (when no one has yet experienced the target event) through the end of data collection. (p. 326, *emphasis* in the original)

To make a life table as presented in Table 10.1, we need to load the `teachers.csv` data.

```{r, warning = F, message = F}
library(tidyverse)

teachers <- read_csv("data/teachers.csv")

glimpse(teachers)
```

Perhaps the easiest way to make a life table as presented in Table 10.1 is with help from the [**survival** package](https://CRAN.R-project.org/package=survival) [@R-survival; @therneau2000ModelingSurvivalData].

```{r, warning = F, message = F}
# install.packages("survival", dependencies = T)
library(survival)
```

Here we'll use the `survfit()` function to compute survival curves. Within the `survfit()` function, we'll use the `Surv()` function to make a survival object, which will become the criterion within the model formula. It takes two basic arguments, `time` and `event`. With the `teachers` data, `t` is time in years. In the data, events are encoded in `censor`. However, it's important to understand how the `event` argument expects the data. From the [**survival** reference manual](https://CRAN.R-project.org/package=survival/survival.pdf) [@survival2021RM], we read that `event` is "the status indicator, normally `0=alive`, `1=dead`. Other choices are `TRUE/FALSE` (`TRUE = death`) or `1/2` (`2=death`)." Note that whereas within our data `censor` is coded `0 = event` `1 = censored`, the `event` argument expects the opposite. A quick way to solve that is to enter `1 - censor`.

```{r fit10.1}
fit10.1 <- survfit(data = teachers,
                   Surv(t, 1 - censor) ~ 1)
```

Use the `str()` function to survey the results.

```{r}
fit10.1 %>% str()
```

We can retrieve the values for the "Year" column from `fit10.1$time`. The values in the "Time interval" column are a simple transformation from there.

```{r}
fit10.1$time
```

We can find the values in the "Employed at the beginning of the year" column in `fit10.1$n.risk` and those in the "Who left during the year" column in `fit10.1$n.event`.

```{r}
fit10.1$n.risk
fit10.1$n.event
```

We'll have to work a little harder to compute the values in the "Censored at the end of the year column." Here we'll walk it through in a data frame format.

```{r}
data.frame(n_risk  = fit10.1$n.risk,
           n_event = fit10.1$n.event) %>% 
  mutate(n_risk_1 = lead(n_risk, default = 0)) %>% 
  mutate(n_censored = n_risk - n_event - n_risk_1)
```

That is, to get the number of those censored at the end of a given year, you take the number employed at the beginning of that year, subtract the number of those who left (i.e., the number who experienced the "event"), and then subtract the number of those employed at the beginning of the next year. Notice our use of the `dplyr::lead()` function to get the number employed in the next year (learn more about that function [here](https://dplyr.tidyverse.org/reference/lead-lag.html)).

To get the values in the "Teachers at the beginning of the year who left during the year" column, which is in a proportion metric, we use division.

```{r}
fit10.1$n.event / fit10.1$n.risk
```

Finally, to pull the values in the "All teachers still employed at the end of the year" column, we just execute `fit10.1$surv`.

```{r}
fit10.1$surv
```

Let's put that all together in a tibble.

```{r}
most_rows <-
  tibble(year = fit10.1$time) %>% 
  mutate(time_int   = str_c("[", year, ", ", year + 1, ")"), 
         n_employed = fit10.1$n.risk, 
         n_left     = fit10.1$n.event) %>% 
  mutate(n_censored   = n_employed - n_left - lead(n_employed, default = 0),
         hazard_fun   = n_left / n_employed,
         survivor_fun = fit10.1$surv)

most_rows
```

The only thing missing from our version of Table 10.1 is we don't have a row for Year 0. Here's a quick and dirty way to manually insert those values.

```{r}
row_1 <-
  tibble(year         = 0, 
         time_int     = "[0, 1)", 
         n_employed   = fit10.1$n.risk[1], 
         n_left       = NA, 
         n_censored   = NA, 
         hazard_fun   = NA, 
         survivor_fun = 1)

d <-
  bind_rows(row_1,
            most_rows)

d
```

We might walk out the notation in our `time_int` column a bit. Those intervals

> reflect a standard partition of time, in which each interval *includes* the initial time and *excludes* the concluding time. Adopting common mathematical notation, [brackets] denote inclusions and (parentheses) denote exclusions. Thus, we bracket each interval's initial time and place a parenthesis around its concluding time. (p. 328, *emphasis* in the original)

The values in the `n_employed` column the *risk set*, those who are "*eligible* to experience the event during that interval" (p. 329, *emphasis* in the original).

## A framework for characterizing the distribution of discrete-time event occurrence data

> The fundamental quantity used to assess the risk of event occurrence in each discrete time period is known as *hazard.* Denoted by $h(t_{ij})$, discrete time hazard is the *conditional probability that individual* $i$ *will experience the event time in period* $j$, *given that he or she did not experience it in any earlier time period*. Because hazard represents the risk of the event occurrence in each discrete time period among those people eligible to experience the event (those in the risk set) hazard tells us precisely what we want to know: whether and when events occurs. (p. 330, *emphasis* in the original)

If we let $T_i$ stand for the discrete value in time person $i$ experiences the event, we can express the conditional probability the event might occur in the $j^\text{th}$ interval as

$$h(t_{ij}) = \text{Pr}[T_i = j | T \geq j].$$

That last part, $T \geq j$, clarifies the event can only occur once and, therefore, cannot have occurred in any of the prior levels of $j$. More plainly put, imagine the event is death and person $i$ died during the period of $T_j = 20$. In such a case, it's nonsensical to speak of that $i^\text{th}$ person's hazard for the period of $T_j = 25$. They're already dead.

Also, "the discrete-time hazard probabilities expressed as a function of time--labeled $h(t_{ij})$--is known as the population *discrete-time hazard function*" (p 330, *emphasis* in the original). That was expressed in the 6^th^ column in Table 10.1, which we called `hazard_fun` in our `d` tibble.

```{r}
d %>% 
  select(year, hazard_fun)
```

You might notice $h(t_{ij})$ is in a proportion metric and it is not cumulative. If you look above in the code, you'll see we computed that by `hazard_fun = n_left / n_employed`. More formally and generally, this is an operationalization of

$$\hat h(t_{j}) = \frac{n \text{ events}_j}{n \text{ at risk}_j},$$

where $n \text{ events}_j$ is the number of individuals who experienced the event in the $j^{th}$ period and $n \text{ at risk}_j$ is the number within the period who have not (a) already experienced the event and (b) been censored. Also note that by $\hat h(t_{j})$, we're indicating we're talking about the maximum likelihood estimate for $h(t_{j})$. Because no one is at risk during the initial time point, $h(t_0)$ is undefined (i.e., `NA`). Here we mimic the top panel of Figure 10.1 and plot our $\hat h(t_{j})$ over time.

```{r, fig.width = 6, fig.height = 3, warning = F}
d %>% 
  ggplot(aes(x = year, y = hazard_fun)) +
  geom_line() +
  scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
  scale_y_continuous(expression("estimated hazard probability, "*hat(italic(h))(italic(t[j]))), 
                     breaks = c(0, .05, .1, .15), limits = c(0, .15)) +
  theme(panel.grid = element_blank())
```

### Survivor function.

The survivor function provides another way of describing the distribution of event occurrence over time. Unlike the hazard function, which assesses the unique risk associated with each time period, the survivor function cumulates these period-by-period risks of event occurrence (or more properly, nonoccurrence) together to assess the probability that a randomly selected individual will survive will not experience the event.

We can formally define the survivor function, $S(t_{ij})$, as

$$S(t_{ij}) = \text{Pr}[T > j],$$

where $S$ is survival as a function of time, $t$. But since our data are finite, we can only have an estimate of the "true" survivor function, which we call $\hat S(t_{ij})$. Here it is in a plot, our version of the bottom panel of Figure 10.1.

```{r, fig.width = 6, fig.height = 3}
d %>% 
  ggplot(aes(x = year, y = survivor_fun)) +
  geom_hline(yintercept = .5, color = "white", linetype = 2) +
  geom_line() +
  scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
  scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(panel.grid = element_blank())
```

### Median lifetime.

> Having characterized the *distribution* of event times using the hazard and survivor functions, we often want to identify the distribution's center. Were there no censoring, all event times would be known, and we could compute a sample mean. But because of censoring, another estimate of central tendency is preferred: the *median lifetime*.
>
> *The estimated median lifetime identifies that value for* $T$ *for which the value of the estimated survivor function is .5*. It is the point in time by which we estimate that half of the sample has experienced the target event, half has not. (p. 337, *emphasis* in the original)

If we use `filter()`, well see our median lifetime rests between years 6 and 7.

```{r}
d %>% 
  filter(year %in% c(6, 7)) %>%
  # this just simplifies the output
  select(year, time_int, survivor_fun)
```

Using a simple descriptive approach, we'd just say the median lifetime was between years 6 and 7. We could also follow @miller1981SurvivalAnalysis and linearly interpolate between the two values of $S(t_j)$ bracketing .5. If we let $m$ be the time interval just before the median lifetime, $\hat S(t_m)$ be the value of the survivor function in that $m^\text{th}$ interval, and $\hat S(t_{m + 1})$ be the survival value in the next interval, the can write

$$\text{Estimated median lifetime} = m + \Bigg [\frac{\hat S(t_m) - .5}{\hat S(t_m) - \hat S(t_{m + 1})} \Bigg ] \big ((m + 1) - m \big).$$

We can compute that by hand like so.

```{r}
m        <- 6
m_plus_1 <- 7

stm <-
  d %>% 
  filter(year == m) %>% 
  pull(survivor_fun)

stm_plus_1 <-
  d %>% 
  filter(year == m_plus_1) %>% 
  pull(survivor_fun)

# compute the interpolated median lifetime and save it as `iml`
iml <- m + ((stm - .5) / (stm - stm_plus_1)) * ((m + 1) - m)
iml
```

Now we have the `iml` value, we can add that information to our version of the lower panel of Figure 10.1.

```{r, fig.width = 6, fig.height = 3}
line <-
  tibble(year         = c(0, iml, iml),
         survivor_fun = c(.5, .5, 0))

d %>% 
  ggplot(aes(x = year, y = survivor_fun)) +
  geom_path(data = line,
            color = "white", linetype = 2) +
  geom_line() +
  annotate(geom = "text",
           x = iml, y = .55,
           label = "All teachers (6.6 years)",
           hjust = 0) +
  scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
  scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(panel.grid = element_blank())
```

We can compute the estimates for the 5- and 10-year survival rates as a direct algebraic transformation of the survival function from those years.

```{r}
d %>% 
  filter(year %in% c(5, 10)) %>% 
  select(year, survivor_fun) %>% 
  mutate(`survival rate (%)` = (100 * survivor_fun) %>% round(digits = 0))
```

## Developing intuition about hazard functions, survivor functions, and median lifetimes

> Developing intuition about these sample statistics requires exposure to estimates computed from a wide range of studies. To jump-start this process, we review results from four studies that differ across three salient dimensions--the type of event investigated, the metric used to record discrete time, and most important, the underlying profile of risk--and discuss how we would examine, and describe, the estimated hazard functions, survivor functions, and median lifetimes. (p. 339)

Here we load the four relevant data sets.

```{r, warning = F, message = F}
cocaine  <- read_csv("data/cocaine_relapse.csv")
sex      <- read_csv("data/firstsex.csv")
suicide  <- read_csv("data/suicide.csv")
congress <- read_csv("data/congress.csv")

# glimpse(cocaine)
# glimpse(sex)
# glimpse(suicide)
# glimpse(congress)
```

We have a lot of leg work in front of use before we can recreate Figure 10.2. First, we'll feed each of the four data sets into the `survfit()` function.

```{r fit10.2}
fit10.2 <- 
  survfit(data = cocaine,
          Surv(week, 1 - censor) ~ 1)

fit10.3 <- 
  survfit(data = sex,
          Surv(time, 1 - censor) ~ 1)

fit10.4 <- 
  survfit(data = suicide,
          Surv(time, 1 - censor) ~ 1)

fit10.5 <- 
  survfit(data = congress,
          Surv(time, 1 - censor) ~ 1)
```

Given the four fits all follow the same basic form and given our end point is to make the same basic plots for each, we can substantially streamline our code by making a series of custom functions. For our first custom function, `make_lt()`, we'll save the general steps for making life tables for each data set.

```{r}
make_lt <- function(fit) {
  
  # arrange the lt data for all rows but the first
  most_rows <-
    tibble(time = fit$time) %>% 
    mutate(time_int = str_c("[", time, ", ", time + 1, ")"), 
           n_risk   = fit$n.risk, 
           n_event  = fit$n.event) %>% 
    mutate(hazard_fun   = n_event / n_risk,
           survivor_fun = fit$surv)
  
  # define the values for t = 2 and t = 1
  time_1 <- fit$time[1]
  time_0 <- time_1 - 1
  
  # define the values for the row for which t = 1
  row_1 <-
    tibble(time         = time_0, 
           time_int     = str_c("[", time_0, ", ", time_1, ")"),
           n_risk       = fit$n.risk[1],
           n_event      = NA,
           hazard_fun   = NA, 
           survivor_fun = 1)
  
  # make the full life table
  lt <-
    bind_rows(row_1,
              most_rows)
  
  lt
  
}
```

Use `make_lt()` to make the four life tables.

```{r}
lt_cocaine  <- make_lt(fit10.2)
lt_sex      <- make_lt(fit10.3)
lt_suicide  <- make_lt(fit10.4)
lt_congress <- make_lt(fit10.5)
```

You'll note that the four survival-curve plots in Figure 10.2 all show the median lifetime using the interpolation method. Here we'll save the necessary steps to compute that for each model as the `make_iml()` function.

```{r}
make_iml <- function(lt) {
  
  # lt is a generic name for a life table of the 
  # kind we made with our `make_lt()` function
  
  # determine the mth row
  lt_m <-
    lt %>% 
    filter(survivor_fun > .5) %>% 
    slice(n())
  
  # determine the row for m + 1
  lt_m1 <-
    lt %>% 
    filter(survivor_fun < .5) %>% 
    slice(1)
  
  # pull the value for m
  m  <- pull(lt_m, time)
  
  # pull the two survival function values
  stm  <- pull(lt_m, survivor_fun)
  stm1 <- pull(lt_m1, survivor_fun)
  
  # plug the values into Equation 10.6 (page 338)
  iml <- m + ((stm - .5) / (stm - stm1)) * ((m + 1) - m)
  
  iml
  
}
```

If you want, you can use `make_iml()` directly like this.

```{r}
make_iml(lt_cocaine)
```

However, our approach will be to wrap it in another function, `line_tbl()`, with which we will save the coordinates necessary for marking off the median lifetimes and them save them in a tibble.

```{r}
line_tbl <- function(lt) {
  
  iml <- make_iml(lt)
  
  tibble(time         = c(lt[1, 1] %>% pull(), iml, iml),
         survivor_fun = c(.5, .5, 0))
  
}
```

It works like this.

```{r}
line_tbl(lt_cocaine)
```

If you look closely at the hazard function plots in the left column of Figure 10.2, you'll note they share many common settings (e.g., the basic shape, the label of the $y$-axis). But there are several parameters we'll need to set custom settings for. To my eye, those are:

* the data;
* the $x$-axis label, break points, and limits; and
* the $y$-axis break points, and limits.

With our custom `h_plot()` function, we'll leave those parameters free while keeping all the other **ggplot2** parameters the same.

```{r}
h_plot <- function(data = data, 
                   xlab = xlab, xbreaks = xbreaks, xlimits = xlimits,
                   ybreaks = ybreaks, ylimits = ylimits) {
  
  ggplot(data = data,
         mapping = aes(x = time, y = hazard_fun)) +
    geom_line() +
    scale_x_continuous(xlab, breaks = xbreaks, limits = xlimits) +
    scale_y_continuous(expression(widehat(italic(h(t)))), 
                       breaks = ybreaks, limits = ylimits) +
    theme(panel.grid = element_blank())
  
}
```

Now we'll make a similar custom plotting function, `s_plot()`, for the hazard function plots on the right column of Figure 10.2.

```{r}
s_plot <- function(data = data, xlab = xlab, xbreaks = xbreaks, xlimits = xlimits) {
  
  # compute the interpolated median life value
  iml <- make_iml(data)
  
  # make the imp line values
  line <-
    data %>% 
    line_tbl()
  
  ggplot(data = data,
         mapping = aes(x = time, y = survivor_fun)) +
    geom_path(data = line,
              color = "white", linetype = 2) +
    geom_line() +
    annotate(geom = "text",
             x = iml, y = .6,
             label = str_c("widehat(ML)==", iml %>% round(1)),
             size = 3, hjust = 0, parse = T) +
    scale_x_continuous(xlab, breaks = xbreaks, limits = xlimits) +
    scale_y_continuous(expression(widehat(italic(S(t)))),
                       breaks = c(0, .5, 1), limits = c(0, 1)) +
    theme(panel.grid = element_blank())
  
}
```

Now we make the eight subplots in bulk, naming them `p1`, `p2`, and so on.

```{r, fig.width = 6, fig.height = 3, warning = F}
# cocaine
p1 <-
  lt_cocaine %>% 
  h_plot(xlab = "Weeks after release", 
         xbreaks = 0:12, xlimits = c(0, 12),
         ybreaks = c(0, .05, .1, .15), ylimits = c(0, .15))

p2 <-
  lt_cocaine %>% 
  s_plot(xlab = "Weeks after release", 
         xbreaks = 0:12, xlimits = c(0, 12))

# sex
p3 <-
  lt_sex %>% 
  h_plot(xlab = "Grade", 
         xbreaks = 6:12, xlimits = c(6, 12),
         ybreaks = 0:3 / 10, ylimits = c(0, .325))

p4 <-
  lt_sex %>% 
  s_plot(xlab = "Grade", 
         xbreaks = 6:12, xlimits = c(6, 12))

# suicide
p5 <-
  lt_suicide %>% 
  h_plot(xlab = "Age", 
         xbreaks = 1:9 * 2 + 3, xlimits = c(5, 22),
         ybreaks = c(0, .05, .1, .15), ylimits = c(0, .16))

p6 <-
  lt_suicide %>% 
  s_plot(xlab = "Age", 
         xbreaks = 1:9 * 2 + 3, xlimits = c(5, 22))

# congress
p7 <-
  lt_congress %>% 
  h_plot(xlab = "Terms in office", 
         xbreaks = 0:8, xlimits = c(0, 8),
         ybreaks = 0:3 / 10, ylimits = c(0, .3))

p8 <-
  lt_congress %>% 
  s_plot(xlab = "Terms in office", 
         xbreaks = 0:8, xlimits = c(0, 8))
```

Now we'll use some functions and syntax from the **patchwork** package to combine the subplots and make Figure 10.2.

```{r, fig.width = 7, fig.height = 8, warning = F}
library(patchwork)

p12 <- (p1 + p2) + plot_annotation(title = "A") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p34 <- (p3 + p4) + plot_annotation(title = "B") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p56 <- (p5 + p6) + plot_annotation(title = "C") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p78 <- (p7 + p8) + plot_annotation(title = "D") & theme(plot.margin = margin(0, 5.5, 0, 5.5))

(wrap_elements(p12) /
  wrap_elements(p34) /
  wrap_elements(p56) /
  wrap_elements(p78))
```

Boom! Looks like a dream.

### Identifying periods of high and low risk using hazard functions.

It can be useful to evaluate hazard functions based on whether they are *monotonic* (i.e., have a single distinctive peak and single distinctive trough) and *nonmonotonic* (i.e., have multiple distinctive peaks or troughs). Globally speaking, the hazard functions for rows `A` and `B` are monotonic and the remaining two are nonmonotonic.

Singer and Willett remarked "monotonically increasing hazard functions are common when studying events that are ultimately inevitable (or near universal).... [However,] nonmonotonic hazard functions, like those in Panels C and D, generally arise in studies of long duration" (p. 342).

However, when risk is constant over time, hazard functions will not have peaks or troughs.

### Survivor functions as a context for evaluating the magnitude of hazard.

Unlike with hazard functions, all survivor functions decrease or stay constant over time. They are monotonic (i.e., they never switch direction, they never increase). From the text (p. 344), we learn three ways hazard functions relate to survival functions:

* *When hazard is high, the survivor function drops rapidly*.
* *When hazard is low, the survivor function drops slowly*.
* *When hazard is zero, the survivor function remains unchanged*.

### Strengths and limitations of estimated median lifetimes.

> When examining a median lifetime, we find it helpful to remember three important limitations on its interpretation. First, it identifies only an "average" event time; it tells us little about the distribution of even times and is relatively insensitive to extreme values. Second, the median lifetime is not necessarily a moment when the target event is *especially* likely to occur.... Third, the median lifetime reveals little about the distribution of risk over time; identical median lifetimes can result from dramatically different survivor and hazard functions. (pp. 345--346, *emphasis* in the original)

Without access to Singer and Willett's hypothetical data, we're not in a good position to recreate their Figure 10.3. Even the [good folks at IDRE](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-10/) gave up on that one.

## Quantifying the effects of sampling variation

We can quantify the uncertainty in the estimates with standard errors.

### The standard error of the estimated hazard probabilities.

The formula for the frequentist standard errors for the hazard probabilities follows the form

$$se \left (\hat h(t_j) \right) = \sqrt{\frac{\hat h(t_j) \left (1 - \hat h(t_j) \right)}{n \text{ at risk}_j}}.$$

We can express that equation **R** code to recreate the first four columns of Table 10.2. We'll be pulling much of the information from `fit10.1`. But to show our work within a tibble format, we'll be adding a column after $n_j$. Our additional `n_event` column will contain the information pulled from `fit10.1$n.event`, which we'll use to compute the $\hat h(t_j)$.

```{r}
se_h_hat <-
  tibble(year    = fit10.1$time,
         n_j     = fit10.1$n.risk,
         n_event = fit10.1$n.event) %>% 
  mutate(h_hat = n_event / n_j) %>% 
  mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j))

se_h_hat
```

As in the text, our standard errors are pretty small. To get a better sense, here they are in a rug plot.

```{r, fig.width = 6, fig.height = 1}
se_h_hat %>% 
  ggplot(aes(x = se_h_hat)) +
  geom_rug(length = unit(0.25, "in")) +
  scale_x_continuous(expression(italic(se)(hat(italic(h))(italic(t[j])))), limits = c(.004, .007)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())
```

Standard errors for discrete hazards probabilities share a property with those for other probabilities: they are less certain (i.e., larger) for probability values near .5 and increasingly certain (i.e., smaller) for probability values approaching 0 and 1. To give a sense of that, here are the corresponding $se \big (\hat h(t_j) \big)$ for a series of $\hat h(t_j)$ values ranging from 0 to 1, with $n_j$ held constant at 1,000.

```{r, fig.width = 4, fig.height = 3}
tibble(n_j   = 1000,
       h_hat = seq(from = 0, to = 1, by = .01)) %>% 
  mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j)) %>% 
  
  ggplot(aes(x = h_hat, y = se_h_hat)) +
  geom_point() +
  labs(x = expression(hat(italic(h))(italic(t[j]))),
       y = expression(italic(se)(hat(italic(h))(italic(t[j]))))) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())
```

Also, as the size of the risk set, $n_j$, influences the standard errors in the typical way. All things equal, a larger $n$ will make for a smaller $se$. To give a sense, here's the same basic plot from above, but this time with $n_j = 100, 1,000, \text{ and } 10,000$.

```{r, fig.width = 8, fig.height = 2.75}
crossing(n_j   = c(100, 1000, 10000),
         h_hat = seq(from = 0, to = 1, by = .01)) %>% 
  mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
         n_j      = str_c("italic(n[j])==", n_j)) %>% 
  
  ggplot(aes(x = h_hat, y = se_h_hat)) +
  geom_point() +
  labs(x = expression(hat(italic(h))(italic(t[j]))),
       y = expression(italic(se)(hat(italic(h))(italic(t[j]))))) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank()) +
  facet_wrap(~n_j, nrow = 1, labeller = label_parsed)
```

### Standard error of the estimated survival probabilities.

Computing the frequentist standard errors for estimated survival probabilities is more difficult because these are the products of (1 - hazard) for the current and all previous survival probabilities. Computing them is such a pain, Singer and Willett recommend you rely on Greenwood's [-@greenwood1926NaturalDuration] approximation. This follows the form

$$se \big (\hat S(t_j) \big) = \hat S(t_j)  \sqrt{\frac{\hat h(t_1)}{n_1 \big (1 - \hat h(t_1) \big)} + \frac{\hat h(t_2)}{n_2 \big (1 - \hat h(t_2) \big)} + \cdots + \frac{\hat h(t_j)}{n_j \big (1 - \hat h(t_j) \big)}}.$$

Here we put the formula to work and finish our version of Table 10.2. For the sake of sanity, we're simply calling our "Term under the square root sign" column `term`. Note our use of the `cumsum()` function.

```{r}
# suspend scientific notation
options(scipen = 999)

tibble(year    = fit10.1$time,
       n_j     = fit10.1$n.risk,
       n_event = fit10.1$n.event) %>% 
  mutate(h_hat = n_event / n_j) %>% 
  mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
         s_hat    = fit10.1$surv,
         term     = cumsum(h_hat / (n_j * (1 - h_hat)))) %>% 
  mutate(se_s_hat = s_hat * sqrt(term),
         std.err  = fit10.1$std.err)
```

For comparisson, we also added the $se \big (\hat S(t_j) \big)$ values coputed by the **survivor** package in the final column, `std.err`.

```{r, fig.width = 6, fig.height = 3}
tibble(year    = fit10.1$time,
       n_j     = fit10.1$n.risk,
       n_event = fit10.1$n.event) %>% 
  mutate(h_hat = n_event / n_j) %>% 
  mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
         s_hat    = fit10.1$surv,
         term     = cumsum(h_hat / (n_j * (1 - h_hat)))) %>% 
  mutate(Greenwood = s_hat * sqrt(term),
         `survival package`  = fit10.1$std.err) %>% 
  pivot_longer(Greenwood:`survival package`) %>% 
  mutate(name = factor(name,
                       levels = c("survival package", "Greenwood"))) %>% 
  
  ggplot(aes(x = year, y = value, color = name)) +
  geom_point(size = 4) +
  scale_color_viridis_d(NULL, option = "A", end = .55) +
  scale_x_continuous(breaks = 1:12) +
  scale_y_continuous(expression(italic(se)), limits = c(0, 0.025)) +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.125, .9),
        panel.grid = element_blank())
```

It's out of my expertise to comment on which should we should trust more. But Singer and Willett did note that "as an approximation, Greenwood's formula is accurate only asymptotically" (p. 351).

```{r}
# turn scientific notation back on (R default)
options(scipen = 0)
```

## A simple and useful strategy for constructing the life table

> How can you construct a life table for *your* data set? For preliminary analyses, it is easy to use the prepackaged routines available in the major statistical packages. If you choose this approach, be sure to check whether your package allows you to: (1) select the partition of time; and (2) ignore any actuarial corrections invoked due to continuous-time assumptions (that do not hold in discrete time). When event times have been measured using a discrete-time scale, actuarial corrections (discussed in chapter13) are inappropriate. Although most packages clearly document the algorithm being used, we suggest that you double-check by comparing results with one or two estimates computed by hand. (p. 351, *emphasis* in the original)

For more information about the methods we've been using via the **survival** package, browse through the documentation listed on the CRAN page, [https://CRAN.R-project.org/package=survival](https://CRAN.R-project.org/package=survival), with a particular emphasis on the [reference manual](https://CRAN.R-project.org/package=survival/survival.pdf) and [*A package for survival analysis in R*](https://CRAN.R-project.org/package=survival/vignettes/survival.pdf) [@therneau2021Package4Survival]. But back to the text:

> Despite the simplicity of preprogrammed algorithms, [Singer and Willett] prefer an alternative approach for life table construction. This approach requires construction of a person-period data set, much like the period-period data set used for growth modeling. Once you create the person-period data set, you can compute descriptive statistics sing any standard cross-tabulation routine. (p. 351)

### The person-period data set.

Here is the person-level data set displayed in Figure 10.4; it's just a subset of the `teachers` data.

```{r}
teachers %>% 
  filter(id %in% c(20, 126, 129))
```

You can transform the person-level survival data set into the person-period variant shown on the right panel of Figure 10.4 with a workflow like this.

```{r}
teachers_pp <-
  teachers %>% 
  uncount(weights = t) %>% 
  group_by(id) %>% 
  mutate(period = 1:n()) %>% 
  mutate(event = if_else(period == max(period) & censor == 0, 1, 0)) %>% 
  select(-censor) %>% 
  ungroup()

teachers_pp %>% 
  filter(id %in% c(20, 126, 129))
```

You don't necessarily need to use `ungroup()` at the end, but it's probably a good idea. Anyway, note how the information previously contained in the `censor` column has been transformed to the `event` column, which is coded 0 = no event, 1 = event. With this coding, we know a participant has been censored when `event == 0` on their `max(period)` row.

We can count the number of teachers in the sample like this.

```{r}
teachers %>% 
  distinct(id) %>% 
  count()
```

To get a sense of the difference in the data structures, here are the number of rows for the original person-level `teachers` data and for our person-period transformation.

```{r}
# person-level
teachers %>% 
  count()

# person-period
teachers_pp %>% 
  count()
```

Here is the breakdown of the number of rows in the person-period `teachers` data for which `event == 1` or `event == 0`.

```{r}
teachers_pp %>% 
  count(event)
```

### Using the person-period data set to construct the life table.

"All the life table's essential elements can be computed through cross-tabulation of *PERIOD* and *EVENT* in the person-period data set" (p. 354, *emphasis* in the original). Here's how we might use the **tidyverse** do that with `teachers_pp`.

```{r}
teachers_lt <-
  teachers_pp %>% 
  # change the coding for `event` in anticipation of the final format
  mutate(event = str_c("event = ", event)) %>% 
  group_by(period) %>% 
  count(event) %>% 
  ungroup() %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  mutate(total = `event = 0` + `event = 1`) %>% 
  mutate(prop_e_1 = (`event = 1` / total) %>% round(digits = 4))

teachers_lt
```

Here are the totals Singer and Willett displayed in the bottom row.

```{r}
teachers_lt  %>% 
  pivot_longer(`event = 0`:total) %>% 
  group_by(name) %>% 
  summarise(total = sum(value)) %>% 
  pivot_wider(names_from = name,
              values_from = total)
```

> The ability to construct a life table using the person-period data set provides a simple strategy for conducting the descriptive analyses outlined in this chapter. This strategy yields appropriate statistics regardless of the amount, or pattern, of censoring. Perhaps even more important, the person-period data set is the fundamental tool for fitting discrete-time hazard models to data, using methods that we describe in the next chapter. (p. 356)

## Bonus: Fit the discrete-time hazard models with **brms**

The frequentists aren't the only ones who can discrete-time hazard models. Bayesians can get in on the fun, too. The first step is to decide on an appropriate likelihood function. Why? Because Bayes' formula requires that we define the likelihood and the priors. As all the parameters in the model are seen through the lens of the likelihood, it's important we consider it with care.

Happily, the exercises in the last section did a great job preparing us for the task. In Table 10.3, Singer and Willett hand computed the discrete hazards (i.e., the values in the "Proportion *EVENT* = 1" column) by dividing the valued in the "*EVENT* = 1" column by those in the "Total" column. Discrete hazards are proportions. Proportions have two important characteristics; they are continuous and necessarily range between 0 and 1. You know what else has those characteristics? Probabilities.

So far in this text, we have primarily focused on models using the Gaussian likelihood. Though it's a workhorse, the Gaussian is inappropriate for modeling proportions/probabilities. Good old Gauss is great at modeling unbounded continuous data, but it can fail miserably when working with bounded data and our proportions/probabilities are most certainly bounded. The binomial likelihood, however, is well-suited for handling probabilities. Imagine you have data that can take on values of 0 and 1, such as failures/successes, no's/yesses, fails/passes, and no-events/events. If you sum up all the 1's and divide them by the total cases, you get a proportion/probability. The simple binomial model takes just that kind of data--the number of 1's and the number of total cases. The formula for the binomial likelihood is

$$\operatorname{Pr} (z | n, p) = \frac{n!}{z!(n - z)!} p^z (1 - p)^{n - z},$$

where $z$ is the number of cases for which the value is 1, $n$ is the total number of cases, and $p$ is the probability of a 1 across cases. As the data provide us with $z$ and $n$, we end up estimating $p$. If we're willing to use what's called a *link function*, we can estimate $p$ with a linear model with any number of predictors. When fitting binomial regression models, you can take your choice among several link functions, the most popular of which is the logit. This will be our approach. As you may have guesses, using the logit link to fit a binomial model is often termed *logistic regression*. Welcome to the generalized linear model.

Let's fire up **brms**.

```{r, warning = F, message = F}
library(brms)
```

Before we fit the model, it will make our lives easier if we redefine `period` as a factor and rename our `event = 1` column as `event`. We defined `period` as a factor because we want to fit a model with discrete time. Renaming `event = 1` column as `event` just makes it easier on the `brm()` function to read the variable.

```{r}
teachers_lt <-
  teachers_lt %>% 
  mutate(period = factor(period),
         event  = `event = 1`)
```

With this formulation, `event` is our $z$ term and `total` is our $n$ term. We're estimating $p$. Behold the mode syntax.

```{r fit10.6}
fit10.6 <-
  brm(data = teachers_lt,
      family = binomial,
      event | trials(total) ~ 0 + period,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 1, iter = 2000, warmup = 1000,
      seed = 10,
      file = "fits/fit10.06")
```

Check the summary.

```{r}
print(fit10.6)
```

Because we used the `0 + Intercept` syntax in the presence of a factor predictor, `period`, we suppressed the default intercept. Instead, we have separate "intercepts" for each of the 12 levels `period`. Because we used the conventional logit link, the parameters are all on the log-odds scale. Happily, we can use the `brms::inv_logit_scaled()` function to convert them back into the probability metric. Here's a quick and dirty conversion using the output from `fixef()`.

```{r}
fixef(fit10.6) %>% inv_logit_scaled()
```

Compare these `Estimate` values with the values in the "Estimated hazard probability" column from Table 10.2 in the text (p. 349). They are very close. We can go further and look at these hazard estimates in a plot. We'll use the `tidybayes::stat_lineribbon()` function to plot their posterior means atop their 50% and 95% intervals.

```{r, fig.width = 6, fig.height = 3, warning = F, message = F}
library(tidybayes)

posterior_samples(fit10.6) %>% 
  select(starts_with("b_")) %>% 
  mutate_all(inv_logit_scaled) %>% 
  set_names(1:12) %>% 
  pivot_longer(everything(),
               names_to = "period",
               values_to = "hazard") %>% 
  mutate(period = period %>% as.double()) %>% 
  
  ggplot(aes(x = period, y = hazard)) +
  stat_lineribbon(.width = c(.5, .95), size = 1/3) +
  # add the hazard estimates from `survival::survfit()`
  geom_point(data = tibble(period = fit10.1$time,
                           hazard = fit10.1$n.event / fit10.1$n.risk),
             aes(y = hazard),
             size = 2, color = "violetred1") +
  scale_fill_manual("CI", values = c("grey75", "grey60")) +
  scale_x_continuous(breaks = 1:12) +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.925, .825),
        panel.grid = element_blank())
```

For comparison sake, those violet dots in the foreground are the hazard estimates from the frequentist `survival::survfit()` function. Turns out our Bayesian results are very similar to the frequentist ones. Hopefully this isn't a surprise. There was a lot of data and we used fairly weak priors. For simple models under those conditions, frequentist and Bayesian results should be pretty close.

Remember that $se \big (\hat h(t_j) \big)$ formula from back in section 10.4.1? That doesn't quite apply to the posterior standard deviations from our Bayesian model. Even so, our posterior $SD$s will be very similar to the ML standard errors. Let's compare those in a plot, too.

```{r, fig.width = 6, fig.height = 3}
posterior_samples(fit10.6) %>% 
  select(starts_with("b_")) %>% 
  mutate_all(inv_logit_scaled) %>% 
  set_names(1:12) %>% 
  pivot_longer(everything(),
               names_to = "period",
               values_to = "hazard") %>% 
  mutate(period = period %>% as.double()) %>% 
  group_by(period) %>% 
  summarise(sd = sd(hazard)) %>% 
  bind_cols(se_h_hat %>% select(se_h_hat)) %>% 
  pivot_longer(-period) %>% 
  mutate(name = factor(name,
                       levels = c("sd", "se_h_hat"),
                       labels = c("Bayesian", "ML"))) %>% 
  
  ggplot(aes(x = period, y = value, color = name)) +
  geom_point(size = 3, position = position_dodge(width = .25)) +
  scale_color_viridis_d(NULL, option = "A", end = .55) +
  scale_x_continuous(breaks = 1:12) +
  scale_y_continuous(expression(italic(se)), limits = c(0, 0.01)) +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.09, .9),
        panel.grid = element_blank())
```

Man those are close.

We can also use our **brms** output to depict the survivor function. On page 337 in the text (Equation 10.5), Singer and Willett demonstrated how to define the survivor function in terms of the hazard function. It follows the form

$$\hat S (t_j) = [1 - \hat h (t_j)][1 - \hat h (t_{j - 1})][1 - \hat h (t_{j - 2})]...[1 - \hat h (t_1)].$$

In words, "each year's estimated survival probability is the successive product of the complement of the estimated hazard function probabilities across *this* and *all previous* years" (p. 337, *emphasis* in the original). Here's how you might do that with the output from `posterior_samples()`.

```{r}
post <-
  posterior_samples(fit10.6) %>% 
  select(starts_with("b_")) %>% 
  # transform the hazards from the log-odds metric to probabilities
  mutate_all(inv_logit_scaled) %>% 
  set_names(str_c("h", 1:12)) %>% 
  # take the "complement" of each hazard
  mutate_all(~1 - .) %>% 
  # apply Equation 10.5
  transmute(s0  = 1, 
            s1  = h1, 
            s2  = h1 * h2, 
            s3  = h1 * h2 * h3, 
            s4  = h1 * h2 * h3 * h4, 
            s5  = h1 * h2 * h3 * h4 * h5, 
            s6  = h1 * h2 * h3 * h4 * h5 * h6, 
            s7  = h1 * h2 * h3 * h4 * h5 * h6 * h7, 
            s8  = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8, 
            s9  = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9, 
            s10 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10, 
            s11 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10 * h11, 
            s12 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10 * h11 * h12)

glimpse(post)
```

We'll learn how to simplify that syntax with help from the `cumprod()` function in the next chapter. Now we have our survival estimates, we can make our Bayesian version of the lower panel of Figure 10.1.

```{r, fig.width = 6, fig.height = 3}
# this will help us depict the median lifetime
line <-
  tibble(period   = c(0, iml, iml),
         survival = c(.5, .5, 0))

# wrangle
post %>% 
  pivot_longer(everything(),
               values_to = "survival") %>% 
  mutate(period = str_remove(name, "s") %>% as.double()) %>% 
  
  # plot!
  ggplot(aes(x = period, y = survival)) +
  geom_path(data = line,
            color = "white", linetype = 2) +
  stat_lineribbon(.width = .95, size = 1/3, fill = "grey75") +
  # add the survival estimates from `survival::survfit()`
  geom_point(data = tibble(period   = fit10.1$time,
                           survival = fit10.1$surv),
             size = 2, color = "violetred1") +
  scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
  scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(panel.grid = element_blank())
```

Like we did with our hazard plot above, we superimposed the frequentist estimates as violet dots atop the Bayesian posterior means and intervals. Because of how narrow the posteriors were, we only showed the 95% intervals, here. As with the hazard estimates, our Bayesian survival estimates are very close to the ones we computed with ML.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(teachers, fit10.1, most_rows, row_1, d, m, m_plus_1, stm, stm_plus_1, iml, line, cocaine, sex, suicide, congress, fit10.2, fit10.3, fit10.4, fit10.5, make_lt, lt_cocaine, lt_sex, lt_suicide, lt_congress, make_iml, line_tbl, h_plot, s_plot, p1, p2, p3, p4, p5, p6, p7, p8, p12, p34, p56, p78, se_h_hat, teachers_pp, teachers_lt, fit10.6, post)

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```


<!--chapter:end:10.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# Fitting Basic Discrete-Time Hazard Models

> In this chapter and the next, we present statistical models of hazard for data collected in discrete time. The relative simplicity of these models makes them an ideal entrée into the world of survival analysis. In subsequent chapters, we extend these basic ideas to situations in which event occurrence is recorded in continuous time. 
>
> Good data analysis involves more than using a computer package to fit a statistical model to data. To conduce a credible discrete-time survival analysis, you must: (1) specify a suitable model for hazard and understand its assumptions; (2) use sample data to estimate the model parameters; (3) interpret results in terms of your research questions; (4) evaluate model fit and [express the uncertainty in the] model parameters; and (5) communicate your findings. [@singerAppliedLongitudinalData2003, pp. 357--358]

## Toward a statistical model for discrete-time hazard

Time to load Capaldi, Crosby, and Stoolmiller's [-@capaldiPredictingTimingFirst1996] `firstsex.csv` data.

```{r, warning = F, message = F}
library(tidyverse)

sex <- read_csv("data/firstsex.csv")

glimpse(sex)
```

Here are the cases broken down by `time` and `censor` status.

```{r}
sex %>% 
  count(time, censor)
```

Since these data show no censoring before the final `time` point, it is straightforward to follow along with the text (p. 358) and compute the percent who had already had sex by the 12^th^ grade.

```{r}
sex %>% 
  count(censor) %>% 
  mutate(percent = 100 * (n / sum(n)))
```

Here we break the data down by our central predictor, `pt`, which is coded "0 for boys who lived with both biological parents" and "1 for boys who experienced one or more parenting transitions" before the 7^th^ grade.

```{r}
sex %>% 
  count(pt) %>% 
  mutate(percent = 100 * (n / sum(n)))
```

### Plots of within-group hazard functions and survivor functions.

> Plots of sample hazard functions and survivor functions estimates separately for groups distinguished by their predictor values are invaluable exploratory tools. If a predictor is categorical, like *PT*, construction of these displays is straightforward. If a predictor is continuous, you should just temporarily categorize its values for plotting purposes. (pp. 358--359, *emphasis* in the original)

To make our version of the descriptive plots in Figure 11.1, we'll need to first load the **survival** package.

```{r, warning = F, message = F}
library(survival)
```

`fit11.1` will be of the cases for which `pt ==  0` and `fit11.2` will be of the cases for which `pt ==  1`. With `fit11.3`, we use all cases regardless of `pt` status.

```{r}
fit11.1 <- 
  survfit(data = sex %>% filter(pt == 0),
          Surv(time, 1 - censor) ~ 1)

fit11.2 <- 
  survfit(data = sex %>% filter(pt == 1),
          Surv(time, 1 - censor) ~ 1)

fit11.3 <- 
  survfit(data = sex,
          Surv(time, 1 - censor) ~ 1)
```

Before we plot the results, it might be handy to arrange the fit results in life tables. We can streamline that code with the custom `make_lt()` function from last chapter.

```{r}
make_lt <- function(fit) {
  
  # arrange the lt data for all rows but the first
  most_rows <-
    tibble(time = fit$time) %>% 
    mutate(time_int = str_c("[", time, ", ", time + 1, ")"), 
           n_risk   = fit$n.risk, 
           n_event  = fit$n.event) %>% 
    mutate(n_censored   = n_risk - n_event - lead(n_risk, default = 0),
           hazard_fun   = n_event / n_risk,
           survivor_fun = fit$surv)
  
  # define the values for t = 2 and t = 1
  time_1 <- fit$time[1]
  time_0 <- time_1 - 1
  
  # define the values for the row for which t = 1
  row_1 <-
    tibble(time         = time_0, 
           time_int     = str_c("[", time_0, ", ", time_1, ")"),
           n_risk       = fit$n.risk[1],
           n_event      = NA,
           n_censored   = NA,
           hazard_fun   = NA, 
           survivor_fun = 1)
  
  # make the full life table
  lt <-
    bind_rows(row_1,
              most_rows)
  
  lt
  
}
```

We'll use `make_lt()` separately for each fit, stack the results from the first on top of those from the second, and add a `pt` column to index the rows. This will be our version of Table 11.1 (p. 360).

```{r}
lt <-
  bind_rows(make_lt(fit11.1),
            make_lt(fit11.2),
            make_lt(fit11.3)) %>% 
  mutate(pt = factor(rep(c("pt = 0", "pt = 1", "overall"), each = n() / 3))) %>% 
  select(pt, everything())

lt
```

Here is the code for the top panel of Figure 11.1.

```{r, fig.width = 6, fig.height = 3, warning = F}
p1 <-
  lt %>% 
  filter(pt != "overall") %>% 
  
  ggplot(aes(x = time, y = hazard_fun, color = pt, group = pt)) +
  geom_line() +
  scale_color_viridis_d(NULL, option = "A", end = .5) +
  scale_x_continuous("grade", breaks = 6:12, limits = c(6, 12)) +
  scale_y_continuous("estimated hazard probability", 
                     limits = c(0, .5)) +
  theme(panel.grid = element_blank())
```

Now make the plot for the bottom panel.

```{r, fig.width = 6, fig.height = 3, warning = F}
p2 <-
  lt %>% 
  filter(pt != "overall") %>% 
  
  ggplot(aes(x = time, y = survivor_fun, color = pt, group = pt)) +
  geom_hline(yintercept = .5, color = "white", linetype = 2) +
  geom_line() +
  scale_color_viridis_d(NULL, option = "A", end = .5) +
  scale_x_continuous("grade", breaks = 6:12, limits = c(6, 12)) +
  scale_y_continuous("estimated survival probability",
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(panel.grid = element_blank())
```

Combine the two **ggplot2** objects with **patchwork** syntax to make our version of Figure 11.1.

```{r, fig.width = 6, fig.height = 5.2, warning = F}
library(patchwork)

(p1 / p2) +
  plot_layout(guides = "collect")
```

On page 361, Singer and Willett compared the hazard probabilities at grades 8 and 11 for boys in the two `pt` groups. We can make that comparison with `filter()`.

```{r}
lt %>% 
  filter(time %in% c(8, 11) &
           pt != "overall") %>% 
  select(pt:time, hazard_fun)
```

Compare the two groups on the hazard probabilities at grade 9.

```{r}
lt %>% 
  filter(time == 9 &
           pt != "overall") %>% 
  select(pt:time, hazard_fun)
```

Now compare them on their hazard probabilities in grade 12.

```{r}
lt %>% 
  filter(time == 12 &
           pt != "overall") %>% 
  select(pt:time, hazard_fun)
```

At the top of page 362, Singer and Willett compared the percentages of boys who were virgins at grades 9 and 12, by `pt` status. Those percentages are straight algebraic transformations of the corresponding survival function values.

```{r}
lt %>% 
  filter(time %in% c(9, 12) &
           pt != "overall") %>% 
  select(pt:time, survivor_fun) %>% 
  mutate(percent_virgins = (100 * survivor_fun) %>% round(digits = 1))
```

Now let's finish off by computing the interpolated median lifetime values for each with our custom `make_iml()` function.

```{r}
make_iml <- function(lt) {
  
  # lt is a generic name for a life table of the 
  # kind we made with our `make_lt()` function
  
  # determine the mth row
  lt_m <-
    lt %>% 
    filter(survivor_fun > .5) %>% 
    slice(n())
  
  # determine the row for m + 1
  lt_m1 <-
    lt %>% 
    filter(survivor_fun < .5) %>% 
    slice(1)
  
  # pull the value for m
  m  <- pull(lt_m, time)
  
  # pull the two survival function values
  stm  <- pull(lt_m, survivor_fun)
  stm1 <- pull(lt_m1, survivor_fun)
  
  # plug the values into Equation 10.6 (page 338)
  iml <- m + ((stm - .5) / (stm - stm1)) * ((m + 1) - m)
  
  iml
  
}
```

Put `make_iml()` to work.

```{r}
make_iml(lt %>% filter(pt == "pt = 0"))
make_iml(lt %>% filter(pt == "pt = 1"))
```

### What kind of statistical model do these graphs suggest?

> To postulate a statistical model to represent the relationship between the population discrete-time hazard function and predictors, we must deal with two complications apparent in these displays. One is that any hypothesized model must describe the shape of the *entire discrete-time hazard function* over time, not just its value in any one period, in much the same way that a multilevel model for change characterizes the shape of entire individual growth trajectories over time. A second complication is that, as a conditional probability, the value of discrete-time hazard must lie between 0 and 1. Any reasonable statistical model for hazard must recognize this constraint, precluding the occurrence of theoretically impossible values. (p. 362, *emphasis* in the original)

#### The bounded nature of hazard.

A conventional way to handle the bounded nature of probabilities is transform the scale of the data. @cox1972regression recommended either the odds and log-odds (i.e., logit) transformations. Given a probability, $p$, we compute the odds as

$$\text{odds} = \frac{p}{1 - p}.$$

Log-odds is a minor extension; you simply take the log of the odds, which we can formally express as

$$\text{log-odds} = \log \left (\frac{p}{1 - p} \right ).$$

To make the conversions easy, we'll define[^2] a couple convenience functions: `odds()` and `log_odds()`.

```{r}
odds <- function(p) {
  p / (1 - p)
}

log_odds <- function(p) {
  log(p / (1 - p))
}
```

Here's how they work.

```{r}
tibble(p = seq(from = 0, to = 1, by = .1)) %>% 
  mutate(odds     = odds(p),
         log_odds = log_odds(p))
```

Before we make our version of Figure 11.2, it might be instructive to see how odds and log-odds compare to probabilities in a plot. Here we'll compare them to probabilities ranging from .01 to .99.

```{r, fig.width = 7, fig.height = 3}
tibble(p = seq(from = .01, to = .99, by = .01)) %>% 
  mutate(odds        = odds(p),
         `log(odds)` = log_odds(p)) %>% 
  pivot_longer(-p) %>% 
  mutate(name = factor(name,
                       levels = c("odds", "log(odds)"))) %>% 
  
  ggplot(aes(x = p, y = value)) +
  geom_line() +
  labs(x = "probability",
       y = "transformed scale") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

Odds are bounded to values of zero and above and have an inflection at 1. Log-odds are unbounded and have an inflection point at 0. Here we'll save the odds and log-odds for our hazard functions within the `lt` life table.

```{r}
lt <-
  lt %>% 
  mutate(odds     = odds(hazard_fun),
         log_odds = log_odds(hazard_fun)) 

lt
```

We're ready to make and combine the subplots for our version of Figure 11.2.

```{r, fig.width = 5, fig.height = 7, warning = F}
# hazard
p1 <-
  lt %>% 
  filter(pt != "overall") %>% 
  
  ggplot(aes(x = time, y = hazard_fun, color = pt, group = pt)) +
  geom_line() +
  scale_y_continuous(NULL, breaks = c(0, .5, 1), limits = c(0, 1)) +
  labs(subtitle = "Estimated hazard") +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.1, .825))

# odds
p2 <-
  lt %>% 
  filter(pt != "overall") %>% 
  
  ggplot(aes(x = time, y = odds, color = pt, group = pt)) +
  geom_line() +
  scale_y_continuous(NULL, breaks = c(0, .5, 1), limits = c(0, 1)) +
  labs(subtitle = "Estimated odds") +
  theme(legend.position = "none")

# log-odds
p3 <-
  lt %>% 
  filter(pt != "overall") %>% 
  
  ggplot(aes(x = time, y = log_odds, color = pt, group = pt)) +
  geom_line() +
  scale_y_continuous(NULL, limits = c(-4, 0)) +
  labs(subtitle = "Estimated logit(hazard)") +
  theme(legend.position = "none")

(p1 / p2 / p3 ) &
  scale_color_viridis_d(NULL, option = "A", end = .5) &
  scale_x_continuous("grade", breaks = 6:12, limits = c(6, 12)) &
  theme(panel.grid = element_blank())
```

#### What statistical model could have generated these sample data?

With the survival models from the prior sections, we were lazy and just used the **survival** package. But recall from the end of the last chapter that we can fit the analogous models **brms** using the binomial likelihood. This subsection is a great place to practice those some more. The fitted lines Singer and Willett displayed in Figure 11.3 can all be reproduced with binomial regression. However, the `sex` data are not in a convenient form to fit those models. Just like we did in last chapter, we'll want to take a two-step process whereby we first convert the data to the long (i.e., person-period) format and then summarize. Happily, we can accomplish that first step by uploading the data in the `firstsex_pp.csv` file, which are already in the long format.

```{r, message = F}
sex_pp <- read_csv("data/firstsex_pp.csv")

glimpse(sex_pp)
```

Now we'll compute the desired summary values and wrangle a bit.

```{r}
sex_aggregated <-
  sex_pp %>% 
  mutate(event = if_else(event == 1, "event", "no_event")) %>% 
  group_by(period) %>% 
  count(event, pt) %>% 
  ungroup() %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  mutate(total         = event + no_event,
         period_center = period - mean(period),
         peroid_factor = factor(period),
         pt            = factor(pt))

sex_aggregated
```

Note how we saved the grade values in three columns:

* `period` has them as continuous values, which will be hand for plotting;
* `period_center` has them as mean-centered continuous values, which will make fitting the linear model in the middle panel easier; and
* `period_factor` has them saved as a factor, which will help us fit the model in the bottom panel.

Fire up **brms**.

```{r, warning = F, message = F}
library(brms)
```

Before we fit the models, it might be good to acknowledge we're jumping ahead of the authors, a bit. Singer and Willett didn't discuss fitting discrete time hazard models until section 11.3.2. Sure, their focus was on the frequentist approach using maximum likelihood. But the point still stands. If these model fitting details feel a bit rushed, they are.

Any anxious feelings aside, now fit the three binomial models. We continue to use weakly-regularizing priors for each.

```{r fit11.4}
# top panel
fit11.4 <-
  brm(data = sex_aggregated,
      family = binomial,
      event | trials(total) ~ 0 + pt,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.04")

# middle panel
fit11.5 <-
  brm(data = sex_aggregated,
      family = binomial,
      event | trials(total) ~ 0 + pt + period_center,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.05")

# bottom panel
fit11.6 <-
  brm(data = sex_aggregated,
      family = binomial,
      event | trials(total) ~ 0 + pt + peroid_factor,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.06")
```

Check the model summaries.

```{r}
print(fit11.4)
print(fit11.5)
print(fit11.6)
```

We can extract the fitted values and their summaries for each row in the data with `fitted()`. To get them in the log-odds metric, we need to set `scale = "linear"`. Here's a quick example with `fit11.4`.

```{r}
fitted(fit11.4, scale = "linear")
```

If we convert that output to a data frame, tack on the original data values, and wrangle a bit, we'll be in good shape to make the top panel of Figure 11.3. Below we'll do that for each of the three panels, saving them as `p1`, `p2`, and `p3`.

```{r}
# logit(hazard) is horizontal with time
p1 <-
  fitted(fit11.4,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(sex_aggregated) %>% 
  mutate(pt = str_c("pt = ", pt)) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate),
            alpha = 1/2) +
  geom_point(aes(y = log_odds(event / total))) +
  scale_y_continuous(NULL, limits = c(-4, 0)) +
  labs(subtitle = "logit(hazard) is horizontal with time") +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.1, .825))

# logit(hazard) is linear with time
p2 <-
  fitted(fit11.5,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(sex_aggregated) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate),
            alpha = 1/2) +
  geom_point(aes(y = log_odds(event / total))) +
  labs(subtitle = "logit(hazard) is linear with time",
       y = "logit(hazard)") +
  coord_cartesian(ylim = c(-4, 0)) +
  theme(legend.position = "none")

# logit(hazard) is completely general with time
p3 <-
  fitted(fit11.6,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(sex_aggregated) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/4) +
  geom_line(aes(y = Estimate),
            alpha = 1/2) +
  geom_point(aes(y = log_odds(event / total))) +
  labs(subtitle = "logit(hazard) is completely general with time",
       y = NULL) +
  coord_cartesian(ylim = c(-4, 0)) +
  theme(legend.position = "none")
```

Now combine the plots with **patchwork** syntax.

```{r, fig.width = 5, fig.height = 7, warning = F}
(p1 / p2 / p3) &
  scale_fill_viridis_d(NULL, option = "A", end = .6) &
  scale_color_viridis_d(NULL, option = "A", end = .6) &
  scale_x_continuous("Grade", breaks = 6:12, limits = c(6, 12)) &
  theme(panel.grid = element_blank())
```

In addition to the posterior means (i.e., our analogues to the fitted values in the text), we added the 95% Bayesian intervals to give a better sense of the uncertainty in each model. Singer and Willet mused the unconstrained model (`fit6`) was a better fit to the data than the other two. We can quantify that with a LOO comparison.

```{r, warning = F, message = F}
fit11.4 <- add_criterion(fit11.4, "loo")
fit11.5 <- add_criterion(fit11.5, "loo")
fit11.6 <- add_criterion(fit11.6, "loo")

loo_compare(fit11.4, fit11.5, fit11.6) %>% print(simplify = F)
```

Here are the LOO weights.

```{r}
model_weights(fit11.4, fit11.5, fit11.6, weights = "loo") %>% round(digits = 3)
```

## Formal representation of the population discrete-time hazard model

Earlier equations for the hazard function omitted substantive predictors. Now consider the case where $X_{1ij}, X_{2ij}, \dots , X_{Pij}$ stand for the $P$ predictors which may or may not vary across individuals $i$ and time periods $j$. Thus $x_{pij}$ is the value for the $i^\text{th}$ individual on the $p^\text{th}$ variable during the $j^\text{th}$ period. We can use this to define the conditional hazard function as

$$h(t_{ij}) = \operatorname{Pr} [T_i = j | T \geq j \text{ and } X_{1ij} = x_{1ij}, X_{2ij} = x_{2ij}, \dots , X_{Pij} = x_{pij}].$$

Building further and keeping the baseline shape of the discrete hazard function flexible, we want a method that allows each of the $j$ time periods to have its own value. Imagine a set of $J$ dummy variables, $D_1, D_2, \dots, D_J$, marking off each of the time periods. For example, say $J = 6$, we could depict this in a tibble like so.

```{r}
tibble(period = 1:6) %>% 
  mutate(d1 = if_else(period == 1, 1, 0),
         d2 = if_else(period == 2, 1, 0),
         d3 = if_else(period == 3, 1, 0),
         d4 = if_else(period == 4, 1, 0),
         d5 = if_else(period == 5, 1, 0),
         d6 = if_else(period == 6, 1, 0))
```

If we were to use a set of dummies of this kind in a model, we would omit the conventional regression intercept, replacing it with the $J$ dummies. Now presume we're fitting a hazard model using the logit link, $\operatorname{logit} h(t_{ij})$. We can express the discrete conditional hazard model with a general functional form with respect to time as

$$\operatorname{logit} h(t_{ij}) = [\alpha_1 D_{1ij} + \alpha_2 D_{2ij} + \cdots + \alpha_J D_{Jij}] + [\beta_1 X_{1ij} + \beta_2 X_{2ij} + \cdots + \beta_P X_{Pij}],$$

where the $\alpha$ parameters are the $J$ time-period dummies and the $\beta$ parameters are for other time-varying or time-invariant predictors. This is just the type of model we used to fit `fit116`. For that model, the basic equation was

$$\operatorname{logit} h(t_{ij}) = [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_1 \text{PT}_i ],$$

where the only substantive predictor was the time-invariant `pt`. However, that formula could be a little misleading. Recall the formula:

```{r}
fit11.6$formula
```

We suppressed the default regression intercept with the `~ 0 +` syntax and the only two predictors were `pt` and `peroid_factor`. Both were saved as factor variables. Functionally, that's why `period_factor` worked as $\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}$, a series of 5 dummy variables with no reference category. The same basic thing goes for `pt`. Because `pt` was a factor used in a model `formula` with no conventional intercept, it acted as if it was a series of 2 dummy variables with no reference category. Thus, we might rewrite the model equation for `fit6` as


$$\operatorname{logit} h(t_{ij}) = [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_0 \text{PT}_{0i} + \beta_1 \text{PT}_{1i} ].$$

And since we're practicing fitting these models as Bayesians, the `fit6` equation with a fuller expression of the likelihood and the priors looks like

\begin{align*}
\text{event}_{ij} & = \operatorname{Binomial}(n = \text{trials}_{ij}, p_{ij}) \\ 
\operatorname{logit} (p_{ij}) & =  [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_0 \text{PT}_{0i} + \beta_1 \text{PT}_{1i} ] \\
\alpha_7, \alpha_8, \dots, \alpha_{12} & \sim \operatorname{Normal}(0, 4) \\
\beta_0 \text{ and } \beta_1 & \sim \operatorname{Normal}(0, 4),
\end{align*}

where we're describing the model in terms of the criterion, `event`, rather than in terms of $h(t_{ij})$. And what is the criterion, `event`? It's a vector of counts. The binomial likelihood allows us to model vectors of counts in terms of the number of trials, as indexed by our `trials` vector, and the (conditional) probability of a "1" in a given trial. In this context, $h(t_{ij}) = p_{ij}$.

### What do the parameters represent?

Given our factor coding of `pt`, our two submodels for the equations in the last section are

\begin{align*}
\text{when PT = 0: } \operatorname{logit} h(t_j) & = [\alpha_7 D_7 + \alpha_8 D_8 + \cdots + \alpha_{12} D_{12}] + \beta_0 \\
\text{when PT = 1: } \operatorname{logit} h(t_j) & = [\alpha_7 D_7 + \alpha_8 D_8 + \cdots + \alpha_{12} D_{12}] + \beta_1,
\end{align*}

where we used Singer and Willett's simplified notation and dropped all the $i$ subscripts and most of the $j$ subscripts.

### An alternative representation of the model.

In the previous sections, we expressed the model in terms of the logit of the criterion or the $p$ parameter of the likelihood. Another strategy is the express the criterion (or $p$) in its natural metric and put the nonlinear portion on the right side of the equation. If we consider the generic discrete conditional hazard function, that would follow the form

$$h(t_{ij})  = \frac{1}{1 + e^{-([\alpha_1 D_{1ij} + \alpha_2 D_{2ij} + \cdots + \alpha_J D_{Jij}] + [\beta_1 X_{1ij} + \beta_2 X_{2ij} + \cdots + \beta_P X_{Pij}])}}.$$

This is just a particular kind of logistic regression model. It also clarifies that "by specifying a linear relationship between predictors and logit hazard we imply a *nonlinear* relationship between predictors and *raw hazard*" (p. 377, *emphasis* in the original). We can explore what that might look like with our version of Figure 11.4. Here we continue to use `fit6`, but this time we'll save the output from `fitted()` before plotting.

```{r}
f <-
  fitted(fit11.6,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(sex_aggregated)

f
```

Make the subplots.

```{r}
# logit(hazard)
p1 <-
  f %>% 
  mutate(pt = str_c("pt = ", pt)) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "logit(hazard)",
       y = NULL) +
  coord_cartesian(ylim = c(-4, 0)) +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.1, .825))

# odds
p2 <-
  f %>% 
  mutate_at(vars(Estimate, Q2.5, Q97.5), .funs = exp) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "odds",
       y = NULL) +
  coord_cartesian(ylim = c(0, .8)) +
  theme(legend.position = "none")

# hazard
p3 <-
  f %>% 
  mutate_at(vars(Estimate, Q2.5, Q97.5), .funs = inv_logit_scaled) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "hazard (i.e., probability)",
       y = NULL) +
  coord_cartesian(ylim = c(0, .5)) +
  theme(legend.position = "none")
```

Combine the subplots with **patchwork**.

```{r, fig.width = 5, fig.height = 7, warning = F}
(p1 / p2 / p3) &
  scale_fill_viridis_d(NULL, option = "A", end = .6) &
  scale_color_viridis_d(NULL, option = "A", end = .6) &
  scale_x_continuous("Grade", breaks = 6:12, limits = c(6, 12)) &
  theme(panel.grid = element_blank())
```

The `mutate_at()` conversions we made for `p2` and `p3` were based on the guidelines in Table 11.2. Those were:

```{r}
tibble(`original scale` = c("logit", "odds", "logit"),
       `desired scale`  = c("odds", "probability", "probability"),
       transformation   = c("exp(logit)", "odds / (1 + odds)", "1 / (1 + exp(-1 * logit))")) %>% 
  flextable::flextable() %>% 
  flextable::width(width = c(1.5, 1.5, 2))
```

We accomplished the transformation in the bottom row with the `brms::inv_logit_scaled()` function.

## Fitting a discrete-time hazard model to data

As Singer and Willett wrote, "with data collected on a random sample of individuals from a target population, you can easily fit a discrete-time hazard model, estimate its parameters using maximum likelihood methods, and evaluate goodness-of-fit" (pp. 378--379. As we've already demonstrated, you can fit them with Bayesian software, too. Though we'll be focusing on **brms**, you might also want to check out the **rstanarm** package, about which you can learn more from Brilleman, Elci, Novik, and Wolfe's [-@brilleman2020BayesianSurvivalAnalysis] preprint, [*Bayesian survival analysis Using the rstanarm R package*](https://arxiv.org/abs/2002.09633), Brilleman's [-@Brilleman2019EstimatingSurvival] vignette, [*Estimating survival (time-to-event) models with rstanarm*](https://github.com/stan-dev/rstanarm/blob/feature/frailty-models/vignettes/surv.Rmd), and the [Survival models in rstanarm thread](https://discourse.mc-stan.org/t/survival-models-in-rstanarm/3998) in the Stan forums.

### Adding predictors to the person-period data set.

At the beginning of section 11.1.2.2, we already loaded a version of the person-period data including the discrete-time dummies. It has our substantive predictors `pt` and `pas`, too. We saved it as `sex_pp`. Here's a `glimpse()`.

```{r}
sex_pp %>% 
  glimpse()
```

### Maximum likelihood estimates [and Bayesian posteriors] for the discrete-time hazard model.

We're not going to walk through all the foundational equations for the likelihood and log-likelihood functions (Equations 11.11 through 11.13). For our purposes, just note that "it turns out that the standard logistic regression routines widely available in all major statistical packages, *when applied appropriately in the person-period data set*, actually provide estimates of the parameters of the discrete-time hazard model" (p. 383, *emphasis* in the original). Happily, this is what we've been doing. Bayesian logistic regression via the binomial likelihood has been our approach. And since we're Bayesians, the same caveat applies to survival models as applied to the other longitudinal models we fit in earlier chapters. We're not just maximizing likelihoods, here. Bayes's formula requires us to multiply the likelihood by the prior.

$$\underbrace{p(\theta | d)}_\text{posterior} \propto \underbrace{p(d | \theta)}_\text{likelihood} \; \underbrace{p(\theta)}_\text{prior}$$

### Fitting the discrete-time hazard model to data.

In one sense, fitting discrete-hazard models with Bayesian logistic regression is old hat, for us. We've been doing that since the end of last chapter. But one thing I haven't clarified is, up to this point, we have been using the aggregated binomial format. To show what I mean, we might look at the data we used for our last model, `fit11.6`.

```{r}
sex_aggregated
```

Now recall the formula for the binomial likelihood from the end of last chapter:

$$\text{Pr} (z | n, p) = \frac{n!}{z!(n - z)!} p^z (1 - p)^{n - z},$$

where $z$ is the number of cases for which the value is 1, $n$ is the total number of cases, and $p$ is the constant chance of a 1 across cases. We refer to binomial data as aggregated with $n > 1$. Our $n$ vector in the `sex_aggregated`, `total`, ranged from 38 to 108. Accordingly, our $z$ vector, `event`, was always some value equal or lower to that in the same row for `total`.

The person-period data, `sex_pp`, contain the same information but in a different format. Instead, each `event` cell only takes on a value of 0 or 1 (i.e., $n = 1$). If you were to sum up all the values in the `total` column of the `sex_aggregated` data, you'd return 822.

```{r}
sex_aggregated %>% 
  summarise(sum = sum(total))
```

This is also the total number of rows in the `sex_pp` data.

```{r}
sex_pp %>% 
  count()
```

It's also the case that when $n = 1$, the right side of the equation for the binomial function reduces to

$$p^z (1 - p)^{1 - z}.$$

Whether you are working with aggregated or un-aggregated data, both are suited to fit logistic regression models with the binomial likelihood. Just specify the necessary information in the model syntax. For **brms**, the primary difference is how you use the `trials()` function. When we fit our logistic regression models using the aggregated data, we specified `trials(total)`, which informed the `brm()` function what the $n$ values were. In the case of unaggregated binomial data, we can just state `trials(1)`. Each cell is the outcome $z$ for a single trial.

Before we fit the models, we might talk a bit about priors. When we fit the first model of this kind at the end of Chapter 10, we just used `prior(normal(0, 4), class = b)` without comment. Recall we're modeling probabilities in the log-odds space. In Section 11.1.2.1 we used a plot to compare probability values to their log-odds counterparts. Let's take a more focused look.

```{r, fig.width = 5, fig.height = 3.5}
tibble(log_odds = -8:8) %>% 
  mutate(p = inv_logit_scaled(log_odds)) %>% 
  
  ggplot(aes(x = log_odds, y = p)) +
  geom_hline(yintercept = 0:5 / 5, color = "white") +
  geom_point() +
  scale_x_continuous(breaks = -8:8) +
  scale_y_continuous(breaks = 0:5 / 5) +
  theme(panel.grid = element_blank())
```

When $\operatorname{log-odds} p = 0$, $p = .5$. Once $\operatorname{log-odds} p$ approaches the $\mp 4$ neighborhood, the corresponding values for $p$ asymptote at the boundaries $[0, 1]$. By using a $\operatorname{Normal} (0, 4)$ prior for $\operatorname{log-odds} p$, we’re putting bulk of the prior mass in the $\operatorname{log-odds} p$ space between, say, -8 and 8. In the absence of other information, this might be a good place to start. A little further down, we'll reexamine this set-up. For now, here's how to use `brm()` to fit Models A through D from page 385.

```{r fit11.7, warning = F, message = F}
library(brms)

# model a
fit11.7 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.07")

# model b
fit11.8 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.08")

# model c
fit11.9 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pas,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.09")

# model d
fit11.10 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt + pas,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.10")
```

## Interpreting parameter estimates

Here are the model summaries in bulk.

```{r}
print(fit11.7)
print(fit11.8)
print(fit11.9)
print(fit11.10)
```

Although the text distinguishes between $\alpha$ and $\beta$ parameters (i.e., intercept and slope parameters, respectively), our **brms** output makes no such distinction. These are all of `class = b`, population-level $\beta$ parameters.

When viewed in bulk, all those `print()` calls yield a lot of output. We can arrange the parameter summaries similar to those in Table 11.3 with a little tricky wrangling.

```{r}
tibble(model = str_c("model ", letters[1:4]),
       fit   = str_c("fit11.", 7:10)) %>% 
  mutate(f = map(fit, ~ get(.) %>% 
                   fixef() %>% 
                   data.frame() %>% 
                   rownames_to_column("parameter"))) %>% 
  unnest(f) %>% 
  mutate(e_sd  = str_c(round(Estimate, digits = 2), " (", round(Est.Error, digits = 2), ")")) %>% 
  select(model, parameter, e_sd) %>% 
  pivot_wider(names_from = model, values_from = e_sd) %>% 
  flextable::flextable() %>% 
  flextable::width(width = 1)
```

### The time indicators.

> As a group, the $\hat \alpha$s are [Bayesian] estimates for the baseline logit hazard function. The amount and direction of variation in their values describe the shape of this function and tell us whether risk increases, decreases, or remains steady over time. (p. 387)

A coefficient plot might help us get a sense of that across the four models.

```{r, fig.width = 8, fig.height = 1.75}
tibble(model = str_c("model ", letters[1:4]),
       fit   = str_c("fit11.", 7:10)) %>% 
  mutate(f = map(fit, ~ get(.) %>% 
                   fixef() %>% 
                   data.frame() %>% 
                   rownames_to_column("parameter"))) %>% 
  unnest(f) %>% 
  filter(str_detect(parameter, "d")) %>% 
  mutate(parameter = factor(str_remove(parameter, "b_"), 
                            levels = str_c("d", 12:7))) %>%
  
  ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = parameter)) +
  geom_pointrange(fatten = 2.5) +
  labs(x = "posterior (log-odds scale)",
       y = NULL) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~ model, nrow = 1)
```

"The fairly steady increase over time in the magnitude of the $\hat \alpha$s in each model [in the coefficient plot] shows that, in this sample of boys, the risk of first intercourse increases over time" (p. 387). When comparing the $\hat \alpha$s across models, it's important to recall that the presence/absence of substantive covariates means each model has a different baseline group.

Because they were in the log-odds scale, the model output and our coefficient plot can be difficult to interpret. With the `brms::inv_logit_scaled()`, we can convert the $\hat \alpha$s to the hazard (i.e., probability) metric.

```{r, fig.width = 8, fig.height = 1.75}
tibble(model = str_c("model ", letters[1:4]),
       fit   = str_c("fit11.", 7:10)) %>% 
  mutate(f = map(fit, ~ get(.) %>% 
                   fixef() %>% 
                   data.frame() %>% 
                   rownames_to_column("parameter"))) %>% 
  unnest(f) %>% 
  filter(str_detect(parameter, "d")) %>% 
  mutate(parameter = factor(str_remove(parameter, "b_"), 
                            levels = str_c("d", 12:7))) %>%
  mutate_at(vars(Estimate:Q97.5), .funs = inv_logit_scaled) %>% 
  
  ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = parameter)) +
  geom_pointrange(fatten = 2.5) +
  labs(x = "posterior (hazard scale)",
       y = NULL) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~ model, nrow = 1)
```

Building further, here's our version of Table 11.4.

```{r}
fixef(fit11.7) %>% 
  data.frame() %>% 
  rownames_to_column("predictor") %>% 
  mutate(`time period` = str_remove(predictor, "d") %>% as.double()) %>% 
  select(`time period`, predictor, Estimate) %>% 
  mutate(`fitted odds`   = exp(Estimate),
         `fitted hazard` = inv_logit_scaled(Estimate)) %>% 
  mutate_if(is.double, round, digits = 4) %>% 
  flextable::flextable() %>% 
  flextable::width(width = 1)
```

### Dichotomous substantive predictors.

Here's the summary for `pt` from `fit11.8` (i.e., Model B).

```{r}
fixef(fit11.8)["pt", ]
```

If we take the anti-log (i.e., exponentiate) of that coefficient, we'll return an odds ratio. Here's the conversion with just the posterior mean.

```{r}
fixef(fit11.8)["pt", 1] %>% exp()
```

To get a better sense of the conversion, here it is in a plot.

```{r, fig.width = 6, fig.height = 2.5, warning = F, message = F}
library(tidybayes)

posterior_samples(fit11.8) %>% 
  transmute(`log-odds`     = b_pt,
            `hazard ratio` = exp(b_pt)) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("log-odds", "hazard ratio"))) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("marginal posterior for pt") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

> This tells us that, in every grade, the estimated odds of first intercourse are nearly two and one half times higher for boys who experienced a parenting transition in comparison to boys raised with both biological parents. In substantive terms, an odds ratio of this magnitude represents a substantial, and potentially important, effect. (p. 398)

To reframe the odds ratio in terms of the other group (i.e., `pt == 0`), take the reciprocal.

```{r}
1 / exp(fixef(fit11.8)[7, 1])
```

"This tells us that the estimated odds of first intercourse for boys who did not experience a parenting transition are approximately 40% of the odds for boys who did. These complimentary ways of reporting effect sizes are equivalent" (p. 389)

### Continuous substantive predictors.

Here's the conditional effect of `pas` from `fit11.9` (i.e., Model C).

```{r}
fixef(fit11.9)["pas", ]
```

To understand `pas`, our measure of parental antisocial behavior, it will help to look at its range.

```{r}
range(sex_pp$pas)
```

Exponentiating (i.e., taking the anti-log) the posterior of a continuous predictor is a legitimate way to convert it to a hazard ratio.

```{r, fig.width = 6, fig.height = 2.5}
posterior_samples(fit11.9) %>% 
  transmute(`log-odds`     = b_pas,
            `hazard ratio` = exp(b_pas)) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("log-odds", "hazard ratio"))) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("marginal posterior for pas") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

Here's how to compute the hazard ratio for a 2-unit difference in `pas`.

```{r}
exp(fixef(fit11.9)[7, 1] * 2)
```

Here's what that looks like in a plot.

```{r, fig.width = 8, fig.height = 2.5}
posterior_samples(fit11.9) %>% 
  transmute(`log-odds`                   = b_pas,
            `hazard ratio`               = exp(b_pas),
            `hr for a 2-unit difference` = exp(b_pas * 2)) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = factor(name,
                       levels = c("log-odds", "hazard ratio", "hr for a 2-unit difference"))) %>%
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("marginal posterior for pas") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

### Polytomous substantive predictors.

Unfortunately, neither the `sex` nor the `sex_pp` data sets contain the polytomous version of `pt` as Singer described in this section. However, we can simulate a similar set of dummy variables which will allow us to trace the basic steps in Singer and Willett's workflow.

Since we've been working with the `sex_pp` data for the past few models, we'll continue using it here. However, this creates a minor challenge. What we want to do is use the `sample()` function to randomly assign cases with values of 1, 2, or 3 conditional on whether `pt == 0`. The catch is, we need to make sure that random value is constant for each case. Our solution will be to first nest the data such that each case only has one row. Here's what that looks like.

```{r}
set.seed(11)

sex_pp <-
  sex_pp %>% 
  nest(data = c(period, event, d7, d8, d9, d10, d11, d12, pas)) %>% 
  mutate(random = sample(1:3, size = n(), replace = T)) %>% 
  mutate(pt_cat = ifelse(pt == 0, pt, random)) %>% 
  mutate(pt1 = ifelse(pt_cat == 1, 1, 0),
         pt2 = ifelse(pt_cat == 2, 1, 0),
         pt3 = ifelse(pt_cat == 3, 1, 0)) %>% 
  select(id, pt, random, pt_cat:pt3, data)

sex_pp %>% 
  head()
```

Here are the number of cases for each of the four levels of our pseudovariable `pt_cat`.

```{r}
sex_pp %>% 
  count(pt_cat)
```

Our breakdown isn't exactly the same as the one in the text (p. 391), but we're in the ballpark. Before we're ready to fit our next model, we'll need to `unnest()` the data, which will transform `sex_pp` back into the familiar long format.

```{r}
sex_pp <-
  sex_pp %>% 
  unnest(data)
```

Before we fit the model with the `pt*` dummies, let's backup and talk about priors, again. When we fit the last four models, our discussion about priors for $p$ focused on the posterior implications for those parameters in the log-odds space. Things get odd when we consider the implications in the probability space. To demonstrate, let’s simulate from $\operatorname{Normal}(0, 4)$ and see what it looks like when we transform the draws back into the probability metric.

```{r, fig.width = 4, fig.height = 2.5}
set.seed(11)

tibble(log_odds = rnorm(1e6, mean = 0, sd = 4)) %>% 
  mutate(p = inv_logit_scaled(log_odds)) %>% 
  ggplot(aes(x = p)) +
  geom_histogram(bins = 50) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Perhaps unexpectedly, our log-odds $\operatorname{Normal}(0, 4)$ prior ended up bunching up the prior mass at the boundaries. Depending on what we want, this may or may not make sense. If we want to regularize the coefficients toward zero in the probability space, something closer to $\operatorname{Normal}(0, 1)$ would be a better idea. Here's a look at what happens when we compare three simulations in that range.

```{r, fig.width = 8, fig.height = 2.5}
set.seed(11)

tibble(sd = c(2, 1.5, 1)) %>% 
  mutate(log_odds = map(sd, ~rnorm(1e6, mean = 0, sd = .))) %>% 
  unnest(log_odds) %>% 
  mutate(sd = str_c("sd = ", sd),
         p  = inv_logit_scaled(log_odds)) %>% 

  ggplot(aes(x = p)) +
  geom_histogram(bins = 50) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~sd)
```

It looks like the log-odds $\operatorname{Normal}(0, 1)$ gently regularizes $p$ toward .5, but still allows for stronger values. This might be a good prior to use for our substantive covariates, what Singer and Willett referred to as the $\beta$ parameters. But I don't know that this makes sense for our $\alpha$ parameters. If you've been following along with the model output, the life tables, and so on, you'll have noticed those tend to drift toward the lower end of the probability range. Regularizing toward $p = .5$ might not be a good idea. In the absence of good substantive or statistical theory, perhaps it's best to use a flat prior. The log-odds $\operatorname{Normal} (0, 1.5)$ prior is nearly flat on the probability space, but it does still push the mass away from the boundaries.

Maybe we can come up with something better. What if we simulated a large number of draws from the $\operatorname{Uniform}(0, 1)$ distribution, converted those draws to the log-odds metric, and fit a simple Student's $t$ model? If we wanted to stay within the Student-$t$ family of priors, of which the normal is a special case, that would give us a sense of the what prior values would approximate a uniform distribution on the probability scale.

```{r fit11.11}
set.seed(11)

dat <- 
  tibble(p = runif(1e5, 0, 1)) %>% 
  mutate(g = log_odds(p)) 

fit11.11 <-
  brm(data = dat,
      family = student,
      g ~ 1,
      chains = 4, cores = 4,
      file = "fits/fit11.11")
```

Check the summary.

```{r}
print(fit11.11)
```

Now we can reverse the process. Here's what it would look like if we simulated from the Student $t$-distribution based on those posterior means and then converted the results into the probability metric.

```{r, fig.width = 4, fig.height = 2.5}
set.seed(11)

tibble(g = rt(1e6, df = 7.61) * 1.57) %>% 
  mutate(p = inv_logit_scaled(g)) %>% 
  
  ggplot(aes(x = p)) +
  geom_histogram(bins = 50) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Now here's what it might look like if fit the next model with the $\operatorname{Normal}(0, 1)$ prior for the $\beta$ parameters and $\text{Student-} t (7.61, 0, 1.57)$ prior for the $\alpha$ parameters.

```{r fit11.12}
fit11.12 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt1 + pt2 + pt3,
      prior = c(prior(student_t(7.61, 0, 1.57), class = b), 
                prior(normal(0, 1), class = b, coef = pt1), 
                prior(normal(0, 1), class = b, coef = pt2), 
                prior(normal(0, 1), class = b, coef = pt3)),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.12")
```

Here is the model summary.

```{r}
print(fit11.12)
```

We can use the `tidy()` function and a few lines of wrangling code to make a version of the table in the middle of page 391.

```{r}
fixef(fit11.12)[7:9, ] %>% 
  data.frame() %>% 
  rownames_to_column("predictor") %>% 
  mutate(`odds ratio` = exp(Estimate)) %>% 
  select(predictor, Estimate, `odds ratio`) %>% 
  mutate_if(is.double, round, digits = 3)
```

Because our data did not include the original values for `pt1` through `pt3`, the results in our table will not match those in the text. We did get pretty close, though, eh? Hopefully this gives a sense of the workflow.

## Displaying fitted hazard and survivor functions

This will be an extension of what we've already been doing.

### A strategy for a single categorical substantive predictor.

We can make our version of Table 11.5 like so. To reduce clutter, we will use abbreviated column names.

```{r}
tibble(time  = 7:12,
       alpha = fixef(fit11.8)[1:6, 1],
       beta  = fixef(fit11.8)[7, 1]) %>% 
  mutate(lh0 = alpha,
         lh1 = alpha + beta) %>% 
  mutate(h0 = inv_logit_scaled(lh0),
         h1 = inv_logit_scaled(lh1)) %>% 
  mutate(s0 = cumprod(1 - h0),
         s1 = cumprod(1 - h1)) %>% 
  # this just simplifies the output
  mutate_if(is.double, round, digits = 4)
```

For the `alpha` and `beta` columns, we just subset the values from `fixef()`. The two logit-hazard columns, `lh0` and `lh1`, were simple algebraic transformations of `alpha` and `beta`, respectively. To make the two hazard columns, `h0` and `h1`, we applied the `inv_logit_scaled()` function to `lh0` and `lh1`, respectively. To make the two survival columns, `s0` and `s1`, we applied the `cumprod()` function to one minus the two hazard columns. Note how all this is based off of the posterior means. There's enough going on with Table 11.5 that it makes sense to ignore uncertainty But when we're ready to go beyond table glancing and actually make a plot, we will go beyond posterior means and reintroduce the uncertainty in the model.

Two of these plots are quite similar to two of the subplots from Figure 11.4, back in Section 11.2.1. But recall that though those plots were based on `fit11.6`, which was based on the aggregated data, the plots we are about to make will be based on `fit11.8`, the analogous model based on the disaggregated person-period data. Regardless of whether the logistic regression model is based on aggregated data, the post-processing approach will involve the `fitted()` function. However, the specifics of how we use `fitted()` will differ. For the disaggregated data used to fit `fit11.8`, here is how we might define the `newdata`, pump it through the model via `fitted()`, and wrangle.

```{r}
nd <-
  crossing(pt     = 0:1,
           period = 7:12) %>% 
  mutate(d7  = if_else(period == 7, 1, 0),
         d8  = if_else(period == 8, 1, 0),
         d9  = if_else(period == 9, 1, 0),
         d10 = if_else(period == 10, 1, 0),
         d11 = if_else(period == 11, 1, 0),
         d12 = if_else(period == 12, 1, 0))

f <-
  fitted(fit11.8,
         newdata = nd,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(pt = str_c("pt = ", pt))

f
```

Here we make and save the upper two panels of Figure 11.6.

```{r}
# logit(hazard)
p1 <-
  f %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "fitted logit(hazard)",
       y = NULL) +
  coord_cartesian(ylim = c(-4, 0)) +
  theme(legend.background = element_rect(fill = "transparent"),
        legend.key = element_rect(color = "grey92"),
        legend.position = c(.1, .825))

# hazard
p2 <-
  f %>% 
  mutate_at(vars(Estimate, Q2.5, Q97.5), .funs = inv_logit_scaled) %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "fitted hazard",
       y = NULL) +
  coord_cartesian(ylim = c(0, .5)) +
  theme(legend.position = "none")
```

Before we're ready to make the last panel, we'll redo our `fitted()` work, this time including predicted values for grade 6.

```{r}
nd <-
  crossing(pt     = 0:1,
           period = 6:12) %>% 
  mutate(d6  = if_else(period == 6, 1, 0),
         d7  = if_else(period == 7, 1, 0),
         d8  = if_else(period == 8, 1, 0),
         d9  = if_else(period == 9, 1, 0),
         d10 = if_else(period == 10, 1, 0),
         d11 = if_else(period == 11, 1, 0),
         d12 = if_else(period == 12, 1, 0))

f <-
  fitted(fit11.8,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(pt = str_c("pt = ", pt))

f
```

The values for grade 6 (i.e., those for when `d6 == 1`) are nonsensical. The main reason we included `d6` in the fitted results and in the `nd` data is so we’d have the slots in our `f` object. In the code block below, we’ll fill those slots with the appropriate values (`0`) and then convert the hazard summaries to the survival (i.e., cumulative probability) metric.

```{r}
f <-
  f %>% 
  mutate(Estimate = if_else(period == 6, 0, Estimate),
         Q2.5     = if_else(period == 6, 0, Q2.5),
         Q97.5    = if_else(period == 6, 0, Q97.5)) %>% 
  group_by(pt) %>% 
  mutate(s       = cumprod(1 - Estimate),
         s_lower = cumprod(1 - Q2.5),
         s_upper = cumprod(1 - Q97.5)) %>% 
  select(pt:d12, s:s_upper)

f %>% glimpse()
```

Make and save the final panel.

```{r}
# save the interpolated median lifetime values
imls <- c(make_iml(lt %>% filter(pt == "pt = 0")), make_iml(lt %>% filter(pt == "pt = 1")))

# hazard
p3 <-
  f %>% 
  
  ggplot(aes(x = period, group = pt,
             fill = pt, color = pt)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_segment(x = imls[1], xend = imls[1],
               y = -Inf, yend = .5,
               color = "white", linetype = 2) +
  geom_segment(x = imls[2], xend = imls[2],
               y = -Inf, yend = .5,
               color = "white", linetype = 2) +
  geom_ribbon(aes(ymin = s_lower, ymax = s_upper),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = s)) + 
  scale_y_continuous(NULL, breaks = c(0, .5, 1)) +
  labs(subtitle = "fitted survival probability") +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none")
```

Combine the subplots to finish off our version of Figure 11.6.

```{r, fig.width = 5, fig.height = 7, warning = F}
(p1 / p2 / p3) &
  scale_fill_viridis_d(NULL, option = "A", end = .6) &
  scale_color_viridis_d(NULL, option = "A", end = .6) &
  scale_x_continuous("Grade", breaks = 6:12, limits = c(6, 12)) &
  theme(panel.grid = element_blank())
```

Here is the breakdown of what percentage of boys will still be virgins at grades 9 and 12, based on `pt` status, as indicated by `fit11.8`.

```{r, message = F}
f %>% 
  filter(period %in% c(9, 12)) %>% 
  mutate_if(is.double, ~ (. * 100) %>% round(digits = 0)) %>% 
  mutate(`percent virgins` = str_c(s, " [", s_lower, ", ", s_upper, "]")) %>% 
  select(period, pt, `percent virgins`) %>% 
  arrange(period)
```

### Extending this strategy to multiple predictors (some of which are continuous).

> It is easy to display fitted hazard and survivor functions for model involving multiple predictor by extending these ideas in a straightforward manner. Instead of plotting one fitted function for *each* predictor value, select several *prototypical* predictor values (using strategies presented in section 4.5.3 and plot fitted functions for combinations of these values. (p. 394, *emphasis* in the original)

We'll be focusing on `fit11.10`, which includes both `pt` and `sas` as substantive predictors. `pt` only takes two values, 0 and 1. For pas, we'll use the conventional -1, 0, and 1. Here's the `fitted()`-related code.

```{r}
nd <-
  crossing(pt  = 0:1,
           pas = -1:1) %>% 
  expand(nesting(pt, pas),
         period = 6:12) %>% 
  mutate(d6  = if_else(period == 6, 1, 0),
         d7  = if_else(period == 7, 1, 0),
         d8  = if_else(period == 8, 1, 0),
         d9  = if_else(period == 9, 1, 0),
         d10 = if_else(period == 10, 1, 0),
         d11 = if_else(period == 11, 1, 0),
         d12 = if_else(period == 12, 1, 0))

f <-
  fitted(fit11.10,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd)

head(f)
```

Make the two subplots.

```{r}
# logit(hazard)
p1 <-
  f %>% 
  mutate(pt  = str_c("pt = ", pt),
         pas = str_c("pas = ", pas)) %>% 
  mutate(pas = factor(pas,
                      levels = str_c("pas = ", 1:-1))) %>% 
  filter(period > 6) %>% 
  
  ggplot(aes(x = period, group = pas,
             fill = pas, color = pas)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = Estimate)) +
  labs(subtitle = "fitted logit(hazard)",
       y = NULL) +
  coord_cartesian(ylim = c(0, .5)) +
  facet_wrap(~pt)

# hazard
p2 <-
  f %>% 
  mutate(Estimate = if_else(period == 6, 0, Estimate),
         Q2.5     = if_else(period == 6, 0, Q2.5),
         Q97.5    = if_else(period == 6, 0, Q97.5)) %>% 
  mutate(pt  = str_c("pt = ", pt),
         pas = str_c("pas = ", pas)) %>% 
  mutate(pas = factor(pas,
                      levels = str_c("pas = ", 1:-1))) %>% 
  group_by(pt, pas) %>% 
  mutate(s       = cumprod(1 - Estimate),
         s_lower = cumprod(1 - Q2.5),
         s_upper = cumprod(1 - Q97.5)) %>% 
  
  ggplot(aes(x = period, group = pas,
             fill = pas, color = pas)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_ribbon(aes(ymin = s_lower, ymax = s_upper),
              size = 0, alpha = 1/6) +
  geom_line(aes(y = s)) +
  scale_y_continuous(NULL, breaks = c(0, .5, 1)) +
  labs(subtitle = "fitted survival probability") +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none") +
  facet_wrap(~pt)
```

Combine the subplots to make our version of Figure 11.7.

```{r, fig.width = 8, fig.height = 5.5, warning = F}
((p1 / p2) &
  scale_fill_viridis_d(NULL, option = "D", end = .8, direction = -1) &
  scale_color_viridis_d(NULL, option = "D", end = .8, direction = -1) &
  scale_x_continuous("Grade", breaks = 6:12, limits = c(6, 12)) &
  theme(panel.grid = element_blank())) +
   plot_layout(guides = "collect")
```

Here we departed from the text a bit by separating the subplots by `pt` status. They're already cluttered enough as is.

### Two cautions when interpreting fitted hazard and survivor functions.

Beware of inferring statistical interaction of a substantive predictor and time when examining plots if fitted hazard and survivor functions. The root of this difficulty is in our use of a link function.

> Because the model expresses the linear effect of the predictor on logit hazard, you cannot draw a conclusion about the stability of an effect using graphs plotted on a raw hazard scale. In fact, the logic works in the *opposite* direction. If the size of the gap between fitted hazard functions is constant over time, [the] effect of the predictor must *vary* over time! (pp. 396-397, *emphasis* in the original)

Also, please don't confuse plots of fitted values with descriptive sample-based plots. Hopefully our inclusion of 95% intervals helps prevent this.

## Comparing models using ~~deviance statistics and~~ information criteria

> We now introduce two important questions that we usually address before interpreting parameters and displaying results: Which of the alternative models fits better: Might a predictor's observed effect be the result of nothing more than sampling variation? (p. 397)

Much of the material in this section will be a refresher from the material we covered in Section 4.6.

### The deviance statistic.

The log-likelihood, LL, is

> a summary statistic routinely output (in some form) by any program that provides ML estimates. As discussed in section 4.6, its relative magnitude across a series of models fit to the same set of data can be informative (although its absolute magnitude is not). The larger the LL statistic, the better the fit. (pp. 397--398)

Note that *in some form* part. Frequentist software typically returns the LL for a given model as a single value. As we learned way back in Section 4.6, we can use the `log_lik()` function to get the LL information from our **brms** fits. However, form the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf) we discover `log_lik()` returns an "S x N matrix containing the pointwise log-likelihood samples, where S is the number of samples and N is the number of observations in the data" (p. 112). Using `fit11.7` as a test case, here's what that looks like.

```{r}
log_lik(fit11.7) %>% 
  str()
```

To compute the LL for each HMC iteration, you sum across the rows. Deviance is that sum multiplied by -2. Here's that in a tibble.

```{r}
ll <-
  fit11.7 %>%
  log_lik() %>%
  as_tibble(.name_repair = ~ str_c("c", 1:822)) %>%
  mutate(ll = rowSums(.)) %>% 
  mutate(deviance = -2 * ll) %>% 
  select(ll, deviance, everything())

ll
```

Since we have distributions for the LL and deviance, we may as well visualize them in a plot.

```{r, fig.width = 6, fig.height = 2.25}
ll %>%
  pivot_longer(ll:deviance) %>% 
  mutate(name = factor(name, levels = c("ll", "deviance"))) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(point_interval = median_qi, .width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

Here's how to compute the LL and deviance distributions for each of our four models, `fit11.7` through `fit11.10`, in bulk.

```{r}
ll <-
  tibble(model = str_c("model ", letters[1:4]),
         name  = str_c("fit11.", 7:10)) %>% 
  mutate(fit = map(name, get)) %>% 
  mutate(ll = map(fit, ~log_lik(.) %>% data.frame() %>% transmute(ll = rowSums(.)))) %>% 
  select(-fit) %>% 
  unnest(ll) %>% 
  mutate(deviance = -2 * ll)

ll %>% 
  glimpse()
```

Now plot the LL and deviance distributions for each.

```{r, fig.width = 7, fig.height = 2.25}
ll %>%
  pivot_longer(ll:deviance,
               names_to = "statistic") %>% 
  mutate(statistic = factor(statistic, levels = c("ll", "deviance"))) %>% 
  
  ggplot(aes(x = value, y = model)) +
  stat_halfeye(point_interval = median_qi, .width = .95, normalize = "panels") +
  labs(x = NULL,
       y = NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~statistic, scales = "free_x")
```

### Deviance-based hypothesis tests for individual predictors.

Singer and Willett wrote: "Comparing deviance statistics for pairs of nested models that differ only by a single substantive predictor permits evaluation of the 'statistical significance' of that predictor" (p. 399). I'm just not going to appeal to null-hypothesis significance testing in this project and, as an extension, I am not going to appeal to tests using the $\chi^2$ distribution. But sure, you could take our deviance distributions and compare them with difference distributions. Singer and Willett made four deviance comparisons in this section. Here's what that might look like using our deviance distributions.

```{r, fig.width = 6, fig.height = 2.25}
ll %>% 
  select(model, deviance) %>% 
  mutate(iter = rep(1:4000, times = 4)) %>% 
  pivot_wider(names_from = model,
              values_from = deviance) %>% 
  mutate(`a - b` = `model a` - `model b`,
         `a - c` = `model a` - `model c`,
         `c - d` = `model c` - `model d`,
         `b - d` = `model b` - `model d`) %>% 
  pivot_longer(contains("-")) %>% 
  
  ggplot(aes(x = value, y = name)) +
  stat_halfeye(point_interval = median_qi, .width = .95, normalize = "panels") +
  labs(x = "deviance difference distribution",
       y = NULL) +
  theme(panel.grid = element_blank())
```

But really, like no one does this with Bayesian models. If you think you have a good theoretical reason to use this approach, do not cite this project as a justification. I do not endorse it.

### Deviance-based hypothesis tests for groups of predictors.

We won't be doing this.

### Comparing nonnested models using [WAIC and LOO].

Now we return to our preferred methods for model comparison. Use the `add_criterion()` function to compute the WAIC and LOO and add their output to the model fits.

```{r, message = F}
fit11.7  <- add_criterion(fit11.7, c("loo", "waic"))
fit11.8  <- add_criterion(fit11.8, c("loo", "waic"))
fit11.9  <- add_criterion(fit11.9, c("loo", "waic"))
fit11.10 <- add_criterion(fit11.10, c("loo", "waic"))
```

First compare Models B and C (i.e., `fit11.8` and `fit11.9`, respectively).

```{r}
loo_compare(fit11.8, fit11.9, criterion = "loo") %>% print(simplify = F)
loo_compare(fit11.8, fit11.9, criterion = "waic") %>% print(simplify = F)
```

In a head-to-head comparison, Model B is a little better than Model C. However, the standard error for their difference score is about three times as large as the difference itself. This is not a difference I would write home about. Now compare Models A through D.

```{r}
loo_compare(fit11.7, fit11.8, fit11.9, fit11.10, criterion = "loo") %>% print(simplify = F)
loo_compare(fit11.7, fit11.8, fit11.9, fit11.10, criterion = "waic") %>% print(simplify = F)
```

Model D (i.e., `fit11.10`, the full model) has the best (i.e., lowest) WAIC and LOO estimates. However, the standard errors for its difference scores with the other models is on the large side, particularly for Models B and C. So sure, adding either `pt` or `sas` to the model helps a bit and adding them both helps a little more, but neither predictor is a huge winner when you take that model complexity penalty into account.

As discussed earlier, we can also compare the models using weights. Here we'll use the WAIC, LOO, and stacking weights to compare all four models.

```{r mw_fit11.7, cache = T}
model_weights(fit11.7, fit11.8, fit11.9, fit11.10, weights = "loo") %>% round(digits = 3)
model_weights(fit11.7, fit11.8, fit11.9, fit11.10, weights = "waic") %>% round(digits = 3)
model_weights(fit11.7, fit11.8, fit11.9, fit11.10, weights = "stacking") %>% round(digits = 3)
```

Model D has the best showing across the three weighting schemes.

## Statistical inference using [uncertainty in the Bayesian posterior]

I generally take a model-based approach to Bayesian statistics and I prefer to scrutinize marginal posteriors, consider effect sizes, and use graphical depictions of my models (e.g., posterior predictive checks) over hypothesis testing. Further extending that approach, here, puts us at further odds with the content in the test. In addition, the authors spent some time discussing the asymptotic properties of ML standard errors. Our Bayesian approach is not based on asymptotic theory and we just don't need to concern ourselves with whether our marginal posteriors are Gaussian. They often are, which is nice. But we summarize our posteriors with percentile-based 95% intervals, we are not presuming they are symmetric or Gaussian.

### The ~~Wald chi-square statistic~~.

This will not be our approach. On page 404, Singer and Willett wrote: "The logistic regression analysis routines in all major statistical packages routinely output asymptotic standard errors." This comment presumes we're focusing on frequentist packages. Our rough analogue to frequentist standard errors is our Bayesian posterior standard deviations. The authors focused on the two substantive predictors from Model D (i.e., `fit11.10`). Here's another look at the **brms** summary.

```{r}
print(fit11.10)
```

Recall that the second column for our 'Population-Level Effects', 'Est.Error', contains the standard deviation for each dimension of the posteriors listed (i.e., for each parameter ranging from `d7` to `pas`). This is similar, but distinct from, the frequentist standard error. Instead of focusing on $p$-values connected to standard errors, why not look at the marginal posteriors directly?

```{r, fig.width = 6, fig.height = 2.5}
post <- posterior_samples(fit11.10)

post %>% 
  pivot_longer(b_pt:b_pas) %>% 
  
  ggplot(aes(x = value, y = name, fill = stat(x > 0))) +
  stat_slab() +
  scale_fill_manual(values = c("blue3", "red3")) +
  labs(x = "marginal posterior",
       y = NULL) +
  coord_cartesian(ylim = c(1.5, 2)) +
  theme(panel.grid = element_blank())
```

If we'd like to keep with the NHST perspective, zero is not a particularly credible value for either parameter. But neither are negative values in generals. In terms of uncertainty, look how much wider the posterior for `pt` is when compared with `pas`. And don't forget that these are on the log-odds scale.

Looking at those densities might lead one to ask, *Exactly what proportion of the posterior draws for each is zero or below*? You can compute that like this.

```{r}
post %>% 
  pivot_longer(b_pt:b_pas) %>% 
  group_by(name) %>% 
  summarise(`percent zero or below` = 100 * mean(value <= 0))
```

Less that 1% of the draws were zero or below for each.

### [Asymmetric credible intervals] for parameters and odds ratios.

Whether we use percentile-based credible intervals, as we typically do, or use highest posterior density intervals, neither depends on asymptotic theory nor do they depend on the posterior standard deviation. That is, our Bayesian intervals do not presume the marginal posteriors are Gaussian. Let's look back at the summary output for `fit11.10`, this time using the `fixef()` function.

```{r}
fixef(fit11.10)
```

We find the lower- and upper-limits for our percentile-based Bayesian credible intervals in the last two columns. If you’d like HDIs instead, use the convenience functions from **tidybayes**.

```{r}
post %>% 
  pivot_longer(b_pt:b_pas) %>% 
  group_by(name) %>% 
  mean_hdi(value)
```

We can exponentiate our measures of central tendency (e.g., posterior means) and posterior intervals to transform them out of the log-odds metric an into the odds-ratio metric. Here are the results working directly with `fixef()`.

```{r}
fixef(fit11.10)[c("pt", "pas"), -2] %>% exp()
```

Keep in mind that fixating on just the 95% intervals is a little NHST-centric. Since we have entire posterior distributions to summarize, we might consider other intervals. Here we use another graphical approach by using `tidybayes_statintervalh()` to mark off the 10, 30, 50, 70, and 90% intervals for both substantive predictors. Both are in the odds-ratio metric.

```{r, fig.width = 6, fig.height = 1.75}
post %>% 
  pivot_longer(b_pt:b_pas) %>% 
  mutate(`odds ratio` = exp(value)) %>% 
  
  ggplot(aes(x = `odds ratio`, y = name)) +
  stat_interval(size = 5, .width = seq(from = .1, to = .9, by = .2)) +
  scale_color_grey("CI level:", start = .8, end = .2) +
  scale_x_continuous(breaks = 1:3) +
  ylab(NULL) +
  coord_cartesian(xlim = c(1, 3)) +
  theme(legend.position = "top",
        panel.grid = element_blank())
```

The frequentist 95% confidence intervals are asymmetric when expressed in the odds-ratio metric and so are our various Bayesian intervals. However, the asymmetry in our Bayesian intervals is less noteworthy because there was no explicit assumption of symmetry when they were in the log-odds metric.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(sex, fit11.1, fit11.2, fit11.3, make_lt, lt, p1, p2, make_iml, odds, log_odds, p3, sex_pp, sex_aggregated, fit11.4, fit11.5, fit11.6, f, fit11.7, fit11.8, fit11.9, fit11.10, dat, fit11.11, fit11.12, nd, imls, ll, post)

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

## Footnote {-}

[^2]: **R** already has a built-in function to convert probabilities to the log-odds scale. Somewhat confusingly, it's called `qlogis()`. You can learn more by executing `?qlogis` or by browsing through [this great blog post](https://ro-che.info/articles/2018-08-11-logit-logistic-r) by Roman Cheplyaka. It’s generally a good idea to stick to the functions in base **R** rather than make your own, like we did earlier in this chapter (see [this twitter thread](https://twitter.com/tjmahr/status/1248228130256453632)). Since the name of `qlogis()` isn’t the easiest to remember, you can always execute something like `log_odds <- qlogis` or `logit <- qlogis` at the beginning of your scripts and then use one of those as a thin wrapper for `qlogis()`.


<!--chapter:end:11.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# Extending the Discrete-Time Hazard Model

> Like all statistical models, the basic discrete-time hazard model invokes assumptions about the population that may, or may not, hold in practice. Because no model should be adopted without scrutiny, we devote this chapter to examining its assumptions, demonstrating how to evaluate their tenability and relax their constraints when appropriate. In doing so, we illustrate practical principles of data analysis and offer theoretical insights into the model’s behavior and interpretation. [@singerAppliedLongitudinalData2003, p. 407]

## Alternative specification for the "main effect" of *TIME*

> Use of a completely general specification for *TIME* [as explored in the last chapter] is an analytic decision, not an integral feature of the model. Nothing about the model or its estimation *requires* adoption of this, or any other, particular specification for *TIME* (p. 409, *emphasis* in the original)

In the next page, Singer and Willett listed three circumstances under which we might consider alternatives to the completely general approach to time. They were

* in studies with many discrete time periods,
* when hazard is expected to be near zero in some time periods, and
* when some time periods have small risk sets.

In the subsections to follow, we will explore each in turn.

### An ordered series of polynomial specifications for *TIME*.

Load the Gamse and Conger's [-@gamseEvaluationSpencerPostdoctoral1997] `tenure_pp.csv` data.

```{r, warning = F, message = F}
library(tidyverse)

tenure_pp <- 
  read_csv("data/tenure_pp.csv") %>% 
  # convert the column names to lower case
  rename_all(tolower)

glimpse(tenure_pp)
```

Let's confirm these data are composed of the records of $n = 260$ early-career academics.

```{r}
tenure_pp %>% 
  distinct(id) %>% 
  count()
```

Here's a way to count how many cases were censored.

```{r}
tenure_pp %>% 
  group_by(id) %>% 
  arrange(desc(period)) %>% 
  slice(1) %>%
  ungroup() %>% 
  count(event) %>% 
  mutate(percent = 100 * n / sum(n))
```

Let's fire up **brms**.

```{r, warning = F, message = F}
library(brms)
```

As discussed in the prose and displayed in Tables 12.1 and 12.2, we will fit seven models, ranging from a constant (i.e., intercept only) model to a general (i.e., discrete factor) model. In the last chapter, we discussed how one can fit a general model with a series of $J$ dummies or equivalently with the time variable, `period` in these data, set as a factor. Here we'll do both. In preparation, we'll make a `period_f` version of `period`.

```{r}
tenure_pp <-
  tenure_pp %>% 
  mutate(period_f = factor(period))
```

Now fit the models.

```{r fit12.1}
# constant
fit12.1 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 1,
      prior(normal(0, 4), class = Intercept),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.01")

# linear
fit12.2 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + period,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.02")

# quadratic
fit12.3 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + period + I(period^2),
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.03")

# cubic
fit12.4 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + period + I(period^2) + I(period^3),
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.04")

# fourth order
fit12.5 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + period + I(period^2) + I(period^3) + I(period^4),
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      control = list(max_treedepth = 12),
      seed = 12,
      file = "fits/fit12.05")

# fifth order
fit12.6 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + period + I(period^2) + I(period^3) + I(period^4) + I(period^5),
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 3000, warmup = 2000,
      control = list(max_treedepth = 13),
      seed = 12,
      file = "fits/fit12.06")

# general
fit12.7 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + d1 + d2 + d3 + d4 + d5 + d6 + d7 + d8 + d9,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.07")

# general with `factor(period)`
fit12.8 <-
  brm(data = tenure_pp,
      family = binomial,
      event | trials(1) ~ 0 + period_f,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.08")
```

Before we compare the models with information criteria, it might be handy to look at the hazard functions for each. A relatively quick way is with the `conditional_effects()` function.

```{r, fig.width = 4}
p2 <- plot(conditional_effects(fit12.2), plot = F)[[1]] + ggtitle("linear")
p3 <- plot(conditional_effects(fit12.3), plot = F)[[1]] + ggtitle("quadratic")
p4 <- plot(conditional_effects(fit12.4), plot = F)[[1]] + ggtitle("cubic")
p5 <- plot(conditional_effects(fit12.5), plot = F)[[1]] + ggtitle("fourth order")
p6 <- plot(conditional_effects(fit12.6), plot = F)[[1]] + ggtitle("fifth order")
p7 <- plot(conditional_effects(fit12.8), 
           cat_args = list(size = 3/2), 
           plot = F)[[1]] + ggtitle("general")
```

Because it contains no predictors, we cannot use `conditional_effects()` to make a plot for the constant model (i.e., `fit12.1`). We'll have to do that by hand.

```{r}
p1 <-
  tibble(period = 1:9) %>% 
  ggplot(aes(x = period)) +
  geom_ribbon(aes(ymin = fixef(fit12.1)[, 3] %>% inv_logit_scaled(),
                  ymax = fixef(fit12.1)[, 4] %>% inv_logit_scaled()),
              alpha = 1/5) +
  geom_line(aes(y = fixef(fit12.1)[, 1] %>% inv_logit_scaled()),
            size = 1, color = "blue1") +
  ggtitle("constant") +
  ylab("event | trials(1)")
```

Now combine and format the subplots with **patchwork**.

```{r, fig.width = 8, fig.height = 6}
library(patchwork)

(((p1 + p2 + p3 + p4 + p5 + p6) & scale_x_continuous(breaks = 1:9)) + p7) &
  coord_cartesian(ylim = c(0, .5)) &
  theme(panel.grid = element_blank())
```

We are going to depart from Singer and Willett and no longer entertain using deviance for Bayesian model comparison. But we will compare then using the WAIC and the LOO.

```{r, message = F}
fit12.1 <- add_criterion(fit12.1, c("loo", "waic"))
fit12.2 <- add_criterion(fit12.2, c("loo", "waic"))
fit12.3 <- add_criterion(fit12.3, c("loo", "waic"))
fit12.4 <- add_criterion(fit12.4, c("loo", "waic"))
fit12.5 <- add_criterion(fit12.5, c("loo", "waic"))
fit12.6 <- add_criterion(fit12.6, c("loo", "waic"))
fit12.7 <- add_criterion(fit12.7, c("loo", "waic"))
fit12.8 <- add_criterion(fit12.8, c("loo", "waic"))
```

Before comparing the models in bulk, as in Table 12.2, let's confirm that whether we use the dummy variable method (`fit12.7`) or the factor variable method (`fit12.8`), the results for the general model are the same.

```{r}
loo_compare(fit12.7, fit12.8, criterion = "loo") %>% print(simplify = F)
loo_compare(fit12.7, fit12.8, criterion = "waic") %>% print(simplify = F)
```

Yep, both WAIC and LOO confirm both methods are equivalent. One of the main reasons we used the factor method, here, was because it made it easier to plot the results with `conditional_effects()`. But with the following model comparisons, we'll focus on `fit12.1` through `fit12.7`.

```{r}
loo_compare(fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, criterion = "loo") %>% 
  print(simplify = F)
loo_compare(fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, criterion = "waic") %>% 
  print(simplify = F)
```

Our results are very similar to those in the AIC column in Table 12.2. Just for kicks and giggles, here are the model weights based on the LOO, WAIC, and stacking method.

```{r mw_fit12.1}
model_weights(fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, weights = "loo") %>% 
  round(digits = 3)
model_weights(fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, weights = "waic") %>% 
  round(digits = 3)
model_weights(fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, weights = "stacking") %>% 
  round(digits = 3)
```

Across the comparison methods, the overall pattern is the cubic model (`fit12.4`) is marginally better than the rest, but that both the quadratic and fourth-order models were quite close. Unlike when you use model deviance, the parsimony corrections used in the information-criteria-based methods all suggest the general model is overfit.

> Before considering whether these differences in [information criteria] are sufficient to warrant use of an alternative specification for *TIME*, let us examine the corresponding fitted logit hazard functions. Doing so not only highlights the behavior of logit hazard, it also offers a graphical means of comparing the fit of competing specifications. (p. 413, *emphasis* in the original)

In preparation for our Figure 12.1, we'll make a custom function called `make_fitted()`, which will streamline some of the data wrangling code.

```{r}
make_fitted <- function(fit, scale, ...) {
  
  fitted(fit,
         newdata = nd,
         scale = scale,
         ...) %>% 
    data.frame() %>% 
    bind_cols(nd)
  
}
```

In addition to taking different **brms** fit objects as input, `make_fitted()` will allow us to adjust the `scale` of the output. As you'll see, we will need to work with to settings in the coming plots. For our first batch of code, we'll use `scale = "linear"`. Because the `newdata` for our version of the general model (i.e., `fit12.8`) requires the predictor to be called `period_f` and the rest of the models require the predictor to be named `period`, we'll make and save the results for the former first, redefine our `newdata`, apply `make_fitted()` to the rest of the models, and then combine them all. Here it is in one fell swoop.

```{r}
nd <- tibble(period_f = 1:9)

f <- make_fitted(fit12.8, scale = "linear") %>% rename(period = period_f)

# this will simplify the `mutate()` code below
models <- c("constant", "linear", "quadratic", "cubic", "general")

nd <- tibble(period = 1:9)

f <-
  bind_rows(make_fitted(fit12.1, scale = "linear"),  # constant
            make_fitted(fit12.2, scale = "linear"),  # linear
            make_fitted(fit12.3, scale = "linear"),  # quadratic
            make_fitted(fit12.4, scale = "linear"),  # cubic
            f) %>%                                   # general
  mutate(model = factor(rep(models, each = 9),
                        levels = models))

# what have we done?
glimpse(f)
```

Now we're in good shape to make and save our version of the top panel of Figure 12.1.

```{r, fig.width = 4, fig.height = 4}
p1 <-
  f %>% 
  ggplot(aes(x = period, y = Estimate, color = model)) +
  geom_line() +
  scale_color_viridis_d(option = "A", direction = -1) +
  ylab("Fitted logit(hazard)") +
  coord_cartesian(ylim = -c(6, 0)) +
  theme(panel.grid = element_blank())
```

The bottom two panels require we redo our `make_fitted()` code from above, this time setting `scale = "response"`.

```{r}
nd <- tibble(period_f = 1:9)

f <- make_fitted(fit12.8, scale = "response") %>% rename(period = period_f)

nd <- tibble(period = 1:9)

f <-
  bind_rows(make_fitted(fit12.1, scale = "response"),  # constant
            make_fitted(fit12.2, scale = "response"),  # linear
            make_fitted(fit12.3, scale = "response"),  # quadratic
            make_fitted(fit12.4, scale = "response"),  # cubic
            f) %>%                                     # general
  mutate(model = factor(rep(models, each = 9),
                        levels = models))
```

Now make and save the bottom left panel for Figure 12.1.

```{r, fig.width = 4, fig.height = 4}
p2 <-
  f %>% 
  filter(model %in% c("quadratic", "general")) %>% 
  
  ggplot(aes(x = period, y = Estimate, color = model)) +
  geom_line() +
  scale_color_viridis_d(option = "A", end = .5, direction = -1) +
  ylab("Fitted hazard") +
  coord_cartesian(ylim = c(0, .4)) +
  theme(legend.position = "none",
        panel.grid = element_blank())
```

Our `f` data will also work to make the final bottom right panel for the figure, but we'll need to convert the `Estimate` values from the hazard metric to the survival-probability metric. In addition, we will need to add in values for when `period = 0`. Here we wrangle, plot, and save in one block.

```{r}
new_rows <-
  tibble(Estimate = 0,
         period   = 0,
         model    = factor(c("quadratic", "general"),
                           levels = models))

p3 <-
  f %>% 
  filter(model %in% c("quadratic", "general")) %>% 
  select(Estimate, period, model) %>% 
  # add the `new_rows` data
  bind_rows(new_rows) %>%
  arrange(model, period) %>%
  group_by(model) %>% 
  # convert hazards to survival probabilities
  mutate(Estimate = cumprod(1 - Estimate)) %>%
  
  # plot!
  ggplot(aes(x = period, y = Estimate, color = model)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_line() +
  scale_color_viridis_d(option = "A", end = .5, direction = -1) +
  scale_y_continuous("Fitted survival probability", breaks = c(0, .5, 1)) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none",
        panel.grid = element_blank())
```

Now combine the subplots and format a little with **patchwork** to make our version of Figure 12.1.

```{r, fig.width = 8, fig.height = 3.5}
p1 + p2 + p3 + 
  plot_layout(guides = "collect") &
  scale_x_continuous("Years after hire", breaks = 0:9, limits = c(0, 9))
```

Instead of the two-row layout in the text, it seemed simpler to arrange the panels all in one row. This way we can let the full version of the line-color legend serve for all three panels.

### Criteria for comparing alternative specification.

> The decline in the deviance statistic across models indicates that fit improves with increasing complexity of the temporal specification. To evaluate the magnitude of this decline, we must also account for the increased number of parameters in the model. You should not adopt a more complex specification if it fits no better than a simpler one. But if an alternative specification is (nearly) as good as the most general one, it may be "good enough." At the same time, we would not want an alternative that performs measurably worse than we know we can do. (p. 415)

We won't be comparing deviances with $\chi^2$ tests, here. As to information criteria, we got ahead of the authors a bit and presented those comparisons in the last section. Although our use of the WAIC and the LOO is similar to Singer and Willett's use of the AIC and BIC in that they yield no formal hypothesis test in the form of a $p$-value, their estimates and difference scores do come with standard errors.

In the middle of page 416, the authors focused on comparing the constant and linear models, the linear and quadratic models, and the quadratic and general models. The LOO and WAIC estimates for each were near identical. For the sake of simplicity, here are those three focused comparisons using the LOO.

```{r}
# the constant and linear models
l1 <- loo_compare(fit12.1, fit12.2, criterion = "loo")

# the linear and quadratic models
l2 <- loo_compare(fit12.2, fit12.3, criterion = "loo")

# the quadratic and general models
l3 <- loo_compare(fit12.3, fit12.7, criterion = "loo")

l1 %>% print(simplify = F)
l2 %>% print(simplify = F)
l3 %>% print(simplify = F)
```

If we presume the LOO differences follow a normal distribution, we can use their point estimates and standard errors to plot those distributions using simulated data from good old `rnorm()`.

```{r, fig.width = 8, fig.height = 2.5, warning = F, message = F}
library(tidybayes)

n <- 1e6
models <- c("linear - constant", "quadratic - linear", "quadratic - general")
set.seed(12)

# wrangle
tibble(loo_difference = c(rnorm(n, mean = l1[2, 1] * -2, sd = l1[2, 2] * 2),
                          rnorm(n, mean = l2[2, 1] * -2, sd = l2[2, 2] * 2),
                          rnorm(n, mean = l3[2, 1] * -2, sd = l3[2, 2] * 2))) %>% 
  mutate(models = factor(rep(models, each = n),
                         levels = models)) %>% 
  
  # plot!
  ggplot(aes(x = loo_difference, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "LOO-difference simulations based on 1,000,000 draws",
       x = "difference distribution") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~models, scales = "free")
```

The LOO difference for the linear and constant models is decisive. The difference for the quadratic and linear models is fairly large on the information-criteria scale, but the uncertainty in that distribution is fairly large relative to its location (i.e., its mean), which might temper overly-strong conclusions about how much better the quadratic was compared to the linear. The comparison between the quadratic and the general produced a simulation with a modest location and rather large uncertainty relative to the magnitude of that location. All in all, "all signs point to the superiority of the quadratic specification, which fits nearly as well as the general mode, but with fewer parameters" (p. 416).

In the next page, Singer and Willett briefly focused on comparing the cubic and quadratic models. Here are their LOO and WAIC caparisons.

```{r}
loo_compare(fit12.3, fit12.4, criterion = "loo") %>% print(simplify = F)
loo_compare(fit12.3, fit12.4, criterion = "waic") %>% print(simplify = F)
```

In the text, the results for the AIC and BIC differed. Our LOO and WAIC results both agree with the AIC: the cubic model has a slightly lower LOO and WAIC estimate compared to the quadratic. However, the standard errors for the formal difference score are about twice the size of that difference and the absolute magnitude of the difference is rather small to begin with. Here's what it looks like if we compare them using LOO weights, WAIC weights, and stacking weights.

```{r mw_fit12.3_fit12.4, cache = T}
model_weights(fit12.3, fit12.4, weights = "loo") %>% round(digits = 3)
model_weights(fit12.3, fit12.4, weights = "waic") %>% round(digits = 3)
model_weights(fit12.3, fit12.4, weights = "stacking") %>% round(digits = 3)
```

Across all three weight comparisons, there was a modest edge for the quadratic model. But back to Singer and Willett:

> Although decision rules cannot substitute for judgment, intuition, and common sense, we nevertheless conclude by offering two guidelines for selecting among alternative specifications:
>
> * *If a smooth specification works nearly as well as the completely general one, appreciably better than all simpler ones, and no worse than all more complex ones, consider adopting it.*
> * *If no smooth specifications meet these criteria, retain the completely general specification.*
>
> If this decision process leads you to a polynomial specification, then you can interpret the model's parameters easily, as we [will discuss in a bit]. (p. 417, *emphasis* in the original)

Before moving on, we might point out that our Bayesian **brms**-based framework offers a different option: model averaging. We plot the hazard and survival curves based on weighted averages of multiple models. The weights can be based on various criteria. One approach would be to use the model weights from the LOO or the WAIC. As an example, here we use the LOO weights for the quadratic and cubic models.

```{r}
nd <- tibble(period = 1:9)

pp <-
  pp_average(fit12.3, fit12.4,
             weights = "loo",
             newdata = nd,
             method = "pp_expect") %>% 
  data.frame() %>% 
  bind_cols(nd)
```

The `pp_average()` function works much like `fitted()` or `predict()`. If you input models and perhaps `newdata`, it will return estimates that are the weighted averages of the specified models. With the `weights = "loo"` argument, we indicated our desired weights were those from the LOO, just as we computed earlier with the `model_weights()` function. With the `method = "pp_expect"` argument, we indicated we wanted fitted values like we would get from `fitted()`.

Here we plot the results in terms of hazard and survival.

```{r, fig.width = 6, fig.height = 3.75}
# hazard
p1 <-
  pp %>% 
  ggplot(aes(x = period)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/5) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous("Years after hire", breaks = 0:9, limits = c(0, 9)) +
  ylab("hazard") +
  theme(panel.grid = element_blank())

# survival
p2 <-
  pp %>% 
  select(-Est.Error) %>% 
  bind_rows(tibble(Estimate = 0, Q2.5 = 0, Q97.5 = 0, period = 0)) %>% 
  arrange(period) %>% 
  mutate_at(vars(Estimate:Q97.5), .funs = ~ cumprod(1 - .)) %>% 
  
  ggplot(aes(x = period)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/5) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous("Years after hire", breaks = 0:9) +
  scale_y_continuous("survival", breaks = c(0, .5, 1), limits = c(0, 1)) +
  theme(panel.grid = element_blank())

# combine
(p1 | p2) + 
  plot_annotation(title = "Behold the fitted hazard and survival curves based on a weighted\naverage of the quadratic and linear models!")
```

### Interpreting parameters from linear, quadratic, and cubic specifications.

"One advantage of a polynomial specification is that you can often interpret its parameters directly" (p. 417). For the polynomial models in this section, Singer and Willett used the $TIME - c$ specification for `period` where $c$ is a centering constant. They used $c = 5$. Before we can refit our polynomial models with this parameterization, we'll want to make a new `period` variable with this centering. We'll call it `period_5`.

```{r}
tenure_pp <-
  tenure_pp %>% 
  mutate(period_5 = period - 5)

# how do the two `period` variables compare?
tenure_pp %>% 
  distinct(period, period_5)
```

Now refit the quadratic model using `period_5`.

```{r fit12.9}
fit12.9 <-
  update(fit12.3,
         newdata = tenure_pp,
         event | trials(1) ~ 0 + Intercept + period_5 + I(period_5^2),
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.09")
```

Check the model summary.

```{r}
print(fit12.9)
```

Focusing just on the posterior means, this yields the following formula for the quadratic discrete-time hazard model: $\operatorname{logit} \hat h (t_j) = \;$ `r fixef(fit12.9)[1, 1] %>% round(4) %>% format(nsmall = 4)` $\text{one } +$ `r fixef(fit12.9)[2, 1] %>% round(4) %>% format(nsmall = 4)`$(\text{time}_j - 5)$ `r fixef(fit12.9)[3, 1] %>% round(4) %>% format(nsmall = 4)`$(\text{time}_j - 5)^2$.

Singer and Willett used the term "flipover point" for the point at which the quadratic function reaches its peak or trough. If we let $c$ be the centering constant for time variable, $\alpha_1$ be the linear coefficient for time, and $\alpha_2$ be the quadratic coefficient for time, we define the flipover point as

$$\text{flipover point} = [c - 1/2 (\alpha_1 / \alpha_2)].$$

Here's what that looks like using the posterior samples.

```{r, fig.width = 4, fig.height = 2.5}
posterior_samples(fit12.9) %>% 
  transmute(c  = 5,
            a1 = b_period_5,
            a2 = b_Iperiod_5E2) %>% 
  mutate(`flipover point` = c - 0.5 * (a1 / a2)) %>% 
  
  ggplot(aes(x = `flipover point`, y = 0)) +
  stat_halfeye(.width = c(.5, .95)) +
  scale_x_continuous(breaks = 7:12) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Just as each parameter has a posterior distribution, the flipover point, which is a function of two of the parameters, also has a posterior distribution. To understand what this flipover distribution means, it might be helpful to look at it in another way. For that, we'll employ `fitted()`.

```{r}
nd <- 
  tibble(period = seq(from = 0, to = 12, by = .1)) %>% 
  mutate(period_5 = period - 5)

f <-
  fitted(fit12.9,
         newdata = nd,
         summary = F,
         scale = "linear") %>% 
  data.frame() %>% 
  pivot_longer(everything()) %>% 
  bind_cols(expand(nd,
                   iter = 1:4000,
                   nesting(period, period_5)))

f %>% 
  glimpse()
```

Now make a logit hazard spaghetti plot.

```{r, fig.width = 4, fig.height = 3}
f %>% 
  # how many lines would you like?
  filter(iter <= 30) %>% 
  
  ggplot(aes(x = period, y = value, group = iter)) +
  geom_line(alpha = 1/2) +
  ylab("logit hazard") +
  coord_cartesian(xlim = c(0, 11),
                  ylim = c(-5, 0)) +
  theme(panel.grid = element_blank())
```

To keep the plot manageable, we filtered to just the first 30 posterior draws. Each was depicted with its own hazard line. Note how each of those hazard lines peaks at a different point along the $x$-axis. Most peak somewhere around 7.4. Some take on notably higher values.

Now fit a cubic model using `period_5`, $(TIME_j - 5)$, as the measure of time.

```{r fit12.10}
fit12.10 <-
  update(fit12.4,
         newdata = tenure_pp,
         event | trials(1) ~ 0 + Intercept + period_5 + I(period_5^2) + I(period_5^3),
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.10")
```

Check the model summary.

```{r}
print(fit12.10)
```

Focusing again on just the posterior means, this yields the following formula for the cubic discrete-time hazard model: $\operatorname{logit} \hat h (t_j) =\;$ `r fixef(fit12.10)[1, 1] %>% round(4) %>% format(nsmall = 4)` $\text{one } +$ `r fixef(fit12.10)[2, 1] %>% round(4) %>% format(nsmall = 4)`$(\text{time}_j - 5)$ `r fixef(fit12.10)[3, 1] %>% round(4) %>% format(nsmall = 4)`$(\text{time}_j - 5)^2$ `r fixef(fit12.10)[4, 1] %>% round(4) %>% format(nsmall = 4)`$(\text{time}_j - 5)^3$. Now if we let $c$, $\alpha_1$ and $\alpha_2$ retain their meanings from before and further let $\alpha_3$ stand for the cubic term for time, we can define the two flipover points in the cubic logit hazard model as

$$\text{flipover points} = c + \frac{-\alpha_2 \pm \sqrt{\alpha_2^2 - 3 \alpha_1 \alpha_3}}{3 \alpha_3}.$$

Do you see that $\pm$ sign in the numerator? That's what gives us the two points. Now apply the formula to `fit12.10` and plot.

```{r, fig.width = 8, fig.height = 2.5, warning = F}
# extract the posterior draws
post <-
  posterior_samples(fit12.10) %>% 
  transmute(c  = 5,
            a1 = b_period_5,
            a2 = b_Iperiod_5E2,
            a3 = b_Iperiod_5E3)

# flipover point with "+" in the numerator
p1 <-
  post %>% 
  mutate(`flipover point 1` = c + (- a2 + sqrt(a2^2 - 3 * a1 * a3)) / (3 * a3)) %>% 
  filter(!is.na(`flipover point 1`)) %>% 
  filter(`flipover point 1` > -50 & `flipover point 1` < 50) %>% 
  
  ggplot(aes(x = `flipover point 1`, y = 0)) +
  stat_halfeyeh(.width = c(.5, .95),) +
  annotate(geom = "text",
           x = -30, y = .85,
           label = "italic(c)+frac(-alpha[2]+sqrt(alpha[2]^2-3*alpha[1]*alpha[3]), 3*alpha[3])",
           hjust = 0, family = "Times", parse = T) +
  scale_y_continuous(NULL, breaks = NULL) +
  coord_cartesian(xlim = c(-30, 20))

# flipover point with "-" in the numerator
p2 <-
  post %>% 
  mutate(`flipover point 2` = c + (- a2 - sqrt(a2^2 - 3 * a1 * a3)) / (3 * a3)) %>% 
  filter(!is.na(`flipover point 2`)) %>% 
  
  ggplot(aes(x = `flipover point 2`, y = 0)) +
  stat_halfeyeh(.width = c(.5, .95)) +
  annotate(geom = "text",
           x = 8.2, y = .85,
           label = "italic(c)+frac(-alpha[2]-sqrt(alpha[2]^2-3*alpha[1]*alpha[3]), 3*alpha[3])",
           hjust = 0, family = "Times", parse = T) +
  scale_y_continuous(NULL, breaks = NULL)

# combine!
(p1 | p2) & theme(panel.grid = element_blank())
```

The plot on the right looks similar to the flipover plot from `fit12.9`. But look at the massive uncertainty in the flipover point in the plot on the left. If you play around with the code, you'll see the $x$-axis extends far beyond the boundaries in the plot. Another spaghetti plot might help show what's going on.

```{r, fig.width = 4, fig.height = 3}
# redifine the `newdata`
nd <- 
  tibble(period = seq(from = -8, to = 11, by = .1)) %>% 
  mutate(period_5 = period - 5)

# employ `fitted()` and wrangle
f <-
  fitted(fit12.10,
         newdata = nd,
         summary = F,
         scale = "linear") %>% 
  data.frame() %>% 
  pivot_longer(everything()) %>% 
  bind_cols(expand(nd,
                   iter = 1:4000,
                   nesting(period, period_5)))

# plot!
f %>% 
  filter(iter <= 30) %>% 
  ggplot(aes(x = period, y = value, group = iter)) +
  geom_line(alpha = 1/2) +
  ylab("logit hazard") +
  coord_cartesian(xlim = c(-7, 10),
                  ylim = c(-13, 0)) +
  theme(panel.grid = element_blank())
```

On the region to the right of 0 on the $x$-axis, the plot looks a lot like the one for the quadratic model. But look at how wildly the lines fan out on the left side of 0. Since that's the region where we find the second flipover point, all that uncertainty got baked into its marginal posterior. Just because I think it looks cool, here's a version of that plot with lines corresponding to all 4,000 posterior draws.

```{r, fig.width = 4, fig.height = 3}
f %>% 
  ggplot(aes(x = period, y = value, group = iter)) +
  geom_line(alpha = 1/10, size = 1/10) +
  ylab("logit hazard") +
  coord_cartesian(xlim = c(-7, 10),
                  ylim = c(-13, 0)) +
  theme(panel.grid = element_blank())
```

## Using the complementary log-log link to specify a discrete-time hazard model

So far we've been using the

> logit transformation [which] represented a natural choice because it allowed us to: (1) specify the model using familiar terminology; (2) use widely available software for estimation; and (3) exploit interpretative strategies with which many empirical researchers are comfortable.
>
> Just like the choice of a completely general specification for the main effect of *TIME*, use of a logit link is an analytic decision. Nothing about the way in which the model is postulated or fit requires the adoption of this, or any other, particular link function. (p. 419--420, *emphasis* in the original)

The *complimentary log-log* transformation--*clog-log* for short--is a widely-used alternative. It follows the form

$$\operatorname{clog-log} = \log \big (- \log (1 - p) \big),$$

where $p$ is a probability value. In words, "while the logit transformation yields the logarithm of the *odds of event occurrence*, the clog-log transformation yields *the logarithm of the negated logarithm of the probability of event nonoccurrence*" (p. 420, *emphasis* in the original).

### The clog-log transformation: When and why it is useful.

Here's our version of Figure 12.2, where we compare the logit and clog-log transformations.

```{r, fig.width = 5, fig.height = 3.5}
# simulate the data
tibble(p = seq(from = .00001, to = .99999, length.out = 1e4)) %>% 
  mutate(logit   = log(p / (1 - p)),
         cloglog = log(-log(1 - p))) %>% 
  pivot_longer(-p) %>% 
  mutate(name = factor(name,
                       levels = c("logit", "cloglog"),
                       labels = c("Logit", "Complementary log-log"))) %>% 
  
  # plot
  ggplot(aes(x = p, y = value, color = name)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_line(size = 1) +
  scale_color_viridis_d(NULL, option = "A", end = .6) +
  scale_y_continuous("transformed hazard probability", 
                     breaks = -3:3 * 5, limits = c(-15, 15)) +
  xlab("hazard probability") +
  theme(legend.background = element_rect(fill = "grey92"),
        legend.key = element_rect(fill = "grey92", color = "grey92"),
        legend.position = c(.25, .85),
        panel.grid = element_blank())
```

Both transformations extend to the full $-\infty$ to $\infty$ parameter space. But whereas the logit is symmetric around zero and has a memorable point corresponding to $p = .5$ (i.e., 0), the clog-log is asymmetric and has a less-intuitive point corresponding to $p = .5$ (i.e., `r round(log(-log(1 - .5)), 7)`). Though somewhat odd, the advantage of the clog-log is

> it provides a discrete-time statistical model for hazard that has a built-in *proportional hazards* assumption, and not a *proportional odds* assumption (as in the case of the logit link). This would be completely unremarkable except for one thing: it provides a conceptual parallelism between the clog-log discrete-time hazard model and the models that we will ultimately describe for continuous-time survival analysis. (p. 421, *emphasis* in the original)

### A discrete-time hazard model using the complementary log-log link.

Singer and Willett (p. 422):

> Any discrete-time hazard model postulated using a logit link can be rewritten using a clog-log link, simply by substituting transformations of the outcome. For example, we can write a general discrete-time hazard model for $J$ time periods and $P$ substantive predictors as:

\begin{align*}
\operatorname{clog-log} h(t_j) & = [\alpha_1 D_1 + \alpha_2 D_2 + \cdots + \alpha_J D_J] \\
& \;\; + [\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_P X_P].
\end{align*}

Now reload the `firstsex_pp.csv` data to test this baby out.

```{r, warning = F, message = F}
sex_pp <- read_csv("data/firstsex_pp.csv")

glimpse(sex_pp)
```

Our next task is to recreate Figure 12.3, which shows the "sample hazard functions for the grade of first intercourse data displayed on different scales," namely the logit and clog-log (p. 423). The key word here is "sample," meaning we're not fitting full models. For our version of the corresponding figure from the last chapter (i.e., Figure 11.2 from page 363), we computed the sample logit hazard functions with life tables based on the output from the `survival::survfit()` function. I am not aware that you can get hazards in the clog-log scale from the `survfit()` function. The folks at IDRE solved the problem by fitting four maximum likelihood models using the `glm()` function (see [here](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-12/)). We'll follow a similar approach, but with weakly-regularizing priors using the `brms::brm()` function.

```{r fit12.11}
## logit
# pt == 0
fit12.11 <-
  brm(data = sex_pp %>% filter(pt == 0),
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.11")

# pt == 1
fit12.12 <-
  brm(data = sex_pp %>% filter(pt == 1),
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.12")

## clog-log
# pt == 0
fit12.13 <-
  brm(data = sex_pp %>% filter(pt == 0),
      family = binomial(link = cloglog),
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.13")

# pt == 1
fit12.14 <-
  brm(data = sex_pp %>% filter(pt == 1),
      family = binomial(link = cloglog),
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.14")
```

The primary line of code that distinguishes `fit12.11` and `fit12.12` from `fit12.13` and `fit12.14` is `family = binomial(link = cloglog)`. With the `family` argument, we switched out the default logit link for the clog-log link.  Before we can make our version of Figure 12.5, we will redefine our `nd` data for `fitted()`, pump our four fit objects into our custom `make_fitted()` function from earlier, and wrangle a little.

```{r}
nd <-
  tibble(period = 7:12) %>% 
  mutate(d7  = if_else(period == 7, 1, 0),
         d8  = if_else(period == 8, 1, 0),
         d9  = if_else(period == 9, 1, 0),
         d10 = if_else(period == 10, 1, 0),
         d11 = if_else(period == 11, 1, 0),
         d12 = if_else(period == 12, 1, 0))

f <-
  bind_rows(make_fitted(fit12.11, scale = "linear"),
            make_fitted(fit12.12, scale = "linear"),
            make_fitted(fit12.13, scale = "linear"),
            make_fitted(fit12.14, scale = "linear")) %>% 
  mutate(pt   = rep(str_c("pt = ", c(0:1, 0:1)), each = n() / 4),
         link = rep(c("logit", "clog-log"), each = n() / 2))

f %>% glimpse()
```

Make our version of Figure 12.3.

```{r, fig.width = 6, fig.height = 3}
f %>% 
  ggplot(aes(x = period, group = interaction(pt, link),
             color = pt)) +
  geom_line(aes(y = Estimate, linetype = link)) +
  scale_color_viridis_d(NULL, option = "A", end = .6) +
  scale_x_continuous("grade", breaks = 6:12, limits = c(6, 12)) +
  ylab("transformed hazard probability") +
  coord_cartesian(ylim = c(-4, 0)) +
  theme(panel.grid = element_blank())
```

Our next step is to fit proper models to the data. First, we will refit `fit11.8` from last chapter. Then we'll fit the clog-log analogue to the same model. The two models, respectively, follow the form

\begin{align*}
\operatorname{logit} h(t_{ij}) & = [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_1 \text{PT}_i ] \; \text{and} \\
\operatorname{clog-log} h(t_{ij}) & = [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_1 \text{PT}_i ].
\end{align*}

Fit the models.

```{r fit12.15}
# logit
fit11.8 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.08")

# clog-log
fit12.15 <-
  brm(data = sex_pp,
      family = binomial(link = cloglog),
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.15")
```

Both models used the binomial likelihood. They only differed in which link function we used to transform the data. To further explicate,  we can more fully write out our Bayesian clog-log model as

\begin{align*}
\text{event}_{ij} & = \operatorname{Binomial}(n = 1, p_{ij}) \\ 
\operatorname{clog-log} (p_{ij}) & =  [\alpha_7 D_{7ij} + \alpha_8 D_{8ij} + \cdots + \alpha_{12} D_{12ij}] + [\beta_1 \text{PT}_{1i} ] \\
\alpha_7, \alpha_8, ..., \alpha_{12} & \sim \operatorname{Normal}(0, 4) \\
\beta_1 & \sim \operatorname{Normal}(0, 4).
\end{align*}

Here are the posterior means for `fit11.8` and `fit12.15`, as depicted in the firs three columns of Table 12.3.

```{r}
pars <-
  bind_rows(fixef(fit11.8)  %>% data.frame() %>% rownames_to_column("par"),
            fixef(fit12.15) %>% data.frame() %>% rownames_to_column("par")) %>% 
  mutate(link = rep(c("logit", "clog-log"), each = n() / 2),
         par  = factor(par, levels = c(str_c("d", 7:12), "pt")))

pars %>%
  select(par, link, Estimate) %>% 
  pivot_wider(names_from = link,
              values_from = Estimate) %>% 
  select(par, `clog-log`, logit) %>% 
  mutate_if(is.double, round, digits = 4)
```

They might be easier to compare in a coefficient plot.

```{r, fig.width = 4, fig.height = 5}
pars %>% 
  ggplot(aes(x = link, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_pointrange() +
  labs(x = NULL, 
       y = "transformed hazard") +
  coord_flip() +
  theme(axis.text.y = element_text(hjust = 0),
        panel.grid = element_blank()) +
  facet_wrap(~par, ncol = 1)
```

Instead of comparing them with deviances, we will compare the two models using Bayesian information criteria. For simplicity, we'll focus on the LOO.

```{r, warning = F, message = F}
fit12.15 <- add_criterion(fit12.15, c("loo", "waic"))

loo_compare(fit12.15, fit11.8, criterion = "loo") %>% print(simplify = F)

model_weights(fit12.15, fit11.8, weights = "loo") %>% round(digits = 3)
```

From a LOO perspective, they're basically the same. The parameter summaries are also quite similar between the two models. A coefficient plot might make it easy to see. "Numerical similarity is common when fitting identical models with alternate link functions (and net risks of event occurrence are low) and suggests that choice of a link function should depend on other considerations" (p. 423).

We already know from the last chapter that we can convert logits to probabilities with the function

$$p = \frac{1}{1 + e^{- \text{logit}}}.$$

The relevant inverse transformation for the clog-log link is

$$p = 1 - e^{\left (- e^{( \text{clog-log})} \right)}.$$

Now use those formulas to convert our `Estimate` values into the hazard metric, as shown in the last two columns of Table 12.3.

```{r}
pars %>% 
  filter(par != "pt") %>% 
  mutate(Estimate = if_else(str_detect(link, "logit"),
                            1 / (1 + exp(-1 * Estimate)),
                            1 - exp(-exp(Estimate)))) %>%
  select(par, link, Estimate) %>% 
  pivot_wider(names_from = link,
              values_from = Estimate) %>% 
  select(par, `clog-log`, logit) %>% 
  mutate(`clog-log - logit` = `clog-log` - logit) %>% 
  mutate_if(is.double, round, digits = 4)
```

For kicks, we threw in a column of their differences. The `clog-log - logit` column highlights how

> in general, fitted hazard functions from models estimated with both link functions will be indistinguishable unless hazard is high, once again suggesting that the quality of the estimates does not provide a rationale for selecting one of these link functions over the other. (p. 424)

Now focus on the `pt` parameter for both models.

> In both cases, we antilog [i.e., exponentiate] parameter estimates, but whereas an antilogged [i.e., exponentiated] coefficient from a model with a logit link is an odds ratio, an antilogged coefficient from a model with a clog-log link is a hazard ratio. (p. 424)

Here's that in a plot.

```{r}
posterior_samples(fit11.8) %>% 
  ggplot(aes(x = b_pt %>% exp(), y = 0)) +
  stat_halfeye(.width = c(.5, .95), na.rm = T) +
  scale_x_continuous("b_pt (exponentiated)") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "fit11.8 (logit link)",
       subtitle = "Exponentiating this parameter yields an odds ratio.") +
  coord_cartesian(xlim = c(1, 5)) +
  theme(panel.grid = element_blank())
```


```{r, fig.width = 8, fig.height = 2.75}
p1 <-
  posterior_samples(fit11.8) %>% 
  ggplot(aes(x = b_pt %>% exp(), y = 0)) +
  stat_halfeye(.width = c(.5, .95), na.rm = T) +
  scale_x_continuous("b_pt (exponentiated)") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "fit11.8 (logit link)",
       subtitle = "Exponentiating this parameter yields an odds ratio.") +
  coord_cartesian(xlim = c(1, 5)) +
  theme(panel.grid = element_blank())

p2 <-
  posterior_samples(fit12.15) %>% 
  ggplot(aes(x = b_pt %>% exp(), y = 0)) +
  stat_halfeye(.width = c(.5, .95)) +
  scale_x_continuous("b_pt (exponentiated)", limits = c(1, 5)) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "fit12.15 (clog-log link)",
       subtitle = "Exponentiating this parameter yields a hazard ratio.") +
  theme(panel.grid = element_blank())

p1 | p2
```

For the clog-log model, then, 

> in every grade from 7^th^ to 12^th^, we estimate the hazard of first intercourse for boys who experienced a parenting transition to be [about] 2.2 times the hazard for their peers raised with both biological parents. This interpretation contrasts with that from the model with a logit link, which suggests that the odds of first intercourse are [about] 2.4 times the *odds* for boys who experienced a parenting transition. (pp. 424--425, *emphasis* in the original)

### Choosing between logit and clog-log links for discrete-time hazard models.

> The primary advantage of the clog-log link is that in invoking a proportional hazards assumption it yields a direct analog to the continuous time hazard model.... If you believe that the underlying metric for time is truly continuous and that the only reason you observe discretized values is due to measurement difficulties, a model specified with a clog-log link has much to recommend it....
>
> [Yet,] if data are collected in truly discrete time, the clog-log specification has no particular advantage. As @beckModellingSpaceTime1999; Beck, Katz, and Tucker [-@beck1998taking]; and @sueyoshi1995class argue, the proportional hazards assumption is no more sacred than the proportional odds assumption, and while consistency across models is noble, so, too, is simplicity (which decreases the chances of mistake). (p. 426)

## Time-varying predictors

"Discrete-time survival analysis adopts naturally to the inclusion of time-varying predictors. Because models are fit using a person-period data set, a time-varying predictor simply takes on its appropriate value for each person in each period" (p. 427).

### Assumptions underlying a model with time-varying predictors.

Load the depression data from Wheaton, Rozell, and Hall [-@wheaton1997impact].

```{r, warning = F, message = F}
depression_pp <- 
  read_csv("data/depression_pp.csv") %>% 
  # convert the column names to lower case
  rename_all(tolower)

glimpse(depression_pp)
```

Here is the participant `age` range.

```{r}
range(depression_pp$age)
```

We might count how many participants experienced a parental divorce, `pd`, like this.

```{r}
depression_pp %>% 
  group_by(id) %>% 
  summarise(pd = if_else(sum(pd) > 0, 1, 0)) %>% 
  ungroup() %>% 
  count(pd) %>% 
  mutate(percent = 100 * n / sum(n))
```

Here we fit the discrete hazard model based on a quadratic treatment of $\text{age} - 18$ with and the time-varying predictor `pd` and without/with the time-invariant predictor `female`.

```{r fit12.16}
# each took about 45 minutes to fit
fit12.16 <-
  brm(data = depression_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + age_18 + I(age_18^2) + I(age_18^3) + pd,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.16")

fit12.17 <-
  brm(data = depression_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + age_18 + I(age_18^2) + I(age_18^3) + pd + female,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.17")
```

Check the summaries.

```{r}
print(fit12.16)
print(fit12.17)
```

Before we make our version of Figure 12.4, we'll need to make an aggregated version of the data, which will allow us to replicate the dots and plus signs. We'll use the same basic wrangling steps from when we made `sex_aggregated` from back in Chapter 11.

```{r}
depression_aggregated <-
  depression_pp %>% 
  mutate(event = if_else(event == 1, "event", "no_event")) %>% 
  group_by(age_18) %>% 
  count(event, pd) %>% 
  ungroup() %>% 
  complete(age_18, event, pd) %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  mutate(pd    = factor(str_c("pd = ", pd), levels = str_c("pd = ", 1:0)),
         age   = age_18 + 18,
         total = event + no_event) %>% 
  mutate(proportion = event / total) %>% 
  mutate(logit = log(proportion / (1 - proportion)))

depression_aggregated
```

Now make Figure 12.4.

```{r, fig.width = 5, fig.height = 5.75, warning = F}
nd <- 
  crossing(age_18 = -14:21,
           pd     = 0:1) %>% 
  mutate(age = age_18 + 18)
  
# hazard (top panel)
p1 <-
  fitted(fit12.16,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(pd = factor(str_c("pd = ", pd),
                     levels = str_c("pd = ", 1:0))) %>% 
  
  ggplot(aes(x = age, group = pd, color = pd)) +
  geom_line(aes(y = Estimate)) +
  geom_point(data = depression_aggregated,
             aes(y = proportion, shape = pd),
             show.legend = F) +
  scale_x_continuous(NULL, breaks = NULL, limits = c(0, 40)) +
  scale_y_continuous("proportion experiencing event", limits = c(0, 0.06))

# logit(hazard) (top panel)
p2 <-
  fitted(fit12.16,
         newdata = nd,
         scale = "linear") %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(pd = factor(str_c("pd = ", pd),
                     levels = str_c("pd = ", 1:0))) %>% 
  
  ggplot(aes(x = age, group = pd, color = pd)) +
  geom_line(aes(y = Estimate)) +
  geom_point(data = depression_aggregated,
             aes(y = logit, shape = pd),
             show.legend = F) +
  scale_x_continuous(breaks = 0:8 * 5, limits = c(0, 40)) +
  scale_y_continuous("logit(proportion experiencing event)", limits = c(-8, -2))

# combine
(
  (p1 / p2) & 
  scale_shape_manual(values = c(3, 16)) &
  scale_color_viridis_d(NULL, option = "A", end = .6) &
  theme(panel.grid = element_blank())
) +
  plot_layout(guides = "collect")
```

### Interpreting and displaying time-varying predictors' effects.

We jumped the gun a little, but to repeat:

> you can fit a discrete-time hazard model with time-varying predictors using exactly the same strategies presented in chapter 11. In the person-period data set, use a logistic regression routine to regress the event indicator on variables representing the main effect of *TIME* and the desired predictors. (pp. 434--435, *emphasis* in the original)

For example, here is the statistical formula we used when we fit `fit12.17`:

\begin{align*}
\text{event}_{ij}             & = \operatorname{Binomial}(n = 1, p_{ij}) \\ 
\operatorname{logit} (p_{ij}) & = [\alpha_0 + \alpha_1 (\text{age}_{ij} - 18) + \alpha_2 (\text{age}_{ij} - 18)^2 + \alpha_3 (\text{age}_{ij} - 18)^3] \\
                              & \; \; + [\beta_1 \text{pd}_{ij} + \beta_2 \text{female}_{i}] \\
\alpha_0, ..., \alpha_3       & \sim \operatorname{Normal}(0, 4) \\
\beta_1 \text{ and } \beta_2  & \sim \operatorname{Normal}(0, 4),
\end{align*}

where $\operatorname{logit} (p_{ij}) = \operatorname{logit} \hat h(t_{ij})$. If you'd like to compare our results for those displayed by Singer and Willett in Equation 12.8, here are our posterior means.

```{r}
fixef(fit12.17)[, 1] %>% round(digits = 4)
```

The values are pretty close. Though we won't compare `fit12.16` and `fit12.17` using deviances, we will use information criteria. Before you execute this code on your end, heads up! In my experience, large data sets like this often result in long computation times for Bayesian information criteria. Between the two, the LOO often takes longer. So here we’ll go with the WAIC. Even so, it took just under four hours to execute both `add_criterion()` lines. **Proceed with causion**.

```{r, message = F}
fit12.16 <- add_criterion(fit12.16, "waic")
fit12.17 <- add_criterion(fit12.17, "waic")

loo_compare(fit12.16, fit12.17, criterion = "waic") %>% print(simplify = F)
model_weights(fit12.16, fit12.17, weights = "waic") %>% round(digits = 3)
```

Based on both the WAIC difference and the WAIC weights, `fit12.17` is the clear favorite. From that model, Here's a plot of the antilogged (i.e., exponentiated) posteriors for `pd` and `female`, which yields their odds ratios.

```{r, fig.width = 4, fig.height = 2.75}
posterior_samples(fit12.17) %>% 
  pivot_longer(b_pd:b_female) %>% 
  mutate(`odds ratio` = exp(value)) %>% 
  
  ggplot(aes(x = `odds ratio`, y = name)) +
  stat_halfeye(.width = c(.5, .95), normalize = "xy") +
  ylab(NULL) +
  coord_cartesian(ylim = c(1.4, 2.4)) +
  theme(panel.grid = element_blank())
```

Here is our version of Figure 12.5.

```{r, fig.width = 8, fig.height = 5.25}
nd <- 
  crossing(female = 0:1,
           pd     = 0:1) %>% 
  expand(nesting(female, pd),
         age_18 = -14:21) %>% 
  mutate(age = age_18 + 18)
  
f <-
  fitted(fit12.17,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(sex = if_else(female == 0, "male", "female"),
         pd  = factor(str_c("pd = ", pd),
                      levels = str_c("pd = ", 1:0))) 

# hazard (top panel)
p1 <-
  f %>% 
  ggplot(aes(x = age, group = pd, color = pd, fill = pd)) +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/5, size = 0) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous(NULL, breaks = NULL, limits = c(0, 40)) +
  scale_y_continuous("fitted hazard", limits = c(0, 0.04))

# survival (bottom panel)
p2 <-
  f %>%
  group_by(sex, pd) %>% 
  mutate_at(vars(Estimate, Q2.5, Q97.5), .funs = ~cumprod(1 - .)) %>% 
  
  ggplot(aes(x = age, group = pd, color = pd, fill = pd)) +
  geom_hline(yintercept = .5, color = "white") +
  geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
              alpha = 1/5, size = 0) +
  geom_line(aes(y = Estimate)) +
  scale_x_continuous(breaks = 0:8 * 5, limits = c(0, 40)) +
  scale_y_continuous("fitted survival probability", limits = c(0, 1)) +
  theme(strip.background.x = element_blank(),
        strip.text.x = element_blank())

# combine
(
  (p1 / p2) & 
  scale_fill_viridis_d(NULL, option = "A", end = .6) &
  scale_color_viridis_d(NULL, option = "A", end = .6) &
  theme(panel.grid = element_blank()) &
  facet_wrap(~sex)
) +
  plot_layout(guides = "collect")
```

Note how with this many `age` levels, our discrete-time models are beginning to look like continuous-time models.

### Two caveats: The problems of state and rate dependence.

> A time-varying predictor is *state-dependent* if its values at time $t_j$ are affected by an individual's *state* (event occurrence status) at time $t_j$: $EVENT_{ij}$. A time-varying predictor is *rate-dependent* if its values at time $t_j$ are affected by the individuals value of *hazard* (the "rate") at time $t_j$: $h (t_{ij})$. (p. 440, *emphasis* in the original)

## The linear additivity assumption: Uncovering violations and simple solutions

> Because the focus on hazard causes you to analyze group level summaries, model violations can be more difficult to discern [than in other kinds of regression models]. We therefore devote this section to introducing practical strategies for diagnosing and correcting violations of the linear additivity assumption. (p. 443)

### Interactions between substantive predictors.

> We do not advocate fishing expeditions. Open searches for interactions can be counterproductive, leading to the discovery of many "effects" that are little more than sampling variation. But there *are* at least two circumstances when a guided search for interactions is crucial:
>
> * *When theory (or common sense!) suggests that two (or more) predictors will interact in the prediction of the outcome*. If you hypothesize the existence of interactions a priori, your search will be targeted and efficient.
> * *When examining the effects of "question" predictor(s), variables whose effects you intend to emphasize in your report*. You need to be certain that these predictors' effects do not differ according ot levels of other important predictors, lest you misinterpret your major findings.
>
> With this in mind, we now demonstrate how to (1) explore your data for the possibility of statistical interactions; and (2) include the additional appropriate terms when necessary. (p. 444, *emphasis* in the original)

Load the  data from Keiley and Martin [-@keileyChildAbuseNeglect2002][^3].

```{r, warning = F, message = F}
firstarrest_pp <- 
  read_csv("data/firstarrest_pp.csv") %>% 
  # convert the column names to lower case
  rename_all(tolower)

glimpse(firstarrest_pp)
```

The total $N$ is 1553.

```{r}
firstarrest_pp %>% 
  distinct(id) %>% 
  count()
```

$n = 887$ were abused.

```{r}
firstarrest_pp %>% 
  filter(abused == 1) %>% 
  distinct(id) %>% 
  count()
```

$n = 342$ were arrested between the ages of 8 and 18.

```{r}
firstarrest_pp %>% 
  filter(censor == 0) %>% 
  distinct(id) %>% 
  count()
```

Since the two focal predictors in this section with be `black` and `abused`, here are the counts broken down by both.

```{r}
firstarrest_pp %>% 
  group_by(id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  count(black, abused) %>% 
  mutate(percent = 100 * n / sum(n))
```

Here's how to hand compute the values for the sample logit(hazard) functions depicted in the top panels of Figure 12.6.

```{r, fig.width = 7, fig.height = 3}
# wrangle
firstarrest_pp %>% 
  mutate(event = if_else(event == 1, "event", "no_event")) %>% 
  group_by(period)  %>% 
  count(event, black, abused) %>% 
  ungroup() %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  drop_na(event) %>% 
  mutate(total  = event + no_event,
         logit  = log(event / total / (1 - event / total)),
         race   = factor(ifelse(black == 1, "Black", "White"),
                         levels = c("White", "Black")),
         abused = ifelse(abused == 1, "abused", "not abused") %>% factor()) %>% 
  
  # plot!
  ggplot(aes(x = period, y = logit)) +
  geom_line(aes(color = abused, group = abused)) +
  scale_color_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_x_continuous("age", breaks = 7:19, limits = c(7, 19)) +
  scale_y_continuous("sample logit(hazard)", limits = c(-7, -2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~race)
```

For kicks and giggles, it might informative to fit a series of subset Bayesian models to make similar versions of those sample hazard functions. Before we fit the models, let's make our lives easier and make a factor version of our time variable, `period`.

```{r}
firstarrest_pp <-
  firstarrest_pp %>% 
  mutate(period_f = factor(period))
```

Fit the four subset models.

```{r fit12.18}
# white, not abused
fit12.18 <-
  brm(data = firstarrest_pp %>% filter(black == 0 & abused == 0),
      family = binomial,
      event | trials(1) ~ 0 + period_f,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.18")

# white, abused
fit12.19 <-
  update(fit12.18,
         newdata = firstarrest_pp %>% filter(black == 0 & abused == 1),
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.19")

# black, not abused
fit12.20 <-
  update(fit12.18,
         newdata = firstarrest_pp %>% filter(black == 1 & abused == 0),
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.20")

# black, abused
fit12.21 <-
  update(fit12.18,
         newdata = firstarrest_pp %>% filter(black == 1 & abused == 1),
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.21")
```

We might use `fitted()` via our custom `make_fitted()` function to perform some of the pre-plotting computation and wrangling.

```{r}
nd <- tibble(period_f = 8:18)

# this will simplify the `mutate()` code below
models <- c("constant", "linear", "quadratic", "cubic", "general")

f <-
  bind_rows(make_fitted(fit12.18, scale = "linear"),      # white, not abused
            make_fitted(fit12.19, scale = "linear"),      # white, abused
            make_fitted(fit12.20, scale = "linear"),      # black, not abused
            make_fitted(fit12.21, scale = "linear")) %>%  # black, abused
  mutate(race  = factor(rep(c("White", "Black"), each = n() / 2),
                        levels = c("White", "Black")),
         abuse = rep(c("not abused", "abused", "not abused", "abused"), each = n() / 4))

# what have we done?
glimpse(f)
```

Now plot.

```{r, fig.width = 7, fig.height = 3}
f %>% 
  ggplot(aes(x = period_f, y = Estimate, group = abuse, color = abuse)) +
  geom_line() +
  scale_color_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_x_continuous("age", breaks = 7:19, limits = c(7, 19)) +
  ylab("sample logit(hazard)") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~race)
```

Notice how this looks different from the version of the plot, above, in that there are five points with very low values on the $y$-axis. Did you notice the `drop_na(event)` line in the code when we computed the sample logit(hazard) values by hand? Those points with missing values in the data are what caused those very low log(hazard) estimates in the models. Next we'll fit the primary statistical models with both `black` and `abused`, without and with their interaction term, based on the full data set. You'll see that when we use all cases, those odd low logit(hazard) values go away.

```{r fit12.22}
fit12.22 <-
  brm(data = firstarrest_pp,
      family = binomial,
      event | trials(1) ~ 0 + d8 + d9 + d10  + d11  + d12  + d13  + d14  + d15  + d16  + d17  + d18 + abused + black,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.22")

fit12.23 <-
  update(fit12.22,
         newdata = firstarrest_pp,
         event | trials(1) ~ 0 + d8 + d9 + d10  + d11  + d12  + d13  + d14  + d15  + d16  + d17  + d18 + abused + black + abused:black,
         chains = 4, cores = 4, iter = 2000, warmup = 1000,
         seed = 12,
         file = "fits/fit12.23")
```

Let's compare the models with information criteria. Not unlike with `fit12.16` and `fit12.17`, these are big models for a big data set and it took about an hour on my 2012-era laptop to return the WAIC. The LOO would likely have taken longer. Proceed with caution.

```{r, message = F}
fit12.22 <- add_criterion(fit12.22, "waic")
fit12.23 <- add_criterion(fit12.23, "waic")

loo_compare(fit12.22, fit12.23, criterion = "waic") %>% print(simplify = F)
model_weights(fit12.22, fit12.23, weights = "waic") %>% round(digits = 3)
```

Our results diverge a bit from those in the text. For the deviance test on page 446, Singer and Willett reported a difference of $4.05 \; (p < .05, df = 1)$ in favor of the full model (i.e., `fit12.23`). Though both our WAIC difference score and the WAIC weights favor `fit12.23`, it's by a hair.

"As in linear (or logistic) regression, we interpret interaction effects by simultaneously considering *all* the constituent parameters, for the cross-product term and its main-effect components" (p. 446, *emphasis* in the original). Rather than the point-estimate table Singer and Willett displayed at the bottom of the page, we'll present the full posterior distributions odds ratios in a faceted plot.

```{r, fig.width = 6, fig.height = 3}
posterior_samples(fit12.23) %>% 
  mutate(iter = 1:n()) %>% 
  expand(nesting(iter, b_abused, b_black, `b_abused:black`),
         abused = 0:1,
         black  = 0:1) %>% 
  mutate(`odds ratio` = exp(b_abused * abused + b_black * black  + `b_abused:black` * abused * black),
         abused       = str_c("abused = ", abused),
         black        = str_c("black = ", black)) %>% 
  
  ggplot(aes(x = `odds ratio`, y = 0)) +
  stat_halfeye(.width = .95, normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank()) +
  facet_grid(abused~black)
```

Here's our version of the lower panel of Figure 12.6.

```{r, fig.width = 7, fig.height = 3.5}
# define the `newdata`
nd <-
  firstarrest_pp %>% 
  distinct(abused, black, d8, d9, d10, d11, d12, d13, d14, d15, d16, d17, d18, period)

# use `fitted()` and wrangle
make_fitted(fit12.23, scale = "linear") %>% 
  mutate(abused = ifelse(abused == 1, "abused", "not abused") %>% factor(),
         sex    = factor(ifelse(black == 1, "Black", "White"),
                         levels = c("White", "Black"))) %>% 
  
  # plot!
  ggplot(aes(x = period, y = Estimate, ymin = Q2.5, ymax = Q97.5,
             fill = abused, color = abused)) +
  geom_ribbon(alpha = 1/5, size = 0) +
  geom_line() +
  scale_fill_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_color_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_x_continuous("age", breaks = 7:19, limits = c(7, 19)) +
  ylab("fitted logit(hazard)") +
  coord_cartesian(ylim = c(-8, -2)) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~sex)
```

### Nonlinear effects.

> There are two general strategies for exploring the linearity assumption. The simplest approach--although somewhat limited--is to fit additional models, replacing the raw predictor with a re-expressed version. Although the additional models also invoke a linearity constraint, use of re-expressed predictors guarantees that the effects represent nonlinear relationships for the *raw* predictors. The ladder of power (section 6.2.1) provides a dizzying array of options. The second approach is to categorize each continuous variable into a small number of groups, create a series of dummy variables representing group membership, and visually examine the *pattern of parameter estimates for consecutive dummies* to deduce the appropriate functional form. If the pattern is linear, retain the predictor in its raw state; if not, explore an alternative specification. 
>
> As the first approach is straightforward, we illustrate the second, using the depression onset data presented in section 12.3. (pp. 447--448, *emphasis* in the original)

Here's another look at those `depression_pp` data.

```{r}
depression_pp %>% 
  glimpse()
```

Our three models will be cubic with respect to our time variable, `age_18`. The focal predictor whose form (non)linear form we're interested in is `nsibs`. Singer and Willett's first model, Model A (see Table 12.4, p. 449), treated `nsibs` as linear. Fit the model with **brms**.

```{r}
# model a (linear)  
# 1.286668 hours
fit12.24 <-
  brm(data = depression_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + age_18 + I(age_18^2) + I(age_18^3) + pd + female + nsibs,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.24")
```

Check the model summary.

```{r}
print(fit12.24)
```

Here we invert the antilog of `nsibs` and summarize the posterior with `median_qi()`.

```{r}
posterior_samples(fit12.24) %>% 
  median_qi(1 / exp(b_nsibs)) %>% 
  mutate_if(is.double, round, digits = 3)
```

Singer and Willett described `nsibs` as "highly skewed" (p. 448). Let's take a look.

```{r, fig.width = 6, fig.height = 2.5}
depression_pp %>% 
  group_by(id) %>% 
  slice(1) %>% 
  
  ggplot(aes(x = nsibs)) +
  geom_bar() +
  theme(panel.grid = element_blank())
```

Yep, sure is. Here's a way to save Singer and Willette's discretized version of `nsibs` and then break the cases down in terms of $n$ and percentage, which matches nicely with the numbers at the bottom of page 449.

```{r}
depression_pp <-
  depression_pp %>% 
  mutate(nsibs_cat = case_when(
    nsibs == 0         ~ "0",
    nsibs %in% c(1, 2) ~ "1 or 2",
    nsibs %in% c(3, 4) ~ "3 or 4",
    nsibs %in% c(5, 6) ~ "5 or 6",
    nsibs %in% c(7, 8) ~ "7 or 8",
    nsibs >= 9         ~ "9 or more"
  )
  )

depression_pp %>% 
  group_by(id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  count(nsibs_cat) %>% 
  mutate(percent = (100 * n / sum(n)) %>% round(digits = 1))
```

Now fit our version of Model B, in which we replace the original linear predictor `nsibs` with a series of dummies: `sibs12`, `sibs34`,..., `sibs9plus`.

```{r fit12.25}
# model b (nonlinear)  
# 2.253822 hours
fit12.25 <-
  brm(data = depression_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + age_18 + I(age_18^2) + I(age_18^3) + pd + female + sibs12 + sibs34 + sibs56 + sibs78 + sibs9plus,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.25")
```

Here's the model summary.

```{r}
print(fit12.25)
```

It might be easier to compare the various `sibs*` coefficients with a coefficient plot.

```{r, fig.width = 6, fig.height = 1.75}
posterior_samples(fit12.25) %>% 
  pivot_longer(contains("sibs")) %>% 
  mutate(name = str_remove(name, "b_")) %>% 
  
  ggplot(aes(x = value, y = name)) +
  stat_interval(size = 5, .width = c(.1, .5, .9)) +
  scale_color_grey("CI level:", start = .8, end = .2) +
  labs(x = "sibs coeficients",
       y = NULL) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

Here is the breakdown by `bigfamily`.

```{r}
depression_pp %>% 
  group_by(id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  count(bigfamily) %>% 
  mutate(percent = (100 * n / sum(n)) %>% round(digits = 2))
```

Now fit our version of Model C, the dichotomized `bigfamily` model.

```{r fit12.26}
# model c (dichotomized)  
# 1.434827 hours
fit12.26 <-
  brm(data = depression_pp,
      family = binomial,
      event | trials(1) ~ 0 + Intercept + age_18 + I(age_18^2) + I(age_18^3) + pd + female + bigfamily,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.26")
```

Check the model summary.

```{r}
print(fit12.26)
```

The summaries look very similar to the values in the rightmost column of Table 12.4 in the text. Here is a look at the antilog of our `bigfamily` coefficient.

```{r, fig.width = 4, fig.height = 2.5}
# just to make the x-axis pretty
breaks <- c(exp(fixef(fit12.26)["bigfamily", c(1, 3:4)]), 1) %>% as.vector()
labels <- c(exp(fixef(fit12.26)["bigfamily", c(1, 3:4)]), 1) %>% round(digits = 3) %>% as.vector()

# plot!
posterior_samples(fit12.26) %>% 
  ggplot(aes(x = exp(b_bigfamily), y = 0)) +
  geom_vline(xintercept = 1, color = "white") +
  stat_halfeye(.width = c(.5, .95)) +
  scale_x_continuous("bigfamily odds ratio", breaks = breaks, labels = labels) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(panel.grid = element_blank())
```

Now compute the WAIC estimates for `fit12.24` through `fit12.26` and compare them by their WAIC differences and WAIC weights.

```{r, message = F}
fit12.24 <- add_criterion(fit12.24, criterion = "waic")  # 1.524841 hours
fit12.25 <- add_criterion(fit12.25, criterion = "waic")  # 1.285394 hours
fit12.26 <- add_criterion(fit12.26, criterion = "waic")  # 1.741655 hours

loo_compare(fit12.24, fit12.25, fit12.26, criterion = "waic") %>% print(simplify = F)

model_weights(fit12.24, fit12.25, fit12.26, weights = "waic") %>% round(digits = 3)
```

Our WAIC estimates are very similar to the AIC estimates presented in Table 12.4. By both the differences and the weights, `fit12.16` (Model C) is the best among the three. Though its rank is decisive with the weights, it's less impressive if you compare the `se_diff` values with the `elpd_diff` values. Those suggest there's a lot of uncertainty in our WAIC differences.

## The proportionality assumption: Uncovering violations and simple solutions

> All the discrete hazard models postulated so far invoke another common, but restrictive assumption: that each predictor has an identical effect in every time period under study. This constraint, known as the *proportionality assumption*, stipulates that a predictor's effect does not depend on the respondent's duration in the initial state....
>
> Yet is it not possible, even likely, that the effects of some predictors will *vary* over time? (p. 451, *emphasis* in the original)

### Discrete-time hazard models that do not invoke a proportionality assumption.

"There are dozens of ways of violating the proportionality assumption" (p. 452). We see three such examples in panels B through D in Figure 12.7. Here's our version of the plot.

```{r, fig.width = 6, fig.height = 6.5}
p1 <-
  crossing(z = 0:1,
           x = 1:8) %>% 
  mutate(y = -2.1 + -0.2 * (x - 1) + 1.1 * z,
         z = factor(z)) %>% 
  
  ggplot(aes(x = x, y = y)) +
  geom_line(aes(size = z))

p2 <-
  crossing(z = 0:1,
           x = 1:8) %>% 
  mutate(y = -4.8 + 0.28 * (x - 1) + 0.01 * z + 0.25 * x * z,
         z = factor(z)) %>% 
  
  ggplot(aes(x = x, y = y)) +
  geom_line(aes(size = z))

p3 <-
  crossing(z = 0:1,
           x = 1:8) %>% 
  mutate(y = -4.8 + 0.25 * (x - 1) + 4.3 * z + -0.5 * x * z,
         z = factor(z)) %>%

  ggplot(aes(x = x, y = y)) +
  geom_line(aes(size = z))

p4 <-
  crossing(z1 = 0:1,
           x  = 1:8) %>% 
  mutate(z2 = rep(0:1, times = n() / 2),
         y  = -2.8 + -0.2 * (x - 1) + 1.8 * z1 + -1.4 * z1 * z2,
         z1 = factor(z1)) %>% 
  
  ggplot(aes(x = x, y = y)) +
  geom_line(aes(size = z1))

# combine
(
  (p1 + p2 + p3 + p4) + 
  plot_annotation(tag_levels = "A") 
) &
  scale_size_manual(values = c(1, 1/2)) &
  scale_x_continuous("time period", breaks = 0:8, limits = c(0, 8)) &
  scale_y_continuous("logit hazard", limits = c(-5, -0.5)) &
  theme(legend.position = "none",
        panel.grid = element_blank())
```

In case is wasn’t clear, I just winged it on the equations for the `y` values in each subplot. If you'd like to annotate the subplots with the arrows and $\beta$ labels as depicted in the original Figure 12.7, consider it a homework exercise.

### Investigating the proportionality assumption in practice.

Load the data from Graham's [-@graham1997exodus] dissertation.

```{r, warning = F, message = F}
mathdropout_pp <- 
  read_csv("data/mathdropout_pp.csv") %>% 
  # convert the column names to lower case
  rename_all(tolower)

glimpse(mathdropout_pp)
```

The data are composed of $n = 3{,}790$ high school students.

```{r}
mathdropout_pp %>% 
  distinct(id) %>% 
  count()
```

Here is the division by `female`.

```{r}
mathdropout_pp %>% 
  distinct(id, female) %>% 
  count(female) %>% 
  mutate(percent = 100 * n / sum(n))
```

Singer and Willett also wrote: "only 93 men and 39 women took a mathematics class in each of the next five terms" (p. 456). I think this is a typo. Here's the break down by `censor` and `female` at the fifth time period (`period == 5`).

```{r}
mathdropout_pp %>% 
  filter(period == 5) %>% 
  count(censor, female)
```

Fitting the three models displayed in Table 12.5 is a mild extension from our previous models. From a `brm()` perspective, there's nothing new, here.

```{r fit12.27}
fit12.27 <-
  brm(data = mathdropout_pp,
      family = binomial,
      event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + female,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.27")

fit12.28 <-
  brm(data = mathdropout_pp,
      family = binomial,
      event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + fhs11 + fhs12 + fcoll1 + fcoll2 + fcoll3,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.28")

fit12.29 <-
  brm(data = mathdropout_pp,
      family = binomial,
      event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + female + fltime,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 12,
      file = "fits/fit12.29")
```

Heads up about the `formula` for `fit12.28` (Model B). Many of the examples in the text and the corresponding data sets we've been working with included pre-computed interaction terms. We have been quietly ignoring those and making our interactions by hand in our `formula` arguments. Here we went ahead with the text and just used `event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + fhs11 + fhs12 + fcoll1 + fcoll2 + fcoll3`. If you wanted a more verbose version of that code, we could have specified either 

* `event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + female:hs11 + female:hs12 + female:coll1 + female:coll2 + female:coll3` or
* `event | trials(1) ~ 0 + hs11 + hs12 + coll1 + coll2 + coll3 + female:(hs11 + hs12 + coll1 + coll2 + coll3)`. 

All three return the same results. Anyway, let's check the model summaries.

```{r}
print(fit12.27)
print(fit12.28)
print(fit12.29)
```

The results are similar to those in Table 12.5. Based on `fit12.27` (Model A), here is the posterior for the odds ratio for `female`.

```{r, fig.width = 4, fig.height = 2.5}
posterior_samples(fit12.27) %>% 
  ggplot(aes(x = exp(b_female), y = 0)) +
  stat_halfeye(.width = c(.5, .95)) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("odds ratio for female") +
  theme(panel.grid = element_blank())
```

With regard to the interaction terms for `fit12.28` (Model B), Singer and Willett remarked: "Notice how the estimates rise over time" (p. 457). It might be easiest to observe that in a plot.

```{r, fig.width = 6, fig.height = 1.75}
posterior_samples(fit12.28) %>% 
  pivot_longer(starts_with("b_f")) %>% 
  mutate(name = factor(str_remove(name, "b_"),
                       levels = c(str_c("fhs", 11:12), str_c("fcoll", 1:3)))) %>%
  
  ggplot(aes(x = value, y = name)) +
  stat_interval(size = 5, .width = c(.1, .5, .9)) +
  scale_color_grey("CI level:", start = .8, end = .2) +
  labs(x = "interaction terms",
       y = NULL) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank())
```

Make and save the subplots for Figure 12.8.

```{r}
# Within-group sample hazard functions
p1 <-
  mathdropout_pp %>% 
  mutate(event = if_else(event == 1, "event", "no_event")) %>% 
  group_by(period) %>% 
  count(event, female) %>% 
  ungroup() %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  mutate(total = event + no_event,
         logit = log(event / total / (1 - event / total))) %>% 
  mutate(female = factor(female,
                         levels = 1:0,
                         labels = c("F", "M"))) %>% 
  
  # plot
  ggplot(aes(x = period, y = logit)) +
  geom_line(aes(color = female),
            show.legend = F) +
  scale_y_continuous("sample logit(hazard)", breaks = -2:0) +
  labs(subtitle = "Within-group sample hazard functions")

# Model A: Main effect of female
nd <-
  mathdropout_pp %>% 
  distinct(female, period, hs11, hs12, coll1, coll2, coll3)

p2 <-
  make_fitted(fit12.27, scale = "linear") %>% 
  mutate(female = factor(female,
                         levels = 1:0,
                         labels = c("F", "M"))) %>% 
  
  # plot
  ggplot(aes(x = period, y = Estimate, ymin = Q2.5, ymax = Q97.5,
             fill = female, color = female)) +
  geom_ribbon(alpha = 1/5, size = 0) +
  geom_line() +
  scale_fill_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_y_continuous("fitted logit(hazard)", breaks = -2:0) +
  labs(subtitle = "Model A: Main effect of female")

# Model B: Completely general\ninteraction between female and time
nd <-
  mathdropout_pp %>% 
  distinct(female, period, hs11, hs12, coll1, coll2, coll3, fhs11, fhs12, fcoll1, fcoll2, fcoll3)

p3 <-
  make_fitted(fit12.28, scale = "linear") %>% 
  mutate(female = factor(female,
                         levels = 1:0,
                         labels = c("F", "M"))) %>% 
  
  # plot
  ggplot(aes(x = period, y = Estimate, ymin = Q2.5, ymax = Q97.5,
             fill = female, color = female)) +
  geom_ribbon(alpha = 1/5, size = 0) +
  geom_line() +
  scale_fill_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_y_continuous("fitted logit(hazard)", breaks = -2:0) +
  labs(subtitle = "Model B: Completely general\ninteraction between female and time")

# Model C: Interaction\nbetween female and time
nd <-
  mathdropout_pp %>% 
  distinct(female, period, hs11, hs12, coll1, coll2, coll3, fltime)

p4 <-
  make_fitted(fit12.29, scale = "linear") %>% 
  mutate(female = factor(female,
                         levels = 1:0,
                         labels = c("F", "M"))) %>% 
  
  # plot
  ggplot(aes(x = period, y = Estimate, ymin = Q2.5, ymax = Q97.5,
             fill = female, color = female)) +
  geom_ribbon(alpha = 1/5, size = 0) +
  geom_line() +
  scale_fill_viridis_d(NULL, option = "A", end = .6, direction = -1) +
  scale_y_continuous("fitted logit(hazard)", breaks = -2:0) +
  labs(subtitle = "Model C: Interaction\nbetween female and time") 
```

Now combine the subplots, augment them in bulk, and return our version of Figure 12.8.

```{r, fig.width = 7, fig.height = 6.5}
(p1 + p2 + p3 + p4 + plot_layout(guides = "collect")) &
  scale_color_viridis_d(NULL, option = "A", end = .6, direction = -1) &
  scale_x_continuous("term", breaks = 1:5,
                     labels = c("HS 11", "HS 12", "C 1", "C 2", "C 3")) &
  coord_cartesian(ylim = c(-2.3, 0)) &
  theme(panel.grid = element_blank())
```

Compute the information criteria for the three models. Since these were much faster to compute than for some of the earlier models in this chapter, we'll go ahead and compute both the LOO and the WAIC.

```{r, message = F}
fit12.27 <- add_criterion(fit12.27, c("loo", "waic"))
fit12.28 <- add_criterion(fit12.28, c("loo", "waic"))
fit12.29 <- add_criterion(fit12.29, c("loo", "waic"))
```

Now compare the models with information criteria differences and weights.

```{r}
loo_compare(fit12.27, fit12.28, fit12.29, criterion = "loo") %>% print(simplify = F)
loo_compare(fit12.27, fit12.28, fit12.29, criterion = "waic") %>% print(simplify = F)

model_weights(fit12.27, fit12.28, fit12.29, weights = "loo") %>% round(digits = 3)
model_weights(fit12.27, fit12.28, fit12.29, weights = "waic") %>% round(digits = 3)
```

You'll note both the LOO and WAIC estimates are very close to those displayed in Table 12.5. When we take their standard errors into account, `fit12.29` (Model C) is marginally better than the other two models. `fit12.29` also took most of the LOO and WAIC weights.

> How do we interpret the gender differential implied by Model C? Because we have centered *TIME* at 1, the coefficient for *FEMALE* (0.2275) estimates the differential in time period 1, which here is 11th grade. Antilogging yields 1.26, which leads us to estimate that in 11th grade, the odds of ending one's mathematics course-taking career are 26% higher for females. (p. 460, *emphasis* in the original)

Let's examine those results with our Bayesian fit.

```{r, fig.width = 6, fig.height = 2.5}
posterior_samples(fit12.29) %>% 
  mutate(`log odds` = b_female,
         `odds ratio` = exp(b_female)) %>% 
  pivot_longer(contains("odds")) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("the effects of female in two metrics") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

Now examine the odds ratio for three different educational periods.

```{r, fig.width = 8, fig.height = 2.75}
posterior_samples(fit12.29) %>% 
  mutate(`12th grade`              = exp(b_female + b_fltime),
         `1st semester of college` = exp(b_female + 2 * b_fltime),
         `3rd semester of college` = exp(b_female + 4 * b_fltime)) %>% 
  pivot_longer(contains(" ")) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_halfeye(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = "The interaction effect in different periods",
       x = "odds ratio") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free")
```

## The no unobserved heterogeneity assumption: No simple solution

> All the hazard models discussed in this book--both the discrete-time models we are discussing now and the continuous-time models we will soon introduce--impose an additional assumption to which we have alluded: the assumption of no unobserved heterogeneity. Every model assumes that the population hazard function for individual $i$ depends only on his or her predictor values. Any pair of individuals who share identical predictor profiles will have identical population functions....
>
> Many data sets will not conform to this assumption. As in the multilevel model for change (and regular regression for that matter), pairs of individuals who share predictor profiles are very likely to have different outcomes....
>
> Unobserved heterogeneity can have serious consequences. In their classic [-@vaupel1985heterogeneity] paper, Vaupel and Yaskin elegantly demonstrate what they call "heterogeneity's ruses"--that ability of unobserved heterogeneity to create the misimpression that a hazard function follows a particular form, when in fact it may not....
>
> Is it possible to fit a hazard model that accounts for unobserved heterogeneity? As you might expect, doing so requires that we have either additional data (for example, data on repeated events within individuals) or that we invoke other--perhaps less tenable--assumptions about the distribution of event time errors [@aalen1988heterogeneity; @heckman1984longitudinal; @vaupel1979impact; @scheike1997discrete; @mare1994discrete]. As a result, most empirical researchers--and we--proceed ahead, if not ignoring the problem, at least not addressing it. In the remainder of this book, we assume that all heterogeneity is observed and attributable to the predictors included in our models.

## Residual analysis

> Before concluding that your model is sound, you should ascertain how well it performs for individual cases. As in regular regression, we [can] address this question by examining residuals.
>
> Residuals compare--usually through subtraction--an outcome's "observed" value to its model-based "expected" value. For a discrete-time hazard model, a simple difference will not suffice because each person has not a single outcome but a set of outcomes--one for each time period when he or she was at risk. This suggests the need for a residual defined at the *person-period* level. A further complication is that the observed outcome in every time period has a value of either 0 or 1 while its expected value—the predicted hazard probability—lies between these extremes. (p. 463, *emphasis* in the original)

Starting on page 464, Singer and Willett illustrated a residual analysis using Model D from Table 11.3. In the last chapter, we called that `fit11.10`. Here it is, again.

```{r}
fit11.10 <-
  brm(data = sex_pp,
      family = binomial,
      event | trials(1) ~ 0 + d7 + d8 + d9 + d10 + d11 + d12 + pt + pas,
      prior(normal(0, 4), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 11,
      file = "fits/fit11.10")
```

With **brms**, users can extract the residuals of a `brm()` fit with the `residuals()` function.

```{r}
residuals(fit11.10) %>% 
  str()
```

The output is similar to what we get from `fitted()`.  We have a numeric array of 822 rows and 4 columns. There are 822 rows because that is the number of rows in the original data set with which we fit the model, `sex_pp`.

```{r}
sex_pp %>% 
  glimpse()
```

Just as we often express the uncertainty in our Bayesian models with parameter summaries from the posterior, we also express the uncertainty of our residuals. Thus, the four columns returned by the `residuals()` function are the familiar summary columns of ` Estimate`, `Est.Error`, `Q2.5`, and `Q97.5`. On page 465, Singer and Willett showcased the deviance residuals for eight participants. We’re going to diverge from them a little. In the plot, below, we’ll look at the residuals for the first 10 cases in the data, by `period`.

```{r, fig.width = 8, fig.height = 2.75}
residuals(fit11.10) %>% 
  data.frame() %>% 
  bind_cols(sex_pp) %>%
  mutate(event  = factor(event),
         period = factor(str_c("period ", period),
                         levels = str_c("period ", 7:12))) %>% 
  # reduce the number of cases
  filter(id < 11) %>% 
  
  ggplot(aes(x = id, y = Estimate, ymin = Q2.5, ymax = Q97.5, color = event)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_pointrange(fatten = 2/3) +
  scale_color_viridis_d(option = "A", end = .6) +
  scale_x_continuous(breaks = 1:10, labels = rep(c(1, "", 10), times = c(1, 8, 1))) +
  ylab("residual") +
  theme(legend.position = "top", 
        panel.grid = element_blank()) +
  facet_wrap(~period, nrow = 1)
```

As is often the case in coefficient plots, the dots are the posterior means and the intersecting lines are the percentile-based 95% intervals. In the `sex_pp` data, the `event` variable encodes when participants experience the event within a given time range. Hopefully the color coding highlights how with hazard models, the residuals are always positive when the criterion variable is a 1 and always negative when the criterion is 0.

Now we have a bit of a handle on the output from `residuals()`, let's plot in bulk.

```{r, fig.width = 7, fig.height = 4.5}
residuals(fit11.10) %>% 
  data.frame() %>% 
  bind_cols(sex_pp) %>% 
  mutate(event = factor(event)) %>% 
  
  ggplot(aes(x = id, y = Estimate, ymin = Q2.5, ymax = Q97.5, color = event)) +
  geom_hline(yintercept = 0, color = "white") +
  geom_pointrange(fatten = 3/4, alpha = 1/2) +
  scale_color_viridis_d(option = "A", end = .6) +
  ylab("residual") +
  theme(legend.position = "top", 
        panel.grid = element_blank()) 
```

This plot is our analogue to the top portion of Singer and Willett's Figure 12.9. But whereas we plotted our residual summaries, they plotted the expected values of their *deviance residuals*. In contrast with the material on page 464, I am not going to discuss deviance residuals or the sums of the squared deviance residuals. Our **brms** workflow offers an alternative. Before we offer our alternative, we might focus on deviance residuals for just a moment:

> Deviance residuals are so named because, when squared, they represent an individual's contribution to the deviance statistic for that time period. The sum of the squared deviance residuals across all the records in a person-period data set yields the deviance statistic for the specified model....
>
> The absolute value of a deviance residual indicates how well the model fits that person's data for that period. Large absolute values identify person-period records whose outcomes are poorly predicted. (p. 464)

Back in Chapter 11, we computed the LOO for `fit11.10`. Let's take a look at that.

```{r}
loo(fit11.10)
```

Notice the part of the output that read "All Pareto k estimates are good (k < 0.5)." We can pull those Pareto k estimates like so.

```{r}
loo(fit11.10)$diagnostics %>% 
  data.frame() %>% 
  glimpse()
```

We formatted the output for convenience. Notice there are 822 rows--one for each case in the data. Almost like a computational byproduct, **brms** returned `pareto_k` and `n_eff` values when we computed the LOO estimates for the model. Our focus will be on the `pareto_k` column. Here are those `pareto_k` values in a plot.

```{r, fig.width = 7, fig.height = 4.5}
loo(fit11.10)$diagnostics %>% 
  data.frame() %>% 
  # attach the `id` values
  bind_cols(sex_pp) %>% 
  
  ggplot(aes(x = id, y = pareto_k)) +
  geom_point(alpha = 3/4) + 
  geom_text(data = . %>% filter(pareto_k > .2),
            aes(x = id + 2, label = id),
            size = 3, hjust = 0) +
  theme(panel.grid = element_blank())
```

To learn the technical details about `pareto_k`, check out Vehtari, Gelman, and Gabry's [-@vehtariPracticalBayesianModel2017] [*Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC*](https://arxiv.org/abs/1507.04544) or Vehtari, Simpson, Gelman, Yao, and Gabry's [-@vehtari2021pareto] [*Pareto smoothed importance sampling*](https://arxiv.org/abs/1507.02646). In short, we can use `pareto_k` values to flag cases that were overly-influential on the model in a way that's a little like Singer and Willett's deviance residuals. As pointed out in the [**loo** reference manual](https://CRAN.R-project.org/package=loo/loo.pdf) [@loo2020RM], the makers of the **loo** package warn against `pareto_k` values when they get much larger than $0.5$. We should be a little worried by values that exceed the $0.7$ threshold and it's very likely a problem when they get larger than $1$. In this case, they're all below $0.4$ and all is good.

To learn more about `pareto_k` values and what the **loo** package can do for you, check out Vehtari and Gabry's [-@vehtariUsingLooPackage2020] vignette, [*Using the loo package*](https://CRAN.R-project.org/package=loo/vignettes/loo2-example.html).

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, message = F}
# here we'll remove our objects
rm(tenure_pp, fit12.1, fit12.2, fit12.3, fit12.4, fit12.5, fit12.6, fit12.7, fit12.8, p1, p2, p3, p4, p5, p6, p7, make_fitted, nd, f, models, new_rows, l1, l2, l3, n, pp, fit12.9, fit12.10, post, sex_pp, fit12.11, fit12.12, fit12.13, fit12.14, fit11.8, fit12.15, pars, depression_pp, fit12.16, fit12.17, depression_aggregated, firstarrest_pp, fit12.18, fit12.19, fit12.20, fit12.21, fit12.22, fit12.23, fit12.24, fit12.25, fit12.26, breaks, labels, mathdropout_pp, fit12.27, fit12.28, fit12.29, fit11.10)

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

## Footnote {-}

[^3]: In their reference section, Singer and Willett indicated this was a manuscript submitted for publication. To my knowledge, it was never published.


<!--chapter:end:12.Rmd-->


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```

# Describing Continuous-Time Event Occurrence Data

> In this chapter, we present strategies for describing continuous-time event data. Although survivor and hazard functions continue to form the cornerstone of our work, the challenge in the time scale from discrete to continuous demands that we revise our fundamental definitions and modify estimation strategies....
>
> In the [second half of the chapter], we offer solutions to the core conundrum embedded in contentious-time event data: our inability to estimate the hazard function well. This is a concern as it leads some researchers to conclude that they should not even *try* to ascertain the pattern of risk over time. [@singerAppliedLongitudinalData2003, pp, 468--469, *emphasis* in the original]

## A framework for characterizing the distribution of continuous-time event data

> Variables measured with greater precision contain more information than those measured with less precision… Finer distinctions, as long as they can be made reliably, lead to more subtle interpretations and more powerful analyses. 
>
> Unfortunately, a switch from discrete- to continuous-time survival analysis is not as trivial as you might hope. In discrete time, the definition of the hazard function is intuitive, its values are easily estimated, and simple graphic displays can illuminate its behavior. In continuous time, although the survivor function is easily defined and estimated, the hazard function is not. As explained below, we must revise its definition and develop new methods for its estimation and exploration. (p. 469)

### Salient features of continuous-time event occurrence data.

> Because continuous time is infinitely divisible, the distribution of event times displays two highly salient properties:
>
> * *The probability of observing* any particular event time *is infinitesimally small*. In continuous time, the probability that an event will occur *at any specific instant* approaches 0. The probability may nor reach 0, but as time’s divisions become finer and finer, it becomes smaller and smaller.
> * *The probability that two or more individuals will share the same event time is also infinitesimally small*. If the probability of event occurrence at each instant is infinitesimally small, the probability of *co*occurrence (a tie) must be smaller still. (p. 470, *emphasis* in the original)

Load the horn honking data from Diekmann, Jungbauer-Gans, Krassing, and Lorenz [-@diekmann1996social].

```{r, warning = F, message = F}
library(tidyverse)

honking <- 
  read_csv("data/honking.csv") %>%
  # make all names lower case
  rename_all(str_to_lower) %>% 
  mutate(censor_1 = abs(censor - 1))

glimpse(honking)
```

Here's a quick way to arrange the `seconds` values and `censor` status of each case in a similar way to how they appear in Table 13.1.

```{r}
honking %>% 
  arrange(seconds) %>% 
  transmute(seconds = ifelse(censor == 0, seconds, str_c(seconds, "*")))
```

For kicks, here's a tile-plot version of Table 13.1.

```{r, fig.width = 6, fig.height = 3}
honking %>% 
  arrange(seconds) %>% 
  # `formatC()` allows us to retain the trailing zeros when converting the numbers to text
  mutate(text = formatC(seconds, digits = 2, format = "f")) %>% 
  mutate(time = ifelse(censor == 0, text, str_c(text, "*")),
         row  = c(rep(1:6, times = 9), 1:3),
         col  = rep(1:10, times = c(rep(6, times = 9), 3))) %>% 
  
  ggplot(aes(x = col, y = row)) +
  geom_tile(aes(fill = seconds)) +
  geom_text(aes(label = time, color = seconds < 10)) +
  scale_fill_viridis_c(option = "B", limits = c(0, NA)) +
  scale_color_manual(values = c("black", "white")) +
  scale_y_reverse() +
  labs(subtitle = "Table 13.1: Known and censored (*) event times for 57 motorists blocked by another\nautomobile (reaction times are recorded to the nearest hundredth of a second)") +
  theme_void() +
  theme(legend.position = "none")
```

### The survivor function.

"In continuous time, the survival probability for individual $i$ at time $t_j$ is the probability that his or her event time, $T_i$ will exceed $t_j$" (p. 472). This follows the equation

$$S(t_{ij}) = \operatorname{Pr} [T_i > t_j].$$

Heads up: When Singer and Willett "do not distinguish individuals on the basis of predictors, [they] remove the subscript $i$, letting $S(t_j)$ represent the survivor function for a randomly selected member of the population" (p. 472).

### The hazard function.

> The hazard function assesses the *risk*--at a particular moment--that an individual who has not yet done so will experience the target event. In discrete time, the moments are time periods, which allows us to express hazard as a conditional probability. In continuous time, the moments are the infinite numbers of infinitesimally small instants of time that exist within any finite time period, a change that requires us to alter our definition. (pp. 472--473, *emphasis* in the original)

Singer and Willett the went on to demonstrate the notion of "infinitesimally small instants of time" by dividing a year into days, hours, minutes, and seconds. Here's how we might use **R** to practice dividing up a year into smaller and smaller units.

```{r}
year    <- 1
days    <- 365
hours   <- 24
minutes <- 60
seconds <- 60

year * days
year * days * hours
year * days * hours * minutes
year * days * hours * minutes * seconds
```

Building, we define the continuous-time hazard function as

$$h(t_{ij}) = \text{limit as } \Delta t \rightarrow 0 \left \{ \frac{\text{Pr}[T_i \text{ is in the interval } (t_j, t_j + \Delta t) | T_i \geq t_j]}{\Delta t} \right \},$$

where $[t_j, t_j + \Delta t)$ is the $j$th time interval and "the opening phrase '$\text{limit as } \Delta t \rightarrow 0$' indicates that we evaluate the conditional probability in brackets as the interval width modes closer and closer to 0" (p. 474).

> Because the definitions of hazard differ in continuous and discrete time, their interpretations differ as well. Most important, *continuous-time hazard is not a probability*. Instead, it is a *rate*, assessing the conditional probability of event occurrence *per unit of time*. No matter how tempted you might be to use the nomenclature of probability to describe rates in continuous time, please resist the urge. Rates and probabilities are not the same, and so the interpretive language is not interchangeable. (p. 474, *emphasis* in the original)

Closing out this section, we read:

> An important difference between continuous-time hazard rates and discrete-time hazard probabilities is that rates are not bounded from above. Although neither can be negative, rates can easily exceed 1.0…. The possibility that continuous-time hazard rate can exceed 1 has serious consequences because it requires that we revise the statistical models that incorporate the effects of predictors. We cannot posit a model in terms of *logit* hazard (as in discrete time) because that transformation is defined only for values of hazard between 0 and 1. As a result, when we specify continuous-time hazard models in chapter 14, our specification will focus on the *logarithm* of hazard, a transformation that is defines for all values of hazard greater than 0. (p. 475, *emphasis* in the original)

## Grouped methods for estimating continuous-time survivor and hazard functions

> In principle, in continuous time, we would like to estimate a value for the survivor and hazard functions at every possible instant when an event could occur. In practice, we can do so only if we are willing to adopt constraining parametric assumptions about the distribution of event times. To support this approach, statisticians have identified dozens of different distributions--Weibull, Gompertz, gamma, and log-logistic, to name a few--that event times might follow, and in some fields—industrial product testing, for example--parametric estimation is the dominant mode of analysis [see, e.g., @lawless1982StatisticalModels].
>
> In many other fields, including most of the social, behavioral, and medical sciences, nonparametric methods are more popular. The fundamental advantage of nonparametric methods is that we need not make constraining assumptions about the distribution of event times. This flexibility is important because: (1) few researchers have a sound basis for preferring one distribution over another; and (2) adopting an *incorrect* assumption can lead to erroneous conclusions. With a nonparametric approach, you essentially trade the *possibility* of a minor increase in efficiency if a particular assumption holds for the guarantee of doing nearly as well for most data sets, regardless of its tenability. 
>
> For decades, in a kind of mathematic irony, statisticians obtained nonparametric estimates of the continuous-time survivor and hazard functions by grouping event times into a small number of intervals, constructing a life table, and applying the discrete-time strategies of chapter 10 (with some minor revisions noted below). In this section we describe two of the most popular of these grouped strategies: the *discrete-time* method (section 13.2.1) and the *actuarial* method (section 13.2.2). (pp. 475--476, *emphasis* in the original)

As we'll see, **brms** supports parametric and nonparametric continuous-time survival models. In the sections and chapters to come, we will make extensive use of the Cox model, which is nonparametric. However, if you look through the *Time-to-event models* section of Bürkner's [-@Bürkner2021Parameterization] vignette, [*Parameterization of response distributions in brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_families.html), you'll see **brms** supports survival models with the exponential, inverse-Gaussian, gamma, log-normal, and Weibull likelihoods.

### Constructing a grouped life table.

> Grouped estimation strategies begin with a life table that partitions continuous time into a manageable number of contiguous intervals. When choosing a partition, you should seek one that is: (1) substantively meaningful; (2) coarse enough to yield stable estimates; and (3) fine enough to reveal discernible patterns. (p. 476)

For the first step in making the life table of Table 13.2, we'll make variables that partition the `seconds` column of the `honking` data into lower and upper bounds.

```{r}
honking <-
  honking %>% 
  mutate(lb = case_when(
    seconds < 2 ~ 1,
    seconds < 3 ~ 2,
    seconds < 4 ~ 3,
    seconds < 5 ~ 4,
    seconds < 6 ~ 5,
    seconds < 7 ~ 6,
    seconds < 8 ~ 7,
    seconds >= 8 ~ 8
  )) %>% 
  mutate(ub = if_else(lb == 8, 18, lb + 1)) %>% 
  mutate(time_interval = str_c("[", lb, ", ", ub, ")"))

honking %>% head()
```

Now we'll transform the data into a life-table format, which we'll save as `honking_aggregated`.

```{r}
 honking_aggregated <-
  honking %>% 
  mutate(event = ifelse(censor == 0, "n_events", "n_censored")) %>% 
  count(lb, event) %>% 
  pivot_wider(names_from = event,
              values_from = n) %>% 
  mutate(ub = if_else(lb == 8, 18, lb + 1)) %>% 
  mutate(time_interval = str_c("[", lb, ", ", ub, ")")) %>% 
  mutate(n_censored = ifelse(is.na(n_censored), 0, n_censored)) %>% 
  mutate(total = n_censored + n_events) %>% 
  mutate(n_at_risk = sum(total) - cumsum(lag(total, default = 0))) %>% 
  select(lb, ub, time_interval, n_at_risk, n_events, n_censored) %>% 
  mutate(`p(t)` = n_events / n_at_risk)

honking_aggregated
```

### The discrete-time method.

Here we simply apply the discrete-time hazard model to our discretized continuous-time data. Before we fit the model, we'll define a new term, $\hat p(t_j)$. Recall back to Section 10.2 where we defined the hazard function $\hat h(t_{j})$ as

$$\hat h(t_{j}) = \frac{n \text{ events}_j}{n \text{ at risk}_j}.$$

Now we're working with continuous-time data (even if they're momentarily discretized), we focus instead on $\hat p(t_{j})$. In words, $\hat p(t_{j})$ is the conditional probability that a member of the risk set at the beginning of the interval $j$ will experience the target event during that interval. In discrete time we labeled this quantity "hazard," but now we use the term "*conditional probability*" to distinguish it from a continuous time *hazard rate*. Our conditional probability follows the formula

$$\hat p(t_{j}) = \frac{n \text{ events}_j}{n \text{ at risk}_j},$$

where $n \text{ events}_j$ is the number of individuals who experienced the event in the $j^{th}$ period and $n \text{ at risk}_j$ is the number of those at risk at the beginning of the interval $j$.

Time to fire up **brms**.

```{r, warning = F, message = F}
library(brms)
library(tidybayes)
```

For our first model, we will use the binomial likelihood with the aggregated version of the `honking` data, `honking_aggregated`. The main time variable in those data is `time_interval`, the lower and upper bounds for which are identified in the `lb` and `ub` columns, respectively. In anticipation of the upcoming plots, we'll use the `ub` variable for time. But to make fitting the model easier with the `brm()` function, we'll first save a factor version of the variable.

```{r}
honking_aggregated <-
  honking_aggregated %>% 
  mutate(ub_f = factor(ub))
```

In the last chapter, we used the `normal(0, 4)` prior, which was permissive in the log-odds metric. Here we'll be more conservative and use `normal(0, 1.5)`, which is weakly-regularizing on the log-odds metric, but flat in the probability metric. A plot might help show this.

```{r, fig.width = 6, fig.height = 2.5, cache = T}
set.seed(13)

tibble(`log odds` = rnorm(1e6, mean = 0, sd = 1.5)) %>% 
  mutate(probability = inv_logit_scaled(`log odds`)) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = 0)) +
  stat_histinterval(normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("prior predictive distribution") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free")
```

Otherwise, this model is just like any of the other unconditional discrete-time models we've fit with `brm()`.

```{r fit13.1}
fit13.1 <-
  brm(data = honking_aggregated,
      family = binomial,
      n_events | trials(n_at_risk) ~ 0 + ub_f,
      prior(normal(0, 1.5), class = b),
      chains = 4, cores = 1, iter = 2000, warmup = 1000,
      seed = 13,
      file = "fits/fit13.01")
```

Check the parameter summary.

```{r}
print(fit13.1)
```

As will become apparent in a bit, our `normal(0, 1.5)` prior was not inconsequential. Our aggregated data were composed of the event/censoring information of 57 cases, spread across 8 time periods. This left little information in the likelihood, particularly for the later time periods. As a consequence, the prior left clear marks in the posterior.

As with the discrete-time model, we can formally define the survivor function for the continuous-time model as

$$\hat S(t_j) =  \big(1 - \hat p(t_1)\big) \big(1 - \hat p(t_2)\big)... \big(1 - \hat p(t_j)\big).$$

It'll take a little wrangling effort to transform the output from `posterior_samples(fit13.1)` into a useful form for plotting and summarizing $\hat S(t_j)$. We'll save it as `s`.

```{r}
s <-
  posterior_samples(fit13.1) %>% 
  select(starts_with("b_")) %>% 
  mutate_all(inv_logit_scaled) %>% 
  mutate(b_ub_f0 = 0) %>% 
  select(b_ub_f0, everything()) %>% 
  set_names(c(1:8, 18)) %>% 
  mutate(iter = 1:n()) %>% 
  pivot_longer(-iter,
               names_to = "time",
               values_to = "p") %>% 
  mutate(time = time %>% as.integer()) %>% 
  group_by(iter) %>% 
  mutate(survivor = cumprod(1 - p)) %>% 
  ungroup()
```

Now we can make the first 6 columns of Table 13.2 by combining a subset of the `honking_aggregated` data with a summary of our `s`.

```{r}
bind_cols(
  # select the first 5 columns for Table 13.2
  honking_aggregated %>% 
    select(time_interval:`p(t)`),
  # add the 6th column
  s %>% 
    filter(time > 1) %>% 
    group_by(time) %>% 
    summarise(median = median(survivor),
              sd     = sd(survivor)) %>% 
    mutate_if(is.double, round, digits = 4) %>% 
    transmute(`S(t)` = str_c(median, " (", sd, ")"))
)
```

You'll note that our posterior summary values in `S(t)` differ a little from those in the text. Remember, the likelihood was weak and we used a regularizing prior. If we had more cases spread across fewer discretized time periods, the likelihood would have done a better job updating the prior.

Now let's take a look at the posterior of our survivor function, $\hat S(t_j)$, in our version of the upper left panel of Figure 13.1.

```{r, fig.width = 4, fig.height = 4}
s %>% 
  ggplot(aes(x = time, y = survivor)) +
  geom_hline(yintercept = .5, color = "white") +
  stat_lineribbon(alpha = 1/2) +
  # add the ML-based survival estimates
  geom_line(data = honking_aggregated %>% mutate(s = cumprod(1 - `p(t)`)),
            aes(x = ub, y = s),
            color = "red") +
  scale_fill_grey("CI level", start = .7, end = .4) +
  scale_x_continuous("seconds after light turns green", limits = c(0, 20)) +
  ylab(expression(widehat(italic(S(t[j]))))) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(panel.grid = element_blank())
```

For a little context, we superimposed the sample (ML) estimates of the survivor function in red. Based on the posterior median, our median lifetime appears to be between the time intervals of $[2, 3)$ and $[3, 4)$. Sticking with those medians, here's the exact number using Miller's [-@miller1981SurvivalAnalysis] interpolation approach from [Section 10.2.2][Median lifetime.].

```{r}
s_medians <-
  s %>% 
  mutate(lb = time - 1,
         ub = time) %>% 
  mutate(time_interval = str_c("[", lb, ", ", ub, ")")) %>% 
  filter(ub %in% c(3, 4)) %>% 
  group_by(time_interval) %>% 
  summarise(median = median(survivor)) %>% 
  pull(median)

3 + (s_medians[1] - .5) / (s_medians[1] - s_medians[2]) * (4 - 3)
```

Here's how we might convert the output of `posterior_samples(fit13.1)` into a useful format for our hazard function.

```{r}
h <-
  posterior_samples(fit13.1) %>% 
  select(starts_with("b_")) %>% 
  mutate_all(inv_logit_scaled) %>% 
  set_names(c(2:8, 18)) %>% 
  mutate(iter = 1:n()) %>% 
  pivot_longer(-iter,
               names_to = "time",
               values_to = "p") %>% 
  mutate(time = time %>% as.integer())

h
```

For continuous-time data, hazard is a rate, which is

> the limit of the conditional probability of event occurrence in a (vanishingly small) interval divided by the interval's width. A logical estimator is thus the ratio of the conditional probability of event occurrence in an interval to the interval's width. (p. 479)

Thus, our new definition of hazard is 

$$\hat h(t_j) = \frac{\hat p(t_j)}{\text{width}_j},$$

where $\text{width}_j$ denotes the width of the $j$th interval. The widths of most of our intervals were 1 (seconds). The final interval, $[8, 18)$, had a width of ten. Once we add that information to the `h` data, we can use the formula above to convert $\hat p(t_j)$ to $\hat h(t_j)$.

```{r}
h <-
  h %>% 
  mutate(width = if_else(time <= 8, 1, 10)) %>% 
  mutate(hazard = p / width)
```

Now we can make the first 7 columns of Table 13.2 by adding an cleaned-up version of our `h` object to what we had before.

```{r}
bind_cols(
  # select the first 5 columns for Table 13.2
  honking_aggregated %>% 
    select(time_interval:`p(t)`),
  # add the 6th column
  s %>% 
    filter(time > 1) %>% 
    group_by(time) %>% 
    summarise(median = median(survivor),
              sd     = sd(survivor)) %>% 
    mutate_if(is.double, round, digits = 4) %>% 
    transmute(`S(t)` = str_c(median, " (", sd, ")")),
  # add the 7th column
  h %>% 
    group_by(time) %>% 
    summarise(median = median(hazard),
              sd     = sd(hazard)) %>% 
    mutate_if(is.double, round, digits = 4) %>% 
    transmute(`h(t)` = str_c(median, " (", sd, ")"))
)
```

Perhaps even more so than with our estimates for the survivor function, our hazard estimates show the influence of our prior on the posterior. For example, note how our posterior standard deviations tend to be a bit smaller than the standard errors reported in the text. To my mind, plotting the marginal posteriors for the intervals of our hazard function really helps hit this home.

```{r, fig.width = 6, fig.height = 3}
h %>% 
  mutate(time = factor(time)) %>% 
  
  ggplot(aes(x = hazard, y = time)) +
  geom_vline(xintercept = .5, color = "white") +
  stat_halfeye(.width = c(.5, .95), normalize = "xy") +
  xlim(0, 1) +
  theme(panel.grid = element_blank())
```

As wide and sloppy as those distributions look, they're more precise than the estimates returned by Maximum Likelihood (ML). To finish this section out with the lower left panel of Figure 13.1, here's what our hazard function looks like.

```{r, fig.width = 4, fig.height = 4}
h %>% 
  ggplot(aes(x = time, y = hazard)) +
  stat_lineribbon(alpha = 1/2) +
  scale_fill_grey("CI level", start = .7, end = .4) +
  scale_x_continuous("seconds after light turns green", limits = c(0, 20)) +
  ylab(expression(widehat(italic(h(t[j]))))) +
  coord_cartesian(ylim = c(0, .35)) +
  theme(panel.grid = element_blank())
```

### The actuarial method.

I'm not going to dive into a full explanation of the actuarial method. For that, read the book. However, the actuarial method presents a challenge for our **brms** paradigm. To appreciate the challenge, we'll need a couple block quotes:

> For the survivor function, we ask: What does it mean to be "at risk of surviving" past the end of an interval? Because a censored individual is no longer "at risk of surviving" once censoring occurs, we redefine each interval's risk set to account for the censoring we assume to occur equally throughout. This implies that half the censored individuals would no longer be at risk half-way through, so we redefine the number of individuals "at risk of surviving past interval $j$" to be:
>
> $$n' \; at \; risk_j = n \; at \; risk_j - \frac{n \; censored_j}{2}.$$
>
> The actuarial estimate of the survivor function is obtained by substituting $n' \; at \; risk_j$ for $n \; at \; risk_j$ in the discrete-time formulas just presented in section 13.2.2 (equations 13.3 and 13.4). (pp. 480--481, *emphasis* in the original)

Further:

> To estimate the hazard function using the actuarial approach, we again redefine what it means to be "at risk." Now, however, we ask about the "risk of event occurrence" *during* the interval, not the "risk of survival" *past* the interval. This change of definition suggests that each interval's risk set should be diminished not just by censoring but also by event occurrence, because either eliminates the possibility of subsequent event occurrence. Because categorization continues to prevent us from knowing precisely when people leave the risk set, we assume that exits are scattered at random throughout the interval. This implies that half these individuals are no longer at risk of event occurrence halfway through, so we redefine the number of individuals "at risk of event occurrence" in interval $j$ to be:
>
> $$n'' \; at \; risk_j = n \; at \; risk_j - \frac{n \; censored_j}{2} - \frac{n \; events_j}{2}.$$
>
> The actuarial estimator of the continuous-time hazard function is then obtained by substituting $n'' \; at \; risk_j$ for $n \; at \; risk_j$ in discrete-time formulas of section 13.2.2 (equations 13.3 and 13.5). (pp. 481--481, *emphasis* in the original)

Here's how we might implement the equations for $n' \; at \; risk_j$ and $n'' \; at \; risk_j$ in our `honking_aggregated` data.

```{r}
honking_aggregated <-
  honking_aggregated %>% 
  mutate(n_p_at_risk_a  = n_at_risk - (n_censored / 2),
         n_pp_at_risk_a = n_at_risk - (n_censored / 2) - (n_events / 2))

honking_aggregated
```

The essence of our problem is we've been using the binomial likelihood to fit discrete-time hazard functions with **brms**. In this paradigm, we feed in data composed of the number of *successes* and the corresponding number of trials to compute probability $p$ of a *success* within a given trial. Although $p$ is a continuous value ranging from 0 to 1, the binomial likelihood takes the numbers of successes and trials to be non-negative integers. If you try to feed our actuarial $n'' \; at \; risk_j$ variable, `n_at_risk_pp` into the `formula` argument for `brms::brm()` (e.g., `n_events | trials(n_at_risk_pp) ~ 0 + ub_f`), **brms** will return the warning:

> Error: Number of trials must be positive integers.

We can, however, follow along and compute the ML estimates by hand.

```{r}
honking_aggregated <-
  honking_aggregated %>% 
  mutate(`S(t)_a` = cumprod(1 - n_events / n_p_at_risk_a),
         `p(t)_a` = n_events / n_pp_at_risk_a,
         width    = if_else(lb == 8, 10, 1)) %>% 
  mutate(`h(t)_a` = `p(t)_a` / width)

honking_aggregated
```

Before we make our version of the right-hand side of Figure 13.1, we'll need to augment the data a little. Then `geom_step()` will do most of the magic.

```{r}
p1 <-
  # add `S(t)_a` values for `lb = c(0, 18)`
  honking_aggregated %>% 
  select(lb, `S(t)_a`) %>% 
  bind_rows(tibble(lb       = c(0, 18),
                   `S(t)_a` = c(1, honking_aggregated$`S(t)_a`[8]))) %>% 
  # reorder
  arrange(lb) %>% 
  
  # plot!
  ggplot(aes(x = lb, y = `S(t)_a`)) +
  geom_step() +
  scale_x_continuous(NULL, breaks = NULL) +
  scale_y_continuous(expression(widehat(italic(S(t[j])))), limits = c(0, 1))

p2 <-
  # add `h(t)_a` values for `lb = 18`
  honking_aggregated %>% 
  select(lb, `h(t)_a`) %>% 
  bind_rows(tibble(lb       = 18, 
                   `h(t)_a` = honking_aggregated$`h(t)_a`[8])) %>% 
  arrange(lb) %>% 
  
  ggplot(aes(x = lb, y = `h(t)_a`)) +
  geom_step() +
  scale_x_continuous("seconds after light turns green", limits = c(0, 20)) +
  scale_y_continuous(expression(widehat(italic(h(t[j])))), limits = c(0, .35))
```

Combine the subplots and make our version of the right-hand side of Figure 13.1.

```{r, fig.width = 3, fig.height = 7}
library(patchwork)

(p1 / p2) &
  theme(panel.grid = element_blank())
```

I'm not going to bother with computing the ML standard errors for the actuarial survivor and hazard estimates. You can reference page 482 in the text for more on those.

## The Kaplan-Meier method of estimating the continuous-time survivor function

> A fundamental problem with grouped estimation methods is that they artificially categorize what is now, by definition, a continuous variable.... Shouldn't it be possible to use the *observed* data--the actual event times--to describe the distribution of event occurrences? This compelling idea underlies the Kaplan-Meier method, named for the statisticians who demonstrated [in -@kaplan1958NonparametricEstimation] that the intuitive approach--also known as the *product-limit method*--has maximum likelihood properties as well. Below, we explain how this approach works and why it is preferable.
>
> The Kaplan-Meier method is a simple extension of the discrete-time method with a fundamental change: instead of rounding event times to construct the intervals, capitalize on the raw event times and construct intervals so that each contains just *one observed event time* (as shown in table 13.3). Each Kaplan-Meier interval begins at one observed event time and ends just before the next. (p. 483, *emphasis* in the original)

As discussed in the prose in the middle of page 483, here are the first three event times.

```{r}
honking %>% 
  filter(censor == 0) %>% 
  top_n(-3, seconds)
```

> The Kaplan-Meier estimate of the survivor function is obtained by applying the discrete-time estimator of section 13.2.2 to the data of these intervals. [Most] statistical packages include a routine for computing and plotting the estimates. Numerically, the process is simple: first compute the conditional probability of event occurrence (column 7) and then successively multiply the complements of these probabilities together to obtain the Kaplan-Meier estimate of the survivor function (column 8). Because the Kaplan-Meier estimator of the survivor function is identical to the discrete-time estimator of chapter 10, its standard errors (column 9) are estimated using the same formula (pp. 483--485).

To walk this out, we'll first use the frequentist **survival** package.

```{r, warning = F, message = F}
library(survival)
```

Use the `survival::survfit()` function to fit the unconditional model with the Kaplan-Meier estimator.

```{r}
 fit13.2 <-
  survfit(data = honking,
          Surv(seconds, censor_1) ~ 1)
```

The `summary()` returns a lot of output.

```{r}
summary(fit13.2)
```

Taking a cue from the good folks at [IDRE](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-13/), saving the summary results as an object will make it easy to subset and augment that information into our version of Table 13.1.

```{r}
# save the summary as t
t <- summary(fit13.2)

# subset, augment, and save as honking_km
honking_km <-
  tibble(seconds  = t$time,
         n_risk   = t$n.risk,
         n_events = t$n.event) %>% 
  mutate(`p(t)`     = n_events / n_risk,
         n_censored = n_risk - n_events - lead(n_risk, default = 0),
         interval   = 1:n(),
         interval_f = factor(1:n(), levels = 0:n()),
         start      = seconds,
         end        = lead(seconds, default = Inf)) %>% 
  select(interval:interval_f, seconds, start:end, n_risk:n_events, n_censored, `p(t)`)

honking_km
```

We haven't bothered adding the top `interval == 0` row, but one could add that information with a little `bind_rows()` labor. Note how we slipped in a few extra columns (e.g., `interval_f`) because they'll come in handy, later. We compute the Kaplan-Meier estimates for the survivor function by serially multiplying the compliments for the estimates of the conditional probability values, $\hat p(t)$. As in other examples, that's just a little `cumprod()` code.

```{r}
honking_km <-
  honking_km %>% 
  mutate(`S(t)` = cumprod(1 - `p(t)`))
```

Instead of doing that by hand, we could have just subset the `surv` vector within `t`.

```{r}
t$surv
```

Here we subset the standard errors for $\hat S(t)$.

```{r}
honking_km <-
  honking_km %>% 
  mutate(`se[S(t)]` = t$std.err)

head(honking_km)
```

We can plot the fitted $\hat S(t)$ values with `geom_step() to make our version of the top half of Figure 13.2.

```{r, warning = F}
p1 <-
  honking_km %>% 
  select(seconds, `S(t)`) %>% 
  bind_rows(tibble(seconds = 17.15, 
                   `S(t)`  = 0.04252557)) %>% 
  
  ggplot(aes(x = seconds, y = `S(t)`)) +
  geom_step() +
  scale_x_continuous(NULL, breaks = NULL, limits = c(0, 20))
```

Now add the actuarial and discrete-time estimates to make our version of the lower panel of Figure 13.2.

```{r}
arrow <- 
  tibble(x    = c(5, 8.7, 2.7),
         y    = c(.8, .4, .23),
         xend = c(2.6, 7, 3.9),
         yend = c(.875, .25, .33))

text <- 
  tibble(x     = c(5, 8.7, 2.7),
         y     = c(.77, .43, .2),
         label = c("Kaplan Meier", "Discrete-time", "Actuarial"))

p2 <-
  honking_km %>% 
  select(seconds, `S(t)`) %>% 
  bind_rows(tibble(seconds = 17.15, 
                   `S(t)`  = 0.04252557)) %>% 
  
  ggplot(aes(x = seconds, y = `S(t)`)) +
  geom_step() +
  geom_step(data = honking_aggregated,
            aes(x = ub, y = `S(t)_a`), 
            linetype = 3, direction = "vh") +
  geom_line(data = honking_aggregated %>% mutate(s = cumprod(1 - `p(t)`)),
            aes(x = ub, y = s),
            linetype = 2) +
  geom_segment(data = arrow,
               aes(x = x, xend = xend,
                   y = y, yend = yend),
               size = 1/3, arrow = arrow(length = unit(0.15,"cm"), type = "closed")) +
  geom_text(data = text,
            aes(x = x, y = y, label = label)) +
  scale_x_continuous("seconds after light turns green", limits = c(0, 20))
```

Combine and plot.

```{r, fig.width = 7, fig.height = 7}
(p1 / p2) &
  scale_y_continuous(expression(widehat(italic(S(t[j])))), limits = c(0, 1)) &
  theme(panel.grid = element_blank())
```

Did you notice how in both subplots we used `bind_rows()` to add in an `S(t)` value for `seconds = 17.15`? In the text, we read:

> As for actuarial estimates, we plot Kaplan-Meier estimates as a step function that associates the estimated probability with the entire interval. If the largest event time is censored, as it is here (17.15), we extend the step for the last estimate out to that largest censored value. (pp. 485--486)

Those `bind_rows()` lines are what extended "the step for the last estimate." Since that last estimate for $\hat S(t_j)  = 0.04252557$, we set `S(t) = 0.04252557` in the plot data.

We can get the median lifetime by executing `print(fit13.2)`.

```{r}
print(fit13.2)
```

Like in the text, it's 3.58.

Even though there is no Kaplan-Meier estimate for hazard, we can compute a Kaplan-Meier type hazard with the formula 

$$\hat h_\text{KM} (t_j) = \frac{\hat p_\text{KM} (t_j)}{\text{width}_j}.$$

Here we compute both the $\text{width}_j$ and $\hat h_\text{KM} (t_j)$ values by hand.

```{r}
honking_km <-
  honking_km %>% 
  mutate(width = end - start) %>% 
  mutate(width = if_else(end == Inf, 17.15 - start, width)) %>%  # this might be wrong
  mutate(`h[km](t)` = `p(t)` / width)

honking_km
```

Singer and Willett remarked that "because the interval width varies widely (and is itself a function of the distribution of event times), the resulting estimates vary from one interval to the next. Their values are usually so erratic that pattern identification is nearly impossible" (p. 487). That sounds fun. Let's explore them in a plot!

```{r, fig.width = 7, fig.height = 2.5}
honking_km %>% 
  ggplot(aes(x = start, y = `h[km](t)`)) +
  geom_path() +
  geom_point() +
  scale_x_continuous("seconds after light turns green", limits = c(0, 15)) +
  scale_y_continuous(expression(hat(italic(h))[KM](italic(t[j])))) +
  theme(panel.grid = element_blank())
```

Erratic, indeed.

### Fitting a Bayesian Kaplan-Meier model

It's worth repeating one of the quotes from earlier:

> The Kaplan-Meier method is a simple extension of the discrete-time method with a fundamental change: instead of rounding event times to construct the intervals, capitalize on the raw event times and construct intervals so that each contains just *one observed event time* (as shown in table 13.3). Each Kaplan-Meier interval begins at one observed event time and ends just before the next. (p. 483, *emphasis* in the original)

Whether we're working with one event occurrence at a time or binning them in intervals, the basic product of the model is a conditional probability. The binomial likelihood served us well when we binned the event occurrences into largish time intervals, and it will work just the same when we work them in serial fashion. The biggest obstacle is properly setting up the data. After some experimentation, the easiest way to format the data properly is to just use the summary results from `survfit(data = honking, Surv(seconds, censor_1) ~ 1)`, which we saved as `honking_km`. Take another look.

```{r}
glimpse(honking_km)
```

In the `n_events` column we have the number of cases that experienced the event in a given moment in continuous time. In the `n_risk` column we have the number of possible cases that could have experienced the event at that moment. In the `interval_f` column we've saved each moment as a factor, conveniently named $1, 2,..., 42$. Thus we can fit a simple Bayesian Kaplan-Meier model with `brms::brm()` by specifying `formula = n_events | trials(n_risk) ~ 0 + interval_f`.

Consider the implications for our priors. Because we're treating each instance in time as a factor, that means the number of cases experiencing the event in one of those factors will always be 1 or some other small number in the unusual case of a tie. But the number in the denominator, `n_risk`, will tend to be relatively large, which means the probabilities will tend to be small. The weakly regularizing prior approach centered on zero might not make sense in this context.

In earlier models, we used `normal(0, 4)` and `normal(0, 1.5)`. Here's what those look like when we convert them to the probability metric.

```{r, fig.width = 5, fig.height = 2.75}
set.seed(13)

tibble(sd = c(4, 1.5)) %>% 
  mutate(prior    = factor(str_c("normal(0, ", sd, ")"),
                           levels = str_c("normal(0, ", c(1, 1.5, 4), ")")),
         log_odds = map(sd, rnorm, n = 1e5, mean = 0)) %>% 
  unnest(log_odds) %>% 
  mutate(p  = inv_logit_scaled(log_odds)) %>% 
  
  ggplot(aes(x = p, y = prior)) +
  stat_histinterval(.width = c(.5, .95), normalize = "xy") +
  labs(x = expression(italic(p)),
       y = "prior (log-odds scale)") +
  coord_cartesian(ylim = c(1.5, 2.5)) +
  theme(panel.grid = element_blank())
```

The `normal(0, 4)` prior does not reflect our expectation that the probabilities will tend to be small. Interestingly, `normal(0, 4)` pushes a lot of the prior mass to the edges. What we want is a prior that pushed the mass to the left. Here are some options:

```{r, fig.width = 8, fig.height = 5}
crossing(mean = -4:-1,
         sd   = 1:4) %>% 
  mutate(log_odds = map2(mean, sd, rnorm, n = 1e5)) %>% 
  unnest(log_odds) %>% 
  mutate(p    = inv_logit_scaled(log_odds),
         mean = factor(str_c("mean = ", mean),
                       levels = str_c("mean = ", -4:-1)),
         sd   = str_c("sd = ", sd)) %>% 
  
  ggplot(aes(x = p, y = 0)) +
  stat_histinterval(.width = c(.5, .95), normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(italic(p))) +
  theme(panel.grid = element_blank()) +
  facet_grid(sd~mean)
```

As you decrease the prior mean, the mass in the probability metric heads to zero. Increasing or shrinking the prior standard deviation accelerates or attenuates that leftward concentration. To my way of thinking, we want a prior that, while concentrating the mass toward zero, still offers a good spread toward the middle. Let's try `normal(-4, 3)`. 

```{r fit13.3}
fit13.3 <-
  brm(data = honking_km,
      family = binomial,
      n_events | trials(n_risk) ~ 0 + interval_f,
      prior(normal(-4, 3), class = b),
      chains = 4, cores = 1, iter = 2000, warmup = 1000,
      seed = 13,
      file = "~/Dropbox/Recoding Applied Longitudinal Data Analysis/fits/fit13.03")
```

Check the summary.

```{r}
print(fit13.3)
```

Our parameter diagnostics look excellent. It might be helpful to inspect their posteriors in a plot. Here we show them in both the log-odds and probability metrics.

```{r, fig.width = 6, fig.height = 4}
post <-
  posterior_samples(fit13.3) %>% 
  select(-lp__) %>% 
  set_names(1:42) 

post %>% 
  pivot_longer(everything(),
               names_to = "parameter",
               values_to = "log_odds") %>% 
  mutate(parameter   = factor(parameter, levels = 1:42),
         probability = inv_logit_scaled(log_odds)) %>% 
  pivot_longer(probability:log_odds) %>% 
  
  ggplot(aes(x = value, y = parameter)) +
  stat_halfeye(.width = .95, normalize = "xy", size = 1/2) +
  labs(x = expression(hat(italic(p))[Bayes](italic(t[j]))),
       y = expression(italic(j))) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~name, scales = "free_x")
```

Let's wrangle our `post` object a little to put it in a more useful format.

```{r}
post <-
  post %>% 
  mutate_all(inv_logit_scaled) %>% 
  mutate(`0` = 0) %>% 
  mutate(iter = 1:n()) %>% 
  pivot_longer(-iter,
               names_to = "interval",
               values_to = "p") %>% 
  mutate(interval = interval %>% as.double()) %>% 
  arrange(interval) %>% 
  group_by(iter) %>% 
  mutate(survivor = cumprod(1 - p)) %>% 
  ungroup() 

glimpse(post)
```

We might want to compare our Bayesian $\hat p(t_j)$ estimates with their Maximum Likelihood counterparts. Since the Bayesian marginal posteriors are rather asymmetrical, we'll summarize $\hat p_\text{Bayes} (t_j)$ with means, medians, and modes.

```{r, fig.width = 6, fig.height = 3.5}
post %>% 
  mutate(interval = factor(interval, levels = 0:42)) %>% 
  group_by(interval) %>% 
  summarise(mean   = mean(survivor),
            median = median(survivor),
            mode   = Mode(survivor)) %>% 
  pivot_longer(-interval,
               names_to = "posterior estimate") %>% 
  
  ggplot(aes(x = interval, y = value)) +
  geom_point(aes(color = `posterior estimate`),
             size = 3, shape = 1) +
  geom_point(data = honking_km %>% 
               mutate(interval = factor(interval, levels = 0:42)),
             aes(y = `S(t)`)) +
  scale_color_viridis_d(option = "D", begin = .1, end = .8) +
  ylab(expression(hat(italic(p))(italic(t[j])))) +
  theme(legend.background = element_blank(),
        legend.key = element_rect(fill = "grey92"),
        legend.position = c(.85, .8),
        panel.grid = element_blank())
```

The black dots were the ML estimates and the colored circles were the Bayesian counterparts. Overall, it looks like they matched up pretty well! Building on those sensibilities, here's an alternative version of the top panel from Figure 13.2, this time comparing the frequentist $\hat S (t_j)$ with our Bayesian counterpart.

```{r, fig.width = 6, fig.height = 3.5}
post %>% 
  group_by(interval) %>% 
  summarise(mean   = mean(survivor),
            median = median(survivor),
            mode   = Mode(survivor)) %>% 
  pivot_longer(-interval,
               names_to = "posterior estimate") %>% 
  left_join(honking_km %>% 
              distinct(interval, seconds),
            by = "interval") %>% 
  mutate(seconds = if_else(is.na(seconds), 0, seconds)) %>% 
  
  ggplot(aes(x = seconds, y = value)) +
  geom_step(aes(color = `posterior estimate`),
            size = 1) +
  geom_step(data = honking_km,
            aes(y = `S(t)`)) +
  scale_color_viridis_d(option = "D", begin = .1, end = .8) +
  scale_x_continuous("time", limits = c(0, 20)) +
  ylab(expression(widehat(italic(S(t[j]))))) +
  theme(legend.background = element_blank(),
        legend.key = element_rect(fill = "grey92"),
        legend.position = c(.85, .8),
        panel.grid = element_blank())
```

Again, we showed the ML $\hat S (t_j)$ in black and the Bayesian counterpart as summarized by three alternative measures of central tendency in color. Overall, the results were very similar across methods. What this plot makes clear is that it's the last few estimates for that where our Bayesian estimates diverge from ML. If you compare this plot with the previous one, it appears that the nature of the divergence is our Bayesian estimates are shrunk a bit toward $p = .5$, though the modes shrank less than the medians and means. Also recall how uncertain our posteriors were for the last few intervals. This is because the likelihoods for those intervals were incredibly weak. Take a glance at the last few rows in the data.

```{r}
tail(honking_km)
```

Based on the `n_risk` column, the number of trials ranged from $n = 9$ at `interval == 37` to $n = 2$ fpr `interval == 42`. With so little information informing the likelihood, you largely get back the prior. Moving forward, here's another version of the survivor function of Figure 13.2, but this time using posterior intervals to highlight uncertainty in the estimates.

```{r, fig.width = 6, fig.height = 3.5}
# augment post
post <-
  post %>% 
  bind_rows(
    post %>% 
      filter(interval == 42) %>% 
      mutate(interval = 43)
    )

post %>% 
  left_join(
    bind_rows(
      distinct(honking_km, interval, seconds),
      tibble(interval = 43, seconds = 17.15)),
    by = "interval") %>% 
  mutate(seconds = if_else(is.na(seconds), 0, seconds)) %>% 
  
  ggplot(aes(x = seconds, y = survivor)) + 
  stat_lineribbon(step = "hv", size = 3/4, .width = c(.5, .95)) +
  annotate(geom = "text",
           x = 5.3, y = .54, hjust = 0, size = 3.5,
           label = "This time the black line is the Bayesian posterior median,\nwhich is the `stat_lineribbon()` default.") +
  geom_segment(x = 5.3, xend = 4.2,
               y = .55, yend = .475,
               size = 1/5, arrow = arrow(length = unit(0.15,"cm"), type = "closed")) +
  scale_fill_grey("CI", start = .8, end = .6,
                  labels = c("95%", "50%")) +
  scale_x_continuous("time", limits = c(0, 20)) +
  ylab(expression(widehat(italic(S(t[j]))))) +
  theme(legend.background = element_blank(),
        legend.key = element_rect(fill = "grey92"),
        legend.position = c(.925, .85),
        panel.grid = element_blank())
```

Just for giggles, here's how you might depict our $\hat S_\text{Bayes} (t_j)$ with more of a 3D approach.

```{r, fig.width = 6, fig.height = 3.5}
post %>% 
  left_join(
    bind_rows(
      distinct(honking_km, interval, seconds),
      tibble(interval = 43, seconds = 17.15)),
    by = "interval") %>% 
  mutate(seconds = if_else(is.na(seconds), 0, seconds)) %>% 
  
  ggplot(aes(x = seconds, y = survivor)) + 
  stat_lineribbon(.width = seq(from = .01, to = .99, by = .01),
                  step = "hv", size = 0, show.legend = F) +
  scale_fill_grey(start = .89, end = 0) +
  scale_x_continuous("time", limits = c(0, 20)) +
  ylab(expression(widehat(italic(S(t[j]))))) +
  theme(panel.grid = element_blank())
```

## The cumulative hazard function

> Kaplan-Meier type estimates of hazard are simply too erratic to be meaningful.
>
> This is where the *cumulative hazard function* comes in. Denoted $H (t_{ij})$, the cumulative hazard function assesses, at each point in time, the *total amount of accumulated risk* that individual $i$ has faced from the beginning of time until the present. (p. 488, *emphasis* in the original)

The cumulative hazard function follows the equation

$$H (t_{ij}) = \underset{\text{between } t_0 \text{ and } t_j}{\text{cumulation}} [h(t_{ij})],$$

"where the phrase 'cumulation between $t_0$ and $t_j$' indicates that cumulative hazard totals the infinite number of specific values of $h(t_{ij})$ that exist between $t_0$ and $t_j$" (p. 488).

### Understanding the meaning of cumulative hazard.

Although the absolute values of the cumulative hazard function aren't particularly illuminating, the overall shape is. Figure 13.3 gives several examples. We don't have the data or the exact specifications for the functions expressed in Figure 13.3. But if you're okay with a little imprecision, we can make a few good guesses. Before diving in, it'll help simplify our subplot code if we make two custom geoms. We'll call them `geom_h()` and `geom_H()`.

```{r}
geom_h <- function(subtitle, ...) {
  
  list(
    geom_line(...),
    scale_x_continuous(NULL, breaks = NULL),
    scale_y_continuous(expression(italic(h)(italic(t[ij]))),
                       breaks = 0:5 * 0.02),
    labs(subtitle = subtitle),
    coord_cartesian(ylim = c(0, .1))
  )
  
}

geom_H <- function(y_ul, ...) {
  
  list(
    geom_line(...),
    ylab(expression(italic(H)(italic(t[ij])))),
    coord_cartesian(ylim = c(0, y_ul))
  )
  
}
```

Now we have our custom geoms, here's the code to make Figure 13.3.

```{r, fig.width = 9, fig.height = 3.5}
# a: constant hazard
d <-
  tibble(time = seq(from = 0, to = 100, by = 1)) %>% 
  mutate(h = 0.05) %>%  
  mutate(H = cumsum(h))

p1 <- d %>% 
  ggplot(aes(x = time, y = h)) +
  geom_h(subtitle = "A: Constant hazard")

p2 <- d %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(y_ul = 6)

# b: increasing hazard
d <-
  tibble(time = seq(from = 0, to = 100, by = 1)) %>% 
  mutate(h = 0.001 * time) %>%  
  mutate(H = cumsum(h))

p3 <- d %>% 
  ggplot(aes(x = time, y = h)) + 
  geom_h(subtitle = "B: Increasing hazard")
  
p4 <- d %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(y_ul = 5)

# decreasing hazard
d <-
  tibble(time = seq(from = .2, to = 100, by = .1)) %>% 
  # note out use of the gamma distribution (see )
  mutate(h = dgamma(time, shape = .02, rate = .001)) %>%  
  mutate(H = cumsum(h))

p5 <- d %>% 
  ggplot(aes(x = time, y = h)) +
  geom_h(subtitle = "C: Decreasing hazard")

p6 <- d %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(y_ul = 1.2)

# increasing & decreasing hazard
d <-
  tibble(time = seq(from = 1, to = 100, by = 1)) %>% 
  # note our use of the Fréchet distribution
  mutate(h = dfrechet(time, loc = 0, scale = 250, shape = .5) * 25) %>%  
  mutate(H = cumsum(h))

p7 <- d %>% 
  ggplot(aes(x = time, y = h)) +
  geom_h(subtitle = "D: Increasing &\n decreasing hazard")

p8 <- d %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(y_ul = 5)

# combine with patchwork and plot!
((p1 / p2) | (p3 / p4) | (p5 / p6) | (p7 / p8)) &
  theme(panel.grid = element_blank())
```

Did you notice our use of the gamma and Fréchet distributions? Both are supported for continuous-time survival models in **brms** (see Bürkner's vignette, [*Parameterization of response distributions in brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_families.html#time-to-event-models)). 

### Estimating the cumulative hazard function.

The two methods to estimate the cumulative hazard function are the Nelson-Aalen method and the negative log survivor function method. The Nelson-Aalen method used Kaplan-Meier-type hazard estimates as follows:

$$\hat H_\text{NA} (t_j) = \hat h_\text{KM} (t_1) \text{width}_1 + \hat h_\text{KM} (t_2) \text{width}_2 + \dots + \hat h_\text{KM} (t_j) \text{width}_j,$$

where $\hat h_\text{KM} (t_j) \text{width}_j$ is the total hazard during the $j$th interval. The negative log survivor function method makes use of the estimates of the Kaplan-Meier survivor function with the formula

$$\hat H_{- \text{LS}} (t_j) = - \log \hat S_\text{KM} (t_{ij}),$$

where $\hat S_\text{KM} (t_{ij})$, recall, is the cumulative survivor function for the Kaplan-Meier estimator. Here we put both formulas to use and make our version of Figure 13.4.

```{r, fig.width = 7, fig.height = 7, message = F}
# for annotation
text <-
  tibble(seconds = c(13.15, 13.4),
         y       = c(3.275, 2.65),
         label   = c("Negative~log~survivor*','*~hat(italic(H))[-LS](italic(t[j]))", "Nelson-Aalen*','*~hat(italic(H))[N][A](italic(t[j]))"))

# top plot
p1 <-
  honking_km %>% 
  mutate(width = if_else(width == Inf, 17.15 - start, width)) %>% 
  mutate(`H[na](t)` = cumsum(`h[km](t)` * width),
         `H[ls](t)` = - log(`S(t)`)) %>% 
  select(seconds, `H[na](t)`, `H[ls](t)`) %>% 
  bind_rows(tibble(seconds    = 17.15, 
                   `H[na](t)` = 2.78499694, 
                   `H[ls](t)` = 3.15764982)) %>% 
  # add a line index
  mutate(line = rep(letters[1:4], times = c(12, 21, 7, 3))) %>% 
  
  
  ggplot(aes(x = seconds)) +
  geom_step(aes(y = `H[ls](t)`),
            color = "grey50") +
  geom_step(aes(y = `H[na](t)`),
            linetype = 2) +
  geom_text(data = text,
            aes(y = y, label = label),
            hjust = 0, size = 3, parse = T) +
  scale_x_continuous("time", limits = c(0, 20)) +
  scale_y_continuous(expression(widehat(italic(H)(italic(t[j])))), 
                     breaks = seq(from = 0, to = 3.5, by = .5), limits = c(0, 3.5)) +
  theme(panel.grid = element_blank())

# bottom plot
p2 <- 
  p1 +
  stat_smooth(data = . %>% filter(line != "d"),
              aes(y = (`H[ls](t)` + `H[na](t)`) / 2, color = line, fill = line),
              method = "lm", show.legend = F) +
  scale_fill_viridis_d(option = "A", begin = .2, end = .8) +
  scale_color_viridis_d(option = "A", begin = .2, end = .8)

# combine
(p1 + scale_x_continuous(NULL, breaks = NULL, limits = c(0, 20))) / p2
```

Visualizing the cumulative hazard function from our Bayesian `fit13.3` is a minor extension to the approach we used for the survivor function, above. As an example, here's our plot for $\hat H_{- \text{LS}} (t_j)$. The biggest change in the code is the last line in the second `mutate()` statement, `hazard = -log(survivor)`.

```{r, fig.width = 7, fig.height = 4, message = F}
post %>% 
  left_join(
    bind_rows(
      distinct(honking_km, interval, seconds),
      tibble(interval = 43, seconds = 17.15)),
    by = "interval") %>% 
  mutate(seconds = if_else(is.na(seconds), 0, seconds),
         # convert S to H
         hazard = -log(survivor)) %>% 
  
  ggplot(aes(x = seconds, y = hazard)) + 
  stat_lineribbon(step = "hv", size = 3/4, .width = c(.5, .8, .95)) +
  scale_fill_grey("CI", start = .85, end = .6,
                  labels = c("95%", "80%", "50%")) +
  scale_x_continuous("time", limits = c(0, 20)) +
  ylab(expression(widehat(italic(H)[-LS](italic(t[j]))))) +
  theme(legend.background = element_blank(),
        legend.key = element_rect(fill = "grey92"),
        legend.position = c(.925, .73),
        panel.grid = element_blank())
```

## Kernel-smoothed estimates of the hazard function

> The idea behind kernel smoothing is simple. At each of many distinct points in time, estimate a function's average value by aggregating together all the point estimates available within the focal time's temporal vicinity. Conceptually, kernel-smoothed estimates are a type of moving average. They do not identify precise values of hazard at each point in time but rather approximate values based on the estimates nearby. Even though each smoothed value only approximates the underlying true value, a plot over time can help reveal the underlying function's shape.
>
> Kernel smoothing requires a set of point estimates to smooth. For the hazard function, one way of obtaining these point estimates is by computing successive differences in the estimated cumulative hazard function from each observed even time until the next. Each difference acts as a pseudo-slope, a measure of the local rate of change in cumulative hazard during that period. Either Nelson-Aalen estimates or negative log survivor function estimates of cumulative hazard can be used. (pp. 495--496)

Singer and Willett showed examples of this smoothing in Figure 13.5, in which they applied a smoothing algorithm to the $H_{- \text{LS}} (t_j)$ estimates. The good folks at [IDRE](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-13/) have already worked out the code to reproduce Singer and Willette's smoothing algorithm. The IDRE folks called their custom function `smooth()`.

```{r}
my_smooth <- function(width, time, survive) { 
  
  n   <- length(time)
  lo  <- time[1] + width
  hi  <- time[n] - width
  npt <- 50
  inc <- (hi - lo) / npt
  
  s <- t(lo + t(c(1:npt)) * inc)
  
  slag <- c(1, survive[1:n - 1])
  h    <- 1 - survive / slag
  x1   <- as.vector(rep(1, npt)) %*% (t(time))
  x2   <- s %*% as.vector(rep(1, n))
  x    <- (x1 - x2) / width
  k    <- .75 * (1 - x * x) * (abs(x) <= 1)
  
  lambda <- (k %*% h) / width
  
  smoothed <- data.frame(x = s, y = lambda)
  
  return(smoothed)
}
```


We've renamed the function `my_smooth()` to avoid overwriting the already existing `smooth()` function. The code for `my_smooth()` is a mild update from the one in the link above in that it returns its results in a data frame. Let's give it a quick whirl.

```{r}
my_smooth(width = 1, 
          time = honking_km$seconds,
          survive = honking_km$`S(t)`) %>% 
  glimpse()
```

The `time` and `survive` arguments take `seconds` and `S(t)` values, respectively. The `width` argument determines over how many values we would like to smooth over on a given point. For example, `width = 1` would, for a given point on the `time` axis, aggregate over all values $\pm1$ that point. Larger `width` values result in more aggressive smoothing. Also notice `my_smooth()` returned a data frame containing 50 rows for two columns, `x` and `y`. Those columns are the coordinates for the *TIME* and smoothed hazard, respectively. Here we put `my_smooth()` to work and make our version of Figure 13.5.

```{r, fig.width = 5, fig.height = 5.5}
tibble(width = 1:3) %>% 
  mutate(xy = map(width, ~my_smooth(width = ., 
                                    time = honking_km$seconds, 
                                    survive = honking_km$`S(t)`)),
         width = str_c("width = ", width)) %>% 
  unnest(xy) %>% 
  
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  scale_x_continuous("seconds after light turns green", limits = c(0, 20)) +
  ylab("smoothed hazard") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~width, nrow = 3)
```

## Developing an intuition about continuous-time survivor, cumulative hazard, and kernel-smoothed hazard functions

Buckle up and load the relapse data from Cooney and colleagues [-@cooney1991MatchingAlcoholics]; the US Supreme Court tenure data from Zorn and van Winkle [-@zorn2000CompetingRisks]; the first depressive episode data from Sorenson, Rutter, and Aneshensel [-@sorenson1991DepressionInTheComunity]; and the health-workers employment data from Singer and colleagues [-@singer1998PhysicianRetention].

```{r, warning = F, message = F}
alcohol_relapse  <- read_csv("data/alcohol_relapse.csv") %>% rename_all(str_to_lower)
judges           <- read_csv("data/judges.csv")
first_depression <- read_csv("data/firstdepression.csv")
health_workers   <- read_csv("data/healthworkers.csv")

glimpse(alcohol_relapse)
glimpse(judges)
glimpse(first_depression)
glimpse(health_workers)
```

For our first go, just fit the models with `survfit()`.

```{r}
 fit13.4 <-
  survfit(data = alcohol_relapse,
          Surv(weeks, abs(censor - 1)) ~ 1)

 fit13.5 <-
  survfit(data = judges,
          Surv(tenure, leave) ~ 1)

 fit13.6 <-
  survfit(data = first_depression,
          Surv(age, abs(censor - 1)) ~ 1)

 fit13.7 <-
  survfit(data = health_workers,
          Surv(weeks, abs(censor - 1)) ~ 1)
```

With the model from Section 13.3, we organized the Kaplan-Meier output in a tibble following the outline of Table 13.3 (p. 484). That layout was useful for plotting the frequentist results and for fitting the Bayesian version of the model with **brms**. Since we're juggling four models, let's make a convenience function to do that with a single line of code. Call it `km_tibble()`.

```{r}
km_tibble <- function(surv_fit) {
  
  t <- summary(surv_fit)
  
  length <- length(t$time)
  
  d <-
    tibble(time     = c(0, t$time),
           n_risk   = c(t$n.risk[1], t$n.risk),
           n_events = c(0, t$n.event)) %>% 
    mutate(p        = n_events / n_risk,
           n_censored = n_risk - n_events - lead(n_risk, default = 0),
           interval   = 0:length,
           interval_f = factor(0:length, levels = 0:c(length + 1))) %>% 
    select(interval:interval_f, time, n_risk:n_events, n_censored, p) %>% 
    mutate(S = cumprod(1 - p)) %>% 
    mutate(H = ifelse(interval == 0, NA, 
                      ifelse(S == 0, NA,
                             -log(S))),
           p = ifelse(interval == 0, NA, p))
  
  d <-
    bind_rows(
      d, 
      d %>% 
        slice(n()) %>%
        mutate(interval   = length + 1,
               interval_f = factor(length + 1,
                                   levels = 0:c(length + 1)),
               time       = surv_fit$time %>% max()))
  
  return(d)
  
}
```

Here's an example of our custom `km_tibble()` function works based on our earlier model of the `honking` data, `fit13.2`.

```{r}
km_tibble(fit13.2)
```

Now apply `km_tibble()` to our four new fits.

```{r}
km <-
  tibble(ml_fit = str_c("fit13.", 4:7)) %>% 
  mutate(d = map(ml_fit, ~get(.) %>% km_tibble())) %>% 
  unnest(d)
```

If we want the settings in our $x$- and $y$-axes to differ across subplots, good old `facet_wrap()` and `facet_grid()` aren't going to cut it for our version of Figure 13.6. To avoid needless repetition in the settings across subplot code, we’ll make a few custom geoms.

```{r}
geom_S <- function(x_ul, y_lab = NULL, ...) {
  
  list(
    geom_hline(yintercept = .5, color = "white"),
    geom_step(...),
    scale_x_continuous(NULL, breaks = NULL, limits = c(0, x_ul)),
    scale_y_continuous(y_lab, breaks = c(0, .5, 1), labels = c("0", ".5", "1"), limits = c(0, 1))
  )
  
}

geom_H <- function(x_ul, y_lab = NULL, ...) {
  
  list(
    geom_step(...),
    scale_x_continuous(NULL, breaks = NULL, limits = c(0, x_ul)),
    scale_y_continuous(y_lab, limits = c(0, NA))
  )
  
}

geom_h <- function(x_lab, x_ul, y_lab = NULL, ...) {
  
  list(
    geom_line(...),
    scale_x_continuous(x_lab, limits = c(0, x_ul)),
    scale_y_continuous(y_lab, limits = c(0, NA))
  )
  
}
```

Use `geom_S()` to make and save the top row, the $\widehat{S (t_j)}$ plots.

```{r}
p1 <-
  km %>% 
  filter(ml_fit == "fit13.4") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 110, y_lab = expression(widehat(italic(S(t[j]))))) +
  labs(subtitle = "Cooney et al (1991)")

p2 <-
  km %>% 
  filter(ml_fit == "fit13.5") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 36, y_lab = NULL) +
  labs(subtitle = "Zorn &  Van Winkle (2000)")

p3 <-
  km %>% 
  filter(ml_fit == "fit13.6") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 102, y_lab = NULL) +
  labs(subtitle = "Sorenson et al (1991)")

p4 <-
  km %>% 
  filter(ml_fit == "fit13.7") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 150, y_lab = NULL) +
  labs(subtitle = "Singer et al (1998)")
```

Use `geom_H()` to make and save the middle row, the $\widehat{H (t_j)}$ plots.

```{r}
p5 <-
  km %>% 
  filter(ml_fit == "fit13.4") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 110, y_lab = expression(widehat(italic(H(t[j])))))

p6 <-
  km %>% 
  filter(ml_fit == "fit13.5") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 36, y_lab = NULL)

p7 <-
  km %>% 
  filter(ml_fit == "fit13.6") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 102, y_lab = NULL)

p8 <-
  km %>% 
  filter(ml_fit == "fit13.7") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 150, y_lab = NULL)
```

Use `geom_h()` to make and save the bottom row, the $\widehat{h (t_j)}$ plots.

```{r}
p9 <-
  my_smooth(width = 12, time = fit13.4$time, survive = fit13.4$surv) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "weeks after discharge", x_ul = 110, y_lab = expression(widehat(italic(h(t[j])))))

p10 <-
  my_smooth(width = 5, time = fit13.5$time, survive = fit13.5$surv) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "years on court", x_ul = 36, y_lab = NULL)

p11 <-
  my_smooth(width = 7, time = fit13.6$time, survive = fit13.6$surv) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "age (in years)", x_ul = 102, y_lab = NULL)

p12 <-
  my_smooth(width = 12, time = fit13.7$time, survive = fit13.7$surv) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "weeks since hired", x_ul = 150, y_lab = NULL)
```

Finally, combine the subplots and behold our version of Figure 13.6!

```{r, fig.width = 10, fig.height = 6, warning = F}
((p1 / p5 / p9) | (p2 / p6 / p10) | (p3 / p7 / p11) | (p4 / p8 / p12)) &
  theme(panel.grid = element_blank())
```

### Bonus: Bayesians can compare continuous-time survivor, cumulative hazard, and kernel-smoothed hazard functions, too.

Now let's repeat that process as Bayesians. Here we fit those last four models with `brm()`. Note the `data` statements. Filtering by `ml_fit` allowed us to select the correct subset of data saved in `km`. The two filtering statements by `interval` allowed us to focus on the actual data instead of including the two rows we added for plotting conveniences. Otherwise the `brm()` code is just like what we used before.

```{r fit13.8}
fit13.8 <-
  brm(data = filter(km, ml_fit == "fit13.4" & interval > 0 & interval < 61),
      family = binomial,
      n_events | trials(n_risk) ~ 0 + interval_f,
      prior(normal(-4, 3), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 13,
      file = "fits/fit13.08")

fit13.9 <-
  brm(data = filter(km, ml_fit == "fit13.5" & interval > 0 & interval < 34),
      family = binomial,
      n_events | trials(n_risk) ~ 0 + interval_f,
      prior(normal(-4, 3), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 13,
      file = "fits/fit13.09")

fit13.10 <-
  brm(data = filter(km, ml_fit == "fit13.6" & interval > 0 & interval < 44),
      family = binomial,
      n_events | trials(n_risk) ~ 0 + interval_f,
      prior(normal(-4, 3), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 13,
      file = "fits/fit13.10")

fit13.11 <-
  brm(data = filter(km, ml_fit == "fit13.7" & interval > 0 & interval < 306),
      family = binomial,
      n_events | trials(n_risk) ~ 0 + interval_f,
      prior(normal(-4, 3), class = b),
      chains = 4, cores = 4, iter = 2000, warmup = 1000,
      seed = 13,
      file = "fits/fit13.11")
```

For the sake of space, I'm not going to show all the summary output. If you're following along, I still recommend you give them a look. Spoiler alert: the parameter diagnostics look great.

```{r, eval = F}
print(fit13.8)
print(fit13.9)
print(fit13.10)
print(fit13.11)
```

Since we're working with four `brm()` fits, it might make sense to bundle the steps and keep the results in one place. Here we make something of a super function. With `wrangle_samples()`, we'll extract the posterior draws from each model; add a couple interval columns; convert the results to the $\widehat{p (t_j)}$, $\widehat{S (t_j)}$, and $\widehat{H (t_j)}$ metrics; and join the results to the data stored in `km`. If the steps seem overwhelming, just flip back to the ends of Sections 13.3 and 13.4. This is a small extension of the data wrangling steps we took to make the $\widehat{S (t_j)}$ and $\widehat{H (t_j)}$ plots for our **brms** model `fit13.3`.

```{r}
wrangle_samples <- function(brms, survfit) {
  
  # extract the samples
  post <-
    get(brms) %>% 
    posterior_samples() %>% 
    select(-lp__)
  
  # how many columns?
  n_col <- ncol(post)  
  
  # transform to the p metric, add a 0 interval, make it long, and add S
  post <- 
    post %>% 
    set_names(1:n_col) %>% 
    mutate_all(inv_logit_scaled) %>% 
    mutate(`0` = 0) %>% 
    mutate(iter = 1:n()) %>% 
    pivot_longer(-iter,
                 names_to = "interval",
                 values_to = "p") %>% 
    mutate(interval = interval %>% as.double()) %>% 
    arrange(interval) %>% 
    group_by(iter) %>% 
    mutate(S = cumprod(1 - p)) %>% 
    ungroup() %>% 
    mutate(H = -log(S))
  
  # add the final interval, join the data, and return()
  bind_rows(post,
            post %>% filter(interval == n_col) %>% mutate(interval = n_col + 1)) %>% 
    left_join(km %>% filter(ml_fit == survfit) %>% select(interval:n_censored),
              by = "interval") %>% 
    return()
  
}
```

Our `wrangle_samples()` function takes two arguments, `brms` and `survfit`, which indicate the desired **brms** model and the corresponding index within `km` that contains the associated survival data. Let's put it to work.

```{r}
post <-
  tibble(brms    = str_c("fit13.", 8:11),
         survfit = str_c("fit13.", 4:7)) %>% 
  mutate(post = map2(brms, survfit, wrangle_samples)) %>% 
  unnest(post)

post
```

Like before, we have 12 subplots to make and we can reduce redundancies in the code by working with custom geoms. To accommodate our Bayesian fits, we'll redefine `geom_S()` and `geom_H()` to depict the step functions with `tidybayes::stat_lineribbon()`. Happily, our `geom_h()` is good to go as is.

```{r}
geom_S <- function(x_ul, y_lab = NULL, ...) {
  
  list(
    geom_hline(yintercept = .5, color = "white"),
    stat_lineribbon(step = "hv", size = 1/2, .width = c(.5, .95),
                    show.legend = F, ...),
    scale_fill_grey(start = .8, end = .6),
    scale_x_continuous(NULL, breaks = NULL, limits = c(0, x_ul)),
    scale_y_continuous(y_lab, breaks = c(0, .5, 1), labels = c("0", ".5", "1"), limits = c(0, 1))
  )
  
}

geom_H <- function(x_ul, y_lab = NULL, ...) {
  
  list(
    stat_lineribbon(step = "hv", size = 1/2, .width = c(.5, .95),
                    show.legend = F, ...),
    scale_fill_grey(start = .8, end = .6),
    scale_x_continuous(NULL, breaks = NULL, limits = c(0, x_ul)),
    scale_y_continuous(y_lab, limits = c(0, NA))
  )
  
}
```

Make and save the subplots.

```{r, warning = F, message = F}
# use `geom_S()` to make and save the top row
p1 <-
  post %>% 
  filter(brms == "fit13.8") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 110, y_lab = expression(widehat(italic(S(t[j]))))) +
  labs(subtitle = "Cooney et al (1991)")

p2 <-
  post %>% 
  filter(brms == "fit13.9") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 36, y_lab = NULL) +
  labs(subtitle = "Zorn &  Van Winkle (2000)")

p3 <-
  post %>% 
  filter(brms == "fit13.10") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 102, y_lab = NULL) +
  labs(subtitle = "Sorenson et al (1991)")

p4 <-
  post %>% 
  filter(brms == "fit13.11") %>% 
  ggplot(aes(x = time, y = S)) +
  geom_S(x_ul = 150, y_lab = NULL) +
  labs(subtitle = "Singer et al (1998)")
  
# use `geom_H()` to make and save the middle row
p5 <-
  post %>% 
  filter(brms == "fit13.8") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 110, y_lab = expression(widehat(italic(H(t[j])))))

p6 <-
  post %>% 
  filter(brms == "fit13.9") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 36, y_lab = NULL)

p7 <-
  post %>% 
  filter(brms == "fit13.10") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 102, y_lab = NULL)

p8 <-
  post %>% 
  filter(brms == "fit13.11") %>% 
  ggplot(aes(x = time, y = H)) +
  geom_H(x_ul = 150, y_lab = NULL)
  
# use `geom_h()` to make and save the bottom row
p9 <-
  post %>% 
  filter(brms == "fit13.8") %>% 
  group_by(interval, time) %>% 
  summarize(median = median(S)) %>% 
  nest(data = everything()) %>% 
  mutate(smooth = map(data, ~my_smooth(width = 12, time = .$time, survive = .$median))) %>% 
  unnest(smooth) %>%
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "weeks after discharge", x_ul = 110, y_lab = expression(widehat(italic(h(t[j])))))

p10 <-
  post %>% 
  filter(brms == "fit13.9") %>% 
  group_by(interval, time) %>% 
  summarize(median = median(S)) %>% 
  nest(data = everything()) %>% 
  mutate(smooth = map(data, ~my_smooth(width = 12, time = .$time, survive = .$median))) %>% 
  unnest(smooth) %>%
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "years on court", x_ul = 36, y_lab = NULL)

p11 <-
  post %>% 
  filter(brms == "fit13.10") %>% 
  group_by(interval, time) %>% 
  summarize(median = median(S)) %>% 
  nest(data = everything()) %>% 
  mutate(smooth = map(data, ~my_smooth(width = 12, time = .$time, survive = .$median))) %>% 
  unnest(smooth) %>%
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "age (in years)", x_ul = 102, y_lab = NULL)

p12 <-
  post %>% 
  filter(brms == "fit13.11") %>% 
  group_by(interval, time) %>% 
  summarize(median = median(S)) %>% 
  nest(data = everything()) %>% 
  mutate(smooth = map(data, ~my_smooth(width = 12, time = .$time, survive = .$median))) %>% 
  unnest(smooth) %>%
  ggplot(aes(x = x, y = y)) +
  geom_h(x_lab = "weeks since hired", x_ul = 150, y_lab = NULL)
```

Finally, combine the subplots and behold our Bayesian alternative version of Figure 13.6!

```{r, fig.width = 10, fig.height = 6, warning = F}
((p1 / p5 / p9) | (p2 / p6 / p10) | (p3 / p7 / p11) | (p4 / p8 / p12)) &
  theme(panel.grid = element_blank())
```

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F, eval = F}
# here we'll remove our objects
rm()

theme_set(theme_grey())
```

```{r, echo = F, eval = F}
# We can use the `broom::tidy()` function to convert that to a tidy tibble.
kap1 %>% str()


k <-
  tidy(kap1)


# Unlike in the text, this output does not contain the 0 interval, $[0, 1.41)$. The `time` column is the same as the "[Start" column of Table 13.3. The `n.risk` column is a direct analogue to the "n at risk" column. The `n.event` column is similar to the "n events" column in the text, but whereas the latter is cumulative, the former is not. We do not get a "$\hat p (t)$" column like in the text, but that's just a straight algebraic transform of the `estimate` column subtracted from 1. Speaking of which, our `estimate` column is an analogue of the "$\hat S(t)$" column in the text. It appears the values in our `std.error` column are a bit different from those in the text. Our last two columns mark off the 95% intervals. 
# 
# We can get a quick base **R** plot of the Kaplan-Meier survivor function with `plot()`.

plot(kap1,
     xlab = "seconds after light turns green",
     ylab = "Kaplan-Meier survivor function")


# With the **ggplot2** paradigm, one can make a Kaplan-Meier survivor function with the `ggsurvplot()` function from the [**survminer** package](https://CRAN.R-project.org/package=survminer).

library(survminer)

ggsurvplot(kap1, data = honking,
           color = "grey25",
           size = 1/2,
           xlim = c(0, 20),
           break.time.by = 5,
           legend = "none",
           ggtheme = theme_gray() + 
             theme(panel.grid = element_blank())) +
  labs(x = "seconds after light turns green",
       y = "Kaplan-Meier survivor function")
```


<!--chapter:end:13.Rmd-->

`r if (knitr::is_html_output()) '# References {-}'`


<!--chapter:end:99.Rmd-->