-
Notifications
You must be signed in to change notification settings - Fork 94
/
Copy path12.Rmd
1795 lines (1401 loc) · 77.8 KB
/
12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = FALSE, cache = FALSE}
knitr::opts_chunk$set(fig.retina = 2.5)
options(width = 100)
```
# Multilevel Models
> Multilevel models... remember features of each cluster in the data as they learn about all of the clusters. Depending upon the variation among clusters, which is learned from the data as well, the model pools information across clusters. This pooling tends to improve estimates about each cluster. This improved estimation leads to several, more pragmatic sounding, benefits of the multilevel approach. [@mcelreathStatisticalRethinkingBayesian2015, p. 356]
These benefits include:
* improved estimates for repeated sampling (i.e., in longitudinal data)
* improved estimates when there are imbalances among subsamples
* estimates of the variation across subsamples
* avoiding simplistic averaging by retaining variation across subsamples
> All of these benefits flow out of the same strategy and model structure. You learn one basic design and you get all of this for free.
>
> When it comes to regression, multilevel regression deserves to be the default approach. There are certainly contexts in which it would be better to use an old-fashioned single-level model. But the contexts in which multilevel models are superior are much more numerous. It is better to begin to build a multilevel analysis, and then realize it's unnecessary, than to overlook it. And once you grasp the basic multilevel stragety, it becomes much easier to incorporate related tricks such as allowing for measurement error in the data and even model missing data itself ([Chapter 14][Missing Data and Other Opportunities]). (p. 356)
I'm totally on board with this. After learning about the multilevel model, I see it everywhere. For more on the sentiment it should be the default, check out McElreath's blog post, [*Multilevel regression as default*](https://elevanth.org/blog/2017/08/24/multilevel-regression-as-default/).
## Example: Multilevel tadpoles
Let's load the `reedfrogs` data [see @voneshCompensatoryLarvalResponses2005].
```{r, message = F, warning = F}
library(rethinking)
data(reedfrogs)
d <- reedfrogs
```
Detach rethinking and load brms.
```{r, message = F}
rm(reedfrogs)
detach(package:rethinking, unload = T)
library(brms)
```
Go ahead and acquaint yourself with the `reedfrogs`.
```{r, message = F, warning = F}
library(tidyverse)
d %>%
glimpse()
```
Making the `tank` cluster variable is easy.
```{r}
d <-
d %>%
mutate(tank = 1:nrow(d))
```
Here's the formula for the un-pooled model in which each `tank` gets its own intercept:
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{tank}_i} \\
\alpha_{\text{tank}} & \sim \operatorname{Normal}(0, 5),
\end{align*}
where $n_i$ is indexed by the `density` column. It's values are distributed like so:
```{r}
d %>%
count(density)
```
Now fit this simple aggregated binomial model much like we practiced in [Chapter 10][Aggregated binomial: Chimpanzees again, condensed.].
```{r b12.1}
b12.1 <-
brm(data = d,
family = binomial,
surv | trials(density) ~ 0 + factor(tank),
prior(normal(0, 5), class = b),
iter = 2000, warmup = 500, chains = 4, cores = 4,
seed = 12,
file = "fits/b12.01")
```
We don't need a `depth=2` argument to discover we have 48 different intercepts. Just good old `print()` will do.
```{r}
print(b12.1)
```
Do you remember the dummy variables models from back in [Chapter 5][Binary categories.]? This model is like one of those on steroids. It'll be instructive to take a look at their distributions in density plots. We'll plot them in both their log-odds and probability metrics.
For kicks and giggles, let's use a [FiveThirtyEight-like theme](https://github.com/alex23lemm/theme_fivethirtyeight) for this chapter's plots. An easy way to do so is with help from the [ggthemes package](https://cran.r-project.org/package=ggthemes).
```{r, fig.width = 7, fig.height = 3.5, warning = F, message = F}
library(ggthemes)
tibble(estimate = fixef(b12.1)[, 1]) %>%
mutate(p = inv_logit_scaled(estimate)) %>%
gather() %>%
mutate(key = if_else(key == "p", "expected survival probability", "expected survival log-odds")) %>%
ggplot(aes(x = value, fill = key)) +
geom_density(linewidth = 0) +
scale_fill_manual(values = c("orange1", "orange4")) +
scale_y_continuous(breaks = NULL) +
labs(title = "Tank-level intercepts from the no-pooling model",
subtitle = "Notice now inspecting the distributions of the posterior means can offer insights you\nmight not get if you looked at them one at a time.") +
theme_fivethirtyeight() +
theme(legend.position = "none",
panel.grid.major = element_blank()) +
facet_wrap(~key, scales = "free")
```
Even though it seems like we can derive important insights from how the `tank`-level intercepts are distributed, that information is not explicitly encoded in the statistical model. Keep that in mind as we now consider the multilevel alternative. Its formula is
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{tank}_i} \\
\alpha_\text{tank} & \sim \operatorname{Normal}(\color{blue}\alpha, \color{blue}\sigma) \\
\color{blue}\alpha & \color{blue}\sim \color{blue}{\operatorname{Normal}(0, 1)} \\
\color{blue}\sigma & \color{blue}\sim \color{blue}{\operatorname{HalfCauchy}(0, 1)}.
\end{align*}
> The Gaussian distribution with mean $\alpha$ and standard deviation $\sigma$ is the prior for each tank's intercept. But that prior itself has priors for $\alpha$ and $\sigma$. So there are two *levels* in the model, each resembling a simpler model. (p. 359, *emphasis* in the original)
You specify the corresponding multilevel model like this.
```{r b12.2}
b12.2 <-
brm(data = d,
family = binomial,
surv | trials(density) ~ 1 + (1 | tank),
prior = c(prior(normal(0, 1), class = Intercept),
prior(cauchy(0, 1), class = sd)),
iter = 4000, warmup = 1000, chains = 4, cores = 4,
seed = 12,
file = "fits/b12.02")
```
The syntax for the varying effects follows the [lme4 style](https://cran.r-project.org/package=brms/vignettes/brms_overview.pdf), `( <varying parameter(s)> | <grouping variable(s)> )`. In this case `(1 | tank)` indicates only the intercept, `1`, varies by `tank`. The extent to which parameters vary is controlled by the prior, `prior(cauchy(0, 1), class = sd)`, which is <u>parameterized in the standard deviation metric</u>. Do note that last part. It's common in multilevel software to model in the variance metric, instead. For technical reasons we won't really get into until [Chapter 13][Adventures in Covariance], Stan parameterizes this as a standard deviation.
Let's perform the WAIC comparisons.
```{r, message = F}
b12.1 <- add_criterion(b12.1, "waic")
b12.2 <- add_criterion(b12.2, "waic")
w <- loo_compare(b12.1, b12.2, criterion = "waic")
print(w, simplify = F)
```
The `se_diff` is large relative to the `elpd_diff`. If we convert the $\text{elpd}$ difference to the WAIC metric, the message stays the same.
```{r}
cbind(waic_diff = w[, 1] * -2,
se = w[, 2] * 2)
```
Here are the WAIC weights.
```{r}
model_weights(b12.1, b12.2, weights = "waic") %>%
round(digits = 2)
```
Let's check he model summary.
```{r}
print(b12.2)
```
This time we don't get a list of 48 separate `tank`-level parameters. However, we do get a description of their distribution interns of a mean (i.e., `Intercept`) and standard deviation (i.e., `sd(Intercept)`). If you'd like the actual `tank`-level parameters, don't worry; they're coming in Figure 12.1. We'll need to do a little prep work, though.
```{r}
post <- as_draws_df(b12.2)
post_mdn <-
tibble(Estimate = coef(b12.2, robust = T)$tank[, "Estimate", ]) %>%
mutate(post_mdn = inv_logit_scaled(Estimate)) %>%
bind_cols(d)
post_mdn
```
Here's the ggplot2 code to reproduce Figure 12.1.
```{r, fig.width = 5, fig.height = 4}
post_mdn %>%
ggplot(aes(x = tank)) +
geom_hline(yintercept = inv_logit_scaled(median(post$b_Intercept)), linetype = 2, linewidth = 1/4,
color = ggthemes_data[["fivethirtyeight"]][1, 2] %>% pull()) +
geom_vline(xintercept = c(16.5, 32.5), linewidth = 1,
color = ggthemes_data[["fivethirtyeight"]][2, 2] %>% pull()) +
geom_point(aes(y = propsurv), color = "orange2") +
geom_point(aes(y = post_mdn), shape = 1) +
annotate(geom = "text", x = c(8, 16 + 8, 32 + 8), y = 0,
label = c("small tanks", "medium tanks", "large tanks")) +
scale_x_continuous(breaks = c(1, 16, 32, 48)) +
labs(title = "Multilevel shrinkage!",
subtitle = "The empirical proportions are in orange while the model-\nimplied proportions are the black circles. The dashed line is\nthe model-implied average survival proportion.") +
coord_cartesian(ylim = c(0, 1)) +
theme_fivethirtyeight() +
theme(panel.grid.major = element_blank())
```
Here is the code for our version of Figure 12.2.a, where we visualize the model-implied population distribution of log-odds survival (i.e., the population distribution yielding all the `tank`-level intercepts).
```{r}
# this makes the output of `slice_sample()` reproducible
set.seed(12)
p1 <-
post %>%
slice_sample(n = 100) %>%
expand_grid(x = seq(from = -4, to = 5, length.out = 100)) %>%
ggplot(aes(x = x, group = .draw)) +
geom_line(aes(y = dnorm(x, b_Intercept, sd_tank__Intercept)),
alpha = .2, color = "orange2") +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Population survival distribution",
subtitle = "log-odds scale") +
coord_cartesian(xlim = c(-3, 4))
```
Now we make our Figure 12.2.b and then bind the two subplots with patchwork.
```{r, fig.width = 6, fig.height = 3.25}
p2 <-
ggplot(data = post,
aes(x = rnorm(n = nrow(post),
mean = b_Intercept,
sd = sd_tank__Intercept) %>%
inv_logit_scaled())) +
geom_density(linewidth = 0, fill = "orange2") +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Probability of survival",
subtitle = "transformed by the inverse-logit function")
library(patchwork)
(p1 + p2) &
theme_fivethirtyeight() &
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 10))
```
In the left plot, notice the uncertainty in terms of both location $\alpha$ and scale $\sigma$. For both pots, note how we sampled 12,000 imaginary `tanks` rather than McElreath's 8,000. This is because we had 12,000 HMC iterations (i.e., execute `nrow(post)`).
The `aes()` code in that last block was a bit much. To get a sense of how it worked, consider this.
```{r}
set.seed(12)
rnorm(n = 1,
mean = post$b_Intercept,
sd = post$sd_tank__Intercept) %>%
inv_logit_scaled()
```
First, we took one random draw from a normal distribution with a mean of the first row in `post$b_Intercept` and a standard deviation of the value from the first row in `post$sd_tank__Intercept`, and passed it through the `inv_logit_scaled()` function. By replacing the `1` with `nrow(post)`, we do this `nrow(post)` times (i.e., 12,000). Our orange density, then, is the summary of that process.
#### Rethinking: Varying intercepts as over-dispersion.
> In the previous chapter ([page 346][Over-dispersed outcomes]), the beta-binomial and gamma-Poisson models were presented as ways for coping with **over-dispersion** of count data. Varying intercepts accomplish the same thing, allowing count outcomes to be over-dispersed. They accomplish this, because when each observed count gets its own unique intercept, but these intercepts are pooled through a common distribution, the predictions expect over-dispersion just like a beta-binomial or gamma-Poisson model would. Compared to a beta-binomial or gamma-Poisson model, a binomial or Poisson model with a varying intercept on every observed outcome will often be easier to estimate and easier to extend. (p. 363, **emphasis** in the original)
#### Overthinking: Prior for variance components.
Yep, you can use the exponential distribution for your priors in brms, too. Here it is for model `b12.2`.
```{r b12.2b, message = F}
b12.2b <-
update(b12.2,
prior = c(prior(normal(0, 1), class = Intercept),
prior(exponential(1), class = sd)),
seed = 12,
file = "fits/b12.02b")
```
The model summary:
```{r}
print(b12.2b)
```
If you're curious how the exponential prior compares to the posterior, you might just plot.
```{r, fig.width = 3.75, fig.height = 3.25}
tibble(x = seq(from = 0, to = 6, by = 0.01)) %>%
ggplot() +
# the prior
geom_area(aes(x = x, y = dexp(x, rate = 1)),
fill = "orange2", alpha = 1/3) +
# the posterior
geom_density(data = as_draws_df(b12.2b),
aes(x = sd_tank__Intercept),
fill = "orange2", linewidth = 0) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Bonus prior/posterior plot\nfor sd_tank__Intercept",
subtitle = "The prior is the semitransparent ramp in the\nbackground. The posterior is the solid orange\nmound.") +
coord_cartesian(xlim = c(0, 5)) +
theme_fivethirtyeight()
```
## Varying effects and the underfitting/overfitting trade-off
> Varying intercepts are just regularized estimates, but adaptively regularized by estimating how diverse the clusters are while estimating the features of each cluster. This fact is not easy to grasp....
>
> A major benefit of using varying effects estimates, instead of the empirical raw estimates, is that they provide more accurate estimates of the individual cluster (tank) intercepts. On average, the varying effects actually provide a better estimate of the individual tank (cluster) means. The reason that the varying intercepts provides better estimates is that they do a better job trading off underfitting and overfitting. (p. 364)
In this section, we explicate this by contrasting three perspectives:
* Complete pooling (i.e., a single-$\alpha$ model)
* No pooling (i.e., the single-level $\alpha_{\text{tank}_i}$ model)
* Partial pooling (i.e., the multilevel model for which $\alpha_\text{tank} \sim \operatorname{Normal}(\alpha, \sigma)$)
> To demonstrate [the magic of the multilevel model], we'll simulate some tadpole data. That way, we'll know the true per-pond survival probabilities. Then we can compare the no-pooling estimates to the partial pooling estimates, by computing how close each gets to the true values they are trying to estimate. The rest of this section shows how to do such a simulation. (p. 365)
### The model.
The simulation formula should look familiar.
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{pond}_i} \\
\alpha_{\text{pond}} & \sim \operatorname{Normal}(\alpha, \sigma) \\
\alpha & \sim \operatorname{Normal}(0, 1) \\
\sigma & \sim \operatorname{HalfCauchy}(0, 1)
\end{align*}
### Assign values to the parameters.
Here we follow along with McElreath and "assign specific values representative of the actual tadpole data" (p. 366). However, our values will differ a little from his. Because he did not show a `set.seed()` line, we don’t know what seed was used to generate pseudo random draws from the `rnorm()` function.
```{r}
a <- 1.4
sigma <- 1.5
n_ponds <- 60
set.seed(12)
dsim <-
tibble(pond = 1:n_ponds,
ni = rep(c(5, 10, 25, 35), each = n_ponds / 4) %>% as.integer(),
true_a = rnorm(n = n_ponds, mean = a, sd = sigma))
head(dsim)
```
McElreath twice urged us to inspect the contents of the simulation. In addition to looking at the data with `head()`, we might well plot.
```{r, fig.width = 5, fig.height = 3, warning = F, message = F}
library(tidybayes)
dsim %>%
mutate(ni = factor(ni)) %>%
ggplot(aes(x = true_a, y = ni)) +
stat_halfeye(.width = .5, fill = "orange2") +
ggtitle("Log-odds varying by # tadpoles per pond") +
theme_fivethirtyeight() +
theme(plot.title = element_text(size = 14))
```
### Sumulate survivors.
> Each pond $i$ has $n_i$ potential survivors, and nature flips each tadpole's coin, so to speak, with probability of survival $p_i$. This probability $p_i$ is implied by the model definition, and is equal to:
>
> $$p_i = \frac{\exp (\alpha_i)}{1 + \exp (\alpha_i)}$$
>
> The model uses a logit link, and so the probability is defined by the [`inv_logit_scaled()`] function. (p. 367)
```{r}
set.seed(12)
(
dsim <-
dsim %>%
mutate(si = rbinom(n = n(), prob = inv_logit_scaled(true_a), size = ni))
)
```
### Compute the no-pooling estimates.
The no-pooling estimates (i.e., $\alpha_{\text{tank}_i}$) are the results of simple algebra.
```{r}
(
dsim <-
dsim %>%
mutate(p_nopool = si / ni)
)
```
"These are the same no-pooling estimates you'd get by fitting a model with a dummy variable for each pond and flat priors that induce no regularization" (p. 367). That is, these are the same kinds of estimates we got back when we fit `b12.1`.
### Compute the partial-pooling estimates.
To follow along with McElreath, set `chains = 1, cores = 1` to fit with one chain.
```{r b12.3}
b12.3 <-
brm(data = dsim,
family = binomial,
si | trials(ni) ~ 1 + (1 | pond),
prior = c(prior(normal(0, 1), class = Intercept),
prior(cauchy(0, 1), class = sd)),
iter = 10000, warmup = 1000, chains = 1, cores = 1,
seed = 12,
file = "fits/b12.03")
```
Here's our standard brms summary.
```{r}
print(b12.3)
```
I'm not aware that you can use McElreath's `depth=2` trick in brms for `summary()` or `print()`. But can get most of that information and more with the Stan-like summary using the `$fit` syntax.
```{r}
b12.3$fit
```
As an aside, notice how this summary still reports the old-style `n_eff` values, rather than the updated `Bulk_ESS` and `Tail_ESS` values. I suspect this will change sometime soon. In the meantime, here's a [thread on the Stan Forums](https://discourse.mc-stan.org/t/new-r-hat-and-ess/8165) featuring members of the Stan team discussing how.
Let's get ready for the diagnostic plot of Figure 12.3. First we add the partially-pooled estimates, as summarized by their posterior means, to the `dsim` data. Then we compute error values.
```{r}
# we could have included this step in the block of code below, if we wanted to
p_partpool <-
tibble(Estimate = coef(b12.3)$pond[, "Estimate", ]) %>%
transmute(p_partpool = inv_logit_scaled(Estimate))
dsim <-
dsim %>%
bind_cols(p_partpool) %>%
mutate(p_true = inv_logit_scaled(true_a)) %>%
mutate(nopool_error = abs(p_nopool - p_true),
partpool_error = abs(p_partpool - p_true))
dsim %>%
glimpse()
```
Here is our code for Figure 12.3. The extra data processing for `dfline` is how we get the values necessary for the horizontal summary lines.
```{r, fig.width = 5, fig.height = 5, message = F, warning = F}
dfline <-
dsim %>%
select(ni, nopool_error:partpool_error) %>%
gather(key, value, -ni) %>%
group_by(key, ni) %>%
summarise(mean_error = mean(value)) %>%
mutate(x = c( 1, 16, 31, 46),
xend = c(15, 30, 45, 60))
dsim %>%
ggplot(aes(x = pond)) +
geom_vline(xintercept = c(15.5, 30.5, 45.4),
color = "white", linewidth = 2/3) +
geom_point(aes(y = nopool_error), color = "orange2") +
geom_point(aes(y = partpool_error), shape = 1) +
geom_segment(data = dfline,
aes(x = x, xend = xend,
y = mean_error, yend = mean_error),
color = rep(c("orange2", "black"), each = 4),
linetype = rep(1:2, each = 4)) +
annotate("text", x = c(15 - 7.5, 30 - 7.5, 45 - 7.5, 60 - 7.5), y = .45,
label = c("tiny (5)", "small (10)", "medium (25)", "large (35)")) +
scale_x_continuous(breaks = c(1, 10, 20, 30, 40, 50, 60)) +
labs(title = "Estimate error by model type",
subtitle = "The horizontal axis displays pond number. The vertical axis measures\nthe absolute error in the predicted proportion of survivors, compared to\nthe true value used in the simulation. The higher the point, the worse\nthe estimate. No-pooling shown in orange. Partial pooling shown in black.\nThe orange and dashed black lines show the average error for each kind\nof estimate, across each initial density of tadpoles (pond size). Smaller\nponds produce more error, but the partial pooling estimates are better\non average, especially in smaller ponds.",
y = "absolute error") +
theme_fivethirtyeight() +
theme(panel.grid.major = element_blank(),
plot.subtitle = element_text(size = 10))
```
If you wanted to quantify the difference in simple summaries, you might do something like this.
```{r, message = F, warning = F}
dsim %>%
select(ni, nopool_error:partpool_error) %>%
gather(key, value, -ni) %>%
group_by(key) %>%
summarise(mean_error = mean(value) %>% round(digits = 3),
median_error = median(value) %>% round(digits = 3))
```
I originally learned about the multilevel in order to work with [longitudinal data](https://gseacademic.harvard.edu/alda/). In that context, I found the basic principles of a multilevel structure quite intuitive. The concept of partial pooling, however, took me some time to wrap my head around. If you're struggling with this, be patient and keep chipping away.
When McElreath [lectured on this topic in 2015](https://youtu.be/82TaniPgzQc?t=2048), he traced partial pooling to statistician [Charles M. Stein](https://imstat.org/2017/05/15/obituary-charles-m-stein-1920-2016/). In 1977, Efron and Morris wrote the now classic [-@efronSteinParadoxStatistics1977] paper, [*Stein's paradox in statistics*](https://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf), which does a nice job breaking down why partial pooling can be so powerful. One of the primary examples they used in the paper was of 1970 batting-average data. If you'd like more practice seeing how partial pooling works--or if you just like baseball--, check out my blog post, [*Stein's paradox and what partial pooling can do for you*](https://solomonkurz.netlify.app/blog/2019-02-23-stein-s-paradox-and-what-partial-pooling-can-do-for-you/).
#### Overthinking: Repeating the pond simulation.
Within the brms workflow, we can reuse a compiled model with `update()`. But first, we'll simulate new data.
```{r}
a <- 1.4
sigma <- 1.5
n_ponds <- 60
set.seed(1999) # for new data, set a new seed
new_dsim <-
tibble(pond = 1:n_ponds,
ni = rep(c(5, 10, 25, 35), each = n_ponds / 4) %>% as.integer(),
true_a = rnorm(n = n_ponds, mean = a, sd = sigma)) %>%
mutate(si = rbinom(n = n(), prob = inv_logit_scaled(true_a), size = ni)) %>%
mutate(p_nopool = si / ni)
glimpse(new_dsim)
```
Fit the new model.
```{r b12.3_new}
b12.3_new <-
update(b12.3,
newdata = new_dsim,
seed = 12,
file = "fits/b12.03_new")
```
```{r}
print(b12.3_new)
```
Why not plot the first simulation versus the second one?
```{r, fig.width = 8, fig.height = 4.5}
bind_rows(as_draws_df(b12.3),
as_draws_df(b12.3_new)) %>%
mutate(model = rep(c("b12.3", "b12.3_new"), each = n() / 2)) %>%
ggplot(aes(x = b_Intercept, y = sd_pond__Intercept)) +
stat_density_2d(geom = "raster",
aes(fill = after_stat(density)),
contour = F, n = 200) +
geom_vline(xintercept = a, color = "orange3", linetype = 3) +
geom_hline(yintercept = sigma, color = "orange3", linetype = 3) +
scale_fill_gradient(low = "grey25", high = "orange3") +
ggtitle("Our simulation posteriors contrast a bit",
subtitle = expression(alpha*" is on the x and "*sigma*" is on the y, both in log-odds. The dotted lines intersect at the true values.")) +
coord_cartesian(xlim = c(.7, 2),
ylim = c(.8, 1.9)) +
theme_fivethirtyeight() +
theme(legend.position = "none",
panel.grid.major = element_blank()) +
facet_wrap(~model, ncol = 2)
```
If you'd like the `stanfit` portion of your `brm()` object, subset with `$fit`. Take `b12.3`, for example. You might check out its structure via `b12.3$fit %>% str()`. Here's the actual Stan code.
```{r}
b12.3$fit@stanmodel
```
And you can get the data of a given `brm()` fit object like so.
```{r}
b12.3$data %>%
head()
```
## More than one type of cluster
"We can use and often should use more than one type of cluster in the same model" (p. 370).
#### Rethinking: Cross-classification and hierarchy.
> The kind of data structure in `data(chimpanzees)` is usually called a **cross-classified** multilevel model. It is cross-classified, because actors are not nested within unique blocks. If each chimpanzee had instead done all of his or her pulls on a single day, within a single block, then the data structure would instead be *hierarchical.* (p. 371, **emphasis** in the original)
### Multilevel chimpanzees.
The initial multilevel update from model `b10.4` from [Chapter 10][Logistic regression: Prosocial chimpanzees.] follows the statistical formula
\begin{align*}
\text{left_pull}_i & \sim \operatorname{Binomial}(n_i = 1, p_i) \\
\operatorname{logit}(p_i) & = \alpha + \color{blue}{\alpha_{\text{actor}_i}} + (\beta_1 + \beta_2 \text{condition}_i) \text{prosoc_left}_i \\
\color{blue}{\alpha_\text{actor}} & \color{blue}\sim \color{blue}{\operatorname{Normal}(0, \sigma_\text{actor})} \\
\alpha & \sim \operatorname{Normal}(0, 10) \\
\beta_1 & \sim \operatorname{Normal}(0, 10) \\
\beta_2 & \sim \operatorname{Normal}(0, 10) \\
\color{blue}{\sigma_\text{actor}} & \color{blue}\sim \color{blue}{\operatorname{HalfCauchy}(0, 1)}.
\end{align*}
> Notice that $\alpha$ is inside the linear model, not inside the Gaussian prior for $\alpha_\text{actor}$. This is mathematically equivalent to what [we] did with the tadpoles earlier in the chapter. You can always take the mean out of a Gaussian distribution and treat that distribution as a constant plus a Gaussian distribution centered on zero.
>
> This might seem a little weird at first, so it might help train your intuition by experimenting in R. (p. 371)
Behold our two identical Gaussians in a tidy tibble.
```{r}
set.seed(12)
two_gaussians <-
tibble(y1 = rnorm(n = 1e4, mean = 10, sd = 1),
y2 = 10 + rnorm(n = 1e4, mean = 0, sd = 1))
```
Let's follow McElreath's advice to make sure they are same by superimposing the density of one on the other.
```{r, fig.width = 3.5, fig.height = 3}
two_gaussians %>%
ggplot() +
geom_density(aes(x = y1),
linewidth = 0, fill = "orange1", alpha = 1/3) +
geom_density(aes(x = y2),
linewidth = 0, fill = "orange4", alpha = 1/3) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("Our simulated Gaussians") +
theme_fivethirtyeight()
```
Yep, those Gaussians look about the same.
Let's get the `chimpanzees` data from rethinking.
```{r, message = F}
library(rethinking)
data(chimpanzees)
d <- chimpanzees
```
Detach rethinking and reload brms.
```{r, message = F}
rm(chimpanzees)
detach(package:rethinking, unload = T)
library(brms)
```
For our brms model with varying intercepts for `actor` but not `block`, we employ the `pulled_left ~ 1 + ... + (1 | actor)` syntax, specifically omitting a `(1 | block)` section.
```{r b12.4}
b12.4 <-
brm(data = d,
family = binomial,
pulled_left | trials(1) ~ 1 + prosoc_left + prosoc_left:condition + (1 | actor),
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(cauchy(0, 1), class = sd)),
# I'm using 4 cores, instead of the `cores=3` in McElreath's code
iter = 5000, warmup = 1000, chains = 4, cores = 4,
seed = 12,
file = "fits/b12.04")
```
McElreath encouraged us to inspect the trace plots. Here they are.
```{r, message = F, warning = F, fig.width = 8, fig.height = 4}
library(bayesplot)
color_scheme_set("orange")
post <- as_draws_df(b12.4)
post %>%
mcmc_trace(pars = vars(b_Intercept:`r_actor[7,Intercept]`),
facet_args = list(ncol = 4)) +
scale_x_continuous(breaks = 0:2 * 2000) +
theme_fivethirtyeight() +
theme(legend.direction = "vertical",
legend.key.size = unit(0.5, "cm"),
legend.position = c(.96, .1))
```
They look great. Here's the posterior distribution for $\sigma_\text{actor}$.
```{r, fig.width = 4, fig.height = 2.5}
post %>%
ggplot(aes(x = sd_actor__Intercept, y = 0)) +
stat_halfeye(.width = .95, fill = "orange2") +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(sigma[actor])) +
theme_fivethirtyeight()
```
We can inspect the $\widehat R$ and effective sample size values for all parameters with the `bayesplot::rhat()` function. Here we'll put it in a data frame for easy viewing.
```{r}
data.frame(rhat = rhat(b12.4))
```
I'm not aware that brms or bayesplot offer a convenient way to compute the `Bulk_ESS` or `Tail_ESS` values for arbitrary combinations of parameters in a data frame. However you can do so with the `posterior::summarise_draws()` function.
```{r, warning = F, message = F}
library(posterior)
summarise_draws(b12.4)
```
Note how the last three columns are the `rhat`, the `ess_bulk`, and the `ess_tail`. Here we summarize those two effective sample size columns with histograms.
```{r, fig.width = 5, fig.height = 2.75, warning = F}
summarise_draws(b12.4) %>%
select(starts_with("ess")) %>%
gather() %>%
ggplot(aes(x = value)) +
geom_histogram(binwidth = 100, fill = "blue") +
xlim(0, NA) +
theme_fivethirtyeight() +
facet_wrap(~key)
```
McElreath pointed out it's important to understand the `actor`-level parameters, as in the summary above, are *deviations* from the grand mean. Here's one way to add them together.
```{r, message = F, warning = F}
post %>%
select(starts_with("r_actor")) %>%
gather() %>%
# this is how we might add the grand mean to the actor-level deviations
mutate(value = value + post$b_Intercept) %>%
group_by(key) %>%
summarise(mean = mean(value) %>% round(digits = 2))
```
Here's another way to get at the same information, this time using `coef()` and a little formatting help from the `stringr::str_c()` function. Just for kicks, we'll throw in the 95% intervals, too.
```{r, message = F}
coef(b12.4)$actor[, c(1, 3:4), 1] %>%
as_tibble() %>%
round(digits = 2) %>%
# here we put the credible intervals in an APA-6-style format
mutate(`95% CIs` = str_c("[", Q2.5, ", ", Q97.5, "]"),
actor = str_c("chimp #", 1:7)) %>%
rename(mean = Estimate) %>%
select(actor, mean, `95% CIs`) %>%
knitr::kable()
```
If you prefer the posterior median to the mean, just add a `robust = T` argument inside the `coef()` function.
### Two types of cluster.
The full cross-classified statistical model follows the form
\begin{align*}
\text{left_pull}_i & \sim \operatorname{Binomial}(n_i = 1, p_i) \\
\operatorname{logit}(p_i) & = \alpha + \alpha_{\text{actor}_i} + \color{blue}{\alpha_{\text{block}_i}} + (\beta_1 + \beta_2 \text{condition}_i) \text{prosoc_left}_i \\
\alpha_{\text{actor}} & \sim \operatorname{Normal}(0, \sigma_{\text{actor}}) \\
\color{blue}{\alpha_{\text{block}}} & \color{blue}\sim \color{blue}{\operatorname{Normal}(0, \sigma_{\text{actor}})} \\
\alpha & \sim \operatorname{Normal}(0, 10) \\
\beta_1 & \sim \operatorname{Normal}(0, 10) \\
\beta_2 & \sim \operatorname{Normal}(0, 10) \\
\sigma_{\text{actor}} & \sim \operatorname{HalfCauchy}(0, 1) \\
\color{blue}{\sigma_{\text{block}}} & \color{blue}\sim \color{blue}{\operatorname{HalfCauchy}(0, 1)}.
\end{align*}
Our brms model with varying intercepts for both `actor` and `block` now employs the `... (1 | actor) + (1 | block)` syntax.
```{r b12.5, message = F}
b12.5 <-
update(b12.4,
newdata = d,
formula = pulled_left | trials(1) ~ 1 + prosoc_left + prosoc_left:condition +
(1 | actor) + (1 | block),
iter = 6000, warmup = 1000, cores = 4, chains = 4,
control = list(adapt_delta = .85),
seed = 12,
file = "fits/b12.05")
```
This time we increased the `adapt_delta` parameter to `.9` to avoid divergent transitions. We can look at the primary coefficients with `print()`. McElreath urged us again to inspect the trace plots.
```{r, fig.width = 8, fig.height = 7, message = F, warning = F}
post <- as_draws_df(b12.5)
post %>%
mcmc_trace(pars = vars(b_Intercept:`r_block[6,Intercept]`),
facet_args = list(ncol = 4)) +
scale_x_continuous(breaks = 0:2 * 2500) +
theme_fivethirtyeight() +
theme(legend.position = c(.75, .06))
```
The trace plots look great. Here are the other diagnostics along with the parameter summaries, once again using `summarise_draws()`.
```{r}
summarise_draws(b12.5) %>%
select(-median, -mad) %>%
mutate_if(is.double, round, digits = 2)
```
All our $\widehat R$ are excellent and the values in both our `ess_.` columns are generally quite higher than the `n_eff` values McElreath presented in the text.
If we'd like to restrict our focus to the parameter summaries, we can also use `brms::ranef()` to get those `depth=2`-type estimates. With `ranef()`, you get the group-specific estimates in a deviance metric. The `coef()` function, in contrast, yields the group-specific estimates in what you might call the natural metric. We'll get more language for this in the [next chapter][Adventures in Covariance].
```{r}
ranef(b12.5)$actor[, , "Intercept"] %>%
round(digits = 2)
ranef(b12.5)$block[, , "Intercept"] %>%
round(digits = 2)
```
We might make the coefficient plot of Figure 12.4.a with `mcmc_plot()`.
```{r, fig.width = 3.5, fig.height = 3.5}
mcmc_plot(b12.5, variable = c("^r_", "^b_", "^sd_"), regex = T) +
theme_fivethirtyeight() +
theme(axis.text.y = element_text(hjust = 0))
```
Once we get the posterior samples, it's easy to compare the random variances as in Figure 12.4.b.
```{r, fig.width = 3, fig.height = 3}
post %>%
ggplot(aes(x = sd_actor__Intercept)) +
geom_density(linewidth = 0, fill = "orange1", alpha = 3/4) +
geom_density(aes(x = sd_block__Intercept),
linewidth = 0, fill = "orange4", alpha = 3/4) +
annotate(geom = "text", x = 2/3, y = 2, label = "block", color = "orange4") +
annotate(geom = "text", x = 2, y = 3/4, label = "actor", color = "orange1") +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(sigma["<group>"])) +
coord_cartesian(xlim = c(0, 4)) +
theme_fivethirtyeight()
```
We might compare our models by their PSIS-LOO values.
```{r, message = F}
b12.4 <- add_criterion(b12.4, "loo")
b12.5 <- add_criterion(b12.5, "loo")
loo_compare(b12.4, b12.5) %>%
print(simplify = F)
model_weights(b12.4, b12.5, weights = "loo") %>%
round(digits = 2)
```
The two models yield nearly-equivalent information criteria values. Yet recall what McElreath wrote: "There is nothing to gain here by selecting either model. The comparison of the two models tells a richer story" (p. 367).
### Even more clusters.
> Adding more types of clusters proceeds the same way. At some point the model may become too complex to reliably fit to data. But Hamiltonian Monte Carlo is very capable with varying effects. It can easily handle tens of thousands of varying effect parameters. Sampling will be slow in such cases, but it will work.
>
> So don't be shy--if you have a good theoretical reason to include a cluster variable, then you also have a good theoretical reason toe partially pool its parameters. (p. 376)
This section just hints at a historical software difficulty. In short, it's not uncommon to have a theory-based model that includes multiple sources of clustering (i.e., requiring many `( <varying parameter(s)> | <grouping variable(s)> )` parts in the model `formula`). This can make for all kinds of computational difficulties and result in software error messages, inadmissible solutions, and so on. One of the practical solutions to difficulties like these has been to simplify the statistical models by removing some of the clustering terms. Even though such simpler models were not the theory-based ones, at least they yielded solutions. Nowadays, Stan (via brms or otherwise) is making it easier to fit the full theoretically-based model. To learn more about this topic, check out this nice blog post by [Michael Frank](https://web.stanford.edu/~mcfrank/), [*Mixed effects models: Is it time to go Bayesian by default?*](http://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html). Make sure to check out the discussion in the comments section, which includes all-stars like Bürkner and [Douglas Bates](http://pages.stat.wisc.edu/~bates/). You can get more context for the issue from @barrRandomEffectsStructure2013, [*Random effects structure for confirmatory hypothesis testing: Keep it maximal*](https://doi.org/10.1016/j.jml.2012.11.001).
## Multilevel posterior predictions
> Producing implied predictions from a fit model, is very helpful for understanding what the model means. Every model is a merger of sense and nonsense. When we understand a model, we can find its sense and control its nonsense. But as models get more complex, it is very difficult to impossible to understand them just by inspecting tables of posterior means and intervals. Exploring implied posterior predictions helps much more...
>
> ...The introduction of varying effects does introduce nuance, however.
>
> First, we should no longer expect the model to exactly retrodict the sample, because adaptive regularization has as its goal to trade off poorer fit in sample for better inference and hopefully better fit out of sample. This is what shrinkage does for us...
>
Second, "prediction" in the context of a multilevel model requires additional choices. If we wish to validate a model against the specific clusters used to fit the model, that is one thing. But if we instead wish to compute predictions for new clusters, other than the one observed in the sample, that is quite another. We'll consider each of these in turn, continuing to use the chimpanzees model from the previous section. (p. 376)
### Posterior prediction for same clusters.
Like McElreath did in the text, we'll do this two ways. Recall we use `brms::fitted()` in place of `rethinking::link()`.
```{r, message = F, warning = F}
chimp <- 2
nd <-
tibble(prosoc_left = c(0, 1, 0, 1),
condition = c(0, 0, 1, 1),
actor = chimp)
(
chimp_2_fitted <-
fitted(b12.4,
newdata = nd) %>%
as_tibble() %>%
mutate(condition = factor(c("0/0", "1/0", "0/1", "1/1"),
levels = c("0/0", "1/0", "0/1", "1/1")))
)
```
```{r, message = F}
(
chimp_2_d <-
d %>%
filter(actor == chimp) %>%
group_by(prosoc_left, condition) %>%
summarise(prob = mean(pulled_left)) %>%
ungroup() %>%
mutate(condition = str_c(prosoc_left, "/", condition)) %>%
mutate(condition = factor(condition, levels = c("0/0", "1/0", "0/1", "1/1")))
)
```
McElreath didn't show the corresponding plot in the text. It might look like this.
```{r, fig.width = 2.5, fig.height = 4}
chimp_2_fitted %>%
# if you want to use `geom_line()` or `geom_ribbon()` with a factor on the x axis,
# you need to code something like `group = 1` in `aes()`
ggplot(aes(x = condition, y = Estimate, group = 1)) +
geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5), fill = "orange1") +
geom_line(color = "blue") +
geom_point(data = chimp_2_d,
aes(y = prob),
color = "grey25") +
ggtitle("Chimp #2",
subtitle = "The posterior mean and 95%\nintervals are the blue line\nand orange band, respectively.\nThe empirical means are\nthe charcoal dots.") +
coord_cartesian(ylim = c(.75, 1)) +
theme_fivethirtyeight() +
theme(plot.subtitle = element_text(size = 10))
```
Do note how severely we've restricted the $y$-axis range. But okay, now let's do things by hand. We'll need to extract the posterior samples and look at the structure of the data.
```{r}
post <- as_draws_df(b12.4)
glimpse(post)
```
McElreath didn't show what his R code 12.29 `dens( post$a_actor[,5] )` would look like. But here's our analogue.
```{r, fig.width = 4, fig.height = 2.5}
post %>%
mutate(actor_5 = `r_actor[5,Intercept]`) %>%
ggplot(aes(x = actor_5)) +
geom_density(linewidth = 0, fill = "blue") +
scale_y_continuous(breaks = NULL) +
ggtitle("Chimp #5's density") +
theme_fivethirtyeight()
```
And because we made the density only using the `r_actor[5,Intercept]` values (i.e., we didn't add `b_Intercept` to them), the density is in a deviance-score metric.
McElreath built his own `link()` function in R code 12.30. Here we'll build an alternative to `fitted()`.
```{r, warning = F, message = F}
# our hand-made `brms::fitted()` alternative
my_fitted <- function(prosoc_left, condition) {
post %>%
mutate(fitted = (b_Intercept +
`r_actor[5,Intercept]` +
b_prosoc_left * prosoc_left +
`b_prosoc_left:condition` * prosoc_left * condition) %>%
inv_logit_scaled()) %>%
select(fitted)
}
# the posterior summaries
(
chimp_5_my_fitted <-
tibble(prosoc_left = c(0, 1, 0, 1),
condition = c(0, 0, 1, 1)) %>%
mutate(post = map2(prosoc_left, condition, my_fitted)) %>%
unnest(post) %>%
mutate(condition = factor(str_c(prosoc_left, "/", condition),
levels = c("0/0", "1/0", "0/1", "1/1"))) %>%
group_by(condition) %>%
tidybayes::mean_qi(fitted)
)
# the empirical summaries
chimp <- 5
(
chimp_5_d <-
d %>%
filter(actor == chimp) %>%
group_by(prosoc_left, condition) %>%
summarise(prob = mean(pulled_left)) %>%
ungroup() %>%
mutate(condition = factor(str_c(prosoc_left, "/", condition),
levels = c("0/0", "1/0", "0/1", "1/1")))
)
```
Okay, let's see how good we are at retrodicting the `pulled_left` probabilities for `actor == 5`.
```{r, fig.width = 2.5, fig.height = 3.75}
chimp_5_my_fitted %>%
ggplot(aes(x = condition, y = fitted, group = 1)) +
geom_ribbon(aes(ymin = .lower, ymax = .upper), fill = "orange1") +
geom_line(color = "blue") +
geom_point(data = chimp_5_d,
aes(y = prob),
color = "grey25") +
ggtitle("Chimp #5",
subtitle = "This plot is like the last except\nwe did more by hand.") +
coord_cartesian(ylim = c(0, 1)) +
theme_fivethirtyeight() +
theme(plot.subtitle = element_text(size = 10))
```
Not bad.
### Posterior prediction for new clusters.
By average actor, McElreath referred to a chimp with an intercept exactly at the population mean $\alpha$. So this time we'll only be working with the population parameters, or what are also sometimes called the fixed effects. When using `brms::as_draws_df()` output, this would mean working with columns beginning with the `b_` prefix (i.e., `b_Intercept`, `b_prosoc_left`, and `b_prosoc_left:condition`).
```{r, message = F, warning = F}
post_average_actor <-