-
Notifications
You must be signed in to change notification settings - Fork 94
/
Copy path11.Rmd
1561 lines (1234 loc) · 71.4 KB
/
11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = FALSE, cache = FALSE}
knitr::opts_chunk$set(fig.retina = 2.5)
options(width = 110)
```
# Monsters and Mixtures
> [Of these majestic creatures], we'll consider two common and useful examples. The first type is the **ordered categorical** model, useful for categorical outcomes with a fixed ordering. This model is built by merging a categorical likelihood function with a special kind of link function, usually a **cumulative link**. The second type is a family of **zero-inflated** and **zero-augmented** models, each of which mixes a binary event within an ordinary GLM likelihood like a Poisson or binomial.
>
> Both types of models help us transform our modeling to cope with the inconvenient realities of measurement, rather than transforming measurements to cope with the constraints of our models. [@mcelreathStatisticalRethinkingBayesian2015, p. 331, **emphasis** in the original]
## Ordered categorical outcomes
> It is very common in the social sciences, and occasional in the natural sciences, to have an outcome variable that is discrete, like a count, but in which the values merely indicate different ordered *levels* along some dimension. For example, if I were to ask you how much you like to eat fish, on a scale from 1 to 7, you might say 5. If I were to ask 100 people the same question, I'd end up with 100 values between 1 and 7. In modeling each outcome value, I’d have to keep in mind that these values are *ordered* because 7 is greater than 6, which is greater than 5, and so on. But unlike a count, the differences in values are not necessarily equal.
>
> In principle, an ordered categorical variable is just a multinomial prediction problem (page 323). But the constraint that the categories be ordered demands special treatment...
>
> The conventional solution is to use a cumulative link function. The cumulative probability of a value is the probability of that *value or any smaller value*. (pp. 331--332, *emphasis* in the original)
### Example: Moral intuition.
Let's get the `Trolley` data from rethinking [see @cushmanRoleConsciousReasoning2006].
```{r, warning = F, message = F}
library(rethinking)
data(Trolley)
d <- Trolley
```
Unload rethinking and load brms.
```{r, warning = F, message = F}
rm(Trolley)
detach(package:rethinking, unload = T)
library(brms)
```
Use the `dplyr::glimpse()` to get a sense of the dimensions of the data.
```{r, warning = F, message = F}
library(tidyverse)
glimpse(d)
```
Though we have 9,930 rows, we only have 331 unique individuals.
```{r}
d %>%
distinct(id) %>%
count()
```
### Describing an ordered distribution with intercepts.
Before we get to plotting, in this chapter we'll use theme settings and a color palette from the [ggthemes package](https://cran.r-project.org/package=ggthemes).
```{r, warning = F, message = F}
library(ggthemes)
```
We'll take our basic theme settings from the `theme_hc()` function. We'll use the `Green fields` color palette, which we can inspect with the `canva_pal()` function and a little help from `scales::show_col()`.
```{r, fig.height = 2.25}
scales::show_col(canva_pal("Green fields")(4))
canva_pal("Green fields")(4)
canva_pal("Green fields")(4)[3]
```
Now we're ready to make our version of the simple Figure 11.1 histogram of our primary variable, `response`.
```{r, fig.width = 2.5, fig.height = 3}
p1 <-
ggplot(data = d, aes(x = response, fill = after_stat(x))) +
geom_histogram(binwidth = 1/4, linewidth = 0) +
scale_x_continuous(breaks = 1:7) +
scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
p1
```
Our cumulative proportion plot, Figure 11.1.b, will require some pre-plot wrangling.
```{r, fig.width = 2.5, fig.height = 3}
p2 <-
d %>%
count(response) %>%
mutate(pr_k = n / nrow(d),
cum_pr_k = cumsum(pr_k)) %>%
ggplot(aes(x = response, y = cum_pr_k,
fill = response)) +
geom_line(color = canva_pal("Green fields")(4)[2]) +
geom_point(shape = 21, color = "grey92",
size = 2.5, stroke = 1) +
scale_x_continuous(breaks = 1:7) +
scale_y_continuous("cumulative proportion", breaks = c(0, .5, 1)) +
scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
coord_cartesian(ylim = c(0, 1)) +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
p2
```
In order to make the next plot, we'll need McElreath's `logit()` function. Here it is, the logarithm of cumulative odds plot, Figure 11.1.c.
```{r, fig.width = 2.5, fig.height = 3}
# McElreath's convenience function from page 335
logit <- function(x) log(x / (1 - x))
p3 <-
d %>%
count(response) %>%
mutate(cum_pr_k = cumsum(n / nrow(d))) %>%
filter(response < 7) %>%
# we can do the `logit()` conversion right in ggplot2
ggplot(aes(x = response, y = logit(cum_pr_k),
fill = response)) +
geom_line(color = canva_pal("Green fields")(4)[2]) +
geom_point(shape = 21, colour = "grey92",
size = 2.5, stroke = 1) +
scale_x_continuous(breaks = 1:7) +
scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
coord_cartesian(xlim = c(1, 7)) +
ylab("log-cumulative-odds") +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
p3
```
Why not combine the three subplots with patchwork?
```{r, fig.width = 7, fig.height = 3}
library(patchwork)
p1 | p2 | p3
```
The code for Figure 11.2 is itself something of a monster.
```{r, fig.width = 3.5, fig.height = 3}
# primary data
d_plot <-
d %>%
count(response) %>%
mutate(pr_k = n / nrow(d),
cum_pr_k = cumsum(n / nrow(d))) %>%
mutate(discrete_probability = ifelse(response == 1, cum_pr_k, cum_pr_k - pr_k))
# annotation
text <-
tibble(text = 1:7,
response = seq(from = 1.25, to = 7.25, by = 1),
cum_pr_k = d_plot$cum_pr_k - .065)
ggplot(data = d_plot,
aes(x = response, y = cum_pr_k,
color = cum_pr_k, fill = cum_pr_k)) +
geom_line(color = canva_pal("Green fields")(4)[1]) +
geom_point(shape = 21, colour = "grey92",
size = 2.5, stroke = 1) +
geom_linerange(aes(ymin = 0, ymax = cum_pr_k),
alpha = 1/2, color = canva_pal("Green fields")(4)[1]) +
# there must be more elegant ways to do this part
geom_linerange(aes(x = response + .025,
ymin = ifelse(response == 1, 0, discrete_probability),
ymax = cum_pr_k),
color = "black") +
# number annotation
geom_text(data = text,
aes(label = text),
size = 4) +
scale_x_continuous(breaks = 1:7) +
scale_y_continuous("cumulative proportion", breaks = c(0, .5, 1), limits = c(0, 1)) +
scale_fill_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
scale_color_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
```
McElreath's convention for this first type of statistical model is
\begin{align*}
R_i & \sim \operatorname{Ordered}(\mathbf p) \\
\operatorname{logit}(p_k) & = \alpha_k \\
\alpha_k & \sim \operatorname{Normal} (0, 10).
\end{align*}
> The Ordered distribution is really just a categorical distribution that takes a vector $\mathbf p = {p_1, p_2, p_3, p_4, p_5, p_6}$ of probabilities of each response value below the maximum response (7 in this example). Each response value $k$ in this vector is defined by its link to an intercept parameter, $\alpha_k$. Finally, some weakly regularizing priors are placed on these intercepts. (p. 335)
Whereas in `rethinking::map()` you indicate the likelihood by `<criterion> ~ dordlogit(phi , c(<the thresholds>)`, in `brms::brm()` you code `family = cumulative`. Here's how to fit the intercepts-only model.
```{r b11.1}
# define the start values
inits <- list(`Intercept[1]` = -2,
`Intercept[2]` = -1,
`Intercept[3]` = 0,
`Intercept[4]` = 1,
`Intercept[5]` = 2,
`Intercept[6]` = 2.5)
inits_list <- list(inits, inits)
b11.1 <-
brm(data = d,
family = cumulative,
response ~ 1,
prior(normal(0, 10), class = Intercept),
iter = 2000, warmup = 1000, cores = 2, chains = 2,
init = inits_list, # here we add our start values
seed = 11,
file = "fits/b11.01")
```
McElreath needed to include the `depth=2` argument in the `rethinking::precis()` function to show the threshold parameters from his `m11.1stan` model (R code 11.8). With a `brm()` fit, we just use `print()` or `summary()` as usual.
```{r}
print(b11.1)
```
What McElreath's `m11.1stan` summary termed `cutpoints[k]`, our brms summary termed `Intercept[k]`. In both cases, these are the $\alpha_k$ parameters from the equations, above (i.e., the thresholds). The summaries look like those in the text, the $\widehat R$ values are great, and both measures of effective sample size are reasonably high. The model looks good.
Recall we use the `brms::inv_logit_scaled()` function in place of McElreath's `logistic()` function to get these into the probability metric.
```{r}
fixef(b11.1)[, -2] %>%
inv_logit_scaled()
```
But recall that the posterior $\textit{SD}$ (i.e., the 'Est.Error' values) are not valid using that approach. If you really care about them, you'll need to work with the `as_draws_df()` function.
```{r, message = F}
as_draws_df(b11.1) %>%
select(starts_with("b_")) %>%
mutate_all(inv_logit_scaled) %>%
gather() %>%
group_by(key) %>%
summarise(mean = mean(value),
sd = sd(value),
ll = quantile(value, probs = .025),
ul = quantile(value, probs = .975))
```
And just to confirm, those posterior means are centered right around the `cum_pr_k` we computed for Figure 11.2.
```{r}
d_plot %>%
select(response, cum_pr_k)
```
### Adding predictor variables.
Now we define the linear model as $\phi_i = \beta x_i$. Accordingly, the formula for our cumulative logit model becomes
\begin{align*}
\log \frac{\Pr(y_i \leq k)}{1 - \Pr(y_i \leq k)} & = \alpha_k - \phi_i \\
\phi_i & = \beta x_i.
\end{align*}
I'm not aware that brms has an equivalent to the `rethinking::dordlogit()` function. So here we'll make it by hand. The code comes from McElreath's [GitHub page](https://github.com/rmcelreath/rethinking/blob/a309712d904d1db7af1e08a76c521ab994006fd5/R/distributions.r).
```{r}
logistic <- function(x) {
p <- 1 / (1 + exp(-x))
p <- ifelse(x == Inf, 1, p)
p
}
# now we get down to it
dordlogit <-
function(x, phi, a, log = FALSE) {
a <- c(as.numeric(a), Inf)
p <- logistic(a[x] - phi)
na <- c(-Inf, a)
np <- logistic(na[x] - phi)
p <- p - np
if (log == TRUE) p <- log(p)
p
}
```
The `dordlogit()` function works like this:
```{r}
(pk <- dordlogit(1:7, 0, fixef(b11.1)[, 1]))
```
Note the slight difference in how we used `dordlogit()` with a `brm()` fit summarized by `fixef()` than the way McElreath did with a `map2stan()` fit summarized by `coef()`. McElreath just put `coef(m11.1)` into `dordlogit()`. We, however, more specifically placed `fixef(b11.1)[, 1]` into the function. With the `[, 1]` part, we specified that we were working with the posterior means (i.e., `Estimate`) and neglecting the other summaries (i.e., the posterior *SD*s and 95% intervals). If you forget to subset, chaos ensues.
Next, as McElreath further noted on page 338, "these probabilities imply an average outcome of:"
```{r}
sum(pk * (1:7))
```
I found that a bit abstract. Here's the thing in a more elaborate tibble format.
```{r}
(
explicit_example <-
tibble(probability_of_a_response = pk) %>%
mutate(the_response = 1:7) %>%
mutate(their_product = probability_of_a_response * the_response)
)
explicit_example %>%
summarise(average_outcome_value = sum(their_product))
```
**Aside**
This made me wonder how this would compare if we were lazy and ignored the categorical nature of the `response`. Here we refit the model with the typical Gaussian likelihood.
```{r b11.1b}
b11.1b <-
brm(data = d,
family = gaussian,
response ~ 1,
# in this case, 4 (i.e., the middle response) seems to be the conservative place to put the mean
prior = c(prior(normal(4, 10), class = Intercept),
prior(cauchy(0, 1), class = sigma)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.01b")
```
Check the summary.
```{r}
print(b11.1b)
```
This yielded a posterior mean of 4.2, much like our `average_outcome_value`, above. However, the lazy Gaussian model now has a $\sigma$ parameter, whereas $\sigma$ is a function of the probability estimate in the more-appropriate `b11.2` model (see [Section 11.3][Over-dispersed outcomes]). If you only care about mean estimates, this won't matter much if you have a lot of data and the mean is far from the boundary. But when your posterior means get close to the upper or lower boundary, the lazy Gaussian model will yield silly estimates for the 95% intervals. Beware the lazy Gaussian model with data like this.
**End aside**
Now we'll try it by subtracting .5 from each.
```{r}
# the probabilities of a given response
(pk <- dordlogit(1:7, 0, fixef(b11.1)[, 1] - .5))
# the average rating
sum(pk * (1:7))
```
So the rule is we *subtract the linear model from each intercept*. "This way, a positive $\beta$ value indicates that an increase in the predictor variable $x$ results in an increase in the average response" (p. 338). Let's fit our multivariable models.
```{r b11.2}
# start values for b11.2
inits <- list(`Intercept[1]` = -1.9,
`Intercept[2]` = -1.2,
`Intercept[3]` = -0.7,
`Intercept[4]` = 0.2,
`Intercept[5]` = 0.9,
`Intercept[6]` = 1.8,
action = 0,
intention = 0,
contact = 0)
b11.2 <-
brm(data = d,
family = cumulative,
response ~ 1 + action + intention + contact,
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b)),
iter = 2000, warmup = 1000, cores = 2, chains = 2,
init = list(inits, inits),
seed = 11,
file = "fits/b11.02")
# start values for b11.3
inits <- list(`Intercept[1]` = -1.9,
`Intercept[2]` = -1.2,
`Intercept[3]` = -0.7,
`Intercept[4]` = 0.2,
`Intercept[5]` = 0.9,
`Intercept[6]` = 1.8,
action = 0,
intention = 0,
contact = 0,
`action:intention` = 0,
`contact:intention` = 0)
b11.3 <-
update(b11.2,
formula = response ~ 1 + action + intention + contact + action:intention + contact:intention,
iter = 2000, warmup = 1000, cores = 2, chains = 2,
init = list(inits, inits),
seed = 11,
file = "fits/b11.03")
```
We don't have a `coeftab()` function in brms like for rethinking. But [as we did for Chapter 6][Comparing estimates.], we can reproduce it with help from the `posterior::summarise_draws()` function and a bit of data wrangling.
```{r}
# make a custom helper function
get_summary <- function(fit) {
fit %>%
posterior::summarise_draws(mean) %>%
filter(variable != "lprior") %>%
filter(variable != "lp__")
}
# wrangle
tibble(model = str_c("b11.", 1:3)) %>%
mutate(fit = purrr::map(model, get)) %>%
mutate(mean = purrr::map(fit, get_summary)) %>%
select(-fit) %>%
unnest(mean) %>%
select(variable, mean, model) %>%
spread(key = model, value = mean) %>%
mutate_if(is.double, round, digits = 2) %>%
# this last step isn't necessary, but it orders the rows to match the text
slice(c(6:11, 1, 4, 3, 2, 5))
```
If you really wanted that last `nobs` row at the bottom, you could elaborate on this code: `b11.1$data %>% count()`. Also, if you want a proper `coeftab()` function for brms, McElreath's code lives [here](https://github.com/rmcelreath/rethinking/blob/a309712d904d1db7af1e08a76c521ab994006fd5/R/coeftab.r). Give it a whirl.
Here we compute the WAIC estimates.
```{r, warning = F, message = F}
b11.1 <- add_criterion(b11.1, "waic")
b11.2 <- add_criterion(b11.2, "waic")
b11.3 <- add_criterion(b11.3, "waic")
```
Now compare the models.
```{r}
loo_compare(b11.1, b11.2, b11.3, criterion = "waic") %>%
print(simplify = F)
```
Here are the WAIC weights.
```{r model_weights_b11.1_through_b11.3, cache = T}
model_weights(b11.1, b11.2, b11.3, weights = "waic") %>%
round(digits = 3)
```
McElreath made Figure 11.3 by extracting the samples of his `m11.3`, saving them as `post`, and working some hairy base R `plot()` code. We'll take a different route and use `brms::fitted()`. This will take substantial data wrangling, but hopefully it'll be instructive. Let's first take a look at the initial `fitted()` output for the beginnings of Figure 11.3.a.
```{r, warning = F, message = F}
nd <-
tibble(action = 0,
contact = 0,
intention = 0:1)
max_iter <- 100
fitted(b11.3,
newdata = nd,
draw_ids = 1:max_iter,
summary = F) %>%
as_tibble() %>%
glimpse()
```
Hopefully by now it's clear why we needed the `nd` tibble, which we made use of in the `newdata = nd` argument. Because we set `summary = F`, we get draws from the posterior instead of summaries. With `max_iter`, we controlled how many of those posterior draws we wanted. McElreath used 100, which he indicated at the top of page 341, so we followed suit. It took me a minute to wrap my head around the meaning of the 14 vectors, which were named by `brms::fitted()` default. Notice how each column is named by two numerals, separated by a period. That first numeral indicates which if the two `intention` values the draw is based on (i.e., 1 stands for `intention == 0`, 2, stands for `intention == 1`). The numbers on the right of the decimals are the seven response options for `response`. For each posterior draw, you get one of those for each value of `intention`. Finally, it might not be immediately apparent, but the values are in the probability scale, just like `pk` on page 338.
Now we know what we have in hand, it's just a matter of careful wrangling to get those probabilities into a more useful format to feed into ggplot2. I've extensively annotated the code, below. If you lose track of what happens in a given step, just run the code up till that point. Go step by step.
```{r, fig.width = 2.5, fig.height = 3.5}
nd <-
tibble(action = 0,
contact = 0,
intention = 0:1)
max_iter <- 100
fitted(b11.3,
newdata = nd,
draw_ids = 1:max_iter,
summary = F) %>%
as_tibble() %>%
# we need an variable to index which posterior iteration we're working with
mutate(iter = 1:max_iter) %>%
# convert the data to the long format
gather(key, pk, -iter) %>%
# extract the `intention` and `response` information out of the `key` vector and
# spread it into two vectors.
separate(key, into = c("intention", "rating")) %>%
# that step produced two character vectors. they’ll be more useful as numbers
mutate(intention = intention %>% as.double(),
rating = rating %>% as.double()) %>%
# here we convert `intention` into its proper 0:1 metric
mutate(intention = intention -1) %>%
# this step is based on McElreath's R code 11.10 on page 338
mutate(`pk:rating` = pk * rating) %>%
# I'm not sure how to succinctly explain this. you’re just going to have to trust me
group_by(iter, intention) %>%
# this is very important for the next step.
arrange(iter, intention, rating) %>%
# here we take our `pk` values and make cumulative sums. why? take a long hard look at Figure 11.2.
mutate(probability = cumsum(pk)) %>%
# `rating == 7` is unnecessary. these `probability` values are by definition 1
filter(rating < 7) %>%
ggplot(aes(x = intention, y = probability,
color = probability)) +
geom_line(aes(group = interaction(iter, rating)),
alpha = 1/10) +
# note how we made a new data object for `geom_text()`
geom_text(data = tibble(text = 1:7,
intention = seq(from = .9, to = .1, length.out = 7),
probability = c(.05, .12, .20, .35, .53, .71, .87)),
aes(label = text),
size = 3) +
scale_x_continuous("intention", breaks = 0:1) +
scale_y_continuous(breaks = c(0, .5, 1), limits = c(0, 1)) +
scale_color_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
labs(subtitle = "action = 0,\ncontact = 0") +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
```
Boom!
Okay, that pile of code is a bit of a mess and you're not going to want to repeatedly cut and paste all that. Let's condense it into a homemade function, `make_Figure_11.3_data()`.
```{r}
make_Figure_11.3_data <- function(action, contact, max_iter) {
nd <-
tibble(action = action,
contact = contact,
intention = 0:1)
max_iter <- max_iter
fitted(b11.3,
newdata = nd,
draw_ids = 1:max_iter,
summary = F) %>%
as_tibble() %>%
mutate(iter = 1:max_iter) %>%
gather(key, pk, -iter) %>%
separate(key, into = c("intention", "rating")) %>%
mutate(intention = intention %>% as.double(),
rating = rating %>% as.double()) %>%
mutate(intention = intention -1) %>%
mutate(`pk:rating` = pk * rating) %>%
group_by(iter, intention) %>%
arrange(iter, intention, rating) %>%
mutate(probability = cumsum(pk)) %>%
filter(rating < 7)
}
```
Now we'll use our sweet homemade function to make our plots.
```{r, fig.width = 7, fig.height = 4, warning = F, message = F}
# Figure 11.3.a
p1 <-
make_Figure_11.3_data(action = 0,
contact = 0,
max_iter = 100) %>%
ggplot(aes(x = intention, y = probability,
color = probability)) +
geom_line(aes(group = interaction(iter, rating)),
alpha = 1/10) +
geom_text(data = tibble(text = 1:7,
intention = seq(from = .9, to = .1, length.out = 7),
probability = c(.05, .12, .20, .35, .53, .71, .87)),
aes(label = text),
size = 3) +
scale_x_continuous("intention", breaks = 0:1) +
scale_y_continuous(breaks = c(0, .5, 1), limits = c(0, 1)) +
scale_color_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
labs(subtitle = "action = 0,\ncontact = 0") +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
# Figure 11.3.b
p2 <-
make_Figure_11.3_data(action = 1,
contact = 0,
max_iter = 100) %>%
ggplot(aes(x = intention, y = probability,
color = probability)) +
geom_line(aes(group = interaction(iter, rating)),
alpha = 1/10) +
geom_text(data = tibble(text = 1:7,
intention = seq(from = .9, to = .1, length.out = 7),
probability = c(.12, .24, .35, .50, .68, .80, .92)),
aes(label = text),
size = 3) +
scale_x_continuous("intention", breaks = 0:1) +
scale_y_continuous(breaks = c(0, .5, 1), limits = c(0, 1)) +
scale_color_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
labs(subtitle = "action = 1,\ncontact = 0") +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
# Figure 11.3.c
p3 <-
make_Figure_11.3_data(action = 0,
contact = 1,
max_iter = 100) %>%
ggplot(aes(x = intention, y = probability,
color = probability)) +
geom_line(aes(group = interaction(iter, rating)),
alpha = 1/10) +
geom_text(data = tibble(text = 1:7,
intention = seq(from = .9, to = .1, length.out = 7),
probability = c(.15, .34, .44, .56, .695, .8, .92)),
aes(label = text),
size = 3) +
scale_x_continuous("intention", breaks = 0:1) +
scale_y_continuous(breaks = c(0, .5, 1), limits = c(0, 1)) +
scale_color_gradient(low = canva_pal("Green fields")(4)[4],
high = canva_pal("Green fields")(4)[1]) +
labs(subtitle = "action = 0,\ncontact = 1") +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
# here we combine them with patchwork
p1 + p2 + p3 +
plot_annotation(title = "Posterior predictions of the interaction model:",
subtitle = "Here we're focusing on the parameters.",
theme = theme(plot.background = element_rect(fill = "grey92")))
```
If you'd like to learn more about using cumulative probabilities to model ordinal data in brms, check out Bürkner and Vuorre's [-@burknerOrdinalRegressionModels2019] [*Ordinal regression models in psychology: A tutorial*](https://psyarxiv.com/x8swp/) and its [repository](https://osf.io/cu8jv/) on the Open Science Framework. Also check out [Chapter 23](https://bookdown.org/content/3686/ordinal-predicted-variable.html) of my [-@kurzDoingBayesianDataAnalysis2023] sister project, [*Doing Bayesian data analysis in brms and the tidyverse*](https://bookdown.org/content/3686/), were we model ordinal data with a series of cumulative probit models.
### Bonus: Figure 11.3 alternative.
I have a lot of respect for McElreath. But man, Figure 11.3 is the worst. I'm in clinical psychology and there's no way a working therapist is going to look at a figure like that and have any sense of what's going on. Happily, we can go further. Look back at McElreath's R code 11.10 on page 338. See how he multiplied the elements of `pk` by their respective `response` values and then just summed them up to get an average outcome value? With just a little amendment to our custom `make_Figure_11.3_data()` function, we can wrangle our `fitted()` output to express average `response` values for each of our conditions of interest. Here's the adjusted function:
```{r}
make_alternative_data <- function(action, contact, max_iter) {
nd <-
tibble(action = action,
contact = contact,
intention = 0:1)
max_iter <- max_iter
fitted(b11.3,
newdata = nd,
draw_ids = 1:max_iter,
summary = F) %>%
as_tibble() %>%
mutate(iter = 1:max_iter) %>%
gather(key, pk, -iter) %>%
separate(key, into = c("intention", "rating")) %>%
mutate(intention = intention %>% as.double(),
rating = rating %>% as.double()) %>%
mutate(intention = intention -1) %>%
mutate(`pk:rating` = pk * rating) %>%
group_by(iter, intention) %>%
# everything above this point is identical to the previous custom function.
# all we do is replace the last few lines with this one line of code.
summarise(mean_rating = sum(`pk:rating`))
}
```
Our handy homemade `make_alternative_data()` function works very much like its predecessor. Before we put it to work, we might simplify our upcoming ggplot2 code. Did you notice how the last three plots, those three subplots for Figure 11.3, were all basically the same with slightly different input? Instead of copy/pasting all that plot code, we can wrap the bulk of it into a custom plotting function we'll call `geom_figure11.3()`.
```{r}
geom_figure11.3 <- function(subtitle, ...) {
list(
geom_line(alpha = 1/10, color = canva_pal("Green fields")(4)[1]),
scale_x_continuous("intention", breaks = 0:1),
scale_y_continuous("response", breaks = 1:7, limits = c(1, 7)),
labs(subtitle = subtitle),
theme_hc(),
theme(axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "grey92"))
)
}
```
Now we'll use our two custom functions, `make_alternative_data()` to make the data and `geom_figure11.3()` to specify most of the ggplot2 components, to plot our alternative to Figure 11.3.
```{r, fig.width = 7, fig.height = 4, message = F}
# alternative to Figure 11.3.a
p1 <-
make_alternative_data(action = 0,
contact = 0,
max_iter = 100) %>%
ggplot(aes(x = intention, y = mean_rating, group = iter)) +
geom_figure11.3(subtitle = "action = 0,\ncontact = 0")
# alternative to Figure 11.3.b
p2 <-
make_alternative_data(action = 1,
contact = 0,
max_iter = 100) %>%
ggplot(aes(x = intention, y = mean_rating, group = iter)) +
geom_figure11.3(subtitle = "action = 1,\ncontact = 0")
# alternative to Figure 11.3.c
p3 <-
make_alternative_data(action = 0,
contact = 1,
max_iter = 100) %>%
ggplot(aes(x = intention, y = mean_rating, group = iter)) +
geom_figure11.3(subtitle = "action = 0,\ncontact = 1")
# here we combine them with patchwork
p1 + p2 + p3 +
plot_annotation(title = "Posterior predictions of the interaction model:",
subtitle = "This time we'll focus on the means.",
theme = theme(plot.background = element_rect(fill = "grey92")))
```
Finally; now those are plots I can sell in a clinical psychology journal! We did the right thing and used a sophisticated statistical model to treat the data appropriately, but we summarized the results in a way a substantive audience might understand them in.
If you're new to making functions with ggplot2 components, check out [Chapter 19](https://ggplot2-book.org/programming.html) of Wickham's [-@wickhamGgplot2ElegantGraphics2016] [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/index.html). He'll walk you through all the steps.
#### Rethinking: Staring into the abyss.
"The plotting code for ordered logistic models is complicated, compared to that of models from previous chapters. But as models become more monstrous, so too does the code needed to compute predictions and display them" (p. 342). I'll just add that I have found models and plots like this get easier with time. Just keep chipping away. You'll get it!
## Zero-inflated outcomes
> Very often, the things we can measure are not emissions from any pure process. Instead, they are *mixtures* of multiple processes. Whenever there are different causes for the same observation, then a mixture model may be useful. A mixture model uses more than one simple probability distribution to model a mixture of causes. In effect, these models use more than one likelihood for the same outcome variable.
>
> Count variables are especially prone to needing a mixture treatment. The reason is that a count of zero can often arise more than one way. A "zero" means that nothing happened, and nothing can happen either because the rate of events is low or rather because the process that generates events failed to get started. (p. 342, *emphasis* in the original)
#### Rethinking: Breaking the law.
McElreath discussed how advances in computing have made it possible for working scientists to define their own data generating models. If you'd like to dive deeper into the topic, check out Bürkner's [-@Bürkner2022Define] vignette, [*Define custom response distributions with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_customfamilies.html). We'll even make use of it a little further down.
### Example: Zero-inflated Poisson.
Do you remember the monk data from back in [Chapter 10][Example: Exposure and the offset.]? Here we simulate some more. This time we'll work in a little alcohol.
```{r}
# define parameters
prob_drink <- 0.2 # 20% of days
rate_work <- 1 # average 1 manuscript per day
# sample one year of production
n <- 365
# simulate days monks drink
set.seed(11)
drink <- rbinom(n, 1, prob_drink)
# simulate manuscripts completed
y <- (1 - drink) * rpois(n, rate_work)
```
We'll put those data in a tidy tibble before plotting.
```{r, fig.width = 5, fig.height = 3}
d <-
tibble(Y = y) %>%
arrange(Y) %>%
mutate(zeros = c(rep("zeros_drink", times = sum(drink)),
rep("zeros_work", times = sum(y == 0 & drink == 0)),
rep("nope", times = n - sum(y == 0)))
)
ggplot(data = d, aes(x = Y)) +
geom_histogram(aes(fill = zeros),
binwidth = 1, linewidth = 1/10, color = "grey92") +
scale_fill_manual(values = c(canva_pal("Green fields")(4)[1],
canva_pal("Green fields")(4)[2],
canva_pal("Green fields")(4)[1])) +
xlab("Manuscripts completed") +
theme_hc() +
theme(legend.position = "none",
plot.background = element_rect(fill = "grey92"))
```
With these data, the likelihood of observing zero on `y`, (i.e., the likelihood zero manuscripts were completed on a given occasion) is
\begin{align*}
\Pr(0 | p, \lambda) & = \Pr(\text{drink} | p) + \Pr(\text{work} | p) \times \Pr(0 | \lambda) \\
& = p + (1 - p) \; \exp (- \lambda).
\end{align*}
And
> since the Poisson likelihood of $y$ is $\Pr(y | \lambda) = \lambda^y \exp (- \lambda) / y!$, the likelihood of $y = 0$ is just $\exp(- \lambda)$. The above is just the mathematics for:
>
>> *The probability of observing a zero is the probability that the monks didn't drink OR ($+$) the probability that the monks worked AND ($\times$) failed to finish anything*.
>
> And the likelihood of a non-zero value $y$ is:
>
> \begin{align*}
> \Pr(y | p, \lambda) & = \Pr(\text{drink} | p) (0) + \Pr(\text{work} | p) \Pr(y | \lambda) \\
> & = (1 - p) \frac {\lambda^y \; \exp (- \lambda)}{y!}
> \end{align*}
>
> Since drinking monks never produce $y > 0$, the expression above is just the chance the monks both work $1 - p$, and finish $y$ manuscripts. (p. 344, *emphasis* in the original)
So letting $p$ be the probability $y$ is zero and $\lambda$ be the shape of the distribution, the zero-inflated Poisson (ZIPoisson) regression model takes the basic form
\begin{align*}
y_i & \sim \operatorname{ZIPoisson}(p_i, \lambda_i)\\
\operatorname{logit}(p_i) & = \alpha_p + \beta_p x_i \\
\log (\lambda_i) & = \alpha_\lambda + \beta_\lambda x_i.
\end{align*}
One last thing to note is that in brms, $p_i$ is denoted `zi`. To fit a zero-inflated Poisson model with brms, make sure to specify the correct likelihood with `family = zero_inflated_poisson`. To use a non-default prior for `zi`, make sure to indicate `class = zi` within the `prior()` function.
```{r b11.4}
b11.4 <-
brm(data = d,
family = zero_inflated_poisson,
Y ~ 1,
prior = c(prior(normal(0, 10), class = Intercept),
prior(beta(2, 2), class = zi)), # the brms default is beta(1, 1)
cores = 4,
seed = 11,
file = "fits/b11.04")
```
```{r}
print(b11.4)
```
The zero-inflated Poisson is [parameterized in brms](https://cran.r-project.org/package=brms/vignettes/brms_families.html#zero-inflated-and-hurdle-models) a little differently than it is in rethinking. The different parameterization did not influence the estimate for the Intercept, $\lambda$. In both here and in the text, $\lambda$ was about zero. However, it did influence the summary of `zi`. Note how McElreath's `logistic(-1.39)` yielded 0.1994078. Seems rather close to our `zi` estimate of `r round(posterior_summary(b11.4)[2, 1], 3)`. First off, because he didn't set his seed in the text before simulating, we couldn't exactly reproduce his simulated drunk monk data. So our results will vary a little due to that alone. But after accounting for simulation variance, hopefully it's clear that `zi` in brms is already in the probability metric. There's no need to convert it.
In the `prior` argument, we used `beta(2, 2)` for `zi` and also mentioned in the margin that the brms default is `beta(1, 1)`. The beta distribution ranges from 0 to 1, making it a natural distribution to use for priors on probabilities. To give you a sense of what those two versions of the beta look like, let's plot them.
```{r, fig.width = 6, fig.height = 2.5}
tibble(`zi prior`= seq(from = 0, to = 1, length.out = 50)) %>%
mutate(`beta(1, 1)` = dbeta(`zi prior`, 1, 1),
`beta(2, 2)` = dbeta(`zi prior`, 2, 2)) %>%
gather(prior, density, -`zi prior`) %>%
ggplot(aes(x = `zi prior`, ymin = 0, ymax = density)) +
geom_ribbon(aes(fill = prior)) +
scale_fill_manual(values = c(canva_pal("Green fields")(4)[4],
canva_pal("Green fields")(4)[2])) +
scale_x_continuous("prior for zi", breaks = c(0, .5, 1)) +
scale_y_continuous(NULL, breaks = NULL) +
theme_hc() +
theme(legend.position = "none",
plot.background = element_rect(fill = "grey92")) +
facet_wrap(~prior)
```
Hopefully this clarifies that the brms default is flat, whereas our prior regularized a bit toward .5. Anyway, here's that exponentiated $\lambda$.
```{r}
fixef(b11.4)[1, ] %>%
exp()
```
#### Overthinking: Zero-inflated Poisson distribution function.
Define the `dzip()` function.
```{r}
dzip <- function(x, p, lambda, log = TRUE) {
ll <- ifelse(
x == 0,
p + (1 - p) * exp(-lambda),
(1 - p) * dpois(x, lambda, log = FALSE)
)
if (log == TRUE) ll <- log(ll)
return(ll)
}
```
We can use McElreath's `dzip()` to do a posterior predictive check for our model. To work with our estimates for $p$ and $\lambda$ directly, we'll set `log = F`.
```{r, fig.width = 5, fig.height = 3}
p_b11.4 <- posterior_summary(b11.4)[2, 1]
lambda_b11.4 <- posterior_summary(b11.4)[1, 1] %>% exp()
tibble(x = 0:4) %>%
mutate(density = dzip(x = x,
p = p_b11.4,
lambda = lambda_b11.4,
log = F)) %>%
ggplot(aes(x = x, y = density)) +
geom_col(fill = canva_pal("Green fields")(4)[4]) +
xlab("Manuscripts completed") +
theme_hc() +
theme(axis.ticks = element_blank(),
plot.background = element_rect(fill = "grey92"))
```
If you look up to the histogram we made at the beginning of this section, you'll see this isn't a terrible approximation.
We can do something similar with the `brms::pp_check()` function. By setting `type = bars`, we'll get back a series of model-based simulations summarized by mean and error bars superimposed atop a histogram of the original data. With the `nsamples` argument, we indicated we wanted those mean and error bars to be based on 100 simulations.
```{r, fig.width = 6, fig.height = 3, message = F}
# this helps us set our custom color scheme
bayesplot::color_scheme_set(canva_pal("Green fields")(4)[c(1, 1, 1, 1, 4, 1)])
set.seed(11)
pp_check(b11.4, type = "bars", ndraws = 100) +
scale_x_continuous(breaks = 0:7) +
theme_hc() +
theme(axis.ticks = element_blank(),
legend.position = c(.91, .842),
legend.background = element_rect(fill = "transparent"),
plot.background = element_rect(fill = "grey92"))
```
Those mean and error bars suggest the model did a good job simulating data that resemble the original data.
## Over-dispersed outcomes
> All statistical models omit something. The question is only whether that something is necessary for making useful inferences. One symptom that something important has been omitted from a count model is over-dispersion. The variance of a variable is sometimes called its *dispersion*. For a counting process like a binomial, the variance is a function of the same parameters as the expected value. For example, the expected value of a binomial is $np$ and its variance is $np (1 - p)$. When the observed variance exceeds this amount--after conditioning on all the predictor variables--this implies that some omitted variable is producing additional dispersion in the observed counts.
>
> What could go wrong, if we ignore the over-dispersion? Ignoring it can lead to all of the same problems as ignoring any predictor variable. Heterogeneity in counts can be a confound, hiding effects of interest or producing spurious inferences. (p, 346, *emphasis* in the original)
In this chapter we'll cope with the problem using continuous mixture models--first the beta-binomial and then the gamma-Poisson (a.k.a. negative binomial).
### Beta-binomial.
> A beta-binomial model assumes that each binomial count observation has its own probability of success. The model estimates the *distribution* of probabilities of success across cases, instead of a single probability of success. And predictor variables change the shape of this distribution, instead of directly determining the probability of each success. (p, 347, *emphasis* in the original)
Unfortunately, we need to digress. As it turns out, there are multiple ways to parameterize the beta distribution and we've run square into two. In the text, McElreath wrote the beta distribution has two parameters, an average probability $\bar p$ and a shape parameter $\theta$. In his R code 11.24, which we'll reproduce in a bit, he demonstrated that parameterization with the `rethinking::dbeta2()` function. The nice thing about this parameterization is how intuitive the `pbar` parameter is. If you want a beta with an average of .2, you set `pbar = .2`. If you want the distribution to be more or less certain, make the `theta` argument more or less large, respectively.
However, the beta density is often defined in terms of $\alpha$ and $\beta$. If you denote the data as $y$, this follows the form
$$\operatorname{Beta}(y | \alpha, \beta) = \frac{y^{\alpha - 1} (1 - y)^{\beta - 1}}{\text B (\alpha, \beta)},$$
which you can verify in the [*Continuous Distributions on [0, 1]* section](https://mc-stan.org/docs/functions-reference/continuous-distributions-on-0-1.html) of the [Stan Functions Reference](https://mc-stan.org/docs/functions-reference/) [@standevelopmentteamStanFunctionsReference2022]. In the formula, $\text B$ stands for the Beta function, which computes a normalizing constant, which you can learn about in the [*Mathematical Functions* chapter](https://mc-stan.org/docs/functions-reference/math-functions.html) of the Stan Functions Reference. This is all important to be aware of because when we defined that beta prior for `zi` in the last model, it was using this parameterization. Also, if you look at the base R `dbeta()` function, you'll learn it takes two parameters, `shape1` and `shape2`. Those uncreatively-named parameters are the same $\alpha$ and $\beta$ from the density, above. They do not correspond to the `pbar` and `theta` parameters of McEreath's `rethinking::dbeta2()` function.
McElreath had good reason for using `dbeta2()`. Beta's typical $\alpha$ and $\beta$ parameters aren't the most intuitive to use; the parameters in McElreath's `dbeta2()` are much nicer. But if you're willing to dive deeper, it turns out you can find the mean of a beta distribution in terms of $\alpha$ and $\beta$ like this
$$\mu = \frac{\alpha}{\alpha + \beta}$$
We can talk about the spread of the distribution, sometimes called $\kappa$, in terms $\alpha$ and $\beta$ like this
$$\kappa = \alpha + \beta$$
With $\mu$ and $\kappa$ in hand, we can even find the $\textit{SD}$ of a beta distribution with
$$\sigma = \sqrt{\mu (1 - \mu) / (\kappa + 1)}$$
I explicate all this because McElreath's `pbar` is $\mu = \frac{\alpha}{\alpha + \beta}$ and his `theta` is $\kappa = \alpha + \beta$. This is great news because it means that we can understand what McElreath did with his `beta2()` function in terms of base R's `dbeta()` function. Which also means that we can understand the distribution of the beta parameters used in `brms::brm()`. To demonstrate, let's walk through McElreath's R code 11.25.
```{r, fig.width = 4, fig.height = 3}
pbar <- 0.5
theta <- 5
ggplot(data = tibble(x = seq(from = 0, to = 1, by = .01)),
aes(x = x, ymin = 0, ymax = rethinking::dbeta2(x, pbar, theta))) +
geom_ribbon(fill = canva_pal("Green fields")(4)[1]) +
scale_x_continuous("probability space", breaks = c(0, .5, 1)) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(The~beta~distribution),
subtitle = expression("Defined in terms of "*mu*" (i.e., pbar) and "*kappa*" (i.e., theta)")) +
theme_hc() +
theme(plot.background = element_rect(fill = "grey92"))
```