-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy path05.Rmd
2685 lines (2067 loc) · 125 KB
/
05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# Treating Time More Flexibly
> All the illustrative longitudinal data sets in previous chapters share two structural features that simplify analysis. Each is: (1) balanced--everyone is assessed on the identical number of occasions; and (2) time-structured--each set of occasions is identical across individuals. Our analyses have also been limited in that we have used only: (1) time-invariant predictors that describe immutable characteristics of individuals or their environment (except for *TIME* itself); and (2) a representation of *TIME* that forces the level-1 individual growth parameters to represent "initial status" and "rate of change."
>
> The multilevel model for change is far more flexible than these examples suggest. With little or no adjustment, you can use the same strategies to analyze more complex data sets. Not only can the waves of data be irregularly spaced, their number and spacing can vary across participants. Each individual can have his or her own data collection schedule and number of waves can vary without limit from person to person. So, too, predictors of change can be time-invariant or time-varying, and the level-1 submodel can be parameterized in a variety of interesting ways. [@singerAppliedLongitudinalData2003, p. 138, *emphasis* in the original]
## Variably spaced measurement occasions
> Many researchers design their studies with the goal of assessing each individual on an identical set of occasions...
>
> Yet sometimes, despite a valiant attempt to collect time-structured data, actual measurement occasions will differ. Variation often results from the realities of fieldwork and data collection...
>
> So, too, many researchers design their studies knowing full well that the measurement occasions may differ across participants. This is certainly true, for example, of those who use an *accelerated cohort* design in which an age-heterogeneous cohort of individuals is followed for a constant period of time. Because respondents initial vary in age, and age, not *wave*, is usually the appropriate metric for analyses (see the discussion of time metrics in section 1.3.2), observed measurement occasions will differ across individuals. (p. 139, *emphasis* in the original)
### The structure of variably spaced data sets.
You can find the PIAT data from the CNLSY study in the `reading_pp.csv` file.
```{r, warning = F, message = F}
library(tidyverse)
reading_pp <- read_csv("data/reading_pp.csv")
head(reading_pp)
```
On pages 141 and 142, Singer and Willett discussed the phenomena of *occasion creep*, which is when "the temporal separations of occasions widens as the actual ages exceed design projections". Here's what that might look like.
```{r, fig.width = 6, fig.height = 2.5}
reading_pp %>%
ggplot(aes(x = age, y = wave)) +
geom_vline(xintercept = c(6.5, 8.5, 10.5), color = "white") +
geom_jitter(alpha = .5, height = .33, width = 0) +
scale_x_continuous(breaks = c(6.5, 8.5, 10.5)) +
scale_y_continuous(breaks = 1:3) +
ggtitle("This is what occasion creep looks like.",
subtitle = "As the waves go by, the variation of the ages widens and their central tendency\ncreeps away from the ideal point.") +
theme(panel.grid = element_blank())
```
Here's how we might make our version of Figure 5.1.
```{r, fig.width = 5.5, fig.height = 5, message = F}
set.seed(5)
# wrangle
reading_pp %>%
nest(data = c(wave, agegrp, age, piat)) %>%
sample_n(size = 9) %>%
unnest(data) %>%
# this will help format and order the facets
mutate(id = ifelse(id < 10, str_c("0", id), id) %>% str_c("id = ", .)) %>%
pivot_longer(contains("age")) %>%
# plot
ggplot(aes(x = value, y = piat, color = name)) +
geom_point(alpha = 2/3) +
stat_smooth(method = "lm", se = F, linewidth = 1/2) +
scale_color_viridis_d(NULL, option = "B", end = .5, direction = -1) +
xlab("measure of age") +
coord_cartesian(xlim = c(5, 12),
ylim = c(0, 80)) +
theme(panel.grid = element_blank()) +
facet_wrap(~ id)
```
Since it wasn't clear which `id` values the authors used in the text, we just randomized. Change the seed to view different samples.
### Postulating and fitting multilevel models with variably spaced waves of data.
The composite formula for our first model is
$$
\begin{align*}
\text{piat}_{ij} & = \gamma_{00} + \gamma_{10} (\text{agegrp}_{ij} - 6.5) + \zeta_{0i} + \zeta_{1i} (\text{agegrp}_{ij} - 6.5) + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\
\begin{bmatrix}
\zeta_{0i} \\ \zeta_{1i}
\end{bmatrix} & \sim \operatorname{Normal}
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf \Sigma
\end{pmatrix}, \text{where} \\
\mathbf \Sigma & = \mathbf D \mathbf\Omega \mathbf D', \text{where} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \text{and} \\
\mathbf \Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}
\end{align*}
$$
It's the same for the twin model using `age` rather than `agegrp`. Notice how we've switched from Singer and Willett's $\sigma^2$ parameterization to the $\sigma$ parameterization typical of **brms**.
```{r}
reading_pp <-
reading_pp %>%
mutate(agegrp_c = agegrp - 6.5,
age_c = age - 6.5)
head(reading_pp)
```
In the last chapter, we began familiarizing ourselves with `brms::brm()` default priors. It's time to level up. Another approach is to use domain knowledge to set weakly-informative priors. Let's start with the PIAT. The Peabody Individual Achievement Test is a standardized individual test of scholastic achievement. It yields several subtest scores. The reading subtest is the one we're focusing on, here. As is typical for such tests, the PIAT scores are normed to yield a population mean of 100 and a standard deviation of 15.
With that information alone, even a PIAT novice should have an idea about how to specify the priors. Since our sole predictor variables are versions of age centered at 6.5, we know that the model intercept is interpreted as the expected value on the PIAT when the children are 6.5 years old. If you knew nothing else, you'd guess the mean score would be 100 with a standard deviation around 15. One way to use a weakly-informative prior on the intercept would be to multiply that $SD$ by a number like 2.
Next we need a prior for the time variables, `age_c` and `agegrp_c`. A one-unit increase in either of these is the expected increase in the PIAT with one year's passage of age. Bringing in a little domain knowledge, IQ and achievement tests tend to be rather stable over time. However, we also expect children to get better as they age and we also don't know exactly how these data have been adjusted for the children's ages. It's also important to know that it's typical within the Bayesian world to place Normal priors on $\beta$ parameters. So one approach would be to center the Normal prior on 0 and put something like twice the PIAT's standard deviation on the prior's $\sigma$. If we were PIAT researchers, we could do much better. But with minimal knowledge of the test, this approach is certainly beats defaults.
Next we have the variance parameters. Recall that `brms::brm()` defaults are Student's $t$-distributions with $\nu = 3$ and $\mu = 0$. Let's start there. Now we just need to put values on $\sigma$. Since the PIAT has a standard deviation of 15 in the population, why not just use 15? If you felt insecure about this, multiply if by a factor of 2 or so. Also recall that when Student's $t$-distributions has a $\nu = 3$, the tails are quite fat. Within the context of Bayesian priors, those fat tails make it easy for the likelihood to dominate the prior even when it's a good way into the tail.
Finally, we have the correlation among the group-level variance parameters, $\sigma_0$ and $\sigma_1$. Recall that last chapter we learned the `brms::brm()` default was `lkj(1)`. To get a sense of what the LKJ does, we'll simulate from it. McElreath's **rethinking** package contains a handy `rlkjcorr()` function, which will allow us to simulate `n` draws from a `K` by `K` correlation matrix for which $\eta$ is defined by `eta`. Let's take `n <- 1e6` draws from two LKJ prior distributions, one with $\eta = 1$ and the other with $\eta = 4$.
```{r, warning = F, message = F}
library(rethinking)
n <- 1e6
set.seed(5)
lkj <-
tibble(eta = c(1, 4)) %>%
mutate(draws = purrr::map(eta, ~ rlkjcorr(n, K = 2, eta = .)[, 2, 1])) %>%
unnest(draws)
glimpse(lkj)
```
Now plot that `lkj`.
```{r, fig.width = 4, fig.height = 2.5}
lkj %>%
mutate(eta = factor(eta)) %>%
ggplot(aes(x = draws, fill = eta, color = eta)) +
geom_density(size = 0, alpha = 2/3) +
geom_text(data = tibble(
draws = c(.75, .35),
y = c(.6, 1.05),
label = c("eta = 1", "eta = 4"),
eta = c(1, 4) %>% as.factor()),
aes(y = y, label = label)) +
scale_fill_viridis_d(option = "A", end = .5) +
scale_color_viridis_d(option = "A", end = .5) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(expression(rho)) +
theme(panel.grid = element_blank(),
legend.position = "none")
```
When we use `lkj(1)`, the prior is flat over the parameter space. However, setting `lkj(4)` is tantamount to a prior with a probability mass concentrated a bit towards zero. It's a prior that's skeptical of extremely large or small correlations. Within the context of our multilevel model $\rho$ parameters, this will be our weakly-regularizing prior.
Let's prepare to fit our models and load **brms**.
```{r, warning = F, message = F}
detach(package:rethinking, unload = T)
library(brms)
```
Fit the models. Following the same form, the differ in that the first uses `agegrp_c` and the second uses `age_c`.
```{r fit5.1}
fit5.1 <-
brm(data = reading_pp,
family = gaussian,
piat ~ 0 + Intercept + agegrp_c + (1 + agegrp_c | id),
prior = c(prior(normal(100, 30), class = b, coef = Intercept),
prior(normal(0, 30), class = b, coef = agegrp_c),
prior(student_t(3, 0, 15), class = sd),
prior(student_t(3, 0, 15), class = sigma),
prior(lkj(4), class = cor)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/fit05.01")
fit5.2 <-
brm(data = reading_pp,
family = gaussian,
piat ~ 0 + Intercept + age_c + (1 + age_c | id),
prior = c(prior(normal(100, 30), class = b, coef = Intercept),
prior(normal(0, 30), class = b, coef = age_c),
prior(student_t(3, 0, 15), class = sd),
prior(student_t(3, 0, 15), class = sigma),
prior(lkj(4), class = cor)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/fit05.02")
```
Focusing first on `fit5.1`, our analogue to the $(AGEGRP – 6.5)$ model displayed in Table 5.2, here is our model summary.
```{r}
print(fit5.1, digits = 3)
```
Here's the `age_c` model.
```{r}
print(fit5.2, digits = 3)
```
For a more focused look, we can use `fixef()` compare our $\gamma$'s to each other and those in the text.
```{r}
fixef(fit5.1) %>% round(digits = 3)
fixef(fit5.2) %>% round(digits = 3)
```
Here are our $\sigma_\epsilon$ summaries.
```{r}
VarCorr(fit5.1)$residual$sd %>% round(digits = 3)
VarCorr(fit5.2)$residual$sd %>% round(digits = 3)
```
From a quick glance, you can see they are about the square of the $\sigma_\epsilon^2$ estimates in the text.
Let's go ahead and compute the LOO and WAIC.
```{r, warning = F, message = F}
fit5.1 <- add_criterion(fit5.1, criterion = c("loo", "waic"))
fit5.2 <- add_criterion(fit5.2, criterion = c("loo", "waic"))
```
Compare the models with a WAIC difference.
```{r}
loo_compare(fit5.1, fit5.2, criterion = "waic") %>%
print(simplify = F)
```
The WAIC difference between the two isn't that large relative to its standard error. The LOO tells a similar story.
```{r}
loo_compare(fit5.1, fit5.2, criterion = "loo") %>%
print(simplify = F)
```
The uncertainty in our WAIC and LOO estimates and their differences provides information that was not available for the AIC and the BIC comparisons in the text. We can also compare the WAIC and the LOO with model weights. Given the WAIC, from @mcelreathStatisticalRethinkingBayesian2015 we learn
> A total weight of 1 is partitioned among the considered models, making it easier to compare their relative predictive accuracy. The weight for a model $i$ in a set of $m$ models is given by:
>
> $$w_i = \frac{\exp(-\frac{1}{2} \text{dWAIC}_i)}{\sum_{j = 1}^m \exp(-\frac{1}{2} \text{dWAIC}_i)}$$
>
> where dWAIC is the `dWAIC` in the `compare` table output. This example uses WAIC but the formula is the same for any other information criterion, since they are all on the deviance scale. (p. 199)
The `compare()` function McElreath referenced is from his [-@R-rethinking] [**rethinking** package](https://xcelab.net/rm/software/), which is meant to accompany his [texts](https://xcelab.net/rm/statistical-rethinking/). We don't have that function with **brms**. A rough analogue to the `rethinking::compare()` function is `loo_compare()`. We don't quite have a `dWAIC` column from `loo_compare()`. Remember how last chapter we discussed how Aki Vehtari isn't a fan of converting information criteria to the $\chi^2$ difference metric with that last $-2 \times ...$ step? That's why we have an `elpd_diff` instead of a `dWAIC`. But to get the corresponding value, you just multiply those values by -2. And yet if you look closely at the formula for $w_i$, you'll see that each time the dWAIC term appears, it's multiplied by $-\frac{1}{2}$. So we don't really need that `dWAIC` value anyway. As it turns out, we're good to go with our `elpd_diff`. Thus the above equation simplifies to
$$
w_i = \frac{\exp(\text{elpd_diff}_i)}{\sum_{j = 1}^m \exp(\text{elpd_diff}_i)}
$$
But recall you don't have to do any of this by hand. We have the `brms::model_weights()` function, which we can use to compute weights with the WAIC or the LOO.
```{r}
model_weights(fit5.1, fit5.2, weights = "waic") %>% round(digits = 3)
model_weights(fit5.1, fit5.2, weights = "loo") %>% round(digits = 3)
```
Both put the lion's share of the weight on the `age_c` model. Back to McElreath @mcelreathStatisticalRethinkingBayesian2015:
> But what do these weights mean? There actually isn't a consensus about that. But here's Akaike’s interpretation, which is [common](https://www.springer.com/us/book/9780387953649).
>
>> *A model's weight is an estimate of the probability that the model will make the best predictions on new data, conditional on the set of models considered*.
>
> Here's the heuristic explanation. First, regard WAIC as the expected deviance of a model on future data. That is to say that WAIC gives us an estimate of $\text{E} (D_\text{test})$. Akaike weights convert these deviance values, which are log-likelihoods, to plain likelihoods and then standardize them all. This is just like Bayes' theorem uses a sum in the denominator to standardize the produce of the likelihood and prior. Therefore the Akaike weights are analogous to posterior probabilities of models, conditional on expected future data. (p. 199, *emphasis* in the original)
## Varying numbers of measurement occasions
As Singer and Willett pointed out,
> once you allow the spacing of waves to vary across individuals, it is a small leap to allow their *number* to vary as well. Statisticians say that such data sets are *unbalanced.* As you would expect, balance facilitates analysis: models can be parameterized more easily, random effects can be estimated more precisely, and computer algorithms will converge more rapidly.
>
> Yet a major advantage of the multilevel model for change is that it is easily fit to unbalanced data. (p. 146, *emphasis* in the original)
```{r, echo = F}
# to free up memory, we'll drop these two
rm(fit5.1, fit5.2)
```
### Analyzing data sets in which the number of waves per person varies.
Here we load the `wages_pp.csv` data.
```{r, warning = F, message = F}
wages_pp <- read_csv("data/wages_pp.csv")
glimpse(wages_pp)
```
Here's a more focused look along the lines of Table 5.3.
```{r}
wages_pp %>%
select(id, exper, lnw, black, hgc, uerate) %>%
filter(id %in% c(206, 332, 1028))
```
To get a sense of the diversity in the number of occasions per `id`, use `group_by()` and `count()`.
```{r, fig.width = 6, fig.height = 3}
wages_pp %>%
count(id) %>%
ggplot(aes(y = n)) +
geom_bar() +
scale_y_continuous("# measurement occasions", breaks = 1:13) +
xlab("count of cases") +
theme(panel.grid = element_blank())
```
The spacing of the measurement occasions also differs a lot across cases. Recall that `exper` "identifies the specific moment--to the nearest day--in each man's labor force history associated with each observed value of" `lnw` (p. 147). Here's a sense of what that looks like.
```{r, fig.width = 6, fig.height = 2.25}
wages_pp %>%
filter(id %in% c(206, 332, 1028)) %>%
mutate(id = factor(id)) %>%
ggplot(aes(x = exper, y = lnw, color = id)) +
geom_point() +
geom_line() +
scale_color_viridis_d(option = "B", begin = .35, end = .8) +
theme(panel.grid = element_blank())
```
Uneven for dayz.
Here's the **brms** version of the composite formula for Model A, the unconditional growth model for `lnw`.
$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{10} \text{exper}_{ij} + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal}(0, \sigma_\epsilon) \\
\begin{bmatrix}
\zeta_{0i} \\ \zeta_{1i}
\end{bmatrix} & \sim \operatorname{Normal}
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf\Sigma
\end{pmatrix}, \text{where} \\
\mathbf\Sigma & = \mathbf D \mathbf \Omega \mathbf D', \text{where} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix}, \text{and} \\
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix}
\end{align*}
$$
To attempt setting priors for this, we need to review what `lnw` is. From the text: "To adjust for inflation, each hourly wage is expressed in constant 1990 dollars. To address the skewness commonly found in wage data and to linearize the individual wage trajectories, we analyze the natural logarithm of wages, *LNW*" (p. 147). So it's the log of participant wages in 1990 dollars. From the official [US Social Secutiry website](https://www.ssa.gov/oact/cola/central.html), we learn the average yearly wage in 1990 was $20,172.11. Here's that natural log for that.
```{r}
log(20172.11)
```
However, that's the yearly wage. In the text, this is conceptualized as rate per hour. If we presume a 40 hour week for 52 weeks, this translates to a little less than $10 per hour.
```{r}
20172.11 / (40 * 52)
```
Here's what that looks like in a log metric.
```{r}
log(20172.11 / (40 * 52))
```
But keep in mind that "to track wages on a common temporal scale, Murnane and colleagues decided to clock time from each respondent's first day of work" (p. 147). So the wages at one's initial point in the study were often entry-level wages. From the official website for the [US Department of Labor](https://www.dol.gov/whd/minwage/chart.htm), we learn the national US minimum wage in 1990 was $3.80 per hour. Here's what that looks like on the log scale.
```{r}
log(3.80)
```
So perhaps this is a better figure to center our prior for the model intercept on. If we stay with a conventional Gaussian prior and put $\mu = 1.335$, what value should we use for the standard deviation? Well, if that's the log minimum and 2.27 is the log mean, then there's less than a log value of 1 between the minimum and the mean. If we'd like to continue our practice of weakly regularizing priors a value of 1 or even 0.5 on the log scale would seem reasonable. For simplicity, we'll use `normal(1.335, 1)`.
Next we need a prior for the expected increase over a single year's employment. A conservative default might be to center it on zero—no change from year to year. Since as we've established a 1 on the log scale is more than the difference between the minimum and average hourly wages in 1990 dollars, we might just use `normal(0, 0.5)` as a starting point.
So then what about our variance parameters? Given these are all entry-level workers and given how little we'd expect them to increase from year to year, a `student_t(3, 0, 1)` on the log scale would seem pretty permissive.
So then here's how we might formally specify our model priors:
$$
\begin{align*}
\gamma_{00} & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{10} & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_0 & \sim \operatorname{Student-t}(3, 0, 1) \\
\sigma_1 & \sim \operatorname{Student-t}(3, 0, 1) \\
\rho_{01} & \sim \operatorname{LKJ} (4)
\end{align*}
$$
For a point of comparison, here are the `brms::brm()` default priors.
```{r}
get_prior(data = wages_pp,
family = gaussian,
lnw ~ 0 + Intercept + exper + (1 + exper | id))
```
Even though our priors are still quite permissive on the scale of the data, they're much more informative than the defaults. If we had formal backgrounds in the entry-level economy of the US in the early 1900s, we'd be able to specify even better priors. But hopefully this walk-through gives a sense of how to start thinking about model priors.
Let's fit the model. To keep the size of the `fits/fit05.03.rds` file below the 100MB GitHub limit, we'll set `chains = 3` and compensate by upping `iter` a little.
```{r fit5.3}
fit5.3 <-
brm(data = wages_pp,
family = gaussian,
lnw ~ 0 + Intercept + exper + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b, coef = exper),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2500, warmup = 1000, chains = 3, cores = 3,
seed = 5,
file = "fits/fit05.03")
```
Here are the results.
```{r}
print(fit5.3, digits = 3)
```
Since the criterion `lnw` is on the log scale, Singer and Willett pointed out our estimate for $\gamma_{10}$ indicates a nonlinear growth rate on the natural dollar scale. They further explicated that "if an outcome in a linear relationship, $Y$, is expressed as a natural logarithm and $\hat \gamma_{01}$ is the regression coefficient for a predictor $X$, then $100(e^{\hat{\gamma}_{01}} - 1)$ is the *percentage change* in $Y$ per unit difference in $X$" (p. 148, *emphasis* in the original). Here's how to do that conversion with our **brms** output.
```{r, warning = F}
draws <-
as_draws_df(fit5.3) %>%
transmute(percent_change = 100 * (exp(b_exper) - 1))
head(draws)
```
For our plot, let's break out [Matthew Kay](https://twitter.com/mjskay)'s handy [**tidybayes** package](mjskay.github.io/tidybayes/) [@R-tidybayes]. With the `tidybayes::stat_halfeye()` function, it's easy to put horizontal point intervals beneath out parameter densities. Here we'll use 95% intervals.
```{r, fig.width = 3.5, fig.height = 2.5, warning = F, message = F}
library(tidybayes)
draws %>%
ggplot(aes(x = percent_change, y = 0)) +
stat_halfeye(.width = .95) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Percent change",
x = expression(100*(italic(e)^(hat(gamma)[1][0])-1))) +
theme(panel.grid = element_blank())
```
The **tidybayes** package also has a group of functions that make it easy to summarize posterior parameters with measures of central tendency (i.e., mean, median, mode) and intervals (i.e., percentile based, highest posterior density intervals). Here we’ll use `median_qi()` to get the posterior median and percentile-based 95% intervals.
```{r}
draws %>%
median_qi(percent_change)
```
For our next model, Model B in Table 5.4, we add two time-invariant covariates. In the data, these are listed as `black` and `hgc.9`. Before we proceed, let's rename `hgc.9` to be more consistent with [**tidyverse** style](https://style.tidyverse.org/syntax.html#object-names).
```{r}
wages_pp <-
wages_pp %>%
rename(hgc_9 = hgc.9)
```
There we go. Let's take a look at the distributions of our covariates.
```{r, fig.width = 6, fig.height = 2.5}
wages_pp %>%
pivot_longer(c(black, hgc_9)) %>%
ggplot(aes(x = value)) +
geom_bar() +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free")
```
We see `black` is a dummy variable coded "Black" = 1, "Non-black" = 0. `hgc_9` is a somewhat Gaussian ordinal centered around zero. For context, it might also help to check its standard deviation.
```{r}
sd(wages_pp$hgc_9)
```
With a mean near 0 and an $SD$ near 1, `hgc_9` is almost in a standardized metric. If we wanted to keep with our weakly-regularizing approach, `normal(0, 1)` or even `normal(0, 0.5)` would be pretty permissive for both these variables. Recall that we're predicting wage on the log scale. A $\gamma$ value of 1 or even 0.5 would be humongous for the social sciences. Since we already have the $\gamma$ for `exper` set to `normal(0, 0.5)`, let's just keep with that. Here's how we might describe our model in statistical terms:
$$
\begin{align*}
\text{lnw}_{ij} & = \gamma_{00} + \gamma_{01} (\text{hgc}_{i} - 9) + \gamma_{02} \text{black}_{i} \\
& \;\;\; + \gamma_{10} \text{exper}_{ij} + \gamma_{11} \text{exper}_{ij} \times (\text{hgc}_{i} - 9) + \gamma_{12} \text{exper}_{ij} \times \text{black}_{i} \\
& \;\;\; + \zeta_{0i} + \zeta_{1i} \text{exper}_{ij} + \epsilon_{ij} \\
\epsilon_{ij} & \sim \operatorname{Normal} (0, \sigma_\epsilon) \\
\begin{bmatrix}
\zeta_{0i} \\ \zeta_{1i}
\end{bmatrix} & \sim \operatorname{Normal}
\begin{pmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix},
\mathbf D \mathbf\Omega \mathbf D'
\end{pmatrix} \\
\mathbf D & = \begin{bmatrix} \sigma_0 & 0 \\ 0 & \sigma_1 \end{bmatrix} \\
\mathbf\Omega & = \begin{bmatrix} 1 & \rho_{01} \\ \rho_{01} & 1 \end{bmatrix} \\
\gamma_{00} & \sim \operatorname{Normal}(1.335, 1) \\
\gamma_{01}, \dots, \gamma_{12} & \sim \operatorname{Normal}(0, 0.5) \\
\sigma_\epsilon, \sigma_0, \text{ and } \sigma_1 & \sim \operatorname{Student-t} (3, 0, 1) \\
\rho_{01} & \sim \operatorname{LKJ} (4).
\end{align*}
$$
The top portion up through the $\mathbf\Omega$ line is the likelihood. Starting with $\gamma_{00} \sim \text{Normal}(1.335, 1)$ on down, we've listed our priors. Here's how to fit the model with **brms**.
```{r fit5.4}
fit5.4 <-
brm(data = wages_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + black + exper + exper:hgc_9 + exper:black + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2500, warmup = 1000, chains = 3, cores = 3,
seed = 5,
file = "fits/fit05.04")
```
Let's take a look at the results.
```{r}
print(fit5.4, digits = 3)
```
The $\gamma$'s are on par with those in the text. When we convert the $\sigma$ parameters to the $\sigma^2$ metric, here's what they look like.
```{r, fig.width = 8, fig.height = 2.25, warning = F}
draws <- as_draws_df(fit5.4)
draws %>%
transmute(`sigma[0]^2` = sd_id__Intercept^2,
`sigma[1]^2` = sd_id__exper^2,
`sigma[epsilon]^2` = sigma^2) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
stat_halfeye(.width = .95, normalize = "xy") +
scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
coord_cartesian(ylim = c(1.4, 3.4)) +
theme(axis.ticks.y = element_blank(),
panel.grid = element_blank())
```
We might plot our $\gamma$'s, too. Here we'll use `tidybayes::stat_pointinterval()` to just focus on the points and intervals.
```{r, fig.width = 8, fig.height = 2, warning = F}
draws %>%
select(b_Intercept:`b_black:exper`) %>%
set_names(str_c("gamma", c("[0][0]", "[0][1]", "[0][2]", "[1][0]", "[1][1]", "[1][2]"))) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
geom_vline(xintercept = 0, color = "white") +
stat_pointinterval(.width = .95, size = 1/2) +
scale_y_discrete(NULL, labels = ggplot2:::parse_safe) +
theme(axis.ticks.y = element_blank(),
panel.grid = element_blank())
```
As in the text, our $\gamma_{02}$ and $\gamma_{11}$ parameters hovered around zero. For our next model, Model C in Table 5.4, we'll drop those parameters.
```{r fit5.5}
fit5.5 <-
brm(data = wages_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2500, warmup = 1000, chains = 3, cores = 3,
seed = 5,
file = "fits/fit05.05")
```
Let's take a look at the results.
```{r}
print(fit5.5, digits = 3)
```
Perhaps unsurprisingly, the parameter estimates for `fit5.5` ended up quite similar to those from `fit5.4`. Happily, they're also similar to those in the text. Let's compute the WAIC estimates.
```{r, warning = F, message = F}
fit5.3 <- add_criterion(fit5.3, criterion = "waic")
fit5.4 <- add_criterion(fit5.4, criterion = "waic")
fit5.5 <- add_criterion(fit5.5, criterion = "waic")
```
Compare their WAIC estimates using $\text{elpd}$ difference scores.
```{r}
loo_compare(fit5.3, fit5.4, fit5.5, criterion = "waic") %>%
print(simplify = F)
```
The differences are subtle. Here are the WAIC weights.
```{r}
model_weights(fit5.3, fit5.4, fit5.5, weights = "waic") %>%
round(digits = 3)
```
When we use weights, almost all goes to `fit5.4` and `fit5.5`. Focusing on the trimmed model, `fit5.5`, let's get ready to make our version of Figure 5.2. We'll start with `fitted()` work.
```{r}
nd <-
crossing(black = 0:1,
hgc_9 = c(0, 3)) %>%
expand_grid(exper = seq(from = 0, to = 11, length.out = 30))
f <-
fitted(fit5.5,
newdata = nd,
re_formula = NA) %>%
data.frame() %>%
bind_cols(nd)
head(f)
```
Here it is, our two-panel version of Figure 5.2.
```{r, fig.width = 6, fig.height = 3}
f %>%
mutate(black = factor(black,
labels = c("Latinos and Whites", "Blacks")),
hgc_9 = factor(hgc_9,
labels = c("9th grade dropouts", "12th grade dropouts"))) %>%
ggplot(aes(x = exper,
color = black, fill = black)) +
geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5),
size = 0, alpha = 1/4) +
geom_line(aes(y = Estimate)) +
scale_fill_viridis_d(NULL, option = "C", begin = .25, end = .75) +
scale_color_viridis_d(NULL, option = "C", begin = .25, end = .75) +
ylab("lnw") +
coord_cartesian(ylim = c(1.6, 2.4)) +
theme(panel.grid = element_blank()) +
facet_wrap(~ hgc_9)
```
This leads in nicely to a brief discussion of posterior predictive checks (PPC). The basic idea is that good models should be able to retrodict the data used to produce them. Table 5.3 in the text introduced the data set by highlighting three participants and we went ahead and looked at their data in a plot. One way to do a PPC might be to plot their original data atop their model estimates. The `fitted()` function will help us with the preparatory work.
```{r}
nd <-
wages_pp %>%
filter(id %in% c(206, 332, 1028))
f <-
fitted(fit5.5,
newdata = nd) %>%
data.frame() %>%
bind_cols(nd)
head(f)
```
Here's the plot.
```{r, fig.width = 8, fig.height = 2.5}
f %>%
mutate(id = str_c("id = ", id)) %>%
ggplot(aes(x = exper)) +
geom_pointrange(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5,
color = id)) +
geom_point(aes(y = lnw)) +
scale_color_viridis_d(option = "B", begin = .35, end = .8) +
labs(subtitle = "The black dots are the original data. The colored points and vertical lines are the participant-specific posterior\nmeans and 95% intervals.") +
theme(legend.position = "none",
panel.grid = element_blank()) +
facet_wrap(~ id)
```
Although each participant got their own intercept and slope, the estimates all fall in straight lines. Since we're only working with time-invariant covariates, that's about the best we can do. Though our models can express gross trends over time, they're unable to speak to variation from occasion to occasion. Just a little later on in this chapter and we'll learn how to do better.
### Practical problems that may arise when analyzing unbalanced data sets.
With HMC, the issues with non-convergence aren't quite the same as with maximum likelihood estimation. However, the basic issue still remains:
> Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation--variation in the residuals over and above the fixed effects. If too many people have too little data, you will ~~be unable to quantify~~ [have difficulty quantifying] this residual variability. (p. 152)
The big difference is that as Bayesians, our priors add additional information that will help us define the posterior distributions of our variance components. Thus our challenge will choosing sensible priors for our $\sigma$'s.
#### Boundary constraints.
Unlike with the frequentist multilevel software discussed in the text, **brms** will not yield negative values on the $\sigma$ parameters. This is because the **brms** default is to set a lower limit of zero on those parameters. For example, see what happens when we execute `fit5.3$model`.
```{r}
fit5.3$model
```
That returned the Stan code corresponding to our `brms::brm()` code, above. Notice the second and third lines in the `parameters` block. Both contained `<lower=0>`, which indicated the lower bounds for those parameters was zero. See? Stan has you covered.
Let's load the `wages_small_pp.csv` data.
```{r, warning = F, message = F}
wages_small_pp <- read_csv("data/wages_small_pp.csv") %>%
rename(hgc_9 = hcg.9)
glimpse(wages_small_pp)
```
Here's the distribution of the number of measurement occasions for our small data set.
```{r, fig.width = 6, fig.height = 3}
wages_small_pp %>%
count(id) %>%
ggplot(aes(y = n)) +
geom_bar() +
scale_y_continuous("# measurement occasions", breaks = 1:13, limits = c(.5, 13)) +
xlab("count of cases") +
theme(panel.grid = element_blank())
```
Our `brm()` code is the same as that for `fit5.5`, above, with just a slightly different `data` argument. If we wanted to, we could be hasty and just use `update()`, instead. But since we're still practicing setting our priors and such, here we'll be exhaustive.
```{r fit5.6}
fit5.6 <-
brm(data = wages_small_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/fit05.06")
```
```{r}
print(fit5.6)
```
Let's walk through this slow.
You may have noticed that warning message about *divergent transitions*. We'll get to that in a bit. First focus on the parameter estimates for `sd(exper)`. Unlike in the text, our posterior mean is not 0.000. But do remember that our posterior is parameterized in the $\sigma$ metric. Let's do a little converting and look at it in a plot.
```{r, warning = F}
draws <- as_draws_df(fit5.6)
v <-
draws %>%
transmute(sigma_1 = sd_id__exper) %>%
mutate(sigma_2_1 = sigma_1^2) %>%
set_names("sigma[1]", "sigma[1]^2") %>%
pivot_longer(everything())
```
Plot.
```{r, fig.width = 8, fig.height = 2.5}
v %>%
ggplot(aes(x = value, y = name)) +
stat_halfeye(.width = .95, normalize = "xy") +
scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
theme(axis.ticks.y = element_blank(),
panel.grid = element_blank())
```
In the $\sigma$ metric, the posterior is bunched up a little on the boundary, but much of its mass is a gently right-skewed mound concentrated in the 0—0.1 range. When we convert the posterior to the $\sigma^2$ metric, the parameter appears much more bunched up against the boundary. Because we typically summarize our posteriors with means or medians, the point estimate still moves away from zero.
```{r}
v %>%
group_by(name) %>%
mean_qi() %>%
mutate_if(is.double, round, digits = 4)
v %>%
group_by(name) %>%
median_qi() %>%
mutate_if(is.double, round, digits = 4)
```
But it really does start to shoot to zero if we attempt to summarize the central tendency with the mode, as within the maximum likelihood paradigm.
```{r}
v %>%
group_by(name) %>%
mode_qi() %>%
mutate_if(is.double, round, digits = 4)
```
Backing up to that warning message, we were informed that "Increasing adapt_delta above 0.8 may help." The `adapt_delta` parameter ranges from 0 to 1. The `brm()` default is .8. In my experience, increasing to .9 or .99 is often a good place to start. For this model, .9 wasn't quite enough, but .99 worked. Here's how to do it.
```{r fit5.7}
fit5.7 <-
brm(data = wages_small_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
control = list(adapt_delta = .99),
file = "fits/fit05.07")
```
Now look at the summary.
```{r}
print(fit5.7)
```
Our estimates were pretty much the same as before. Happily, this time we got our summary without any warning signs. It won't always be that way, so make sure to take `adapt_delta` warnings seriously.
Now do note that for both `fit5.6` and `fit5.7`, our effective sample sizes for $\sigma_0$ and $\sigma_1$ aren't terribly large relative to the total number of post-warmup draws, 4,000. If it was really important that you had high-quality summary statistics for these parameters, you might need to refit the model with something like `iter = 20000, warmup = 2000`.
In Model B in Table 5.5, Singer and Willett gave the results of a model with the boundary constraints on the $\sigma^2$ parameters removed. I am not going to attempt something like that with **brms**. If you're interested, you're on your own.
But we will fit a version of their Model C where we've removed the $\sigma_1$ parameter. Notice that this results in our removal of the LKJ prior for $\rho_{01}$, too. Without a $\sigma_1$, there's no other parameter with which our lonely $\sigma_0$ might covary.
```{r fit5.8}
fit5.8 <-
brm(data = wages_small_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/fit05.08")
```
Here is the basic model summary.
```{r}
print(fit5.8)
```
No warning messages and our effective samples for $\sigma_0$ improved a bit. Compute the WAIC for both models.
```{r}
fit5.7 <- add_criterion(fit5.7, criterion = "waic")
fit5.8 <- add_criterion(fit5.8, criterion = "waic")
```
Compare.
```{r}
loo_compare(fit5.7, fit5.8, criterion = "waic") %>%
print(simplify = F, digits = 3)
```
Yep. Those WAIC estimates are quite similar and when you compare them with formal $\text{elpd}$ difference scores, the standard error is about the same size as the difference itself.
Though we're stepping away from the text a bit, we should explore more alternatives for this boundary issue. The Stan team has put together a *Prior Choice Recommendations* wiki at [https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). In the [*Boundary-avoiding priors for modal estimation (posterior mode, MAP, marginal posterior mode, marginal maximum likelihood, MML)*](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#boundary-avoiding-priors-for-modal-estimation-posterior-mode-map-marginal-posterior-mode-marginal-maximum-likelihood-mml) section, we read:
> * These are for parameters such as group-level scale parameters, group-level correlations, group-level covariance matrix
> * What all these parameters have in common is that (a) they're defined on a space with a boundary, and (b) the likelihood, or marginal likelihood, can have a mode on the boundary. Most famous example is the group-level scale parameter tau for the 8-schools hierarchical model.
> * With full Bayes the boundary shouldn't be a problem (as long as you have any proper prior).
> * But with modal estimation, the estimate can be on the boundary, which can create problems in posterior predictions. For example, consider a varying-intercept varying-slope multilevel model which has an intercept and slope for each group. Suppose you fit marginal maximum likelihood and get a modal estimate of 1 for the group-level correlation. Then in your predictions the intercept and slope will be perfectly correlated, which in general will be unrealistic.
> * For a one-dimensional parameter restricted to be positive (e.g., the scale parameter in a hierarchical model), we recommend Gamma(2,0) prior (that is, p(tau) proportional to tau) which will keep the mode away from 0 but still allows it to be arbitrarily close to the data if that is what the likelihood wants. For details see this paper by Chung et al.: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
> + Gamma(2,0) biases the estimate upward. When number of groups is small, try Gamma(2,1/A), where A is a scale parameter representing how high tau can be.
We should walk those Gamma priors out, a bit. The paper by @chungNondegeneratePenalizedLikelihood2013 is quite helpful. We'll first let them give us a little more background in the topic:
> Zero group-level variance estimates can cause several problems. Zero variance can go against prior knowledge of researchers and results in underestimation of uncertainty in fixed coefficient estimates. Inferences for groups are often of interest to researchers, but when the group-level variance is estimated as zero, the resulting predictions of the group-level errors will all be zero, so one fails to find unexplained differences between groups. In addition, uncertainty in predictions for new and existing groups is also understated. (p. 686)
They expounded further on page 687.
> When a variance parameter is estimated as zero, there is typically a large amount of uncertainty about this variance. One possibility is to declare in such situations that not enough information is available to estimate a multilevel model. However, the available alternatives can be unappealing since, as noted in the introduction, discarding a variance component or setting the variance to zero understates the uncertainty. In particular, standard errors for coefficients of covariates that vary between groups will be too low as we will see in Section 2.2. The other extreme is to fit a regression with indicators for groups (a fixed-effects model), but this will overcorrect for group effects (it is mathematically equivalent to a mixed-effects model with variance set to infinity), and also does not allow predictions for new groups.
>
> Degenerate variance estimates lead to complete shrinkage of predictions for new and existing groups and yield estimated prediction standard errors that understate uncertainty. This problem has been pointed out by Li and Lahiri [-@liAdjustedMaximumLikelihood2010] and Morris and Tang [-@morrisEstimatingRandomEffects2011] in small area estimation....
>
> If zero variance is not a null hypothesis of interest, a boundary estimate, and the corresponding zero likelihood ratio test statistic, should not necessarily lead us to accept the null hypothesis and to proceed as if the true variance is zero.
In their paper, they covered both penalized maximum likelihood and full Bayesian estimation. We're just going to focus on Bayes, but some of the quotes will contain ML talk. Further, we read:
> We recommend a class of log-gamma penalties (or gamma priors) that in our default setting (the log-gamma(2, $\lambda$) penalty with $\lambda \rightarrow 0$) produce maximum penalized likelihood (MPL) estimates (or Bayes modal estimates) approximately one standard error away from zero when the maximum likelihood estimate is at zero. We consider these priors to be weakly informative in the sense that they supply some direction but still allow inference to be driven by the data. The penalty has little influence when the number of groups is large or when the data are informative about the variance, and the asymptotic mean squared error of the proposed estimator is the same as that of the maximum likelihood estimator. (p. 686)
In the upper left panel of Figure 3, Chung and colleagues gave an example of what they mean by $\lambda \rightarrow 0$: $\lambda = 0.1$. Here's an example of what $\operatorname{Gamma}(2, 0.1)$ looks like across the parameter space of 0 to 100.
```{r, fig.width = 4, fig.height = 2.25, message = F}
library(ggdist)
prior(gamma(2, 0.1)) %>%
parse_dist() %>%
ggplot(aes(xdist = .dist_obj, y = prior)) +
stat_halfeye(.width = .95, p_limits = c(.0001, .9999)) +
scale_y_discrete(NULL, breaks = NULL, expand = expansion(add = 0.1)) +
labs(title = "Gamma(2, 0.1)",
x = "parameter space") +
coord_cartesian(xlim = c(0, 100)) +
theme(panel.grid = element_blank())
```
Given we're working with data on the log scale, that's a massively permissive prior. Let's zoom in and see what it means for the parameter space of possible values for our data.
```{r, fig.width = 4, fig.height = 2.25, message = F}
prior(gamma(2, 0.1)) %>%
parse_dist() %>%
ggplot(aes(xdist = .dist_obj, y = prior)) +
stat_halfeye(.width = .95, p_limits = c(.000001, .99)) +
scale_y_discrete(NULL, breaks = NULL, expand = expansion(add = 0.1)) +
labs(title = "Gamma(2, 0.1)",
x = "parameter space (zoomed in)") +
coord_cartesian(xlim = c(0, 2)) +
theme(panel.grid = element_blank())
```
Now keep that picture in mind as we read further along in the paper:
> In addition, with $\lambda \rightarrow 0$, the gamma density function has a positive constant derivative at zero, which allows the likelihood to dominate if it is strongly curved near zero. The positive constant derivative implies that the prior is linear at zero so that there is no dead zone near zero. The top-left panel of Figure 3 shows that the gamma(2,0.1) density increases linearly from zero with a gentle slope. The shape will be even flatter with a smaller rate parameter. (p. 691)
In case you're not familiar with the gamma distribution, the rate parameter is what we've been calling $\lambda$. Let's test this baby out with our model. Here's how to specify it in **brms**.
```{r fit5.9}
fit5.9 <-
brm(data = wages_small_pp,
family = gaussian,
lnw ~ 0 + Intercept + hgc_9 + exper + exper:black + (1 + exper | id),
prior = c(prior(normal(1.335, 1), class = b, coef = Intercept),
prior(normal(0, 0.5), class = b),
prior(student_t(3, 0, 1), class = sd),
prior(gamma(2, 0.1), class = sd, group = id, coef = exper),
prior(student_t(3, 0, 1), class = sigma),
prior(lkj(4), class = cor)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
control = list(adapt_delta = .85),
file = "fits/fit05.09")
```
Notice how we had to increase `adapt_delta` a bit. Here are the results.
```{r}
print(fit5.9, digits = 3)
```
Our `sd(exper)` is still quite close to zero. But notice how not the lower level of the 95% interval is higher than zero. Here's what it looks like in both $\sigma$ and $\sigma^2$ metrics.
```{r, fig.width = 8, fig.height = 2.5, warning = F}
as_draws_df(fit5.9) %>%
transmute(sigma_1 = sd_id__exper) %>%
mutate(sigma_2_1 = sigma_1^2) %>%
set_names("sigma[1]", "sigma[1]^2") %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
stat_halfeye(.width = .95, normalize = "xy") +
scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
theme(axis.ticks.y = element_blank(),
panel.grid = element_blank())
```
Let's zoom in on the leftmost part of the plot.
```{r, fig.width = 8, fig.height = 2.5, warning = F}
as_draws_df(fit5.9) %>%
transmute(sigma_1 = sd_id__exper) %>%
mutate(sigma_2_1 = sigma_1^2) %>%
set_names("sigma[1]", "sigma[1]^2") %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
geom_vline(xintercept = 0, color = "white") +
stat_halfeye(.width = .95, normalize = "xy") +
scale_y_discrete(NULL, labels = parse(text = c("sigma[1]", "sigma[1]^2"))) +
coord_cartesian(xlim = c(0, 0.01)) +
theme(panel.grid = element_blank())
```
Although we are still brushing up on the boundary with $\sigma_1^2$, the mode is no longer at zero. In the discussion, Chung and colleagues pointed out "sometimes weak prior information is available about a variance parameter. When $\alpha = 2$, the gamma density has its mode at $1 / \lambda$, and so one can use the $\operatorname{gamma}(\alpha, \lambda)$ prior with $1 / \lambda$ set to the prior estimate of $\sigma_\theta$" (p. 703). Let's say we only had our `wages_small_pp`, but the results of something like the `wages_pp` data were published by some earlier group of researchers. In this case, we do have good prior data; we have the point estimate from the model of the `wages_pp` data! Here's what that was in terms of the median.
```{r, fig.width = 8, fig.height = 2.5}
as_draws_df(fit5.3) %>%
median_qi(sd_id__exper)