-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path10.Rmd
850 lines (663 loc) · 41.9 KB
/
10.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# Big Entropy and the Generalized Linear Model
> Statistical models force many choices upon us. Some of these choices are distributions that represent uncertainty. We must choose, for each parameter, a prior distribution. And we must choose a likelihood function, which serves as a distribution of data. There are conventional choices, such as wide Gaussian priors and the Gaussian likelihood of linear regression. These conventional choices work unreasonably well in many circumstances. But very often the conventional choices are not the best choices. Inference can be more powerful when we use all of the information, and doing so usually requires going beyond convention.
>
> To go beyond convention, it helps to have some principles to guide choice. When an engineer wants to make an unconventional bridge, engineering principles help guide choice. When a researcher wants to build an unconventional model, entropy provides one useful principle to guide choice of probability distributions: Bet on the distribution with the biggest entropy. [@mcelreathStatisticalRethinkingBayesian2020, p. 299]
#### Rethinking: Bayesian updating is entropy maximization.
> Another kind of probability distribution, the posterior distribution deduced by Bayesian updating, is also a case of maximizing entropy. The posterior distribution has the greatest entropy relative to the prior (the smallest cross entropy) among all distributions consistent with the assumed constraints and the observed data. (p. 300)
## Maximum entropy
> In [Chapter 7][Ulysses' Compass], you met the basics of information theory. In brief, we seek a measure of uncertainty that satisfies three criteria: (1) the measure should be continuous; (2) it should increase as the number of possible events increases; and (3) it should be additive. The resulting unique measure of the uncertainty of a probability distribution $p$ with probabilities $p_i$ for each possible event $i$ turns out to be just the average log-probability:
>
> $$H(p) = - \sum_i p_i \log p_i$$
>
> This function is known as *information entropy*.
>
> The principle of maximum entropy applies this measure of uncertainty to the problem of choosing among probability distributions. Perhaps the simplest way to state the maximum entropy principle is:
>
>> The distribution that can happen the most ways is also the distribution with the biggest information entropy. The distribution with the biggest entropy is the most conservative distribution that obeys its constraints.
>
> There's nothing intuitive about this idea, so if it seems weird, you are normal. (pp. 300--301, *emphasis* in the original)
Let's execute the code for the pebbles-in-buckets example.
```{r, message = F, warning = F}
library(tidyverse)
d <-
tibble(a = c(0, 0, 10, 0, 0),
b = c(0, 1, 8, 1, 0),
c = c(0, 2, 6, 2, 0),
d = c(1, 2, 4, 2, 1),
e = 2)
# this is our analogue to McElreath's `lapply()` code
d %>%
mutate_all(~ . / sum(.)) %>%
# the next few lines constitute our analogue to his `sapply()` code
pivot_longer(everything(), names_to = "plot") %>%
group_by(plot) %>%
summarise(h = -sum(ifelse(value == 0, 0, value * log(value))))
```
For more on the formula syntax we used within `mutate_all()`, you might check out [this](https://dplyr.tidyverse.org/reference/mutate_all.html) or [this](https://purrr.tidyverse.org/reference/map.html).
Anyway, we're almost ready to plot, which brings us to color. For the plots in this chapter, we'll be taking our color palettes from the [**ghibli** package](https://github.com/ewenme/ghibli) [@R-ghibli], which provides palettes based on scenes from anime films by the Studio Ghibli.
```{r, message = F, warning = F}
library(ghibli)
```
The main function is `ghibli_palette()` which you can use to both preview the palettes before using them and also index in order to use specific colors. For example, we'll play with "MarnieMedium1", first.
```{r, fig.height = 1, fig.width = 4.5}
ghibli_palette("MarnieMedium1")
ghibli_palette("MarnieMedium1")[1:7]
```
Now we're ready to plot five of the six panels of Figure 10.1.
```{r, fig.width = 6, fig.height = 5}
d %>%
mutate(bucket = 1:5) %>%
pivot_longer(-bucket,
names_to = "letter",
values_to = "pebbles") %>%
ggplot(aes(x = bucket, y = pebbles)) +
geom_col(width = 1/5, fill = ghibli_palette("MarnieMedium1")[2]) +
geom_text(aes(y = pebbles + 1, label = pebbles)) +
geom_text(data = tibble(
letter = letters[1:5],
bucket = 5.5,
pebbles = 10.5,
label = str_c(c(1, 90, 1260, 37800, 113400),
rep(c(" way", " ways"), times = c(1, 4)))),
aes(label = label),
hjust = 1) +
scale_y_continuous(breaks = c(0, 5, 10), limits = c(0, 12)) +
theme(panel.background = element_rect(fill = ghibli_palette("MarnieMedium1")[6]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("MarnieMedium1")[7])) +
facet_wrap(~ letter, ncol = 2)
```
We might plot our version of the final panel like so.
```{r, fig.width = 3, fig.height = 2.75, message = F}
d %>%
# the next four lines are the same from above
mutate_all(~ . / sum(.)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(h = -sum(ifelse(value == 0, 0, value * log(value)))) %>%
# here's the R code 9.4 stuff
mutate(n_ways = c(1, 90, 1260, 37800, 113400)) %>%
group_by(name) %>%
mutate(log_ways = log(n_ways) / 10,
text_y = ifelse(name < "c", h + .15, h - .15)) %>%
# plot
ggplot(aes(x = log_ways, y = h)) +
geom_abline(intercept = 0, slope = 1.37, color = "white") +
geom_point(size = 2.5, color = ghibli_palette("MarnieMedium1")[7]) +
geom_text(aes(y = text_y, label = name)) +
labs(x = "log(ways) per pebble",
y = "entropy") +
theme(panel.background = element_rect(fill = ghibli_palette("MarnieMedium1")[6]),
panel.grid = element_blank())
```
"The distribution that can happen the greatest number of ways is the most plausible distribution. Call this distribution the **maximum entropy distribution**" (p. 303, **emphasis** in the original). Among the pebbles, the maximum entropy distribution was `e` (i.e., the uniform).
#### Rethinking: What good is intuition?
"Like many aspects of information theory, maximum entropy is not very intuitive. But note that intuition is just a guide to developing methods. When a method works, it hardly matters whether our intuition agrees" (p. 303).
### Gaussian.
Behold the probability density for the generalized normal distribution:
$$\Pr (y | \mu, \alpha, \beta) = \frac{\beta}{2 \alpha \Gamma \left (\frac{1}{\beta} \right )} e ^ {- \left (\frac{|y - \mu|}{\alpha} \right ) ^ {\beta}},$$
where $\alpha =$ the scale, $\beta =$ the shape, $\mu =$ the location, and $\Gamma =$ the [gamma function](https://en.wikipedia.org/wiki/Gamma_function). If you read closely in the text, you'll discover that the densities in the right panel of Figure 10.2 were all created with the constraint $\sigma^2 = 1$. But $\sigma^2 \neq \alpha$ and there's no $\sigma$ in the equations in the text. However, it appears the variance for the generalized normal distribution follows the form
$$\sigma^2 = \frac{\alpha^2 \Gamma (3/\beta)}{\Gamma (1/\beta)}.$$
So if you do the algebra, you'll see that you can compute $\alpha$ for a given $\sigma^2$ and $\beta$ with the equation
$$\alpha = \sqrt{ \frac{\sigma^2 \Gamma (1/\beta)}{\Gamma (3/\beta)} }.$$
I got the formula from [Wikipedia.com](https://en.wikipedia.org/wiki/Generalized_normal_distribution). Don't judge. We can wrap that formula in a custom function, `alpha_per_beta()`, use it to solve for the desired $\beta$ values, and plot. But one more thing: McElreath didn't tell us exactly which $\beta$ values the left panel of Figure 10.2 was based on. So the plot below is my best guess.
```{r, fig.width = 3.5, fig.height = 3}
alpha_per_beta <- function(beta, variance = 1) {
sqrt((variance * gamma(1 / beta)) / gamma(3 / beta))
}
crossing(value = seq(from = -5, to = 5, by = .1),
# I arrived at these values by trial and error
beta = c(1, 1.5, 2, 4)) %>%
mutate(mu = 0,
alpha = alpha_per_beta(beta)) %>%
# behold the formula for the generalized normal distribution in code!
mutate(density = (beta / (2 * alpha * gamma(1 / beta))) *
exp(1) ^ (-1 * (abs(value - mu) / alpha) ^ beta)) %>%
# plot
ggplot(aes(x = value, y = density, group = beta)) +
geom_line(aes(color = beta == 2, size = beta == 2)) +
scale_color_manual(values = c(ghibli_palette("MarnieMedium2")[c(2, 4)])) +
scale_size_manual(values = c(1/4, 1.25)) +
labs(subtitle = "Guess which color denotes the Gaussian.") +
coord_cartesian(xlim = c(-4, 4)) +
theme(legend.position = "none",
panel.background = element_rect(fill = ghibli_palette("MarnieMedium2")[7]),
panel.grid = element_blank())
```
Once you have $\alpha$ and $\beta$, the entropy equation for the generalized normal distribution is
$$
\text{entropy} = \frac{1}{\beta} -\log\left[\frac{\beta}{2\alpha\Gamma(1/\beta)}\right].
$$
Here's how we can use that equation to make our version of right panel of Figure 10.2.
```{r, fig.width = 3.5, fig.height = 3, warning = F, message = F}
tibble(beta = seq(from = 1, to = 4, length.out = 100)) %>%
mutate(alpha = alpha_per_beta(beta)) %>%
mutate(entropy = (1 / beta) - log((beta) / (2 * alpha * gamma(1 / beta)))) %>%
ggplot(aes(x = beta, y = entropy)) +
geom_vline(xintercept = 2, color = "white") +
geom_line(linewidth = 2, color = ghibli_palette("MarnieMedium2")[6]) +
xlab(expression(beta~(i.e.*", "*shape))) +
theme(panel.background = element_rect(fill = ghibli_palette("MarnieMedium2")[7]),
panel.grid = element_blank())
```
I should note the solution to this plot was initially beyond me. However, fellow enthusiast [Hamed Bastan-Hagh](https://github.com/hamedbh) generously shared the correct solution on [GitHub](https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed/issues/39).
But getting back on track:
> The take-home lesson from all of this is that, if all we are willing to assume about a collection of measurements is that they have a finite variance, then the Gaussian distribution represents the most conservative probability distribution to assign to those measurements. But very often we are comfortable assuming something more. And in those cases, provided our assumptions are good ones, the principle of maximum entropy leads to distributions other than the Gaussian. (p. 306)
### Binomial.
The binomial likelihood entails
> counting the numbers of ways that a given observation could arise, according to our assumptions. The resulting distribution is known as the **binomial distribution**. If only two things can happen (blue or white marble, for example), and there's a constant chance $p$ of each across $n$ trials, then the probability of observing $y$ events of type 1 and $n - y$ events of type 2 is:
>
> $$\Pr (y | n, p) = \frac{n!}{y! (n - y)!} p^y (1 - p)^{n - y}$$
>
> It may help to note that the fraction with the factorials is just saying how many different ordered sequences of $n$ outcomes have a count of $y$. (p. 307, **emphasis** in the original)
For me, that last sentence made more sense when I walked it out in an example. To do so, let's wrap that fraction of factorials into a function.
```{r}
count_ways <- function(n, y) {
# n = the total number of trials (i.e., the number of rows in your vector)
# y = the total number of 1s (i.e., successes) in your vector
(factorial(n) / (factorial(y) * factorial(n - y)))
}
```
Now consider three sequences:
* 0, 0, 0, 0 (i.e., $n = 4$ and $y = 0$)
* 1, 0, 0, 0 (i.e., $n = 4$ and $y = 1$)
* 1, 1, 0, 0 (i.e., $n = 4$ and $y = 2$)
We can organize that information in a little tibble and then demo our `count_ways()` function.
```{r}
tibble(sequence = 1:3,
n = 4,
y = c(0, 1, 2)) %>%
mutate(n_ways = count_ways(n = n, y = y))
```
Here's the pre-Figure 10.3 data McElreath presented on page 308.
```{r}
# data
d <-
tibble(distribution = letters[1:4],
ww = c(1/4, 2/6, 1/6, 1/8),
bw = c(1/4, 1/6, 2/6, 4/8),
wb = c(1/4, 1/6, 2/6, 2/8),
bb = c(1/4, 2/6, 1/6, 1/8))
# table
d %>%
mutate_if(is.numeric, ~MASS::fractions(.) %>% as.character()) %>%
flextable::flextable()
```
Those data take just a tiny bit of wrangling before they're ready to plot in our version of Figure 10.3.
```{r, fig.width = 4, fig.height = 3.5}
d <-
d %>%
pivot_longer(-distribution,
names_to = "sequence",
values_to = "probability") %>%
mutate(sequence = factor(sequence, levels = c("ww", "bw", "wb", "bb")))
d %>%
ggplot(aes(x = sequence, y = probability, group = 1)) +
geom_point(size = 2, color = ghibli_palette("PonyoMedium")[4]) +
geom_line(color = ghibli_palette("PonyoMedium")[5]) +
labs(x = NULL, y = NULL) +
coord_cartesian(ylim = 0:1) +
theme(axis.ticks.x = element_blank(),
panel.background = element_rect(fill = ghibli_palette("PonyoMedium")[2]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("PonyoMedium")[6])) +
facet_wrap(~ distribution)
```
If we go step by step, we might count the expected value for each `distribution` like follows.
```{r, message = F}
d %>%
# `str_count()` will count the number of times "b" occurs within a given row of `sequence`
mutate(n_b = str_count(sequence, "b")) %>%
mutate(product = probability * n_b) %>%
group_by(distribution) %>%
summarise(expected_value = sum(product))
```
We can use the same `group_by()` strategy on the way to computing the entropy values.
```{r, message = F}
d %>%
group_by(distribution) %>%
summarise(entropy = -sum(probability * log(probability)))
```
Like in the text, `distribution == "a"` had the largest `entropy` of the four. In the next example, the $\text{expected value} = 1.4$ and $p = .7$.
```{r}
p <- 0.7
(
a <-
c((1 - p)^2,
p * (1 - p),
(1 - p) * p,
p^2)
)
```
Here's the entropy for our distribution `a`.
```{r}
-sum(a * log(a))
```
I'm going to alter McElreath's simulation function from **R** code 10.9 to take a `seed` argument. In addition, I altered the names of the objects within the function and changed the output to a tibble that will also include the conditions "ww", "bw", "wb", and "bb".
```{r}
sim_p <- function(seed, g = 1.4) {
set.seed(seed)
x_123 <- runif(3)
x_4 <- ((g) * sum(x_123) - x_123[2] - x_123[3]) / (2 - g)
z <- sum(c(x_123, x_4))
p <- c(x_123, x_4) / z
tibble(h = -sum(p * log(p)),
p = p,
key = factor(c("ww", "bw", "wb", "bb"), levels = c("ww", "bw", "wb", "bb")))
}
```
For a given `seed` and `g` value, our augmented `sim_p()` function returns a $4 \times 3$ tibble.
```{r}
sim_p(seed = 9.9, g = 1.4)
```
So the next step is to determine how many replications we'd like, create a tibble with seed values ranging from 1 to that number, and then feed those `seed` values into `sim_p()` via `purrr::map2()`, which will return a nested tibble. We'll then `unnest()` and take a peek.
```{r, echo = F}
# the sim took the better part of 10 minutes to run
# save(list = c("n_rep", "d"), file = "sims/10.sim_1.rda")
load("sims/10.sim_1.rda")
```
```{r, eval = F}
# how many replications would you like?
n_rep <- 1e5
d <-
tibble(seed = 1:n_rep) %>%
mutate(sim = map2(seed, 1.4, sim_p)) %>%
unnest(sim)
```
Take a look.
```{r}
head(d)
```
In order to intelligently choose which four replications we want to highlight in Figure 10.4, we'll want to rank order them by entropy, `h`.
```{r}
ranked_d <-
d %>%
group_by(seed) %>%
arrange(desc(h)) %>%
ungroup() %>%
# here's the rank order step
mutate(rank = rep(1:n_rep, each = 4))
head(ranked_d)
```
And we'll also want a subset of the data to correspond to McElreath's "A" through "D" distributions.
```{r}
subset_d <-
ranked_d %>%
# I arrived at these `rank` values by trial and error
filter(rank %in% c(1, 87373, n_rep - 1500, n_rep - 10)) %>%
# I arrived at the `height` values by trial and error, too
mutate(height = rep(c(8, 2.25, .75, .5), each = 4),
distribution = rep(letters[1:4], each = 4))
head(subset_d)
```
We're finally ready to make our version of the left panel of Figure 10.4.
```{r}
p1 <-
d %>%
ggplot(aes(x = h)) +
geom_density(linewidth = 0, fill = ghibli_palette("LaputaMedium")[3],
adjust = 1/4) +
# note the data statements for the next two geoms
geom_linerange(data = subset_d %>% group_by(seed) %>% slice(1),
aes(ymin = 0, ymax = height),
color = ghibli_palette("LaputaMedium")[5]) +
geom_text(data = subset_d %>% group_by(seed) %>% slice(1),
aes(y = height + .5, label = distribution)) +
scale_x_continuous("Entropy", breaks = seq(from = .7, to = 1.2, by = .1)) +
theme(panel.background = element_rect(fill = ghibli_palette("LaputaMedium")[7]),
panel.grid = element_blank())
```
Did you notice how our `adjust = 1/4` with `geom_density()` served a similar function to the `adj=0.1` in McElreath's `dens()` code? Anyways, here we make the right panel and combine the two with **patchwork**.
```{r, fig.width = 7, fig.height = 3.5}
p2 <-
ranked_d %>%
filter(rank %in% c(1, 87373, n_rep - 1500, n_rep - 10)) %>%
mutate(distribution = rep(letters[1:4], each = 4)) %>%
ggplot(aes(x = key, y = p, group = 1)) +
geom_line(color = ghibli_palette("LaputaMedium")[5]) +
geom_point(size = 2, color = ghibli_palette("LaputaMedium")[4]) +
scale_y_continuous(NULL, breaks = NULL, limits = c(0, .75)) +
xlab(NULL) +
theme(axis.ticks.x = element_blank(),
panel.background = element_rect(fill = ghibli_palette("LaputaMedium")[7]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("LaputaMedium")[6])) +
facet_wrap(~ distribution)
# combine and plot
library(patchwork)
p1 | p2
```
Because we simulated, our values won't match up identically with those in the text. We got pretty close, eh?
Since we saved our `sim_p()` output in a nested tibble, which we then `unnested()`, there's no need to separate the entropy values from the distributional values the way McElreath did in his **R** code 10.11. If we wanted to determine our highest entropy value--and the corresponding `seed` and `p` values, while we're at it--, we might execute something like this.
```{r}
ranked_d %>%
group_by(key) %>%
arrange(desc(h)) %>%
slice(1)
```
That maximum `h` value matched up nicely with the one in the text. If you look at the `p` column, you'll see our values approximated McElreath's `distribution` values, too. In both cases, they're real close to the `a` values we computed, above.
```{r}
a
```
"All four of these distributions really do have expected value 1.4. But among the infinite distributions that satisfy this constraint, it is only the most even distribution, the exact one nominated by the binomial distribution, that has greatest entropy" (p. 310).
## Generalized linear models
> For an outcome variable that is continuous and far from any theoretical maximum or minimum, [a simple] Gaussian model has maximum entropy. But when the outcome variable is either discrete or bounded, a Gaussian likelihood is not the most powerful choice. (p. 312)
I winged the values for our Figure 10.5.
```{r, fig.width = 3.25, fig.height = 3}
tibble(x = seq(from = -1, to = 3, by = .01)) %>%
mutate(probability = .35 + x * .5) %>%
ggplot(aes(x = x, y = probability)) +
geom_rect(xmin = -1, xmax = 3,
ymin = 0, ymax = 1,
fill = ghibli_palette("MononokeMedium")[5]) +
geom_hline(yintercept = 0:1, linetype = 2, color = ghibli_palette("MononokeMedium")[7]) +
geom_line(aes(linetype = probability > 1, color = probability > 1),
linewidth = 1) +
geom_segment(x = 1.3, xend = 3,
y = 1, yend = 1,
linewidth = 2/3, color = ghibli_palette("MononokeMedium")[3]) +
annotate(geom = "text",
x = 1.28, y = 1.04, hjust = 1,
label = "This is why we need link functions",
color = ghibli_palette("MononokeMedium")[4], size = 2.6) +
scale_color_manual(values = c(ghibli_palette("MononokeMedium")[3:4])) +
scale_y_continuous(breaks = c(0, .5, 1)) +
coord_cartesian(xlim = c(0, 2),
ylim = c(0, 1.2)) +
theme(legend.position = "none",
panel.background = element_rect(fill = ghibli_palette("MononokeMedium")[1]),
panel.grid = element_blank())
```
> Luckily, it's easy to do better. By using all of our prior knowledge about the outcome variable, usually in the form of constraints on the possible values it can take, we can appeal to maximum entropy for the choice of distribution. Then all we have to do is generalize the linear regression strategy--replace a parameter describing the shape of the likelihood with a linear model--to probability distributions other than the Gaussian. (p. 313)
As we will see, doing better will often involve using link functions.
#### Rethinking: The scourge of Histomancy.
> One strategy for choosing an outcome distribution is to plot the histogram of the outcome variable and, by gazing into its soul, decide what sort of distribution function to use. Call this strategy **Histomancy**, the ancient art of divining likelihood functions from empirical histograms. This sorcery is used, for example, when testing for normality before deciding whether or not to use a non-parametric procedure. Histomancy is a false god. (p. 314, **emphasis** in the original)
Stop worshiping at alter of this false god. Use domain knowledge and principles maximum entropy to pick your likelihoods.
### Meet the family.
> The most common distributions used in statistical modeling are members of a family known as the **exponential family**. Every member of this family is a maximum entropy distribution, for some set of constraints. And conveniently, just about every other statistical modeling tradition employs the exact same distributions, even though they arrive at them via justifications other than maximum entropy. (p. 314, **emphasis** in the original)
Here are the Gamma and Exponential panels for Figure 10.6.
```{r, fig.width = 6, fig.height = 2.25}
length_out <- 100
tibble(x = seq(from = 0, to = 5, length.out = length_out)) %>%
mutate(Gamma = dgamma(x, 2, 2),
Exponential = dexp(x)) %>%
pivot_longer(-x, values_to = "density") %>%
mutate(label = ifelse(name == "Gamma", "y %~% Gamma(lambda, kappa)", "y %~% Exponential(lambda)")) %>%
ggplot(aes(x = x, y = density)) +
geom_area(fill = ghibli_palette("SpiritedMedium")[3]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 4)) +
theme(panel.background = element_rect(fill = ghibli_palette("SpiritedMedium")[5]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("SpiritedMedium")[7])) +
facet_wrap(~ label, scales = "free_y", labeller = label_parsed)
```
The Gaussian:
```{r, fig.width = 3, fig.height = 2.25}
tibble(x = seq(from = -5, to = 5, length.out = length_out)) %>%
mutate(density = dnorm(x),
strip = "y %~% Normal(mu, sigma)") %>%
ggplot(aes(x = x, y = density)) +
geom_area(fill = ghibli_palette("SpiritedMedium")[3]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(-4, 4)) +
theme(panel.background = element_rect(fill = ghibli_palette("SpiritedMedium")[5]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("SpiritedMedium")[7])) +
facet_wrap(~ strip, labeller = label_parsed)
```
Here is the Poisson.
```{r, fig.width = 3, fig.height = 2.25}
tibble(x = 0:20) %>%
mutate(density = dpois(x, lambda = 2.5),
strip = "y %~% Poisson(lambda)") %>%
ggplot(aes(x = x, y = density)) +
geom_col(fill = ghibli_palette("SpiritedMedium")[2], width = 1/2) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 10)) +
theme(panel.background = element_rect(fill = ghibli_palette("SpiritedMedium")[5]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("SpiritedMedium")[7])) +
facet_wrap(~ strip, labeller = label_parsed)
```
Finally, the Binomial:
```{r, fig.width = 3, fig.height = 2.25}
tibble(x = 0:10) %>%
mutate(density = dbinom(x, size = 10, prob = .85),
strip = "y %~% Binomial(n, p)") %>%
ggplot(aes(x = x, y = density)) +
geom_col(fill = ghibli_palette("SpiritedMedium")[2], width = 1/2) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 10)) +
theme(panel.background = element_rect(fill = ghibli_palette("SpiritedMedium")[5]),
panel.grid = element_blank(),
strip.background = element_rect(fill = ghibli_palette("SpiritedMedium")[7])) +
facet_wrap(~ strip, labeller = label_parsed)
```
#### Rethinking: A likelihood is a prior.
> In traditional statistics, likelihood functions are "objective" and prior distributions "subjective." In Bayesian statistics, likelihoods are deeply related to prior probability distributions: They are priors for the data, conditional on the parameters. And just like with other priors, there is no correct likelihood. But there are better and worse likelihoods, depending upon the context. (p. 316)
For a little more in this, check out McElreath's great lecture, [Bayesian statistics without frequentist language](https://youtu.be/yakg94HyWdE). This subsection also reminds me of the title of one of Gelman's blog posts, [*"It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood"*](https://statmodeling.stat.columbia.edu/2015/01/27/perhaps-merely-accident-history-skeptics-subjectivists-alike-strain-gnat-prior-distribution-swallowing-camel-likelihood/). The title, which itself is a quote, comes from one of his papers, which he linked to in the blog, along with several related papers. It's taken some time for the weight of that quote to sink in with me, and indeed it's still sinking. Perhaps you'll benefit from it, too.
### Linking linear models to distributions.
> To build a regression model from any of the exponential family distributions is just a matter of attaching one or more linear models to one or more of the parameters that describe the distribution's shape. But as hinted at earlier, usually we require a **link function** to prevent mathematical accidents like negative distances or probability masses that exceed 1. (p. 316, **emphasis** in the original)
These models generally follow the form
\begin{align*}
y_i & \sim \color{#4D6D93}{\operatorname{Some distribution}} (\theta_i, \phi) \\
\color{#4D6D93}{f(\theta_i)} & = \alpha + \beta (x_i - \bar x),
\end{align*}
where $\theta_i$ is a parameter of central interest (e.g., the probability of 1 in a Binomial distribution) and $\phi$ is a placeholder for any other parameters necessary for the likelihood but not typically of primary substantive interest (e.g., $\sigma$ in conventional Gaussian models). The $f(\cdot)$ portion is the link function.
Speaking about links,
> the **logit link** maps a parameter that is defined as a probability mass, and therefore constrained to lie between zero and one, onto a linear model that can take on any real value. This link is extremely common when working with binomial GLMs. In the context of a model definition, it looks like this:
>
> \begin{align*}
> y_i & \sim \color{#4D6D93}{\operatorname{Binomial}}(n, p_i) \\
> \color{#4D6D93}{\operatorname{logit}}(p_i) & = \alpha + \beta x_i
> \end{align*}
>
> And the logit function itself is defined as the *log-odds:*
>
> $$\operatorname{logit}(p_i) = \log \frac{p_i}{1 - p_i}$$
>
> The "odds" of an event are just the probability it happens divided by the probability it does not happen. So really all that is being stated here is:
>
> $$\log \frac{p_i}{1 - p_i} = \alpha + \beta x_i$$
If we do the final algebraic manipulation on page 317, we can solve for $p_i$ in terms of the linear model
$$p_i = \frac{\exp (\alpha + \beta x_i)}{1 + \exp (\alpha + \beta x_i)}.$$
As we'll see later, we will make great use of this formula via the `brms::inv_logit_scaled()` when making sense of logistic regression models. Now we have that last formula in hand, we can make the data necessary for Figure 10.7.
```{r, fig.width = 6, fig.height = 2.67, message = F, warning = F}
# first, we'll make data for the horizontal lines
alpha <- 0
beta <- 4
lines <-
tibble(x = seq(from = -1, to = 1, by = .25)) %>%
mutate(`log-odds` = alpha + x * beta,
probability = exp(alpha + x * beta) / (1 + exp(alpha + x * beta)))
# now we're ready to make the primary data
beta <- 2
d <-
tibble(x = seq(from = -1.5, to = 1.5, length.out = 50)) %>%
mutate(`log-odds` = alpha + x * beta,
probability = exp(alpha + x * beta) / (1 + exp(alpha + x * beta)))
# now we make the individual plots
p1 <-
d %>%
ggplot(aes(x = x, y = `log-odds`)) +
geom_hline(data = lines,
aes(yintercept = `log-odds`),
color = ghibli_palette("YesterdayMedium")[6]) +
geom_line(linewidth = 1.5, color = ghibli_palette("YesterdayMedium")[3]) +
coord_cartesian(xlim = c(-1, 1)) +
theme(panel.background = element_rect(fill = ghibli_palette("YesterdayMedium")[5]),
panel.grid = element_blank())
p2 <-
d %>%
ggplot(aes(x = x, y = probability)) +
geom_hline(data = lines,
aes(yintercept = probability),
color = ghibli_palette("YesterdayMedium")[6]) +
geom_line(linewidth = 1.5, color = ghibli_palette("YesterdayMedium")[3]) +
coord_cartesian(xlim = c(-1, 1)) +
theme(panel.background = element_rect(fill = ghibli_palette("YesterdayMedium")[7]),
panel.grid = element_blank())
# finally, we're ready to mash the plots together and behold their nerdy glory
(p1 | p2) +
plot_annotation(subtitle = "The logit link transforms a linear model (left) into a probability (right).")
```
> The key lesson for now is just that no regression coefficient, such as $\beta$, from a GLM ever produces a constant change on the outcome scale. Recall that we defined interaction ([Chapter 8][Conditional Manatees]) as a situation in which the effect of a predictor depends upon the value of another predictor. Well now every predictor essentially interacts with itself, because the impact of a change in a predictor depends upon the value of the predictor before the change....
>
> The second very common link function is the **log link**. This link function maps a parameter that is defined over only positive real values onto a linear model. For example, suppose we want to model the standard deviation $\sigma$ of a Gaussian distribution so it is a function of a predictor variable $x$. The parameter $\sigma$ must be positive, because a standard deviation cannot be negative nor can it be zero. The model might look like:
>
> \begin{align*}
> y_i & \sim \operatorname{Normal}(\mu, \color{#0E84B4}{\sigma_i}) \\
> \color{#0E84B4}{\log (\sigma_i)} & = \alpha + \beta x_i
> \end{align*}
>
> In this model, the mean $\mu$ is constant, but the standard deviation scales with the value $x_i$. (p. 318, **emphasis** in the original)
This kind of model is trivial in the **brms** framework, which you can learn more about in Bürkner's [-@Bürkner2022Distributional] vignette, [*Estimating distributional models with brms*](https://cran.r-project.org/package=brms/vignettes/brms_distreg.html). Before moving on with the text, let's detour and see how we might fit such a model. First, we'll simulate some continuous data `y` for which the $\textit{SD}$ is affected by a dummy variable `x`.
```{r}
set.seed(10)
(
d <-
tibble(x = rep(0:1, each = 100)) %>%
mutate(y = rnorm(n = n(), mean = 100, sd = 10 + x * 10))
)
```
These data are based on IQ data. In psychology, general intelligence is often operationalized by and measured with standardized intelligence tests. The results from these tests are often summarized with an a single intelligence quotient (IQ) score, often called the full-scale IQ score. For many years now, the convention within among IQ test developers is to scale full-scale IQ scores so they have a population mean of 100 and a standard deviation of 15. One of the old and continuing controversies in the literature is whether men and women differ not in their means--they don't--but in their standard deviations [e.g., @johnsonSexDifferencesVariability2008]. To give a sense of how one might explore such a controversy, we simulated data where the `y` variables have a mean of 100 and standard deviations of either 10 or 20, depending on one's status on `x`. We can view what data like these look like with aid from `tidybayes::stat_halfeye()`.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
library(tidybayes)
d %>%
mutate(x = x %>% as.character()) %>%
ggplot(aes(x = y, y = x, fill = x)) +
stat_halfeye(point_interval = mean_qi, .width = .68,
color = ghibli_palette("KikiMedium")[2]) +
scale_fill_manual(values = c(ghibli_palette("KikiMedium")[c(4, 6)])) +
coord_cartesian(ylim = c(1.5, 2)) +
theme(axis.ticks.y = element_blank(),
legend.position = "none",
panel.background = element_rect(fill = ghibli_palette("KikiMedium")[7]),
panel.grid = element_blank())
```
Even though the means of `y` are the same for both levels of the `x` dummy, the variance for `x == 1` is substantially larger than that for `x == 0`. Let's open **brms**.
```{r, warning = F, message = F}
library(brms)
```
For such a model, we have two formulas: one for $\mu$ and one for $\sigma$. We wrap both within the `bf()` function.
```{r b10.1}
b10.1 <-
brm(data = d,
family = gaussian,
bf(y ~ 1, sigma ~ 1 + x),
prior = c(prior(normal(100, 5), class = Intercept),
prior(normal(2.70805, 0.5), class = Intercept, dpar = sigma),
prior(normal(0, 0.5), class = b, dpar = sigma)),
seed = 10,
file = "fits/b10.01")
```
Do note our use of the `dpar` arguments in the `prior()` functions Here's the summary.
```{r}
print(b10.1)
```
Now we get an intercept for both $\mu$ and $\sigma$, with the intercept for sigma labeled as `sigma_Intercept`. And note the $\beta$ coefficient for $\sigma$ was named `sigma_x`. Also notice the scale the `sigma_i` coefficients are on. These are not in the original metric, but rather based on a logarithmic transformation of $\sigma$. You can confirm that by the second line of the `print()` output: `Links: mu = identity; sigma = log`. So if you want to get a sense of the effects of `x` on the $\sigma$ for `y`, you have to exponentiate the formula. Here we'll do so with the output from `as_draws_df()`.
```{r}
post <- as_draws_df(b10.1)
head(post)
```
With the samples in hand, we’ll use the model formula to compute the model-implied standard deviations of `y` based on the `x` dummy and then examine them in a plot.
```{r, fig.width = 3.75, fig.height = 2.5, warning = F}
post %>%
mutate(`x == 0` = exp(b_sigma_Intercept + b_sigma_x * 0),
`x == 1` = exp(b_sigma_Intercept + b_sigma_x * 1)) %>%
pivot_longer(contains("==")) %>%
ggplot(aes(x = value, y = name, fill = name)) +
stat_halfeye(point_interval = median_qi, .width = .95,
color = ghibli_palette("KikiMedium")[2]) +
scale_fill_manual(values = c(ghibli_palette("KikiMedium")[c(4, 6)])) +
labs(subtitle = "Model-implied standard deviations by group",
x = expression(sigma[x]),
y = NULL) +
coord_cartesian(ylim = c(1.5, 2)) +
theme(axis.ticks.y = element_blank(),
legend.position = "none",
panel.background = element_rect(fill = ghibli_palette("KikiMedium")[7]),
panel.grid = element_blank())
```
If we looked back at the data, those $\textit{SD}$ estimates are right about what we'd expect.
```{r, message = F}
d %>%
group_by(x) %>%
summarise(sd = sd(y) %>% round(digits = 1))
```
For more on models like this, check out Christakis's blog post, [*2014: What scientific idea is ready for retirement?*](https://www.edge.org/response-detail/25437), or his paper with Subramanian and Kim, [*The "average" treatment effect: A construct ripe for retirement. A commentary on Deaton and Cartwright*](http://humannaturelab.net/publications/the-average-treatment-effect-a-construct-ripe-for-retirement-a-commentary-on-deaton-and-cartwright), [@subramanianAverageTreatmentEffect2018a]. Kruschke covered modeling $\sigma$ a bit in his [-@kruschkeDoingBayesianData2015] [*Doing Bayesian data analysis, second edition: A tutorial with R, JAGS, and Stan*](https://sites.google.com/site/doingbayesiandataanalysis/), my [-@kurzDoingBayesianDataAnalysis2023] translation for which lives [here](https://bookdown.org/content/3686/). Finally, this is foreshadowing a bit because it requires the multilevel model (see Chapters [13][Models With Memory] and [14][Adventures in Covariance]), but you might also check out the [-@williams2020BayesianMulitivariate] paper by Williams, Martin, Liu, and Rast, [*Bayesian multivariate mixed-effects location scale modeling of longitudinal relations among affective traits, states, and physical activity*](https://europepmc.org/articles/pmc8580300/bin/nihms1744407-supplement-supplementary_material.pdf) or Williams's blog post, [*A defining feature of cognitive interference tasks: Heterogeneous within-person variance*](https://rstudio-pubs-static.s3.amazonaws.com/593838_838639e42b1643df9886a26cb922ab05.html).
But getting back to the text,
> What the log link effectively assumes is that the parameter's value is the exponentiation of the linear model. Solving $\log (\sigma_i) = \alpha + \beta x_i$ for $\sigma_i$ yields the inverse link:
>
> $$\sigma_i = \exp (\alpha + \beta x_i)$$
>
> The impact of this assumption can be seen in [our version of] Figure 10.8. (pp. 318--319)
```{r, fig.width = 6, fig.height = 2.5, message = F, warning = F}
# first, we'll make data that'll be make the horizontal lines
alpha <- 0
beta <- 2
lines <-
tibble(`log-measurement` = -3:3,
`original measurement` = exp(-3:3))
# now we're ready to make the primary data
d <-
tibble(x = seq(from = -1.5, to = 1.5, length.out = 50)) %>%
mutate(`log-measurement` = alpha + x * beta,
`original measurement` = exp(alpha + x * beta))
# now we make the individual plots
p1 <-
d %>%
ggplot(aes(x = x, y = `log-measurement`)) +
geom_hline(data = lines,
aes(yintercept = `log-measurement`),
color = ghibli_palette("YesterdayMedium")[6]) +
geom_line(linewidth = 1.5, color = ghibli_palette("YesterdayMedium")[3]) +
coord_cartesian(xlim = c(-1, 1)) +
theme(panel.background = element_rect(fill = ghibli_palette("YesterdayMedium")[5]),
panel.grid = element_blank())
p2 <-
d %>%
ggplot(aes(x = x, y = `original measurement`)) +
geom_hline(data = lines,
aes(yintercept = `original measurement`),
color = ghibli_palette("YesterdayMedium")[6]) +
geom_line(linewidth = 1.5, color = ghibli_palette("YesterdayMedium")[3]) +
scale_y_continuous(position = "right", limits = c(0, 10)) +
coord_cartesian(xlim = c(-1, 1)) +
theme(panel.background = element_rect(fill = ghibli_palette("YesterdayMedium")[7]),
panel.grid = element_blank())
# combine the ggplots
p1 | p2
```
> Using a log link for a linear model (left) implies an exponential scaling of the outcome with the predictor variable (right). Another way to think of this relationship is to remember that logarithms are *magnitudes*. An increase of one unit on the log scale means an increase of an order of magnitude on the untransformed scale. And this fact is reflected in the widening intervals between the horizontal lines in the right-hand plot of Figure 10.8. (p. 319, *emphasis* in the original)
#### Rethinking: When in doubt, play with assumptions.
> Link functions are assumptions. And like all assumptions, they are useful in different contexts. The conventional logit and log links are widely useful, but they can sometimes distort inference. If you ever have doubts, and want to reassure yourself that your conclusions are not sensitive to choice of link function, then you can use **sensitivity analysis**. A sensitivity analysis explores how changes in assumptions influence inference. (p. 319, **emphasis** in the original)
As an example, a common alternative to the logit link is the probit. Both are available with **brms**.
### Omitted variable bias again.
> Back in Chapters [5][The Many Variables & The Spurious Waffles] and [6][The Haunted DAG & The Causal Terror], you saw some examples of omitted variable bias, where leaving a causally important variable out of a model leads to biased inference. The same thing can of course happen in GLMs. But it can be worse in GLMs, because even a variable that isn't technically a confounder can bias inference, once we have a link function. The reason is that the ceiling and floor effects described above can distort estimates by suppressing the causal influence of a variable. (p. 320)
### Absolute and relative differences.
Within the context of GLMs with non-identity link functions,
> parameter estimates do not by themselves tell you the importance of a predictor on the outcome. The reason is that each parameter represents a *relative* difference on the scale of the linear model, ignoring other parameters, while we are really interested in *absolute* differences in outcomes that must incorporate all parameters. (p. 320, *emphasis* in the original)
This will make more sense after we start playing around with logistic regression, count regression, and so on. For now, just file it away.
## Session info {-}
```{r}
sessionInfo()
```
```{r, echo = F}
rm(d, alpha_per_beta, count_ways, p, sim_p, n_rep, p1, p2, ranked_d, subset_d, a, length_out, alpha, beta, lines, b10.1, post)
```
```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
bayesplot::color_scheme_set("blue")
# pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```