-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy path10.Rmd
945 lines (709 loc) · 43.2 KB
/
10.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# Describing Discrete-Time Event Occurrence Data
> In this chapter, [Singer and Willett presented] a framework for describing discrete-time event occurrence data.... As we will [see], the conceptual linchpin for all subsequent survival methods is to approach the analysis on a period-by-period basis. This allows you to examine event occurrence sequentially among those individuals eligible to experience the event at each discrete point in time. [@singerAppliedLongitudinalData2003, p. 325]
## The life table
> The fundamental tool for summarizing the sample distribution of event occurrence is the *life table*. As befits its name, a life table tracks the event histories (the "lives") of a sample of individuals from the beginning of time (when no one has yet experienced the target event) through the end of data collection. (p. 326, *emphasis* in the original)
To make a life table as presented in Table 10.1, we need to load the `teachers.csv` data.
```{r, warning = F, message = F}
library(tidyverse)
teachers <- read_csv("data/teachers.csv")
glimpse(teachers)
```
Perhaps the easiest way to make a life table as presented in Table 10.1 is with help from the [**survival** package](https://CRAN.R-project.org/package=survival) [@R-survival; @therneau2000ModelingSurvivalData].
```{r, warning = F, message = F}
# install.packages("survival", dependencies = T)
library(survival)
```
Here we'll use the `survfit()` function to compute survival curves. Within the `survfit()` function, we'll use the `Surv()` function to make a survival object, which will become the criterion within the model formula. It takes two basic arguments, `time` and `event`. With the `teachers` data, `t` is time in years. In the data, events are encoded in `censor`. However, it's important to understand how the `event` argument expects the data. From the [**survival** reference manual](https://CRAN.R-project.org/package=survival/survival.pdf) [@survival2021RM], we read that `event` is "the status indicator, normally `0=alive`, `1=dead`. Other choices are `TRUE/FALSE` (`TRUE = death`) or `1/2` (`2=death`)." Note that whereas within our data `censor` is coded `0 = event` `1 = censored`, the `event` argument expects the opposite. A quick way to solve that is to enter `1 - censor`.
```{r fit10.1}
fit10.1 <- survfit(data = teachers,
Surv(t, 1 - censor) ~ 1)
```
Use the `str()` function to survey the results.
```{r}
fit10.1 %>% str()
```
We can retrieve the values for the "Year" column from `fit10.1$time`. The values in the "Time interval" column are a simple transformation from there.
```{r}
fit10.1$time
```
We can find the values in the "Employed at the beginning of the year" column in `fit10.1$n.risk` and those in the "Who left during the year" column in `fit10.1$n.event`.
```{r}
fit10.1$n.risk
fit10.1$n.event
```
We'll have to work a little harder to compute the values in the "Censored at the end of the year column." Here we'll walk it through in a data frame format.
```{r}
data.frame(n_risk = fit10.1$n.risk,
n_event = fit10.1$n.event) %>%
mutate(n_risk_1 = lead(n_risk, default = 0)) %>%
mutate(n_censored = n_risk - n_event - n_risk_1)
```
That is, to get the number of those censored at the end of a given year, you take the number employed at the beginning of that year, subtract the number of those who left (i.e., the number who experienced the "event"), and then subtract the number of those employed at the beginning of the next year. Notice our use of the `dplyr::lead()` function to get the number employed in the next year (learn more about that function [here](https://dplyr.tidyverse.org/reference/lead-lag.html)).
To get the values in the "Teachers at the beginning of the year who left during the year" column, which is in a proportion metric, we use division.
```{r}
fit10.1$n.event / fit10.1$n.risk
```
Finally, to pull the values in the "All teachers still employed at the end of the year" column, we just execute `fit10.1$surv`.
```{r}
fit10.1$surv
```
Let's put that all together in a tibble.
```{r}
most_rows <-
tibble(year = fit10.1$time) %>%
mutate(time_int = str_c("[", year, ", ", year + 1, ")"),
n_employed = fit10.1$n.risk,
n_left = fit10.1$n.event) %>%
mutate(n_censored = n_employed - n_left - lead(n_employed, default = 0),
hazard_fun = n_left / n_employed,
survivor_fun = fit10.1$surv)
most_rows
```
The only thing missing from our version of Table 10.1 is we don't have a row for Year 0. Here's a quick and dirty way to manually insert those values.
```{r}
row_1 <-
tibble(year = 0,
time_int = "[0, 1)",
n_employed = fit10.1$n.risk[1],
n_left = NA,
n_censored = NA,
hazard_fun = NA,
survivor_fun = 1)
d <-
bind_rows(row_1,
most_rows)
d
```
We might walk out the notation in our `time_int` column a bit. Those intervals
> reflect a standard partition of time, in which each interval *includes* the initial time and *excludes* the concluding time. Adopting common mathematical notation, [brackets] denote inclusions and (parentheses) denote exclusions. Thus, we bracket each interval's initial time and place a parenthesis around its concluding time. (p. 328, *emphasis* in the original)
The values in the `n_employed` column the *risk set*, those who are "*eligible* to experience the event during that interval" (p. 329, *emphasis* in the original).
## A framework for characterizing the distribution of discrete-time event occurrence data
> The fundamental quantity used to assess the risk of event occurrence in each discrete time period is known as *hazard.* Denoted by $h(t_{ij})$, discrete time hazard is the *conditional probability that individual* $i$ *will experience the event time in period* $j$, *given that he or she did not experience it in any earlier time period*. Because hazard represents the risk of the event occurrence in each discrete time period among those people eligible to experience the event (those in the risk set) hazard tells us precisely what we want to know: whether and when events occurs. (p. 330, *emphasis* in the original)
If we let $T_i$ stand for the discrete value in time person $i$ experiences the event, we can express the conditional probability the event might occur in the $j^\text{th}$ interval as
$$h(t_{ij}) = \Pr[T_i = j \mid T \geq j].$$
That last part, $T \geq j$, clarifies the event can only occur once and, therefore, cannot have occurred in any of the prior levels of $j$. More plainly put, imagine the event is death and person $i$ died during the period of $T_j = 20$. In such a case, it's nonsensical to speak of that $i^\text{th}$ person's hazard for the period of $T_j = 25$. They're already dead.
Also, "the discrete-time hazard probabilities expressed as a function of time--labeled $h(t_{ij})$--is known as the population *discrete-time hazard function*" (p 330, *emphasis* in the original). That was expressed in the 6^th^ column in Table 10.1, which we called `hazard_fun` in our `d` tibble.
```{r}
d %>%
select(year, hazard_fun)
```
You might notice $h(t_{ij})$ is in a proportion metric and it is not cumulative. If you look above in the code, you'll see we computed that by `hazard_fun = n_left / n_employed`. More formally and generally, this is an operationalization of
$$\hat h(t_{j}) = \frac{n \text{ events}_j}{n \text{ at risk}_j},$$
where $n \text{ events}_j$ is the number of individuals who experienced the event in the $j^{th}$ period and $n \text{ at risk}_j$ is the number within the period who have not (a) already experienced the event and (b) been censored. Also note that by $\hat h(t_{j})$, we're indicating we're talking about the maximum likelihood estimate for $h(t_{j})$. Because no one is at risk during the initial time point, $h(t_0)$ is undefined (i.e., `NA`). Here we mimic the top panel of Figure 10.1 and plot our $\hat h(t_{j})$ over time.
```{r, fig.width = 6, fig.height = 3, warning = F}
d %>%
ggplot(aes(x = year, y = hazard_fun)) +
geom_line() +
scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
scale_y_continuous(expression("estimated hazard probability, "*hat(italic(h))(italic(t[j]))),
breaks = c(0, .05, .1, .15), limits = c(0, .15)) +
theme(panel.grid = element_blank())
```
### Survivor function.
The survivor function provides another way of describing the distribution of event occurrence over time. Unlike the hazard function, which assesses the unique risk associated with each time period, the survivor function cumulates these period-by-period risks of event occurrence (or more properly, nonoccurrence) together to assess the probability that a randomly selected individual will survive will not experience the event.
We can formally define the survivor function, $S(t_{ij})$, as
$$S(t_{ij}) = \Pr[T > j],$$
where $S$ is survival as a function of time, $t$. But since our data are finite, we can only have an estimate of the "true" survivor function, which we call $\hat S(t_{ij})$. Here it is in a plot, our version of the bottom panel of Figure 10.1.
```{r, fig.width = 6, fig.height = 3}
d %>%
ggplot(aes(x = year, y = survivor_fun)) +
geom_hline(yintercept = .5, color = "white", linetype = 2) +
geom_line() +
scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
breaks = c(0, .5, 1), limits = c(0, 1)) +
theme(panel.grid = element_blank())
```
### Median lifetime.
> Having characterized the *distribution* of event times using the hazard and survivor functions, we often want to identify the distribution's center. Were there no censoring, all event times would be known, and we could compute a sample mean. But because of censoring, another estimate of central tendency is preferred: the *median lifetime*.
>
> *The estimated median lifetime identifies that value for* $T$ *for which the value of the estimated survivor function is .5*. It is the point in time by which we estimate that half of the sample has experienced the target event, half has not. (p. 337, *emphasis* in the original)
If we use `filter()`, well see our median lifetime rests between years 6 and 7.
```{r}
d %>%
filter(year %in% c(6, 7)) %>%
# this just simplifies the output
select(year, time_int, survivor_fun)
```
Using a simple descriptive approach, we'd just say the median lifetime was between years 6 and 7. We could also follow @miller1981SurvivalAnalysis and linearly interpolate between the two values of $S(t_j)$ bracketing .5. If we let $m$ be the time interval just before the median lifetime, $\hat S(t_m)$ be the value of the survivor function in that $m^\text{th}$ interval, and $\hat S(t_{m + 1})$ be the survival value in the next interval, the can write
$$\text{Estimated median lifetime} = m + \Bigg [\frac{\hat S(t_m) - .5}{\hat S(t_m) - \hat S(t_{m + 1})} \Bigg ] \big ((m + 1) - m \big).$$
We can compute that by hand like so.
```{r}
m <- 6
m_plus_1 <- 7
stm <-
d %>%
filter(year == m) %>%
pull(survivor_fun)
stm_plus_1 <-
d %>%
filter(year == m_plus_1) %>%
pull(survivor_fun)
# compute the interpolated median lifetime and save it as `iml`
iml <- m + ((stm - .5) / (stm - stm_plus_1)) * ((m + 1) - m)
iml
```
Now we have the `iml` value, we can add that information to our version of the lower panel of Figure 10.1.
```{r, fig.width = 6, fig.height = 3}
line <-
tibble(year = c(0, iml, iml),
survivor_fun = c(.5, .5, 0))
d %>%
ggplot(aes(x = year, y = survivor_fun)) +
geom_path(data = line,
color = "white", linetype = 2) +
geom_line() +
annotate(geom = "text",
x = iml, y = .55,
label = "All teachers (6.6 years)",
hjust = 0) +
scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
breaks = c(0, .5, 1), limits = c(0, 1)) +
theme(panel.grid = element_blank())
```
We can compute the estimates for the 5- and 10-year survival rates as a direct algebraic transformation of the survival function from those years.
```{r}
d %>%
filter(year %in% c(5, 10)) %>%
select(year, survivor_fun) %>%
mutate(`survival rate (%)` = (100 * survivor_fun) %>% round(digits = 0))
```
## Developing intuition about hazard functions, survivor functions, and median lifetimes
> Developing intuition about these sample statistics requires exposure to estimates computed from a wide range of studies. To jump-start this process, we review results from four studies that differ across three salient dimensions--the type of event investigated, the metric used to record discrete time, and most important, the underlying profile of risk--and discuss how we would examine, and describe, the estimated hazard functions, survivor functions, and median lifetimes. (p. 339)
Here we load the four relevant data sets.
```{r, warning = F, message = F}
cocaine <- read_csv("data/cocaine_relapse.csv")
sex <- read_csv("data/firstsex.csv")
suicide <- read_csv("data/suicide.csv")
congress <- read_csv("data/congress.csv")
# glimpse(cocaine)
# glimpse(sex)
# glimpse(suicide)
# glimpse(congress)
```
We have a lot of leg work in front of use before we can recreate Figure 10.2. First, we'll feed each of the four data sets into the `survfit()` function.
```{r fit10.2}
fit10.2 <-
survfit(data = cocaine,
Surv(week, 1 - censor) ~ 1)
fit10.3 <-
survfit(data = sex,
Surv(time, 1 - censor) ~ 1)
fit10.4 <-
survfit(data = suicide,
Surv(time, 1 - censor) ~ 1)
fit10.5 <-
survfit(data = congress,
Surv(time, 1 - censor) ~ 1)
```
Given the four fits all follow the same basic form and given our end point is to make the same basic plots for each, we can substantially streamline our code by making a series of custom functions. For our first custom function, `make_lt()`, we'll save the general steps for making life tables for each data set.
```{r}
make_lt <- function(fit) {
# arrange the lt data for all rows but the first
most_rows <-
tibble(time = fit$time) %>%
mutate(time_int = str_c("[", time, ", ", time + 1, ")"),
n_risk = fit$n.risk,
n_event = fit$n.event) %>%
mutate(hazard_fun = n_event / n_risk,
survivor_fun = fit$surv)
# define the values for t = 2 and t = 1
time_1 <- fit$time[1]
time_0 <- time_1 - 1
# define the values for the row for which t = 1
row_1 <-
tibble(time = time_0,
time_int = str_c("[", time_0, ", ", time_1, ")"),
n_risk = fit$n.risk[1],
n_event = NA,
hazard_fun = NA,
survivor_fun = 1)
# make the full life table
lt <-
bind_rows(row_1,
most_rows)
lt
}
```
Use `make_lt()` to make the four life tables.
```{r}
lt_cocaine <- make_lt(fit10.2)
lt_sex <- make_lt(fit10.3)
lt_suicide <- make_lt(fit10.4)
lt_congress <- make_lt(fit10.5)
```
You'll note that the four survival-curve plots in Figure 10.2 all show the median lifetime using the interpolation method. Here we'll save the necessary steps to compute that for each model as the `make_iml()` function.
```{r}
make_iml <- function(lt) {
# lt is a generic name for a life table of the
# kind we made with our `make_lt()` function
# determine the mth row
lt_m <-
lt %>%
filter(survivor_fun > .5) %>%
slice(n())
# determine the row for m + 1
lt_m1 <-
lt %>%
filter(survivor_fun < .5) %>%
slice(1)
# pull the value for m
m <- pull(lt_m, time)
# pull the two survival function values
stm <- pull(lt_m, survivor_fun)
stm1 <- pull(lt_m1, survivor_fun)
# plug the values into Equation 10.6 (page 338)
iml <- m + ((stm - .5) / (stm - stm1)) * ((m + 1) - m)
iml
}
```
If you want, you can use `make_iml()` directly like this.
```{r}
make_iml(lt_cocaine)
```
However, our approach will be to wrap it in another function, `line_tbl()`, with which we will save the coordinates necessary for marking off the median lifetimes and them save them in a tibble.
```{r}
line_tbl <- function(lt) {
iml <- make_iml(lt)
tibble(time = c(lt[1, 1] %>% pull(), iml, iml),
survivor_fun = c(.5, .5, 0))
}
```
It works like this.
```{r}
line_tbl(lt_cocaine)
```
If you look closely at the hazard function plots in the left column of Figure 10.2, you'll note they share many common settings (e.g., the basic shape, the label of the $y$-axis). But there are several parameters we'll need to set custom settings for. To my eye, those are:
* the data;
* the $x$-axis label, break points, and limits; and
* the $y$-axis break points, and limits.
With our custom `h_plot()` function, we'll leave those parameters free while keeping all the other **ggplot2** parameters the same.
```{r}
h_plot <- function(data = data,
xlab = xlab, xbreaks = xbreaks, xlimits = xlimits,
ybreaks = ybreaks, ylimits = ylimits) {
ggplot(data = data,
mapping = aes(x = time, y = hazard_fun)) +
geom_line() +
scale_x_continuous(xlab, breaks = xbreaks, limits = xlimits) +
scale_y_continuous(expression(widehat(italic(h(t)))),
breaks = ybreaks, limits = ylimits) +
theme(panel.grid = element_blank())
}
```
Now we'll make a similar custom plotting function, `s_plot()`, for the hazard function plots on the right column of Figure 10.2.
```{r}
s_plot <- function(data = data, xlab = xlab, xbreaks = xbreaks, xlimits = xlimits) {
# compute the interpolated median life value
iml <- make_iml(data)
# make the imp line values
line <-
data %>%
line_tbl()
ggplot(data = data,
mapping = aes(x = time, y = survivor_fun)) +
geom_path(data = line,
color = "white", linetype = 2) +
geom_line() +
annotate(geom = "text",
x = iml, y = .6,
label = str_c("widehat(ML)==", iml %>% round(1)),
size = 3, hjust = 0, parse = T) +
scale_x_continuous(xlab, breaks = xbreaks, limits = xlimits) +
scale_y_continuous(expression(widehat(italic(S(t)))),
breaks = c(0, .5, 1), limits = c(0, 1)) +
theme(panel.grid = element_blank())
}
```
Now we make the eight subplots in bulk, naming them `p1`, `p2`, and so on.
```{r, fig.width = 6, fig.height = 3, warning = F}
# cocaine
p1 <-
lt_cocaine %>%
h_plot(xlab = "Weeks after release",
xbreaks = 0:12, xlimits = c(0, 12),
ybreaks = c(0, .05, .1, .15), ylimits = c(0, .15))
p2 <-
lt_cocaine %>%
s_plot(xlab = "Weeks after release",
xbreaks = 0:12, xlimits = c(0, 12))
# sex
p3 <-
lt_sex %>%
h_plot(xlab = "Grade",
xbreaks = 6:12, xlimits = c(6, 12),
ybreaks = 0:3 / 10, ylimits = c(0, .325))
p4 <-
lt_sex %>%
s_plot(xlab = "Grade",
xbreaks = 6:12, xlimits = c(6, 12))
# suicide
p5 <-
lt_suicide %>%
h_plot(xlab = "Age",
xbreaks = 1:9 * 2 + 3, xlimits = c(5, 22),
ybreaks = c(0, .05, .1, .15), ylimits = c(0, .16))
p6 <-
lt_suicide %>%
s_plot(xlab = "Age",
xbreaks = 1:9 * 2 + 3, xlimits = c(5, 22))
# congress
p7 <-
lt_congress %>%
h_plot(xlab = "Terms in office",
xbreaks = 0:8, xlimits = c(0, 8),
ybreaks = 0:3 / 10, ylimits = c(0, .3))
p8 <-
lt_congress %>%
s_plot(xlab = "Terms in office",
xbreaks = 0:8, xlimits = c(0, 8))
```
Now we'll use some functions and syntax from the **patchwork** package to combine the subplots and make Figure 10.2.
```{r, fig.width = 7, fig.height = 8, warning = F}
library(patchwork)
p12 <- (p1 + p2) + plot_annotation(title = "A") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p34 <- (p3 + p4) + plot_annotation(title = "B") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p56 <- (p5 + p6) + plot_annotation(title = "C") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
p78 <- (p7 + p8) + plot_annotation(title = "D") & theme(plot.margin = margin(0, 5.5, 0, 5.5))
(wrap_elements(p12) /
wrap_elements(p34) /
wrap_elements(p56) /
wrap_elements(p78))
```
Boom! Looks like a dream.
### Identifying periods of high and low risk using hazard functions.
It can be useful to evaluate hazard functions based on whether they are *monotonic* (i.e., have a single distinctive peak and single distinctive trough) and *nonmonotonic* (i.e., have multiple distinctive peaks or troughs). Globally speaking, the hazard functions for rows `A` and `B` are monotonic and the remaining two are nonmonotonic.
Singer and Willett remarked "monotonically increasing hazard functions are common when studying events that are ultimately inevitable (or near universal).... [However,] nonmonotonic hazard functions, like those in Panels C and D, generally arise in studies of long duration" (p. 342).
However, when risk is constant over time, hazard functions will not have peaks or troughs.
### Survivor functions as a context for evaluating the magnitude of hazard.
Unlike with hazard functions, all survivor functions decrease or stay constant over time. They are monotonic (i.e., they never switch direction, they never increase). From the text (p. 344), we learn three ways hazard functions relate to survival functions:
* *When hazard is high, the survivor function drops rapidly*.
* *When hazard is low, the survivor function drops slowly*.
* *When hazard is zero, the survivor function remains unchanged*.
### Strengths and limitations of estimated median lifetimes.
> When examining a median lifetime, we find it helpful to remember three important limitations on its interpretation. First, it identifies only an "average" event time; it tells us little about the distribution of even times and is relatively insensitive to extreme values. Second, the median lifetime is not necessarily a moment when the target event is *especially* likely to occur.... Third, the median lifetime reveals little about the distribution of risk over time; identical median lifetimes can result from dramatically different survivor and hazard functions. (pp. 345--346, *emphasis* in the original)
Without access to Singer and Willett's hypothetical data, we're not in a good position to recreate their Figure 10.3. Even the [good folks at IDRE](https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-10/) gave up on that one.
## Quantifying the effects of sampling variation
We can quantify the uncertainty in the estimates with standard errors.
### The standard error of the estimated hazard probabilities.
The formula for the frequentist standard errors for the hazard probabilities follows the form
$$se \left (\hat h(t_j) \right) = \sqrt{\frac{\hat h(t_j) \left (1 - \hat h(t_j) \right)}{n \text{ at risk}_j}}.$$
We can express that equation **R** code to recreate the first four columns of Table 10.2. We'll be pulling much of the information from `fit10.1`. But to show our work within a tibble format, we'll be adding a column after $n_j$. Our additional `n_event` column will contain the information pulled from `fit10.1$n.event`, which we'll use to compute the $\hat h(t_j)$.
```{r}
se_h_hat <-
tibble(year = fit10.1$time,
n_j = fit10.1$n.risk,
n_event = fit10.1$n.event) %>%
mutate(h_hat = n_event / n_j) %>%
mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j))
se_h_hat
```
As in the text, our standard errors are pretty small. To get a better sense, here they are in a rug plot.
```{r, fig.width = 6, fig.height = 1}
se_h_hat %>%
ggplot(aes(x = se_h_hat)) +
geom_rug(length = unit(0.25, "in")) +
scale_x_continuous(expression(italic(se)(hat(italic(h))(italic(t[j])))), limits = c(.004, .007)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```
Standard errors for discrete hazards probabilities share a property with those for other probabilities: they are less certain (i.e., larger) for probability values near .5 and increasingly certain (i.e., smaller) for probability values approaching 0 and 1. To give a sense of that, here are the corresponding $se \big (\hat h(t_j) \big)$ for a series of $\hat h(t_j)$ values ranging from 0 to 1, with $n_j$ held constant at 1,000.
```{r, fig.width = 4, fig.height = 3}
tibble(n_j = 1000,
h_hat = seq(from = 0, to = 1, by = .01)) %>%
mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j)) %>%
ggplot(aes(x = h_hat, y = se_h_hat)) +
geom_point() +
labs(x = expression(hat(italic(h))(italic(t[j]))),
y = expression(italic(se)(hat(italic(h))(italic(t[j]))))) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```
Also, as the size of the risk set, $n_j$, influences the standard errors in the typical way. All things equal, a larger $n$ will make for a smaller $se$. To give a sense, here's the same basic plot from above, but this time with $n_j = 100, 1{,}000, \text{ and } 10{,}000$.
```{r, fig.width = 8, fig.height = 2.75}
crossing(n_j = c(100, 1000, 10000),
h_hat = seq(from = 0, to = 1, by = .01)) %>%
mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
n_j = str_c("italic(n[j])==", n_j)) %>%
ggplot(aes(x = h_hat, y = se_h_hat)) +
geom_point() +
labs(x = expression(hat(italic(h))(italic(t[j]))),
y = expression(italic(se)(hat(italic(h))(italic(t[j]))))) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank()) +
facet_wrap(~ n_j, nrow = 1, labeller = label_parsed)
```
### Standard error of the estimated survival probabilities.
Computing the frequentist standard errors for estimated survival probabilities is more difficult because these are the products of (1 - hazard) for the current and all previous survival probabilities. Computing them is such a pain, Singer and Willett recommend you rely on Greenwood's [-@greenwood1926NaturalDuration] approximation. This follows the form
$$se \big (\hat S(t_j) \big) = \hat S(t_j) \sqrt{\frac{\hat h(t_1)}{n_1 \big (1 - \hat h(t_1) \big)} + \frac{\hat h(t_2)}{n_2 \big (1 - \hat h(t_2) \big)} + \cdots + \frac{\hat h(t_j)}{n_j \big (1 - \hat h(t_j) \big)}}.$$
Here we put the formula to work and finish our version of Table 10.2. For the sake of sanity, we're simply calling our "Term under the square root sign" column `term`. Note our use of the `cumsum()` function.
```{r}
# suspend scientific notation
options(scipen = 999)
tibble(year = fit10.1$time,
n_j = fit10.1$n.risk,
n_event = fit10.1$n.event) %>%
mutate(h_hat = n_event / n_j) %>%
mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
s_hat = fit10.1$surv,
term = cumsum(h_hat / (n_j * (1 - h_hat)))) %>%
mutate(se_s_hat = s_hat * sqrt(term),
std.err = fit10.1$std.err)
```
For comparisson, we also added the $se \big (\hat S(t_j) \big)$ values coputed by the **survivor** package in the final column, `std.err`.
```{r, fig.width = 6, fig.height = 3}
tibble(year = fit10.1$time,
n_j = fit10.1$n.risk,
n_event = fit10.1$n.event) %>%
mutate(h_hat = n_event / n_j) %>%
mutate(se_h_hat = sqrt((h_hat * (1 - h_hat)) / n_j),
s_hat = fit10.1$surv,
term = cumsum(h_hat / (n_j * (1 - h_hat)))) %>%
mutate(Greenwood = s_hat * sqrt(term),
`survival package` = fit10.1$std.err) %>%
pivot_longer(Greenwood:`survival package`) %>%
mutate(name = factor(name,
levels = c("survival package", "Greenwood"))) %>%
ggplot(aes(x = year, y = value, color = name)) +
geom_point(size = 4) +
scale_color_viridis_d(NULL, option = "A", end = .55) +
scale_x_continuous(breaks = 1:12) +
scale_y_continuous(expression(italic(se)), limits = c(0, 0.025)) +
theme(legend.background = element_rect(fill = "transparent"),
legend.key = element_rect(color = "grey92"),
legend.position = c(.125, .9),
panel.grid = element_blank())
```
It's out of my expertise to comment on which should we should trust more. But Singer and Willett did note that "as an approximation, Greenwood's formula is accurate only asymptotically" (p. 351).
```{r}
# turn scientific notation back on (R default)
options(scipen = 0)
```
## A simple and useful strategy for constructing the life table
> How can you construct a life table for *your* data set? For preliminary analyses, it is easy to use the prepackaged routines available in the major statistical packages. If you choose this approach, be sure to check whether your package allows you to: (1) select the partition of time; and (2) ignore any actuarial corrections invoked due to continuous-time assumptions (that do not hold in discrete time). When event times have been measured using a discrete-time scale, actuarial corrections (discussed in chapter13) are inappropriate. Although most packages clearly document the algorithm being used, we suggest that you double-check by comparing results with one or two estimates computed by hand. (p. 351, *emphasis* in the original)
For more information about the methods we've been using via the **survival** package, browse through the documentation listed on the CRAN page, [https://CRAN.R-project.org/package=survival](https://CRAN.R-project.org/package=survival), with a particular emphasis on the [reference manual](https://CRAN.R-project.org/package=survival/survival.pdf) and [*A package for survival analysis in R*](https://CRAN.R-project.org/package=survival/vignettes/survival.pdf) [@therneau2021Package4Survival]. But back to the text:
> Despite the simplicity of preprogrammed algorithms, [Singer and Willett] prefer an alternative approach for life table construction. This approach requires construction of a person-period data set, much like the period-period data set used for growth modeling. Once you create the person-period data set, you can compute descriptive statistics sing any standard cross-tabulation routine. (p. 351)
### The person-period data set.
Here is the person-level data set displayed in Figure 10.4; it's just a subset of the `teachers` data.
```{r}
teachers %>%
filter(id %in% c(20, 126, 129))
```
You can transform the person-level survival data set into the person-period variant shown on the right panel of Figure 10.4 with a workflow like this.
```{r}
teachers_pp <-
teachers %>%
uncount(weights = t) %>%
group_by(id) %>%
mutate(period = 1:n()) %>%
mutate(event = if_else(period == max(period) & censor == 0, 1, 0)) %>%
select(-censor) %>%
ungroup()
teachers_pp %>%
filter(id %in% c(20, 126, 129))
```
You don't necessarily need to use `ungroup()` at the end, but it's probably a good idea. Anyway, note how the information previously contained in the `censor` column has been transformed to the `event` column, which is coded 0 = no event, 1 = event. With this coding, we know a participant has been censored when `event == 0` on their `max(period)` row.
We can count the number of teachers in the sample like this.
```{r}
teachers %>%
distinct(id) %>%
count()
```
To get a sense of the difference in the data structures, here are the number of rows for the original person-level `teachers` data and for our person-period transformation.
```{r}
# person-level
teachers %>%
count()
# person-period
teachers_pp %>%
count()
```
Here is the breakdown of the number of rows in the person-period `teachers` data for which `event == 1` or `event == 0`.
```{r}
teachers_pp %>%
count(event)
```
### Using the person-period data set to construct the life table.
"All the life table's essential elements can be computed through cross-tabulation of *PERIOD* and *EVENT* in the person-period data set" (p. 354, *emphasis* in the original). Here's how we might use the **tidyverse** do that with `teachers_pp`.
```{r}
teachers_lt <-
teachers_pp %>%
# change the coding for `event` in anticipation of the final format
mutate(event = str_c("event = ", event)) %>%
group_by(period) %>%
count(event) %>%
ungroup() %>%
pivot_wider(names_from = event,
values_from = n) %>%
mutate(total = `event = 0` + `event = 1`) %>%
mutate(prop_e_1 = (`event = 1` / total) %>% round(digits = 4))
teachers_lt
```
Here are the totals Singer and Willett displayed in the bottom row.
```{r}
teachers_lt %>%
pivot_longer(`event = 0`:total) %>%
group_by(name) %>%
summarise(total = sum(value)) %>%
pivot_wider(names_from = name,
values_from = total)
```
> The ability to construct a life table using the person-period data set provides a simple strategy for conducting the descriptive analyses outlined in this chapter. This strategy yields appropriate statistics regardless of the amount, or pattern, of censoring. Perhaps even more important, the person-period data set is the fundamental tool for fitting discrete-time hazard models to data, using methods that we describe in the next chapter. (p. 356)
## Bonus: Fit the discrete-time hazard models with **brms**
The frequentists aren't the only ones who can discrete-time hazard models. Bayesians can get in on the fun, too. The first step is to decide on an appropriate likelihood function. Why? Because Bayes' formula requires that we define the likelihood and the priors. As all the parameters in the model are seen through the lens of the likelihood, it's important we consider it with care.
Happily, the exercises in the last section did a great job preparing us for the task. In Table 10.3, Singer and Willett hand computed the discrete hazards (i.e., the values in the "Proportion *EVENT* = 1" column) by dividing the valued in the "*EVENT* = 1" column by those in the "Total" column. Discrete hazards are proportions. Proportions have two important characteristics; they are continuous and necessarily range between 0 and 1. You know what else has those characteristics? Probabilities.
So far in this text, we have primarily focused on models using the Gaussian likelihood. Though it's a workhorse, the Gaussian is inappropriate for modeling proportions/probabilities. Good old Gauss is great at modeling unbounded continuous data, but it can fail miserably when working with bounded data and our proportions/probabilities are most certainly bounded. The binomial likelihood, however, is well-suited for handling probabilities. Imagine you have data that can take on values of 0 and 1, such as failures/successes, no's/yesses, fails/passes, and no-events/events. If you sum up all the 1's and divide them by the total cases, you get a proportion/probability. The simple binomial model takes just that kind of data--the number of 1's and the number of total cases. The formula for the binomial likelihood is
$$\Pr (z \mid n, p) = \frac{n!}{z!(n - z)!} p^z (1 - p)^{n - z},$$
where $z$ is the number of cases for which the value is 1, $n$ is the total number of cases, and $p$ is the probability of a 1 across cases. As the data provide us with $z$ and $n$, we end up estimating $p$. If we're willing to use what's called a *link function*, we can estimate $p$ with a linear model with any number of predictors. When fitting binomial regression models, you can take your choice among several link functions, the most popular of which is the logit. This will be our approach. As you may have guesses, using the logit link to fit a binomial model is often termed *logistic regression*. Welcome to the generalized linear model.
Let's fire up **brms**.
```{r, warning = F, message = F}
library(brms)
```
Before we fit the model, it will make our lives easier if we redefine `period` as a factor and rename our `event = 1` column as `event`. We defined `period` as a factor because we want to fit a model with discrete time. Renaming `event = 1` column as `event` just makes it easier on the `brm()` function to read the variable.
```{r}
teachers_lt <-
teachers_lt %>%
mutate(period = factor(period),
event = `event = 1`)
```
With this formulation, `event` is our $z$ term and `total` is our $n$ term. We're estimating $p$. Behold the mode syntax.
```{r fit10.6}
fit10.6 <-
brm(data = teachers_lt,
family = binomial,
event | trials(total) ~ 0 + period,
prior(normal(0, 4), class = b),
chains = 4, cores = 1, iter = 2000, warmup = 1000,
seed = 10,
file = "fits/fit10.06")
```
Check the summary.
```{r}
print(fit10.6)
```
Because we used the `0 + Intercept` syntax in the presence of a factor predictor, `period`, we suppressed the default intercept. Instead, we have separate "intercepts" for each of the 12 levels `period`. Because we used the conventional logit link, the parameters are all on the log-odds scale. Happily, we can use the `brms::inv_logit_scaled()` function to convert them back into the probability metric. Here's a quick and dirty conversion using the output from `fixef()`.
```{r}
fixef(fit10.6) %>% inv_logit_scaled()
```
Compare these `Estimate` values with the values in the "Estimated hazard probability" column from Table 10.2 in the text (p. 349). They are very close. We can go further and look at these hazard estimates in a plot. We'll use the `tidybayes::stat_lineribbon()` function to plot their posterior means atop their 50% and 95% intervals.
```{r, fig.width = 6, fig.height = 3, warning = F, message = F}
library(tidybayes)
as_draws_df(fit10.6) %>%
select(starts_with("b_")) %>%
mutate_all(inv_logit_scaled) %>%
set_names(1:12) %>%
pivot_longer(everything(),
names_to = "period",
values_to = "hazard") %>%
mutate(period = period %>% as.double()) %>%
ggplot(aes(x = period, y = hazard)) +
stat_lineribbon(.width = c(.5, .95), size = 1/3) +
# add the hazard estimates from `survival::survfit()`
geom_point(data = tibble(period = fit10.1$time,
hazard = fit10.1$n.event / fit10.1$n.risk),
aes(y = hazard),
size = 2, color = "violetred1") +
scale_fill_manual("CI", values = c("grey75", "grey60")) +
scale_x_continuous(breaks = 1:12) +
theme(legend.background = element_rect(fill = "transparent"),
legend.key = element_rect(color = "grey92"),
legend.position = c(.925, .825),
panel.grid = element_blank())
```
For comparison sake, those violet dots in the foreground are the hazard estimates from the frequentist `survival::survfit()` function. Turns out our Bayesian results are very similar to the frequentist ones. Hopefully this isn't a surprise. There was a lot of data and we used fairly weak priors. For simple models under those conditions, frequentist and Bayesian results should be pretty close.
Remember that $se \big (\hat h(t_j) \big)$ formula from back in section 10.4.1? That doesn't quite apply to the posterior standard deviations from our Bayesian model. Even so, our posterior $\textit{SD}$s will be very similar to the ML standard errors. Let's compare those in a plot, too.
```{r, fig.width = 6, fig.height = 3, warning = F}
as_draws_df(fit10.6) %>%
select(starts_with("b_")) %>%
mutate_all(inv_logit_scaled) %>%
set_names(1:12) %>%
pivot_longer(everything(),
names_to = "period",
values_to = "hazard") %>%
mutate(period = period %>% as.double()) %>%
group_by(period) %>%
summarise(sd = sd(hazard)) %>%
bind_cols(se_h_hat %>% select(se_h_hat)) %>%
pivot_longer(-period) %>%
mutate(name = factor(name,
levels = c("sd", "se_h_hat"),
labels = c("Bayesian", "ML"))) %>%
ggplot(aes(x = period, y = value, color = name)) +
geom_point(size = 3, position = position_dodge(width = .25)) +
scale_color_viridis_d(NULL, option = "A", end = .55) +
scale_x_continuous(breaks = 1:12) +
scale_y_continuous(expression(italic(se)), limits = c(0, 0.01)) +
theme(legend.background = element_rect(fill = "transparent"),
legend.key = element_rect(color = "grey92"),
legend.position = c(.09, .9),
panel.grid = element_blank())
```
Man those are close.
We can also use our **brms** output to depict the survivor function. On page 337 in the text (Equation 10.5), Singer and Willett demonstrated how to define the survivor function in terms of the hazard function. It follows the form
$$\hat S (t_j) = [1 - \hat h (t_j)][1 - \hat h (t_{j - 1})][1 - \hat h (t_{j - 2})] \dots [1 - \hat h (t_1)].$$
In words, "each year's estimated survival probability is the successive product of the complement of the estimated hazard function probabilities across *this* and *all previous* years" (p. 337, *emphasis* in the original). Here's how you might do that with the output from `as_draws_df()`.
```{r, warning = F}
draws <-
as_draws_df(fit10.6) %>%
select(starts_with("b_")) %>%
# transform the hazards from the log-odds metric to probabilities
mutate_all(inv_logit_scaled) %>%
set_names(str_c("h", 1:12)) %>%
# take the "complement" of each hazard
mutate_all(~1 - .) %>%
# apply Equation 10.5
transmute(s0 = 1,
s1 = h1,
s2 = h1 * h2,
s3 = h1 * h2 * h3,
s4 = h1 * h2 * h3 * h4,
s5 = h1 * h2 * h3 * h4 * h5,
s6 = h1 * h2 * h3 * h4 * h5 * h6,
s7 = h1 * h2 * h3 * h4 * h5 * h6 * h7,
s8 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8,
s9 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9,
s10 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10,
s11 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10 * h11,
s12 = h1 * h2 * h3 * h4 * h5 * h6 * h7 * h8 * h9 * h10 * h11 * h12)
glimpse(draws)
```
We'll learn how to simplify that syntax with help from the `cumprod()` function in the next chapter. Now we have our survival estimates, we can make our Bayesian version of the lower panel of Figure 10.1.
```{r, fig.width = 6, fig.height = 3}
# this will help us depict the median lifetime
line <-
tibble(period = c(0, iml, iml),
survival = c(.5, .5, 0))
# wrangle
draws %>%
pivot_longer(everything(), values_to = "survival") %>%
mutate(period = str_remove(name, "s") %>% as.double()) %>%
# plot!
ggplot(aes(x = period, y = survival)) +
geom_path(data = line,
color = "white", linetype = 2) +
stat_lineribbon(.width = .95, size = 1/3, fill = "grey75") +
# add the survival estimates from `survival::survfit()`
geom_point(data = tibble(period = fit10.1$time,
survival = fit10.1$surv),
size = 2, color = "violetred1") +
scale_x_continuous("years in teaching", breaks = 0:13, limits = c(0, 13)) +
scale_y_continuous(expression("estimated survival probability, "*hat(italic(S))(italic(t[j]))),
breaks = c(0, .5, 1), limits = c(0, 1)) +
theme(panel.grid = element_blank())
```
Like we did with our hazard plot above, we superimposed the frequentist estimates as violet dots atop the Bayesian posterior means and intervals. Because of how narrow the posteriors were, we only showed the 95% intervals, here. As with the hazard estimates, our Bayesian survival estimates are very close to the ones we computed with ML.
## Session info {-}
```{r}
sessionInfo()
```
```{r, echo = F, message = F}
# here we'll remove our objects
rm(teachers, fit10.1, most_rows, row_1, d, m, m_plus_1, stm, stm_plus_1, iml, line, cocaine, sex, suicide, congress, fit10.2, fit10.3, fit10.4, fit10.5, make_lt, lt_cocaine, lt_sex, lt_suicide, lt_congress, make_iml, line_tbl, h_plot, s_plot, p1, p2, p3, p4, p5, p6, p7, p8, p12, p34, p56, p78, se_h_hat, teachers_pp, teachers_lt, fit10.6, draws)
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```