-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path043-example-lymphoma.Rmd
168 lines (126 loc) · 7.27 KB
/
043-example-lymphoma.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# Lymphoma example
A walk through of steps taken to analyse some Lymphoma data.
The broad aim was to look at relapse and survival for different types of Lymphoma related to the stage at initial diagnosis.
Starting with two data extractions from Caboodle and Clarity respectively.
1. Cancer staging form data.
2. Oncology History (a series of events).
## cancer staging data
The cancer staging form data also contains fields from Patient and Problem List tables joined by the original sql extraction.
We wanted to get to one row per patient with the stage at initial diagnosis.
Data were read in from an excel file :
```{r, eval=FALSE}
dfstaging <- read_excel(filename_staging) #, col_types=c("text"))
names(dfstaging)
# [1] "DurableKey" "PrimaryMrn" "Problem List Diagnosis" "DiagnosisKey"
# [5] "StageDateKey" "Classification" "StageGroup" "StageDescription"
# [9] "DateKey" "SmartDataElementEpicId" "AttributeType" "StringValue"
# [13] "NumericValue" "DateValue"
```
There were multiple forms per patient and each has a data in "StageDateKey". We added an index for the form number per patient to be able to filter just the first form later.
```{r, eval=FALSE}
dfstaging <- dfstaging %>%
group_by(DurableKey) %>%
#dense_rank assigns 1,2,3 etc to each form date starting low
mutate(patient_form_num = dense_rank(StageDateKey)) %>%
ungroup() %>%
arrange(DurableKey, StageDateKey)
```
The staging data has variable names in AttributeType and values in either StringValue (almost entirely), NumericValue or DateValue.
Can get variables into their own columns using pivot_wider with StageDateKey as an extra identifier.
```{r, eval=FALSE}
# even when using StageDateKey there are a few multiple records
# due to 1 partially completed form
# & "symptoms at diagnosis (B symptoms)" that can have multiple records per form
# this modified code from pivot_wider warning to remove duplicates
# removes 33 of 5000 rows
dfstaging2 <- dfstaging %>%
dplyr::group_by(DurableKey, StageDateKey, AttributeType) %>%
dplyr::mutate(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n == 1L)
dfstagewide <- pivot_wider(dfstaging2,
names_from = AttributeType,
values_from = StringValue,
id_cols=c(PrimaryMrn, DurableKey, StageDateKey, patient_form_num) )
# filter the first form for each patient
dfstagewide_form1 <- dfstagewide %>%
filter(patient_form_num == 1)
# replace spaces with _ in variable names to make easier to deal with
names(dfstagewide_form1) <- names(dfstagewide_form1) %>%
#replace spaces with _
str_replace_all("\\s", "_")
# now the data can be filtered by lymphoma_type
table(dfstagewide_form1$lymphoma_type)
# Diffuse large B-cell lymphoma Follicular lymphoma
# 141 118
# Hodgkin lymphoma Mantle cell lymphoma
# 76 32
# Marginal zone lymphoma Peripheral T-cell lymphoma
# 42 7
# Small lymphocytic leukemia Unknown
# 1 17
```
## oncology history data
Oncology history has a series of events classified by Event_Category.
```{r eval=FALSE}
filename_oh <- "data-raw//Lymphoma_Oncology_History_Full_Extract_2022-03-30.csv"
dfoncology <- read_csv(filename_oh)
names(dfoncology)
# [1] "PAT_MRN_ID" "Event_Category" "PRB_EVENT_STDATE_DT"
# [4] "PRB_EVENT_ENDATE_DT" "NOTE_CSN_ID" "NOTE_TEXT"
# [7] "PRB_EVENT_INDEX" "PROBLEM_EVENT_AUTO_UPDATE_YN"
unique(dfoncology$Event_Category)
# [1] "Initial Diagnosis" "Cancer Staged" "Chemotherapy"
# [4] "No evidence of disease" "Progression" "Relapse"
# [7] "Other" "Adverse Reaction" "Death"
# [10] "Radiotherapy" "Supportive Treatment" "Surgery"
# [13] "Research Study Participant" "Bone Marrow Transplant Event" "End of Therapy"
# [16] "Re-Staged" "Palliative Care" "Biopsy"
# [19] "Imaging" "Multi Disciplinary Meeting" "Immunotherapy"
# [22] "Targeted Therapy" "Previous External Chemotherapy"
```
We can arrange the events in chronological order by patient to look at the data, and count the events of the same type per patient to see that there can be repeats.
```{r eval=FALSE}
#dates are in oh_date_start converted from "PRB_EVENT_STDATE_DT" from oncology history
#first filter & order & look at the data
dfevent_order <- dfoncology_dlbcl %>%
#filter(Event_Category=="No evidence of disease" | Event_Category=="Initial Diagnosis") %>%
arrange(PAT_MRN_ID, oh_date_start)
#check whether there are multiple events
#yes, can be up to 3 relapse, progression or no evidence
dfcheck_events <- dfoncology_dlbcl %>%
group_by(PAT_MRN_ID) %>%
summarise(no_evidence=sum(Event_Category=="No evidence of disease"),
initial_diagnosis=sum(Event_Category=="Initial Diagnosis"),
relapse=sum(Event_Category=="Relapse"),
progression=sum(Event_Category=="Progression"))
```
### To calculate the time to relapse from remission.
This filters events that indicate remission or relapse, orders them for each patient and calculates the time from each event to the next one.
Then for each patient it checks whether a relapse event occurs immediately after a remission one. It classifies survival in a way that can later be used in a Kaplan Meier survival plot using the survminer package.
```{r eval=FALSE}
# this is time from no_evidence to either relapse, progression or death
# function requires oncology history dataframe with converted date columns
time_to_relapse <- function(df1)
{
df1 %>%
# filter out event types
filter(Event_Category %in% c("No evidence of disease","Progression","Relapse","Death")) %>%
# arrange events in date order
arrange(PAT_MRN_ID, oh_date_start) %>%
group_by(PAT_MRN_ID) %>%
# calculate days to next event - lower down it replaces NAs for survival with interval to last date in file
mutate(days_to_next = lead(oh_date_start) - oh_date_start,
next_cat = lead(Event_Category)) %>%
ungroup() %>%
# filter just the records starting with CMR (survival will have NA in next_cat)
filter(Event_Category=="No evidence of disease") %>%
# survival for KM plots 1=survived, 2=not
mutate(survival=if_else(next_cat %in% c("Progression","Relapse","Death"),2,1)) %>%
# survival events have an NA in days_to_next, replace that with time from remission to the max date in the extraction
# later replace this with the last clinic appointment for patients who haven't had relapse
mutate(days_to_next=if_else(is.na(days_to_next), max(oh_date_start,na.rm=TRUE) - oh_date_start, days_to_next))
}
# call function above
dftime_to_relapse <- time_to_relapse(dfoncology_dlbcl)
```
These data can be joined onto the widened staging data to be able to look at survival by different Lymphoma types.