-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathomitting_data.Rmd
132 lines (86 loc) · 4.21 KB
/
omitting_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "Omitting data"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(fig.width = 12)
knitr::opts_chunk$set(fig.height = 12)
```
In the section on depth we learned how we can visualize variant depth, or any other numeric value provided in the gt portion of VCF data.
In the section on censoring data we learned how to rescore variants that were outside our acceptance thershold as missing.
And in the section on missing data we learned how to quantify and visualize missingness in our dataset.
Here we put all of these skills together in order to omit samples and variants that have been set as missing (NA).
```{r, results='hide', echo = TRUE}
library(vcfR)
vcf <- read.vcfR('TASSEL_GBS0077.vcf.gz')
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
```
```{r}
vcf
```
Because part of this exercise involves setting cells in our data matrix as NA we should begin by reminding ourselves of how abundant they are.
By using the show method we see that we have over 35 percent missing data.
We can now use what we learned previously to set variants that are outside our per sample inclusion threshold as NA.
```{r}
quants <- apply(dp, MARGIN=2, quantile, probs=c(0.1, 0.8), na.rm=TRUE)
dp2 <- sweep(dp, MARGIN=2, FUN = "-", quants[1,])
dp[dp2 < 0] <- NA
dp2 <- sweep(dp, MARGIN=2, FUN = "-", quants[2,])
dp[dp2 > 0] <- NA
dp[dp < 4] <- NA
vcf@gt[,-1][ is.na(dp) == TRUE ] <- NA
```
```{r}
vcf
```
We see that this censoring has increased the degree of missingness in our matrix to over 60 percent.
Ideally we should visualize the results of this action.
For brevity, we will not here.
But you can return to the section on depth and reuse the code presented there to visualize how this change has affected the distribution of the data.
```{r}
heatmap.bp(dp[1:1000,], rlabels = FALSE)
```
In this heatmap we see that we have samples in columns and variants in rows.
The color ramp on the right represents the depth of coverage for each variant where yellow is high coverage and purple is low coverage.
White is not part of our color ramp but it is a part of the heatmap.
This is because white represents cells in our matrix that having missing data (NA).
The marginal barplots sum the information accross columns and rows to tell us which samples (columns) or variants (rows) include a high or low amount of sequence coverage.
In general we would like to see our sequence coverage uniformly distributed among our samples and variants.
## Omitting samples
We can see that some samples have a high degree of missingness.
By omiting these samples we may reduce the overall missingness in the data set.
```{r}
myMiss <- apply(dp, MARGIN = 2, function(x){ sum( is.na(x) ) } )
myMiss <- myMiss / nrow(dp)
vcf@gt <- vcf@gt[, c(TRUE, myMiss < 0.7)]
vcf
```
```{r}
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
heatmap.bp(dp[1:1000,], rlabels = FALSE)
```
## Omitting variants
Previously we have seen how to quantify and visualize missingness for variants in our dataset.
We can use this information to omit variants that have a high degree of missingness.
```{r}
myMiss <- apply(dp, MARGIN = 1, function(x){ sum( is.na(x) ) } )
myMiss <- myMiss / ncol(dp)
vcf <- vcf[myMiss < 0.2, ]
vcf
```
```{r}
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
heatmap.bp(dp[1:1000,], rlabels = FALSE)
```
## Summary.
Through omitting samples and variants with a high degree of missingness we have taken a dataset that was over 35 percent missing data to a dataset that is now below ten percent missing data.
We've also reduced tha sample size from 61 samples to 41.
And we've reduced the number of variants from over 60 thousand to just below 20 thousand.
How important any particular sample or variant is will have to be determined base on the specifics of any particular project.
These actions have greatly improved the ratio of data to missing or low quality data in our dataset.
Through exploring thresholds that are different from those implemented here one may be able to improve on this more.
We can now proceed to downstream analyses of this dataset with greater confidence that the variants we are analyzing are of high quality.