-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathsequence_coverage.Rmd
205 lines (139 loc) · 7.8 KB
/
sequence_coverage.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Sequence coverage"
output:
html_document
---
Once data is received from a sequencing center, an initial question is whether the sequencing was successful?
A common attempt to address this question is by asking how much sequence coverage was attained.
For a diploid organism, one would require at least two reads per variant to infer a genotype.
Ideally, we would like a little more so that we have redundancy.
How much redundancy is necessary may be a question the researchers need to address themselves.
In this vignette we'll learn how to read in VCF data, extract sequence depth information from it and how to visualize this data.
## Reading VCF data
The topic of reading in VCF format data is covered in other vignettes.
We include it here for redundancy, and because every example involving VCF data needs to start with some data.
```{r}
library(vcfR)
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR(vcf_file, verbose = FALSE)
```
Calling the object by name, with no options (the show method), provides a summary of the data.
```{r}
vcf
```
## Querying VCF information
Some, but not all, variant callers will provide per variant and per sample sequence depth information.
Some variant callers may provide this information only as an option (see relevant documentation).
We can use the 'head' method to explore what sort of information exists in our data.
```{r}
head(vcf)
```
This provides a summary of the information contained in the VCF data.
In the genotype section, the first column is titled 'FORMAT' and includes fields that describe the genotype data included with each genotype.
According to the VCF definition the first element must be the genotype.
All subsequent data is optional according to the VCF definition.
Each field is defined in the meta region.
We can use 'grep' to search the meta slot for the definition of these fields.
```{r}
#strwrap(grep('DP', vcf@meta, value = TRUE))
queryMETA(vcf, element = 'DP')
```
Here 'DP' is defined as a 'FORMAT' (the genotype region) as well as in the 'INFO' column.
We're interested in the former.
This tells us that data for depth (DP) exists, so we can extract the DP information from the genotype slot.
```{r}
dp <- extract.gt(vcf, element='DP', as.numeric=TRUE)
```
We can use this function to extract other fields by changing the value of the 'element' parameter.
Here, we've specified to convert the data to a numeric.
This makes sense because depth should be a number describing how many times a variant was sequenced.
Some fields, such as the genotype (GT), may not be easily converted to a numeric because they contain non-numeric characters.
For example, the genotype field will contain a pipe ('|') for phased data or a forward slash ('/') for unphased data.
If you select a field which is non-numeric, such as the genotype, and specify as.numeric to be TRUE, you will get a numeric matrix.
However, these numbers may not make much sense.
Make sure to only use as.numeric=TRUE on data that actually are numeric.
## Creating boxplots of sequence depth
Now that we have our data, we will want to visualize it.
A good choice for this type of data is a box and whisker plot.
```{r, fig.align='center', fig.width=7, fig.height=7}
par(mar=c(8,4,1,1))
#boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Depth", log='y', las=2)
boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Depth", las=2)
abline(h=seq(0,1e4, by=100), col="#C0C0C088")
par(mar=c(5,4,4,2))
```
This is a reasonable representation of the data.
The boxplots contain 50% of the data within each box (the first and third quartiles).
The majority of data beyond the first and third quartiles are contained in the whiskers.
Outlying data is represented by characters, here open circles.
A shortcoming of this plot is that there is a great amount of large values.
This compresses the plot so that the boxes are small and the plot is dominated by outliers.
## Creating violin plots of sequence depth
For quality control purposes, I'm frequently happy with using boxplots.
They are quick and easy to produce.
If I want something a little more fancy, say for publication, I may use violin plots.
This will require us to recast the data from a matrix to a data.frame, and then we can plot it with ggplot2.
```{r, fig.align='center', fig.width=7}
if( require(reshape2) & require(ggplot2) ){
dpf <- melt(dp, varnames=c('Index', 'Sample'), value.name = 'Depth', na.rm=TRUE)
dpf <- dpf[ dpf$Depth > 0,]
p <- ggplot(dpf, aes(x=Sample, y=Depth)) + geom_violin(fill="#C0C0C0", adjust=1.0,
scale = "count", trim=TRUE)
p <- p + theme_bw()
p <- p + theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size=12))
# p <- p + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black")
p <- p + scale_y_continuous(trans=scales::log2_trans(),
breaks=c(1, 10, 100, 800),
minor_breaks=c(1:10, 2:10*10, 2:8*100))
p <- p + theme(axis.title.y = element_text(size=12))
p <- p + theme( panel.grid.major.y=element_line(color = "#A9A9A9", size=0.6) )
p <- p + theme( panel.grid.minor.y=element_line(color = "#C0C0C0", size=0.2) )
p <- p + stat_summary(fun.y=median, geom="point", shape=23, size=2)
p
} else {
message("The packages reshape2 and ggplot2 are required for this example but do not appear
to be installed. Please use install.packages(c('reshape2', 'ggplot2', 'scales')) if you would
like to install them.")
}
```
The violin plot presents the distribution of the data throughout its range.
They also provide an easy way to apply a logarithmic transformation (here base 2) that effectively compresses large values and stretches out small values.
Here we observe a large density of per variant sequence depths that are greater than ten (for the most part).
This may be interpreted as having attained sufficient sequence depth for a diploid organism.
## Filtering on sequence depth.
In order to sequence a diploid genotype, we need at least two sequence reads.
Ideally, we would like some redundancy to build more confidence in our genotypes.
This means we may want a lower bound on coverage for whether we call a genotype.
You can also see from the plots that we have some samples that have exceptionally high sequence coverage (greater than 100).
These may be variants from repetitive portions of the genome, and therefore may be desireable to exclude.
Here we pull out summary statistics for each sample and use this information to filter our data.
The quantile function is used to build a 90% confidence interval for each sample.
We then use this information to set variants which are outside this region to missing data (NA).
Lastly, we apply a minimum threshold by setting variants with coverage less than four as missing.
```{r}
sums <- apply(dp, MARGIN=2, quantile, probs=c(0.05, 0.95), na.rm=TRUE)
dp2 <- sweep(dp, MARGIN=2, FUN = "-", sums[1,])
dp[dp2 < 0] <- NA
dp2 <- sweep(dp, MARGIN=2, FUN = "-", sums[2,])
dp[dp2 > 0] <- NA
dp[dp < 4] <- NA
```
Now that we've accomplished these manipulations, we can visualize the data using boxplots.
```{r, fig.align='center', fig.width=7, fig.height=7}
par(mar=c(8,4,1,1))
boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Depth")
abline(h=seq(0,200, by=20), col="#C0C0C088")
par(mar=c(5,4,4,2))
```
We can see that we've removed many of our outlier variants (circles beyond the whiskers).
And our data appear much better behaved.
We can now use this information to update our vcfR object.
```{r}
vcf
is.na( vcf@gt[,-1][ is.na(dp) ] ) <- TRUE
vcf
```
We can see that we've increased the amount of missing data in our vcfR object.
This validated the success of our manipulation.
However, we may want to use this information to further mitigate the data set.