-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdispersion.Rmd
108 lines (87 loc) · 4 KB
/
dispersion.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
title: "Comparison of dispersion estimation methods"
output:
html_document: default
html_notebook: default
---
The notebook reproduces the comparison of dispersion fitting in the csDEX article (Stražar & Curk, 2017),
Supplementary Figure 4. For this experiment, csDEX and DEXSeq packages are required. A temporary directory is created in the current working directory.
```{r, message=FALSE, warning=FALSE}
require(csDEX)
require(DEXSeq)
data.dir = file.path(getwd(), "csdex-temp/")
```
The terminology csDEX package estimates precisions (inverse dispersions), to keep a consistent terminology for both count and Percent Spliced-in (PSI) models, based on negative binomial and Beta distributions, respectively. Since csDEX assumes `sum` as the default replicate aggregation function,
the estimated dispersion is multiplied by the number of replicates for correct comparison.
```{r}
dispersion.csdex <- function(obj, nreps=1){
# obj: a list returned by the csDEX::generate function
# nreps: number of replicates
cdx = obj$data
cdx = csDEX::estimateSizeFactors(cdx)
cdx = csDEX::estimatePrecisions(cdx)
disp = nreps / csDEX::rowData(cdx)$precision
return(disp)
}
```
Estimating dispersion with DEXSeq (Anders et. al, 2012) is also straightforward. The count files are loaded from disk and a standard workflow is used to first estimate the size factors (differences in sample means due to sequencing depth) and the using the Cox-Reid dispersion estimate.
```{r}
dispersion.dexseq <- function(obj){
# obj: a list returned by the csDEX::generate function
# nreps: number of replicates
sampleData = obj$design
countfiles = file.path(obj$data.dir, "data", paste0(sampleData$File.accession, ".txt"))
dex = DEXSeqDataSetFromHTSeq(
countfiles = countfiles,
sampleData = sampleData)
dex = DEXSeq::estimateSizeFactors(dex)
dex = DEXSeq::estimateDispersions(dex)
disp = dispersions(dex)
return(disp)
}
```
Generate a dataset with a fixed vector of `dispersions`. The values are the same for each of the `repeats`, enabling to compute mean predictions and standard deviations. The DEXSeq dispersion fitting can fail in the case of a large number of zero counts.
```{r, message=FALSE, warning=FALSE}
conditions = 10
exons = 20
repeats = 30
replicates = 3
genes = 3
dispersions = rev(sort(rgamma(exons*genes, 1, 2.5)))
est.cdx = matrix(0, nrow=repeats, ncol=exons*genes)
est.dex = matrix(0, nrow=repeats, ncol=exons*genes)
for (r in 1:repeats){
obj = generate(exons=exons, conditions=conditions,
interacting=0, replicates=replicates, genes=genes,
dispersions = dispersions,
type="count", data.dir=data.dir)
est.cdx[r,] = dispersion.csdex(obj, replicates)
est.dex[r,] = tryCatch(dispersion.dexseq(obj),
error=function(e) rep(NA, exons))
}
```
Compute means and stadard deviations. Finally, compare the fits on a plot and compute root mean square errors. Observe the error decreasing with a larger number of experimental conditions.
```{r}
x = 1:(exons*genes)
mean.cdx = colMeans(est.cdx)
std.cdx = apply(est.cdx, 2, sd)
down.cdx = mean.cdx + std.cdx
up.cdx = mean.cdx - std.cdx
rmse.cdx = sqrt(sum((mean.cdx - dispersions)^2))
est.dex[est.dex > 3] = NA
mean.dex = colMeans(est.dex, na.rm=TRUE)
std.dex = apply(est.dex, 2, sd, na.rm=TRUE)
down.dex = mean.dex + std.dex
up.dex = mean.dex - std.dex
rmse.dex = sqrt(sum((mean.dex - dispersions)^2))
plot(dispersions, type="l", col="black", xlab="Exon", ylab="Dispersion",
main = sprintf("Conditions: %d", conditions))
op = par(cex = 0.8)
lines(mean.cdx, col="orange")
lines(mean.dex, col="blue")
polygon(c(x, rev(x)), c(up.cdx, rev(down.cdx)), col=rgb(1,0,0,0.2), border=NA)
polygon(c(x, rev(x)), c(up.dex, rev(down.dex)), col=rgb(0,0,1,0.2), border=NA)
legend("topright", inset=.01,
c(sprintf("qCML (%.2f)", rmse.cdx), sprintf("Cox-Reid (%.2f)", rmse.dex)),
fill=c("orange", "blue"), bty="n")
```