-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDepeters_RumenSampling2018_C.Rmd
3619 lines (2952 loc) · 203 KB
/
Depeters_RumenSampling2018_C.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Depeters Study 2018"
author: "Jill Hagey"
date: "Started: 10/5/18, Completed: 9/XX/2020"
output:
html_document:
theme: spacelab
toc: true
toc_depth: 2
toc_float: true
df_print: paged
highlight: espresso
---
#Research questions
The primary goal of this study was a comparison of grab sample, stomach tubing and feces to understand how different sampling methods will effect the microbial communities found in samples. The current "gold standard" for surveying the rumen microbiome is with a grab sample from the rument that contains both liquid and solid particles. On a commercial dairy, fecal sampling is easy to do. Stomach tube could be done with a little more time. If a fecal sample is not representative of the stomach tube, then there is no sense doing the fecal sampling as a monitor for rumen conditions. In reality, if the stomach tube and the fecal sample do not reflect the grab sample (gold standard) then neither would be used to monitor rumen microbial health (populations).
###We seek to answer the following questions:
* How are sample types different?
+ Alpha diversity (richness and evenness)
+ Beta diversity
+ Differentially abundant and differentially variable ASVs.
* What ASVs are shared between samples of the same type.
A secondary question is the decomposition of the grab sample (liquid strained & solid). The grab sample was separated into liquid strained and solid particulate by pressing the grab sample through cheese cloth to get liquid strained and solid particulate. We will have a closer look at what communites are in what parts of the grab sample.
###Other Questions of interest:
* Have a look at the feed microbiome from the two different kits (TMR_plant_kit and TMR_fecal_kit) to see if there is a relationship between the feed and sample type.
* Typically, the liquid unstrained is what we would collect from a rumen fistulated cow and then transfaunate using a stomach tube into a sick cow that is experiencing simple indigestion. How does the microbial population of the liquid unstrained compared with the grab sample, liquid strained, and solid? That is to say, when we transfaunate what mircrobial populations are we transfering.
* How constant is the rumen population over time in the same animal?
+ **This can't be tested as we only have one sample per day and thus can estimate variations on a day**
* There is one Jersey in the study is her microbiome different from the holstiens?
+ **We can't really answer this as we only have an n=1**
```{r setup, echo=FALSE, include=FALSE, warning=FALSE}
#Setting working directory. Pick One
#Use this one for lab computer
#setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
#Just a bit of house keeping to set the working directory
#knitr::opts_knit$set(root.dir = "C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
knitr::opts_chunk$set(echo = FALSE, warning=FALSE)
#Use thigs one for my own computer
setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
#Just a bit of house keeping to set the working directory
knitr::opts_knit$set(root.dir = "C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
#calling in custom alpha diversity plotting script
#read_chunk('C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/plot_alpha_estimates_custom.R')
#Use thigs one for my own computer
#setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
#Just a bit of house keeping to set the working directory
#knitr::opts_knit$set(root.dir = "C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
getwd()
```
```{r loading packages, include=FALSE, error=FALSE, warning=FALSE}
#load the packages
library(dada2); packageVersion("dada2")
library(phyloseq); packageVersion("phyloseq")
library(breakaway); packageVersion("breakaway")
library(DivNet); packageVersion("DivNet")
library(corncob); packageVersion("corncob")
library(structSSI); packageVersion("structSSI")
library(ggplot2); packageVersion("ggplot2")
library(reshape2); packageVersion("reshape2")
library(plotly); packageVersion("plotly")
library(dplyr); packageVersion("dplyr")
library(tibble); packageVersion("tibble")
library(doSNOW); packageVersion("doSNOW")
library(knitr); packageVersion("knitr")
library(tidyr); packageVersion("tidyr")
library(kableExtra); packageVersion("kableExtra")
library(Biostrings); packageVersion("Biostrings")
library(ggrepel); packageVersion("ggrepel")
library(stringr); packageVersion("stringr")
library(magrittr); packageVersion("magrittr")
library(cowplot); packageVersion("cowplot")
library(xlsx); packageVersion("xlsx")
library(RColorBrewer); packageVersion("RColorBrewer")
#library(stargazer); packageVersion("stargazer")
```
Note that prior to running DADA2 sequences were cleaned with kneaddata and then demuliplexed and primers trimmed with cuteadapt. Code for this is available at my [GitHub Page](https://github.com/Jill/Depeters_RumenSampling_2018/blob/master/Clean_Up)
<!--- Calling in custom functions first --->
<!--- function to remove taxa from phyloseq object --->
```{r include=FALSE,eval=TRUE}
#running some functions I'll need for later.
#This function will return a phyloseq object with the taxa we want to keep
pop_taxa_keep = function(physeq, goodTaxa){
allTaxa = taxa_names(physeq)
myTaxa <- allTaxa[(allTaxa %in% goodTaxa)]
return(prune_taxa(myTaxa, physeq))
}
```
<!-- custom plotting of alpha diversity output --->
```{r include=FALSE,eval=TRUE}
plot_alpha_estimates_custom <- function(x, physeq = NULL, measure = NULL, facet.y=NULL, facet.x=NULL, shrink=NULL,
color = NULL, shape = NULL, title = NULL, trim_plot = FALSE, ...) {
if (!is.null(shrink)) { #checks to make sure x isn't a breakaway object
name_check <- x %>% lapply(function(x) x$name) %>% unlist %>% unique
if(name_check=="breakaway") {
stop("You can't shrink a plot with breakaway estimates")
}
}
if (is.null(measure)) {
all_measures <- x %>% lapply(function(x) x$name) %>% unlist %>% unique
measure <- all_measures[1]
}
df <- summary(x, physeq)
if (all(is.na(df$estimate))) {
stop("There are no estimates in this alpha_estimates object!")
}
if (!is.null(facet.x)) { #new
if (facet.x %in% phyloseq::sample_variables(physeq)) {
df[["facet.x"]] <- phyloseq::get_variable(physeq, facet.x)
} else if (length(facet.x) == nrow(df)) {
df[["facet.x"]] <- facet.x
} else {
stop("facet must either match a variable or be a custom vector of correct length!")
}
}
if (!is.null(facet.y)) { #new
if (facet.y %in% phyloseq::sample_variables(physeq)) {
df[["facet.y"]] <- phyloseq::get_variable(physeq, facet.y)
} else if (length(facet.y) == nrow(df)) {
df[["facet.y"]] <- facet.y
} else {
stop("facet.y must either match a variable or be a custom vector of correct length!")
}
}
if (!is.null(color)) {
if (color %in% phyloseq::sample_variables(physeq)) {
df[["color"]] <- phyloseq::get_variable(physeq, color)
} else if (length(color) == nrow(df)) {
df[["color"]] <- color
} else {
stop("color must either match a variable or be a custom vector of correct length!")
}
}
if (!is.null(shape)) {
if (shape %in% phyloseq::sample_variables(physeq)) {
df[["shape"]] <- phyloseq::get_variable(physeq, shape)
} else if (length(shape) == nrow(df)) {
df[["shape"]] <- shape
} else {
stop("shape must either match a variable or be a custom vector of correct length!")
}
}
yname1 <- measure
yname2 <- x[[1]]$estimand
if (is.null(physeq) & !is.null(rownames(df))) {
df$sample_names <- rownames(df)
}
if (!is.null(shrink)){ #new
warning(paste("Warning you should not shrink your graph unless you have used a covariate in the model.\nAdditionally, only shrink on the covariate"))
ps.data <- as.data.frame(sample_data(physeq)) #pull sample data from physeq object into dataframe
ps.data$sample_names <- sample_names(physeq) #make a column of sample names
df[,shrink] <- ps.data[,shrink][match(ps.data$sample_names, df$sample_names)] #add metadata column to divnet df
df <- df[!duplicated(df$estimate),] #remove duplicates
}
if (is.null(shape) & is.null(color)) {
my_gg <- ggplot2::ggplot(df)
} else if (is.null(shape) & !is.null(color)) {
aes_map <- ggplot2::aes_string(color = "color")
my_gg <- ggplot2::ggplot(df, aes_map)
} else if (!is.null(shape) & is.null(color)) {
aes_map <- ggplot2::aes_string(shape = "shape")
my_gg <- ggplot2::ggplot(df, aes_map)
} else if (!is.null(shape) & !is.null(color)) {
aes_map <- ggplot2::aes_string(color = "color", shape = "shape")
my_gg <- ggplot2::ggplot(df, aes_map)
}
if (!is.null(shrink)){ #new
my_gg <- my_gg +
ggplot2::geom_point(ggplot2::aes_string(x = shrink, y = "estimate"))+
ggplot2::xlab("")+
ggplot2::ylab(paste(yname1, "estimate of", yname2)) +
ggplot2::labs(title = title, color=color) +
ggplot2::theme_bw() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
if (!(all(is.na(df$lower)) || all(is.na(df$upper)))) {
my_gg <- my_gg +
ggplot2::geom_segment(ggplot2::aes_string(x = shrink, xend = shrink, y = "lower", yend = "upper"))
}
} else if (is.null(shrink)) { #new
my_gg <- my_gg +
ggplot2::geom_point(ggplot2::aes_string(x = "sample_names", y = "estimate")) +
ggplot2::ylab(paste(yname1, "estimate of", yname2)) +
ggplot2::xlab("") +
ggplot2::labs(title = title, color=color) + #added color so it will have appropriate legend title
ggplot2::theme_bw() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
}
if (!is.null(facet.y)) { #new
my_gg <- my_gg +
ggplot2::facet_grid( facet.y ~ ., scales="free", space="free_x")
}
if (!is.null(facet.x)) { #new
#This name changing is only for this project.
sam_names <- c(`Grab Sample` = "Grab Sample",`Feces` = "Feces",`Stomach Tube` = "Stomach Tube",`Solid` = "Solid", `Liquid Strained` = "Liquid\nStrained",`Liquid Unstrained` = "Liquid\nUnstrained")
my_gg <- my_gg +
ggplot2::facet_grid( . ~ facet.x, scales="free", space="free_x", labeller = as_labeller(sam_names))
}
if (is.null(shrink)) {
if (!(all(is.na(df$lower)) || all(is.na(df$upper)))) {
my_gg <- my_gg +
ggplot2::geom_segment(ggplot2::aes_string(x = "sample_names", xend = "sample_names", y = "lower", yend = "upper"))
}
}
if (!trim_plot) {
fiven <- stats::fivenum(df$upper, na.rm = TRUE)
iqr <- diff(fiven[c(2, 4)])
if (!is.na(iqr)) {
out <- df$upper < (fiven[2L] - 1.5 * iqr) | df$upper > (fiven[4L] + 1.5 * iqr)
ylower <- min(0, 0.95*min(df$upper[!out]), na.rm = TRUE)
yupper <- 1.05*max(df$upper[!out], na.rm = TRUE)
my_gg <- my_gg +
ggplot2::coord_cartesian(ylim = c(ylower,yupper))
}
}
my_gg
}
```
<!--- custom plotting of corncob output --->
```{r include=FALSE,eval=TRUE}
plot.differentialTest_custom <- function(x, level = NULL, cutoff=NULL, taxa_filter=NULL, ...) {
signif_taxa <- x$significant_taxa
if ("phyloseq" %in% class(x$data)) {
if (!is.null(x$data@tax_table)) {
signif_taxa <- otu_to_taxonomy(signif_taxa, x$data, level = level)
if (length(unique(signif_taxa)) != length(unique(x$significant_taxa))) {
# Make sure if repeated taxa add unique otu identifiers
signif_taxa <- paste0(signif_taxa, " (", x$significant_taxa, ")")
}
}
}
if (length(x$significant_models) != 0) {
var_per_mod <- length(x$restrictions_DA) + length(x$restrictions_DV)
total_var_count <- length(signif_taxa) * var_per_mod
df <- as.data.frame(matrix(NA, nrow = total_var_count, ncol = 5))
colnames(df) <- c("x", "xmin", "xmax", "taxa", "variable")
qval <- stats::qnorm(.975)
restricts_mu <- attr(x$restrictions_DA, "index")
restricts_phi <- attr(x$restrictions_DV, "index")
count <- 1
for (i in 1:length(x$significant_models)) {
# Below from print_summary_bbdml, just to get coefficient names
tmp <- x$significant_models[[i]]
coefs.mu <- tmp$coefficients[1:tmp$np.mu,, drop = FALSE]
rownames(coefs.mu) <- paste0(substring(rownames(coefs.mu), 4), "\nDifferential\nAbundance")
coefs.mu <- coefs.mu[restricts_mu,, drop = FALSE]
coefs.phi <- tmp$coefficients[(tmp$np.mu + 1):nrow(tmp$coefficients),, drop = FALSE]
rownames(coefs.phi) <- paste0(substring(rownames(coefs.phi), 5), "\nDifferential Variability")
coefs.phi <- coefs.phi[restricts_phi - tmp$np.mu,, drop = FALSE]
coefs <- rbind(coefs.mu, coefs.phi)
for (j in 1:var_per_mod) {
df[count, 1:3] <- c(coefs[j, 1], coefs[j, 1] - qval * coefs[j, 2],
coefs[j, 1] + qval * coefs[j, 2])
df[count, 4:5] <- c(signif_taxa[i], rownames(coefs)[j])
count <- count + 1
}
}
df$Phylum <- str_extract(df$taxa, ".+?(?<=_)")
df$Phylum <- gsub("_", "", df$Phylum)
df$variable <- gsub("Sample_Type", "", df$variable)
if (!is.null(taxa_filter)) {
df_filtered <- df %>% filter(Phylum == taxa_filter)
df_filtered$taxa <- gsub(paste(taxa_filter,"_", sep=""), "", df_filtered$taxa)
#need to check if all taxa have all the sample types with them?
#global variables warning suppression
taxa <- xmin <- xmax <- NULL
ggplot2::ggplot(df_filtered, ggplot2::aes(x = x, y = taxa)) +
ggplot2::geom_vline(xintercept = 0, color = "gray50", lty = "dashed", alpha = 0.75, lwd = 1) +
ggplot2::geom_point() +
ggplot2::geom_errorbarh(ggplot2::aes(xmin = xmin, xmax = xmax), height = .3) +
ggplot2::theme_bw() +
ggplot2::facet_wrap(~variable, scales = "free_x", nrow = 1) +
ggplot2::labs(title = "", x = "", y = "Taxa") +
ggplot2::scale_y_discrete(limits = rev(sort(unique(df_filtered$taxa)))) +
ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
#ggplot2::geom_tile(aes(fill=Phylum))
}
else if(is.null(taxa_filter)) {
# global variables warning suppression
taxa <- xmin <- xmax <- NULL
ggplot2::ggplot(df, ggplot2::aes(x = x, y = taxa)) +
ggplot2::geom_vline(xintercept = 0, color = "gray50", lty = "dashed", alpha = 0.75, lwd = 1) +
ggplot2::geom_point() +
ggplot2::geom_errorbarh(ggplot2::aes(xmin = xmin, xmax = xmax), height = .3) +
ggplot2::theme_bw() +
ggplot2::facet_wrap(~variable, scales = "free_x", nrow = 1) +
ggplot2::labs(title = "", x = "", y = "Taxa") +
ggplot2::scale_y_discrete(limits = rev(df$taxa)) +
ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
#ggplot2::geom_tile(aes(x=,fill=Phylum))
}
}
}
```
```{r include=FALSE,eval=TRUE}
plot.differentialTest_custom_color <- function(x, level = NULL, cutoff=NULL, taxa_filter=NULL, color=NULL, ...) {
signif_taxa <- x$significant_taxa
if ("phyloseq" %in% class(x$data)) {
if (!is.null(x$data@tax_table)) {
signif_taxa <- otu_to_taxonomy(signif_taxa, x$data, level = level)
if (length(unique(signif_taxa)) != length(unique(x$significant_taxa))) {
# Make sure if repeated taxa add unique otu identifiers
signif_taxa <- paste0(signif_taxa, " (", x$significant_taxa, ")")
}
}
}
if (length(x$significant_models) != 0) {
var_per_mod <- length(x$restrictions_DA) + length(x$restrictions_DV)
total_var_count <- length(signif_taxa) * var_per_mod
df <- as.data.frame(matrix(NA, nrow = total_var_count, ncol = 5))
colnames(df) <- c("x", "xmin", "xmax", "taxa", "variable")
qval <- stats::qnorm(.975)
restricts_mu <- attr(x$restrictions_DA, "index")
restricts_phi <- attr(x$restrictions_DV, "index")
count <- 1
for (i in 1:length(x$significant_models)) {
# Below from print_summary_bbdml, just to get coefficient names
tmp <- x$significant_models[[i]]
coefs.mu <- tmp$coefficients[1:tmp$np.mu,, drop = FALSE]
rownames(coefs.mu) <- paste0(substring(rownames(coefs.mu), 4), "\nDifferential\nAbundance")
coefs.mu <- coefs.mu[restricts_mu,, drop = FALSE]
coefs.phi <- tmp$coefficients[(tmp$np.mu + 1):nrow(tmp$coefficients),, drop = FALSE]
rownames(coefs.phi) <- paste0(substring(rownames(coefs.phi), 5), "\nDifferential Variability")
coefs.phi <- coefs.phi[restricts_phi - tmp$np.mu,, drop = FALSE]
coefs <- rbind(coefs.mu, coefs.phi)
for (j in 1:var_per_mod) {
df[count, 1:3] <- c(coefs[j, 1], coefs[j, 1] - qval * coefs[j, 2],
coefs[j, 1] + qval * coefs[j, 2])
df[count, 4:5] <- c(signif_taxa[i], rownames(coefs)[j])
count <- count + 1
}
}
df$Phylum <- str_extract(df$taxa, ".+?(?<=_)")
df$Phylum <- gsub("_", "", df$Phylum)
df$variable <- gsub("Sample_Type", "", df$variable)
if (is.null(color)) {
df$taxa <- str_replace(df$taxa, paste0(df$Phylum, "_", sep=""),"")
}
if (!is.null(taxa_filter)) {
df_filtered <- df %>% filter(Phylum == taxa_filter)
df_filtered$taxa <- gsub(paste(taxa_filter,"_", sep=""), "", df_filtered$taxa)
print(head(df_filtered))
#need to check if all taxa have all the sample types with them?
#global variables warning suppression
taxa <- xmin <- xmax <- NULL
ggplot2::ggplot(df_filtered, ggplot2::aes(x = x, y = taxa, color=color)) +
ggplot2::geom_vline(xintercept = 0, color = "gray50", lty = "dashed", alpha = 0.75, lwd = 1) +
ggplot2::geom_point() +
ggplot2::geom_errorbarh(ggplot2::aes(xmin = xmin, xmax = xmax), height = .3) +
ggplot2::theme_bw() +
ggplot2::facet_wrap(~variable, scales = "free_x", nrow = 1) +
ggplot2::labs(title = "", x = "", y = "Taxa") +
ggplot2::scale_y_discrete(limits = rev(sort(unique(df_filtered$taxa)))) +
ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
#ggplot2::geom_tile(aes(fill=Phylum))
}
else if(is.null(taxa_filter)) {
print(head(df))
# global variables warning suppression
taxa <- xmin <- xmax <- NULL
ggplot2::ggplot(df, ggplot2::aes(x = x, y = taxa, color=Phylum)) +
ggplot2::geom_vline(xintercept = 0, color = "gray50", lty = "dashed", alpha = 0.75, lwd = 1) +
ggplot2::geom_point() +
ggplot2::geom_errorbarh(ggplot2::aes(xmin = xmin, xmax = xmax), height = .3) +
ggplot2::theme_bw() +
ggplot2::facet_wrap(~variable, scales = "free_x", nrow = 1) +
ggplot2::labs(title = "", x = "", y = "Taxa") +
ggplot2::scale_y_discrete(limits = rev(df$taxa)) +
ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
#ggplot2::geom_tile(aes(x=,fill=Phylum))
}
}
}
```
<!--- getting df of corncob output --->
```{r include=FALSE,eval=TRUE}
get_data_CC <- function(x, taxa_filter=NULL, ...) {
signif_taxa <- x$significant_taxa
if ("phyloseq" %in% class(x$data)) {
if (!is.null(x$data@tax_table)) {
signif_taxa <- otu_to_taxonomy(signif_taxa, x$data)
signif_taxa <- paste0(signif_taxa, " (", x$significant_taxa, ")")
}
}
if (length(x$significant_models) != 0) {
var_per_mod <- length(x$restrictions_DA) + length(x$restrictions_DV)
total_var_count <- length(signif_taxa) * var_per_mod
df <- as.data.frame(matrix(NA, nrow = total_var_count, ncol = 5))
colnames(df) <- c("x", "xmin", "xmax", "taxa", "variable")
qval <- stats::qnorm(.975)
restricts_mu <- attr(x$restrictions_DA, "index")
restricts_phi <- attr(x$restrictions_DV, "index")
count <- 1
for (i in 1:length(x$significant_models)) {
# Below from print_summary_bbdml, just to get coefficient names
tmp <- x$significant_models[[i]]
coefs.mu <- tmp$coefficients[1:tmp$np.mu,, drop = FALSE]
rownames(coefs.mu) <- paste0(substring(rownames(coefs.mu), 4), "\nDifferential Abundance")
coefs.mu <- coefs.mu[restricts_mu,, drop = FALSE]
coefs.phi <- tmp$coefficients[(tmp$np.mu + 1):nrow(tmp$coefficients),, drop = FALSE]
rownames(coefs.phi) <- paste0(substring(rownames(coefs.phi), 5), "\nDifferential Variability")
coefs.phi <- coefs.phi[restricts_phi - tmp$np.mu,, drop = FALSE]
coefs <- rbind(coefs.mu, coefs.phi)
for (j in 1:var_per_mod) {
df[count, 1:3] <- c(coefs[j, 1], coefs[j, 1] - qval * coefs[j, 2],
coefs[j, 1] + qval * coefs[j, 2])
df[count, 4:5] <- c(signif_taxa[i], rownames(coefs)[j])
count <- count + 1
}
}
#df$Phylum <- str_extract(df$taxa, "_.+?(?<=_)")
#df$Phylum <- gsub("_", "", df$Phylum)
df$variable <- gsub("Sample_Type", "", df$variable)
df$ASV <- gsub(".*\\((.*)\\).*", "\\1", df$taxa)
#print(head(df))
#df$Family <- gsub("(?<=_)(.*?)(?=_)", "\\4", df$taxa)
#df$Family <- gsub("_", "", df$Family)
#print(head(df))
df$Phylum <- otu_to_taxonomy(df$ASV, x$data, level = "Phylum")
df$Family <- otu_to_taxonomy(df$ASV, x$data, level = "Family")
df$Genus <- otu_to_taxonomy(df$ASV, x$data, level = "Genus")
return(df)
}
}
```
<!--- getting model out of corncob data --->
```{r}
get_model_CC <- function(x, ASV, ...) {
models <- x$significant_models
models[[grep(ASV, x$significant_taxa)]]
}
```
<!--- function to compare two phyloseq objects --->
```{r}
compare_phyloseq_taxa = function(physeq1, physeq2, taxa_level){
long <- identical(get_taxa_unique(physeq1, taxa_level), get_taxa_unique(physeq2,taxa_level))
if (long == TRUE){
print("There are no taxa differences at this level")
}
if (long == FALSE) {
print("These taxa are found in both phyloseq objects")
print(get_taxa_unique(rumen_A,taxa_level)[get_taxa_unique(rumen_A,taxa_level) %in% get_taxa_unique(feces_A,taxa_level)])
print("These taxa are different between the phyloseq objects")
print(get_taxa_unique(rumen_A,taxa_level)[!(get_taxa_unique(rumen_A,taxa_level) %in% get_taxa_unique(feces_A,taxa_level))])
}
}
```
<!--- Get abundace and SEM --->
```{r}
#only works on phylum
ps_ave_abu_phy = function(physeq){
#calculating error bars to graph mean transformed abundance of major phyla
melted <- psmelt(physeq)
grouped <- dplyr::group_by(melted[!is.na(melted$Phylum),], Sample_Type, Phylum)
phyla_five <- as.data.frame(dplyr::summarise(grouped, mean=mean(Abundance), sd=sd(Abundance), sem = (sd(Abundance)/sqrt(length(Abundance)))))
}
#only works on family
ps_ave_abu_fam = function(physeq){
#calculating error bars to graph mean transformed abundance of major phyla
melted <- psmelt(physeq)
grouped <- dplyr::group_by(melted[!is.na(melted$Family),], Sample_Type, Family)
fam_five <- as.data.frame(dplyr::summarise(grouped, mean=mean(Abundance), sd=sd(Abundance), sem = (sd(Abundance)/sqrt(length(Abundance)))))
}
#only works on genus
ps_ave_abu_gen = function(physeq){
#calculating error bars to graph mean transformed abundance of major phyla
melted <- psmelt(physeq)
grouped <- dplyr::group_by(melted[!is.na(melted$Genus),], Sample_Type, Genus)
gen_five <- as.data.frame(dplyr::summarise(grouped, mean=mean(Abundance), sd=sd(Abundance), sem = (sd(Abundance)/sqrt(length(Abundance)))))
}
```
<!--- setting custom colors for plotting --->
```{r include=FALSE,eval=TRUE}
myColors <- brewer.pal(6, "Dark2")
names(myColors) <- c("Stomach Tube","Grab Sample","Liquid Strained","Feces","Liquid Unstrained","Solid")
myColors_DPCoA <- c("#666666", "#1B9E77","#D95F02", "#7570B3", "#E7298A", "#66A61E","#E6AB02")
names(myColors_DPCoA) <- c("Taxa","Stomach Tube","Grab Sample","Liquid Strained","Feces","Liquid Unstrained","Solid")
```
#Running DADA2 to get ASVs and assign taxonomy.
This program infers exact amplicon sequence variants (ASVs) from amplicon data, resolving biological differences of even 1 or 2 nucleotides. This algorithum is prefered as DADA2 reports fewer false positive sequence variants than other methods report false OTUs. Note that this is a computationally expensive so its run on a cluster and then the R objects are read in.
First we will read in the data and trim ends where there is poor quality.
```{r Running DADA2, eval=FALSE}
# CHANGE ME to the directory containing the fastq files after unzipping.
path <- "C:/Users/Jill/Desktop/Depeters/"
list.files(path)
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_Trim_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_Trim_R2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
plotQualityProfile(fnFs[1:10])
plotQualityProfile(fnRs[1:10])
#Place filtered files in filtered/subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,220),trimLeft=c(10,0),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, minLen=150,
compress=TRUE, multithread=FALSE, verbose=TRUE)
head(out)
#check quality again after trimming
plotQualityProfile(filtFs[10:20])
plotQualityProfile(filtRs[10:20])
```
The next steps learn the error rates of the data and identifies unique sequences. These data are fed into the main dada2 algorithum that makes a table of ASVs. Reads are merged and chimerias removed prior to making the final ASV table. Taxaonomy was assined using the silva database.
```{r eval=FALSE}
#learn erros for DADA2 algorithm
errF <- learnErrors(filtFs, multithread=FALSE)
errR <- learnErrors(filtRs, multithread=FALSE)
plotErrors(errF, nominalQ=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
#run the dada2 algorithum
dadaFs <- dada(derepFs, err=errF, multithread=FALSE, pool=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=FALSE, pool=TRUE)
#checking output
dadaFs[[1]]
#Merging forward and Reverse Reads
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
#Construct Sequence Table
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
#Removing chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=FALSE, verbose=TRUE)
dim(seqtab.nochim)
#
taxa_rdp <- assignTaxonomy(seqtab.nochim, "/share/tearlab/Maga/Jill/rdp_train_set_16.fa.gz", multithread=TRUE)
saveRDS(taxa_rdp, "/share/tearlab/Maga/Jill/16s_Milk_2016/DADA2/taxa_rdp.rds")
taxa.sp_rdp <- addSpecies(taxa_rdp, "/share/tearlab/Maga/Jill/rdp_species_assignment_16.fa.gz")
saveRDS(taxa.sp_rdp, "/share/tearlab/Maga/Jill/16s_Milk_2016/DADA2/taxa.sp_rdp.rds")
#
taxa_silva <- assignTaxonomy(seqtab.nochim, "/share/tearlab/Maga/Jill/silva_nr_v132_train_set.fa.gz", multithread=TRUE)
saveRDS(taxa_silva, "/share/tearlab/Maga/Jill/16s_Milk_2016/DADA2/taxa_silva.rds")
taxa.sp_silva <- addSpecies(taxa_silva, "/share/tearlab/Maga/Jill/silva_species_assignment_v132.fa.gz")
saveRDS(taxa.sp_silva, "/share/tearlab/Maga/Jill/16s_Milk_2016/DADA2/taxa.sp_silva.rds")
```
Getting information out of DADA2 Objects.
```{r Getting info out of DADA2, eval=FALSE, include=TRUE}
#making and writing out a fasta of our final ASV seqs:
#This fasta will also be used for making a tree...
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fa")
#count table:
asv_tab <- t(seqtab.nochim)
row.names(asv_tab) <- sub(">", "", asv_headers)
write.table(asv_tab, "ASVs_counts.txt", sep="\t", quote=F)
#tax table:
asv_tax <- sil_taxa.sp
row.names(asv_tax) <- sub(">", "", asv_headers)
write.table(asv_tax, "ASVs_taxonomy.txt", sep="\t", quote=F)
```
Let's check the sizes of the sequences as a way to determine contamination.
```{r}
setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
seqtab.nochim <- readRDS("seqtab.nochim.rds")
#Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim)))
median(as.numeric(rownames(table(nchar(getSequences(seqtab.nochim))))))
```
These sequences have a median length of `r median(as.numeric(rownames(table(nchar(getSequences(seqtab.nochim))))))` with most are less than 390bp. The sequences longer sequences may be the result of non-specific priming. We will look at this again after specific and thoughtful filtering. If long sequences remain after filtering we will look at them closer to make sure they are infact from bacterial origin.
Now we check the number of chimeras in the dataset.
```{r DADA2 chimeria stats}
setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
seqtab <- readRDS("seqtab.rds")
#checking Frequency of chimeras
sum(seqtab.nochim)/sum(seqtab)
```
Here we see that 2.01% of the sequences were identified to be chimerias and were removed from the dataset. Next, we will have a look at the read stats.
```{r DADA2 stats, eval=TRUE}
##Examining the stats of read count to through the pipeline.
#I still need to add in sample names
dadaFs <- readRDS("dadaFs.rds")
dadaRs <- readRDS("dadaRs.rds")
mergers <- readRDS("mergers.rds")
out <- readRDS("out.rds")
#Tracking read count through pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
track
```
This shows the library sizes of the samples and how many reads were removed at each step. There is `r sum(track[,"input"])` cleaned reads that entered the DADA2 pipeline. We will now get read stats for the input ASVs.
```{r}
#Getting total read number
sum(track[,1])
#Get info on depth of sequecing for samples
data.frame("Min" = min(track[,"input"]),"Max" = max(track[,"input"]),"Mean" = mean(track[,"input"]),
"Range" = range(track[,"input"]), "median" = median(track[,"input"]))
```
We compare this to the read stats for the final libraries.
```{r}
#Get info on depth of sequecing for samples
data.frame("Min" = min(track[,"nonchim"]),"Max" = max(track[,"nonchim"]),"Mean" = mean(track[,"nonchim"]),
"Range" = range(track[,"nonchim"]), "Median" = median(track[,"nonchim"]))
```
##Making phyloseq object
```{r Making phyloseq object}
setwd("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/")
asv_tab <- readRDS("asv_tab.rds")
asv_tax <- readRDS("asv_tax.rds")
#had the following taxa that rdp didn't Entotheonellaeota, Epsilonbacteraeota, Gemmatimonadetes, Kiritimatiellaeota, Patescibacteria, BRC1 it doesn't have SR1 or Candidatus_Saccharibacteria though
TREE <- read_tree("dada2_seqs.tre")
MAP <- import_qiime_sample_data("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/Mapping_File_MMDR.txt")
MAP$X.SampleID <- paste0("sample_", MAP$X.SampleID) #add sample_ to SampleID column
ps <- phyloseq(otu_table(asv_tab, taxa_are_rows=TRUE), sample_data(MAP), tax_table(asv_tax), phy_tree(TREE))
sample_data(ps)$Sample_Type <- gsub("_"," ",sample_data(ps)$Sample_Type)
sample_data(ps)$CowID <- paste0("Cow_", sample_data(ps)$CowID) #corncob doesn't like numbers for factors
#sample_names(ps) <- paste0("sample_", sample_names(ps)) #Divnet/DPCoA don't like numbers for samples
ps
```
#Cleaning data
Currently, we are starting with 5,607 ASVs from 70 samples
```{r}
ps_kit <- ps #saving copy for later
ps <- subset_samples(ps, Sample_Type != c("TMR fecal kit"))
ps <- subset_samples(ps, Sample_Type != c("TMR plant kit"))
ps
```
First, we remove the kit samples to bring us down to 68 samples. We will look at these again later. We will also remove ASVs that aren't present in any samples.
```{r Removing empty ASVs, include=FALSE}
#Checking for empty samples, samples with no taxa assoicated with them (should be "FALSE").
any(sample_sums(ps) == 0)
#Checking if there are ASVs that aren't present in any samples (should be "FALSE")
any(taxa_sums(ps) == 0)
#Determining how many ASVs there are that aren't present in any sample
sum(taxa_sums(ps) == 0)
#removing ASVs that aren't present in any samples
ps <- prune_taxa(taxa_sums(ps) > 0, ps)
ps
```
There was no empty samples or taxa which is what we want. Also, there was 16 ASVs that weren't in any sample and were removed.
#More cleaning of data
To start cleaning the data we will at the number of ASVs assigned to each phylum.
```{r clean 1}
#Create table, number of features for each phyla
table(tax_table(ps)[, "Phylum"], exclude = NULL)
```
Next we will count what samples have the ASVs that aren't assigned to a phylum.
```{r clean 2}
#checking to see what samples contain the NA phyla samples.
psNA <- subset_taxa(ps, is.na(Phylum))
psNA <- prune_taxa(taxa_sums(psNA) > 0, psNA)
psNA_tab <- melt(colSums(psNA@otu_table), value.name="ASVs")
psNA_tab[,"Sample_Type"] <- psNA@sam_data$Sample_Type
psNA_tab %>% group_by(Sample_Type) %>% summarise(sum(ASVs))
```
There are 94 ASVs that weren't able to be assigned to a phylum. These unassigned taxa are found in all sample types with most of the unassigned ASVs in solid samples. *NOTE that the sum column is reads not the number of ASVs!* We next made a fasta file from the phyloseq object with these unknown taxa so that we can blast it later.
```{r clean 3, include=FALSE}
taxa_sp <- readRDS("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/sil_tax_sp_final.rds")
ps3 <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), sample_data(MAP), tax_table(taxa_sp))
ps3 <- subset_samples(ps3, Sample_Type != c("TMR_fecal_kit"))
ps3 <- subset_samples(ps3, Sample_Type != c("TMR_plant_kit"))
#ps3 <- subset_taxa(ps3, !Order %in% "Chloroplast")
ps3 <- prune_taxa(taxa_sums(ps3) > 0, ps3)
ps3 <- subset_taxa(ps3, is.na(Phylum))
```
```{r Rechecking stats output}
#Getting our seqs out
asv_seqs2 <- colnames(otu_table(ps3))
#Making fasta file
#giving our seq headers more manageable names (ASV_1, ASV_2...)
asv_seqs <- colnames(otu_table(ps3))
asv_headers <- vector(dim(otu_table(ps3))[2], mode="character")
for (i in 1:dim(otu_table(ps3))[2]) {
asv_headers[i] <- paste(">Seq", i, sep="_")
}
#making and writing out a fasta of our final seqs:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs_Unknowns.fa")
```
Getting back to our orginal phyloseq object: the 94 AVSs that weren't assigned to a phyla were removed for analysis. This leaves 5,497 ASVs.
```{r}
#Removing ambiguous phylum annotation
#This changes ASVs from 5,591 to 5,452
ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
ps
```
Next we will compute the total and average prevalences of the ASVs in each phylum. We are defining prevalence as the number of samples in which a taxon appears at least once.
```{r}
#Compute prevalence of each feature, store as data.frame
#prevalence in the dataset we will define here as the number of samples in which a taxon appears at least once
prevdf = apply(X = otu_table(ps),
MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2),
FUN = function(x){sum(x > 0)})
#Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf, TotalAbundance = taxa_sums(ps), tax_table(ps))
#display table
plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))}) %>%
#making table of phyla ASVs taxa
kable(caption="Prevelance of Phyla") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 10) %>%
scroll_box(width = "100%", height = "300px")
```
Here we see that Deferribacteres and Gemmatimonadetes ASVs only has one feature so we'll just looking into this real quick.
```{r Explore phyla, warning=FALSE}
#Making phyloseq object with Gemmatimonadetes
ps_explore <- subset_taxa(ps, Phylum == c("Gemmatimonadetes"))
ps_explore <- prune_samples(sample_sums(ps_explore) > 0, ps_explore)
ps_explore@sam_data$Sample_Type
#which sample is it found in
otu_table(ps_explore)
#Making phyloseq object with Deferribacteres
ps_explore <- subset_taxa(ps, Phylum == c("Deferribacteres"))
ps_explore <- prune_samples(sample_sums(ps_explore) > 0, ps_explore)
ps_explore@sam_data$Sample_Type
#which sample is it found in
otu_table(ps_explore)
```
The phylum Deferribacteres are only in Fecal samples (2 reads) and and Gemmatimonadetes are only in Stomach Tube samples (3 reads). This suggest these groups might be important for comparing sample types, thus we will leave reads assigned to these phyla in the dataset despite their low prevelance.
Lastly, we'll check to see if chloroplasts and Mitochondria are in the data set and remove them.
```{r Remove chloroplast, include=FALSE}
#removing phyla that are assigned to chloroplasts
tax_table(subset_taxa(ps, Order == "Chloroplast"))
ps <- subset_taxa(ps, !Order %in% "Chloroplast")
tax_table(subset_taxa(ps, Family == "Mitochondria"))
ps <- subset_taxa(ps, !Family %in% "Mitochondria")
ps
```
After removing chloroplasts and mitochondria there is 5,485 ASVs left.
#Looking at metrics after filtering
```{r eval=FALSE}
#number of taxa present
ntaxa(ps)
#checking names of taxa present at specific rank
length(get_taxa_unique(ps, "Phylum"))
length(get_taxa_unique(ps, "Order"))
length(get_taxa_unique(ps, "Family"))
length(get_taxa_unique(ps, "Genus"))
length(get_taxa_unique(ps, "Species"))
#how taxa did not have species assigned
length(which(is.na(tax_table(ps)[,"Species"])))
#what percentage of taxa had species assigned
as.numeric(format((length(which(!is.na(tax_table(ps)[,"Species"])))/length(row.names(tax_table(ps))))*100, digits = 3))
```
As we have seen previously, there are `r ntaxa(ps)` ASVs in the dataset. This is composed of `r length(get_taxa_unique(ps, "Phylum"))` phyla, `r length(get_taxa_unique(ps, "Order"))` Orders, `r length(get_taxa_unique(ps, "Family"))` Families and `r length(get_taxa_unique(ps, "Genus"))` Genera.
`r length(which(is.na(tax_table(ps)[,"Species"])))` ASVs didn't have species assigned. Only `r as.numeric(format((length(which(!is.na(tax_table(ps)[,"Species"])))/length(row.names(tax_table(ps))))*100, digits = 3))`% of taxa had species assigned. For genera, `r length(which(is.na(tax_table(ps)[,"Genus"])))` ASVs didn't have a genera assigned. Only `r as.numeric(format((length(which(!is.na(tax_table(ps)[,"Genus"])))/length(row.names(tax_table(ps))))*100, digits = 3))`% of taxa had genera assigned.
```{r}
keep <- as.data.frame(table(tax_table(ps)[which(is.na(tax_table(ps)[,"Species"])),][,"Phylum"]))
keep$total <- cbind(table(tax_table(ps)[,"Phylum"]))
keep$percent <- (keep$Freq/keep$total)*100
colnames(keep) <- c("Phylum","#ASVs with no species assignment", "Total ASVs","Percent Unassigned")
kable(keep, caption="Unassigned Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 10) %>%
scroll_box(width = "100%", height = "300px")
```
This table gives the frequencing and percent of ASVs not assigned to species and their phyla. This really speaks to the limitations of the methods used here to be able to give species level assigments.
```{r eval=FALSE}
keep <- as.data.frame(table(tax_table(ps)[which(is.na(tax_table(ps)[,"Genus"])),][,"Phylum"]))
keep <- merge(keep,as.data.frame(table(tax_table(ps)[,"Phylum"])),by="Var1",all=TRUE)
keep[is.na(keep)] <- 0 #change NAs to 0
keep$percent <- (keep$Freq.x/keep$Freq.y)*100
colnames(keep) <- c("Phylum","#ASVs with no genera assignment", "Total ASVs","Percent Unassigned")
kable(keep, caption="Unassigned Genera") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 10) %>%
scroll_box(width = "100%", height = "300px")
```
This table gives the frequencing and percent of ASVs not assigned to genus and their phyla.
```{r Counting Singletons, eval=TRUE, include=FALSE}
#How many singletons are there? How many doubletons?
singletons <- sum(rowSums(ps@[email protected])==1) #number of singletons
doubletons <- sum(rowSums(ps@[email protected])==2) #number of doubletons
tripletons <- sum(rowSums(ps@[email protected])==3) #number of tripletons
sum(singletons,doubletons,tripletons)
```
There are `r sum(singletons,doubletons,tripletons)` singletons (`r sum(singletons)`), doubletons (`r sum(doubletons)`) or tripletons (`r sum(tripletons)`). This looks pretty good and indicates that filtering was not excessive nor was a large enough part of the data to be suspicious about. We will need these for diversity metrics.
For the last part of our cleaning process we will graph out the prevalance of ASVs assigned to each phylum.
```{r Graph phyla , fig.width=6}
#Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps, "Phylum"))
ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps),color=Phylum)) +
#Include a guess for parameter
geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) + geom_point(size = 2, alpha = 0.7) +
scale_x_log10() + xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") +
facet_wrap(~Phylum) + theme(legend.position="none")
```
#Rechecking read stats
Before moving on we will look again at the read stats to check that we still don't have reads that are too long in the dataset.
```{r Rechecking stats, echo=FALSE}
taxa_sp <- readRDS("C:/Users/Jill/OneDrive - UC Davis/Documents/collaboration/Depeters/DADA2_Out/Demultiplex_Redo/sil_tax_sp_final.rds")
ps2 <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), sample_data(MAP), tax_table(taxa_sp))
ps2 <- subset_samples(ps2, Sample_Type != c("TMR_fecal_kit"))
ps2 <- subset_samples(ps2, Sample_Type != c("TMR_plant_kit"))
ps2 <- subset_taxa(ps2, !Order %in% "Chloroplast")
ps2 <- subset_taxa(ps2, !Family %in% "Mitochondria")
ps2 <- subset_taxa(ps2, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
ps2 <- prune_taxa(taxa_sums(ps2) > 0, ps2)
#ps2 <- subset_taxa(ps2, !Phylum %in% filterPhyla)
ps2
```
```{r Rechecking stats output B}
#Getting our seqs out
asv_seqs2 <- colnames(otu_table(ps2))
#Inspect distribution of sequence lengths
table(nchar(getSequences(asv_seqs2)))
```
Looks like we now only have one sample that is greater than 300bp let's see what it is.
```{r}
#find the sequence that is greater than 300bp
large_taxa <- asv_seqs2[which(nchar(getSequences(asv_seqs2)) > 300)]
#find taxa that this sequence was assigned to
tax_table(ps2)[grep(as.character(large_taxa), row.names(tax_table(ps2))),]
```
As this large sequence is a **Methanobrevibacter** and this is a common rumen bacteria its expected to be here and will be left in the data set.
#Abundance of Phyla
As the first part of the exploratory analysis we will look the general relative abundances of phyla across sample types.
```{r warning=FALSE}
#combing by phyla and then making into relative abundance
ps_phyla <- tax_glom(ps, "Phylum")
#Making relative
ps_phyla_rel <- transform_sample_counts(ps_phyla, function(x) 100*(x/sum(x)))
#calculating error bars to graph mean transformed abundance of major phyla
phyla <- ps_ave_abu_phy(ps_phyla_rel)
#Ordering
phyla <- phyla[order(-phyla$mean),] #ordering by mean
#phyla <- phyla[order(-phyla$Phylum),] #change to order by name*
phyla[,3:5] <- format(phyla[,3:5], digits = 3, scientific=F)
kable(phyla, caption="Statistiscs for Abundance of Phyla") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 10) %>%
scroll_box(width = "100%", height = "300px")
```
We will look at relative abundance of Phyla in certain sample types. First, grab samples.
```{r}
#getting the average grab sample phyla abundance
phyla[grep("Grab Sample",phyla$Sample_Type),]
```
Next, we examine the relative abundance of Phyla in only fecal samples.
```{r}
#getting the average grab sample phyla abundance
phyla[grep("Feces",phyla$Sample_Type),]
```
These are the phyla found in all samples sorted by decending order of mean relative abundance.
We will have a look to see if any phyla only present in only feces or only in stomach tube samples.
```{r include=FALSE}
ps_sub <- subset_samples(ps, Sample_Type != c("Feces"))
ps_sub <- subset_samples(ps, Sample_Type == c("Stomach Tube"))
ps_sub <- prune_taxa(taxa_sums(ps_sub) > 0, ps_sub)
Feces <- subset_samples(ps, Sample_Type == c("Feces"))
Feces <- prune_taxa(taxa_sums(Feces) > 0, Feces)
setdiff(get_taxa_unique(Feces, "Phylum"), get_taxa_unique(ps_sub, "Phylum"))
setdiff(get_taxa_unique(ps_sub, "Phylum"), get_taxa_unique(Feces, "Phylum"))
```
`r setdiff(get_taxa_unique(Feces, "Phylum"), get_taxa_unique(ps_sub, "Phylum"))` Was only found in fecal samples and `r setdiff(get_taxa_unique(ps_sub, "Phylum"), get_taxa_unique(Feces, "Phylum"))` was only found in stomach tube samples.
This confirms what we say earlier and we didn't identify any other phyla that are only present in these sample types.
Next we will graph out some of the different phyla based on their abundance ranges.
```{r warning=FALSE, fig.height=6, fig.width=4}
phyla$mean <- as.numeric(phyla$mean)
#Which phyla are present at greater than 3% relative abundance
lfive <- as.list(as.character(unique(phyla[which(phyla$mean > 3),]$Phylum)))
five <- subset_taxa(ps_phyla_rel, Phylum== lfive[[1]] | Phylum== lfive[[2]] | Phylum==lfive[[3]] | Phylum==lfive[[4]] | Phylum==lfive[[5]] | Phylum==lfive[[6]])
#calculating error bars to graph mean transformed abundance of major phyla
phyla_five <- ps_ave_abu_phy(five)
#Plotting relative abundance