Data processing for metagenomics assembly
- a. Tools
- b. PreProcess the Raw Data
- c. Assembly
- d. Abundance calculation
- e. ORF prediction and functional annotation
- f. Genome binning
- [BBtools]:https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/installation-guide/ for raw data preprocessing
- Assembly tools metaspades https://github.com/ablab/spades or megahit https://github.com/voutcn/megahit
- pullseq to filter sequences by a minimum length or maximum length https://github.com/bcthomas/pullseq
- reads mapping and calculation bowtie for read mappping https://github.com/BenLangmead/bowtie2 shrinksam to remove sequences that failed to map to the reference https://github.com/bcthomas/shrinksam add_read_count.rb https://github.com/bcthomas/misc_scripts/tree/master/add_read_count
- Prodigal for ORF prediction https://github.com/hyattpd/Prodigal
- KofamScan KEGG ko term annotation https://github.com/takaram/kofam_scan
- Diamond alignment https://github.com/bbuchfink/diamond
- Deseq Statistical Analysis ttps://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf
1. Trim adaptors
Besides trim adaptors, optionally, reads with Ns can be disregarded by adding “maxns=0” ; and reads with really low average quality can be disregarded with “maq=8”
$ bbduk.sh in=input.fastq out=output.fastq \
ref=${BBtools_path}/resources/adapters.fa \
ktrim=r k=23 mink=11 hdist=1 tpe tbo
2. Remove synthetic artifacts and spike-ins by kmer-matching
This will remove all reads that have a 31-mer match to PhiX, allowing one mismatch. The “outm” stream will catch reads that matched a reference kmers; “stats” will produce a report of which contaminant sequences were seen, and how many reads had them.
$ bbduk.sh in=trimed.fastq out=output_u.fastq outm=output_m.fastq \
ref=${BBTools_path}/resources/sequencing_artifacts.fa.gz \
ref=${BBTools_path}/resources/phix174_ill.ref.fa.gz \
k=31 hdist=1 stats=stats.txt
3. Error Correction
Low-depth reads can be discarded here with the”tossdepth”, or “tossuncorrectable” flags For very large datasets, “prefilter=1” or “prefilter=2” can be added to conserve memory
$ tadpole.sh -Xmx32g in= output_u.fastq out=output_tecc.fastq filtermemory=7g ordered prefilter=1
Metaspades require more memorial and longer time, but can get a higher quality assembly and scaffold directly Megahit require less memorial and less time, but with relatively lower quality and can only assembly the reads into contig.
1. metaspades
$ metaspades --12 output_tecc.fastq -o assembly_data.fasta
#--12 means the sequencing data is in interleaved form
The default memorial is 250G, if you need a larger one, you can set like -m 500
2. Megahit
$ megahit --12 output_tecc.fastq -o assembly_data_megahit.fasta
3. N50, N90 Calculation (BBtools)
$ stats.sh in=assembly_data.fasta out=assembly_data_stats
1. fillter scaffolds
#Select scaffolds that sequence length > 1000;
$ pullseq -i assembly_data.fasta -m 1000 > sequence_min1000.fasta
2. map reads into scaffolds with bowtie2
#Create bowtie2 index file
$ bowtie2 -build sequence_min1000.fasta sequence_min1000
#Deinterleave paired reads
$ reformat.sh in=output_tecc.fastq out1=read1.fastq out2=read2.fastq
#Sequence alignment
$ bowtie2 -x sequence_min1000 -1 read1.fastq -2 read2.fastq S alignment.sam -p 19
# 'alignment.sam' is also the file used for genome binning
# You can obtain the number of aligned reads in the output file
# and the number of total read can be obtained in the log file of assembly
# Mapping rate = the number of aligned reads/ the number of total readsRPKM Calculation (shrinksam)
3. Count Mapped Reads
# we use `shrinksam` to remove unmapped reads from bowtie2-created SAM files,
# it will generate a SAM file with only the reads that map to an assembled scaffold.
$ shrinksam -i alignment.sam > mapped.sam
# Then we count the reads numbers mapped to each scaffold through `add_read_count.rb`
$ add_read_count.rb mapped.sam sequence_min1000.fastq > mapped.fasta.counted
# we just filter the lines containing scaffold name in the output fasta files
$ grep -e ">" mapped.fasta.counted > mapped.counted.result
$ sed -i "s/>//g" mapped.counted.result
$ sed -i "s/read_count_//g" mapped.counted.result
# an example of `mapped.count.result`
NODE_8_length_295416_cov_86.298909 read_length_150 1805
NODE_9_length_294755_cov_77.048012 read_length_150 267
NODE_10_length_287310_cov_147.7697 read_length_150 2469
NODE_11_length_283217_cov_82.5382 read_length_150 423
NODE_12_length_273911_cov_216.2412 read_length_150 16
4. RPKM Calculation for each assembled scaffold
$ Python parse_rpkm.py mapped.counted.result total_reads_file srrID
# total_reads_file is a file that includes total reads number in each sample, an example:
10004483 SRR7768473
10005994 SRR7768567
10007320 SRR7768664
# srrID is just the srrID sample ID number (such as SRR7768473)
# to calculate the rpkm, scaffold length is all needed.
# MetaSpades predicted scaffold name is like 'NODE_8_length_295416_cov_86.298909'.
# the number 295416 is the length of this scaffold.
Important notes: Here we only got RPKM for each scaffold; For the proteins in the same scaffod, they should have the same RPKM with this scaffold.
- ORF prediction through
Prodigal
$ prodigal -i sequence_min1000.fasta -p meta -a trans_protein.fasta -f gff -o predicted.gff
# two output files:
trans_protein.fasta (protein translations file);
predicted.gff (inforamtion for each CDS, we will use the CDS length)
2.1 Map scaffold RPKM to each protein that located in this scaffold.
The principle is that for the proteins in the same scaffod, they should have the same RPKM with this scaffold. This is analysis for metagenome, which is not same with metatranscriptomics.
2.2 read counts for each protein
Some statistic tools, such as DEGseq2, they use the read counts (not the normalized RPKM) as the inputs
here we add a method to get protein read counts directly from the scaffold read counts in the above.
We could just divide the scaffold read counts into each protein according to their CDS length
.
1. KEGG annotation through KofamScan
$ kofam_scan/exec_annotation -o Coassembly_KO.txt trans_protein.fasta --tmp-dir tmp_KO --cpu 10
map the annotated KO term into pathways through KEGG Mapper
: https://www.genome.jp/kegg/tool/map_pathway.html
2. Metacyc reaction annotation by aligning database from Metacyc pathway database;
diamond
was used to do sequence alignment
$ diamond blastp --query trans_protein.fasta \
--db uniprot-seq-ids.fasta \
--out Metacyc_protein_top5hit.blst \
--outfmt 6 --evalue 0.001 --max-target-seqs 5 --sensitive
$ python creat_RXN_dictionary.py Metacyc_protein_top5hit.blst Metacyc_protein_RXN_key_sen
For more information: https://github.com/thepanlab/MetagenomicsTree/blob/master/metacyc/metacyc_annotation.md
1.sam to sorted bam for binning
$ samtools view -bS alignment.sam -@ 19 > alignment.bam
$ samtools sort alignment.bam -@ 19 -o alignment_sorted > alignment_sorted.bam
# alignment.sam is from Bowtie2 output result in read mapping process.
2. metabat binning
$ jgi_summarize_bam_contig_depths --outputDepth output_depth.txt *_sorted.bam
$ metabat -i sequence_min1000.fasta -o output_bin -a output_depth.txt -m 2000
3. qualtify and taxonomy annotation for the binned genomes
$ checkm lineage_wf -f CheckM.txt -t 10 -x fa --pplacer_threads 1 input_fold_includes_Bins checkm_wf_out