These scripts attempt to determine whether or not there is any support for a shared whole genome duplication between two taxa based on their transcriptomes.
- TransDecoder
- Used to determine likely open reading frames in a given transcriptome and its corresponding protein translation.
- Blat
- Used to determine likely paralogs.
- KaKs_Calculator
- Used to estimate the number of synonymous substitions per synonymous site (Ks) between paralogs.
- ProteinOrtho.pl
- Used to identify orthologous sequences shared across the given transcriptomes.
- Blastp
- Used by ProteinOrtho.pl to detect orthologs. Also used to reduce orthologous sequences to only their shared homologous sites.
- MUSCLE
- Used to align orthologous family sequences after reduction to homologous sites.
- RAxML
- Used to calculate best tree for a quartet as well as its support.
- R
- Plotting and statistics software used to generate the actual Ks plot.
Transcriptomes given to the script are first translated into their most likely protein using TransDecoder. An intraspecies all versus all blat search is then performed for for each input transcriptome in order to identify homologous protein pairs. Ks is then calculated for each protein pair using KaKs_Calculator. Pairs are then filtered to the Ks range specified by the user. Quartets are then identified by taking one sequence from each pair found in the specified Ks range in each transcriptome and running ProteinOrtho. Alignments are restricted to homologous sites only using blastp, homologous sites are then aligned using MUSCLE. If the final protein alignment meets the required length cutoff (100 amino acids/300 nucleotides), the sequence is reverse translated to its original nucleotide sequence. RAxML is then used to run a 100 replicate rapid bootstrap using the GTR-Gamma model. If the support of the quartet split is high enough (>= 70 by default), this quartet's topology is used as evidence for whether or not the given transcriptomes shared a whole genome duplication.
At the mininmum, the script requires two FASTA files specfied with -t or --transcriptomes to run, the desired Ks range over which to check for a shared WGD must also specified for each transcriptome using -k or --ks:
wgd-test.pl -t transcriptome1.fa transcriptome2.fa -k 0.2-0.6,0.3-0.7
If only one Ks range is given as in the following invocation, this range is used for both transcriptomes:
wgd-test.pl -t transcriptome1.fa transcriptome2.fa -k 0.2-0.6
For further fine-tuning of the script, the following options can also be specified:
Option Flag(s) | Option Descripton | Default |
---|---|---|
-k, --ks | range of Ks to search for each transcriptome | none |
-t, --transcriptomes | file names of at least two transcriptomes (in FASTA format) to use for analyses | none |
-m, --model | model used by KaKs_Calculator to determine Ks | YN |
-o, --output | name of the directory to store output files in | "wgd-test-" + Unix time of script invocation) |
-l, --min-length | the minimum alignment length of paralogous sequences | 300 nucleotides |
-p, --pid-cut | minimum pairwise percent identity allowed between two members of a quartet | 80% |
-b, --boot-cut | minimum RAxML bootstrap support required to use a quartet | 70 |
--pfam-cpus | the number of CPUs to let hmmscan using during pfam analysis | current number of free CPUs |
--pfam-search | full path to pfam binary for usage in pfam search | none |
-T, --n-threads | the number of families to analyze concurrently | current number of free CPUs |
-h, --help | display help and exit | N/A |
The following files can be found in the output directory upon successful completion of the script given that the user invoked the script with the following command:
wgd-test.pl -t trans1.fasta trans2.fasta -k 0.2-0.6
- The two most important output files, each containing a list of well-supported quartets
- wgd-support.txt: file containing a list of quartets which support a shared WGD
- wgd.against.txt: file containing a list of quartets which are against a shared WGD
- genes/: directory containing quartet alignments (-aligned.nex files) and corresponding RAxML output files for genes which met all thresholds required for analysis
- Note: there would be a corresponding trans2 prefix for each of these output files
- TransDecoder output for each input transcriptome:
- trans1.fasta.transdecoder.bed: bed output from TransDecoder
- trans1.fasta.transdecoder.cds: corresponding CDS of putative peptides identified
- trans1.fasta.transdecoder.gff3: gff3 output from TransDecoder
- trans1.fasta.transdecoder.mRNA: corresponding mRNA of putative peptides identified
- trans1.fasta.transdecoder.pep: putative peptides identified
- trans1.fasta.transdecoder.pep.pslx: blat output from trans1.fasta's all versus all self blat
- trans1.atx: atx formatted file containing homologous protein pairs used as KaKs_Calculator's input
- trans1.YN.kaks: KaKs_Calculator output (YN would be whichever model name you specified)
- trans1.fasta.YN.ks_0.2-0.6: FASTA file containing one sequence from each protein pair with a Ks between whatever range the user specified (YN would be whichever model name you specified, 0.2-0.6 will vary based on user specified Ks range)
- wgd-test.proteinortho: output by ProteinOrtho, contains final sequence clustering information of input files. This the file parsed by the toca.pl to determine clustering
- wgd-test.blast-graph: output by ProteinOrtho, contains similarity scores between contigs as determined by blastp
- wgd-test.proteinortho-graph: output by ProteinOrtho, contains similarity scores between contigs as determined by blastp
Transcriptomes given to the script are first translated into their most likely protein using TransDecoder. An intraspecies all versus all blat search is then performed for for each input transcriptome in order to identify homologous protein pairs. Ks is then calculated for each protein pair using KaKs_Calculator. Pairs are NOT filtered to a specific Ks range. Quartets are then identified by taking one sequence from each pair found in each transcriptome and running ProteinOrtho. Alignments are restricted to homologous sites only using blastp, homologous sites are then aligned using MUSCLE. If the final protein alignment meets the required length cutoff (100 amino acids/300 nucleotides), the sequence is reverse translated to its original nucleotide sequence. RAxML is then used to run a 100 replicate rapid bootstrap using the GTR-Gamma model. If the support of the quartet split is high enough (>= 70 by default), this quartet's topology is used as evidence for whether or not the given transcriptomes shared a whole genome duplication.
At the mininmum, the script requires two FASTA files specfied with -t or --transcriptomes to run:
wgd-plot.pl -t transcriptome1.fa transcriptome2.fa
After successful completion of the script, the files transcriptome1.csv and transcriptome2.csv would be located in the output directory. A sliding window plot can be generated for a particular csv file by running the following command:
wgd-plot.r transcriptome1.csv
For further fine-tuning of the script, the following options can also be specified:
Option Flag(s) | Option Descripton | Default |
---|---|---|
-t, --transcriptomes | file names of at least two transcriptomes (in FASTA format) to use for analyses | none |
-m, --model | model used by KaKs_Calculator to determine Ks | YN |
-o, --output | name of the directory to store output files in | "wgd-plot-" + Unix time of script invocation) |
-l, --min-length | the minimum alignment length of paralogous sequences | 300 nucleotides |
-p, --pid-cut | minimum pairwise percent identity allowed between two members of a quartet | 80% |
-b, --boot-cut | minimum RAxML bootstrap support required to use a quartet | 70 |
--pfam-cpus | the number of CPUs to let hmmscan using during pfam analysis | current number of free CPUs |
--pfam-search | full path to pfam binary for usage in pfam search | none |
-T, --n-threads | the number of families to analyze concurrently | current number of free CPUs |
-h, --help | display help and exit | N/A |
The following files can be found in the output directory upon successful completion of the script given that the user invoked the script with the following command:
wgd-plot.pl -t trans1.fasta trans2.fasta
- The two most important output files which are used to generate the plot. Each line corresponds to a single quartet, the first value is the Ks of the protein pair from the species's perspective, and the second value is boolean indicating whether or not the quartet supports a WGD (0 for against, 1 for support). These files are used as the input for wgd-plot.r:
- trans1.csv: file containing a list of quartets which support a shared WGD
- trans2.csv: file containing a list of quartets which are against a shared WGD
- Using the above files as input for wgd-plot.r would create trans1.pdf and trans2.pdf
- genes/: directory containing quartet alignments (-aligned.nex files) and corresponding RAxML output files for genes which met all thresholds required for analysis
- Note: there would be a corresponding trans2 prefix for each of these output files
- TransDecoder output for each input transcriptome:
- trans1.fasta.transdecoder.bed: bed output from TransDecoder
- trans1.fasta.transdecoder.cds: corresponding CDS of putative peptides identified
- trans1.fasta.transdecoder.gff3: gff3 output from TransDecoder
- trans1.fasta.transdecoder.mRNA: corresponding mRNA of putative peptides identified
- trans1.fasta.transdecoder.pep: putative peptides identified
- trans1.fasta.transdecoder.pep.pslx: blat output from trans1.fasta's all versus all self blat
- trans1.atx: atx formatted file containing homologous protein pairs used as KaKs_Calculator's input
- trans1.YN.kaks: KaKs_Calculator output (YN would be whichever model name you specified)
- trans1.fasta.YN.ks_0.2-0.6: FASTA file containing one sequence from each protein pair with a Ks between whatever range the user specified (YN would be whichever model name you specified, 0.2-0.6 will vary based on user specified Ks range)
- wgd-plot.proteinortho: output by ProteinOrtho, contains final sequence clustering information of input files. This the file parsed by the toca.pl to determine clustering
- wgd-plot.blast-graph: output by ProteinOrtho, contains similarity scores between contigs as determined by blastp
- wgd-plot.proteinortho-graph: output by ProteinOrtho, contains similarity scores between contigs as determined by blastp