Skip to content

Admin_EnsemblReferenceFileGenerator.pl

AndyMenzies edited this page Jun 19, 2020 · 9 revisions

This is the main generation script and should be the only one needed under standard usage.

Example

Admin_EnsemblReferenceFileGenerator.pl -sp Human -as GRCh38 -d homo_sapiens_core_91_38 -f ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/cdna/ -o /path/to/output/directory

Parameters

 Required Options:

    --output       (-o)     Output directory    

    --species      (-sp)    Species (ie human, mouse)

    --assembly     (-as)    Assembly version (ie GRCh37, GRCm38)

    --database     (-d)     Ensembl core database version number (ie homo_sapiens_core_74_37p)

  Dynamic Download:

    --ftp          (-f)     Ensembl ftp directory containing the cDNA fasta sequence files

  Or Local Files:

    --features     (-gf)    gff3 or gtf file containing transcript and gene information

    --cdna_fa      (-cf)    Fasta file containing protein coding cdna sequences

    --ncrna_fa     (-nf)    Fasta file containing non-coding cdna sequences 

  Optional:

    --help         (-h)     Brief documentation
    
    --ccds         (-c)     (Recommended) The CCDS2Sequence file from the relevant CCDS release, see http://www.ncbi.nlm.nih.gov/CCDS

    --fai          (-fai)   (Recommended) The samtools fasta index file (.fai) for your reference genome
                              This is the reference genome that your bam and vcf files will be mapped to

    --trans_list   (-tl)    List of preprepared transcript accessions, only these accesions will be included in the reference output>

Details

There are 4 compulsory parameters

  • output directory
  • species
  • genome build
  • Ensembl database version

After that you need to specify the actual ensembl data you want to use the build the reference data set. You can do this in one of 2 ways, either by using an Ensembl FTP url for the cDNA fasta file directory (ie. ftp://ftp.ensembl.org/pub/release-XX/fasta/XXX_XXX/cdna/) or by using local paths for the 3 input files you've already downloaded.

Finally there are a few optional parameters which can have a significant impact on the content of the reference data produced.

CCDS

For human and mouse configurations this allows you to specify a CCDS input file from NCBI, you'll need to download the appropriate CCDS2Sequence file from the NCBI's CCDS website. (Go to https://www.ncbi.nlm.nih.gov/projects/CCDS and follow the link to their FTP site)

FAI

Ensembl doesn't always use the same chromosome/contig names as the official genome release. Human GRCh38 is a prime example, in the official GRC genome build chromosomes start with a chr prefix, but in the Ensembl annotation they don't (ie chr1 vs 1). By supplying the fasta index file (.fa.fai) for the reference genome your data will be mapped to, the reference generation code will attempt to change the Ensembl sequence names to match the .fai file. Any sequence entry in the raw Ensembl data that can't be matched to a name in the .fai file will be ignored.

Trans_list

If you want to produce the reference files that only contain a very specific subset of transcripts, you can provide a file containing a list of Ensembl transcript IDs. Very simple format, one ID per line with no whitespace. NOTE: if the transcript accession numbers are not found in the Ensembl GFF3/GTF files it will produce an empty cache file and not throw an error. If this happens unexpectedly, and the transcript list may have been moved between different operating systems, check your end-of-line characters.

Clone this wiki locally