Skip to content

Detecting imprinting and X-chromosome inactivation status from RNA-seq

Notifications You must be signed in to change notification settings

Gal-Keshet/EpiTyping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EpiTyping

EpiTyping is a tool for detecting imprinting and X-chromosome inactivation status from RNA-seq scheme1_cropped

Pre-requisites

Nextflow (from version 22.10.4 onwards) and Singularity (or apptainer, the new name for the Singularity project; program works from Singularity version 3.8.1 onwards) are the only pre-requisites for the EpiTyping tool. Install both if needed and make sure they are properly running on your system.

If the following commands do not generate any error message you are good to go.

nextflow run hello  # test that nextflow is working

singularity help  # test that singularity is working

Installation

  1. Download the project directory:
git clone https://github.com/Gal-Keshet/EpiTyping.git  # clone the project using git
  1. Download docker images to run the software:
cd /path/to/EpiTyping

singularity pull docker://broadinstitute/gatk

singularity pull docker://galclbl/epityping
  1. Execute the script named installation.nf which is responsible for setting up all the reference data and will complete the installation (this might take a while).
nextflow run installation.nf [parameters]  # run the installation script
  1. To save space, delete the work directory (the folder nextflow creates during the run; should free ~130 GB of disk space).
rm -rf work/

Basic parameters

-profile: Choose the executor profile between a local workstation or usage on a SLURM cluster (standard/cluster, default: standard).

--cpu: The number of threads for multi-threading (int, default 8).

--mem: Memory limit for the run ([int].GB, default: 64.GB)

--readlength: The expected Illumina read length for optimal alignment by STAR (int, default 100).

Advanced parameters

  • Use with caution. Might lead to unexpected results if the alternative gtf/fasta files have a different format than the default files.

--humanFasta: url for a human fasta file (default: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.primary_assembly.genome.fa.gz)

--humanGTF: url for a human gtf file (default: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.primary_assembly.annotation.gtf.gz)

--mouseFasta: url for a mouse fasta file (default: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/GRCm39.primary_assembly.genome.fa.gz)

--mouseGTF: url for a mouse gtf file (default: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/gencode.vM31.primary_assembly.annotation.gtf.gz)

--dbSNP: url for a dbSNP vcf file (default: https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz)

--dbSNP_index: url for a dbSNP index file (default: https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz.tbi)

--remap_file: Path to the NCBI chromosome remapping names (default: "$projectDir/genome_files/remapNCBI.txt").

Usage

bgzip /path/to/fastq_dir/* # Use if fastq files are not compressed.

nextflow run /path/to/EpiTyping/main.nf --fastq_folder /path/to/fastq_dir --single true [parameters] # for single-end reads (all files in fastq folder need to be single ended)

nextflow run /path/to/EpiTyping/main.nf --fastq_folder /path/to/fastq_dir --single false [parameters] # for paired-ends reads (default. All files in fastq folder need to be paired ended)
  • Important! All RNA (or DNA) fastq files need to be gz-compressed. Single ended fastq files need to have the following naming format: sample_name.fastq.gz, and paired ended fastq files need to have the following format: sample_name_1.fastq.gz/sample_name_2.fastq.gz
  • sample_name format: sample_name needs to start with an alphabetical character, which is followed by either alphabetical, numerical or underline characters. No hyphens are allowed.

Basic parameters

--fastq_folder: Path to a folder containing the fastq files (default: $projectDir/fastq). The folder can contain multiple fastq files but they all need to represent samples which were prepared with the same library composition (e.g all paired and unstranded; all single and unstranded etc.)

--multiple_samples: Whether there is more than one sample (true/false, default: false)

--single: Whether the RNA-seq library is single ended (true/false, default: false)

-profile: (standard/cluster, default: standard).

--cpu: (int, default 12).

--mem: Memory limit for the run ([int].GB, default: 64.GB)

--keepInter: Whether to keep intermediate alignment and VCF files (true/false, default: false).

--mouse_feeders: Whether to perform mouse contamination cleanup (true/false, default: true).

--strandness: Whether the library is unstranded or strand-specific (int, 0: unstranded; 1: forward strand; 2: reverse strand, default: 0)

--dna_reference: Name of DNA-seq sample without the .fastq.gz extension. Use if a DNA-seq for the examined cell line is available (default: "" for no integration).

--dna_fastq_folder: Path to dna fastq files if DNA-seq for the examined cell line is available. Used together with --dna_sample_name parameter (default: $projectDir/dna_fastq).

--dna_single: Whether the DNA-seq library is single ended (true/false, default: false)

--mouse_feeders_dna: whether to perform mouse contamination cleanup from DNA sample. (true/false, default: false).

  • If a DNA-seq sample is given for integration, make sure all the RNA-seq samples in the fastq folder are from the same cell line

--outdir: Existing or to-be-created path for output files (default: $PWD/output)

Advanced parameters (Use with caution. see note on advanced parameters for installation)

--common_adapters: Full path to a fasta file containing ilumina adapter sequences to be used with the Trimmomatic software. We provide such file in the genome_files directory (default: $projectDir/genome_files/CommonAdapters.fa).

--index: Full path to the human STAR index folder (default: $projectDir/genome_files/star_index)

--mouse_index: Full path to the mouse STAR index folder (default: $projectDir/genome_files/star_mouse_index)

--gtf: Full path to a human gtf file (default: $projectDir/genome_files/star_ref/gencode.v42.primary_assembly.annotation.gtf)

--ref: Full path to a human fasta file (default: $projectDir/genome_files/star_ref/GRCh38.primary_assembly.genome.fa)

--ref_dict: Full path to a human fasta dictionary file (created and to be used by gatk) (default: $projectDir/genome_files/star_ref/GRCh38.primary_assembly.genome.dict)

--dbsnp: Full path to a dbsnp vcf file (default: $projectDir/genome_files/dbSNP155Renamed.vcf.gz)

--exon_intervals: Full path to a bed file containing exon intervals (default: $projectDir/genome_files/GRCh38_exome.bed.gz)

--imprinted_intervals: Full path to a sorted bed file containing imprinted genes intervals (default: $projectDir/genome_files/imprinted_genes_sorted.bed)

--imprinted_genes: Full path to a csv file containing two columns (Gene, Location) with imprinted genes name and their cytogenetic band (default: $projectDir/genome_files/imprinted_genes_location.csv)

--headers: Full path to a vcf headers file for vcf annotation with gene names (default: $projectDir/genome_files/headers_for_annotation.txt)

--bwa_index: Full path to the BWA index files (default: $projectDir/genome_files/bwa_ref/*)

--bwa_mouse_index: Full path to the BWA mouse index files (default: $projectDir/genome_files/bwa_mouse_ref/*)

Examples

Let there be a folder in a path /User/gal/epigenetic_analysis/fastq within a slurm cluster which contains fastq files from 2 paired-end samples that are unstranded: ERR3466738_1.fastq.gz; ERR3466738_2.fastq.gz; ERR3466740_1.fastq.gz; ERR3466740_2.fastq.gz. Suppose these samples grew on MEFs.

  • To run the analysis for these samples and keep intermediate files, use the following command (after running the installation script):
nextflow run /path/to/EpiTyping/main.nf\
 --fastq_folder /User/gal/epigenetic_analysis/fastq\
 --multiple_samples true\
 -profile cluster\
 --keepInter true\
 --mouse_feeders true\
 --outdir /User/gal/epigenetic_analysis/output
  • Suppose you also have a folder /User/gal/epigenetic_analysis/dna_fastq which contains fastq files from whole genome sequencing of H1 ESC line: SRR2070629_1.fastq.gz; SRR2070629_2.fastq.gz. Considering that both ERR3466740 and ERR3466738 are samples taken from H1 ESC, you can integrate the DNA-seq with the RNA-seq to detect monoallelic as well as biallelic expression from the RNA-seq samples with the following command:
nextflow run /path/to/EpiTyping/main.nf\
 --fastq_folder /User/gal/epigenetic_analysis/fastq\
 --multiple_samples true\
 --dna_fastq_folder /User/gal/epigenetic_analysis/dna_fastq\
 --dna_reference SRR2070629\
 -profile cluster\
 --keepInter true\
 --mouse_feeders true\
 --outdir /User/gal/epigenetic_analysis/output

Output

Output files are saved to the outdir folder (default: $PWD/output)

Under the outdir folder, a results folder will also be created with two subdirectories: loi and xci.

  • If keepInter paremeter is true, subdirectories under the outdir folder will be created with the intermediate results.

Under loi, if there is no DNA integration, two files will be created:

  • LOI_matrix.csv: a table where each row corresponds to a gene that exists in the imprinted_genes_location.csv file under the genome_files folder, and each column corresponds to a sample. In this table, "Uninformative" corresponds to an uninformative gene, "Not expressed" corresponds to a gene that was not expressed and "Biallelic by X SNPs" corresponds to the number of biallelic SNPs detected in the given gene.

  • LOI_per_locus_matrix.csv: a table where each row corresponds to an imprinted region that exists in the imprinted_genes_location.csv file under the genome_files folder, and each column corresponds to a sample. In this table, 0 corresponds to an uninformative region and "Biallelic by X genes" corresponds to the number of biallelic genes detected in the given region.

If there is DNA integration, one file will be created in the loi folder:

  • LOI_matrix_with_dna_integration.csv: a table where each row corresponds to a gene that exists in the imprinted_genes_location.csv file under the genome_files folder, and that showed at least one heterozygote position in one of its exons. Each column corresponds to a sample. In this table, "Monoallelic by X SNPs" corresponds to a gene which showed only monoallelic expression in x heterozygote SNPs detected in the DNA-seq, "Not expressed" corresponds to a gene that was not expressed and "Biallelic by X SNPs" corresponds to the number of biallelic SNPs detected in the given gene (only SNPs which were detected in the DNA).

Under the xci folder, two files will appear:

  • XCI_status.csv: a table where each row corresponds to a sample and one column which will either output "XaXa" for no X-chromosome inactivation (XCI), "XaXe" for XCI erosion, "XaXi" for intact XCI or "Male" in case the RNA-seq sample belongs to a male cell line.

  • XCI_status_per_gene.csv: a table where each row corresponds to an X-chromosome gene and each column corresponds to a sample. In this table, the values will either be "Uninformative" if the gene is uninformative, "Unexpressed" if the gene is not expressed or "Biallelic by X SNPs", mentioning the number of biallelically expressed SNPs if the gene is biallelically expressed.

About

Detecting imprinting and X-chromosome inactivation status from RNA-seq

Resources

Stars

Watchers

Forks

Packages

No packages published