morbidostat-genome-analysis

Code for the analysis of morbidostat data. The analysis consists for now of the following steps:

map reads against assembled genome from the first timepoint
produce a pileup of the mapped reads, that can be explored with IGV
analyzes the variation over time of:
- coverage
- consensus frequency
- gaps
- insertions
- secondary/supplementary mappings
- clipped reads

It consists of three workflows:

pileup.nf: maps the reads against the reference genomes and produces pileups and other summary files (see notes/pipeline_description.md for a more detailed description).
extract_stats.nf: produces databases containing the variation of consensus and gap frequency over time.
plots.nf: produces different plots for the analysis (see notes/plots.md for a more detailed description).

test dataset

A test dataset is available to test the steps of the pipeline. This can be downloaded from scicore cluster using:

scp -r [email protected]:/scicore/home/neher/GROUP/data/2022_nanopore_sequencing/experiments/test_dataset .

Substituting username with the appropriate username.

The test dataset has been extracted from the morbidostat run labeled 2022-02-08-RT_test. It contains 2 vials (vial 2 and 3), each one having 3 timepoints (timepoint 1, 2 and 5). For timepoint 1 the annotated genome is included. For the rest of the timepoints, only fastq reads are provided in the reads.fastq.gz files. These files contain only a subset of all the reads (first 10'000 lines of the original fastq file). This corresponds to ~3% of all the reads, and these files are roughly 20-30 Mb each.

In general input data must be placed in an input directory with nested structure:

input_dir/vial_XX/time_YY/
├── assembled_genome
│   ├── assembly.fna
│   └── assembly.gbk
└── reads.fastq.gz

workflow: building a pileup of mapped reads

The workflow pileup.nf builds a pileup of the reads. For every vial, it maps reads for any timepoints to a reference genome, which is the one assembled from the initial timepoint for each vial. It generates the sorted bam file of mappings, together with a pileup matrix and a list of insertions and positions of clipped reads. Usage is as follows:

nextflow run pileup.nf \
    -profile cluster \
    --input_fld test_dataset \
    --time_beg 1 \
    --qual_min 15 \
    --clip_minL 100  \
    -resume

The input parameters are:

profile : if cluster is specified, then SLURM execution is activated. Otherwise local execution is used.
input_fld : folder containing the input reads. It must have the nested vial_n/time_n structure of archived morbidostat experiment data.
time_beg : the label corresponding to the first timepoint, whose assembled genome is used as reference (e.g. 1 for timepoint_1`)
qual_min : parameters of the script used to build the pileup. Only reads with quality higher than the threshold are used, and only insertions that have at least 70% of the sites meeting this threshold.
clip_minL : minimum length of clips that are added to the clip dictionary. Shorter clips are ignored. Nb: only clips from primary mappings are considered in the clip analysis.

results are saved in the results folder, with a structure that mirrors the vial_n/timepoint_n structure of the input folder. Each of these subfolders will contain the following files:

reads.sorted.bam and reads.sorted.bam.bai : sorted bam file (and corresponding index) containing the mapping of the reads against the reference genome.
pileup/allele_counts.npz : pileup of the reads. This is a numpy tensor with dimension (2,6,L) corresponding to (1) forward-reverse reads, (2) allele ["A", "C", "G", "T", "-", "N"] and (3) position.
pileup/insertions.pkl.gz : a nested dictionary of insertions, saved in pickle format and compressed with gzip. The structure is position -> sequence -> [n. forward reads, n. reverse reads].
pileup/clips.pkl.gz : a pickle file containing two dictionaries (accessible with the "count" and "seqs" tags). The first has the form {position -> [fwd clips, rev clips, tot fwd read ends, tot rev read ends]}. Contains the number of forward and reverse soft-clips ecountered in the reads at any given position, provided that the clipped part is longer than the threshold clip_minL. It also contains the total number of reads that end at that position, even if they do not meet the threshold. On a separate dictionary (under the "seqs" tag) for each position we save the clipped nucleotide sequence. This dictionary has the form {position -> {0 or 1 for fwd / rev -> seq }}.
pileup/unmapped.{csv,fastq.gz}: dataframe with list of unmapped reads. It contains the read-id, length, flag and average quality. The fastq.gz file contains these reads.
pileup/non_primary.csv: dataframe with list of mapping info for reads that have supplementary or secondary mappings. The primary one is included too. The structure of this file is described in notes/read_mapping.md.
ref_genome.fa and ref_genome.gbk : symlink to the reference genome used for mapping the reads (same vial, first timepoint), both in genbank and fasta format.

See also notes/pipeline_description.md for a detailed description of the commands/tools used and output files structure. In the current version of the pipeline this workflow also produces plots for secondary/supplementary mappings (see notes/plots.md).

workflow: extract statistics

The workflow extract_stats.nf extracts relevant statistics from the pileups and insertion lists. These include:

frequency of reads being equal to the reference genome
frequency of gaps

These are saved as stats/stats_table_{statname}.pkl.gz files containing pandas dataframes.

The workflow can be run by executing:

nextflow run extract_stats.nf \
    -profile cluster \
    --input_fld results/test_dataset \
    -resume

where --input_fld is the subfolder of results corresponding to the dataset of interest, containing the pileups for different vials and different timepoints.

workflow: summary plots with pileup analysis

The workflow plots.nf executes scripts to analyze the results and produce figures and summary csv files. The workflow can be launched with:

nextflow run plots.nf \
    -profile cluster \
    --input_fld results/test_dataset \
    -resume

The --input_fld flag is used to specify the folder in which the results of the previous workflow are saved. The output figures are saved in figures/subfolder where the subfolder has the name of the run, and they are separated by vial. The executed scripts are:

plot_coverage.py
plot_consensus_freq.py
plot_gaps.py
plot_insertions.py
plot_clips.py

For each of these scripts, usage is:

python3 scripts/pileupplots_consensus_frequency.py \
    --vial_fld results/2022-02-08_RT_test/vial_02 \
    --fig_fld figs/2022-02-08_RT_test/vial_02 \
    --verbose \
    --show

Where:

--vial_fld is the folder containing the results for a single vial (pileup, reference genome...)
--fig_fld is the output folder in which to save figures
--show if specified figures are visualized with plt.show(), and otherwise they are simply saved.

For more informations on the files and figures produced see notes/plots.md.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
conda_envs		conda_envs
notes		notes
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_stats.nf		extract_stats.nf
nextflow.config		nextflow.config
pileup.nf		pileup.nf
plots.nf		plots.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

morbidostat-genome-analysis

test dataset

workflow: building a pileup of mapped reads

workflow: extract statistics

workflow: summary plots with pileup analysis

About

Releases 2

Packages

Contributors 2

Languages

License

mmolari/morbidostat-genome-analysis

Folders and files

Latest commit

History

Repository files navigation

morbidostat-genome-analysis

test dataset

workflow: building a pileup of mapped reads

workflow: extract statistics

workflow: summary plots with pileup analysis

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages