Pangenome Graphs - Simulation

This repository contains a section of the original Pangenome Graphs Workshop that was removed to reduce the amount of content being covered.

It involves the use of the wgsim software to simulate sequencing a genome using short-read high throughput sequnecing technology. The short-read data can be used to investigate the difference in alignment percentages when aligning to a pangenome assembly, versus aligning to a single linear reference genome.

This could be run as a short interactive session, but there is still a bit of tidying up that would need to be done.

Genomic data

The data for this exercise can be download by cloning this repository:

git clone https://github.com/mikblack/Pangenome-Graphs-Simulation

The data file 4Sim.fa contains the genome sequence for four different genomes of Neisseria meningitidis (1 real, 3 with specific genomic feactures simulated), which need to be split into individual genomes.

We can see the genome names via:

grep '^>' 4Sim.fa

The above command searches the file 5NM.fa for lines that start ('^') with '>'.

We can split the file into the five separate genomes (using the '>' symbol to identify where each one starts) usign the following commands:

awk '/^>/ { gsub(">","",$1)
            FILE=$1 ".fa"
            print ">" $1 >> FILE
            next}
          { print >> FILE }' 4Sim.fa

We now have four separate .fa (FASTA) files, one per genome:

NC_neisseria.fa
Sim1_3k.fa
Sim2_4k.fa
Sim3_5k.fa

Use `wgsim` to simulate individual genomes and variation

Use wgsim to simulate 2x150bp NGS data, with an error rate of 0.005.

output_folder=simulated_fastq
mkdir -p $output_folder

for f in NC_neisseria.fa Sim1_3k.fa Sim2_4k.fa Sim3_5k.fa
do
    x=$(basename $f .fa)
    echo ${x}

    wgsim -N 1000000 -1 150 -2 150  -e 0.005 -r 0 -R 0 -X 0 ${x}.fa $output_folder/${x}.wgsim_er0.005.R1.fq $output_folder/${x}.wgsim_er0.005.R2.fq
    gzip $output_folder/${x}.wgsim_er0.005.R1.fq
    gzip $output_folder/${x}.wgsim_er0.005.R2.fq

done

wgsim parameters:

-N: number of read pairs (1000000)
-1: length of the first read (150bp)
-2: length of the second read (150bp)
-e: per base sequencing error rate (0.005)
-r: de novo mutation rate (0)
-R: fractio of the genome comprising InDels (0)
-X: probability an indel is extended (0)

The simulated data from each individual can now be aligned to either a linear reference or a pangenome graph.

Build index for graph

#convert the graph into 256 bp chunks, saving as vg format
vg mod -X 256 4Sim_1K96.gfa > 4Sim_1K96_256.vg

# Make temporary directory
mkdir TMP
#build index of xg and gcsa index
vg index -b TMP -t 48 -x 4Sim_1K96_256.xg -g 4Sim_1K96_256.gcsa -k 16 4Sim_1K96_256.vg

Small graph is ok without prunning, complex graph will need to prune first before generating index

Not run?

# pruning: use -M if pruning fails
# vg prune -u -m node-mapping.tmp -t 48 -k 24 ${x}_256.vg > ${x}_256_chopped.vg

# vg index ${x}_256_chopped.vg -x ${x}_256_chopped.xg
# gcsa index
# vg index -b $tem_dir -t 48  -g ${x}_256_chopped.gcsa  ${x}_256_chopped.vg

vg map NGS to graph

data=simulated_fastq/*R1.fq.gz
input_folder=simulated_fastq
output=graph_based_mapping

mkdir -p $output

index=4Sim_1K96_256.gcsa
basename=4Sim_1K96_256

for f in $data
do

    x=$(basename $f R1.fq.gz)
    echo ${x}

    read1=${x}R1.fq.gz
    read2=$(echo $read1|sed 's/R1.fq.gz/R2.fq.gz/')

    echo $read2

    #map paired reads using vg map
    vg map -t 20  -d $basename -g $index  -f $input_folder/$read1 -f $input_folder/$read2 -N $x  > $output/${x}vgmap_4Sim.gam

    #vg stats to check the mapping statistics
    vg stats -a  $output/${x}vgmap_4Sim.gam  >$output/${x}vgmap_4Sim_stats

done

genotying known variants

mkdir vgmap_12e_sim4_allR10S3_typing

# Genotyping
data_gam=graph_based_mapping/*.wgsim_er0.005.vgmap_4Sim.gam
input=graph_based_mapping
output=vgmap_12e_sim4_allR10S3_typing
graph_xg=4Sim_1K96_256.xg
snarls_file=4Sim_1K96_256.xg.snarls

# Generate snarls of graph
vg snarls $graph_xg >$snarls_file

for f in $data_gam
do

    x=$(basename $f .wgsim_er0.005.vgmap_4Sim.gam)
    echo ${x}


    #Calculate the surpport reads ingoring mapping and base quality <5
    #vg pack -t 48 -x $graph_xg -g $input/${x}.wgsim_er0.005.vgmap_4Sim.gam -Q 5 -o $output/${x}vgmap_Sim4_256_aln.pack

    #Calculate the surpport reads
    vg pack -t 12 -x $graph_xg -g $input/${x}.wgsim_er0.005.vgmap_4Sim.gam -o $output/${x}vgmap_sim4_256_aln.pack

    #call variant using the same coordinates and including reference calls (for following compare)
    vg call -t 12 -m 3,10 $graph_xg -k $output/${x}vgmap_sim4_256_aln.pack -r $snarls_file -a  >$output/${x}vgmap_sim4_256_aln.pack_allR10S3.vcf

done

Novel variant calling using graph reference

mkdir vgmap_5e_sim4_allR10S3_novelcalling 

data_gam=graph_based_mapping/*.wgsim_er0.005.vgmap_4Sim.gam
input=graph_based_mapping
output=vgmap_5e_sim4_allR10S3_novelcalling
graph_vg=4Sim_1K96_256.vg
graph_xg=4Sim_1K96_256.xg

#compute snarls (already done above)
#vg snarls $graph_xg >$output/${graph_xg}.snarls

for f in $data_gam
do

    x=$(basename $f .wgsim_er0.005.vgmap_4Sim.gam)
    echo ${x}


    #in order to also consider novel variants from the reads, use the augmented graph and gam (as created in the "Augmentation" example using vg augment -A)
    #Augment augment the graph with all variation from the GAM, saving to aug.vg
    # augment the graph with all variation from the GAM except
    # that implied by soft clips, saving to aug.vg
    # *aug-gam contains the same reads as aln.gam but mapped to aug.vg

    vg augment -t 12 $graph_vg $input/${x}.wgsim_er0.005.vgmap_4Sim.gam -A $output/${x}nofilt_aug.gam >$output/${x}nofilt_aug.vg

    #index the augmented graph
    vg index -t 12 $output/${x}nofilt_aug.vg -x $output/${x}nofilt_aug.xg

    # Compute the all read support from the augmented gam
    vg pack -t 12 -x $output/${x}nofilt_aug.xg -g $output/${x}nofilt_aug.gam  -o $output/${x}nofilt_aug_allR.pack


    #call variant
    vg call -t 12 -m 3,10 $output/${x}nofilt_aug.xg -k $output/${x}nofilt_aug_allR.pack >$output/${x}nofilt_aug_allR.pack.vcf

    #call variant snarl using the same coordinate
    #vg call -t 48 -m 3,10 $output/${x}nofilt_aug.xg -k $output/${x}nofilt_aug_allR.pack -a >$output/${x}nofilt_aug_allR.pack_snarls.vcf

done

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
4Sim.fa		4Sim.fa
4Sim_1k96.gfa		4Sim_1k96.gfa
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangenome Graphs - Simulation

Genomic data

Use `wgsim` to simulate individual genomes and variation

Build index for graph

vg map NGS to graph

genotying known variants

Novel variant calling using graph reference

About

Releases

Packages

License

GenomicsAotearoa/Pangenome-Graphs-Simulation

Folders and files

Latest commit

History

Repository files navigation

Pangenome Graphs - Simulation

Genomic data

Use wgsim to simulate individual genomes and variation

Build index for graph

vg map NGS to graph

genotying known variants

Novel variant calling using graph reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Use `wgsim` to simulate individual genomes and variation

Packages