-
Notifications
You must be signed in to change notification settings - Fork 1
How to quickly check the insert size distribution of a new sequencing library
After you get the reads, be they paired end or mate pairs, from the sequencing centre, you sometimes want to chek the distribution of the insert sizes. There are probably several ways to do this, but here is what we tend to use.
Needed: either the enitre read dataset, or a subset of it (see below) a reference genome, or assembly, with at least a certain fraction of contigs/scaffolds large than the expected insert size.
All programs are avaliable from the common bin
folder unless otherwise indicated. We'll be using:
-
seqtk
(optional) bwa
samtools
- your favorite plotting program
-- Step 1: - optional: creating subsets of the data
For example, to create files with 1 million randomly selected reads - still paired up:
seqtk sample ../read1.fastq.gz 1000000 >read1_1Mreads.fastq
seqtk sample ../read2.fastq.gz 1000000 >read2_1Mreads.fastq
To take 10% of the reads instead:
seqtk sample ../read1.fastq.gz 0.1 >read1_10%reads.fastq
seqtk sample ../read2.fastq.gz 0.1 >read2_10%reads.fastq
NOTE, seqtk
will happily accept gzipped fastq files.
Step 2: indexing your reference:
I prefer to make a new folder close to where the reference file is:
mkdir bwa_index
Create a symlink (or 'soft link', see this page to the reference (optional, but this makes the next commands more readable).
ln -s path/to/reference.fasta .
NOTE the period .
at the end!
Create the index:
bwa-0.6.2 index -a bwtsw reference.fasta >bwa-index.out 2>&1
(all messages from the program will go to the file bwa_index.out)
Step 3: aligning the reads:
Make a folder for the results and cd
into it. Make symlinks to the read files and reference index (optional, but this makes the next commands more readable).
ln -s path/to/read1_1Mreads.fastq .
ln -s path/to/read2_1Mreads.fastq .
ln -s path/to/bwa_index .
Align reads
bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read1_1Mreads.fastq >read1_1Mreads.sai 2> read1_1Mreads.bwa.err
bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read2_1Mreads.fastq >read2_1Mreads.sai 2> read2_1Mreads.bwa.err
NOTE older Illumina data may require you to add the -I
flag, check the documentation.
Generate sam file:
bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
1>read1+2_1Mreads.sam 2> read1+2_1Mreads.bwa.err
Some tools need a sorted, compressed bam file. You can generate this using:
module load samtools
cat read1+2_1Mreads.sam | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted
The resulting file will be called read1+2_1Mreads.sorted.bam.
Alternatively, you can do it like this:
bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
2> read1+2_1Mreads.bwa.err | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted
Now you could for example get some basic metrics on the mapping by running:
samtools flagstat 130301_R1+2_001_10Mreads.sorted.bam
Step 4: plotting.
To get the insert size distribution, extract the following from the sam file:
- the lines describing the mapped reads (these start with the name of the read)
- for these lines, the value of column 9, the insert size as determined by bwa, if it is positive and larger than 0
Back to the CEES-HPC wiki home page