This repository demonstrates a series of analyses to assess the impact of DNA accessibility on CRISPR-Cas9 cleavage efficiency. This analysis takes multiple open access genome-scale datasets including GUIDE-Seq, CIRCLE-Seq, DNase-Seq and RNA-Seq to systematically characterize crucial determinants for CRISPR-induced editing efficiency.
Highlighting results:
- The condensed chromatin conformation has the potential to abrogate the correlation between gRNA:target similarity and CRISPR-induced cleavage frequency.
- CRISPR-induced sequence editing is possible even in regions where the vast majority of endogenous genes are silent.
Full analysis and figure generation could be found in this python notebook
List of files required to run full analysis:
The pipelines that generates all required files to run full analysis descrbied above take publically available dataset from multiple resources:
Prepare the sorted bam file for the calculation of Read count Per Million mappable reads per basepair (RPM).
Dataset | Assay | Link |
HEK293T DNA accessibility | DNase-Seq | ENCFF774HUB.bam |
U2OS DNA accessibility | DNase-Seq | SRR4413990.fastq |
Preprocessing of DNase-Seq on HEK293T:
python code/ -i raw_data/HEK/ENCFF774HUB.bam -p 'bam' -r raw_data/HG19.fasta
Preprocessing of DNase-Seq on U2OS:
python code/ -i raw_data/U2OS/SRR4413990.fastq.gz -p 'fastq' -r raw_data/HG19.fasta
HG19 reference genome could be downloaded from: GRCh37/hg19
and indxed by:
bwa index -a bwtsw raw_data/HG19.fasta
Dataset | Assay/Platform | Link |
GUIDE-Seq identified | GUIDE-Seq | GUIDEseq_allgRNAs_identified or Supplementary Table 2 |
CIRCLE-Seq identified | CIRCLE-Seq | CIRCLEseq_allgRNAs_identified or Supplementary Table 2 |
HG19 gene coordinates | NCBI RefSeq | UCSC Table Browser |
HEK293T transcriptome | NextSeq 500 | SRR3997505 |
U2OS transcriptome | HighSeq 2000 | ERR191523 |
For pre-defined promoter regions, use this file: hg19_allTSS_1000up_200down.bed.gz
For more detail of gene region file, please see: rnaseq.ipynb
Add DNase-Seq RPM column to CRISPR-induced cleavage sites/Gene promoter:
DNase-Seq RPM on HEK293T gene promoter:
python code/ -L raw_data/transcriptome/hg19_allTSS_1000up_200down.bed.gz -c "HEK293T" -b raw_data/HEK/HEK.se50.DNaseSeq.sorted.bam -o processed_data/HEK_TSS_1000up_200down_DNaseSeq.csv
DNase-Seq RPM on U2OS gene promoter:
python code/ -L raw_data/transcriptome/hg19_allTSS_1000up_200down.bed.gz -c "U2OS" -b raw_data/U2OS/SRR4413990_trimmed.sorted.bam -o processed_data/U2OS_TSS_1000up_200down_DNaseSeq.csv
DNase-Seq RPM on HEK293T GUIDE-Seq identified cleavage sites:
python code/ -L raw_data/GUIDEseq_allgRNAs_identified.csv -c "HEK293T" -b raw_data/HEK/HEK.se50.DNaseSeq.sorted.bam -w 100 -o processed_data/HEK_GUIDESeq_DNaseSeq.csv
DNase-Seq RPM on U2OS GUIDE-Seq identified cleavage sites:
python code/ -L raw_data/GUIDEseq_allgRNAs_identified.csv -c "U2OS" -b raw_data/U2OS/SRR4413990_trimmed.sorted.bam -w 100 -o processed_data/U2OS_GUIDESeq_DNaseSeq.csv
DNase-Seq RPM on HEK293T CIRCLE-Seq identified cleavage sites:
python code/ -L raw_data/CIRCLEseq_allgRNAs_identified_matched.csv -c "HEK293T" -b raw_data/HEK/HEK.se50.DNaseSeq.sorted.bam -w 100 -o processed_data/HEK_CIRCLESeq_DNaseSeq.csv
DNase-Seq RPM on U2OS CIRCLE-Seq identified cleavage sites:
python code/ -L raw_data/CIRCLEseq_allgRNAs_identified_matched.csv -c "U2OS" -b raw_data/U2OS/SRR4413990_trimmed.sorted.bam -w 100 -o processed_data/U2OS_CIRCLESeq_DNaseSeq.csv
Open analysis.ipynb and displace the correct file names in corresponding input DataFrames.