SCSilicon is a tool for synthetic single-cell DNA sequencing data generation.
- python3
- numpy>=1.16.1
- pandas>=0.23.4,<0.24
- tasklogger>=0.4.0
- wget>=3.2
- seaborn>=0.11.1
- matplotlib>=3.0.2
- SCSsim
All python packages will be automatically installed when you install SCSilicon if these packages are not included in your python library.
To install SCSsim, please refer to the README of SCSsim.
To install with pip, run the following from a terminal:
pip install scsilicon
To clone the repository and install manually, run the following from a terminal:
git clone https://github.com/xikanfeng2/SCSilicon.git
cd SCSilicon
python setup.py install
The following code runs SCSilicon.
import scsilicon as scs
# create SCSiliconParams object
params = scs.SCSiliconParams()
#download all necessary reference files. Just run once in the first time and remove this line after the first run.
scs.download_ref_data(params)
# simulate snp samples
snp_simulator = scs.SNPSimulator()
snp_simulator.sim_samples(params)
# simulate snv samples
snv_simulator = scs.SNVSimulator()
snv_simulator.sim_samples(params)
# simulate indel samples
indel_simulator = scs.IndelSimulator()
indel_simulator.sim_samples(params)
# simulate cnv samples
cnv_simulator = scs.CNVSimulator()
cnv_simulator.sim_samples(params)
All the general parameters for the SCSilicon simulation are stored in a SCSiliconParams
object. Let’s create a new one.
params = scs.SCSiliconParams()
-
out_dir
: string, optional, default: './'.
The output directory path -
ref
: string, optional, default: hg19.
The reference genome version: hg19 or hg38 -
chrom
: string, optional, default: chr22.
The chromosome number for reads generation: all or a specific chromosome -
layout
: string, optional, default: 'SE'.
The reads laryout: PE or SE (PD for paired-end and SE for single-end) -
coverage
: int, optional, default: 5.
The sequencing coverage -
isize
: int, optional, default: 260.
The mean insert size for paired-end sequencing -
threads
: int, optional, default: 1.
The number of threads to use for reads generation -
verbose
: int or boolean, optional, default: 1.
IfTrue
or> 0
, print log messages
If we want to look at the value of parameters, we can extract it using the get_params
function:
params.get_params()
# console log: {'out_dir': './', 'ref': 'hg19', 'chrom': 'chr20', 'layout': 'PE', 'coverage': 5, 'isize': 260, 'threads': 10}
Alternatively, to give a parameter a new value we can use the set_params
function:
paramss.set_params(ref='hg38', chrom='chr22')
# console log: {'out_dir': './', 'ref': 'hg38', 'chrom': 'chr22', 'layout': 'PE', 'coverage': 5, 'isize': 260, 'threads': 10}
We can also set parameters directly when we create new SCSiliconParams
object:
params = scs.SCSiliconParams(ref='hg38', chrom='chr22')
Once we have a set of parameters we are happy with we can use SNPSimulator
to simulate samples with SNPs in it.
snp_simulator = scs.SNPSimulator()
snp_simulator.sim_samples(params)
-
cell_no
: int, optional, default: 1.
The cell number for this simulation -
snp_no : int, optional, default: 1000
The SNP number of each sample
For each sample, SNPSimulator
will randomly select a total number of SNPs from dbSNP file and snp_no
parameter can be used to control this total number.
Similar to SCSiliconParams
, SNPSimulator
uses the functions get_params
and set_params
to get or set parameters.
SNPSimulator
object uses the function sim_samples
to generate FASTQ files for each sample.
snp_simulator.sim_samples()
If you want to simulate multiple
samples once, you can use the cell_no
parameter to contorl this.
snp_simulator.set_params(cell_no=10)
# or set the parameter when creating the object
snp_simulator = scs.SNPSimulator(cell_no=10)
# generating reads
snp_simulator.sim_samples(params)
Above code will simulate 10 samples with FASTQ format once.
The sim_samples
function will generate two output files for each sample in your output directory.
sample{1}-snps.txt
: the SNPs included in this sample. This file can be reagrded as the groud truth for SNP detection software.sample{1}.fq
: the reads data of this sample with FASTQ format.
{1}
is the sample no., like sample1-snps.txt, sample2-snps.txt.
We can use CNVimulator
to simulate samples with CNVs.
cnv_simulator = scs.CNVSimulator()
cnv_simulator.sim_samples(params)
-
cell_no
: int, optional, default: 1.
The cell number for this simulation -
bin_len
: int, optional, default: 500000.
The fixed bin length -
seg_no
: int, optional, default: 10.
The segment number for each chromosome -
cluster_no
: int, optional, default: 1.
The cell cluster number for multiple sample simulation -
normal_frac
: float, optional, default: 0.4.
The fraction of normal cells -
noise_frac
: float, optional, default: 0.1.
The noise fraction for cnv matrix
Similar to SCSiliconParams
, CNVimulator
uses the functions get_params
and set_params
to get or set parameters.
CNVimulator
object also uses the function sim_samples
to generate FASTQ files for each sample.
cnv_simulator.sim_samples(params)
The seg_no
parameter can be used to control the segments in each chromosome.
cnv_simulator.set_params(seg_no=8)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(seg_no=8)
# generating reads
cnv_simulator.sim_samples(params)
Above code will split each chromosome to 8 segments and this is useful for segmentation experiments of single cell CNV detection tools.
If you want to simulate multiple
samples once, you can use the cell_no
parameter to contorl this.
cnv_simulator.set_params(cell_no=10)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(cell_no=10)
# generating reads
cnv_simulator.sim_samples(params)
Above code will simulate 10 samples with FASTQ format once.
For multiple-sample simulation, you can use the cluster_no
parameter to seperate these samples to several clusters.
cnv_simulator.set_params(cluster_no=5)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(cluster_no=10)
# generating reads
cnv_simulator.sim_samples(params)
The sim_samples
function will generate two output files for each sample in your output directory.
cnv.csv
: the CNV matrix with cells as rows and bins as columns. This file can be reagrded as the groud truth for CNV detection software.segments.csv
: the segments information for each chromosome. This file can be reagrded as the groud truth for segmentation experiments.clusters.csv
: the clusters information for each sample. This file can be reagrded as the groud truth for cell cluster experiments.sample{1}.fq
: the reads data of this sample with FASTQ format.
{1}
is the sample no., like sample1.fq, sample2.fq.
CNVimulator
object has the funciton visualize_cnv_matrix
to draw the heatmap graph for the cnv matrix.
cnv_simulator.visualize_cnv_matrix(out_prefix)
This function will save the heatmap with pdf format to the file named as out_prefix.pdf
. One example of cnv heatmap graph is shown below:
Once we have a set of parameters we are happy with we can use SNVSimulator
to simulate samples with SNVs in it.
snv_simulator = scs.SNVSimulator()
snv_simulator.sim_samples(params)
-
cell_no
: int, optional, default: 1.
The cell number for this simulation -
snv_no
: int, optional, default: 1000
The SNV number of each sample
Once we have a set of parameters we are happy with we can use IndelSimulator
to simulate samples with Indels in it.
indel_simulator = scs.IndelSimulator()
indel_simulator.sim_samples(params)
-
cell_no
: int, optional, default: 1.
The cell number for this simulation -
in_no
: int, optional, default: 1000
The insertion number of each sample -
del_no
: int, optional, default: 1000
The deletion number of each sample
Feng, X., Chen, L. SCSilicon: a tool for synthetic single-cell DNA sequencing data generation. BMC Genomics 23, 359 (2022). Full text
If you have any questions or require assistance using SCSilicon, please contact us with [email protected].