Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better support for VCF files #14

Open
emmanuelparadis opened this issue Mar 10, 2015 · 8 comments
Open

better support for VCF files #14

emmanuelparadis opened this issue Mar 10, 2015 · 8 comments

Comments

@emmanuelparadis
Copy link

I already mentioned this in the wiki, but I can develop a bit more here. It seems that VCF will be the format of the future for pop gen. Several packages offer support for VCF files but, IMHO, they are limited and not well integrated with other packages (I exclude the package on BioConductor to read variant annotations which has another objective). pegas has read.vcf but it is limited in its performance. I propose to improve the situation. These last days, I've played with the 1000 Genomes data for chr. Y (62,042 loci). I tried to set code for scannning the whole file and finding which loci are "real" SNP (ie, loci with only 2 alleles different by a single substition): this takes ~3 sec. Then it'd be possible to read selectively some loci (read.vcf can do that already). The approach is to use only R code to avoid relying on external libraries (it'll be probably necessary to write some code in C at some stage). For instance, this code to find SNPs is ~ 20 lines.

An issue is the file size we should expect. Are VCF files from pop-gen studies as big as those output by the 1000 Genomes project? Currently, handling files with 10,000's loci is not a problem, and I believe it should be relatively easy to reach 100,000's loci. Should we expect to handle files/data with 10^6 loci?

@jgx65
Copy link

jgx65 commented Mar 11, 2015

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers from and to other formats, including allele frequencies. I think we should work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already quite common. Perhaps playing with chr1 of the 1000 project rather than ChrY would put us in the right ball park?

@thibautjombart
Copy link
Contributor

Sounds exciting!
I agree VCF seems to be the way to go. 10^5 SNPs x 100s of individuals also
sounds more like the kind of dataset I am used to. I think we can assume
that anything substantially bigger, e.g. large collections of full
eucaryote genomes, will be handled on computer clusters and pose a
different problem.
Cheers
Thibaut

On Wed, Mar 11, 2015 at 9:11 AM, Jerome Goudet [email protected]
wrote:

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers
from and to other formats, including allele frequencies. I think we should
work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already
quite common. Perhaps playing with chr1 of the 1000 project rather than
ChrY would put us in the right ball park?


Reply to this email directly or view it on GitHub
#14 (comment)
.

@grunwald
Copy link

I am totally on board. We need better vcf tools for larger data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans for Fst, pi, etc. and gene content obtained from gff.

@grunwald
Copy link

I am totally on board. We need good vcf tools for larger GBS data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans with varying window sizes for Fst, pi, linkage, etc. and gene content obtained from gff. I like the idea of using moderate data matrices rather than whole genomes. Will this be based on the genlight object? Ability to parse to loci and genind is important.

One issue that we are struggling with is that likelihood that each individual is a new multilocus genotype (MLG) requires tools for defining MLG boundaries. How many different SNPs does it take to be a different genotype? Once we scale to 1,000s of SNPs aan MLG is very different to what we are used to in traditional data such as SSR. This is important for my group as we need understand if there are clones. How many SNP differences make it another clone, etc.

I like the idea of starting with 10^6 loci using chr1 of 1000 genomes as a starting point.

Cheers,
Nik

On Mar 11, 2015, at 3:18 AM, Thibaut Jombart [email protected] wrote:

Sounds exciting!
I agree VCF seems to be the way to go. 10^5 SNPs x 100s of individuals also
sounds more like the kind of dataset I am used to. I think we can assume
that anything substantially bigger, e.g. large collections of full
eucaryote genomes, will be handled on computer clusters and pose a
different problem.
Cheers
Thibaut

On Wed, Mar 11, 2015 at 9:11 AM, Jerome Goudet [email protected]
wrote:

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers
from and to other formats, including allele frequencies. I think we should
work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already
quite common. Perhaps playing with chr1 of the 1000 project rather than
ChrY would put us in the right ball park?


Reply to this email directly or view it on GitHub
#14 (comment)
.


Reply to this email directly or view it on GitHub #14 (comment).

Niklaus J. Grünwald
Research Plant Pathologist | Horticultural Crops Research Laboratory | USDA ARS
Professor (courtesy) | Department of Botany and Plant Pathology | Oregon State University
3420 NW Orchard Ave. | Corvallis, OR 97330 | USA | Tel 541.738-4049 | Fax 541.738-4025
grunwaldlab.cgrb.oregonstate.edu http://grunwaldlab.cgrb.oregonstate.edu/ | phytophthora-id.org http://phytophthora-id.org/ | oregonstate.edu/instruct/dce/phytophthora http://oregonstate.edu/instruct/dce/phytophthora | phytophthora-smallrna-db.cgrb.oregonstate.edu http://phytophthora-smallrna-db.cgrb.oregonstate.edu/

@emmanuelparadis
Copy link
Author

I switched to chr22 of 1K Genomes: it has already 1,103,537 loci and is 214 Mb (10.5 Gb uncompressed but no need to uncompress it to read thank to R's binary connections... nice stuff btw). It's possible to read in chunks of 1 Gb, scan the chunks for the loci positions, and output them for the 1,103,537 loci. This takes ~1 min (and < 8 Gb of RAM).

The idea is to provide some functions to get similar info on the loci (their positions, their alleles, whether they are SNP, ...) so the user may select which ones to read.

pegas::read.vcf returns a "loci" object. Currently, it can read quite easily a few 1000s loci.

@zkamvar
Copy link

zkamvar commented Mar 11, 2015

One question I have is: are we considering VCF files pre or post filtering? I believe it would be worthwhile to consider pre-filtering as it would get away from the current black box sort of methods available now.

@emmanuelparadis
Copy link
Author

Other fields in the VCF file can be scanned in the same way than POS. read.vcf already returns QUAL and FILTER as additional attributes.

@knausb
Copy link

knausb commented Mar 12, 2015

I'm not attending this event, but colleagues in my lab who are planning to attend brought this issue to my attention. I've been working on an R package to visualize and filter vcf files. Originally I wrote it all in R, and then realized its performance wouldn't scale well. I've subsequently been learning Rcpp and have been working on improving its performance. It has a vignette with a full working example. It can be installed with:

devtools::install_github(repo="knausb/vcfR", build_vignettes=TRUE)

And you can browse its vignette with:

browseVignettes(package="vcfR")

Its under active development, so things should change in the future. I'd be interested in any feedback people may have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants