-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better support for VCF files #14
Comments
Agreed with you Emmanuel, VCF is the way to go, but we will need parsers from and to other formats, including allele frequencies. I think we should work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already quite common. Perhaps playing with chr1 of the 1000 project rather than ChrY would put us in the right ball park? |
Sounds exciting! On Wed, Mar 11, 2015 at 9:11 AM, Jerome Goudet [email protected]
|
I am totally on board. We need better vcf tools for larger data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans for Fst, pi, etc. and gene content obtained from gff. |
I am totally on board. We need good vcf tools for larger GBS data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans with varying window sizes for Fst, pi, linkage, etc. and gene content obtained from gff. I like the idea of using moderate data matrices rather than whole genomes. Will this be based on the genlight object? Ability to parse to loci and genind is important. One issue that we are struggling with is that likelihood that each individual is a new multilocus genotype (MLG) requires tools for defining MLG boundaries. How many different SNPs does it take to be a different genotype? Once we scale to 1,000s of SNPs aan MLG is very different to what we are used to in traditional data such as SSR. This is important for my group as we need understand if there are clones. How many SNP differences make it another clone, etc. I like the idea of starting with 10^6 loci using chr1 of 1000 genomes as a starting point. Cheers,
Niklaus J. Grünwald |
I switched to chr22 of 1K Genomes: it has already 1,103,537 loci and is 214 Mb (10.5 Gb uncompressed but no need to uncompress it to read thank to R's binary connections... nice stuff btw). It's possible to read in chunks of 1 Gb, scan the chunks for the loci positions, and output them for the 1,103,537 loci. This takes ~1 min (and < 8 Gb of RAM). The idea is to provide some functions to get similar info on the loci (their positions, their alleles, whether they are SNP, ...) so the user may select which ones to read. pegas::read.vcf returns a "loci" object. Currently, it can read quite easily a few 1000s loci. |
One question I have is: are we considering VCF files pre or post filtering? I believe it would be worthwhile to consider pre-filtering as it would get away from the current black box sort of methods available now. |
Other fields in the VCF file can be scanned in the same way than POS. read.vcf already returns QUAL and FILTER as additional attributes. |
I'm not attending this event, but colleagues in my lab who are planning to attend brought this issue to my attention. I've been working on an R package to visualize and filter vcf files. Originally I wrote it all in R, and then realized its performance wouldn't scale well. I've subsequently been learning Rcpp and have been working on improving its performance. It has a vignette with a full working example. It can be installed with:
And you can browse its vignette with:
Its under active development, so things should change in the future. I'd be interested in any feedback people may have. |
I already mentioned this in the wiki, but I can develop a bit more here. It seems that VCF will be the format of the future for pop gen. Several packages offer support for VCF files but, IMHO, they are limited and not well integrated with other packages (I exclude the package on BioConductor to read variant annotations which has another objective). pegas has read.vcf but it is limited in its performance. I propose to improve the situation. These last days, I've played with the 1000 Genomes data for chr. Y (62,042 loci). I tried to set code for scannning the whole file and finding which loci are "real" SNP (ie, loci with only 2 alleles different by a single substition): this takes ~3 sec. Then it'd be possible to read selectively some loci (read.vcf can do that already). The approach is to use only R code to avoid relying on external libraries (it'll be probably necessary to write some code in C at some stage). For instance, this code to find SNPs is ~ 20 lines.
An issue is the file size we should expect. Are VCF files from pop-gen studies as big as those output by the 1000 Genomes project? Currently, handling files with 10,000's loci is not a problem, and I believe it should be relatively easy to reach 100,000's loci. Should we expect to handle files/data with 10^6 loci?
The text was updated successfully, but these errors were encountered: