-
Notifications
You must be signed in to change notification settings - Fork 111
RNA expression data structure is inefficient #832
Comments
The Treehouse compendium is ~11k samples by 30k genes/features. Storing on disk in an hdf5 file take about 1GB and loads into a dataframe (R or Python) in < 600msec. Differential expression analysis typically requires getting a subset of the 'columns' - say 500 expression vectors after subsetting by diseases. This works out to ~500 * 30k features rows. I suspect the emerging single cell world will be much the same but with the addition of very sparse data lending itself to compression. The current GA4GH reference server database schema being single row per level doesn't get any of the optimizations of row nor column orientation, significantly expands the data (float into ~128 bytes) and can't be compressed. The protocol buffer schema has the same issue. Suggest this should be re-considered towards storing the levels in an array, either single row blob, or a column database, or external file with meta data in the existing relational database so that it can be usable for current clinical differential use cases as well as emerging single cell research cases. |
@saupchurch We were going to bring up this use case on the call today. |
To bootstrap the effort of modeling the Genomics domain the way that we have, we've picked up a lot of assumptions of the underlying file types. Our variant representation is weighed down by VCF, read alignments by BAM, etc. The community moves slowly, and the schemas are our opportunity to decouple the database/file layer from the interaction layer. As we evolve the API, in addition to adding the methods and fields that biologists find practical based on existing usage patterns, we should work to remove the legacy of the file representation and move more of that logic to standardized ETL pipelines, allowing biologists to spend more time reasoning about their domain. An important concern Rob raises (thanks @rcurrie) is that we don't present useful methods for analysis. The same objection has been raised of the variants API. You can get everything back, and be able to reason about single documents well, but you always get more than you need. I can imagine an interesting experiment where we provide an alternative way of querying expressions. Instead of splitting across ~5 requests to get an expression level you pass in a list of sample identifiers and feature names. Add an endpoint called "expressionlevels/select" that takes a Select message containing a list of sample_id and feature_id (or name).
It then returns a table with quantifications that match the requested sample_ids in the columns and the requested feature_ids as rows. Cells simply contain the expression level.
I believe this is tractable based on the data model we currently present and would demonstrate a valuable analysis use case. It reduces the transfer required for the most common access pattern (show me these samples against these genes) down to a minimum. The returned response is essentially a table where each row is tagged with the sample it came from, and the experiment it came from. Constructing a select request will require iterating over metadata to get the list of sample IDs one is interested in. With a list of genes and samples, one should be able to query arbitrarily large stores using different slicing techniques. I would note that having the This same pattern could be used with the variants API, I believe, where a list of sample IDs and variant IDs could be used to assemble a list of call vectors. One of the problems with this approach is that it assumes that the data can be easily queried in the way the method presents. In practice, it requires that API implementors keep their data in structures that can be joined and merged. This assumption was presented for the Reads API, which has a method for assembling alignments from multiple BAMs in the Part of the value of the API we present is that it aims to be as low cost to implement over existing stores, tries to only present the methods required for interchange, and presents documents that can be reasoned about alone. We don't want everyone to need to have hadoop, or all their RNA in one table, or to look up in documentation/external metadata to see what a value means. I believe a service could use the API as it presents itself to create the We should definitely work to make the API efficient, but I imagine you can think of this layer of the API like the FASTQ of read alignment. It is a low layer protocol that analysis applications will work on top of, and those analysis applications take the responsibility of filtering/removing keys, and making sure only the data needed to drive inquiry is transferred to you, the analyst. |
The idea of a select query is a good one - this would essentially recreate an earlier proposed pattern where the |
+1 on this idea. It would be great to have @david4096 's approach implemented |
From email chain with Rob:
The text was updated successfully, but these errors were encountered: