-
Notifications
You must be signed in to change notification settings - Fork 18
Biological data
Michael Lawrence
When analyzing genomic data, we primarily work with three types of data:
- Genomic vectors. Could be the genome sequence itself, a reference annotation like conservation, or an experimentally generated statistic, like coverage.
- Genomic features. Defined by a range on the genome, annotated with metadata, forming a natural table structure. Includes data like gene structures, read alignments, coverage peaks, sequence motif hits, etc. May contain gaps, which requires a single level of nesting/grouping.
- Summarized statistics from an assay. Forms a natural feature by sample matrix, typically combined with metadata on the features and samples. Often there are several assays (matrices) per experiment.
The three types are interrelated. We can represent a genomic vector with a set of single position features that span the genome. The features measured by an assay typically correspond to genomic loci. The distinction depends mostly on one's perspective. Are the assay measurements just an annotation on the features? Or are the assay results of primary interest?
Biological data are complex, and the appropriate data model depends on the use case. However, tabular models are often more intuitive to users and integrate better with downstream processing. We should try to identify tabular models for each of the common genomic data types.
The recently introduced GPos class aims to represent genomic vectors using a tabular (GRanges) shape. However, it has some limitations. First, it lacks long vector support, so it cannot represent every position in the human genome. Bioconductor is working to fix that. Another limitation is that disk-based mechanisms for storing genomic vectors, like BSgenome, are not modeled as vectors.
GRanges is the canonical tabular representation of genomic features. GAlignments is similar but for alignments. Currently, nested features are represented as a list, GRangesList, but it would probably be better to use some sort of grouped GRanges. TxDb is the canonical way to store transcript models. It is backed by a database, but a single table would be more intuitive.
SummarizedExperiment stores assay matrices along with their metadata. While it is rectangular, it is not tabular, nor is there one obvious way to convert it to a table. Its structure is fundamental to how we think about data from biological experiments, so we should encourage its adoption. However, we somehow need to provide transformations to commonly useful tables.