Skip to content
mbrush edited this page Feb 2, 2015 · 1 revision

This page describes the primary requirements and use cases that GENO is built to support. Additional figures and examples to come soon.

1. Ontological definition of a 'genotype'

While this use case is not based on any practical need for data representation or analysis in the Monarch Initiative, a clear ontological characterization of the concept of a genotype is important to engage community understanding and feedback, facilitate alignment with existing ontologies, and support application of GENO by third-party developers. Many of the concepts described of GENO are abstract, complex, domain-specific, and constantly evolving as our understanding of biology continues to improve. As a result, the broader community lacks clear definitions of and distinctions between many of these concepts and the terms that reference them (e.g. 'genotype', 'genome', 'allele', 'gene'). This lack of a shared conceptual and terminological foundation throws up a barrier to communication and aggregation of data within and across domains of research and scholarship. Accordingly, one goal for GENO is to provide comprehensive definitions of the basic concepts relevant to genotype-to-phenotype (G2P) research, and build a coherent information model that distinguishes and relates these concepts in clear and meaningful ways. Here, we will leverage and build upon a framework provided by prominent ontological and terminological efforts including the Basic Formal Ontology (BFO) and the Sequence Ontology (SO).

2. Defining the levels of sequence variation specified in genotypes

Genetic variation can be described at varying levels of granularity - from a complete genotype specifying sequence variation across an entire genome, down to single nucleotide changes within a gene. In different data sources and communities of research, phenotypes are associated with units of variation across this entire spectrum. For example, mouse (MGI) and zebrafish (ZFIN) model organism databases link phenotypes to complete genotypes, while worm (Wormbase) and fly (Flybase) databases link phenotypes to individual variant alleles, and many human genetics databases correlate phenotypes to specific single-nucleotide polymorphisms. Accordingly, GENO represents each of these levels of variation to support ingest and standardization of G2P data.

3. Support for aggregation and integrated analysis of G2P data

Defining the genetic elements linked to phenotypes suffices for a simple data representation use case, but integrated analysis of G2P data from different sources requires a unifying model that links these levels variation into a single graph that can be operated on by advanced logic- and graph-based algorithms. GENO supports this use case by representing a genotype in a 'partonomy graph'. This graph decomposes a genotype representing variation across an entire genome into more fundamental units of variation that are relevant in biological and G2P research. This model is generic yet rich enough to apply to genotypes as captured across different model organism and human databases, and in doing provides a standardized representation of the sequence content described in genotypes. A schematic of this core partonomy can be found in the readme file [here] (https://github.com/monarch-initiative/GENO-ontology/blob/develop/README.md).

4. Phenotype propagation

'Phenotype propagation' refers to the process of inferring associations between a phenotype annotated at one level of genetic variation and other levels in the genotype model. GENO supports phenotype propagation by defining property chains across the edges in its core genotype partonomy, which allow a reasoner to infer links between a phenotype and other nodes the graph of a given genotype. This operation over the genotype graph is essential for integrated analysis of G2P data, where phenotype annotations are asserted at different levels of genetic variation across different data sources.

5. Parallel support for operating on variation in gene expression alongside traditional genotype data

While genotypes as described above specify intrinsic variation in genomic sequence, we are also interested in capturing and operating on information about experimentally-targeted variation in gene expression. For example, reduced expression following the application of RNAi, or transient overexpression following introduction of a DNA expression construct. While the majority of G2P associations are based on variation in genomic sequence, studies employing transient genetic manipulation are increasingly used as another way that a gene can be linked to a phenotype. Parallel representation of these two forms of genetic variation is important for integrated description and analysis of data about any genetic contribution to an organismal or cellular phenotype.

GENO defines the notion of an 'experimental genotype' to describe an information artifact that summarizes variation in gene expression at the time of an experiment. A partonomy decomposes an experimental genotype into its component genes that are targeted for altered expression. Through this analogous representation of experimental variation in terms of the targeted genes, we facilitate integration of intrinsic and experimental G2P data, and support operations across the full spectrum of genetic variation in an organism that can be associated with phenotypes.

6. Characterizing attributes of genotypes and genetic variants

The genetic entities represented in the core GENO partonomies exhibit many attributes that are important to describe in order to leverage G2P data toward novel analyses and inference of new knowledge. GENO will implement terms and design patterns to support the description of such attributes including as zygosity, genomic position, expression patterns, and dominance, as well as dependencies and consequences of variation. Where existing work in community ontologies supports description of such attributes, we will work to align with and/or re-use modeling where appropriate. Relevant ontologies here may include FALDO (genomic position), HPO (dominance), and VariO (functional consequence).

TO DO: outline specific Monarch use cases and requirements for each type of attribute (position, expression, dominance, etc)

7. Description of genotype-phenotype association data and its provenance

The representation of genetic variation in GENO and phenotypic variation across existing phenotype ontologies were created in large part to support operation over G2P data to extract new knowledge from data. Numerous and diverse sources provide G2P data of varying structure and complexity, and GENO aims to provide a common framework of terms and design patterns to richly describe this data in a graph-based model. This includes representation of the genotype-phenotype association itself, contextualizing environmental or experimental information, and provenance data surrounding the assignment of the G2P link. This work is in the planning stages and is to be implemented soon.

8. Orthogonality to existing domain ontologies

While a key mission of GENO is to support the data integration and analysis use cases of the Monarch Initiative, it also aims to become a community standard for representation of G2P data in biomedical research. Toward this end, we will take care to provide comprehensive definitions and documentation of design decisions and patterns, and consider use cases beyond those of the Monarch Initiative. We will take care to situate GENO into the existing framework of related and orthogonal ontological models, which may include alignment with and/or re-use from the BFO, SO, HPO, RO, OBI, and FALDO. We believe that development of GENO in a this way can will be beneficial through an ability to leverage existing work, improve modeling and requirements through community feedback, and produce data with better interoperability across the research landscape.

Clone this wiki locally