Skip to content

General Information

Rodney Omukuti edited this page Jun 28, 2022 · 2 revisions

The pangenome

The pangenome models genomic variation in a population. There are many types of variations such as Single Nucleotide Polymorphisms (SNPs), Insertions and deletions (Indels), gene presence or absence, and Structural variations, among others. The population, in this case, could be single tissues, species, subspecies, other taxonomic units, or ecological communities at large. As defined in the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade or the union of all the genomes of a clade

Types of pangenomes

There are three types of pangenome. That is:

  1. Collection pangenome
  2. Graphical pangenome (Nodes and Edges)
  3. Presence/Absence pangenome

The choice of either of the types is determined by the research question and the type of sequence data that is available

Collection pangenome

This involves the collection of genomic sequences, mapping them against a reference genome, and identifying the differences

Presence/Absence pangenome

This type of pangenome depicts the presence or absence of genes within a population. It is characterized by vane diagrams that mainly focus on core and accessory genes. core genes are those genes that are mainly associated with survival and are found in all the organisms under study. Accessory genes on the other hand are found in most but not all the organisms under investigation. They are associated with variations and evolutionary trajectories. The accessory genes link phenotypes and genotypes and are used in species delineation.

Graphical pangenome

These are characterized by nodes and edges.

Nodes are segments of genomic sequence

Edges are used to dictate how the individual segments are joined together

More information can be found here

Applications of graphical pangenome

  • Precision medicine
  • Structural variations within a population
  • Evolutionary studies of closely related species
  • Can be used instead of a linear reference genome

The project

This project's main goal was to mine and analyze arboviruses genomic data from East Africa and the world as detailed here. There are five arboviruses that are common in East African countries as shown in the table below

Viruses Countries
Chikungunya Kenya
Dengue Uganda
West nile Tanzania
Yellow fever Rwanda
Zika Burundi
South Sudan
DRC

The codes that were used to fetch the metadata and the sequences from the database are available.

Pangenomics is an emerging field of genomics, and therefore little has been done about it, especially viral pangenomics. Many challenges were faced along the way because the pan genomic tools that are available are tailored toward bacterial genomes. In comparison, viruses have fewer genes within their genomes compared to bacteria. They also lack core genes. Moreover, these genes were not well annotated, nullifying the use of presence/absence pangenomes whose tools are publicly available.

Therefore, it was necessary to come up with a working pipeline that could be applied in building pan genomic variation graphs which is not just limited to viruses.

These graphs can have numerous applications including:

  1. Reducing bias in genome construction. Genomes reconstructed with a reference appear to be more similar to the reference than they actually are. Pan genomic reference systems can reduce this bias by enabling the direct relationship of new genomes to all those represented in the pangenome

  2. Standard pangenomics focus on the presence/absence of genes, and fail to pay attention to the variation between these sequences. Pan genomic graphs attempt to provide a precise model relating many genomes to each other at the base level.

References

  1. Vg
  2. Pangenome graphs