Skip to content

Alignment Tree

Robert J. Gifford edited this page Oct 4, 2024 · 3 revisions

Constrained versus Unconstrained Alignments


In GLUE, multiple sequence alignments can be represented in two forms; constrained and unconstrained. In unconstrained alignments, no fixed system of numbering is applied, and columns are inserted into the alignment as necessary to represent the homologies between sequences. By contrast, constrained alignments are anchored on a specific reference sequence, and use the coordinates of that reference to refer to homologies at specific nucleotide and amino acid positions.

The Constrained Alignment Tree in GLUE


In GLUE, an alignment tree is used to organize and relate virus sequences based on their evolutionary history and DNA sequence similarity (homology). This structure allows us to link sequences in a way that reflects their evolutionary relationships while also managing how their nucleotide sequences match up across different viral clades.

A core rule in the GLUE framework, called the alignment tree invariant ensures the tree remains logically consistent. This rule states that if an Alignment object (representing a clade) is a descendant of another, the ReferenceSequence (which constrains the Alignment) must also belong to the parent clade. In this way, every parent clade contains at least one representative sequence from its descendant clades.

Key Concepts


  1. Alignment Objects and Clades:

    • An alignment tree starts by grouping virus sequences into clades---groups of viruses that share a common ancestor.
    • Each clade is represented by an Alignment object, which acts as a container for all the sequences that belong to that clade.
    • When one clade is descended from another (like a parent-child relationship), a special link is made between the two corresponding Alignment objects.
  2. Assigning Sequences:

    • Virus sequences are assigned to clades by becoming AlignmentMembers of the appropriate Alignment object.
    • As we learn more about the evolutionary relationships between viruses, new clades can be added, and sequences may be reassigned to different clades as needed.

Structure of the Tree


  • Tip Alignments: Most virus sequences are found at the tips of the tree. These represent modern sequences that are assigned to specific clades.
  • Internal Alignments: Internal nodes in the tree represent ancestral clades. They only contain key sequences that act as representatives for their descendant clades.
  • Insertions and New Alignments: If a virus sequence contains an insertion (extra DNA) that doesn't match its reference sequence, this may trigger the creation of a new child Alignment to account for this unique feature.

Advantages of an Alignment Tree


  • Preserving Evolutionary Relationships: The alignment tree "fixes" known evolutionary relationships, so they don't need to be recalculated every time we analyze the sequences. This saves time and effort in later analyses.
  • Combining Different Alignment Methods: GLUE supports the integration of different alignment techniques. For example, closely related sequences can be quickly aligned using simpler methods (e.g., BLAST+), while more distant sequences may require more complex alignments at the protein level. Regardless of the method, GLUE stores the resulting nucleotide homologies as AlignedSegments.
  • Handling Divergent Sequences: Sometimes virus sequences are so different that it's impossible to align their entire genomes. However, specific regions of the genome may still show strong similarity, and the alignment tree captures these similarities at internal nodes. The closer a node is to the tips, the larger the fraction of the genome that can be aligned because the sequences are more closely related.
  • Homology via Transitivity: GLUE can also infer homology between sequences using a concept called transitivity. If Sequence A matches part of Sequence B, and Sequence B matches part of Sequence C, GLUE can figure out how Sequence A relates to Sequence C by combining these homologies. This feature is used to find connections between sequences in different parts of the tree.