Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract HGVS expressions and related attributes from the Variation.content field #58

Open
larrybabb opened this issue Jun 23, 2022 · 2 comments
Labels
clinvar Clinvar data exchange and reporting

Comments

@larrybabb
Copy link

This is the first in several class content fields that originate from the dsp clinvar ingest stream that will need to be parsed and stored formally in the final transformed messages.

This first element is the array of HGVS expressions that are embedded in the Variation.content serialized json field.

Each Variation object will have zero, one or more HGVS elements in the stringified json content attribute.

The json path $.HGVSlist.HGVS may either be an array (when more than one exists) or a single element (when only one exists). I believe there is no $.HGVSlist.HGVS node found when no HGVS expressions exist for a variation, but it may be an empty array or an empty single node (can't recall right now).

Each HGVS node will need to be parsed into a structure with the following shape:

hgvs.assembly - $['@Assembly']
hgvs.type - $['@Type']
hgvs.nucleotideExpression   - $['NucleotideExpression']['Expression']['$']
hgvs.nucleotideExpression.isManeSelect  - $['NucleotideExpression']['@MANESelect'].   -- boolean TRUE/FALSE
hgvs.proteinExpression -  $['ProteinExpression']['Expression']['$'] 
hgvs.molecularConsequence.db - $['MolecularConsequence']['@DB']
hgvs.molecularConsequence.id - $['MolecularConsequence']['@ID']
hgvs.molecularConsequence.type  - $['MolecularConsequence']['@Type']

Some general patterns that may be informational as to how these fields are typically populated...

  • if the type is genomic then only the nucleotiedXXX values will be included
  • if the type is transcript then the isManeSelect may be TRUE otherwise it defaults to FALSE
  • if the type is transcript and its a protein coding expression then the corresponding derived protein expression will likely be provided
  • if the type is protein only (or something like that) then only the protein expression will be provided (i think)
  • The molecularConsequence fields will optionally be available for many of the hgvs expressions that have a transcript nucleotide expression.

We will need to do some finalization of the destination structure for this data in our GeneGraph model. For a general reference these fields will ultimately end up in the VariationDescriptor class that is associated with the core VCV and SCV statements being transformed.

@larrybabb
Copy link
Author

NOTE: eventually we will be extracting ALL the data from the various Class.content fields. In the initial MVP for the standardization of ClinVar into GeneGraph we will be identifying fields from the ClinicalAssertionObservation.content json and possibly from the GeneAssociation.content.

These are yet to be written up.

@theferrit32
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clinvar Clinvar data exchange and reporting
Projects
None yet
Development

No branches or pull requests

2 participants