Extract HGVS expressions and related attributes from the Variation.content field #58

larrybabb · 2022-06-23T14:38:21Z

This is the first in several class content fields that originate from the dsp clinvar ingest stream that will need to be parsed and stored formally in the final transformed messages.

This first element is the array of HGVS expressions that are embedded in the Variation.content serialized json field.

Each Variation object will have zero, one or more HGVS elements in the stringified json content attribute.

The json path $.HGVSlist.HGVS may either be an array (when more than one exists) or a single element (when only one exists). I believe there is no $.HGVSlist.HGVS node found when no HGVS expressions exist for a variation, but it may be an empty array or an empty single node (can't recall right now).

Each HGVS node will need to be parsed into a structure with the following shape:

hgvs.assembly - $['@Assembly']
hgvs.type - $['@Type']
hgvs.nucleotideExpression   - $['NucleotideExpression']['Expression']['$']
hgvs.nucleotideExpression.isManeSelect  - $['NucleotideExpression']['@MANESelect'].   -- boolean TRUE/FALSE
hgvs.proteinExpression -  $['ProteinExpression']['Expression']['$'] 
hgvs.molecularConsequence.db - $['MolecularConsequence']['@DB']
hgvs.molecularConsequence.id - $['MolecularConsequence']['@ID']
hgvs.molecularConsequence.type  - $['MolecularConsequence']['@Type']

Some general patterns that may be informational as to how these fields are typically populated...

if the type is genomic then only the nucleotiedXXX values will be included
if the type is transcript then the isManeSelect may be TRUE otherwise it defaults to FALSE
if the type is transcript and its a protein coding expression then the corresponding derived protein expression will likely be provided
if the type is protein only (or something like that) then only the protein expression will be provided (i think)
The molecularConsequence fields will optionally be available for many of the hgvs expressions that have a transcript nucleotide expression.

We will need to do some finalization of the destination structure for this data in our GeneGraph model. For a general reference these fields will ultimately end up in the VariationDescriptor class that is associated with the core VCV and SCV statements being transformed.

The text was updated successfully, but these errors were encountered:

larrybabb · 2022-06-23T14:43:34Z

NOTE: eventually we will be extracting ALL the data from the various Class.content fields. In the initial MVP for the standardization of ClinVar into GeneGraph we will be identifying fields from the ClinicalAssertionObservation.content json and possibly from the GeneAssociation.content.

These are yet to be written up.

theferrit32 · 2022-08-30T01:04:45Z

See genegraph.transform.clinvar.variation for the current way this is done.

https://github.com/clingen-data-model/genegraph/blob/6c3a3a051af7d85b014f4f495549356193aea25d/src/genegraph/transform/clinvar/variation.clj#L47-L157

larrybabb added the clinvar Clinvar data exchange and reporting label Jun 23, 2022

larrybabb mentioned this issue Jun 23, 2022

False negative updates are originating in the clinvar_raw event stream #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract HGVS expressions and related attributes from the Variation.content field #58

Extract HGVS expressions and related attributes from the Variation.content field #58

larrybabb commented Jun 23, 2022

larrybabb commented Jun 23, 2022

theferrit32 commented Aug 30, 2022

Extract HGVS expressions and related attributes from the Variation.content field #58

Extract HGVS expressions and related attributes from the Variation.content field #58

Comments

larrybabb commented Jun 23, 2022

larrybabb commented Jun 23, 2022

theferrit32 commented Aug 30, 2022