Skip to content

DaSH 2017 Progress Report

Martin Maiers edited this page Dec 13, 2017 · 2 revisions

Progress Report

Haplotype Frequency Curation Service

In 2017, the Haplotype Frequency Curation Service (https://github.com/nmdp-bioinformatics/service-haplotype-frequency-curation) was initially implemented. Endpoints for getting and posting Haplotype Frequencies and Populations have been implemented: http://phycus.b12x.org:8080/swagger-ui.html. Code has also been added to support a delete endpoint for Haplotype Frequencies.

Perl, Python and Java clients have been checked in with the service and all make use of Swagger code generation.

The Java client is multi-module and includes an infrastructure for command line tools, with a basic tool implemented for pushing haplotype frequencies in a standard file format to the Frequency Curation Service with some basic annotations.

/* More to be said about Perl and Python clients here - @mhalagan-nmdp, @hpeberhard */

Discussions regarding useful (pragmatic?) annotation of haplotype frequencies and populations is underway, with implementation/upload of some real world examples likely to further fuel the conversation. Further discussion, and the implementation, of access control will likely be necessary before certain frequency sets may be uploaded. Clarity around annotation of haplotype frequency sets and populations will aid in determining how to implement duplicate detection within the service.

/* Other additions? @fscheel, @sauter, @HofmannJ, @jbrelsf2-nmdp, @pbashyal-nmdp

HL7 FHIR

IHIWS Follow Up

Primate MHC

I started with the goal of extending tools like feature-service, GFE, ACT to non-human primate MHC.

After locating the NHP.dat file from the IPD website I noticed it fails to conform to the EMBL standard in many ways: summary of nhp.dat analysis:

  1. the file does not parse with EMBL and IMGT BioPython parsers
  2. the RA (Reference Author) field has non utf-8 characters (control characters)
  3. the ID field needs to have 7 fields separated by “;” — only has one
  4. the annotation does not carry over genbank annotation

I wrote my own parser was able to parse out: 9605 genbank accession ids for 6812 alleles at 374 loci in 53 non-human primate species

Many of the alleles are defined based on cDNA so there is no genomic annotation to be found. But even the alleles (great apes) with gDNA annotations in genbank apparently are unannotated in IPD-NHP.

So, I build a mySQL "BioSQL" database and loaded it with the 9605 genbank entries linked back to the corresponding 6812 alleles from IPD-NHP. From here it is now possible to use BioPython to be able to do feature-level analysis. (@mmaiers-nmdp)

Etc

DaSH

Clone this wiki locally