Skip to content

Commit

Permalink
Merge branch 'main' into make-public
Browse files Browse the repository at this point in the history
  • Loading branch information
glass-ships committed Nov 2, 2023
2 parents 4e6c7d3 + c47dc45 commit 51e9174
Show file tree
Hide file tree
Showing 10 changed files with 329 additions and 259 deletions.
20 changes: 20 additions & 0 deletions Jenkinsfile-redo-solr
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,16 @@ pipeline {
description: 'Re-run denormalization step',
name: 'RUN_CLOSURIZER'
),
booleanParam(
defaultValue: false,
description: 'Load Solr',
name: 'SOLR'
),
booleanParam(
defaultValue: false,
description: 'Load sqlite',
name: 'SQLITE'
),
booleanParam(
defaultValue: false,
description: 'Upload to bucket',
Expand Down Expand Up @@ -69,11 +79,21 @@ pipeline {
}
}
stage('solr') {
when {
expression {
return params.SOLR
}
}
steps {
sh 'poetry run ingest solr'
}
}
stage('sqlite') {
when {
expression {
return params.SQLITE
}
}
steps {
sh 'poetry run ingest sqlite'
}
Expand Down
33 changes: 33 additions & 0 deletions docs/Sources/Phenio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# PHENIO

PHENIO is an ontology for accessing and comparing knowledge concerning phenotypes across species and genetic backgrounds.

Phenio provides the "semantic backbone" of the Monarch Knowledge Graph.
Designed as an application ontology, PHENIO integrates a variety of ontological concepts, in particular
the "core entities" in the Monarch Knowledge Graph (KG), including diseases, phenotypes and anatomical entities.

Note that while forming an integral part of the Monarch KG, PHENIO does not have a "Koza Ingest" configuration like all the other sources,
but is instead ingested into Monarch KG straight via a `OWL -> obographs -> KGX` transform.

## Sources

PHENIO integrates several different types of hierarchical relationships from a variety of sources.

These include:
* Chemical entities and relationships from [CHEBI](https://www.ebi.ac.uk/chebi/)
* Disease entities and relationships from [MONDO](https://mondo.monarchinitiative.org/)
* Abnormal phenotypes of humans ([HPO](https://hpo.jax.org/app/)), mouse and other mammalian species ([MPO](https://www.informatics.jax.org/vocab/mp_ontology)), the nematode worm Caenorhabditis elegans ([WBBT](http://www.obofoundry.org/ontology/wbphenotype.html)), and zebrafish ([ZFA](http://www.obofoundry.org/ontology/zfa.html)).

[A full list of files used in the construction of PHENIO is available here.](https://monarch-initiative.github.io/phenio/odk-workflows/RepositoryFileStructure/)

## More Information
For more information, see:

- [NCATS Translater Phenio Overview](https://github.com/NCATSTranslator/Translator-All/wiki/phenio)
- [KGHub Phenio](https://github.com/Knowledge-Graph-Hub/kg-phenio)
- [Monarch Phenio](https://github.com/monarch-initiative/phenio)
- [Documentation](https://monarch-initiative.github.io/phenio/)

## Source Code

https://github.com/monarch-initiative/phenio
470 changes: 234 additions & 236 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ python = ">=3.10,<3.12"
kghub-downloader = "^0.3.2"
koza = "^0.3.0"
cat-merge = ">=0.2.0"
closurizer = "^0.3.2"
closurizer = "0.4.1"
kgx = ">=2.1"
multi-indexer = "0.0.5"
botocore = "^1.31"
Expand Down
2 changes: 1 addition & 1 deletion scripts/add_association_copyfields.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{

# now add copyfields declarations for subject_label, subject_closure_label, object_label, object_closure_label

for field in subject_label subject_closure_label subject_taxon subject_taxon_label predicate object_label object_closure_label object_taxon object_taxon_label primary_knowledge_source qualifier_label onset_qualifier_label frequency_qualifier_label sex_qualifier_label
for field in subject_label subject_closure_label subject_taxon subject_taxon_label predicate object_label object_closure_label object_taxon object_taxon_label primary_knowledge_source qualifiers_label onset_qualifier_label frequency_qualifier_label sex_qualifier_label
do
curl -X POST -H 'Content-type:application/json' --data-binary "{
\"add-copy-field\": {
Expand Down
2 changes: 1 addition & 1 deletion scripts/after_download.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/sh

# Make a simple text file of all the gene IDs in Alliance
zcat data/alliance/BGI_*.gz | jq '.data[].basicGeneticEntity.primaryId' | gzip > data/alliance/alliance_gene_ids.txt.gz
zcat data/alliance/BGI_*.gz | jq '.data[].basicGeneticEntity.primaryId' | pigz > data/alliance/alliance_gene_ids.txt.gz

# Make an id, name map of DDPHENO terms
sqlite3 -cmd ".mode tabs" -cmd ".headers on" data/dictybase/ddpheno.db "select subject as id, value as name from rdfs_label_statement where predicate = 'rdfs:label' and subject like 'DDPHENO:%'" > data/dictybase/ddpheno.tsv
Expand Down
24 changes: 19 additions & 5 deletions scripts/load_solr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,18 @@ if test -f "output/monarch-kg-denormalized-edges.tsv.gz"; then
fi

echo "Download the schema from monarch-py"
# This replaces poetry run monarch schema > model.yaml
curl -O https://raw.githubusercontent.com/monarch-initiative/monarch-app/v0.15.8/backend/src/monarch_py/datamodels/model.yaml
curl -O https://raw.githubusercontent.com/monarch-initiative/monarch-app/v0.15.8/backend/src/monarch_py/datamodels/similarity.yaml
# This replaces poetry run monarch schema > model.yaml and just awkwardly pulls from a github raw link

# temporarily retrieve from a branch that has the sssom changes, they can't be merged until the new build runs
curl -O https://raw.githubusercontent.com/monarch-initiative/monarch-app/schema-sssom-and-grouping/backend/src/monarch_py/datamodels/model.yaml
curl -O https://raw.githubusercontent.com/monarch-initiative/monarch-app/schema-sssom-and-grouping/backend/src/monarch_py/datamodels/similarity.yaml

echo "Starting the server"
poetry run lsolr start-server
sleep 30

echo "Adding cores"
poetry run lsolr add-cores entity association
poetry run lsolr add-cores entity association sssom
sleep 10

# todo: ideally, this will live in linkml-solr
Expand All @@ -37,12 +39,24 @@ echo "Adding association schema"
poetry run lsolr create-schema -C association -s model.yaml -t Association
sleep 5

echo "Adding sssom schema"
poetry run lsolr create-schema -C sssom -s model.yaml -t Mapping
sleep 5

# todo: this also should live in linkml-solr, and copy-fields should be based on the schema
echo "Add dynamic fields and copy fields declarations"
scripts/add_entity_copyfields.sh
scripts/add_association_copyfields.sh
sleep 5

# todo: this should probably happen after associations, but putting it first for testing
echo "Loading SSSOM mappings"
grep -v "^#" data/monarch/mondo.sssom.tsv > headless.mondo.sssom.tsv
# todo: copy the mappings to output/mappings as part of an earlier step
poetry run lsolr bulkload -C sssom -s model.yaml headless.mondo.sssom.tsv
poetry run lsolr bulkload -C sssom -s model.yaml data/monarch/gene_mappings.tsv
poetry run lsolr bulkload -C sssom -s model.yaml data/monarch/chebi-mesh.biomappings.sssom.tsv

echo "Loading entities"
poetry run lsolr bulkload -C entity -s model.yaml output/monarch-kg_nodes.tsv

Expand All @@ -64,4 +78,4 @@ chmod -R a+rX solr-data

tar czf solr.tar.gz -C solr-data data
mv solr.tar.gz output/
gzip --force output/monarch-kg-denormalized-edges.tsv
pigz --force output/monarch-kg-denormalized-edges.tsv
8 changes: 4 additions & 4 deletions scripts/load_sqlite.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ sqlite3 output/monarch-kg.db "create index if not exists denormalized_edges_obje

echo "Cleaning up..."
rm output/monarch-kg_*.tsv
gzip --force output/qc/monarch-kg-dangling-edges.tsv
gzip --force output/monarch-kg-denormalized-edges.tsv
pigz --force output/qc/monarch-kg-dangling-edges.tsv
pigz --force output/monarch-kg-denormalized-edges.tsv

echo "Populate phenio db term_association..."
cp data/monarch/phenio.db.gz output/phenio.db.gz
gunzip output/phenio.db.gz
sqlite3 -cmd "attach 'monarch-kg.db' as monarch" phenio.db "insert into term_association (id, subject, predicate, object, evidence_type, publication, source) select id, subject, predicate, object, has_evidence as evidence_type, publications as publication, primary_knowledge_source as source from monarch.edges where predicate = 'biolink:has_phenotype' and negated <> 'True'"

echo "Compressing databases"
gzip --force output/phenio.db
gzip --force output/monarch-kg.db
pigz --force output/phenio.db
pigz --force output/monarch-kg.db
6 changes: 3 additions & 3 deletions scripts/update_latest_release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
# This script will push a local copy of the Solr, Sqlite and denormalized edge artifacts up to all
# all copies of the bucket for a given release. It needs to be run from the root of the repo

RELEASE=$(gsutil ls gs://data-public-monarchinitiative/monarch-kg-dev/latest/ | grep -Eo "(\d){4}-(\d){2}-(\d){2}")
export RELEASE=$(gsutil ls gs://data-public-monarchinitiative/monarch-kg-dev/latest/ | grep -o '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}')
echo "Updating Solr, SQLite and denormalized edge files for $RELEASE"

gsutil cp output/monarch-kg.db.gz gs://monarch-archive/monarch-kg-dev/$RELEASE/
gsutil cp output/monarch-kg-denormalized-edges.tsv.gz gs://monarch-archive/monarch-kg-dev/$RELEASE/
gsutil cp output/solr.tar.gz gs://monarch-archive/monarch-kg-dev/$RELEASE/

gsutil cp -r "gs://monarch-archive/monarch-kg-dev/$RELEASE/*" gs://data-public-monarchinitiative/monarch-kg-dev/$RELEASE/
gsutil cp -r "gs://monarch-archive/monarch-kg-dev/$RELEASE/*" gs://monarch-archive/monarch-kg/latest/
gsutil cp "gs://monarch-archive/monarch-kg-dev/$RELEASE/*.gz" gs://data-public-monarchinitiative/monarch-kg-dev/$RELEASE/
gsutil cp "gs://monarch-archive/monarch-kg-dev/$RELEASE/*.gz" gs://data-public-monarchinitiative/monarch-kg-dev/latest/
21 changes: 13 additions & 8 deletions src/monarch_ingest/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,14 +320,19 @@ def apply_closure(
output_dir: str = OUTPUT_DIR,
):
output_file = f"{output_dir}/{name}-denormalized-edges.tsv"
add_closure(
kg_archive=f"{output_dir}/{name}.tar.gz",
closure_file=closure_file,
output_file=output_file,
fields=["subject", "object", "frequency_qualifier", "onset_qualifier", "sex_qualifier", "stage_qualifier"],
evidence_fields=["has_evidence", "publications"],
)
sh.gzip(output_file, force=True)
add_closure(kg_archive=f"{output_dir}/{name}.tar.gz",
closure_file=closure_file,
output_file=output_file,
fields=['subject',
'object',
'qualifiers',
'frequency_qualifier',
'onset_qualifier',
'sex_qualifier',
'stage_qualifier'],
evidence_fields=['has_evidence', 'publications'],
grouping_fields=['subject', 'negated', 'predicate', 'object'])
sh.pigz(output_file, force=True)


def load_sqlite():
Expand Down

0 comments on commit 51e9174

Please sign in to comment.