Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop new Koza API + general refactoring #163

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
db4fc50
Refactor main koza configuration file
Dec 12, 2024
88f40e8
Refactor JSON{,L} readers
Dec 12, 2024
cdbc6f6
Refactor the CSV reader
Dec 12, 2024
0f0bd6b
Refactor writers to use WriterConfig rather than named parameters
Dec 12, 2024
535bb8a
Initial implementation of new KozaRunner API
Dec 12, 2024
4f95c13
Re-work `open_resource` and how it handles compression & archives
Dec 18, 2024
ce1a47e
Move `metadata` from transform config to KozaConfig
Jan 10, 2025
caa10f4
Add ability to override output format when creating a KozaRunner
Jan 10, 2025
cb3e0c6
Make CSV the default type of reader
Jan 10, 2025
0e69234
Generalize TransformConfig in order to remove MapTransformConfig
Jan 10, 2025
0ceb760
Add a PassthroughWriter that simply store written data from a transform
Jan 10, 2025
73fa492
Add back missing `header_delimiter` option in CSVReaderConfig
Jan 10, 2025
c12442a
Provide defaults for all Config options
Jan 10, 2025
939769e
Add tests to CSVReader
Jan 10, 2025
78abbcf
Remove unused transform `mode` field, add ability to load a module
Jan 10, 2025
b5617a3
Add ability to override configuration fields from a transform YAML
Jan 10, 2025
6a13412
Add ability to load mappings to runner
Jan 10, 2025
0d98af1
Move transform integration tests to new API
Jan 10, 2025
df7baa2
Remove unused tests and test configuration using old API
Jan 13, 2025
2bcb4e2
Fix filter tests
Jan 13, 2025
fcf8314
Re-implement row_limit
Jan 13, 2025
cadffde
Fix testing for sources with multiple files
Jan 13, 2025
626979f
Remove unused MapDict class
Jan 13, 2025
645835c
Remove unused TranslationTable class
Jan 13, 2025
9cc6678
Breakup model/config/source_config.py into smaller modules
Jan 13, 2025
1c2d8fb
Move CLI to new API
Jan 14, 2025
a8525fe
Add a bit more logging to the runner to match messages in cli_utils.py
Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 6 additions & 15 deletions examples/maps/custom-entrez-2-string.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,7 @@
from koza.cli_utils import get_koza_app
from koza.runner import KozaTransform

source_name = 'custom-map-protein-links-detailed'
map_name = 'custom-entrez-2-string'

koza_app = get_koza_app(source_name)

row = koza_app.get_row(map_name)

map = koza_app.get_map(map_name)

entry = dict()

entry["entrez"] = row["entrez"]

map[row["STRING"]] = entry
def transform_record(koza: KozaTransform, record: dict):
koza.write({
"STRING": record['STRING'],
"entrez": record["entrez"],
})
28 changes: 15 additions & 13 deletions examples/maps/custom-entrez-2-string.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,25 @@ name: 'custom-entrez-2-string'
metadata:
description: 'Mapping file provided by StringDB that contains entrez to protein ID mappings'

delimiter: '\t'
header_delimiter: '/'
reader:
delimiter: '\t'
header_prefix: '#'
header_delimiter: '/'

# Assumes that no identifiers are overlapping
# otherwise these should go into separate configs
files:
- './examples/data/entrez-2-string.tsv'
- './examples/data/additional-entrez-2-string.tsv'
# Assumes that no identifiers are overlapping
# otherwise these should go into separate configs
files:
- './examples/data/entrez-2-string.tsv'
- './examples/data/additional-entrez-2-string.tsv'

header: 0
header_mode: 0

columns:
columns:
- 'NCBI taxid'
- 'entrez'
- 'STRING'

key: 'STRING'

values:
- 'entrez'
transform:
key: 'STRING'
values:
- 'entrez'
36 changes: 19 additions & 17 deletions examples/maps/entrez-2-string.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,25 @@ name: 'entrez-2-string'
metadata:
description: 'Mapping file provided by StringDB that contains entrez to protein ID mappings'

delimiter: '\t'
header_delimiter: '/'
header: 0
comment_char: '#'
reader:
delimiter: '\t'
header_delimiter: '/'
header_mode: 0
header_prefix: '#'
comment_char: '#'

# Assumes that no identifiers are overlapping
# otherwise these should go into separate configs
files:
- './examples/data/entrez-2-string.tsv'
- './examples/data/additional-entrez-2-string.tsv'
# Assumes that no identifiers are overlapping
# otherwise these should go into separate configs
files:
- './examples/data/entrez-2-string.tsv'
- './examples/data/additional-entrez-2-string.tsv'

columns:
- 'NCBI taxid'
- 'entrez'
- 'STRING'
columns:
- 'NCBI taxid'
- 'entrez'
- 'STRING'

key: 'STRING'

values:
- 'entrez'
transform:
key: 'STRING'
values:
- 'entrez'
4 changes: 4 additions & 0 deletions examples/minimal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from koza.runner import KozaTransform

def transform(koza: KozaTransform):
pass
30 changes: 14 additions & 16 deletions examples/string-declarative/declarative-protein-links-detailed.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
import re
from typing import Any
import uuid

from biolink_model.datamodel.pydanticmodel_v2 import PairwiseGeneToGeneInteraction, Protein

from koza.cli_utils import get_koza_app
from koza.runner import KozaTransform

koza_app = get_koza_app("declarative-protein-links-detailed")
def transform_record(koza: KozaTransform, record: dict[str, Any]):
protein_a = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", record["protein1"]))
protein_b = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", record["protein2"]))

row = koza_app.get_row()
pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=protein_a.id,
object=protein_b.id,
predicate="biolink:interacts_with",
knowledge_level="not_provided",
agent_type="not_provided",
)

protein_a = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", row["protein1"]))
protein_b = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", row["protein2"]))

pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=protein_a.id,
object=protein_b.id,
predicate="biolink:interacts_with",
knowledge_level="not_provided",
agent_type="not_provided",
)

koza_app.write(protein_a, protein_b, pairwise_gene_to_gene_interaction)
koza.write(protein_a, protein_b, pairwise_gene_to_gene_interaction)
84 changes: 43 additions & 41 deletions examples/string-declarative/declarative-protein-links-detailed.yaml
Original file line number Diff line number Diff line change
@@ -1,49 +1,51 @@
name: 'declarative-protein-links-detailed'

delimiter: ' '

files:
- './examples/data/string.tsv'
- './examples/data/string2.tsv'

metadata:
ingest_title: 'String DB'
ingest_url: 'https://string-db.org'
description: 'STRING: functional protein association networks'
rights: 'https://string-db.org/cgi/access.pl?footer_active_subpage=licensing'

global_table: './examples/translation_table.yaml'

columns:
- 'protein1'
- 'protein2'
- 'neighborhood'
- 'fusion'
- 'cooccurence'
- 'coexpression'
- 'experimental'
- 'database'
- 'textmining'
- 'combined_score' : 'int'

filters:
- inclusion: 'include'
column: 'combined_score'
filter_code: 'lt'
value: 700

transform_mode: 'flat'

node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- 'object'
- 'category'
- 'relation'
- 'provided_by'
reader:
format: csv

delimiter: ' '

files:
- './examples/data/string.tsv'
- './examples/data/string2.tsv'

columns:
- 'protein1'
- 'protein2'
- 'neighborhood'
- 'fusion'
- 'cooccurence'
- 'coexpression'
- 'experimental'
- 'database'
- 'textmining'
- 'combined_score' : 'int'


transform:
filters:
- inclusion: 'include'
column: 'combined_score'
filter_code: 'lt'
value: 700

writer:
node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- 'object'
- 'category'
- 'relation'
- 'provided_by'
34 changes: 17 additions & 17 deletions examples/string-w-custom-map/custom-map-protein-links-detailed.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,23 @@

from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction

from koza.cli_utils import get_koza_app
from koza.runner import KozaTransform

source_name = "custom-map-protein-links-detailed"
koza_app = get_koza_app(source_name)
row = koza_app.get_row()
entrez_2_string = koza_app.get_map("custom-entrez-2-string")
def transform_record(koza: KozaTransform, record: dict):
a = record["protein1"]
b = record["protein2"]
mapped_a = koza.lookup(a, "entrez")
mapped_b = koza.lookup(b, "entrez")
gene_a = Gene(id="NCBIGene:" + mapped_a)
gene_b = Gene(id="NCBIGene:" + mapped_b)

gene_a = Gene(id="NCBIGene:" + entrez_2_string[row["protein1"]]["entrez"])
gene_b = Gene(id="NCBIGene:" + entrez_2_string[row["protein2"]]["entrez"])
pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=gene_a.id,
object=gene_b.id,
predicate="biolink:interacts_with",
knowledge_level="not_provided",
agent_type="not_provided",
)

pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=gene_a.id,
object=gene_b.id,
predicate="biolink:interacts_with",
knowledge_level="not_provided",
agent_type="not_provided",
)

koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
koza.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
85 changes: 43 additions & 42 deletions examples/string-w-custom-map/custom-map-protein-links-detailed.yaml
Original file line number Diff line number Diff line change
@@ -1,46 +1,47 @@
name: 'custom-map-protein-links-detailed'

delimiter: ' '

files:
- './examples/data/string.tsv'
- './examples/data/string2.tsv'

metadata: !include './examples/string-w-custom-map/metadata.yaml'

columns:
- 'protein1'
- 'protein2'
- 'neighborhood'
- 'fusion'
- 'cooccurence'
- 'coexpression'
- 'experimental'
- 'database'
- 'textmining'
- 'combined_score' : 'int'

filters:
- inclusion: 'include'
column: 'combined_score'
filter_code: 'lt'
value: 700

depends_on:
- 'examples/maps/custom-entrez-2-string.yaml'

transform_mode: 'flat'

node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- 'object'
- 'category'
- 'relation'
- 'provided_by'
reader:
delimiter: ' '

files:
- './examples/data/string.tsv'
- './examples/data/string2.tsv'

columns:
- 'protein1'
- 'protein2'
- 'neighborhood'
- 'fusion'
- 'cooccurence'
- 'coexpression'
- 'experimental'
- 'database'
- 'textmining'
- 'combined_score' : 'int'

transform:
filters:
- inclusion: 'include'
column: 'combined_score'
filter_code: 'lt'
value: 700

mappings:
- 'examples/maps/custom-entrez-2-string.yaml'

writer:
node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- 'object'
- 'category'
- 'relation'
- 'provided_by'
Loading
Loading