Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

346 duckdb for benchmarking #348

Merged
merged 82 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
bc72174
add duckdb dependency
yaseminbridges Jul 9, 2024
b2fc779
move methods to retrieve variants/genes/diseases from phenopackets
yaseminbridges Jul 9, 2024
c0ce36e
add `CorpusParser` class to parse phenopacket corpus and record known…
yaseminbridges Jul 9, 2024
8baef5a
change typing to `BenchmarkRunOutputGenerator`
yaseminbridges Jul 9, 2024
85e19d1
implement `CorpusParser().parse_corpus()` method
yaseminbridges Jul 9, 2024
07e3458
add `get_connection()` method to connect to the benchmarking db
yaseminbridges Jul 9, 2024
bf2b613
record matched results in db rather than using dictionaries
yaseminbridges Jul 9, 2024
d79d55d
replace `RankStats` and `RankStatsWriter` methods that focus on writi…
yaseminbridges Jul 9, 2024
5d2b001
add `phenopacket_dir` variable to `BenchmarkRunResults`
yaseminbridges Jul 9, 2024
cffd41b
calculate rank stats
yaseminbridges Jul 9, 2024
745d51b
refactor constant variable names
yaseminbridges Jul 9, 2024
4cd2b06
refactor structure to connect to DB once when benchmarking whole dire…
yaseminbridges Jul 9, 2024
6ba5580
refactor db connection
yaseminbridges Jul 9, 2024
81a8e49
refactor db connection
yaseminbridges Jul 9, 2024
74b29a3
refactor DBConnector
yaseminbridges Jul 9, 2024
6ab525d
refactor prioritisation type strings
yaseminbridges Jul 9, 2024
4c153ed
adding missing args to docstrings
yaseminbridges Jul 9, 2024
780b0f6
add rank stats to table rather than writing to file
yaseminbridges Jul 9, 2024
fa8ddb4
refactor method name for adding column
yaseminbridges Jul 29, 2024
60c9dd1
refactor variable name
yaseminbridges Jul 29, 2024
0b85056
removing `RankComparisonGenerator` class and replacing with methods t…
yaseminbridges Jul 29, 2024
23a184c
add method to drop table
yaseminbridges Jul 29, 2024
462f60c
add rank comparison suffix for table naming
yaseminbridges Jul 29, 2024
9b9a3ec
remove parameter for rank comparison dictionary
yaseminbridges Jul 29, 2024
154f5fb
remove parameter for rank comparison dictionary
yaseminbridges Jul 29, 2024
97d478a
fix table name
yaseminbridges Jul 29, 2024
720da4e
remove ranks parameter
yaseminbridges Jul 29, 2024
fe4fc87
refactor to use duckdb for benchmarking
yaseminbridges Jul 29, 2024
92a8385
format docstrings
yaseminbridges Jul 29, 2024
6b3a132
refactor column names
yaseminbridges Jul 29, 2024
0211fc8
tox lint
yaseminbridges Jul 30, 2024
c62eb9e
allow S608
yaseminbridges Jul 30, 2024
1ee8b14
remove unused methods
yaseminbridges Jul 30, 2024
aaed184
remove unused methods
yaseminbridges Jul 31, 2024
dbd05f0
remove unused methods
yaseminbridges Jul 31, 2024
47e1c1f
fix SQL statement
yaseminbridges Aug 1, 2024
54909f6
tox lint
yaseminbridges Aug 1, 2024
dd4fc79
update tests to utilise DuckDB
yaseminbridges Aug 1, 2024
873ac14
remove argument
yaseminbridges Aug 1, 2024
33d67c0
update `RankStats.add_ranks()` mocking duckdb connection
yaseminbridges Aug 1, 2024
278518d
add tests for comparison tables
yaseminbridges Aug 1, 2024
dccdd09
add google docstrings
yaseminbridges Aug 1, 2024
1bb68bf
alter codebase to process run configurations for benchmarking
yaseminbridges Aug 7, 2024
5cb9cbe
alter codebase to process run configurations for benchmarking
yaseminbridges Aug 7, 2024
78ea8e2
clear plot figure before generating to avoid overlapping plots
yaseminbridges Aug 7, 2024
403e5d9
parse TSV result to a duckdb table in place of using a pandas
yaseminbridges Aug 19, 2024
a5d4ee0
add custom `contains_entity_function` and function to parse a table i…
yaseminbridges Aug 19, 2024
d96751f
reformat tests to use duckdb table
yaseminbridges Aug 19, 2024
a73112f
remove unused methods
yaseminbridges Aug 19, 2024
195f7fa
remove unused import
yaseminbridges Aug 19, 2024
02c60eb
remove methods for benchmarking a single directory, add plot customis…
yaseminbridges Aug 22, 2024
5e18074
add plot customisation
yaseminbridges Aug 22, 2024
a250685
add benchmark name to access db
yaseminbridges Aug 22, 2024
7de4d6f
implement plot customisation
yaseminbridges Aug 22, 2024
cfb05f1
remove redundant method for generating single output
yaseminbridges Aug 22, 2024
5588f46
allow for naming of output db
yaseminbridges Aug 22, 2024
3a93781
add benchmark name
yaseminbridges Aug 22, 2024
edf6faa
add classes for parsing benchmarking yaml file for plot customisation
yaseminbridges Aug 22, 2024
209eafa
collapse benchmarking commands into a single command that can benchma…
yaseminbridges Aug 22, 2024
ee6915f
add missing benchmark_name parameter
yaseminbridges Aug 22, 2024
640e0e2
tox lint
yaseminbridges Aug 22, 2024
fb87c98
add function to check if table exists
yaseminbridges Aug 22, 2024
10e6c33
implement methods to gather benchmarking stats results from db
yaseminbridges Aug 22, 2024
7b23d35
implement methods to gather benchmarking stats results from db
yaseminbridges Aug 22, 2024
6e4b183
refactor `generate_plots_from_benchmark_summary_tsv` to `generate_plo…
yaseminbridges Aug 22, 2024
c2dcea1
add customisation of plot output file names
yaseminbridges Aug 22, 2024
7f05096
add missing argument
yaseminbridges Aug 22, 2024
c4420b6
refactor `benchmark` to `generate_benchmark_stats`
yaseminbridges Aug 22, 2024
053126a
remove threshold and score order parameters as these are now included…
yaseminbridges Aug 27, 2024
697561b
remove threshold and score order parameters
yaseminbridges Aug 27, 2024
a3608e2
add threshold and score order parameters with default values
yaseminbridges Aug 27, 2024
0a8d7c4
refactor `DBConnector` to `BenchmarkDBManager`
yaseminbridges Aug 27, 2024
a87176b
Add Executing a Benchmark
yaseminbridges Aug 27, 2024
f52c6ea
Add docs for executing a benchmark
yaseminbridges Aug 27, 2024
43b21b2
remove @classmethod decorator
yaseminbridges Aug 28, 2024
9f09cf8
remove constants.py
yaseminbridges Sep 5, 2024
dd79006
keeping strings local to their usage
yaseminbridges Sep 5, 2024
2835124
tox lint
yaseminbridges Sep 5, 2024
2441184
merge main into branch
yaseminbridges Sep 5, 2024
788f363
refactor benchmark command
yaseminbridges Sep 5, 2024
5a1e677
Revert "refactor benchmark command"
yaseminbridges Sep 5, 2024
dfe23e3
Refactoring the end-to-end test to align with the refactored pheval m…
souzadevinicius Sep 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 39 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,6 @@ $(TMP_DATA)/semsim/%.sql:
wget $(SEMSIM_BASE_URL)/$*.sql -O $@


$(ROOT_DIR)/results/run_data.txt:
touch $@

$(ROOT_DIR)/results/gene_rank_stats.svg: $(ROOT_DIR)/results/run_data.txt
pheval-utils benchmark-comparison -r $< -o $(ROOT_DIR)/$(shell dirname $@)/results --gene-analysis -y bar_cumulative
mv $(ROOT_DIR)/gene_rank_stats.svg $@

.PHONY: pheval-report
pheval-report: $(ROOT_DIR)/results/gene_rank_stats.svg


$(ROOT_DIR)/results/template-1.0.0/results.yml: configurations/template-1.0.0/config.yaml corpora/lirical/default/corpus.yml
Expand All @@ -88,10 +79,48 @@ $(ROOT_DIR)/results/template-1.0.0/results.yml: configurations/template-1.0.0/co
--output-dir $(shell dirname $@)

touch $@
echo -e "$(ROOT_DIR)/corpora/lirical/default/phenopackets\t$(shell dirname $@)" >> results/run_data.txt

.PHONY: pheval-run
pheval-run: $(ROOT_DIR)/results/template-1.0.0/results.yml


$(ROOT_DIR)/results/template-1.0.0/run_data.yaml:
printf '%s\n' \
"benchmark_name: fake_predictor_benchmark" \
"runs:" \
" - run_identifier: run_identifier_1" \
" results_dir: $(shell dirname $@)" \
" phenopacket_dir: $(ROOT_DIR)/corpora/lirical/default/phenopackets" \
" gene_analysis: True" \
" variant_analysis: False" \
" disease_analysis: False" \
" threshold:" \
" score_order: descending" \
"plot_customisation:" \
" gene_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title:" \
" roc_curve_title: " \
" precision_recall_title: " \
" disease_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title:" \
" roc_curve_title: " \
" precision_recall_title: " \
" variant_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title: " \
" roc_curve_title: " \
" precision_recall_title: " \
> $@

$(ROOT_DIR)/results/template-1.0.0/gene_rank_stats.svg: $(ROOT_DIR)/results/template-1.0.0/run_data.yaml
pheval-utils generate-benchmark-stats -r $<

.PHONY: pheval-report
pheval-report: $(ROOT_DIR)/results/template-1.0.0/gene_rank_stats.svg


corpora/lirical/default/corpus.yml:
test -d $(ROOT_DIR)/corpora/lirical/default/ || mkdir -p $(ROOT_DIR)/corpora/lirical/default/

Expand Down
107 changes: 107 additions & 0 deletions docs/executing_a_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Executing a Benchmark

PhEval is designed for benchmarking algorithms across various datasets. To execute a benchmark using PhEval, you need to:

1. Execute your runner; generating the PhEval standardised TSV outputs for gene/variant/disease prioritisation.
2. Configure the benchmarking parameters.
3. Run the benchmark.

PhEval will generate various performance reports, allowing you to easily compare the effectiveness of different algorithms.

## After the Runner Execution

After executing a run, you may be left with an output directory structure like so:

```tree
.
├── pheval_disease_results
│   ├── patient_1-pheval_disease_result.tsv
├── pheval_gene_results
│   ├── patient_1-pheval_gene_result.tsv
├── pheval_variant_results
│   ├── patient_1-pheval_variant_result.tsv
├── raw_results
│   ├── patient_1.json
├── results.yml
└── tool_input_commands
└── tool_input_commands.txt
```
Whether you have populated `pheval_disease_results`, `pheval_gene_results`, and `pheval_variant_results` directories will depend on what is specified in the `config.yaml` for the runner execution. It is the results in these directories that are consumed in the benchmarking to produce the statistical comparison reports.

## Benchmarking Configuration File

To configure the benchmarking parameters, a YAML configuration file should be created and supplied to the CLI command.

An outline of the configuration file structure follows below:

```yaml
benchmark_name: exomiser_14_benchmark
runs:
- run_identifier: run_identifier_1
results_dir: /path/to/results_dir_1
phenopacket_dir: /path/to/phenopacket_dir
gene_analysis: True
variant_analysis: False
disease_analysis: True
threshold:
score_order: descending
- run_identifier: run_identifier_2
results_dir: /path/to/results_dir_2
phenopacket_dir: /path/to/phenopacket_dir
gene_analysis: True
variant_analysis: True
disease_analysis: True
threshold:
score_order: descending
plot_customisation:
gene_plots:
plot_type: bar_cumulative
rank_plot_title:
roc_curve_title:
precision_recall_title:
disease_plots:
plot_type: bar_cumulative
rank_plot_title:
roc_curve_title:
precision_recall_title:
variant_plots:
plot_type: bar_cumulative
rank_plot_title:
roc_curve_title:
precision_recall_title:

```

The `benchmark_name` is what will be used to name the duckdb database that will contain all the ranking and binary statistics as well as comparisons between runs. The name provided should not have any whitespace or special characters.

### Runs section

The `runs` section specifies which run configurations should be included in the benchmarking. For each run configuration you will need to populate the following parameters:

- `run_identifier`: The identifier associated with the run - this should be meaningful as it will be used in the naming in tables and plots.
- `results_dir`: The full path to the root directory where the directories `pheval_gene_results`/`pheval_variant_results`/`pheval_disease_results` can be found.
- `phenopacket_dir`: The full path to the phenopacket directory used during the runner execution.
- `gene_analysis`: Boolean specifying whether to perform benchmarking for gene prioritisation analysis.
- `variant_analysis`: Boolean specifying whether to perform benchmarking for variant prioritisation analysis
- `disease_analysis`: Boolean specifying whether to perform benchmarking for disease prioritisation analysis
- `threshold`: OPTIONAL score threshold to consider for inclusion of results.
- `score_order`: Ordering of results for ranking. Either ascending or descending.

### Plot customisation section

The `plot_customisation` section specifies any additional customisation to the plots output from the benchmarking. Here you can specify title names for all the plots output, as well as the plot type for displaying the summary ranking stats. This section is split by the plots output from the gene, variant and disease prioritisation benchmarking. The parameters in this section do not need to be populated - however, if left blank it will default to generic titles. The parameters as follows are:

- `plot_type`: The plot type output for the summary rank stats plot. This can be either, bar_cumulative, bar_non_cumulative or bar_stacked.
- `rank_plot_title`: The customised title for the summary rank stats plot.
- `roc_curve_title`: The customised title for the ROC curve plot.
- `precision_recall_title` The customised title for the precision-recall curve plot.

## Executing the benchmark

After configuring the benchmarking YAML, executing the benchmark is relatively simple.

```bash
pheval-utils generate-benchmark-stats --run-yaml benchmarking_config.yaml
```


1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ nav:
- "styleguide.md"
- "CODE_OF_CONDUCT.md"
- Plugins: "plugins.md"
- Executing a Benchmark: "executing_a_benchmark.md"
- "roadmap.md"


Expand Down
63 changes: 59 additions & 4 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ matplotlib = "^3.7.0"
pyserde = "^0.9.8"
polars = "^0.19.15"
scikit-learn = "^1.4.0"
duckdb = "^1.0.0"

[tool.poetry.dev-dependencies]
pytest = "^7.2.0"
Expand Down
49 changes: 39 additions & 10 deletions resources/Makefile.j2
Original file line number Diff line number Diff line change
Expand Up @@ -92,15 +92,6 @@ $(TMP_DATA)/semsim/%.sql:
wget $(SEMSIM_BASE_URL)/$*.sql -O $@


$(ROOT_DIR)/results/run_data.txt:
touch $@

$(ROOT_DIR)/results/gene_rank_stats.svg: $(ROOT_DIR)/results/run_data.txt
pheval-utils benchmark-comparison -r $< -o $(ROOT_DIR)/$(shell dirname $@)/results --gene-analysis -y bar_cumulative
mv $(ROOT_DIR)/gene_rank_stats.svg $@

.PHONY: pheval-report
pheval-report: $(ROOT_DIR)/results/gene_rank_stats.svg

{% for run in runs %}
$(ROOT_DIR)/results/{{ run.configuration }}/results.yml: configurations/{{ run.configuration }}/config.yaml corpora/{{ run.corpus }}/{{ run.corpusvariant }}/corpus.yml
Expand All @@ -125,10 +116,48 @@ $(ROOT_DIR)/results/{{ run.configuration }}/results.yml: configurations/{{ run.c
--output-dir $(shell dirname $@)

touch $@
echo -e "$(ROOT_DIR)/corpora/{{ run.corpus }}/default/phenopackets\t$(shell dirname $@)" >> results/run_data.txt

.PHONY: pheval-run
pheval-run: $(ROOT_DIR)/results/{{ run.configuration }}/results.yml


$(ROOT_DIR)/results/{{ run.configuration }}/run_data.yaml:
printf '%s\n' \
"benchmark_name: fake_predictor_benchmark" \
"runs:" \
" - run_identifier: run_identifier_1" \
" results_dir: $(shell dirname $@)" \
" phenopacket_dir: $(ROOT_DIR)/corpora/lirical/default/phenopackets" \
" gene_analysis: True" \
" variant_analysis: False" \
" disease_analysis: False" \
" threshold:" \
" score_order: descending" \
"plot_customisation:" \
" gene_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title:" \
" roc_curve_title: " \
" precision_recall_title: " \
" disease_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title:" \
" roc_curve_title: " \
" precision_recall_title: " \
" variant_plots:" \
" plot_type: bar_cumulative" \
" rank_plot_title: " \
" roc_curve_title: " \
" precision_recall_title: " \
> $@

$(ROOT_DIR)/results/{{ run.configuration }}/gene_rank_stats.svg: $(ROOT_DIR)/results/{{ run.configuration }}/run_data.yaml
pheval-utils generate-benchmark-stats -r $<

.PHONY: pheval-report
pheval-report: $(ROOT_DIR)/results/{{ run.configuration }}/gene_rank_stats.svg


{% endfor %}


Expand Down
Loading
Loading