346 duckdb for benchmarking #348

yaseminbridges · 2024-07-30T15:21:07Z

Unfortunately a very large PR, however, it is for a much-needed and positive upgrade to the benchmarking code.

Before:

Lots and lots of opening and closing of phenopacket files. For each run, the phenopacket file would be opened and the same methods for retrieving the information would be called. This means, that when running the benchmarking on the 4k phenopacket-store for 4 different run configurations, the same batch of phenopackets would be opened, call retrieval methods, and closed ~16k times.
Used dictionaries and lists to store data from the benchmarking - making the code hard to read.
Lots of files output from the command, making it difficult at times to navigate to the data that you want to access.
Limited customisation to plots output from the command.

Now:

A unique corpus of phenopackets is read in once. The data retrieval occurs once. All the data is stored in a single duckdb table for reference. So now for benchmarking the 4k corpus on 4 different run configurations, each operation is completed once.
Eliminated dictionaries and most lists for storing data, now data is taken from a duckdb table.
All outputs are now contained to a single db - apart from the svg plots, making it easier to navigate.
You can now customise the output of the plot from the command - custom plot titles for all plots & naming of the runs in the plot keys are named after a specified run identifier.

This change was necessary, the overall benchmarking process is now more efficient in terms of speed, storing of data & code readability (also managed to cut down the codebase).

Finally, I have added some much-needed documentation for running a benchmark.

A general idea of how things work now:

Run configuration YAML file is consumed and parsed - providing all information required for benchmarking run(s).
Unique corpus of phenopackets is parsed and relevant gene/variant/disease information (if specified) is stored in a duckdb table (TABLE A).
Depending on whether gene/variant/disease benchmarking is specified for that run configuration, the standardised TSV result is read in as a duckdb table (TABLE B) and used to compare against what is known from the phenopacket (retrieved from TABLE A) and the rank is retrieved and added to TABLE A.
Summary stats, plots, and comparisons between runs are generated from TABLE A

… genes/diseases/variants in tables

…ng TSV with using the db to calculate stats & add stats to table

…ctory

…o create comparison tables with duckdb

… in the config

src/pheval/analyse/benchmark_generator.py

julesjacobsen

Looks great. It's a bit tricky to properly digest all these changes but I think it looks like a generally clearer to comprehend codebase. There are still a lot of essentially duplicated code in the variant/gene/disease_prioritisation_analysis and having the LLM generated docs being larger than the two line function it documents isn't actually all that helpful as it add a lot of noise. e.g.

    def _assess_disease_with_threshold_ascending_order(
        self,
        result_entry: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Record the disease prioritisation rank if it meets the ascending order threshold.

        This method checks if the disease prioritisation rank meets the ascending order threshold.
        If the score of the result entry is less than the threshold, it records the disease rank.

        Args:
            result_entry (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) > float(result_entry.score):
            return result_entry.rank
        else:
            return 0


    def _assess_disease_with_threshold(
        self,
        result_entry: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Record the disease prioritisation rank if it meets the score threshold.

        This method checks if the disease prioritisation rank meets the score threshold.
        If the score of the result entry is greater than the threshold, it records the disease rank.

        Args:
            result_entry (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) < float(result_entry.score):
            return result_entry.rank
        else:
            return 0


    def _record_matched_disease(
        self,
        standardised_disease_result: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Return the disease rank result - handling the specification of a threshold.

        This method determines and returns the disease rank result based on the specified threshold
        and score order. If the threshold is 0.0, it records the disease rank directly.
        Otherwise, it assesses the disease with the threshold based on the score order.

        Args:
            standardised_disease_result (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) == 0.0:
            return standardised_disease_result.rank
        else:
            return (
                self._assess_disease_with_threshold(standardised_disease_result)
                if self.score_order != "ascending"
                else self._assess_disease_with_threshold_ascending_order(
                    standardised_disease_result,
                )
            )

is essentially repeated in three files.

This reverts commit 788f363.

yaseminbridges · 2024-09-05T15:47:39Z

This PR fails due to the MakeFile pipeline test that executes the template runner and the benchmark command. The MakeFile could be separated out of the main PhEval repo and moved into monarch-pheval - @souzadevinicius agreed on this statement in another PR. Ultimately this PR is blocked until the test is refactored in the MakeFile pipeline OR this is moved to another repo to test the pipeline.

…ethods.

souzadevinicius · 2024-09-06T12:37:24Z

This PR fails due to the MakeFile pipeline test that executes the template runner and the benchmark command. The MakeFile could be separated out of the main PhEval repo and moved into monarch-pheval - @souzadevinicius agreed on this statement in another PR. Ultimately this PR is blocked until the test is refactored in the MakeFile pipeline OR this is moved to another repo to test the pipeline.

The pipeline test needed to be modified due to the pheval-utils methods refactor. Sorry for this PR blocking. @yaseminbridges, @julesjacobsen and @matentzn, wherever you decide to put the Makefile code, I agree 😅.

yaseminbridges added 30 commits July 9, 2024 14:00

add duckdb dependency

bc72174

move methods to retrieve variants/genes/diseases from phenopackets

b2fc779

add CorpusParser class to parse phenopacket corpus and record known…

c0ce36e

… genes/diseases/variants in tables

change typing to BenchmarkRunOutputGenerator

8baef5a

implement CorpusParser().parse_corpus() method

85e19d1

add get_connection() method to connect to the benchmarking db

07e3458

record matched results in db rather than using dictionaries

bf2b613

replace RankStats and RankStatsWriter methods that focus on writi…

d79d55d

…ng TSV with using the db to calculate stats & add stats to table

add phenopacket_dir variable to BenchmarkRunResults

5d2b001

calculate rank stats

cffd41b

refactor constant variable names

745d51b

refactor structure to connect to DB once when benchmarking whole dire…

4cd2b06

…ctory

refactor db connection

6ba5580

refactor db connection

81a8e49

refactor DBConnector

74b29a3

refactor prioritisation type strings

6ab525d

adding missing args to docstrings

4c153ed

add rank stats to table rather than writing to file

780b0f6

refactor method name for adding column

fa8ddb4

refactor variable name

60c9dd1

removing RankComparisonGenerator class and replacing with methods t…

0b85056

…o create comparison tables with duckdb

add method to drop table

23a184c

add rank comparison suffix for table naming

462f60c

remove parameter for rank comparison dictionary

9b9a3ec

remove parameter for rank comparison dictionary

154f5fb

fix table name

97d478a

remove ranks parameter

720da4e

refactor to use duckdb for benchmarking

fe4fc87

format docstrings

92a8385

refactor column names

6b3a132

yaseminbridges added 8 commits August 22, 2024 16:26

add missing argument

7f05096

refactor benchmark to generate_benchmark_stats

c4420b6

remove threshold and score order parameters as these are now included…

053126a

… in the config

remove threshold and score order parameters

697561b

add threshold and score order parameters with default values

a3608e2

refactor DBConnector to BenchmarkDBManager

0a8d7c4

Add Executing a Benchmark

a87176b

Add docs for executing a benchmark

f52c6ea

yaseminbridges marked this pull request as ready for review August 27, 2024 15:18

yaseminbridges self-assigned this Aug 27, 2024

yaseminbridges requested review from julesjacobsen, souzadevinicius and matentzn August 27, 2024 15:18

remove @classmethod decorator

43b21b2

yaseminbridges mentioned this pull request Sep 5, 2024

Updated the README.md #354

Merged

julesjacobsen reviewed Sep 5, 2024

View reviewed changes

src/pheval/analyse/benchmark_generator.py Outdated Show resolved Hide resolved

yaseminbridges added 3 commits September 5, 2024 12:07

remove constants.py

9f09cf8

keeping strings local to their usage

dd79006

tox lint

2835124

yaseminbridges requested a review from julesjacobsen September 5, 2024 11:09

julesjacobsen approved these changes Sep 5, 2024

View reviewed changes

yaseminbridges added 3 commits September 5, 2024 15:59

merge main into branch

2441184

refactor benchmark command

788f363

Revert "refactor benchmark command"

5a1e677

This reverts commit 788f363.

Refactoring the end-to-end test to align with the refactored pheval m…

dfe23e3

…ethods.

souzadevinicius approved these changes Sep 6, 2024

View reviewed changes

yaseminbridges merged commit 3116bfa into main Sep 10, 2024
5 checks passed

yaseminbridges deleted the 346-duckdb-for-benchmarking branch September 30, 2024 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

346 duckdb for benchmarking #348

346 duckdb for benchmarking #348

yaseminbridges commented Jul 30, 2024 •

edited

Loading

julesjacobsen left a comment

yaseminbridges commented Sep 5, 2024

souzadevinicius commented Sep 6, 2024

346 duckdb for benchmarking #348

346 duckdb for benchmarking #348

Conversation

yaseminbridges commented Jul 30, 2024 • edited Loading

julesjacobsen left a comment

Choose a reason for hiding this comment

yaseminbridges commented Sep 5, 2024

souzadevinicius commented Sep 6, 2024

yaseminbridges commented Jul 30, 2024 •

edited

Loading