Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

346 duckdb for benchmarking #348

Merged
merged 82 commits into from
Sep 10, 2024
Merged

Conversation

yaseminbridges
Copy link
Contributor

@yaseminbridges yaseminbridges commented Jul 30, 2024

Unfortunately a very large PR, however, it is for a much-needed and positive upgrade to the benchmarking code.

Before:

  • Lots and lots of opening and closing of phenopacket files. For each run, the phenopacket file would be opened and the same methods for retrieving the information would be called. This means, that when running the benchmarking on the 4k phenopacket-store for 4 different run configurations, the same batch of phenopackets would be opened, call retrieval methods, and closed ~16k times.
  • Used dictionaries and lists to store data from the benchmarking - making the code hard to read.
  • Lots of files output from the command, making it difficult at times to navigate to the data that you want to access.
  • Limited customisation to plots output from the command.

Now:

  • A unique corpus of phenopackets is read in once. The data retrieval occurs once. All the data is stored in a single duckdb table for reference. So now for benchmarking the 4k corpus on 4 different run configurations, each operation is completed once.
  • Eliminated dictionaries and most lists for storing data, now data is taken from a duckdb table.
  • All outputs are now contained to a single db - apart from the svg plots, making it easier to navigate.
  • You can now customise the output of the plot from the command - custom plot titles for all plots & naming of the runs in the plot keys are named after a specified run identifier.

This change was necessary, the overall benchmarking process is now more efficient in terms of speed, storing of data & code readability (also managed to cut down the codebase).

Finally, I have added some much-needed documentation for running a benchmark.

A general idea of how things work now:

  1. Run configuration YAML file is consumed and parsed - providing all information required for benchmarking run(s).
  2. Unique corpus of phenopackets is parsed and relevant gene/variant/disease information (if specified) is stored in a duckdb table (TABLE A).
  3. Depending on whether gene/variant/disease benchmarking is specified for that run configuration, the standardised TSV result is read in as a duckdb table (TABLE B) and used to compare against what is known from the phenopacket (retrieved from TABLE A) and the rank is retrieved and added to TABLE A.
  4. Summary stats, plots, and comparisons between runs are generated from TABLE A

…ng TSV with using the db to calculate stats & add stats to table
@yaseminbridges yaseminbridges marked this pull request as ready for review August 27, 2024 15:18
@yaseminbridges yaseminbridges self-assigned this Aug 27, 2024
Copy link
Contributor

@julesjacobsen julesjacobsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. It's a bit tricky to properly digest all these changes but I think it looks like a generally clearer to comprehend codebase. There are still a lot of essentially duplicated code in the variant/gene/disease_prioritisation_analysis and having the LLM generated docs being larger than the two line function it documents isn't actually all that helpful as it add a lot of noise. e.g.

    def _assess_disease_with_threshold_ascending_order(
        self,
        result_entry: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Record the disease prioritisation rank if it meets the ascending order threshold.

        This method checks if the disease prioritisation rank meets the ascending order threshold.
        If the score of the result entry is less than the threshold, it records the disease rank.

        Args:
            result_entry (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) > float(result_entry.score):
            return result_entry.rank
        else:
            return 0


    def _assess_disease_with_threshold(
        self,
        result_entry: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Record the disease prioritisation rank if it meets the score threshold.

        This method checks if the disease prioritisation rank meets the score threshold.
        If the score of the result entry is greater than the threshold, it records the disease rank.

        Args:
            result_entry (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) < float(result_entry.score):
            return result_entry.rank
        else:
            return 0


    def _record_matched_disease(
        self,
        standardised_disease_result: RankedPhEvalDiseaseResult,
    ) -> int:
        """
        Return the disease rank result - handling the specification of a threshold.

        This method determines and returns the disease rank result based on the specified threshold
        and score order. If the threshold is 0.0, it records the disease rank directly.
        Otherwise, it assesses the disease with the threshold based on the score order.

        Args:
            standardised_disease_result (RankedPhEvalDiseaseResult): Ranked PhEval disease result entry

        Returns:
            int: Recorded disease prioritisation rank
        """
        if float(self.threshold) == 0.0:
            return standardised_disease_result.rank
        else:
            return (
                self._assess_disease_with_threshold(standardised_disease_result)
                if self.score_order != "ascending"
                else self._assess_disease_with_threshold_ascending_order(
                    standardised_disease_result,
                )
            )

is essentially repeated in three files.

@yaseminbridges
Copy link
Contributor Author

This PR fails due to the MakeFile pipeline test that executes the template runner and the benchmark command. The MakeFile could be separated out of the main PhEval repo and moved into monarch-pheval - @souzadevinicius agreed on this statement in another PR. Ultimately this PR is blocked until the test is refactored in the MakeFile pipeline OR this is moved to another repo to test the pipeline.

@souzadevinicius
Copy link
Member

This PR fails due to the MakeFile pipeline test that executes the template runner and the benchmark command. The MakeFile could be separated out of the main PhEval repo and moved into monarch-pheval - @souzadevinicius agreed on this statement in another PR. Ultimately this PR is blocked until the test is refactored in the MakeFile pipeline OR this is moved to another repo to test the pipeline.

The pipeline test needed to be modified due to the pheval-utils methods refactor. Sorry for this PR blocking. @yaseminbridges, @julesjacobsen and @matentzn, wherever you decide to put the Makefile code, I agree 😅.

@yaseminbridges yaseminbridges merged commit 3116bfa into main Sep 10, 2024
5 checks passed
@yaseminbridges yaseminbridges deleted the 346-duckdb-for-benchmarking branch September 30, 2024 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DuckDB for benchmarking
3 participants