Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit Predictions File with Git Hash Link, Remove Artifact Upload #1350

Merged
merged 15 commits into from
Jan 20, 2025

Conversation

nagutm
Copy link
Collaborator

@nagutm nagutm commented Jan 10, 2025

This PR refactors the GitHub Actions workflow for the paper ranking script to ensure the predictions file is committed directly to the repository. The following changes were made:

  1. Removed Artifact Upload Step:
  • The actions/upload-artifact step in the workflow is currently used to store files generated during the workflow. These artifacts are then available for download in the GitHub Actions interface under the "Artifacts" section for that workflow run.
  • The addition of the commit and push step below makes this step unnecessary and was removed to avoid redundant storage.
  1. Added Commit and Push Step:
  • Introduced a new step to commit and push the predictions file directly to the exports/analyses/paper_ranking/ directory in the repository.
  • This ensures that the predictions file is saved in the repository for future use if necessary.

Testing via a forked repository successfully showed that the predictions file was saved to the correct directory once the workflow run was complete.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 46.27%. Comparing base (8950e70) to head (b495c04).
Report is 274 commits behind head on main.

Files with missing lines Patch % Lines
src/bioregistry/analysis/paper_ranking.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1350      +/-   ##
==========================================
+ Coverage   42.51%   46.27%   +3.75%     
==========================================
  Files         117      118       +1     
  Lines        8327     8301      -26     
  Branches     1963     1357     -606     
==========================================
+ Hits         3540     3841     +301     
+ Misses       4582     4275     -307     
+ Partials      205      185      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cthoyt
Copy link
Member

cthoyt commented Jan 11, 2025

I would suggest appending to the existing predictions file, rather than creating new ones each time. If you need to keep track of date provenance for this, then you could add a new column for the publication date

@bgyori
Copy link
Contributor

bgyori commented Jan 11, 2025

I would suggest appending to the existing predictions file, rather than creating new ones each time. If you need to keep track of date provenance for this, then you could add a new column for the publication date

Here is why we designed this process as it is currently:

  • The ranking is incremental in that it combines new papers with uncurated past papers to re-rank all papers. So it's not a simple append operation. This ensures that uncurated past papers that are highly ranked can still be surfaced.
  • In addition, we wanted to ensure that each month's comment posting new ranking results points to a specific file that is associated with that comment, even if newer comments/rankings have been posted since.

I do have an idea that could work and serve as a better solution:

  • We could have a single version controlled ranking file
  • Each time the ranking is done, the ranking file is overwritten with the new rankings in a github-actions commit
  • The comment posted on the issue doesn't simply add a link to this file from the main branch but uses the github-actions commit's hash (the one added in the previous step) in the direct link to the file. This ensures that the link will continue to exist later while also being reproducibly associated with the current issue comment.

Should we try this?

@cthoyt
Copy link
Member

cthoyt commented Jan 13, 2025

I think using a commit hash is a great idea.

The action that updates this file can also periodically clean out old predictions that have been either curated or incorporated into the bioregistry.json file

.github/workflows/paper_ranking.yml Outdated Show resolved Hide resolved
@nagutm nagutm changed the title Remove Artifact Upload Step and Add Predictions File Commit to Exports Directory Commit Predictions File with Git Hash Link, Remove Artifact Upload Jan 16, 2025
@bgyori
Copy link
Contributor

bgyori commented Jan 18, 2025

@nagutm before we wrap this up: is adding the file exports/analyses/paper_ranking/predictions_2024-12-11_to_2025-01-10.tsv necessary here? I imagine given the change to a single predictions.tsv we would want to end up with just that one file (and get rid of the date-specific ones) after this PR is merged but I might be missing something.

@nagutm
Copy link
Collaborator Author

nagutm commented Jan 18, 2025

@nagutm before we wrap this up: is adding the file exports/analyses/paper_ranking/predictions_2024-12-11_to_2025-01-10.tsv necessary here? I imagine given the change to a single predictions.tsv we would want to end up with just that one file (and get rid of the date-specific ones) after this PR is merged but I might be missing something.

The date specific prediction file was added before the suggestion to use just one predictions file. It can be safely removed.

@bgyori
Copy link
Contributor

bgyori commented Jan 20, 2025

Alright, this looks good. The code coverage warning is an artifact of how coverage rules are applied whereby for #1358 coverage wasn't run whereas here it was - choosing to ignore this.

@bgyori bgyori merged commit bfee1d4 into biopragmatics:main Jan 20, 2025
13 of 14 checks passed
bgyori added a commit that referenced this pull request Jan 20, 2025
#1367)

This pull request addresses the issue outlined in this comment:
#1350 (comment)

The `reldate` parameter was replaced by `mindate` and `maxdate` in the
functions responsible for searching relevant PubMed papers.

This ensures that the PubMed papers fetched within the desired date
range is consistent with the `start_date` and `end_date` parameters in
the main function.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Benjamin M. Gyori <ben.gyori@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants