-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow aggregated tasks within benchmarks #1771
base: main
Are you sure you want to change the base?
fix: Allow aggregated tasks within benchmarks #1771
Conversation
- Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ```
…low-aggregated-tasks-within-benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good work! I left a couple of comments on some things that cause or might cause errors.
…nto KennethEnevoldsen/issue-Allow-aggregated-tasks-within-benchmarks
…nto KennethEnevoldsen/issue-Allow-aggregated-tasks-within-benchmarks
Leaderboard seemingly works - @Samoed is it possible that we did not scrape the CQADupstack scores (if so can we do that?) I will also go over the results repo to and manually create a merged dict for all model which have the scores |
No, I think we scrape all of them, e.g. e5-small has CQADupstackWordpressRetrieval in external dir https://github.com/embeddings-benchmark/results/blob/6f32ccc978e2d84b905f10b2848479881f43ab02/results/intfloat__e5-small/external/CQADupstackWordpressRetrieval.json#L4 |
Fixes #1231
Adresses #1763 (though we need a fix for the results repo)
Did quite a few refactors here. Not at all settled that this is the right representation, but it is at least much better than where we started.
We will still need to combine the score of CQGA scores on
embedding-benchmark/results
.@x-tabdeveloping let me know if this works on the leaderboard end (I believe it should)
Checklist
make test
.make lint
.