Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow aggregated tasks within benchmarks #1771

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Jan 11, 2025

Fixes #1231
Adresses #1763 (though we need a fix for the results repo)

  •  added AbsTaskAggregated
  •  added CQADupstackRetrieval
  • updated mteb(eng, classic) to use CQADupstackRetrieval instead of its subtasks
  • refactor

Did quite a few refactors here. Not at all settled that this is the right representation, but it is at least much better than where we started.

We will still need to combine the score of CQGA scores on embedding-benchmark/results.

@x-tabdeveloping let me know if this works on the leaderboard end (I believe it should)

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

- Updated task filtering adding exclusive_language_filter and hf_subset
- fix bug in MTEB where cross-lingual splits were included
- added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta)

The following code outlines the problems:

```py
import mteb
from mteb.benchmarks import MTEB_ENG_CLASSIC

task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
# was eq. to:
task = mteb.get_task("STS22", languages=["eng"])
task.hf_subsets
# correct filtering to English datasets:
# ['en', 'de-en', 'es-en', 'pl-en', 'zh-en']
# However it should be:
# ['en']

# with the changes it is:
task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
task.hf_subsets
# ['en']
# eq. to
task = mteb.get_task("STS22", hf_subsets=["en"])
# which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits):
task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True)
```
mteb/abstasks/AbsTask.py Show resolved Hide resolved
mteb/abstasks/aggregate_task_metadata.py Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
Copy link
Collaborator

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good work! I left a couple of comments on some things that cause or might cause errors.

mteb/abstasks/aggregate_task_metadata.py Outdated Show resolved Hide resolved
mteb/abstasks/aggregated_task.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor Author

Leaderboard seemingly works - @Samoed is it possible that we did not scrape the CQADupstack scores (if so can we do that?)

I will also go over the results repo to and manually create a merged dict for all model which have the scores

@Samoed
Copy link
Collaborator

Samoed commented Jan 28, 2025

No, I think we scrape all of them, e.g. e5-small has CQADupstackWordpressRetrieval in external dir https://github.com/embeddings-benchmark/results/blob/6f32ccc978e2d84b905f10b2848479881f43ab02/results/intfloat__e5-small/external/CQADupstackWordpressRetrieval.json#L4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow aggregated tasks within benchmarks
3 participants