Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add balance_weights to weight balanced batches #1588

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

FrenchKrab
Copy link
Contributor

The balance option of the segmentation tasks allows to pass a list of ProtocolFile fields, e.g. ['database', 'foo']. Then when batches are sampled, it looks at all existing combinations of values for these fields in the task protocol.

For example if they come from databases aishell and ami, and their foo field is either a or b, we compute the cartesian product [('aishell', 'a'), ('aishell', 'b'), ('ami', 'a'), ('ami', 'b')], batches are created by randomly selecting one of these tuples and picking a sample from a matching file.

The PR allows to weight the random choice from the cartesian product. For example with

balance_weights = {
  ('aishell'):2.0,
  ('ami', 'b'): 4.0,
}

we will sample from the cartesian product using random.choices with these weights:

selected = random.choices(
    population=[('aishell', 'a'), ('aishell', 'b'), ('ami', 'a'), ('ami', 'b')],
    weights=[2.0, 2.0, 1.0, 4.0],
    k=1,
)[0]

e.g. for each tuple of the cartesian product, we find the longest matching (tuple) prefix in balance_weights and use this weight.

I'm not sure this approach is flexible/clean enough to be PR-ready, and it's hard to make the docstring concise, but i think it could be really useful :)

@FrenchKrab FrenchKrab marked this pull request as draft December 19, 2023 10:15
But before caching training metadata was introduced

Squashed commit of the following:

commit d41ce0a
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Thu Jan 11 13:04:18 2024 +0100

    doc: fix typo in README

commit 8f477fa
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Tue Jan 9 13:06:09 2024 +0100

    fix(task): fix random generators (pyannote#1594)

    Before this change, each worker would select the same files, resulting in less randomness than expected.

commit eda0c51
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 17:05:05 2024 +0100

    Delete .github/ISSUE_TEMPLATE/feature_request.md

commit eb2e813
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 17:04:22 2024 +0100

    github: update config.yml (pyannote#1607)

commit 27cd91f
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 17:02:40 2024 +0100

    github: create config.yml

commit 42ef141
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 16:53:52 2024 +0100

    github: add bug_report.yml template

commit 808b170
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 16:36:24 2024 +0100

    feat: add MRE template

commit e21e7bb
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Mon Jan 8 09:52:19 2024 +0100

    ci: deactivate FAQtory

commit 80634c9
Author: Clément Pagés <55240756+clement-pages@users.noreply.github.com>
Date:   Fri Dec 22 09:16:12 2023 +0100

    fix: update `isort` version to 5.12.0 in pre-commit-config (pyannote#1596)

    Co-authored-by: clement-pages <clement.pages@irit.fr>

commit 7bd88d5
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Wed Dec 20 21:26:42 2023 +0100

    feat(pipeline): add Waveform and SampleRate preprocessors (pyannote#1593)

commit 4d2d16b
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Wed Dec 20 16:03:13 2023 +0100

    doc: update benchmark section (pyannote#1592)

commit 66dd72b
Author: Hervé BREDIN <hbredin@users.noreply.github.com>
Date:   Fri Dec 15 16:10:51 2023 +0100

    feat(model): add `num_frames` and `receptive_field` to segmentation models

    Co-authored-by: Bilal Rahou <Bilal-Rahou@users.noreply.github.com>
(not tested in this branch)
Copy link

stale bot commented Jul 23, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2024
@hbredin hbredin removed the wontfix label Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants