Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

Closed
ashu5644 opened this issue Oct 17, 2023 · 7 comments
Closed

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

ashu5644 opened this issue Oct 17, 2023 · 7 comments
Labels

Comments

@ashu5644
Copy link

Hi @hbredin, as per https://huggingface.co/pyannote/segmentation-3.0, model ingests 10 seconds. of audio. For detecting speaker-change at inference time within a given audio chunk of <10seconds, >10 seconds, would it impact performance of model if we pass that audio-chunk to model instead of padding/triming audio-chunk to 10 seconds. Would permutation-invariant behaviour of model also affect model outputs in this case ? Thank you.

@github-actions
Copy link

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

@hbredin
Copy link
Member

hbredin commented Oct 17, 2023

I am afraid I do not fully understand the question.
May I suggest you start by looking at this blog post and in particular the section dedicated to speaker change detection?
This should clarify how the Inference class works and how it can be used for speaker change detection.

@ashu5644
Copy link
Author

@hbredin , Thank you for sharing the blog post. Compared to going via derivative method of pyannote-2.x for speaker-change-detection, picking the most probable class should give speaker-class for each frame in pyannote-3.0 in an audio-chunk, as it takes multi-classification approach within 7 classes. Is my understanding correct ? After getting speaker-class for each frame in audio-chunk, it can be used to identify frames with speaker-change. As https://huggingface.co/pyannote/segmentation-3.0 ingest audio-chunks of length 10 seconds, So if we pass a single audio-chunk of length > 10seconds, and use its outputs for speaker-change-detection purpose by picking maximum probable class at each frame (~16ms), would it make sense or should I trim down audio-chunk to 10seconds for speaker-change-detection as well to avoid side-effect of permutation invariance. Thank you.

@hbredin
Copy link
Member

hbredin commented Oct 19, 2023

The model has only ever seen 10s chunks during training, so I would not recommend using it with other durations (it might work but it might actually also behave in an unexpected manner).

Why not use the speaker diarization pipeline that solves the permutation problem by relying on speaker embedding? Is the extra time needed to extract speaker embedding the issue?

@ashu5644
Copy link
Author

ashu5644 commented Oct 19, 2023

No, timings is not the issue, currently I am using speaker-diarization based method only for speaker-change-detection(SCD). I was thinking to try SCD with base segmentation model as well to evaluate performance. As SCD can de done using segmentation model alone and adding more steps (speaker-embedding+clustering) might also introduce more errors in SCD

@hbredin
Copy link
Member

hbredin commented Oct 24, 2023

You could actually try to stitch segmentation windows together by finding the optimal permutation between two consecutive (and overlapping) windows?

Copy link

stale bot commented Apr 22, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 22, 2024
@stale stale bot closed this as completed May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants