Using pyannote-segmentation 3.0 for speaker-change-detection #1504

ashu5644 · 2023-10-17T13:22:32Z

Hi @hbredin, as per https://huggingface.co/pyannote/segmentation-3.0, model ingests 10 seconds. of audio. For detecting speaker-change at inference time within a given audio chunk of <10seconds, >10 seconds, would it impact performance of model if we pass that audio-chunk to model instead of padding/triming audio-chunk to 10 seconds. Would permutation-invariant behaviour of model also affect model outputs in this case ? Thank you.

github-actions · 2023-10-17T13:22:54Z

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

hbredin · 2023-10-17T17:08:33Z

I am afraid I do not fully understand the question.
May I suggest you start by looking at this blog post and in particular the section dedicated to speaker change detection?
This should clarify how the Inference class works and how it can be used for speaker change detection.

ashu5644 · 2023-10-19T05:50:11Z

@hbredin , Thank you for sharing the blog post. Compared to going via derivative method of pyannote-2.x for speaker-change-detection, picking the most probable class should give speaker-class for each frame in pyannote-3.0 in an audio-chunk, as it takes multi-classification approach within 7 classes. Is my understanding correct ? After getting speaker-class for each frame in audio-chunk, it can be used to identify frames with speaker-change. As https://huggingface.co/pyannote/segmentation-3.0 ingest audio-chunks of length 10 seconds, So if we pass a single audio-chunk of length > 10seconds, and use its outputs for speaker-change-detection purpose by picking maximum probable class at each frame (~16ms), would it make sense or should I trim down audio-chunk to 10seconds for speaker-change-detection as well to avoid side-effect of permutation invariance. Thank you.

hbredin · 2023-10-19T07:08:32Z

The model has only ever seen 10s chunks during training, so I would not recommend using it with other durations (it might work but it might actually also behave in an unexpected manner).

Why not use the speaker diarization pipeline that solves the permutation problem by relying on speaker embedding? Is the extra time needed to extract speaker embedding the issue?

ashu5644 · 2023-10-19T12:39:07Z

No, timings is not the issue, currently I am using speaker-diarization based method only for speaker-change-detection(SCD). I was thinking to try SCD with base segmentation model as well to evaluate performance. As SCD can de done using segmentation model alone and adding more steps (speaker-embedding+clustering) might also introduce more errors in SCD

hbredin · 2023-10-24T06:15:15Z

You could actually try to stitch segmentation windows together by finding the optimal permutation between two consecutive (and overlapping) windows?

stale · 2024-04-22T03:24:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Apr 22, 2024

stale bot closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

ashu5644 commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

hbredin commented Oct 17, 2023

ashu5644 commented Oct 19, 2023

hbredin commented Oct 19, 2023

ashu5644 commented Oct 19, 2023 •

edited

Loading

hbredin commented Oct 24, 2023

stale bot commented Apr 22, 2024

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

Using pyannote-segmentation 3.0 for speaker-change-detection #1504

Comments

ashu5644 commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

hbredin commented Oct 17, 2023

ashu5644 commented Oct 19, 2023

hbredin commented Oct 19, 2023

ashu5644 commented Oct 19, 2023 • edited Loading

hbredin commented Oct 24, 2023

stale bot commented Apr 22, 2024

ashu5644 commented Oct 19, 2023 •

edited

Loading