-
-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pyannote-segmentation 3.0 for speaker-change-detection #1504
Comments
Thank you for your issue.You might want to check the FAQ if you haven't done so already. Feel free to close this issue if you found an answer in the FAQ. If your issue is a feature request, please read this first and update your request accordingly, if needed. If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:
Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users). Companies relying on
|
I am afraid I do not fully understand the question. |
@hbredin , Thank you for sharing the blog post. Compared to going via derivative method of pyannote-2.x for speaker-change-detection, picking the most probable class should give speaker-class for each frame in pyannote-3.0 in an audio-chunk, as it takes multi-classification approach within 7 classes. Is my understanding correct ? After getting speaker-class for each frame in audio-chunk, it can be used to identify frames with speaker-change. As https://huggingface.co/pyannote/segmentation-3.0 ingest audio-chunks of length 10 seconds, So if we pass a single audio-chunk of length > 10seconds, and use its outputs for speaker-change-detection purpose by picking maximum probable class at each frame (~16ms), would it make sense or should I trim down audio-chunk to 10seconds for speaker-change-detection as well to avoid side-effect of permutation invariance. Thank you. |
The model has only ever seen 10s chunks during training, so I would not recommend using it with other durations (it might work but it might actually also behave in an unexpected manner). Why not use the speaker diarization pipeline that solves the permutation problem by relying on speaker embedding? Is the extra time needed to extract speaker embedding the issue? |
No, timings is not the issue, currently I am using speaker-diarization based method only for speaker-change-detection(SCD). I was thinking to try SCD with base segmentation model as well to evaluate performance. As SCD can de done using segmentation model alone and adding more steps (speaker-embedding+clustering) might also introduce more errors in SCD |
You could actually try to stitch segmentation windows together by finding the optimal permutation between two consecutive (and overlapping) windows? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @hbredin, as per https://huggingface.co/pyannote/segmentation-3.0, model ingests 10 seconds. of audio. For detecting speaker-change at inference time within a given audio chunk of <10seconds, >10 seconds, would it impact performance of model if we pass that audio-chunk to model instead of padding/triming audio-chunk to 10 seconds. Would permutation-invariant behaviour of model also affect model outputs in this case ? Thank you.
The text was updated successfully, but these errors were encountered: