Online speaker segmentation #721
Replies: 1 comment
-
Shape is
In the image below, However, because of the 200ms step, there is indeed some kind of temporal redundancy.
Yes. The model is trained to discriminate speakers within a 5s chunk but has no knowledge of what happened before.
As long as a speaker never stops speaking for more than 5s (the length of your window), you should be able to track speakers over time like this. However, this assumption is a bit strong and will probably not hold in most cases. Using a speaker verification model is most likely the way to go to address this. For each locally active speaker in the current window, extract a speaker embedding, compare it to past speakers, and assign to the most similar speaker. FWIW, pyannote.audio 2.0 also comes with a pretrained speaker embedding model. @juanmc2005 and I have been working in this direction -- should be able to answer with more confidence in the coming weeks... |
Beta Was this translation helpful? Give feedback.
-
I'm trying to use the new pyannote-audio2.0 segmentation model in an online setting. Briefly, here is what I do:
Inference("pyannote/segmentation")
.SlidingWindowFeature
with an odd.data
structure, that looks like it has a shape of the form Number of Windows X 293 X 4. I.e. in my case, with 5 second audio segments, I only get a prediction for a single window and the shape is1X293X4
. Data long the 1-dimension (293) seems to be duplicated, so I average over the 1-dimension to get a1X4
array of "speaker evidence".Now, I'm facing the issue that the 4 "speaker evidences" for successive calls to the inference object are not matched. The model is trained to have 4 possible speakers in 5s audio segments, however, a speaker's identity may not be the same in two different audio segments. For example, the speaker that corresponded to index 2 in my array of speaker evidences in audio segment 3 might not be the one that corresponded to index 1 in audio segment 4. So for every 5s audio segment, I need a post processing step, that determines a permutation of the speaker evidence such that the new speaker evidence is a good continuation of the speaker evidence time series that I accumulated so far.
I solve this problem by picking the permutation for which the already accumulated time series of speaker evidences are most smoothly continued. Here, I defined smoothness as either a local linear continuation or smoothing by a kalman filter, with the latter working a tiny bit better. Both work ok, but far from perfect.
Are other people facing a similar problem? How do you address this? Is my understanding of the
.data
property correct?Beta Was this translation helpful? Give feedback.
All reactions