Online speaker segmentation #721

igordertigor · 2021-08-03T10:52:04Z

igordertigor
Aug 3, 2021

I'm trying to use the new pyannote-audio2.0 segmentation model in an online setting. Briefly, here is what I do:

I create an inference object by calling Inference("pyannote/segmentation").
I feed the inference object with 5 second audio segments every 200ms. So there is 4.8s overlap between two successive segments.
With this, I get a SlidingWindowFeature with an odd .data structure, that looks like it has a shape of the form Number of Windows X 293 X 4. I.e. in my case, with 5 second audio segments, I only get a prediction for a single window and the shape is 1X293X4. Data long the 1-dimension (293) seems to be duplicated, so I average over the 1-dimension to get a 1X4 array of "speaker evidence".

Now, I'm facing the issue that the 4 "speaker evidences" for successive calls to the inference object are not matched. The model is trained to have 4 possible speakers in 5s audio segments, however, a speaker's identity may not be the same in two different audio segments. For example, the speaker that corresponded to index 2 in my array of speaker evidences in audio segment 3 might not be the one that corresponded to index 1 in audio segment 4. So for every 5s audio segment, I need a post processing step, that determines a permutation of the speaker evidence such that the new speaker evidence is a good continuation of the speaker evidence time series that I accumulated so far.

I solve this problem by picking the permutation for which the already accumulated time series of speaker evidences are most smoothly continued. Here, I defined smoothness as either a local linear continuation or smoothing by a kalman filter, with the latter working a tiny bit better. Both work ok, but far from perfect.

Are other people facing a similar problem? How do you address this? Is my understanding of the .data property correct?

hbredin · 2021-08-03T12:46:31Z

hbredin
Aug 3, 2021
Maintainer

3. With this, I get a `SlidingWindowFeature` with an odd `.data` structure, that looks like it has a shape of the form Number of Windows X 293 X 4. I.e. in my case, with 5 second audio segments, I only get a prediction for a single window and the shape is `1X293X4`. Data long the 1-dimension (293) seems to be duplicated, so I average over the 1-dimension to get a `1X4` array of "speaker evidence".

Shape is num_chunks x num_frames x num_speakers where:

num_chunks is indeed what you call "number of windows"
num_speakers is the maximum number of speakers that the segmentation model can handle in a 5s chunk (corresponds to Kmax in the paper)
num_frames is the number of "frames" output by the segmentation model, given the 5s input buffer. Here, num_frames = 293 so it means the temporal resolution of the output of the model is approximately 5 / 293 = 17ms. Put otherwise, the model outputs one decision (a.k.a. speaker activation) every 17ms.

In the image below, num_speakers = 2 (red and green speakers), num_chunks = 3 and num_frames = 293 and it appears clearly that the num_frames dimension is not duplicated.

However, because of the 200ms step, there is indeed some kind of temporal redundancy.
For any given time step t, it is covered by 5.0s / 200ms = 25 different (shifted) windows.
Therefore, for any given time step t, you get 25 different decisions from the model -- which you should somehow aggregate to get the final decision.

Now, I'm facing the issue that the 4 "speaker evidences" for successive calls to the inference object are not matched. The model is trained to have 4 possible speakers in 5s audio segments, however, a speaker's identity may not be the same in two different audio segments. For example, the speaker that corresponded to index 2 in my array of speaker evidences in audio segment 3 might not be the one that corresponded to index 1 in audio segment 4. So for every 5s audio segment, I need a post processing step, that determines a permutation of the speaker evidence such that the new speaker evidence is a good continuation of the speaker evidence time series that I accumulated so far.

Yes. The model is trained to discriminate speakers within a 5s chunk but has no knowledge of what happened before.
Because of permutation-invariant training, the same speaker may activate two different dimensions in two different chunks.

I solve this problem by picking the permutation for which the already accumulated time series of speaker evidences are most smoothly continued. Here, I defined smoothness as either a local linear continuation or smoothing by a kalman filter, with the latter working a tiny bit better. Both work ok, but far from perfect. Are other people facing a similar problem? How do you address this?

As long as a speaker never stops speaking for more than 5s (the length of your window), you should be able to track speakers over time like this. However, this assumption is a bit strong and will probably not hold in most cases. Using a speaker verification model is most likely the way to go to address this. For each locally active speaker in the current window, extract a speaker embedding, compare it to past speakers, and assign to the most similar speaker. FWIW, pyannote.audio 2.0 also comes with a pretrained speaker embedding model.

@juanmc2005 and I have been working in this direction -- should be able to answer with more confidence in the coming weeks...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online speaker segmentation #721

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Online speaker segmentation #721

igordertigor Aug 3, 2021

Replies: 1 comment

hbredin Aug 3, 2021 Maintainer

igordertigor
Aug 3, 2021

hbredin
Aug 3, 2021
Maintainer