-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpeechBrain embedding compatibility #34
Comments
Hi @nefastosaturo and thank you :) From a quick look at Overlapped speech penalty should still be useful because it happens before calling the embedding model and because the weights it computes can also work as a mask. Are you sure you're not having NaN's because of this line instead? In any case I would be extremely interested in adding a working |
I implemented a working version using speechbrain embeddings and it seems to work well with OSP weights as masks, even without normalization. I don't know if this is the same error you were getting (can you post a stacktrace?) but I was also getting However, from my side (on cpu) speechbrain is way slower than pyannote/embedding. It could still be interesting to have compatibility with the speechbrain model anyway for other applications where real-time latency is not critical. |
Despite this huge slowdown, did you estimate the gain in accuracy? |
I didn't have time or access to a gpu but I'll take a look at that when I have some free time |
Hello @juanmc2005 thank you for your answer. @hbredin , using a single wav file as a test, diart pipeline with the speechbrain embeddings on my gpu are computed quite fast (audio file: ~24s, pipeline computation time with my laptop GPU Nvidia 1050: ~5s) @juanmc2005, The thing about mask is that, as I read here that they are used to ignore padding inside a batch of waveforms. So do you think that using the weights as masks, you kinda "penalize" one waveforms against another in a batch? |
Could you clarify these numbers?
I'm not familiar with how the
If the only use of the mask is to ignore padding then I agree that it may not be the smartest way to apply OSP. That said, "ignoring" frames with small weights is also kind of what we're aiming at, so I don't think it's the worst idea either. I guess we'll know for sure once we test it against pyannote/embedding. Right now I'm working on a faster implementation of the pipeline that pre-calculates the segmentation and embeddings in batches before streaming. This should speed things up a lot for evaluation (issue #35). I also want to add an RxPY operator to benchmark real-time latency if I have the time. |
yep, sorry I was typing in a hurry :) I run your example code in the README adapted using the class SpeechBrainEmbedding():
def __init__(self, model):
self.model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb","cuda:0")
def __call__(self, waveform, weights):
with torch.no_grad():
chunk = torch.from_numpy(waveform.data.T)
inputs = chunk.unsqueeze(0)
inputs = inputs.repeat(4, 1, 1) # no weights, 4 possible speakers
output = self.model(inputs)
return torch.from_numpy(output) the clustering step, taken from here : clustering = fn.OnlineSpeakerClustering(0.555, 0.422, 1.517)
aggregation = fn.DelayedAggregation(
0.5, 0.5, strategy="hamming", stream_end=None
)
pipeline = rx.zip(segmentation_stream, embedding_stream).pipe(
ops.starmap(clustering),
# Buffer 'num_overlapping' sliding chunks with a step of 1 chunk
myops.buffer_slide(aggregation.num_overlapping_windows),
# Aggregate overlapping output windows
ops.map(aggregation),
# Binarize output
ops.map(fn.Binarize(uri, 0.5)),
) my audio file contains 3 non overlapping speakers and is 24.473s long. The diarization pipeline above processes the audio file in 5.550771630001691 seconds. As I said, I run that code on my laptop with a very basic Nvidia 1050, nvidia-smi command returns: torch installed with:
yes that it could be a nice move. I will try :) But right now I will focus to just obtain the same result that I got from the offline hugginface pipeline on that test audio that I'm using. I think that the crucial step is to focus on the May I ask you what range of values are valid for |
Ok thanks for the info. So if I'm not mistaken this is about 140ms per chunk for the whole pipeline, which is still good but a bit high. Have you benchmarked it with
Awesome, please don't forget to contribute with a PR when you get a working version :)
It is indeed important to tune clustering hyper-parameters, but beware that since your implementation doesn't apply OSP, it's going to fail at tracking overlapping speakers because the 4 embeddings you extract from the chunk are identical. Right now that's not a problem for your 24s file from what I understand, but it's something to keep in mind.
Sure: 0 <= tau_active <= 1 # threshold on speaker probabilities
0 <= rho_update <= 1 # ratio of speech in a chunk
0 <= delta_new <= 2 # threshold on cosine distance |
@nefastosaturo any news on this? Recently I've been working a lot on the possibility to add custom models (#43), to optimize thresholds (#53) and to run a faster batched inference (#35), which should all help integrate and tune speechbrain embeddings. Custom models and optimization should be good to go for version 0.4 (next release). |
@zaouk we talked about this some days ago |
SpeechBrain embedding compatibility in progress as part of #188 |
Hello there,
first, thank you for your great work!
I was trying to reproduce the result that I obtain from https://huggingface.co/spaces/pyannote/pretrained-pipelines but using the diart modules to achieve the same but in real time/streaming fashion.
That pipeline uses the speechbrain embedding system. I tried to kinda reverse engineer a bit but without success :)
I used the example on the README for setting up a pipeline. I create a custom embedding module:
and then the other modules like the example:
so basically I give those
weighs
to the model__call__
that triggers line 173 here setting my embeddings tonan
So surely I didn't understand what are those weights coming from the
OverlappedSpeechPenalty
module means and if are useful for this speechbrain embedding module described here.Could you give me some hints? Maybe is better to try to build something with Gradio as noted in #30 ?
The text was updated successfully, but these errors were encountered: