-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compatibility with pyannote 3.0 embedding wrappers #188
Conversation
Hint for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR! The core logic looks good. Changes are needed to simplify the API before merging
@sorgfresser if that's ok with you, let's open a different PR for |
Hey @juanmc2005 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorgfresser Thank you for this revised version! It looks way better than before.
I just want to address a couple of concerns before merging:
- I want to support all pyannote embedding models and not just wespeaker (this can be done without much effort)
- There are some small errors here and there and potential improvements that would greatly benefit the diart API
Also, I noticed you forgot to replace SegmentationModel
with the new API. Remember it also inherits from LazyModel
so we need to replace forward
with __call__
and make sure that type hints still match. This is very important.
Thank you again for this enormous contribution! I can't wait to get this merged, it will be huge for many people wanting to improve real-time diarization performance!
src/diart/models.py
Outdated
weights = weights.to("cpu") | ||
# Move to cpu for numpy conversion | ||
waveform = waveform.to("cpu") | ||
return torch.from_numpy(self.model(waveform, weights)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do waveform and weights have to be numpy? have you checked that this works with pyannote/embedding
? Also please check that this way we can run both models on GPU.
If you don't have a GPU, I can check this myself during testing. Please let me know.
Also, please use super().__call__(waveform, weights)
here as well so you don't have to call self.load()
manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sadly they have to be. It will be executed on the GPU, but this .numpy()
requires the waveform.to("cpu")
Additionally this requires the weights.to("cpu")
as we otherwise fail with
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
I can verify that the GPU is utilized, you can see the ONNX provider set with our call to .to("cuda")
because of this definition
Nonetheless, the tensors have to be converted to numpy for onnx use and as such the errors pointed out above exist if they are not moved to the cpu beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all types of embedding models use onnx? Maybe @hbredin can give us a hint here.
If we absolutely need this to("cpu")
, we could create a wrapper around PretrainedEmbeddingModel
, let's say PipelineInputFormatter
.
This way PyannoteLoader
would return a PipelineInputFormatter
instead of a PretrainedEmbeddingModel
:
class PipelineInputFormatter:
def __init__(self, model: PretrainedEmbeddingModel):
self.model = model
def __call__(self, audio, masks) -> np.ndarray:
return self.model(audio.cpu().numpy(), masks.cpu().numpy())
I don't really like the name though, I'm open to suggestions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be a matter of needing to specify the device upon instantiation, contrary to nn.Module
. In that case we may want to refactor diart models to work in the same way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all types of embedding models use onnx?
No. Only WeSpeaker
ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I just took a look at the links sent by @sorgfresser and it does look like a pyannote bug.
@hbredin I opened a PR to fix this: pyannote/pyannote-audio#1518
Could you take a look?
Also, if it's not too much to ask, it would be great to have a release with this fix so we don't have to do any weird workarounds here
@sorgfresser I just released version 0.8. Please make sure to rebase your branch against develop so we're able to merge: git checkout pyannote-3.0
git rebase <diart remote>/develop
# Once successful and without conflicts
git push --force origin pyannote-3.0 |
68ca2ba
to
679dee4
Compare
Ok I think the code is pretty solid! I want to test this but I would prefer to wait for the pyannote fix to be merged and hopefully released by @hbredin 🙏🏻 If the pyannote fix takes a long time to get into a release, I would prefer to do the required changes here anyway. In this case, WeSpeaker embeddings wouldn't work (on GPU) temporarily, but I prefer the code to be clean if it's going to be part of the next release |
@sorgfresser could you please rebase on top of |
Thanks for rebasing @juanmc2005 and sorry for the late reply. I added the docstrings and removed the cpu moving - is there anything else I can add / modify? |
Hey @sorgfresser thanks for the new changes. I think we're good here. Now that the code is looking good I'll pull the branch locally and do some tests to see if the feature is working correctly. In particular wespeaker and some other model like ecapa tdnn. If my tests look good I'll go ahead and merge. I'll probably wait for the pyannote fix to release v0.9 though. |
@sorgfresser quick update after some tests. It looks like normalizing weights does affect the embeddings. DER on AMI goes from 27.3 to to 29.8, which is pretty bad. When I remove the normalization it goes down to 27.5, so there's something else affecting performance negatively. I think we should add a parameter somewhere to specify if weights should be normalized. I'll keep investigating and get back |
@hbredin looks like the difference between the 27.5 and 27.3 was because of pytorch 2.1.0. Maybe it's related to the automatic conversion of the segmentation and embedding models. I keep getting these warnings:
I don't think this is a deal-breaker so I won't change the requirements to force |
@sorgfresser could you move the weight normalization code to def __call__(self, segmentation: TemporalFeatures) -> TemporalFeatures:
weights = self.formatter.cast(segmentation) # shape (batch, frames, speakers)
with torch.no_grad():
probs = torch.softmax(self.beta * weights, dim=-1)
weights = torch.pow(weights, self.gamma) * torch.pow(probs, self.gamma)
weights[weights < 1e-8] = 1e-8
if self.normalize:
min_values = weights.min(dim=1, keepdim=True).values
max_values = weights.max(dim=1, keepdim=True).values
weights = (weights - min_values) / (max_values - min_values)
weights.nan_to_num_(1e-8)
return self.formatter.restore_type(weights) Where This way, users can decide whether they want to do this normalization or not directly in the pipeline config. I could do these changes myself but I'm not sure I have write access to your fork. I would also like to have this as a CLI argument |
Otherwise, I was able to run |
|
I added the boolean to cli and moved the normalization to OverlappedSpeechPenalty. Is that the way you'd like the cli to behave? |
@sorgfresser thank you for the swift reply and commit! I'll re-run the tests as soon as I can and get back with updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some new tests and everything works well! My benchmark on AMI with WeSpeaker embeddings gave DER=28.9, but it may be a matter of tuning the hyper-parameters. Also it may be worth benchmarking without normalizing weights.
I can't seem to get ONNX to run on my GPU but I think it might be a problem with my CUDA drivers. @sorgfresser can you run them on GPU?
I will commit some suggestions here and there and I just have a couple of formatting things that I'd like to improve. Once this is fixed I'm good to merge.
src/diart/models.py
Outdated
def __call__(self, waveform: torch.Tensor) -> torch.Tensor: | ||
""" | ||
Call the forward pass of the segmentation model. | ||
Parameters | ||
---------- | ||
waveform: torch.Tensor, shape (batch, channels, samples) | ||
Returns | ||
------- | ||
speaker_segmentation: torch.Tensor, shape (batch, frames, speakers) | ||
""" | ||
return super().__call__(waveform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to SegmentationModel
Update: AMI benchmark with WeSpeaker embeddings and not weight normalization gives DER=30.8 |
@sorgfresser huge thanks for this feature! Stay tuned for v0.9! I hope we can get it released as soon as possible. |
* bump pyannote to 3.0 * add wespeaker inference * add weights normalization, cpu for numpy conversion * unify api * remove try catch * always normalize * use PretrainedSpeakerEmbedding in Loader * Fix min-max normalization equation * fix: remove imports * Change embedding model return type to Callable Co-authored-by: Simon <80467011+sorgfresser@users.noreply.github.com> * fix: remove type checking * remove from active if NaN embeddings * Fix wrong typing of model in `LazyModel` * add docstrings * Simplify EmbeddingModel.__call__() * Add numpy import * add normalize boolean * Update requirements.txt * Update setup.cfg * Apply suggestions from code review * Fix wrong kwarg name * add abstract __call__ * move __call__ to parent class --------- Co-authored-by: Juan Coria <juanmc2005@hotmail.com>
* bump pyannote to 3.0 * add wespeaker inference * add weights normalization, cpu for numpy conversion * unify api * remove try catch * always normalize * use PretrainedSpeakerEmbedding in Loader * Fix min-max normalization equation * fix: remove imports * Change embedding model return type to Callable Co-authored-by: Simon <80467011+sorgfresser@users.noreply.github.com> * fix: remove type checking * remove from active if NaN embeddings * Fix wrong typing of model in `LazyModel` * add docstrings * Simplify EmbeddingModel.__call__() * Add numpy import * add normalize boolean * Update requirements.txt * Update setup.cfg * Apply suggestions from code review * Fix wrong kwarg name * add abstract __call__ * move __call__ to parent class --------- Co-authored-by: Juan Coria <juanmc2005@hotmail.com>
Adds initial support for the embedding model using in pyannote/speaker-diarization-3.0
Usage:
I am still lacking support for
pyannote/segmentation-3.0
as of now and I am not 100% sure why... I thought it should be drop in replacement forpyannote/segmentation
but it does not seem to work.Any hints here would be greatly appreciated.