Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compatibility with pyannote 3.0 embedding wrappers #188

Merged
merged 23 commits into from
Nov 9, 2023

Conversation

sorgfresser
Copy link
Contributor

@sorgfresser sorgfresser commented Oct 18, 2023

Adds initial support for the embedding model using in pyannote/speaker-diarization-3.0

Usage:

embedding = EmbeddingModel.from_pyannote("hbredin/wespeaker-voxceleb-resnet34-LM")
embedding.to(device("cuda"))
config = SpeakerDiarizationConfig(embedding=embedding)
pipeline = SpeakerDiarization(config)
mic = MicrophoneAudioSource()
inference = StreamingInference(pipeline, mic, do_plot=True)
inference.attach_observers(RTTMWriter(mic.uri, "output/file.rttm"))
prediction = inference()

I am still lacking support for pyannote/segmentation-3.0 as of now and I am not 100% sure why... I thought it should be drop in replacement for pyannote/segmentation but it does not seem to work.

Any hints here would be greatly appreciated.

@hbredin
Copy link
Collaborator

hbredin commented Oct 18, 2023

Hint for pyannote/segmentation-3.0 support: use Powerset.to_multilabel conversion as illustrated here

Copy link
Owner

@juanmc2005 juanmc2005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! The core logic looks good. Changes are needed to simplify the API before merging

src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/blocks/diarization.py Outdated Show resolved Hide resolved
@juanmc2005
Copy link
Owner

@sorgfresser if that's ok with you, let's open a different PR for segmentation-3.0 so we can merge this one before that

@juanmc2005 juanmc2005 added feature New feature or request API Improvements to the API labels Oct 18, 2023
@juanmc2005 juanmc2005 changed the title Pyannote 3.0 Add compatibility with pyannote 3.0 embedding wrappers Oct 18, 2023
@juanmc2005 juanmc2005 added this to the Version 0.9 milestone Oct 19, 2023
@sorgfresser
Copy link
Contributor Author

Hey @juanmc2005
Thanks for your recommendations, I hopefully added them all now. Could you review again?
The usage has changed a bit, I've updated my original comment accordingly.

Copy link
Owner

@juanmc2005 juanmc2005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorgfresser Thank you for this revised version! It looks way better than before.
I just want to address a couple of concerns before merging:

  • I want to support all pyannote embedding models and not just wespeaker (this can be done without much effort)
  • There are some small errors here and there and potential improvements that would greatly benefit the diart API

Also, I noticed you forgot to replace SegmentationModel with the new API. Remember it also inherits from LazyModel so we need to replace forward with __call__ and make sure that type hints still match. This is very important.

Thank you again for this enormous contribution! I can't wait to get this merged, it will be huge for many people wanting to improve real-time diarization performance!

src/diart/mapping.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
weights = weights.to("cpu")
# Move to cpu for numpy conversion
waveform = waveform.to("cpu")
return torch.from_numpy(self.model(waveform, weights))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do waveform and weights have to be numpy? have you checked that this works with pyannote/embedding? Also please check that this way we can run both models on GPU.

If you don't have a GPU, I can check this myself during testing. Please let me know.

Also, please use super().__call__(waveform, weights) here as well so you don't have to call self.load() manually

Copy link
Contributor Author

@sorgfresser sorgfresser Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sadly they have to be. It will be executed on the GPU, but this .numpy() requires the waveform.to("cpu")
Additionally this requires the weights.to("cpu") as we otherwise fail with

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I can verify that the GPU is utilized, you can see the ONNX provider set with our call to .to("cuda") because of this definition

Nonetheless, the tensors have to be converted to numpy for onnx use and as such the errors pointed out above exist if they are not moved to the cpu beforehand.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all types of embedding models use onnx? Maybe @hbredin can give us a hint here.

If we absolutely need this to("cpu"), we could create a wrapper around PretrainedEmbeddingModel, let's say PipelineInputFormatter.

This way PyannoteLoader would return a PipelineInputFormatter instead of a PretrainedEmbeddingModel:

class PipelineInputFormatter:
    def __init__(self, model: PretrainedEmbeddingModel):
        self.model = model

    def __call__(self, audio, masks) -> np.ndarray:
        return self.model(audio.cpu().numpy(), masks.cpu().numpy())

I don't really like the name though, I'm open to suggestions :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be a matter of needing to specify the device upon instantiation, contrary to nn.Module. In that case we may want to refactor diart models to work in the same way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all types of embedding models use onnx?

No. Only WeSpeaker ones.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just took a look at the links sent by @sorgfresser and it does look like a pyannote bug.
@hbredin I opened a PR to fix this: pyannote/pyannote-audio#1518
Could you take a look?

Also, if it's not too much to ask, it would be great to have a release with this fix so we don't have to do any weird workarounds here

src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
@juanmc2005
Copy link
Owner

@sorgfresser I just released version 0.8. Please make sure to rebase your branch against develop so we're able to merge:

git checkout pyannote-3.0
git rebase <diart remote>/develop
# Once successful and without conflicts
git push --force origin pyannote-3.0

src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
@juanmc2005
Copy link
Owner

Ok I think the code is pretty solid! I want to test this but I would prefer to wait for the pyannote fix to be merged and hopefully released by @hbredin 🙏🏻

If the pyannote fix takes a long time to get into a release, I would prefer to do the required changes here anyway. In this case, WeSpeaker embeddings wouldn't work (on GPU) temporarily, but I prefer the code to be clean if it's going to be part of the next release

src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
@juanmc2005
Copy link
Owner

@sorgfresser could you please rebase on top of develop and force push again? I added some GitHub action checks that I would like to run on this PR to make sure nothing's broken

@sorgfresser
Copy link
Contributor Author

Thanks for rebasing @juanmc2005 and sorry for the late reply. I added the docstrings and removed the cpu moving - is there anything else I can add / modify?

src/diart/models.py Outdated Show resolved Hide resolved
@juanmc2005
Copy link
Owner

Hey @sorgfresser thanks for the new changes. I think we're good here. Now that the code is looking good I'll pull the branch locally and do some tests to see if the feature is working correctly. In particular wespeaker and some other model like ecapa tdnn.

If my tests look good I'll go ahead and merge. I'll probably wait for the pyannote fix to release v0.9 though.

@juanmc2005
Copy link
Owner

@sorgfresser quick update after some tests.

It looks like normalizing weights does affect the embeddings. DER on AMI goes from 27.3 to to 29.8, which is pretty bad. When I remove the normalization it goes down to 27.5, so there's something else affecting performance negatively.

I think we should add a parameter somewhere to specify if weights should be normalized.

I'll keep investigating and get back

@juanmc2005
Copy link
Owner

@hbredin looks like the difference between the 27.5 and 27.3 was because of pytorch 2.1.0.
Downgrading to 2.0.1 solves this. Have you observed any performance changes in pyannote too with this new pytorch version?

Maybe it's related to the automatic conversion of the segmentation and embedding models. I keep getting these warnings:

Model was trained with pyannote.audio 0.0.1, yours is 3.0.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.

I don't think this is a deal-breaker so I won't change the requirements to force torch<=2.1.0, but it may be worth checking if the model is being loaded badly with torch 2.1.0

@juanmc2005
Copy link
Owner

@sorgfresser could you move the weight normalization code to diart.blocks.OverlappedSpeechPenalty?
The new OverlappedSpeechPenalty.__call__() method should look like this:

    def __call__(self, segmentation: TemporalFeatures) -> TemporalFeatures:
        weights = self.formatter.cast(segmentation)  # shape (batch, frames, speakers)
        with torch.no_grad():
            probs = torch.softmax(self.beta * weights, dim=-1)
            weights = torch.pow(weights, self.gamma) * torch.pow(probs, self.gamma)
            weights[weights < 1e-8] = 1e-8
            if self.normalize:
                min_values = weights.min(dim=1, keepdim=True).values
                max_values = weights.max(dim=1, keepdim=True).values
                weights = (weights - min_values) / (max_values - min_values)
                weights.nan_to_num_(1e-8)
        return self.formatter.restore_type(weights)

Where self.normalize is a new constructor argument that defaults to False.
Also, we need to add a normalize_weights argument to OverlapAwareSpeakerEmbedding.
Finally, let's add a normalize_embedding_weights to DiarizationConfig, and then pass this value to OverlapAwareSpeakerEmbedding inside the constructor of diart.blocks.diarization.SpeakerDiarization.

This way, users can decide whether they want to do this normalization or not directly in the pipeline config.

I could do these changes myself but I'm not sure I have write access to your fork.

I would also like to have this as a CLI argument --normalize-embedding-weights in diart.stream, diart.benchmark, diart.tune and diart.serve, but I would merge the PR without this anyway. I leave it for you to decide if you want to implement that feature here.
In any case, pipeline configs are becoming pretty big so I'm thinking of converting them to a yaml file soon.

@juanmc2005
Copy link
Owner

Otherwise, I was able to run diart.stream with WeSpeaker embeddings! Amazing work @sorgfresser!

@hbredin
Copy link
Collaborator

hbredin commented Nov 6, 2023

@hbredin looks like the difference between the 27.5 and 27.3 was because of pytorch 2.1.0. Downgrading to 2.0.1 solves this. Have you observed any performance changes in pyannote too with this new pytorch version?

pyannote's CI is kind of non-existent so I don't actually know :)

@sorgfresser
Copy link
Contributor Author

I added the boolean to cli and moved the normalization to OverlappedSpeechPenalty. Is that the way you'd like the cli to behave?
Would be nice if you'd test it again. What models did you use for your benchmark to get the 29.8 on AMI?
I think btw you can edit the fork, feel free to do so but I would be willing to implement any other changes too if you prefer me to do it.

@juanmc2005
Copy link
Owner

@sorgfresser thank you for the swift reply and commit!
I ran the command on the reproducibility section, only changing --tau-active, --delta-new and --rho-update to the AMI values from the hyper-parameter table.
So the models I used were pyannote/segmentation@Interspeech2021 and pyannote/embedding.

I'll re-run the tests as soon as I can and get back with updates

Copy link
Owner

@juanmc2005 juanmc2005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some new tests and everything works well! My benchmark on AMI with WeSpeaker embeddings gave DER=28.9, but it may be a matter of tuning the hyper-parameters. Also it may be worth benchmarking without normalizing weights.

I can't seem to get ONNX to run on my GPU but I think it might be a problem with my CUDA drivers. @sorgfresser can you run them on GPU?

I will commit some suggestions here and there and I just have a couple of formatting things that I'd like to improve. Once this is fixed I'm good to merge.

requirements.txt Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
src/diart/argdoc.py Outdated Show resolved Hide resolved
src/diart/blocks/embedding.py Outdated Show resolved Hide resolved
src/diart/blocks/embedding.py Outdated Show resolved Hide resolved
src/diart/blocks/embedding.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
src/diart/models.py Outdated Show resolved Hide resolved
Comment on lines 117 to 127
def __call__(self, waveform: torch.Tensor) -> torch.Tensor:
"""
Call the forward pass of the segmentation model.
Parameters
----------
waveform: torch.Tensor, shape (batch, channels, samples)
Returns
-------
speaker_segmentation: torch.Tensor, shape (batch, frames, speakers)
"""
return super().__call__(waveform)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to SegmentationModel

src/diart/models.py Show resolved Hide resolved
@juanmc2005
Copy link
Owner

Update: AMI benchmark with WeSpeaker embeddings and not weight normalization gives DER=30.8

@juanmc2005 juanmc2005 merged commit 14910e1 into juanmc2005:develop Nov 9, 2023
@juanmc2005
Copy link
Owner

@sorgfresser huge thanks for this feature! Stay tuned for v0.9! I hope we can get it released as soon as possible.
If you liked contributing to diart I'd love to work with you on other issues in need of work 😃 Even better if it's on the list for the v0.9 milestone!

juanmc2005 added a commit that referenced this pull request Nov 11, 2023
* bump pyannote to 3.0

* add wespeaker inference

* add weights normalization, cpu for numpy conversion

* unify api

* remove try catch

* always normalize

* use PretrainedSpeakerEmbedding in Loader

* Fix min-max normalization equation

* fix: remove imports

* Change embedding model return type to Callable

Co-authored-by: Simon <80467011+sorgfresser@users.noreply.github.com>

* fix: remove type checking

* remove from active if NaN embeddings

* Fix wrong typing of model in `LazyModel`

* add docstrings

* Simplify EmbeddingModel.__call__()

* Add numpy import

* add normalize boolean

* Update requirements.txt

* Update setup.cfg

* Apply suggestions from code review

* Fix wrong kwarg name

* add abstract __call__

* move __call__ to parent class

---------

Co-authored-by: Juan Coria <juanmc2005@hotmail.com>
@juanmc2005 juanmc2005 mentioned this pull request Nov 18, 2023
juanmc2005 added a commit that referenced this pull request Nov 19, 2023
* bump pyannote to 3.0

* add wespeaker inference

* add weights normalization, cpu for numpy conversion

* unify api

* remove try catch

* always normalize

* use PretrainedSpeakerEmbedding in Loader

* Fix min-max normalization equation

* fix: remove imports

* Change embedding model return type to Callable

Co-authored-by: Simon <80467011+sorgfresser@users.noreply.github.com>

* fix: remove type checking

* remove from active if NaN embeddings

* Fix wrong typing of model in `LazyModel`

* add docstrings

* Simplify EmbeddingModel.__call__()

* Add numpy import

* add normalize boolean

* Update requirements.txt

* Update setup.cfg

* Apply suggestions from code review

* Fix wrong kwarg name

* add abstract __call__

* move __call__ to parent class

---------

Co-authored-by: Juan Coria <juanmc2005@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Improvements to the API feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants