Skip to content

Commit

Permalink
Merge pull request #100 from sensein/99-add-speaker-verification-note…
Browse files Browse the repository at this point in the history
…book-and-documentation

✨Add speaker verification tutorial and documentation
  • Loading branch information
fabiocat93 authored Sep 26, 2024
2 parents 01a5bdc + f9dca74 commit 466e7ec
Show file tree
Hide file tree
Showing 2 changed files with 143 additions and 0 deletions.
60 changes: 60 additions & 0 deletions src/senselab/audio/tasks/speaker_verification/doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Speaker Verification
Last updated: 08/13/2024

<button class="tutorial-button" onclick="window.location.href='https://github.com/sensein/senselab/blob/main/tutorials/speaker_verification.ipynb'">Tutorial</button>

## Task Overview
Speaker verification is identity authentication based on voice features.

This technology is widely used in various applications, including security systems, authentication processes, and personalized user experiences. The core concept revolves around comparing voice characteristics extracted from speech samples to verify the identity of the speaker.

SenseLab speaker verification extracts audio embeddings, finds their cosine similarity, and uses a similarity threshold to determine if two audio files came from the same speaker.

## Models
SenseLab speaker verification extracts audio embeddings using ECAPA-TDNN (`speechbrain/spkrec-ecapa-voxceleb`). This model is part of [SpeechBrain](https://huggingface.co/speechbrain), "an open-source and all-in-one conversational AI toolkit based on PyTorch." This model was trained on [Voxceleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) and [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html), celebrity voice datasets. It is important to ensure that the audio samples used for verification have a sampling rate of 16kHz, as this is the rate that `speechbrain/spkrec-ecapa-voxceleb` was trained on.

- **ECAPA-TDNN**
- [spkrec-ecapa-voxceleb](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb)


## Evaluation
### Metrics
The primary evaluation metric for speaker verification is the Equal Error Rate [EER](https://www.sciencedirect.com/topics/computer-science/equal-error-rate), which is the percent error when the false acceptance rate (FAR) and false rejection rate (FRR) are equal.
- **False Acceptance Rate (FAR)** The probability of a verification system accepting invalid inputs. Similar names may refer to this same metric.
- **False Rejection Rate (FRR)** The probability of a verification system rejecting valid inputs. Similar names may refer to this same metric.

Lower values on these metrics indicate better performance.

### Datasets
Common datasets used for evaluating speaker verification models include:
- [VoxCeleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html): A large-scale speaker verification dataset containing over 100,000 utterances from 1,251 celebrities.

Verification Split
| | Dev | Test |
|-----------------|------------|----------|
| # of speakers | 1,211 | 40 |
| # of videos | 21,819 | 677 |
| # of utterances | 148,642 | 4,874 |

Identification Split
| | Dev | Test |
|-----------------|------------|----------|
| # of speakers | 1,251 | 1,251 |
| # of videos | 21,245 | 1,251 |
| # of utterances | 145,265 | 8,251 |

- [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html): An extension of VoxCeleb1, with more speakers and more utterances, used for training and evaluation.

| | Dev | Test |
|-----------------|------------|----------|
| # of speakers | 5,994 | 118 |
| # of videos | 145,569 | 4,911 |
| # of utterances | 1,092,009 | 36,237 |

For more details on these datasets and the evaluation process, refer to the [VoxCeleb paper](https://arxiv.org/abs/1706.08612).

### Benchmark
See the [Papers With Code leaderboard](https://paperswithcode.com/sota/speaker-verification-on-voxceleb) for rankings of speaker verification by EER on VoxCeleb.

## Notes
- Fine-tuning the model on a specific dataset can further improve accuracy.
83 changes: 83 additions & 0 deletions tutorials/speaker_verification.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Speaker Verification\n",
"Speaker Verification is a process in which an audio system determines whether a given set of speech samples are from the same speaker. This technology is widely used in various applications such as security systems, authentication processes, and personalized user experiences. The core concept revolves around comparing voice characteristics extracted from speech samples to verify the identity of the speaker.\n",
"\n",
"Speaker verification can be done in SenseLab as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mThe kernel failed to start as '/Users/isaacbevers/Library/Python/3.10/lib/python/site-packages/psutil/_psutil_osx.abi3.so' could not be imported from '/Users/isaacbevers/Library/Python/3.10/lib/python/site-packages/psutil/_psutil_osx.abi3.so, 0x0002'.\n",
"\u001b[1;31mClick <a href='https://aka.ms/kernelFailuresModuleImportErrFromFile'>here</a> for more info."
]
}
],
"source": [
"# Import necessary libraries\n",
"from senselab.audio.tasks.speaker_verification.speaker_verification import verify_speaker\n",
"\n",
"# Create two audio samples (dummy data for illustration purposes)\n",
"audio1 = Audio(signal=[0.1, 0.2, 0.3], sampling_rate=16000)\n",
"audio2 = Audio(signal=[0.1, 0.2, 0.3], sampling_rate=16000)\n",
"\n",
"# List of audio pairs to compare\n",
"audio_pairs = [(audio1, audio2)]\n",
"\n",
"# Verify if the audios are from the same speaker\n",
"results = verify_speaker(audio_pairs)\n",
"\n",
"# Print the results\n",
"for score, is_same_speaker in results:\n",
" print(f\"Verification Score: {score}, Same Speaker: {is_same_speaker}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The verify_speaker function is designed to accomplish the task of speaker verification using a pre-trained model. Here's a breakdown of how the function achieves this:\n",
"\n",
"Input Data: The function takes a list of tuples, where each tuple contains two audio samples to be compared. Each audio sample is represented by an Audio object which includes the signal data and sampling rate.\n",
"\n",
"Model and Device Setup: The function uses a pre-trained speaker verification model (SpeechBrainModel). It also selects the appropriate device (CPU or GPU) to run the model efficiently.\n",
"\n",
"Sampling Rate Check: The function ensures that the audio samples have a sampling rate of 16kHz, as this is the rate the model was trained on. If the sampling rate does not match, it raises an error.\n",
"\n",
"Embedding Extraction: For each pair of audio samples, the function extracts speaker embeddings using the SpeechBrainEmbeddings module. Embeddings are numerical representations that capture the unique characteristics of a speaker's voice.\n",
"\n",
"Cosine Similarity Calculation: The function calculates the cosine similarity between the embeddings of the two audio samples. Cosine similarity is a measure of similarity between two vectors, where a higher value indicates greater similarity.\n",
"\n",
"Threshold Comparison: The function compares the calculated similarity score against a predefined threshold (default is 0.25). If the score exceeds the threshold, it indicates that the two audio samples are likely from the same speaker.\n",
"\n",
"Output: The function returns a list of tuples, each containing the similarity score and a boolean indicating whether the two audio samples are from the same speaker."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "senselab-_dRIpWVy-py3.10",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 466e7ec

Please sign in to comment.