From 31121421c21ea326f2551ea546bc7c66d135c860 Mon Sep 17 00:00:00 2001 From: ibevers Date: Tue, 16 Jul 2024 11:58:08 -0400 Subject: [PATCH 1/2] Add first draft of speaker verification tutorial --- tutorials/speaker_verification.ipynb | 83 ++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 tutorials/speaker_verification.ipynb diff --git a/tutorials/speaker_verification.ipynb b/tutorials/speaker_verification.ipynb new file mode 100644 index 00000000..2e6e6532 --- /dev/null +++ b/tutorials/speaker_verification.ipynb @@ -0,0 +1,83 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Speaker Verification\n", + "Speaker Verification is a process in which an audio system determines whether a given set of speech samples are from the same speaker. This technology is widely used in various applications such as security systems, authentication processes, and personalized user experiences. The core concept revolves around comparing voice characteristics extracted from speech samples to verify the identity of the speaker.\n", + "\n", + "Speaker verification can be done in SenseLab as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "ename": "", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[1;31mThe kernel failed to start as '/Users/isaacbevers/Library/Python/3.10/lib/python/site-packages/psutil/_psutil_osx.abi3.so' could not be imported from '/Users/isaacbevers/Library/Python/3.10/lib/python/site-packages/psutil/_psutil_osx.abi3.so, 0x0002'.\n", + "\u001b[1;31mClick here for more info." + ] + } + ], + "source": [ + "# Import necessary libraries\n", + "from senselab.audio.tasks.speaker_verification.speaker_verification import verify_speaker\n", + "\n", + "# Create two audio samples (dummy data for illustration purposes)\n", + "audio1 = Audio(signal=[0.1, 0.2, 0.3], sampling_rate=16000)\n", + "audio2 = Audio(signal=[0.1, 0.2, 0.3], sampling_rate=16000)\n", + "\n", + "# List of audio pairs to compare\n", + "audio_pairs = [(audio1, audio2)]\n", + "\n", + "# Verify if the audios are from the same speaker\n", + "results = verify_speaker(audio_pairs)\n", + "\n", + "# Print the results\n", + "for score, is_same_speaker in results:\n", + " print(f\"Verification Score: {score}, Same Speaker: {is_same_speaker}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The verify_speaker function is designed to accomplish the task of speaker verification using a pre-trained model. Here's a breakdown of how the function achieves this:\n", + "\n", + "Input Data: The function takes a list of tuples, where each tuple contains two audio samples to be compared. Each audio sample is represented by an Audio object which includes the signal data and sampling rate.\n", + "\n", + "Model and Device Setup: The function uses a pre-trained speaker verification model (SpeechBrainModel). It also selects the appropriate device (CPU or GPU) to run the model efficiently.\n", + "\n", + "Sampling Rate Check: The function ensures that the audio samples have a sampling rate of 16kHz, as this is the rate the model was trained on. If the sampling rate does not match, it raises an error.\n", + "\n", + "Embedding Extraction: For each pair of audio samples, the function extracts speaker embeddings using the SpeechBrainEmbeddings module. Embeddings are numerical representations that capture the unique characteristics of a speaker's voice.\n", + "\n", + "Cosine Similarity Calculation: The function calculates the cosine similarity between the embeddings of the two audio samples. Cosine similarity is a measure of similarity between two vectors, where a higher value indicates greater similarity.\n", + "\n", + "Threshold Comparison: The function compares the calculated similarity score against a predefined threshold (default is 0.25). If the score exceeds the threshold, it indicates that the two audio samples are likely from the same speaker.\n", + "\n", + "Output: The function returns a list of tuples, each containing the similarity score and a boolean indicating whether the two audio samples are from the same speaker." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "senselab-_dRIpWVy-py3.10", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 7c704968825f23e9410ddc9165ad62865a121bb1 Mon Sep 17 00:00:00 2001 From: ibevers Date: Tue, 13 Aug 2024 14:58:59 -0400 Subject: [PATCH 2/2] Add doc.md --- .../audio/tasks/speaker_verification/doc.md | 60 +++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 src/senselab/audio/tasks/speaker_verification/doc.md diff --git a/src/senselab/audio/tasks/speaker_verification/doc.md b/src/senselab/audio/tasks/speaker_verification/doc.md new file mode 100644 index 00000000..b3c4eddb --- /dev/null +++ b/src/senselab/audio/tasks/speaker_verification/doc.md @@ -0,0 +1,60 @@ +# Speaker Verification +Last updated: 08/13/2024 + + + +## Task Overview +Speaker verification is identity authentication based on voice features. + +This technology is widely used in various applications, including security systems, authentication processes, and personalized user experiences. The core concept revolves around comparing voice characteristics extracted from speech samples to verify the identity of the speaker. + +SenseLab speaker verification extracts audio embeddings, finds their cosine similarity, and uses a similarity threshold to determine if two audio files came from the same speaker. + +## Models +SenseLab speaker verification extracts audio embeddings using ECAPA-TDNN (`speechbrain/spkrec-ecapa-voxceleb`). This model is part of [SpeechBrain](https://huggingface.co/speechbrain), "an open-source and all-in-one conversational AI toolkit based on PyTorch." This model was trained on [Voxceleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) and [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html), celebrity voice datasets. It is important to ensure that the audio samples used for verification have a sampling rate of 16kHz, as this is the rate that `speechbrain/spkrec-ecapa-voxceleb` was trained on. + +- **ECAPA-TDNN** + - [spkrec-ecapa-voxceleb](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) + + +## Evaluation +### Metrics +The primary evaluation metric for speaker verification is the Equal Error Rate [EER](https://www.sciencedirect.com/topics/computer-science/equal-error-rate), which is the percent error when the false acceptance rate (FAR) and false rejection rate (FRR) are equal. +- **False Acceptance Rate (FAR)** The probability of a verification system accepting invalid inputs. Similar names may refer to this same metric. +- **False Rejection Rate (FRR)** The probability of a verification system rejecting valid inputs. Similar names may refer to this same metric. + +Lower values on these metrics indicate better performance. + +### Datasets +Common datasets used for evaluating speaker verification models include: +- [VoxCeleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html): A large-scale speaker verification dataset containing over 100,000 utterances from 1,251 celebrities. + +Verification Split +| | Dev | Test | +|-----------------|------------|----------| +| # of speakers | 1,211 | 40 | +| # of videos | 21,819 | 677 | +| # of utterances | 148,642 | 4,874 | + +Identification Split +| | Dev | Test | +|-----------------|------------|----------| +| # of speakers | 1,251 | 1,251 | +| # of videos | 21,245 | 1,251 | +| # of utterances | 145,265 | 8,251 | + +- [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html): An extension of VoxCeleb1, with more speakers and more utterances, used for training and evaluation. + +| | Dev | Test | +|-----------------|------------|----------| +| # of speakers | 5,994 | 118 | +| # of videos | 145,569 | 4,911 | +| # of utterances | 1,092,009 | 36,237 | + +For more details on these datasets and the evaluation process, refer to the [VoxCeleb paper](https://arxiv.org/abs/1706.08612). + +### Benchmark +See the [Papers With Code leaderboard](https://paperswithcode.com/sota/speaker-verification-on-voxceleb) for rankings of speaker verification by EER on VoxCeleb. + +## Notes +- Fine-tuning the model on a specific dataset can further improve accuracy.