A curated list of speaker embedding/verification resources
- [01] Deep Speaker: an End-to-End Neural Speaker Embedding System, Baidu inc, 2017
- [02] Text-Independent Speaker Verification Using 3D Convolutional Neural Networks, 2017
- [03] Speaker Recognition from Raw Waveform with SincNet, Bengio team, raw waveform, 2018
- [04] VoxCeleb2: Deep Speaker Recognition VGG group, Interspeech 2018
- [05] Generalized End-to-End Loss for Speaker Verification, Google, ICASSP 2017
- [06] Voxceleb: Large-scale speaker verification in the wild,VGG group, 2019
- [07] Deep neural network embeddings for text-independent speaker verification, Interspeech 2017, original TDNN paper from Johns Hopkins , MFCC/frame-based/time-delay/multi-class, softmax + cross-entropy loss
- [08] Robust DNN Embeddings for Speaker Recognition, ICASSP 2018, the X-vector paper Johns Hopkins, based on TDNN, improved by adding Noise and reverberation for augmentation
- [09] Front-end factor analysis for speaker verification, 2011, IEEE TASLP, the 'i-vector' paper from Johns Hopkins
- [10] TDNN-UBM Time delay deep recognition neural network-based universal background models for speaker , 2015
- [11] Deep neural networks for small footprint text-dependent speaker verification, The 'D-vector' paper from Johns Hopkins
- [12] Analysis of Score Normalization in Multilingual Speaker Recognition, Interspeech 2017, The S-norm paper, useful for score normalization
Results reported (by the authors) on Voxceleb1, VoxCeleb1-E and VoxCeleb1-H.
Voxceleb1 public results (continuously updating...)
Name | feature,model,activation/loss | VoxCeleb1 | VoxCeleb1-E | VoxCeleb1-H | Link | Affiliation | Year |
---|---|---|---|---|---|---|---|
X205 | DPN68,Res2Net50 | 0.7712% | 0.8968% | 1.637% | report | AISpeech | 2020 |
Veridas | ResNet152 | 1.08% | - | - | report | das-nano | 2020 |
DKU-DukeECE | Resnet,ECAPA-TDNN | 0.888% | 1.133% | 2.008% | report | Duke University | 2020 |
IDLAB | Resnet,ECAPA-TDNN | - | - | - | report | Ghent University - | 2020 |
speechbrain | ECAPA-TDNN | 0.69% | - | - | link | - | 2021 |
Commonly-used speaker datasets:
- TIMIT: A small dataset for speaker and asr, non-free
- Free ST: Mandarin speech corpus for speaker and asr, free
- NIST SRE NIST Speaker Recognition Evaluation, non-free
- AIShell-1: Mandarin speech corpus, divided into train/dev/test, free.
- AIShell-2: free for education, non-free for commercial
- AIShell-3: free, for speaker, asr and tts
- AIShell-4, will be released soon
- HI-MIA: free, for far-field text-dependent speaker verification and keyword spotting
- SITW Speakers in the Wild,
- Voxceleb 1&2, Celebrity interview video/audio extracted from Youtube
- Cn-Celeb 1&2, Multi-genres speaker dataset in the wild, utterances are from chinese celebrities.
- VoxCeleb Speaker Recognition Challenge (VoxSRC 2019) report
- VoxCeleb Speaker Recognition Challenge (VoxSRC 2020)
- VoxCeleb Speaker Recognition Challenge (VoxSRC 2021)
- Short-duration Speaker Verification (SdSV) Challenge 2020
- Short-duration Speaker Verification (SdSV) Challenge 2021
- CTS Speaker Recognition Challenge 2020
- Far-Field Speaker Verification Challenge (FFSVC 2020)
- X-vectors: Neural Speech Embeddings for Speaker Recognition, Daniel Garcia-Romero, 2020
- 2020声纹识别研究与应用学术讨论会
- VGGVox The first baseline system for voxceleb dataset, originally implementated in Matlab.
- DeepSpeaker An End-to-End Neural Speaker Embedding System.
- SincNet, also in speechbrain
- 3D CNN TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification
- GE2E, implementation is also in tensorlow
- asv-subtools An Open Source Tools based on Pytorch and Kaldi for speaker recognition/language identification, XMU Speech Lab.
- Resemblyzer, high-level representation of a voice through a deep learning model (referred to as the voice encoder).
- voxceleb audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
- Triplet-loss Triplet Loss and Online Triplet Mining in TensorFlow.
- Res2Net The Res2net architecture used commonly in VoxCeleb speaker recognition challenge.
- voxceleb_trainer A very good speaker framework written in pytorch with pretrained models.
- Speechbrain Voxceleb recipe.
- kaldi Kaldi recipe for voxceleb.
- pytorch_xvectors pytorch implementation of x-vectors.
- Attention Back-end, Compare PLDA and cosine with proposed attention Back-end, model: TDNN, Resnet, data: cn-celeb
- Rank 1: FBank, "r-vectors" using resnet, AAM loss. From Brno University of Technolog, REPORT
- Rank 2: 80-dim FBank features, E-TDNN/F-TDNN models, various classification loss including softmax/AM-softmax/PLDA-softmax. From Johns Hopkins University, REPORT
- Rank 3: FBank, resnet + attentive pooling + Phonetic attention, BLSTM + ResNET, loss unclear(?). From Microsoft, REPORT
- Rank 1: 60-dim log-FBank, ECAPA-TDNN/SE-ResNet34, S-Norm, AAM-Softmax. From IDLab, REPORT
- Rank 2: 40-dim FBank/mean-normalized, no VAD, resnet/Res2Net, S-Norm, CM-Softmax. From AI Speech, REPORT, kaldi recipe for data-aug
- Rank 3: Report not available
Please let me know if your code/repo is not listed here (ranchlai at 163.com)