Skip to content

Latest commit

 

History

History
62 lines (46 loc) · 4.16 KB

README.md

File metadata and controls

62 lines (46 loc) · 4.16 KB

Query by example spoken term detection using bottleneck features and a convolutional neural network

This project is a modification of the CNN-based QbE-STD method proposed in Ram, Miculicich, and Bourlard (2018), with code available on https://github.com/idiap/CNN_QbE_STD. In this implementation we use bottleneck features instead of phone posteriors used in the original design.

Bottleneck features were extracted using the Shennong speech features library, which provides a Python API for the BUT/Phonexia bottleneck features extractor trained on 17 languages from the IARPA Babel dataset (Cantonese, Pashto, Turkish, Tagalog, Vietnamese, Assamese, Bengali, Haitian Creole, Lao, Tamil, Zulu, Kurdish, Tok Pisin, Cebuano, Kazach, Telugu, Lithuanian).

Usage

To run training or inference, supply a config file with the required information. There are two sample configuration files included in data/sws2013-sample (for more info see below).

python src/run.py <config_file>

Requirements: Python 3.7, PyTorch 1.6.0, etc.* (*experiments were run on a Genesis Cloud instances with Python, PyTorch and CUDA pre-configured on launch).

Sample files

A set of sample files from the Spoken Web Search 2013 (SWS2013) corpus is included here in the data/sws2013-sample directory. The sws2013 database (20+ hours of audio) is too big to include here. If you want to run the full training process, you can get the original data from the BUT Speech@Fit website and use the feature extractor provided in src/extract_bnf.py.

data/sws2013-sample
├── train_queries     <- Queries for training
├── test_queries      <- Queries for development/testing (same for sample data)
├── Audio             <- Corpus in which dev/test queries are searched for
├── train_labels.csv  <- Ground truth for train_queries
|                        (1 = query occurs in reference, 0 = otherwise)
├── dev_labels.csv    <- Ground truth for dev/test queries
├── train_config.yaml <- Config file to run training on sws2013-sample data
├── test_config.yaml  <- Config file to run inference on sws2013-sample data 
|                        using model saved at the 60th epoch from training process

Running python src/run.py data/sws2013-sample/train_config.yaml followed by python src/run.py data/sws2013-sample/test_config.yaml should result in the output process shown below:

Sample process

Evaluation

The primary outputs from run.py generate CSV files shown below, where query, reference, and label are those given in the ground truth label files (e.g. train_labels.csv) and the prediction column the output of the CNN. For CSV files generated from the training process, there is an additional epoch column.

Query Reference Label Prediction
sws2013_dev_221_07 sws2013_04169 0 0.68009675
sws2013_dev_391_06 sws2013_03545 1 0.83309245
... ... ... ...

Following the original SWS2013 evaluation metrics and those also reported in Ram, Miculicich, and Bourlard (2018), the Maximum Term Weighted Value (MTWV) is used as the primary evaluation metric. To retrieve the Actual and Maximum Term Weighted Values, use the script provided in src/mtwv.R. The script returns CSV data (to stdout) listing the ATWVs at different thresholds sorted by largest ATWVs (i.e. the MTWV) to smallest. Also included are average precision and average recall as secondary metrics.

Threshold ATWV Average Precision Average Recall
0.70 0.683 0.0419 0.689
0.75 0.652 0.0449 0.657
... ... ... ...