This repository provides an audio data synthesizing tool for AudioDiffCaps and its captions. AudioDiffCaps dataset consists of (i) pairs of similar but slightly different audio clips and (ii) human-annotated descriptions of their differences.
Please consider citing our paper if you find this repository useful in your work.
@inproceedings{takeuchi2023audiodiffcaps,
author = "Takeuchi, Daiki and Ohishi, Yasunori and Niizumi, Daisuke and Harada, Noboru and Kashino, Kunio”,
title = "Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement",
booktitle = "Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023)”,
address = “Tampere, Finland”,
month = “September”,
year = "2023”,
}
AudioDiffCaps dataset consists of (i) pairs of similar but slightly different audio clips and (ii) human-annotated descriptions of their differences. The pairs of audio clips were artificially synthesized by mixing foreground event sounds with background sounds taken from existing environmental sound datasets (FSD50K and ESC-50) using the Scaper library for soundscape synthesis and augmentation.
Install dependent packages according to the requirements.txt. This will install essential modules for running tools in this repository.
You can download them from the following URLs
After downloading, rewrite the two variables in utils.py (FSD50K and ESC50) to your environment.
Run the following to prepare audio files for synthesizing
python preprocess_org_audio.py
There are two scenes and two sprits in this dataset. Audio files of each scene and split are generated by following command. Rain_dev
python synthesize_audio -d datasets/adc_rain/dev
Rain_eval
python synthesize_audio -d datasets/adc_rain/eval
Traffic_dev
python synthesize_audio -d datasets/adc_traffic/dev
Traffic_eval
python synthesize_audio -d datasets/adc_traffic/eval
Please check the LICENSE.pdf for the detail.
- E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,” arXiv preprint arXiv:2010.00475, 2020.
- K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proc. 23rd Annual ACM Conf. Multimedia, pp.1015–1018.
- J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA). IEEE, 2017, pp. 344–348.