Skip to content

nttcslab/audio-diff-caps

Repository files navigation

AudioDiffCaps

This repository provides an audio data synthesizing tool for AudioDiffCaps and its captions. AudioDiffCaps dataset consists of (i) pairs of similar but slightly different audio clips and (ii) human-annotated descriptions of their differences.

Please consider citing our paper if you find this repository useful in your work.

@inproceedings{takeuchi2023audiodiffcaps,
    author = "Takeuchi, Daiki and Ohishi, Yasunori and Niizumi, Daisuke and Harada, Noboru and Kashino, Kunio”,
    title = "Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement",
    booktitle = "Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023)”,
    address = “Tampere, Finland”,
    month = “September”,
    year = "2023”,
}

What is the AudioDiffCaps dataset?

AudioDiffCaps dataset consists of (i) pairs of similar but slightly different audio clips and (ii) human-annotated descriptions of their differences. The pairs of audio clips were artificially synthesized by mixing foreground event sounds with background sounds taken from existing environmental sound datasets (FSD50K and ESC-50) using the Scaper library for soundscape synthesis and augmentation.

Getting Started

Install dependent packages according to the requirements.txt. This will install essential modules for running tools in this repository.

Step 0: Download FSD50K and esc50.

You can download them from the following URLs

After downloading, rewrite the two variables in utils.py (FSD50K and ESC50) to your environment.

Step 1: Preprocess audio files

Run the following to prepare audio files for synthesizing

python preprocess_org_audio.py

Step 2: Synthesize audio files

There are two scenes and two sprits in this dataset. Audio files of each scene and split are generated by following command. Rain_dev

python synthesize_audio -d datasets/adc_rain/dev

Rain_eval

python synthesize_audio -d datasets/adc_rain/eval

Traffic_dev

python synthesize_audio -d datasets/adc_traffic/dev

Traffic_eval

python synthesize_audio -d datasets/adc_traffic/eval

License

Please check the LICENSE.pdf for the detail.

References

  • E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,” arXiv preprint arXiv:2010.00475, 2020.
  • K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proc. 23rd Annual ACM Conf. Multimedia, pp.1015–1018.
  • J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA). IEEE, 2017, pp. 344–348.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages