This repository contains the code for "Contrastive Semantic Alignment for Speech Referring Expression Comprehension (CSRef)".
- Download speech referring expressions, speech encoder weights from Contrastive Semantic Alignment (CSA) stage, and pre-processing annotations JSON file to the data folder, following the path in Google Drive
- Download and unzip the LibriSpeech ASR dataset for CSA pre training to the
data/audios/
folder - Download and unzip the train2014 images from COCO to the
data/images
folder - Download bert-base-uncased and wav2vec2-base from HuggingFace to the
data/weights/
folder
- Clone this repo
- Create a conda virtual environment and activate it
conda create -n csref python=3.7.16
- Install Pytorch
- Install other packages in
requirements.txt
CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 4
CUDA_VISIBLE_DEVICES=0 PORT=23450 bash tools/train_speech.sh configs/csref_refcoco+_speech.py 1
Thanks to the following repos for their great works: