Implementation of NAACL2021 paper: DECEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization by *Zineng Tang, *Jie Lei, Mohit Bansal
# Create python environment (optional)
conda create -n decembert python=3.7
# Install python dependencies
pip install -r requirements.txt
To speed up the training, mixed precision is recommended.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Running pre-training command
bash scripts/pretrain.sh 0,1,2,3
The feature extraction scripts is provided in the feature_extractor folder.
We extract our 2D-level video features with ResNet152 Github Link: torchvision
We extract our 3D-level video features with 3D-ResNext Github Link: 3D-RexNext
Following the implementation of dense captioning aided pre-training, we pre-extract dense captions with the following code.
Original Github Link: Dense Captioning with Joint Inference and Visual Context (pytorch reproduced)
Important todos are to change the framerate sampling in code implementation according to dfferent video types.
MSRVTT
(TODO: add downstream tasks)
@inproceedings{tang2021decembert,
title={DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization},
author={Tang, Zineng and Lei, Jie and Bansal, Mohit},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={2415--2426},
year={2021}
}
Part of the code is built based on huggingface transformers and facebook faiss and TVCaption.