This repo supports data pre-processing, training and evaluation of both Ego4D-NLQ and MAD benchmarks.TL;DR: CONE (see overview below) deals with an emerging and challenging problem of video temporal grounding (VTG) in the long-form video. It is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. It is also a coarse-to-fine alignment framework that utilizes a pipeline of {window slicing and selection, proposal generation and ranking}.
- [2023.5.30] We release the CONE codebase.
- [2022.10.24] CONE model won 3rd place in NLQ@ Ego4D Challenge 2022, hosted by ECCV 2022. Please refer to Ego4D-NLQ_ECCV_2022_workshop for detailed information.
- Follow INSTALL.md for installing necessary dependencies and compiling the code.
- Download full Ego4D-NLQ data Ego4D-NLQ (8.29GB).
- Download partial MAD data MAD (6.5GB). We CAN NOT share the MAD visual features at this moment, please request access to the MAD dataset from official resource MAD github.
- We provide the feature extraction and file pre-processing procedures for both benchmarks in detail, please refer to Feature_Extraction_MD.
- If you unzip the Ego4D-NLQ data, the extracted folder structure should look like
This folder
└───offline_extracted_features/
│ └───egovlp_video_feature_1.875fps.tar.gz
│ └───egovlp_text_cls_feature.tar.gz
│ └───egovlp_text_token_feature.tar.gz
│ └───...
└───offline_lmdb/
│ └───egovlp_video_feature_1.875fps/
│ └───egovlp_egovlp_text_features/
│ └───...
└───data/
│ └───ego4d_ori_data/
│ │ └───nlq_train.json
│ │ └───...
│ └───ego4d_data/
│ │ └───train.jsonl
│ │ └───...
└───one_training_sample/
│ └───tensorboard_log/
│ └───inference_ego4d_val_top20_nms_0.5_preds.txt
│ └───inference_ego4d_val_top20_nms_0.5_preds.json
│ └───...
Note that our default base model is Moment-DETR. Additionally, we also release the code with 2D-TAN as the base model. Please refer to 2D-TAN README.
Please refer to Ego4d-NLQ_ECCV_2022_workshop for detailed information about our submission for Ego4D ECCV 2022 Challenge.
Training can be launched by running the following command:
bash cone/scripts/train_ego4d.sh CUDA_DEVICE_ID NUM_QUERIES WINDOW_LENGTH ADAPTER
CUDA_DEVICE_ID
is cuda device id.
NUM_QUERIES
is moment queries number, default as 5.
WINDOW_LENGTH
is visual feature number inside one video window.
ADAPTER
is model type string for visual adapter module, can be one of linear
and none
.
The checkpoints and other experiment log files will be written into cone_results
. For training under different settings, you can append additional command line flags to the command above.
For more configurable options, please check our config file cone/config.py.
The actual command used in the experiments is
bash cone/scripts/train_ego4d.sh 0 5 90 linear
In additional, we find that the performance empirically increases when the textual token feature extractor is replaced by CLIP or RoBERTa, thus, we recommend you to use CLIP or RoBERTa token feature via the following commands,
bash cone/scripts/train_ego4d_clip.sh 0 5 90 linear
bash cone/scripts/train_ego4d_roberta.sh 0 5 90 linear
Once the model is trained, you can use the following commands for inference:
bash cone/scripts/inference_ego4d.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID --nms_thd 0.5 --topk_window 20
bash cone/scripts/inference_ego4d_test.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID --nms_thd 0.5 --topk_window 20
where CUDA_DEVICE_ID
is cuda device id, CHECKPOINT_PATH
is the path to the saved checkpoint, EVAL_ID
is a name string for evaluation id.
We adopt Non-Maximum Suppression (NMS) with a threshold of 0.5 and set pre-filtering window number as 20.
- The results (Recall@K at IoU = 0.3 or 0.5) on the val. set should be similar to the performance of the below table reported in the main paper.
Metric \ Method | R@1 IoU=0.3 | R@5 IoU=0.3 | R@1 IoU=0.5 | R@5 IoU=0.5 |
---|---|---|---|---|
CONE | 14.15 | 30.33 | 8.18 | 18.02 |
In additional, we provide our experiment log files Ego4D-NLQ-Training-Sample(24MB).
Note that we inference on 3874 queries in validation split, but NaQ removes zero-duration ground-truth queries and inferences on 3529 queries in validation split . The performance of CONE will be higher (i.e., multiplied by 3874/3529=1.098) if we use the same validation split of NaQ.
Training can be launched by running the following command:
bash cone/scripts/train_mad.sh CUDA_DEVICE_ID NUM_QUERIES WINDOW_LENGTH ADAPTER
CUDA_DEVICE_ID
is cuda device id.
NUM_QUERIES
is moment queries number, default as 5.
WINDOW_LENGTH
is visual feature number inside one video window.
ADAPTER
is model type string for visual adapter module, can be one of linear
and none
.
The actual command used in the experiments is
bash cone/scripts/train_mad.sh 0 5 125 linear
bash cone/scripts/train_mad.sh 0 5 125 none --no_adapter_loss
Once the model is trained, you can use the following commands for inference:
bash cone/scripts/inference_mad.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID --nms_thd 0.5 --topk_window 30
bash cone/scripts/inference_mad_test.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID --nms_thd 0.5 --topk_window 30
where CUDA_DEVICE_ID
is cuda device id, CHECKPOINT_PATH
is the path to the saved checkpoint, EVAL_ID
is a name string for evaluation id.
We adopt Non-Maximum Suppression (NMS) with a threshold of 0.5 and set pre-filtering window number as 30.
- The results (Recall@K at IoU = 0.3) should be similar to the performance of the below table reported in the main paper.
Method \ R@K | 1 | 5 | 10 | 50 |
---|---|---|---|---|
CONE (val) | 6.73 | 15.20 | 20.07 | 32.09 |
CONE (test) | 6.87 | 16.11 | 21.53 | 34.73 |
In additional, we provide the experiment log files MAD-Training-Sample(370MB).
You may also want to run CONE model on your own videos and queries. Currently, it supports moment retrieval on first-person videos with EgoVLP video feature extractor. For third-person videos, the video/text feature extractors should be replaced with CLIP for better performance.
Mkdir the ckpt folder and place two weight files Egovlp.pth and model_best.ckpt into the ckpt folder
Mkdir the example folder and place one Ego4D video example with the uid "94cdabf3-c078-4ad4-a3a1-c42c8fc3f4ad" into the example folder
Install some additional dependencies
pip install transformers easydict decord
pip install einops timm
pip install pytorchvideo
Run the example provided in this repo:
python run_on_video/run.py
The output will look like the following:
Build models...
Loading feature extractors...
Loading EgoVLP models
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading trained Moment-DETR model...
Loading CONE models
Run prediction...
video_name: 94cdabf3-c078-4ad4-a3a1-c42c8fc3f4ad
text_query: Did I wash the green pepper?
-----------------------------prediction------------------------------------
Rank 1, moment boundary in seconds: 87.461 103.1118, score: 1.9370151082074316
Rank 2, moment boundary in seconds: 350.099 360.7614, score: 1.9304019422713785
Rank 3, moment boundary in seconds: 95.9942 101.9733, score: 1.9060271403367295
Rank 4, moment boundary in seconds: 275.3885 286.7189, score: 1.8871944230965596
Rank 5, moment boundary in seconds: 384.3145 393.3277, score: 1.701088363940821
This repo is maintained by Zhijian Hou. Questions and discussions are welcome via zjhou3-c@my.cityu.edu.hk.
This code is based on Moment-DETR. We use some resources from CLIP, EgoVLP to extract the features. We thank the authors for their awesome open-source contributions.