Rui Li1, Shenglong Zhou1, and Dong Liu1
1University of Science and Technology of China, Hefei, China
[Paper] / [Demo] / [Project page] / [Poster] / [Intro]
This is the official code for "Learning Fine-Grained Features for Pixel-wise Video Correspondences".
Without any fine-tuning, the proposed method can be directly applied to various correspondence-related tasks including long-term point tracking, video object segmentation, etc.
-
2023.07.14: Our paper "Learning Fine-Grained Features for Pixel-wise Video Correspondences" is accepted to ICCV 2023. The code for inference and training will be released as soon as possible.
-
2023.10.04: We have released the code and models of the paper "Learning Fine-Grained Features for Pixel-wise Video Correspondences".
If you find this repository useful for your research, please cite our paper:
@inproceedings{li2023learning,
title={Learning Fine-Grained Features for Pixel-wise Video Correspondences},
author={Li, Rui and Zhou, Shenglong and Liu, Dong},
booktitle={ICCV},
pages={9632--9641},
year={2023}
}
Our other paper related to video correspondence learning (Spa-then-Temp):
@inproceedings{li2023spatial,
title={Spatial-then-Temporal Self-Supervised Learning for Video Correspondence},
author={Li, Rui and Liu, Dong},
booktitle={CVPR},
pages={2279--2288},
year={2023}
}
- Python 3.8.8
- PyTorch 1.9.1
- mmcv-full == 1.5.2
- davis2017-evaluation
To get started, first please clone the repo
git clone https://github.com/qianduoduolr/FGVC
For convenience, we provide a Dockerfile. Alternatively, you can install all required packages manually. Our code is based on mmcv framework and Spa-then-Temp. You can refer to those repositories for more information.
Please refer to README.md for more details.
Method | BADJA / PCK@0.2 | JHMDB / PCK@0.1 | TAP-DAVIS / < D | TAP-Kinetics / < D |
---|---|---|---|---|
RAFT | 45.6 | 66.4 | 42.1 | 44.3 |
TAPNet | - | 62.3 | 48.6 | 54.4 |
PIPs | 62.3 | - | 55.3 | 48.2 |
Ours | 69.7 | 66.8 | 62.8 | 54.6 |
The evaluation is particularly conducted on pixel-wise correspondence-related tasks, i.e., point tracking, on TAP-Vid dataset, JHMDB, and BADJA. The results are shown above.
We follow the prior studies to leverage label propagation for inference, which can be achieved by:
bash tools/dist_test.sh ${CONFIG} ${GPUS} ${TASK} ${CKPT}
Note you need download the pre-trained models with this link for the CKPT
. Note the TASK consists of 'davis'
(for TAP-Vid-DAVIS), 'kinetics'
(for TAP-Vid-Kinetics), 'jhmdb'
(for human keypoint tracking), and 'badja'
(for animal keypoint tracking).
We give a inference cmd example:
# testing for point tracking on TAP-Vid-DAVIS with 4 GPUs
bash tools/dist_test.sh configs/eval/res18_d1_eval.py 4 davis ckpt/res18_d1_fly_ytv_mixed_training.pth
The results will be saved to eval/
. Please note we do inference on 4 A100 GPUs, which has 80G memory. Here we give the inference code to support other GPUs with smaller memory size, which may cost more time for inference, we plan to give a more efficient version of the inference code with label propagation later. If you have enough memory, you can simply increase the step
of test_cfg
in CONFIG
for faster inference in current version.
We perform training on FlyingThings and YouTube-VOS. Before training you need to download the pre-trained 2D encoder from this link, and modify the pretrained
in model.teacher
in the config. You can also try more stronger model pre-trained on large-scale image dataset, i.e., MoCo, DetCo, which may get better results.
bash tools/dist_train.sh configs/train/mixed_train_res18_d1_l2_rec_ytv_fly.py 4
This work is licensed under MIT license. See the LICENSE for details.
The codebase is implemented based on the MMCV, tapnet, pips, and VFS. Thanks for these excellent open source repositories.