Maybe awesome?
If you find missed papers, please open issues or pull requests (recommended).
-
VIOLETv2 (EmpiricalMVM) [Paper][Code] @Microsoft
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling (CVPR 2023) -
LAVENDER [Paper][Code] @Microsoft
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling (CVPR 2023) -
Flamingo[Paper] @DeepMind
Flamingo: a Visual Language Model for Few-Shot Learning (NeurIPS 2022) -
ALPRO [Paper][Code] @Salesforce
Align and Prompt: Video-and-Language Pre-training with Entity Prompts (CVPR 2022) -
VL-Adapter [Paper][Code] @UNC
VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (CVPR 2022) -
ATP [Paper][Code][Website][Poster][Oral] @Stanford
Revisiting the "Video" in Video-Language Understanding (CVPR 2022) -
InternVideo [Paper][Code] @OpenGVLab
InternVideo: General Video Foundation Models via Generative and Discriminative Learning (arXiv 2022) -
VIOLET [Paper][Code] @Microsoft
VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling (arXiv 2021) -
VidLanKD [Paper][Code] @UNC
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer (NeurIPS 2021) -
MCN [Paper][Code] @MIT-IBM
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos (ICCV 2021) -
HERO [Paper][Code] @Microsoft
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020) -
UniVL [Paper][Code] @Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation (arXiv 2020)
-
FiT [Paper][Code][Website][Demo] @Oxford
Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021) -
CLIP-Hitchhiker [Paper] @Oxford
A CLIP-Hitchhiker's Guide to Long Video Retrieval (arXiv 2022) -
CLIP2Video [Paper][Code] @Tencent
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP (arXiv 2021)
-
FrozenBiLM [Paper][Code][Website][Poster][Slides] @Inria
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS 2022) -
MERLOT Reserve [Paper][Code][Website][Demo] @AI2
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound (CVPR 2022) -
MERLOT [Paper][Code][Website] @AI2
MERLOT: Multimodal Neural Script Knowledge Models (NeurIPS 2021) -
JustAsk [Paper/Journal][Code][Website][Demo][Poster][Slides][Oral] @Inria
Just Ask: Learning to Answer Questions from Millions of Narrated Videos (ICCV 2021)
Learning to Answer Visual Questions from Web Videos (TPAMI 2022)
-
Video ChatCaptioner [Paper][Code] @KAUST
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions (arXiv 2023) -
Vid2Seq [Paper][Code][Website][Blog] @Google
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (CVPR 2023) -
MV-GPT [Paper] @Google
End-to-end Generative Pretraining for Multimodal Video Captioning (CVPR 2022) -
SwinBERT [Paper][Code] @Microsoft
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (CVPR 2022) -
HMN [Paper][Code] @UCAS
Hierarchical Modular Network for Video Captioning (CVPR 2022) -
VX2TEXT [Paper] @Facebook
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs (CVPR 2021) -
DeCEMBERT [Paper][Code][Oral] @UNC
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021) -
CLIP4Caption [Paper] @Tencent
CLIP4Caption: CLIP for Video Caption (ACM 2021) -
ViTT [Paper][Oral] @Google
Multimodal Pretraining for Dense Video Captioning (ACL 2020)
- GPT2MVS [Paper] @UvA
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization (ICMR 2021)
-
WebVid-10M [Paper][Code][Website] @Oxford
Frozen in Time: A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021) -
HowTo100M [Paper][Code][Website] @Inria
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)
-
MAD-v2 [Paper][Code][Website] @Oxford
AutoAD: Movie Description in Context (CVPR 2023) -
AGQA-Decomp [Paper][Code][Website] @UW
Measuring Compositional Consistency for Video Question Answering (CVPR 2022) -
MAD [Paper][Code][PapersWithCode] @KAUST
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions (CVPR 2022) -
QVHighlights [Paper][Code][PapersWithCode] @UNC
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (NeurIPS 2021) -
STAR [Paper][Code][Website][PaperswithCode] @MIT-IBM
STAR: A Benchmark for Situated Reasoning in Real-World Videos (NeurIPS 2021) -
VidSitu [Paper][Code][Website][PapersWithCode] @AI2
Visual Semantic Role Labeling for Video Understanding (CVPR 2021) -
AGQA [Paper][Code][Website][PapersWithCode] @Stanford
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning (CVPR 2021) -
LVU [Paper][Code][Website] @UTAUS
Towards Long-Form Video Understanding (CVPR 2021) -
DramaQA [Paper][Code][Website][PapersWithCode] @SNU
DramaQA: Character-Centered Video Story Understanding with Hierarchical QA (AAAI 2021) -
How2QA [Paper][Code] @Microsoft
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020) -
How2R [Paper][Code] @Microsoft
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020) -
VLEP [Paper][Code][PapersWithCode] @UNC
What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020) -
V2C [Paper][Code][Website][PapersWithCode] @ASU
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning (EMNLP 2020) -
CMD [Paper][Code][Website][PapersWithCode] @Oxford
Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020) -
TVC [Paper][Code][Website] @UNC
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020) -
TVR [Paper][Code][Website][PapersWithCode] @UNC
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020) -
VIOLIN [Paper][Code][PapersWithCode] @Microsoft
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020) -
KnowITVQA [Paper][Code][Website][PapersWithCode] @Osaka
KnowIT VQA: Answering Knowledge-Based Questions about Videos (AAAI 2020) -
TVQA+ [Paper][Code][Website][PapersWithCode] @UNC
TVQA+: Spatio-Temporal Grounding for Video Question Answering (ACL 2020) -
TVQA [Paper][Code][Website][PapersWithCode] @UNC
TVQA: Localized, Compositional Video Question Answering (EMNLP 2018) -
YouCook2 [Paper][Website][PapersWithCode] @UMich
Towards Automatic Learning of Procedures from Web Instructional Videos (AAAI 2018) -
ActivityNet Captions [Paper][Code][Website][PapersWithCode] @Stanford
Dense-Captioning Events in Videos (ICCV 2017) -
Charades-STA [Paper][Code][PapersWithCode] @USC
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ICCV 2017) -
DiDeMo [Paper][Code][PapersWithCode] @Adobe
Localizing Moments in Video with Natural Language (ICCV 2017) -
MSVD [Paper][PapersWithCode] @Microsoft
Collecting Highly Parallel Data for Paraphrase Evaluation (ACL 2017) -
TGIF-QA [Paper/Journal][Code][PapersWithCode] @SNU
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering (CVPR 2016)
Video Question Answering with Spatio-Temporal Reasoning (IJCV 2019) -
PororoQA [Paper][Code] @SNU
DeepStory: Video Story QA by Deep Embedded Memory Networks (IJCAI 2017) -
LSMDC [Paper][Website][PapersWithCode] @MPII
Movie Description (IJCV 2017) -
MovieQA [Paper][Code][PapersWithCode] @UToronto
MovieQA: Understanding Stories in Movies through Question-Answering (CVPR 2016) -
MSR-VTT [Paper][PapersWithCode] @Microsoft
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016) -
MPII-MD [Paper][Website][PapersWithCode] @MPII
A Dataset for Movie Description (CVPR 2015)