Skip to content

Latest commit

 

History

History
232 lines (148 loc) · 17.3 KB

README_full.md

File metadata and controls

232 lines (148 loc) · 17.3 KB

awesome-Video-Language-Understanding Awesome

Maybe awesome?

If you find missed papers, please open issues or pull requests (recommended).

Table of Contents

Main

Video Language Transformers

  • VIOLETv2 (EmpiricalMVM) [Paper][Code] @Microsoft
    An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling (CVPR 2023)

  • LAVENDER [Paper][Code] @Microsoft
    LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling (CVPR 2023)

  • Flamingo[Paper] @DeepMind
    Flamingo: a Visual Language Model for Few-Shot Learning (NeurIPS 2022)

  • ALPRO [Paper][Code] @Salesforce
    Align and Prompt: Video-and-Language Pre-training with Entity Prompts (CVPR 2022)

  • VL-Adapter [Paper][Code] @UNC
    VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (CVPR 2022)

  • ATP [Paper][Code][Website][Poster][Oral] @Stanford
    Revisiting the "Video" in Video-Language Understanding (CVPR 2022)

  • InternVideo [Paper][Code] @OpenGVLab
    InternVideo: General Video Foundation Models via Generative and Discriminative Learning (arXiv 2022)

  • VIOLET [Paper][Code] @Microsoft
    VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling (arXiv 2021)

  • VidLanKD [Paper][Code] @UNC
    VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer (NeurIPS 2021)

  • MCN [Paper][Code] @MIT-IBM
    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos (ICCV 2021)

  • HERO [Paper][Code] @Microsoft
    HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020)

  • UniVL [Paper][Code] @Microsoft
    UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation (arXiv 2020)

Video Retrieval

  • FiT [Paper][Code][Website][Demo] @Oxford
    Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021)

  • CLIP-Hitchhiker [Paper] @Oxford
    A CLIP-Hitchhiker's Guide to Long Video Retrieval (arXiv 2022)

  • CLIP2Video [Paper][Code] @Tencent
    CLIP2Video: Mastering Video-Text Retrieval via Image CLIP (arXiv 2021)

Video Question Answering

  • FrozenBiLM [Paper][Code][Website][Poster][Slides] @Inria
    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS 2022)

  • MERLOT Reserve [Paper][Code][Website][Demo] @AI2
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound (CVPR 2022)

  • MERLOT [Paper][Code][Website] @AI2
    MERLOT: Multimodal Neural Script Knowledge Models (NeurIPS 2021)

  • JustAsk [Paper/Journal][Code][Website][Demo][Poster][Slides][Oral] @Inria
    Just Ask: Learning to Answer Questions from Millions of Narrated Videos (ICCV 2021)
    Learning to Answer Visual Questions from Web Videos (TPAMI 2022)

Video Captioning

  • Video ChatCaptioner [Paper][Code] @KAUST
    Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions (arXiv 2023)

  • Vid2Seq [Paper][Code][Website][Blog] @Google
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (CVPR 2023)

  • MV-GPT [Paper] @Google
    End-to-end Generative Pretraining for Multimodal Video Captioning (CVPR 2022)

  • SwinBERT [Paper][Code] @Microsoft
    SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (CVPR 2022)

  • HMN [Paper][Code] @UCAS
    Hierarchical Modular Network for Video Captioning (CVPR 2022)

  • VX2TEXT [Paper] @Facebook
    VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs (CVPR 2021)

  • DeCEMBERT [Paper][Code][Oral] @UNC
    DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021)

  • CLIP4Caption [Paper] @Tencent
    CLIP4Caption: CLIP for Video Caption (ACM 2021)

  • ViTT [Paper][Oral] @Google
    Multimodal Pretraining for Dense Video Captioning (ACL 2020)

Others

  • GPT2MVS [Paper] @UvA
    GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization (ICMR 2021)

Datasets and SOTA

Large-scale Video Language Dataset

  • WebVid-10M [Paper][Code][Website] @Oxford
    Frozen in Time: A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021)

  • HowTo100M [Paper][Code][Website] @Inria
    HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)

Downstream Tasks

  • MAD-v2 [Paper][Code][Website] @Oxford
    AutoAD: Movie Description in Context (CVPR 2023)

  • AGQA-Decomp [Paper][Code][Website] @UW
    Measuring Compositional Consistency for Video Question Answering (CVPR 2022)

  • MAD [Paper][Code][PapersWithCode] @KAUST
    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions (CVPR 2022)

  • QVHighlights [Paper][Code][PapersWithCode] @UNC
    QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (NeurIPS 2021)

  • STAR [Paper][Code][Website][PaperswithCode] @MIT-IBM
    STAR: A Benchmark for Situated Reasoning in Real-World Videos (NeurIPS 2021)

  • VidSitu [Paper][Code][Website][PapersWithCode] @AI2
    Visual Semantic Role Labeling for Video Understanding (CVPR 2021)

  • AGQA [Paper][Code][Website][PapersWithCode] @Stanford
    AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning (CVPR 2021)

  • LVU [Paper][Code][Website] @UTAUS
    Towards Long-Form Video Understanding (CVPR 2021)

  • DramaQA [Paper][Code][Website][PapersWithCode] @SNU
    DramaQA: Character-Centered Video Story Understanding with Hierarchical QA (AAAI 2021)

  • How2QA [Paper][Code] @Microsoft
    HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020)

  • How2R [Paper][Code] @Microsoft
    HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020)

  • VLEP [Paper][Code][PapersWithCode] @UNC
    What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020)

  • V2C [Paper][Code][Website][PapersWithCode] @ASU
    Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning (EMNLP 2020)

  • CMD [Paper][Code][Website][PapersWithCode] @Oxford
    Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020)

  • TVC [Paper][Code][Website] @UNC
    TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020)

  • TVR [Paper][Code][Website][PapersWithCode] @UNC
    TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020)

  • VIOLIN [Paper][Code][PapersWithCode] @Microsoft
    VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020)

  • KnowITVQA [Paper][Code][Website][PapersWithCode] @Osaka
    KnowIT VQA: Answering Knowledge-Based Questions about Videos (AAAI 2020)

  • TVQA+ [Paper][Code][Website][PapersWithCode] @UNC
    TVQA+: Spatio-Temporal Grounding for Video Question Answering (ACL 2020)

  • TVQA [Paper][Code][Website][PapersWithCode] @UNC
    TVQA: Localized, Compositional Video Question Answering (EMNLP 2018)

  • YouCook2 [Paper][Website][PapersWithCode] @UMich
    Towards Automatic Learning of Procedures from Web Instructional Videos (AAAI 2018)

  • ActivityNet Captions [Paper][Code][Website][PapersWithCode] @Stanford
    Dense-Captioning Events in Videos (ICCV 2017)

  • Charades-STA [Paper][Code][PapersWithCode] @USC
    Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ICCV 2017)

  • DiDeMo [Paper][Code][PapersWithCode] @Adobe
    Localizing Moments in Video with Natural Language (ICCV 2017)

  • MSVD [Paper][PapersWithCode] @Microsoft
    Collecting Highly Parallel Data for Paraphrase Evaluation (ACL 2017)

  • TGIF-QA [Paper/Journal][Code][PapersWithCode] @SNU
    TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering (CVPR 2016)
    Video Question Answering with Spatio-Temporal Reasoning (IJCV 2019)

  • PororoQA [Paper][Code] @SNU
    DeepStory: Video Story QA by Deep Embedded Memory Networks (IJCAI 2017)

  • LSMDC [Paper][Website][PapersWithCode] @MPII
    Movie Description (IJCV 2017)

  • MovieQA [Paper][Code][PapersWithCode] @UToronto
    MovieQA: Understanding Stories in Movies through Question-Answering (CVPR 2016)

  • MSR-VTT [Paper][PapersWithCode] @Microsoft
    MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)

  • MPII-MD [Paper][Website][PapersWithCode] @MPII
    A Dataset for Movie Description (CVPR 2015)