Skip to content

This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"

Notifications You must be signed in to change notification settings

Leon1207/Video-RAG-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Arxiv Arxiv

PWC PWC PWC PWC PWC PWC

😮 Highlights

radar

  • We integrate RAG into open-source LVLMs: Video-RAG incorporates three types of visually-aligned auxiliary texts (OCR, ASR, and object detection) processed by external tools and retrieved via RAG, enhancing the LVLM. It’s implemented using completely open-source tools, without the need for any commercial APIs.
  • We design a versatile plug-and-play RAG-based pipeline for any LVLM: Video-RAG offers a training-free solution for a wide range of LVLMs, delivering performance improvements with minimal additional resource requirements.
  • We achieve proprietary-level performance with open-source models: Applying Video-RAG to a 72B open-source model yields state-of-the-art performance in Video-MME, surpassing models such as Gemini-1.5-Pro. framework results

🔨 Usage

This repo is built upon LLaVA-NeXT:

  • Step 1: Clone and build LLaVA-NeXT conda environment:
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

Then install the following packages in llava environment:

pip install spacy faiss-cpu easyocr ffmpeg-python
pip install torch==2.1.2 torchaudio numpy
python -m spacy download en_core_web_sm
# Optional: pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  • Step 2: Clone and build another conda environment for APE by:
git clone https://github.com/shenyunhang/APE
cd APE
pip3 install -r requirements.txt
python3 -m pip install -e .
  • Step 3: Copy all the files in vidrag_pipeline under the root dir of LLaVA-NeXT;

  • Step 4: Copy all the files in ape_tools under the demo dir of APE;

  • Step 5: Opening a service of APE by running the code under APE/demo:

python demo/ape_service.py
  • Step 6: You can now run our pipeline build upon LLaVA-Video-7B by:
python vidrag_pipeline.py
  • Note that you can also use our pipeline in any LVLMs by implementing some modifications in vidrag_pipeline.py:
1. The video-language model you load (line #161).
2. The llava_inference() function, make sure your model supports both inputs with/without video (line #175).
3. The process_video() function may suit your model (line #34).
4. The final prompt may suit your model (line #366).

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:

@misc{luo2024videoragvisuallyalignedretrievalaugmentedlong,
      title={Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension}, 
      author={Yongdong Luo and Xiawu Zheng and Xiao Yang and Guilin Li and Haojia Lin and Jinfa Huang and Jiayi Ji and Fei Chao and Jiebo Luo and Rongrong Ji},
      year={2024},
      eprint={2411.13093},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13093}, 
}