Skip to content

Latest commit

 

History

History
116 lines (77 loc) · 6.33 KB

README.md

File metadata and controls

116 lines (77 loc) · 6.33 KB

OLA-VLM

Framework: PyTorch HuggingFace space YouTube

Jitesh Jain*, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

*Work done during an internship at Microsoft Research, Redmond    Equal Advising

[Project Page] | [arXiv] [Model Checkpoints] [Video] [BibTeX]

This repo contains the code for our paper OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.

We propose distilling target visual information into the intermediate representations of the LLM from a set of target encoders. We adopt a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.

Contents

  1. Installation Instructions
  2. Demo
  3. Getting Started
  4. Results
  5. Citation

News

Installation Instructions

Note: We trained all our models on AMD MI300x GPUs. However, in this repo, we provide instructions for Nvidia GPUs considering their wider usage.

  • Clone this repository.

    git lfs install
    git clone https://github.com/SHI-Labs/OLA-VLM
    cd OLA-VLM
  • Setup conda environment with the base dependencies.

    conda create -n ola_vlm -y
    conda activate ola_vlm
    pip install -e .
    pip install flash-attn --no-build-isolation
    pip install scikit-learn icecream datasets pytorch-fid lpips opencv-python-headless
    pip install setuptools==61.0.0
    pip install -e lmms-eval/
    pip install huggingface_hub==0.24.7
    pip install transformers==4.41.1

Demo

HuggingFace space

You can use the Gradio interface to interact with OLA-VLM locally. The demo also supports visualizing the respresentations from the slected intermediate LLM layers (embedding loss positions).

# install demo-specific libraries
pip install -e .["demo"]

# start the demo
CUDA_VISIBLE_DEVICES=0 python demo.py --model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b --PT-model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b

Getting Started

Note: We provide the guide to integrating the embeddding losses from OLA-VLM into any custom MLLM in Custom_MLLM.md

Training

  • Please see Training.md for training commands and dataset preparation.
  • We train all our models using 16 192G MI300X AMD GPUs.

Evaluation

Please see Evaluation.md for evaluation commands

Probing

Please see Probing.md for probing commands.

Results

Method Training Stages LLM Base Encoder CV-Bench MMStar RWQA OK-VQA Checkpoint
OLA-VLM PT + IFT Phi3-4k-mini CLIP-ViT-L 62.5 36.0 58.0 56.4 ckpt
OLA-VLM PT + IFT Phi3-4k-mini CLIP-ConvNeXT-XXL 63.9 38.4 58.4 56.5 ckpt
OLA-VLM PT + IFT Llama3-8b CLIP-ViT-L 61.4 39.5 57.9 56.6 ckpt
OLA-VLM PT + IFT Llama3-8b CLIP-ConvNeXT-XXL 61.5 38.5 55.0 59.0 ckpt
OLA-VLM PT + VPT + IFT Llama3-8b CLIP-ConvNeXT-XXL 64.6 40.6 62.9 61.1 ckpt

Citation

If you found OLA-VLM useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2024ola_vlm,
      title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
      author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
      journal={arXiv},
      year={2024}
}

Acknowledgement

We thank the authors of LLaVA-1.5, OneFormer, Depth-Anything v2, and unCLIP-SD for open-sourcing their codebase and checkpoints. We are grateful to the authors of cambrian and MMStar for releasing their code for CV-Bench and MMStar evaluation, respectively.