We provide the off-the-shelf scripts in the scripts folder.
Cache of pretrained weight | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
Large | Link | Link | Link |
Huge | Link | - | Link |
For example, to train LanguageBind on Depth-Language with 8 GPUs (1 nodes x 8 GPUs).
- First download the cache of pretrained weight above. and specify
CACHE_DIR=path/to/LanguageBind
. - The second step is to develop a path to
ANNOTATION
andDATA
here according to the dataset preparation. - Then you can run
CACHE_DIR="/path/to/LanguageBind"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--do_train \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_d_cls_data "NYUV2"
For example, to validate LanguageBind on Depth-Language with 1 GPUs.
- First specify
RESUME
. - The second step is to prepare the downstream dataset.
- Then you can run
CACHE_DIR="/path/to/LanguageBind"
RESUME="thermal_language.pt"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
--do_eval \
--val_d_cls_data "NYUV2"
NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. We also provide data as follows. Change the data_root
here.
Video datasets are downloaded from this repo and we show the folder structure. Change the data_root
here.
Audio datasets are downloaded from this repo and Audioset from here.We reformat them to conform to the standard ImageNet format. Change the data_root
here1 and here2.
We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root
here. We also provide the processed data as follows.
Datasets | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
LLVIP | Link | Link | Link |
FLIR V1 | Link | Link | Link |
FLIR V2 | Link | Link | Link |
downstream_datasets
├── Audio
│ ├── audiocaps
│ │ └── audio
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── audioset
│ │ ├── balanced_train_segments
│ │ ├── eval_segments
│ │ └── unbalanced_train_segments
│ │ ├── unbalanced_train_segments_part00
│ │ ├── unbalanced_train_segments_part01
│ │ ├── ...
│ │ └── unbalanced_train_segments_part40
│ ├── clotho
│ │ ├── CLOTHO_retrieval_dataset
│ │ └── evaluation
│ ├── esc50
│ │ └── test
│ │ ├── airplane
│ │ ├── breathing
│ │ ├── ...
│ │ └── wind
├── laionaudio
│ │ ├── audios
│ │ ├── freesound_no_overlap
│ │ └── jsons
├── vggsound
│ └── test
│ ├── air\ conditioning\ noise
│ ├── air\ horn
│ ├── ...
│ └── zebra\ braying
├── Depth
│ ├── nyuv2
│ │ ├── data
│ │ │ └── val
│ │ │ ├── bathroom
│ │ │ ├── bedroom
│ │ │ ├── bookstore
│ │ │ ├── classroom
│ │ │ ├── dining_room
│ │ │ ├── home_office
│ │ │ ├── kitchen
│ │ │ ├── living_room
│ │ │ ├── office
│ │ │ └── others
├── Thermal
│ ├── flirv1
│ │ └── val
│ │ ├── bicycle
│ │ ├── car
│ │ ├── dog
│ │ └── person
│ ├── flirv2
│ │ └── val
│ │ ├── bike
│ │ ├── bus
│ │ ├── car
│ │ ├── hydrant
│ │ ├── light
│ │ ├── motor
│ │ ├── other\ vehicle
│ │ ├── person
│ │ ├── sign
│ │ ├── skateboard
│ │ ├── stroller
│ │ └── truck
│ ├── llvip
│ │ ├── train
│ │ │ ├── background
│ │ │ └── person
│ │ └── val
│ │ ├── background
│ │ └── person
└── VideoTextRetrieval
├── vtRetdata
│ ├── ActivityNet
│ │ └── Videos
│ │ └── Activity_Videos
│ ├── Didemo
│ │ └── videos
│ ├── MSRVTT
│ │ └── MSRVTT_Videos
│ └── MSVD
│ └── MSVD_Videos