diff --git a/README.md b/README.md
index 30cd449..3ec4c71 100644
--- a/README.md
+++ b/README.md
@@ -1,15 +1,23 @@
 <div align="center">
 
-# 【CVPR'2023】Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
+<h2>【CVPR'2023】🚴 BIKE: Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models </h2>
 
 [![Conference](http://img.shields.io/badge/CVPR-2023-6790AC.svg)](https://cvpr.thecvf.com/)
 [![Paper](http://img.shields.io/badge/Paper-arxiv.2301.00182-b31b1b.svg)](https://arxiv.org/abs/2301.00182)
-</div>
 
 
-This is the official implementation of our **BIKE**: [Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models](https://arxiv.org/abs/2301.00182).
+<!-- <h3><a href="https://arxiv.org/abs/2301.00182">Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models</a></h3> -->
+
+[Wenhao Wu](https://whwu95.github.io/)<sup>1,2</sup>, [Xiaohan Wang](https://scholar.google.com/citations?user=iGA10XoAAAAJ&hl=en)<sup>3</sup>, [Haipeng Luo]()<sup>4</sup>, [Jingdong Wang](https://jingdongwang2017.github.io/)<sup>2</sup>, [Yi Yang](https://scholar.google.com/citations?user=RMSuNFwAAAAJ&hl=en)<sup>3</sup>, [Wanli Ouyang](https://wlouyang.github.io/)<sup>4,1</sup>
+
+ 
+<sup>1</sup>[The University of Sydney](https://www.sydney.edu.au/), <sup>2</sup>[Baidu](https://vis.baidu.com/#/), <sup>3</sup>[ZJU](https://www.zju.edu.cn/english/), <sup>4</sup>[UCAS](https://english.ucas.ac.cn/), <sup>5</sup>[Shanghai AI Lab](https://www.shlab.org.cn/)
+
+
+
+</div>
+
 
-I am currently traveling and may not be able to open-source the code until May.
 
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bidirectional-cross-modal-knowledge/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=bidirectional-cross-modal-knowledge)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bidirectional-cross-modal-knowledge/action-recognition-in-videos-on-activitynet)](https://paperswithcode.com/sota/action-recognition-in-videos-on-activitynet?p=bidirectional-cross-modal-knowledge)
@@ -22,21 +30,269 @@ I am currently traveling and may not be able to open-source the code until May.
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bidirectional-cross-modal-knowledge/zero-shot-action-recognition-on-activitynet)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-activitynet?p=bidirectional-cross-modal-knowledge)
 
 
+This is the official implementation of our 🚴 **BIKE** (BIdirectional Knowledge Exploration), which leverages cross-modal bridge to enhance video recognition by exploring bidirectional knowledge.
+
+
+<details><summary>📣 I also have other cross-modal video projects that may interest you ✨. </summary><p>
+
+
+> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)<br>
+> Accepted by AAAI 2023 | [[Text4Vis Code]](https://github.com/whwu95/Text4Vis)<br>
+> Wenhao Wu, Zhun Sun, Wanli Ouyang
+
+
+> [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)<br>
+> Accepted by CVPR 2023 as 🌟Highlight🌟 | [[Cap4Video Code]](https://github.com/whwu95/Cap4Video)<br>
+> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
+
+</p></details>
+
+## News
+-  🚀 **Efficient Training**: Almost all models can be trained using **8 NVIDIA 32G V100 GPUs**. Especially, we can train ViT-L/14 (336) backbone using **8 GPUs** and achieve **88.3%** Top-1 accuracy on Kinetics-400 dataset!
+-  `Apr 20, 2023`: Main training codes have been released, including single-node/multi-node multi-GPU distributed training. Thanks for your star 😝.
+-  `Feb 28, 2023`: 🎉Our **BIKE** has been accepted by **CVPR-2023**.
+
+## Overview
+🚴**BIKE** explores bidirectional cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce auxiliary attributes and category-dependent temporal saliency for improved video recognition.
+
+![BIKE](docs/bike.png)
+
+
+## Content
+- [Prerequisites](#prerequisites)
+- [Data Preparation](#data-preparation)
+- [Model Zoo](#model-zoo)
+- [Training](#training)  
+- [Testing](#testing)  
+- [BibTeX & Citation](#bibtex)
+- [Acknowledgment](#acknowledgment)
+
+<a name="prerequisites"></a>
+## Prerequisites
+
+<details><summary>The code is built with following libraries. </summary><p>
+
+- [PyTorch](https://pytorch.org/) >= 1.8
+- RandAugment
+- pprint
+- tqdm
+- dotmap
+- yaml
+- csv
+- Optional: decord (for on-the-fly video training)
+- Optional: torchnet (for mAP evaluation on ActivityNet)
+</p></details>
+
+
+<a name="data-preparation"></a>
+## Data Preparation
+
+
+
+### Video Loader
+
+**(Recommend)** To train all of our models, we extract videos into frames for fast reading. Please refer to [MVFNet](https://github.com/whwu95/MVFNet/blob/main/data_process/DATASETS.md) repo for the detaied guide of dataset processing.  
+The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. 
+<details><summary>Example of annotation</summary>
+
+```sh
+abseiling/-7kbO0v4hag_000107_000117 300 0
+abseiling/-bwYZwnwb8E_000013_000023 300 0
+```
+</details>
+
+(Optional) We can also decode the videos in an online fashion using [decord](https://github.com/dmlc/decord). This manner should work but are not tested. All of the models offered have been trained using offline frames. 
+<details><summary>Example of annotation</summary>
+
+```sh
+  abseiling/-7kbO0v4hag_000107_000117.mp4 0
+  abseiling/-bwYZwnwb8E_000013_000023.mp4 0
+```
+</details>
+
+
+### Annotation
+Annotation information consists of two parts: video label, and category description.
+
+- Video Label: As mentioned above, this part is same as the traditional video recognition. Please refer to [lists/k400/kinetics_rgb_train_se320.txt](lists/k400/kinetics_rgb_train_se320.txt) for the format.
+- Category Description: We also need a textual description for each video category.  Please refer to [lists/k400/kinetics_400_labels.csv](lists/k400/kinetics_400_labels.csv) for the format.
+
+
+
+<a name="model-zoo"></a>
+## 📱 Model Zoo
+
+Here we provide some off-the-shelf pre-trained checkpoints of our models in the following tables.
+
+- *# Inference Views = # Temporal clips x # Spatial crops*
+### Kinetics-400
+
+| Architecture | Input | Views | Top-1(%) | checkpoint | Train log| config| 
+|:------------:|:-------------------:|:------------------:|:-----------------:|:--------------:|:--------------:|:--------------:|
+| ViT-B/32 | 8x224<sup>2</sup> | 4x3 |80.5 | [Github](https://github.com/whwu95/Text4Vis/releases/download/v1/k400-vitb-32-f16.pt) | [log](exps/k400/ViT-B/32/f16/log.txt)  | [config](configs/k400/k400_train_rgb_vitb-32-f16.yaml) |
+| ViT-B/16 | 8x224<sup>2</sup> | 4x3  | 84.0 | [Github]() | [log](exps/k400/ViT-B/16/f8/log.txt) | [config](configs/k400/k400_train_rgb_vitb-16-f8.yaml) |
+| ViT-L/14* | 8x224<sup>2</sup> | 4x3| 87.4 | [OneDrive]() | [log](exps/k400/ViT-L/14/f8/log.txt) | [config](configs/k400/k400_train_rgb_vitl-14-f8.yaml) | 
+| ViT-L/14 | 16x224<sup>2</sup> | 4x3| 88.1| [OneDrive]() | [log](exps/k400/ViT-L/14/f16/log.txt) | [config](configs/k400/k400_train_rgb_vitl-14-f16.yaml) | 
+| ViT-L/14 | 8x336<sup>2</sup> | 4x3  | 88.3 | [OneDrive]()  | [log](exps/k400/ViT-L/14-336px/f8/log.txt) | [config](configs/k400/k400_train_rgb_vitl-14-336-f8.yaml) |
+| ViT-L/14 | 16x336<sup>2</sup> | 4x3  | 88.7 | [OneDrive]()  | [log](exps/k400/ViT-L/14-336px/f16/log.txt) | [config](configs/k400/k400_train_rgb_vitl-14-336-f16.yaml) |
+<!-- | ViT-L/14 | 32x336<sup>2</sup> | 4x3  | xx.x | [OneDrive]() | [log](exps/k400/ViT-L/14-336px/f32/log.txt) | [config](configs/k400/k400_train_rgb_vitl-14-336-f32.yaml) | -->
+
+*Note: * indicates that in the paper, a ViT-L model with 8 frames was used for zero-shot evaluation on UCF, HMDB, ActivityNet, and Kinetics-600.*
+
+### Untrimmed Video Recognition: ActivityNet
+| Architecture | Input | Views | Top-1 (%) |  mAP (%) | checkpoint | Train log| config|
+|:------------:|:-------------------:|:------------------:|:-----------------:|:--------------:|:--------------:|:--------------:|:--------------:|
+| ViT-L/14 | 16x224<sup>2</sup> | 4x1| 94.4 | 96.3| [OneDrive]() | [log](exps/anet/ViT-L/14/f16/log.txt) | [config](configs/anet/anet_k400_finetune.yaml) |
+| ViT-L/14 | 16x336<sup>2</sup> | 4x1| 94.7 | 96.3 | [OneDrive]() | [log](exps/anet/ViT-L/14-336px/f16/log.txt) | [config](configs/anet/anet_k400_finetune_336.yaml) |
+
 
+### Multi-label Action Recognition: Charades
+| Architecture | Input | Views |  mAP (%) | checkpoint | Train log| config|
+|:------------:|:-------------------:|:------------------:|:-----------------:|:--------------:|:--------------:|:--------------:|
+| ViT-L/14 | 16x336<sup>2</sup> | 4x1| 50.4  | [OneDrive]() | [log](exp_nce/charades/ViT-L/14-336px/16f/log.txt) | [config](configs/charades/charades_k400_finetune_336.yaml) |
 
-## 📣 Updates
-- [x] **[Feb 28, 2023]** 🎉Our **BIKE** has been accepted by **CVPR-2023**.
 
 
+### UCF-101
+| Architecture | Input | Views |  Top-1 (%) | checkpoint | Train log| config|
+|:------------:|:-------------------:|:------------------:|:-----------------:|:--------------:|:--------------:|:--------------:|
+| ViT-L/14 | 16x224<sup>2</sup> | 1x1 | 98.8 | [OneDrive]() | [log](exps/ucf101/ViT-L/14/f16/log.txt) | [config](configs/ucf101/ucf_k400_finetune.yaml) |
+| ViT-L/14 | 16x336<sup>2</sup> | 1x1 | 98.6 | [OneDrive]()| [log](exps/ucf101/ViT-L/14-336px/f16/log.txt) | [config](configs/ucf101/ucf_k400_finetune_336.yaml) |
 
-## 📌 Bibtex
-If you find this repository useful, please star🌟 this repo and cite📑 our paper:
+### HMDB-51
+| Architecture | Input | Views |  Top-1 (%) | checkpoint | Train log| config|
+|:------------:|:-------------------:|:------------------:|:-----------------:|:--------------:|:--------------:|:--------------:|
+| ViT-L/14 | 16x224<sup>2</sup> | 1x1 | 82.2 | [OneDrive]() | [log](exps/hmdb51/ViT-L/14/f16/log.txt) | [config](configs/hmdb51/hmdb_k400_finetune.yaml) |
+| ViT-L/14 | 16x336<sup>2</sup> | 1x1 | 83.1 | [OneDrive]()| [log](exps/hmdb51/ViT-L/14-336px/f16/log.txt) | [config](configs/hmdb51/hmdb_k400_finetune_336.yaml) |
 
+
+
+
+<a name="training"></a>
+## 🚀 Training
+This implementation supports Multi-GPU `DistributedDataParallel` training, which is faster and simpler than `DataParallel` training. 
+
+1. **Single Machine**: To train our model on Kinetics-400 with 8 GPUs in *Single Machine*, you can run:
+```sh
+# For example, we train the 8 Frames ViT-B/32 video model (i.e., video branch).
+sh scripts/run_train.sh  configs/k400/k400_train_rgb_vitb-32-f8.yaml
+
+# We train the attributes branch.
+sh scripts/run_train_attributes.sh  configs/k400/k400_train_rgb_vitb-32-f8.yaml
+```
+
+<details><summary>2. Mulitple Machines: We also provide the script to train larger model with Mulitple Machines (e.g., 2 nodes have 16 GPUs).</summary>
+
+```sh
+# For example, we train the 8 Frames ViT-L/14-336 with 2 machines as follows:
+# For first machine, you need to set the ip of your first machine as the --master_addr, --nnodes is 2.
+# Compared with the Single-Machine training script, only one node_id needs to be added.
+sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 0
+
+# For second machine, --master_addr is still the ip of your first machine
+sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 1
+```
+</details>
+
+
+<details><summary>3. Few-shot Recognition: To train our model under Few-shot scenario, you just need to add one line in the general config file.</summary>
+
+```sh
+# You can refer to config/k400/k400_few_shot.yaml
+data: 
+    ...  # general configurations
+    shot: 2  # i.e., 2-shot setting
+```
+</details>
+
+<a name="testing"></a>
+## ⚡ Testing
+We support single-view validation (default) and multi-view (4x3 views) validation.
+
+```sh
+# The testing command for obtaining top-1/top-5 accuracy.
+sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt
+
+# The command for zero-shot evaluation is similar.
+sh scripts/run_test_zeroshot.sh Your-Config.yaml Your-Trained-Model.pt
+```
+
+We provide more examples of testing commands below.
+
+
+<details><summary>General / Few-shot Video Recognition</summary>
+
+```sh
+# Efficient Setting: Single view evaluation. 
+# E.g., ViT-L/14 8 Frames on Kinetics-400. You should get around 86.5% top-1 accuracy. 
+sh scripts/run_test.sh  configs/k400/k400_train_rgb_vitl-14-f8.yaml exps/k400/ViT-L/14/8f/model_best.pt
+
+# Accurate Setting: Multi-view evalition (4clipsx3crops).
+# You should get around 87.4% top-1 accuracy. 
+sh scripts/run_test.sh  configs/k400/k400_train_rgb_vitl-14-f8.yaml exps/k400/ViT-L/14/8f/model_best.pt --test_crops 3  --test_clips 4
+
+# Test the Charades dataset using the mAP metric. You should achieve around 50.4 mAP.
+sh scripts/run_test_charades.sh configs/charades/charades_k400_finetune_336.yaml exps/charades/ViT-L/14-336px/16f/model_best.pt --test_crops 1  --test_clips 4
+
+# Test the ActivityNet dataset using top1 and mAP metric. You should achieve around 96.1 mAP.
+sh scripts/run_test_anet.sh configs/anet/anet_k400_finetune_336.yaml exps/anet/ViT-L/14-336px/16f/model_best.pt
 ```
+</details>
+
+
+<details><summary>Zero-shot Evaluation<p></summary>
+
+
+We use the Kinetics-400 pre-trained model (e.g., [ViT-L/14 with 8 frames](configs/k400/k400_train_rgb_vitl-14-f8.yaml)) to perform cross-dataset zero-shot evaluation, i.e., UCF101, HMDB51, ActivityNet, Kinetics-600.
+
+
+- Half-classes Evaluation: A traditional evaluation protocol involves selecting half of the test dataset's classes, repeating the process ten times, and reporting the mean accuracy with a standard deviation of ten times.
+
+
+- Full-classes Evaluation: Perform evaluation on the entire dataset.
+
+```sh
+# On ActivityNet: reporting the half-classes and full-classes results
+# Half-classes: 86.18 ± 1.05, Full-classes: 80.04
+sh scripts/run_test_zeroshot.sh  configs/anet/anet_zero_shot.yaml exps/k400/ViT-L/14/f8/last_model.pt
+
+# On UCF101: reporting the half-classes and full-classes results
+# Half-classes: 86.63 ± 3.4, Full-classes: 80.83
+sh scripts/run_test_zeroshot.sh  configs/ucf101/ucf_zero_shot.yaml exps/k400/ViT-L/14/f8/last_model.pt
+
+# On HMDB51: reporting the half-classes and full-classes results
+# Half-classes: 61.37 ± 3.68, Full-classes: 52.75
+sh scripts/run_test_zeroshot.sh  configs/hmdb51/hmdb_zero_shot.yaml exps/k400/ViT-L/14/f8/last_model.pt
+
+# On Kinetics-600: manually calculating the mean accuracy with standard deviation of three splits.
+# Split1: 70.14, Split2: 68.31, Split3: 67.15
+# Average: 68.53 ± 1.23
+sh scripts/run_test.sh  configs/k600/k600_zero_shot_split1.yaml exp/k400s/ViT-L/14/f8/last_model.pt
+sh scripts/run_test.sh  configs/k600/k600_zero_shot_split2.yaml exp/k400s/ViT-L/14/f8/last_model.pt
+sh scripts/run_test.sh  configs/k600/k600_zero_shot_split3.yaml exp/k400s/ViT-L/14/f8/last_model.pt
+```
+</details>
+
+<a name="bibtex"></a>
+## 📌 BibTeX & Citation
+
+If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.
+
+
+```bibtex
 @inproceedings{bike,
   title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
   author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
-  booktitle=CVPR,
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
   year={2023}
 }
 ```
+
+<a name="acknowledgment"></a>
+## 🎗️ Acknowledgement
+
+This repository is built based on [Text4Vis](https://github.com/whwu95/Text4Vis) and [CLIP](https://github.com/openai/CLIP). Sincere thanks to their wonderful works.
+
+
+## 👫 Contact
+For any question, please file an issue or contact [Wenhao Wu](https://whwu95.github.io/)
diff --git a/configs/k400/k400_train_rgb_vitb-16-f8.yaml b/configs/k400/k400_train_rgb_vitb-16-f8.yaml
new file mode 100644
index 0000000..053cef6
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitb-16-f8.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 8
+    seg_length: 1
+    batch_size: 32
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 224
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-B/16  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None  
+    interaction: VCS  # DP VCS
+    joint_st: False
+    drop: 0           
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 30
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/configs/k400/k400_train_rgb_vitl-14-336-f16.yaml b/configs/k400/k400_train_rgb_vitl-14-336-f16.yaml
new file mode 100644
index 0000000..076fdbe
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitl-14-336-f16.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 16
+    seg_length: 1
+    batch_size: 8  # mem: 26G
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 336
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-L/14-336px  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None  
+    interaction: VCS  # DP VCS
+    joint_st: False
+    drop: 0      
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 20
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/configs/k400/k400_train_rgb_vitl-14-336-f32.yaml b/configs/k400/k400_train_rgb_vitl-14-336-f32.yaml
new file mode 100644
index 0000000..26f6004
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitl-14-336-f32.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 32
+    seg_length: 1
+    batch_size: 5  # mem: 31G
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 336
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-L/14-336px  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False  # False tsm tokent1d tokenshift
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None 
+    interaction: VCS  # DP VCS 
+    joint_st: False
+    drop: 0      
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 20
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml b/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml
new file mode 100644
index 0000000..2d720cc
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 8
+    seg_length: 1
+    batch_size: 16
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 336
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-L/14-336px  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None  
+    interaction: VCS  # DP VCS
+    joint_st: False
+    drop: 0      
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 20
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/configs/k400/k400_train_rgb_vitl-14-f16.yaml b/configs/k400/k400_train_rgb_vitl-14-f16.yaml
new file mode 100644
index 0000000..93eb3c7
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitl-14-f16.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 16
+    seg_length: 1
+    batch_size: 32
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 224
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-L/14  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None  
+    interaction: VCS  # DP VCS
+    joint_st: False
+    drop: 0      
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 20
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/configs/k400/k400_train_rgb_vitl-14-f8.yaml b/configs/k400/k400_train_rgb_vitl-14-f8.yaml
new file mode 100644
index 0000000..b5651fc
--- /dev/null
+++ b/configs/k400/k400_train_rgb_vitl-14-f8.yaml
@@ -0,0 +1,49 @@
+resume:
+pretrain:
+seed: 1024
+data:
+    dataset: k400
+    modality: RGB
+    num_segments: 8
+    seg_length: 1
+    batch_size: 32
+    workers: 4
+    num_classes: 400
+    image_tmpl: 'img_{:05d}.jpg'
+    train_root: '/bpfs/v2_mnt/VIS/test/k400/train_320_frames'
+    train_list: 'lists/k400/kinetics_rgb_train_se320.txt'
+    val_root: /bpfs/v2_mnt/VIS/test/k400/kinetics_400_val_320_opencv
+    val_list: lists/k400/kinetics_rgb_val_se320.txt
+    label_list: 'lists/k400/kinetics_400_labels.csv'
+    input_size: 224
+    random_shift: True
+    output_path: exps
+network:
+    arch: ViT-L/14  #ViT-B/32 ViT-B/16
+    init: True
+    tm: False
+    drop_out: 0.0 
+    emb_dropout: 0.0 
+    type: clip_k400
+    sim_header: Transf  # Transf   None  
+    interaction: VCS  # DP VCS
+    joint_st: False
+    drop: 0      
+    fix_text: True  
+    fix_video: False        
+solver:
+    type: cosine
+    epochs: 20
+    start_epoch: 0
+    epoch_offset: 0
+    optim: adamw
+    lr: 5.e-5
+    lr_warmup_step: 5
+    weight_decay: 0.2
+    loss_type: NCE
+    evaluate: False
+    clip_ratio: 0.1
+    grad_accumulation_steps: 1
+logging:
+    print_freq: 10
+    eval_freq: 1
\ No newline at end of file
diff --git a/docs/bike.png b/docs/bike.png
new file mode 100644
index 0000000..09b8ddb
Binary files /dev/null and b/docs/bike.png differ
diff --git a/scripts/run_test.sh b/scripts/run_test.sh
new file mode 100644
index 0000000..faf1d58
--- /dev/null
+++ b/scripts/run_test.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+if [ -f $1 ]; then
+  config=$1
+else
+  echo "need a config file"
+  exit
+fi
+
+weight=$2
+
+python -m torch.distributed.launch --master_port 1239 --nproc_per_node=8 \
+    test.py --config ${config} --weights ${weight} ${@:3}
\ No newline at end of file
diff --git a/test.py b/test.py
new file mode 100644
index 0000000..af019ac
--- /dev/null
+++ b/test.py
@@ -0,0 +1,226 @@
+import os
+import argparse
+
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from torch.nn.parallel import DistributedDataParallel
+import torch.distributed as dist
+import torch.backends.cudnn as cudnn
+import torchvision
+import torch.nn.functional as F
+import time
+from utils.utils import init_distributed_mode, AverageMeter, reduce_tensor, accuracy
+import clip
+
+import yaml
+from dotmap import DotMap
+from datasets.video import Video_dataset
+from datasets.transforms import GroupScale, GroupCenterCrop, Stack, ToTorchFormatTensor, GroupNormalize, GroupOverSample, GroupFullResSample
+from modules.video_clip import video_header
+from modules.text_prompt import text_prompt
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--config', type=str, help='global config file')
+    parser.add_argument('--weights', type=str, default=None)
+    parser.add_argument('--dist_url', default='env://',
+                        help='url used to set up distributed training')
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument("--local_rank", type=int,
+                        help='local rank for DistributedDataParallel')
+    parser.add_argument(
+        "--precision",
+        choices=["amp", "fp16", "fp32"],
+        default="amp",
+        help="Floating point precition."
+    )
+    parser.add_argument('--test_crops', type=int, default=1)   
+    parser.add_argument('--test_clips', type=int, default=1) 
+    parser.add_argument('--dense', default=False, action="store_true",
+                    help='use dense sample for test as in Non-local I3D')
+    args = parser.parse_args()
+    return args
+
+def update_dict(dict):
+    new_dict = {}
+    for k, v in dict.items():
+        new_dict[k.replace('module.', '')] = v
+    return new_dict
+
+
+def main(args):
+    init_distributed_mode(args)
+
+    with open(args.config, 'r') as f:
+        config = yaml.load(f, Loader=yaml.FullLoader)
+
+    config = DotMap(config)
+
+    device = "cpu"
+    if torch.cuda.is_available():
+        device = "cuda"
+        cudnn.benchmark = True
+
+    # get fp16 model and weight
+    model, clip_state_dict = clip.load(
+        config.network.arch,
+        device='cpu', jit=False,
+        internal_modeling=config.network.tm,
+        T=config.data.num_segments,
+        dropout=config.network.drop_out,
+        emb_dropout=config.network.emb_dropout,
+        pretrain=config.network.init,
+        joint_st= config.network.joint_st) # Must set jit=False for training  ViT-B/32
+
+    video_head = video_header(
+        config.network.sim_header,
+        config.network.interaction,
+        clip_state_dict)
+
+    if args.precision == "amp" or args.precision == "fp32":
+        model = model.float()
+
+
+    input_mean = [0.48145466, 0.4578275, 0.40821073]
+    input_std = [0.26862954, 0.26130258, 0.27577711]
+
+    # rescale size
+    if 'something' in config.data.dataset:
+        scale_size = (240, 320)
+    else:
+        scale_size = 256 if config.data.input_size == 224 else config.data.input_size
+
+    # crop size
+    input_size = config.data.input_size
+
+    # control the spatial crop
+    if args.test_crops == 1: # one crop
+        cropping = torchvision.transforms.Compose([
+            GroupScale(scale_size),
+            GroupCenterCrop(input_size),
+        ])
+    elif args.test_crops == 3:  # do not flip, so only 3 crops (left right center)
+        cropping = torchvision.transforms.Compose([
+            GroupFullResSample(
+                crop_size=input_size,
+                scale_size=scale_size,
+                flip=False)
+        ])
+    elif args.test_crops == 5:  # do not flip, so only 5 crops
+        cropping = torchvision.transforms.Compose([
+            GroupOverSample(
+                crop_size=input_size,
+                scale_size=scale_size,
+                flip=False)
+        ])
+    elif args.test_crops == 10:
+        cropping = torchvision.transforms.Compose([
+            GroupOverSample(
+                crop_size=input_size,
+                scale_size=scale_size,
+            )
+        ])
+    else:
+        raise ValueError("Only 1, 3, 5, 10 crops are supported while we got {}".format(args.test_crops))
+
+
+    val_data = Video_dataset(
+        config.data.val_root, config.data.val_list, config.data.label_list,
+        random_shift=False, num_segments=config.data.num_segments,
+        modality=config.data.modality,
+        image_tmpl=config.data.image_tmpl,
+        test_mode=True,
+        transform=torchvision.transforms.Compose([
+            cropping,
+            Stack(roll=False),
+            ToTorchFormatTensor(div=True),
+            GroupNormalize(input_mean, input_std),
+        ]),
+        dense_sample=args.dense,
+        test_clips=args.test_clips)
+
+    val_sampler = torch.utils.data.distributed.DistributedSampler(val_data)
+    val_loader = DataLoader(val_data,
+        batch_size=config.data.batch_size, num_workers=config.data.workers,
+        sampler=val_sampler, pin_memory=True, drop_last=False)
+
+    if os.path.isfile(args.weights):
+        checkpoint = torch.load(args.weights, map_location='cpu')
+        if dist.get_rank() == 0:
+            print('load model: epoch {}'.format(checkpoint['epoch']))
+
+        model.load_state_dict(update_dict(checkpoint['model_state_dict']))
+        video_head.load_state_dict(update_dict(checkpoint['fusion_model_state_dict']))
+        del checkpoint
+
+    if args.distributed:
+        model = DistributedDataParallel(model.cuda(), device_ids=[args.gpu], find_unused_parameters=True)
+        if config.network.sim_header != "None":
+            video_head = DistributedDataParallel(video_head.cuda(), device_ids=[args.gpu])
+
+    classes = text_prompt(val_data)
+    n_class = classes.size(0)
+
+    prec1 = validate(
+        val_loader, classes, device,
+        model, video_head, config, n_class, args.test_crops, args.test_clips)
+    return
+
+
+def validate(val_loader, classes, device, model, video_head, config, n_class, test_crops, test_clips):
+    top1 = AverageMeter()
+    top5 = AverageMeter()
+    model.eval()
+    video_head.eval()
+    proc_start_time = time.time()
+
+    with torch.no_grad():
+        text_inputs = classes.to(device)
+        cls_feature, text_features = model.module.encode_text(text_inputs, return_token=True)
+        for i, (image, class_id) in enumerate(val_loader):
+            batch_size = class_id.numel()
+            num_crop = test_crops
+
+            num_crop *= test_clips  # 4 clips for testing when using dense sample
+
+            class_id = class_id.to(device)
+            n_seg = config.data.num_segments
+            image = image.view((-1, n_seg, 3) + image.size()[-2:])
+            b, t, c, h, w = image.size()
+            image_input = image.to(device).view(-1, c, h, w)
+            image_features = model.module.encode_image(image_input).view(b, t, -1)
+            cnt_time = time.time() - proc_start_time
+            similarity = video_head(image_features, text_features, cls_feature)  # b dim
+            similarity = F.softmax(similarity, -1)
+            similarity = similarity.reshape(batch_size, num_crop, -1).mean(1)  # bs dim
+
+            similarity = similarity.view(batch_size, -1, n_class).softmax(dim=-1)
+            similarity = similarity.mean(dim=1, keepdim=False)
+
+            prec = accuracy(similarity, class_id, topk=(1, 5))
+            prec1 = reduce_tensor(prec[0])
+            prec5 = reduce_tensor(prec[1])
+
+            top1.update(prec1.item(), class_id.size(0))
+            top5.update(prec5.item(), class_id.size(0))
+
+            if i % config.logging.print_freq == 0 and dist.get_rank() == 0:
+                runtime = float(cnt_time) / (i+1) / (batch_size * dist.get_world_size())
+                print(
+                    ('Test: [{0}/{1}], average {runtime:.4f} sec/video \t'
+                     'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
+                     'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
+                       i, len(val_loader), runtime=runtime, top1=top1, top5=top5)))
+
+    if dist.get_rank() == 0:
+        print('-----Evaluation is finished------')
+        print('Overall Prec@1 {:.03f}% Prec@5 {:.03f}%'.format(top1.avg, top5.avg))
+    return top1.avg
+
+if __name__ == '__main__':
+    args = get_parser()
+    main(args)
+