This is an official implementation of our work, Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models, accepted to ECCV'24.
[2025/01/19] The model checkpoints have also been uploaded! Check here for more details.
[2025/01/19] The instruction page is ready! We plan to release our original checkpoints soon.
[2024/12/31] Our full codebase has been released! Introduction and installation method (include packages) would be updated soon.
Create a new Conda environment with Python 3.10.14:
conda create -n snd python==3.10.14
Activate the environment and install PyTorch with the specified version and CUDA support:
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
Install additional dependencies using the provided requirements.txt
file:
pip install -r requirements.txt
To reproduce our experiments, download the following datasets from the guidance provided here.
- FGVCAircraft
- DTD
- EuroSAT
- Flowers102
- Food101
- OxfordPets
- StanfordCars
- UCF101
- ImageNet
Organize each dataset in the following directory structure:
<DATASET_NAME>/
├── images/
│ ├── image data / folders
├── <DATASET_NAME>_annotations.json
The <DATASET_NAME>_annotations.json
file contains the training, validation, and test splits, along with class names. The files we used for all datasets are provided here. Download these files and place them in the appropriate paths as described above.
We provide our original model checkpoints for public use. Due to limited storage space, only the last checkpoints for each training sequence are released.
Unfortunately, while reproducing our experiments, we observed a slight performance drop (0.08% in mean scores). This discrepancy may be attributed to differences in hardware or package versions. Despite this minor variation, our method still achieve state-of-the-art performance compared to previous works.
You can access the model checkpoints and the reproduced average accuracy scores here.
We provide several scripts to help you easily reproduce our experiments. Our experiments were conducted using 4x V100 GPUs in distributed parallel mode. Note that we have not tested our method outside of distributed mode. If you have only one GPU, run the code in distributed mode by specifying --nproc_per_node
to 1.
Before running the scripts, ensure that the root paths to your dataset folders are correctly configured in all files within the configs/
directory.
Specifically, update the data.root
attribute to point to your dataset's root directory.
Other configuration attributes do not need modification, as our scripts will automatically adjust them during runtime. However, you may modify these attributes if you wish to experiment with different hyper-parameters.
The following script allows training on a single dataset (e.g., fgvc-aircraft) and evaluating on all datasets using 4 GPUs.
Run the command below to execute the script:
python -m scripts.train_and_eval --config_path configs/snd_config_4_gpus.yaml --dataset fgvc-aircraft --distributed --nproc_per_node 4
If you are using only one GPU, modify the command as follows:
python -m scripts.train_and_eval --config_path configs/snd_config_1_gpu.yaml --dataset fgvc-aircraft --distributed --nproc_per_node 1
To load a model trained on a specific dataset and continue training on another dataset, include the --pretrained_dataset
argument:
python -m scripts.train_and_eval --config_path configs/snd_config_4_gpus.yaml --pretrained_dataset fgvc-aircraft --dataset dtd --distributed --nproc_per_node 4
- Our code has only been verified with 1 or 4 GPUs.
- Using more than 4 GPUs is not recommended, as we observed that the performance drops a bit.
- When training with 1–4 GPUs, ensure that the batch size for training and reference data is correctly adjusted to match the number of GPUs.
We also provide a script to continually train and evaluate across an entire sequence of datasets (i.e., reproduce our Multi-Domain Task Incremental Learning setting):
python -m scripts.continually_train --config_path configs/snd_config_4_gpus.yaml --order 0 --distributed --nproc_per_node 4
- The
--order
argument specifies an offset to shift the pre-defined dataset sequence. - For detailed task orders of each training sequence, refer to the supplementary materials.
- The whole process of training and evaluation on a single training sequence using 4x V100 GPUs takes approximately 150 minutes on our devices.
We also provide a script for performing inference on all datasets used in our experiments.
Run the following command to execute the inference script using the model stored in outputs/order_0/checkpoint_latest.pth
:
python -m scripts.inference --model_path outputs/order_0/checkpoint_latest.pth
If you find our work useful, please cite it using the following BibTeX entry:
@inproceedings{yu2025select,
title={Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models},
author={Yu, Yu-Chu and Huang, Chi-Pin and Chen, Jr-Jen and Chang, Kai-Po and Lai, Yung-Hsuan and Yang, Fu-En and Wang, Yu-Chiang Frank},
booktitle={European Conference on Computer Vision},
pages={219--236},
year={2025},
organization={Springer}
}