Training Vision Transformers for Image Retrieval
(Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval (El-Nouby, Alaaeldin, et al. 2021).
I have not yet achieved exactly the same results as reported in the paper(Differential entropy regularization does not have much effect on In-shop and SOP datasets).
# Python 3.7
pip install -r requirements.txt
# CUB-200-2011
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 2000 \
--dataset cub200 \
--data-path /data/CUB_200_2011 \
--rank 1 2 4 8 \
--lambda-reg 0.7
# Stanford Online Products
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 35000 \
--dataset sop \
--m 2 \
--data-path /data/Stanford_Online_Products \
--rank 1 10 100 1000 \
--lambda-reg 0.7
# In-shop
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 35000 \
--dataset inshop \
--data-path /data/In-shop \
--m 2 \
--rank 1 10 20 30 \
--memory-ratio 0.2 \
--device cuda:2 \
--encoder-momentum 0.999 \
--lambda-reg 0.7
IRTO – off-the-shelf extraction of features from a ViT backbone, pre-trained on ImageNet;
IRTL – fine-tuning a transformer with metric learning, in particular with a contrastive loss;
IRTR – additionally regularizing the output feature space to encourage uniformity.
†: Models pre-trained with distillation with a convnet trained on ImageNet1k
Method
Backbone
SOP
CUB-200
In-Shop
1
10
100
1000
1
2
4
8
1
10
20
30
IRTO
DeiT-S
53.12
68.96
81.60
94.09
58.68
71.30
80.96
88.18
31.28
57.03
64.20
68.28
IRTL
DeiT-S
83.56
93.29
97.23
99.03
73.68
82.58
88.77
92.71
93.09
98.28
98.74
99.02
IRTR
DeiT-S
82.67
92.73
96.69
98.80
73.73
82.91
89.30
93.35
90.47
97.97
98.61
98.92
IRTR
DeiT-S†
82.70
92.85
96.92
98.86
76.55
85.26
90.92
94.65
90.66
98.16
98.68
98.99
El-Nouby, Alaaeldin, et al. "Training vision transformers for image retrieval." arXiv preprint arXiv:2102.05644 (2021).