Adversarial Robustness of Vision Transformers vs. CNNs

Outline

Introduction and Motivation
Adversarial Training Details and Tricks
Adversarial Robustness Comparison on the ImageNet
Concluding Remarks
Citation
References

Introduction and Motivation

There is a recent trend of unifying methodologies for different deep learning tasks: vision, natural language processing, speech... etc. Transformers have been introduced into the field of computer vision and become a strong competitor for Convolutional Neural Networks (CNNs). With the performance of vision transformers being on-par or even better than that of CNNs, several previous works [2][4][5][6] look into the adversarial robustness comparison of vision transformers vs. CNNs. However, most of them compare models with different:

model size
pretrain dataset
macro and micro architectural design
training recipe

Some of them directly test the adversarial robustness on the clean trained model, whose robustness accuracy quickly drops to 0. A few others compare defended models after adversarial training, but those models have quite different clean accuracy to start with. All of these make it unclear if transformer type models are better than CNN type models in terms of adversarial robustness.

Our main goal with this study is to provide a fair comparision of the adversarial robustness of some relatively recent vision transformers (PoolFormer, Swin Transformer, DeiT) and CNNs (ConvNext, ResNet50). This comparison can give us insight into the adversarially robustness implication of two model types. To ensure fairness of comparison, we try to align:

model size: we select model variants with similar # of parameters
dataset: all models adversarially trained on ImageNet-1k from scratch
macro and micro architectural design: ConvNext is designed to strictly follow Swin Transformer's macro and micro architectures.
training recipe: We fully align training recipe for Swin Transformer and ConvNext. All other models also use training recipe close to ConvNext.

For all models we compare, we first apply the same adversarial training with PGD on the ImageNet dataset without any pretraining and then test robustness performance of the adversarially trained models. For better comparison of models that start with different clean accuracy, we calculate the relative drop of top@1 robust accuracy with respect to top@1 clean accuracy.

Adversarial Training Details and Tricks

We adversarially train all models on the ImageNet-1k dataset from scratch. We first apply the PGD-1 attacker (eps=4/255, step=4/255) with random restart on the training set and then evaluate the robustness with the PGD-5 attacker ((eps=4/255, step=1/255)) with random restart on the validation set.

Adversarial training on the ImageNet takes a long time and often causes Out of Memory (OOM) error on our machines. We tried to lower the global batch size, which resulted in suboptimal training. To resolve the issue, we apply distributed mixed precision training, which greatly reduce the training time and GPU memory consumption. The difference in acc@1 for mixed precision training and full precision training is negligible (<1%). For distributed training, we apply the "learning rate scaling" technique mentioned in [3]. For the half precision data type, we exprimented with both IEEE's float16 and Google Brain's bfloat16, which have similar clean and robust acc@1 after training. Although bfloat16 is slower than float16, we chose bfloat16 since float16 will sometimes results in training crash for some models.

Adversarial training on the ImageNet turned out to be very sensitive to training hyperparameters, so we spent quite some time to obtain the best trainig results for each model. For all models except for ResNet50, we apply the "data augmentation warmup" mentioned in [2] to gradually increase the augmentation strength for stable training.

For fair comparison, we fully align the trianing recipe for ConvNext and Swin Transformer. For all other models, we closely follow the trainig recipe as the original paper, which all turn out to be very close to ConvNext's training recipe except for ResNet50. For all models, we select the variant with similar model size in terms of the total number of parameters.

model	convnext-tiny	swin-tiny	poolformer-s36	deit-small	resnet50-gelu*
activation	gelu	gelu	gelu	gelu	gelu
norm layer	ln	ln	groupnorm	ln	bn
epochs	300	300	300	100	100
batch size	4096	4096	4096	4096	512
optimizer	AdamW	AdamW	AdamW	AdamW	SGD, mom=0.9
init lr (bs1024)	1.00E-03	1.00E-03	1.00E-03	1.00E-03	0.4
lr decay (bs1024)	cosine, minlr=2.5e-7	cosine, minlr=2.5e-7	cosine, minlr=2.5e-7	cosine, minlr=e-5	step30e, minlr=e-5
warmup	initlr=0, 10e linear	initlr=0, 10e linear	initlr=0, 5e linear	initlr=1e-6, 10e linear	initlr=0.1, 5e linear
amp	bf16	bf16	bf16	bf16	no
weight decay	0.05	0.05	0.05	0.05	1.00E-04
label smoothing eps	0.1	0.1	0.1	0.1	0.1
dropout	no	no	no	no	no
stochastic depth rate	0.1	0.1	0.2	0	0
repeated aug	no	no	no	no	no
gradient clip	no	no	no	no	no
EMA	no	no	no	no	no
rand aug	m1-mstd0.5-->m9-mstd0.5	m1-mstd0.5-->m9-mstd0.5	m1-mstd0.5-->m9-mstd0.5	m1-mstd0.5-->m9-mstd0.5	rand-m9-mstd0.5-inc1
mixup alpha	0.8	0.8	0.8	0.8	0
cutmix alpha	0-1.0 at 6th e	0-1.0 at 6th e	0-1.0 at 6th e	0-1.0 at 6th e	0
mixup/cutmix prob	0.5-1 first 0-5e	0.5-1 first 0-5e	0.5-1 first 0-5e	0.5-1 first 0-5e	1
rand erasing prob	0	0	0	0	0

Adversarial Robustness Comparison on the ImageNet

Swin transformer emerges recently as one of the new SOTA vision transformer backbones. It adds heiarchical feature map and local self attention with window shift compared with the DeiT model. ConvNext follows Swin's macro and micro architectural design closely, making it a good candidate for robustness comparison. If one model is noticably more robust than the other, it will give us insight into if the attention operation is superior to the convolution operation in terms of adversarial robustness. As shown by our results, after aligning the adversarial training recipe, they have very close robustness performance. We also experiment with the recently introduced PoolFormer model, which replace self attention operation with a simple average pooling while keeping the the same "MetaFormer" [1] macro design as vision transformers. PoolFormer also has similar robustness after adversarialy training.

For ResNet50-GELU and DeiT-small, we use results reported in the paper [2]. These are model adversarially trained with full precison. We also perform bf16 mixed precision training on the DeiT model and results in very close performance. As shown in the table, they have the same robustness in terms of relative drop from clean acc@1 to robust acc@1. The relative drop rate for ResNet50 and DeiT are lower than the more recent models. We think this is likely due to the difference in model size. It is known in the literature that advesarial training requires larger model capacity. Model with larger capacity can be adversarially trained better and thus have slightly higher relative drop ratio.

Robustnes Results on ImageNet1k

Model	Train Prec	Num Para	Clean Acc@1	PGD-5 Acc@1	Relative Drop
convnext-tiny	bf16	29M	69.98%	49.95%	-29%
swin-tiny	bf16	28M	67.24%	46.66%	-31%
poolformer-s36	bf16	30.8M	66.32%	44.80%	-32%
deit-small	bf16	22M	66.48%	44.60%	-33%
deit-small*	full	22M	66.50%	43.95%	-34%
resnet50-gelu*	full	25M	67.38%	44.01%	-35%

model* denotes model results reported in the previous work.

Training and Validation Learning Curves

Concluding Remarks

This study provides a fair comparsion of the advesarial robustness of some recent vision transformers and CNNs. We summarize out findings below:

vision transformers are as adversarially robust as CNNs after aligning model size, pretrain dataset, macro and micro architectural design, and training recipe.
as far as the basic network operation, self-attention is as robust as convolution.
for fair evaluation of adversarial robustness, we should first apply basic advesarial training and then compare robustness of different models. Relative drop rate is a better and more intuitive metric than absolute robust accuracy.
adversarial training is very sensitive. Lots of factors such as model size, number of epochs, initial learning rate... etc, can influnce the training outcome. Careful training and evuation procedures should be followed for fair robustness comparison.

We hope this can provide some useful insight to researchers and practitioners in the field of adversarial machine learning and deep learning architecture.

Citation

If you find this study helpful, please consider citing:


@unpublished{xu2022AdvTransCNNs,
  author = {Ke Xu and Ram Nevatia},
  title  = {Adversarial Robustness of SOTA Vision Transformers vs. CNNs},
  month  = "May",
  year   = {2022},
}