Release TorchVision 0.13, including new Multi-weights API, new pre-trained weights, and more · pytorch/vision

Highlights

Models

Multi-weight support API

TorchVision v0.13 offers a new Multi-weight support API for loading different weights to the existing model builder methods:

from torchvision.models import *

# Old weights with accuracy 76.130%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)

# New weights with accuracy 80.858%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

# Best available weights (currently alias for IMAGENET1K_V2)
# Note that these weights may change across versions
resnet50(weights=ResNet50_Weights.DEFAULT)

# Strings are also supported
resnet50(weights="IMAGENET1K_V2")

# No weights - random initialization
resnet50(weights=None)

The new API bundles along with the weights important details such as the preprocessing transforms and meta-data such as labels. Here is how to make the most out of it:

from torchvision.io import read_image
from torchvision.models import resnet50, ResNet50_Weights

img = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg")

# Step 1: Initialize model with the best available weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()

# Step 2: Initialize the inference transforms
preprocess = weights.transforms()

# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)

# Step 4: Use the model and print the predicted category
prediction = model(batch).squeeze(0).softmax(0)
class_id = prediction.argmax().item()
score = prediction[class_id].item()
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}%")

You can read more about the new API in the docs. To provide your feedback, please use this dedicated Github issue.

New architectures and model variants

Classification

The Swin Transformer and EfficienetNetV2 are two popular classification models which are often used for downstream vision tasks. This release includes 6 pre-trained weights for their classification variants. Here is how to use the new models:

import torch
from torchvision.models import *

image = torch.rand(1, 3, 224, 224)
model = swin_t(weights="DEFAULT").eval()
prediction = model(image)

image = torch.rand(1, 3, 384, 384)
model = efficientnet_v2_s(weights="DEFAULT").eval()
prediction = model(image)

In addition to the above, we also provide new variants for existing architectures such as ShuffleNetV2, ResNeXt and MNASNet. The accuracies of all the new pre-trained models obtained on ImageNet-1K are seen below:

Model	Acc@1	Acc@5
swin_t	81.474	95.776
swin_s	83.196	96.36
swin_b	83.582	96.64
efficientnet_v2_s	84.228	96.878
efficientnet_v2_m	85.112	97.156
efficientnet_v2_l	85.808	97.788
resnext101_64x4d	83.246	96.454
resnext101_64x4d (quantized)	82.898	96.326
shufflenet_v2_x1_5	72.996	91.086
shufflenet_v2_x1_5 (quantized)	72.052	90.700
shufflenet_v2_x2_0	76.230	93.006
shufflenet_v2_x2_0 (quantized)	75.354	92.488
mnasnet0_75	71.180	90.496
mnas1_3	76.506	93.522

We would like to thank Hu Ye for contributing to TorchVision the Swin Transformer implementation.

[BETA] Object Detection and Instance Segmentation

We have introduced 3 new model variants for RetinaNet, FasterRCNN and MaskRCNN that include several post-paper architectural optimizations and improved training recipes. All models can be used similarly:

import torch
from torchvision.models.detection import *

images = [torch.rand(3, 800, 600)]
model = retinanet_resnet50_fpn_v2(weights="DEFAULT")
# model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT")
# model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")
model.eval()
prediction = model(images)

Below we present the metrics of the new variants on COCO val2017. In parenthesis we denote the improvement over the old variants:

Model	Box mAP	Mask mAP
retinanet_resnet50_fpn_v2	41.5 (+5.1)	-
fasterrcnn_resnet50_fpn_v2	46.7 (+9.7)	-
maskrcnn_resnet50_fpn_v2	47.4 (+9.5)	41.8 (+7.2)

We would like to thank Ross Girshick, Piotr Dollar, Vaibhav Aggarwal, Francisco Massa and Hu Ye for their past research and contributions to this work.

New pre-trained weights

SWAG weights

The ViT and RegNet model variants offer new pre-trained SWAG (Supervised Weakly from hashtAGs) weights. One of the biggest of these models achieves a whopping 88.6% accuracy on ImageNet-1K. We currently offer two versions of the weights: 1) fine-tuned end-to-end weights on ImageNet-1K (highest accuracy) and 2) frozen trunk weights with a linear classifier fit on ImageNet-1K (great for transfer learning). Below we see the detailed accuracies of each model variant:

Model Weights	Acc@1	Acc@5
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_E2E_V1	86.012	98.054
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_LINEAR_V1	83.976	97.244
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_E2E_V1	86.838	98.362
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_LINEAR_V1	84.622	97.48
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_E2E_V1	88.228	98.682
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_LINEAR_V1	86.068	97.844
ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1	85.304	97.65
ViT_B_16_Weights.IMAGENET1K_SWAG_LINEAR_V1	81.886	96.18
ViT_L_16_Weights.IMAGENET1K_SWAG_E2E_V1	88.064	98.512
ViT_L_16_Weights.IMAGENET1K_SWAG_LINEAR_V1	85.146	97.422
ViT_H_14_Weights.IMAGENET1K_SWAG_E2E_V1	88.552	98.694
ViT_H_14_Weights.IMAGENET1K_SWAG_LINEAR_V1	85.708	97.73

The weights can be loaded normally as follows:

from torchvision.models import *

model1 = vit_h_14(weights="IMAGENET1K_SWAG_E2E_V1")
model2 = vit_h_14(weights="IMAGENET1K_SWAG_LINEAR_V1")

The SWAG weights are released under the Attribution-NonCommercial 4.0 International license. We would like to thank Laura Gustafson, Mannat Singh and Aaron Adcock for their work and support in making the weights available to TorchVision.

Model Refresh

The release of the Multi-weight support API enabled us to refresh the most popular models and offer more accurate weights. We improved on average each model by ~3 points. The new recipe used was learned on top of ResNet50 and its details were covered on a previous blogpost.

Model	Old weights	New weights
efficientnet_b1	78.642	79.838
mobilenet_v2	71.878	72.154
mobilenet_v3_large	74.042	75.274
regnet_y_400mf	74.046	75.804
regnet_y_800mf	76.42	78.828
regnet_y_1_6gf	77.95	80.876
regnet_y_3_2gf	78.948	81.982
regnet_y_8gf	80.032	82.828
regnet_y_16gf	80.424	82.886
regnet_y_32gf	80.878	83.368
regnet_x_400mf	72.834	74.864
regnet_x_800mf	75.212	77.522
regnet_x_1_6gf	77.04	79.668
regnet_x_3_2gf	78.364	81.196
regnet_x_8gf	79.344	81.682
regnet_x_16gf	80.058	82.716
regnet_x_32gf	80.622	83.014
resnet50	76.13	80.858
resnet50 (quantized)	75.92	80.282
resnet101	77.374	81.886
resnet152	78.312	82.284
resnext50_32x4d	77.618	81.198
resnext101_32x8d	79.312	82.834
resnext101_32x8d (quantized)	78.986	82.574
wide_resnet50_2	78.468	81.602
wide_resnet101_2	78.848	82.51

We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for their past research and contributions to this work.

Ops and Transforms

New Augmentations, Layers and Losses

This release brings a bunch of new primitives which can be used to produce SOTA models. Some highlights include the addition of AugMix data-augmentation method, the DropBlock layer, the cIoU/dIoU loss and many more. We would like to thank Aditya Oke, Abhijit Deo, Yassine Alouini and Hu Ye for contributing to the project and for helping us maintain TorchVision relevant and fresh.

Documentation

We completely revamped our models documentation to make them easier to browse, and added various key information such as supported image sizes, or image pre-processing steps of pre-trained weights. We now have a main model page with various summary tables of available weights, and each model has a dedicated page. Each model builder is also documented in their own page, with more details about the available weights, including accuracy, minimal image size, link to training recipes, and other valuable info. For comparison, our previous models docs are here. To provide feedback on the new documentation, please use the dedicated Github issue.

Backward-incompatible changes

The new Multi-weight support API replaced the legacy “pretrained” parameter of model builders. Both solutions are currently supported to maintain backwards compatibility but our intention is to remove the deprecated API in 2 versions. Migrating to the new API is very straightforward. The following method calls between the 2 APIs are all equivalent:

from torchvision.models import resnet50, ResNet50_Weights

# Using pretrained weights:
resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
resnet50(weights="IMAGENET1K_V1")
resnet50(pretrained=True)  # deprecated
resnet50(True)  # deprecated

# Using no weights:
resnet50(weights=None)
resnet50()
resnet50(pretrained=False)  # deprecated
resnet50(False)  # deprecated

Deprecations

[models, models.quantization] Reinstate and deprecate model_urls and quant_model_urls (#5992)
[transforms] Deprecate int as interpolation argument type (#5974)

New Features

[models] New Multi-weight API support (#5618, #5859, #6047, #6026, #5848)
[models] Adding Swin Transformer architecture (#5491)
[models] Adding EfficientNetV2 architecture (#5450)
[models] Adding detection model improved weights: RetinaNet, MaskRCNN, FasterRCNN (#5756, #5773, #5763)
[models] Adding classification model weight: resnext101 64x4d, mnasnet0_75, mnasnet1_3 (#5935, #6019)
[models] Add SWAG model pretrained weights (#5714, #5722, #5732, #5793, #5721)
[ops] AddingIoU loss function variants: DIoU, CIoU (#5786, #5776)
[ops] Adding various ops and test for ops (#6053, #5416, #5792, #5783)
[transforms] Adding AugMix transforms implementation (#5411)
[reference scripts] Support custom weight decay setting in classification reference script (#5671)
[transforms, reference scripts] Improve detection reference script: Scale Jitter, RandomShortestSize, FixedSizeCrop (#5435, #5610, #5607)
[ci] Add M1 support : (#6167)
[ci] Add Python-3.10 (build and test) (#5420)

Improvements

[documentation] Complete new revamp of models documentation (#5821, #5876, #5899, #6025, #5885, #5884, #5886, #5891, #6023, #6009, #5852, #5831, #5832, #6003, #6013, #5856, #6004, #6005, #5878, #6012, #5894, #6002, #5854, #5864, #5920, #5869, #5871, #6021, #6006, #6016, #5905, #6028, #5915, #5924, #5977, #5918, #5921, #5934, #5936, #5937, #5933, #5949, #5988, #5962, #5963, #5975, #5900, #5917, #5895, #5901, #6033, #6032, #6030, #5904, #5661, #6035, #6049, #6036, #5908, #5907, #6044, #6039, #5874, #6151)
[documentation] Various documentation improvements (#5695, #5930, #5814, #5799, #5827, #5796, #5923, #5599, #5554, #5995, #5457, #6163, #6031, #6000, #5847, #6024))
[documentation] Add warnings in docs to document Beta APIs (#6115)
[datasets] improve GDrive downloads (#5704, #5645)
[datasets] indicate md5 checksum is not used for security (#5717)
[models] Add shufflenetv2 1.5 and 2.0 weights (#5906)
[models] Reduce unnecessary cuda sync in anchor_utils.py (#5515)
[models] Adding improved MobileNetV2 weights (#5560)
[models] Remove (N, T, H, W, C) => (N, T, C, H, W) from presets (#6058)
[models] add swin_s and swin_b variants and improved swin_t (#6048)
[models] Update ShuffleNetV2 annotations for x1_5 and x2_0 variants (#6022)
[models] Better error message in ViT (#5820)
[models, ops] Add private support for ciou and diou (#5984, #5685, #5690)
[models, reference scripts] Various improvements to detection recipe and models (#5715, #5444)
[transforms, tests] add functional vertical flip tests on segmentation mask (#5860)
[transforms] make _max_value jit-scriptable (#5623)
[transforms] Make ScaleJitter proportional (#5559)
[transforms] add tensor kernels for normalize and erase (#5462)
[transforms] Update transforms following PIL deprecation (#5898)
[transforms, models, datasets…] Replace asserts with exceptions (#5587, #5659)
[utils] add warning if font is not set in draw_bounding_boxes (#5785)
[utils] Throw warning for empty masks or box tensors on draw_segmentation_masks and draw_bounding_boxes (#5857)
[video] Add output_format do video datasets and readers (#6061)
[video, io] Better compatibility with FFMPEG 5.0 (#5644)
[video, io] Allow cuda device to be passed without the index for GPU decoding (#5505)
[reference scripts] Simplify EMA to use Pytorch's update_parameters (#5469)
[reference scripts] Reduce variance of evaluation in reference (#5819)
[reference scripts] Various improvements to RAFT training reference (#5590)
[tests] Speed up Model tests by 20% (#5574)
[tests] Make test suite fail on unexpected test success (#5556)
[tests] Skip big model in test to reduce memory usage in CI (#5903, #5902)
[tests] Improve test of backbone utils (#5552)
[tests] Validate against expected files on videos (#6077)
[ci] Support for CUDA 11.6 (#5803, 5862)
[ci] pre-download model weights in CI docs build (#5625)

Bug Fixes

[transforms] remove option to pass fill as str in transforms (#5632)
[transforms] Better handling for Pad's fill argument (#5596)
[transforms] [FBcode->GH] Fix accimage tests (#5545)
[transforms] Update _pil_constants.py (#6154) (#6156)
[transforms] Fix resize transform when size == small_edge_size and max_size isn't None (#5409)
[transforms] Fixed rotate transform with expand inconsistency (#5677)
[transforms] Fixed upstream issue with padding (#5875)
[transforms] Fix functional.adjust_gamma (#5427)
[models] Respect strict=False when loading detection models (#5841)
[models] Fix resnet norm initialization (#6082) (#6085)
[models] Use frozen BN only if pre-trained for detection models. (#5443)
[models] fix fcos gtarea calculation (#5816)
[models, onnx] Add topk min function for trace and onnx (#5310)
[models, tests] fix mobilnet norm layer test (#5643)
[reference scripts] Fix regression on Detection training script (#5985)
[datasets] do not re-download from GDrive if file is already present (#5805)
[datasets] Fix datasets: kinetics, Flowers102, VOC_2009, INaturalist 2021_train, caltech (#5578, #5775, #5425, #5844, #5789)
[documentation] Fixes device mismatch issue while building docs (#5428)
[documentation] Fix Accuracy meta-data on shufflenetv2 (#5896)
[documentation] fix typo in docstrings of some transforms (#5609)
[video, documentation] Fix append of audio_pts (#5488)
[io, tests] More robust check in tests for 16 bits images (#5652)
[video, io] Fix shape mismatch error in video reader (#5489)
[io] Address nvjpeg leak on CUDA < 11.6 issue (#5713, #5482)
[ci] Fixing issue with setup_env.sh in docker: resolve "unsafe directory" error (#6106) (#6109)
[ci] fix documentation version problems when new release is tagged (#5583)
[ci] Replace jcenter and fix version for android (#6046)
[tests] Add .float() before .mean() on test_backbone_utils.py because .mean() dont accept integer dtype (#6090) (#6091)
[tests] Fix keypointrcnn_resnet50_fpn flaky test (#5911)
[tests] Disable test_encode|write_jpeg_reference tests (#5910)
[mobile] Bump up LibTorchvision version number for Podspec to release Cocoapods (#5624)
[feature extraction] Add default tracer args for model feature extraction function (#5637)
[build] Fix libtorchvision.so not able to encode images by adding *_FOUND macros to CMakeLists.txt (#5547)

Code Quality

[dataset, models] Better deprecation message for voc2007 and SqueezeExcitation (#5391)
[datasets, reference scripts] Use Kinetics instead of Kinetics400 in references (#5787) (#5952)
[models] CleanUp DenseNet code (#5966)
[models] Minor Swin Transformer fixes (#6054)
[models, onnx] Use onnx function only in tracing mode (#5468)
[models] Refactor swin transfomer so later we can reuse component for 3d version (#6088) (#6100)
[models, tests] Fix minor issues with model tests. (#5576)
[transforms] Remove to_tensor() and ToTensor() usages (#5553)
[transforms] Refactor Augmentation Space calls to speed up. (#5402)
[transforms] Recoded _max_value method using a dictionary (#5566)
[transforms] Replace get_image_size/num_channels with get_dimensions (#5487)
[ops] Replace usages of atomicAdd with gpuAtomicAdd (#5823)
[ops] Fix unused variable warning in ps_roi_align_kernel.cu (#5408)
[ops] Remove custom ops interpolation with antialiasing (#5329)
[ops] Move Permute layer to ops. (#6055)
[ops] Remove assertions for generalized_box_iou (#5691)
[utils] Moving sequence_to_str to torchvision._utils (#5604)
[utils] Clarify TypeError message in make_grid (#5997)
[video, io] replace distutils.spawn with shutil.which per PEP632 in setup script (#5849)
[video, io] Move VideoReader out of init (#5495)
[video, io] Remove unnecessary initialisation in GPUDecoder (#5507)
[video, io] Remove unused member variable and argument in GPUDecoder (#5499)
[video, io] Improve test_video_reader (#5498)
[video, io] Update private attribute name for readability (#5484)
[video, tests] Improve test_videoapi (#5497)
[reference scripts] Minor updates to optical flow ref for consistency (#5654)
[reference scripts] Add barrier() after init_process_group() (#5475)
[ci] Delete stale packaging scripts (#5433)
[ci] remove explicit install of Pillow throughout CI (#5950)
[ci, test] remove unnecessary pytest install (#5739)
[ci, tests] Remove unnecessary PYTORCH_TEST_WITH_SLOW env (#5631)
[ci] Add .git-blame-ignore-revs to ignore specific commits in git blame (#5696)
[ci] Remove CUDA 11.1 support (#5477, #5470, #5451, #5978)
[ci] Minor linting improvement (#5880)
[ci] Remove Bandit and CodeQL jobs (#5734)
[ci] Various type annotation fixes / issues (#5598, #5970, #5943)

Contributors

We're grateful for our community, which helps us improving torchvision by submitting issues and PRs, and providing feedback and suggestions. The following persons have contributed patches for this release:

Abhijit Deo, Aditya Oke, Andrey Talman, Anton Thomma, Behrooz, Bruno Korbar, Daniel Angelov, Dbhasin1, Drishti Bhasin, F-G Fernandez, Federico Pozzi, FG Fernandez, Georg Grab, Gouvernathor, Hu Ye, Jeffery (Zeyu) Zhao, Joao Gomes, kaijieshi, Kazuki Adachi, KyleCZH, kylematoba, LEGRAND Matthieu, Lezwon Castelino, Luming Tang, Matti Picus, Nicolas Hug, Nikita, Nikita Shulga, oxabz, Philip Meier, Prabhat Roy, puhuk, Richard Barnes, Sahil Goyal, satojkovic, Shijie, Shubham Bhokare, talregev, tcmyxc, Vasilis Vryniotis, vfdev, WuZhe, XiaobingZhang, Xu Zhao, Yassine Alouini, Yonghye Kwon, YosuaMichael, Yulv-git, Zhiqiang Wang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchVision 0.13, including new Multi-weights API, new pre-trained weights, and more