Thank you for your interest in contributing to our BDD100K models repository. Our goal is to provide a comprehensive list of models for each task to fascilitate research on BDD100K. You can help us by contributing your models.
Contributing your models is as easy as making a pull request on our repository, which anyone can do!
- Fork and pull the latest version of BDD100K-Models.
- Create a new branch and add your models there (see what to add).
- Commit your changes.
- Create a pull request.
We provide a template for each task which you should exactly follow.
- First, submit your model predictions on both the validation and test set to our evaluation server (hosted on eval.ai) to obtain the official results.
- Next, provide all the necessary files shown in the template. See the general guidelines and the task specific guidelines for the exact files to include.
- Once you have everything, submit a pull request to add your model to the README of the corresponding task and we will verify your results based on your provided information.
- If everything looks good, we will merge your PR!
For now, we are only accepting models that are/will be published in top-tier venues (e.g., CVPR, ICCV, ECCV, etc.).
The general guidelines should be followed for any model contribution for any task. Copy this checklist to the description of your PR and fill the box of each completed item with an X.
- Upload all your files to publicly available online storage services (e.g., Google Drive) so your files can be accessed indefinitely.
- Paper:
- Include a link to your paper (preferably on arXiv) and the venue and year the paper is/will be published in.
- You can add a list of authors of the paper along with links to each person's website.
- Put the abstract of your paper in the indicated part.
- Results:
- You can include all variations of your method (e.g., different backbones/detectors), but not baselines.
- Include links to evaluation results on both validation and test set with BDD100K metrics.
- Include model weights and its corresponding MD5 hash as checksum.
- Include model predictions and visualizations on the validation set.
- Code:
- Include a link to your codebase on GitHub.
- Include usage instructions so that we can easily verify your results and others can easily use your model.
- Make sure your code and instructions are bug-free.
- Before making a pull request, make sure all the general guidelines and task specific guidelines are met.
We use the Apache 2.0 License for our repository. The BDD100K dataset and linked repositories are not subject to the same license.
Each task in BDD100K has its own template and guidelines. Click the links below to go to each template.
- Image Tagging
- Object Detection
- Instance Segmentation
- Semantic Segmentation
- Drivable Area
- Panoptic Segmentation
- Multiple Object Tracking (MOT)
- Multiple Object Tracking and Segmentation (MOTS)
- Pose Estimation
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Input | Acc-val | Scores-val | Acc-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the MMClassification metrics (top-1 and top-5 accuracy, precision, recall, and F1-score).
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with tagging visualizations for the entire validation set.
Example below:
Deep Layer Aggregation [CVPR 2018]
Authors: Fisher Yu, Dequan Wang, Evan Shelhamer, Trevor Darrell
Abstract
Visual recognition requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representations improves inference of what and where. Architectural efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to best aggregate layers and blocks across a network deserves further attention. Although skip connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations. We augment standard architectures with deeper aggregation to better fuse information across layers. Our deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes. The code is at [this https URL](https://github.com/ucbdrive/dla).Backbone | Input | Acc-val | Scores-val | Acc-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
DLA-34 | 224 * 224 | 81.35 | scores | 81.24 | scores | config | model | MD5 | preds | visuals |
DLA-60 | 224 * 224 | 79.99 | scores | 79.65 | scores | config | model | MD5 | preds | visuals |
DLA-X-60 | 224 * 224 | 80.22 | scores | 79.80 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Lr schd | MS-train | Box AP-val | Scores-val | Box AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K detection metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with bounding box visualizations on the validation set.
Example below:
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [NeurIPS 2015]
Authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
Abstract
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.Backbone | Lr schd | MS-train | Box AP-val | Scores-val | Box AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | 1x | 31.04 | scores | 29.78 | scores | config | model | MD5 | preds | visuals | |
R-50-FPN | 3x | ✓ | 32.30 | scores | 31.45 | scores | config | model | MD5 | preds | visuals |
R-101-FPN | 3x | ✓ | 32.71 | scores | 31.96 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Lr schd | MS-train | Mask AP-val | Box AP-val | Scores-val | Mask AP-test | Box AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K instance segmentation metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with instance segmentation visualizations on the validation set.
Example below:
Mask R-CNN [ICCV 2017]
Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick
Abstract
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: [this https URL](https://github.com/facebookresearch/detectron2).Backbone | Lr schd | MS-train | Mask AP-val | Box AP-val | Scores-val | Mask AP-test | Box AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | 1x | 16.24 | 22.34 | scores | 14.86 | 19.59 | scores | config | model | MD5 | preds | visuals | |
R-50-FPN | 3x | ✓ | 19.88 | 25.93 | scores | 17.46 | 22.32 | scores | config | model | MD5 | preds | visuals |
R-101-FPN | 3x | ✓ | 20.51 | 26.08 | scores | 17.88 | 22.01 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K semantic segmentation metrics or drivable area metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with segmentation visualizations on the validation set.
Example below:
Pyramid Scene Parsing Network [CVPR 2017]
Authors: Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia
Abstract
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new record of mIoU accuracy 85.4\% on PASCAL VOC 2012 and accuracy 80.2\% on Cityscapes.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 61.88 | scores | 54.50 | scores | config | model | MD5 | preds | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.03 | scores | 54.99 | scores | config | model | MD5 | preds | visuals |
R-101-D8 | 80K | 512 * 1024 | 63.62 | scores | 56.32 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Lr schd | MS-train | PQ-val | Scores-val | PQ-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K panoptic segmentation metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with panoptic segmentation visualizations on the validation set.
Example below:
Panoptic Feature Pyramid Networks [CVPR 2019]
Authors: Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár
Abstract
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.Backbone | Lr schd | MS-train | PQ-val | Scores-val | PQ-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | 1x | 21.52 | scores | 20.59 | scores | config | model | MD5 | preds | visuals | |
R-50-FPN | 3x | ✓ | 22.39 | scores | 21.76 | scores | config | model | MD5 | preds | visuals |
R-101-FPN | 3x | ✓ | 22.61 | scores | 22.34 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Detector | mMOTA-val | mIDF1-val | ID Sw.-val | Scores-val | mMOTA-test | mIDF1-test | ID Sw.-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K MOT metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with bounding box tracking visualizations on the validation set. Can be images or videos.
Example below:
Quasi-Dense Similarity Learning for Multiple Object Tracking [CVPR 2021 Oral]
Authors: Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, Fisher Yu
Abstract
Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions on the images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning. We can naturally combine this similarity learning with existing detection methods to build Quasi-Dense Tracking (QDTrack) without turning to displacement regression or motion priors. We also find that the resulting distinctive feature space admits a simple nearest neighbor search at the inference time. Despite its simplicity, QDTrack outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks. It achieves 68.7 MOTA at 20.3 FPS on MOT17 without using external training data. Compared to methods with similar detectors, it boosts almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K and Waymo datasets.Detector | mMOTA-val | mIDF1-val | ID Sw.-val | Scores-val | mMOTA-test | mIDF1-test | ID Sw.-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 36.6 | 51.6 | 6193 | scores | 35.7 | 52.3 | 10822 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Detector | mMOTSA-val | mIDF1-val | ID Sw.-val | Scores-val | mMOTSA-test | mIDF1-test | ID Sw.-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K MOTS metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with segmentation tracking visualizations on the validation set. Can be images or videos.
Example below:
Prototypical Cross-Attention Networks (PCAN) for Multiple Object Tracking and Segmentation [NeurIPS 2021 Spotlight]
Authors: Lei Ke, Xia Li, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
Abstract
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks.Detector | mMOTSA-val | mIDF1-val | ID Sw.-val | Scores-val | mMOTSA-test | mIDF1-test | ID Sw.-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 28.1 | 45.4 | 874 | scores | 31.9 | 50.4 | 845 | scores | config | model | MD5 | preds | visuals |
Template and guidelines below:
Paper name [Venue and Year]
Authors: Author list
Abstract
Put your abstract here.Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
scores | scores | config | model | MD5 | preds | visuals |
Other information.
- The scores file should be a JSON file with evaluation results for all the BDD100K pose estimation metrics.
- The predictions should be a JSON file containing model predictions for the entire validation set.
- The visuals should be a zip file with pose visualizations on the validation set.
Example below:
Deep High-Resolution Representation Learning for Visual Recognition [CVPR 2019 / TPAMI 2020]
Authors: Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
Abstract
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at [this https URL](https://github.com/HRNet).Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
HRNet-w32 | 256 * 192 | 48.83 | scores | 46.13 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 256 * 192 | 50.32 | scores | 47.36 | scores | config | model | MD5 | preds | visuals |
HRNet-w32 | 320 * 256 | 49.86 | scores | 46.90 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 320 * 256 | 50.16 | scores | 47.32 | scores | config | model | MD5 | preds | visuals |