Skip to content

Commit

Permalink
TensorRT-OSS 8.2 GA release
Browse files Browse the repository at this point in the history
Signed-off-by: Rajeev Rao <rajeevrao@nvidia.com>
  • Loading branch information
ttyio authored and rajeevsrao committed Nov 24, 2021
1 parent 9ec6eb6 commit 6f38570
Show file tree
Hide file tree
Showing 184 changed files with 5,739 additions and 1,152 deletions.
58 changes: 58 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,63 @@
# TensorRT OSS Release Changelog

## [8.2.1 GA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-1) - 2021-11-24

TensorRT OSS release corresponding to TensorRT 8.2.1.8 GA release.
- Updates since [TensorRT 8.2.0 EA release](https://github.com/NVIDIA/TensorRT/releases/tag/8.2.0-EA).
- Please refer to the [TensorRT 8.2.1 GA release notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-1) for more information.

- ONNX parser [v8.2.1](https://github.com/onnx/onnx-tensorrt/releases/tag/release%2F8.2-GA)
- Removed duplicate constant layer checks that caused some performance regressions
- Fixed expand dynamic shape calculations
- Added parser-side checks for `Scatter` layer support

- Sample updates
- Added [Tensorflow Object Detection API converter samples](samples/python/tensorflow_object_detection_api), including Single Shot Detector, Faster R-CNN and Mask R-CNN models
- Multiple enhancements in HuggingFace transformer demos
- Added multi-batch support
- Fixed resultant performance regression in batchsize=1
- Fixed T5 large/T5-3B accuracy issues
- Added [notebooks](demo/HuggingFace/notebooks) for T5 and GPT-2
- Added CPU benchmarking option
- Deprecated `kSTRICT_TYPES` (strict type constraints). Equivalent behaviour now achieved by setting `PREFER_PRECISION_CONSTRAINTS`, `DIRECT_IO`, and `REJECT_EMPTY_ALGORITHMS`
- Removed `sampleMovieLens`
- Renamed sampleReformatFreeIO to sampleIOFormats
- Add `idleTime` option for samples to control qps
- Specify default value for `precisionConstraints`
- Fixed reporting of TensorRT build version in trtexec
- Fixed `combineDescriptions` typo in trtexec/tracer.py
- Fixed usages of of `kDIRECT_IO`

- Plugin updates
- `EfficientNMS` plugin support extended to TF-TRT, and for clang builds.
- Sanitize header definitions for BERT fused MHA plugin
- Separate C++ and cu files in `splitPlugin` to avoid PTX generation (required for CUDA enhanced compatibility support)
- Enable C++14 build for plugins

- ONNX tooling updates
- [onnx-graphsurgeon](tools/onnx-graphsurgeon/CHANGELOG.md) upgraded to v0.3.14
- [Polygraphy](tools/Polygraphy/CHANGELOG.md) upgraded to v0.33.2
- [pytorch-quantization](tools/pytorch-quantization) toolkit upgraded to v2.1.2

- Build and container fixes
- Add `SM86` target to default `GPU_ARCHS` for platforms with cuda-11.1+
- Remove deprecated `SM_35` and add `SM_60` to default `GPU_ARCHS`
- Skip CUB builds for cuda 11.0+ [#1455](https://github.com/NVIDIA/TensorRT/pull/1455)
- Fixed cuda-10.2 container build failures in Ubuntu 20.04
- Add native ARM server build container
- Install devtoolset-8 for updated g++ version in CentOS7
- Added a note on supporting c++14 builds for CentOS7
- Fixed docker build for large UIDs [#1373](https://github.com/NVIDIA/TensorRT/issues/1373)
- Updated README instructions for Jetpack builds

- demo enhancements
- Updated Tacotron2 instructions and add CPU benchmarking
- Fixed issues in demoBERT python notebook

- Documentation updates
- Updated Python documentation for `add_reduce`, `add_top_k`, and `ISoftMaxLayer`
- Renamed default GitHub branch to `main` and updated hyperlinks

## [8.2.0 EA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-0-EA) - 2021-10-05
### Added
- [Demo applications](demo/HuggingFace) showcasing TensorRT inference of [HuggingFace Transformers](https://huggingface.co/transformers).
Expand Down
7 changes: 4 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,8 @@ if (DEFINED GPU_ARCHS)
separate_arguments(GPU_ARCHS)
else()
list(APPEND GPU_ARCHS
35
53
60
61
70
75
Expand All @@ -157,8 +157,9 @@ else()
if (CUDA_VERSION VERSION_GREATER_EQUAL 11.0)
# Ampere GPU (SM80) support is only available in CUDA versions > 11.0
list(APPEND GPU_ARCHS 80)
else()
message(WARNING "Detected CUDA version is < 11.0. SM80 not supported.")
endif()
if (CUDA_VERSION VERSION_GREATER_EQUAL 11.1)
list(APPEND GPU_ARCHS 86)
endif()

message(STATUS "GPU_ARCHS is not defined. Generating CUDA code for default SMs: ${GPU_ARCHS}")
Expand Down
39 changes: 31 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This repository contains the Open Source Software (OSS) components of NVIDIA Ten
To build the TensorRT-OSS components, you will first need the following software packages.

**TensorRT GA build**
* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.0.6
* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.1.8

**System Packages**
* [CUDA](https://developer.nvidia.com/cuda-toolkit)
Expand Down Expand Up @@ -70,16 +70,16 @@ To build the TensorRT-OSS components, you will first need the following software

```bash
cd ~/Downloads
tar -xvzf TensorRT-8.2.0.6.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-8.2.0.6
tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8
```

**Example: Windows on x86-64 with cuda-11.4**

```powershell
cd ~\Downloads
Expand-Archive .\TensorRT-8.2.0.6.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
$Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.0.6'
Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
$Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
$Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
```

Expand Down Expand Up @@ -110,6 +110,10 @@ For Linux platforms, we recommend that you generate a docker container for build
```bash
./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2
```
**Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2**
```bash
./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
```
2. #### Launch the TensorRT-OSS build container.
**Example: Ubuntu 18.04 build container**
Expand All @@ -132,6 +136,23 @@ For Linux platforms, we recommend that you generate a docker container for build
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
make -j$(nproc)
```
> NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:
```bash
yum -y install centos-release-scl
yum-config-manager --enable rhel-server-rhscl-7-rpms
yum -y install devtoolset-8
export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}
```
**Example: Linux (aarch64) build with default cuda-11.4.2**
```bash
cd $TRT_OSSPATH
mkdir -p build && cd build
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
make -j$(nproc)
```
**Example: Native build on Jetson (aarch64) with cuda-10.2**
```bash
cd $TRT_OSSPATH
Expand All @@ -141,13 +162,15 @@ For Linux platforms, we recommend that you generate a docker container for build
```
> NOTE: C compiler must be explicitly specified via `CC=` for native `aarch64` builds of protobuf.
**Example: Ubuntu 18.04 Cross-Compile for Jetson (arm64) with cuda-10.2 (JetPack)**
**Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)**
```bash
cd $TRT_OSSPATH
mkdir -p build && cd build
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
make -j$(nproc)
```
> NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.
**Example: Windows (x86-64) build in Powershell**
```powershell
cd $Env:TRT_OSSPATH
Expand Down Expand Up @@ -191,4 +214,4 @@ For Linux platforms, we recommend that you generate a docker container for build
## Known Issues
* None
* Please refer to [TensorRT 8.2 Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#tensorrt-8)
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
8.2.0.6
8.2.1.8
37 changes: 37 additions & 0 deletions cmake/toolchains/cmake_aarch64-native.toolchain
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(TRT_PLATFORM_ID "aarch64")

set(CUDA_PLATFORM_ID "sbsa-linux")

set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++)

set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)

set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)

set(CMAKE_C_COMPILER_FORCED TRUE)
set(CMAKE_CXX_COMPILER_FORCED TRUE)

set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")
set(CUDA_INCLUDE_DIRS ${CUDA_TOOLKIT_ROOT_DIR}/include)
4 changes: 2 additions & 2 deletions demo/BERT/notebooks/BERT-TRT-FP16.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
"outputs": [],
"source": [
"# Build BERT TensorRT FP16 model from NGC checkpoint\n",
"!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_large_384.engine -b $BATCH_SIZE -s 384 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1"
"!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_large_384.engine -b 1 -b $BATCH_SIZE -s 384 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1"
]
},
{
Expand Down Expand Up @@ -333,7 +333,7 @@
"metadata": {},
"outputs": [],
"source": [
"!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_base_128.engine -b $BATCH_SIZE -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1"
"!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_base_128.engine -b 1 -b $BATCH_SIZE -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1"
]
},
{
Expand Down
15 changes: 12 additions & 3 deletions demo/BERT/perf.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,20 @@ def main():
bench_times = {}

stream = cuda.Stream()
for idx, batch_size in enumerate(sorted(args.batch_size)):
context.set_optimization_profile_async(idx, stream.handle)
for batch_size in sorted(args.batch_size):
# Select engine profile
selected_profile = -1
for idx in range(engine.num_optimization_profiles):
profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size and profile_shape[0][1] <= args.sequence_length and profile_shape[2][1] >= args.sequence_length:
selected_profile = idx
break
if selected_profile == -1:
raise RuntimeError("None of the dynamic shape profiles meets the requirement batch = {} and sequence = {}.".format(batch_size, args.sequence_length))
context.set_optimization_profile_async(selected_profile, stream.handle)

# Each profile has unique bindings
binding_idx_offset = idx * num_binding_per_profile
binding_idx_offset = selected_profile * num_binding_per_profile
bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]

shapes = {
Expand Down
12 changes: 6 additions & 6 deletions demo/HuggingFace/GPT2/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,10 @@ def __init__(self, model, network_metadata):

# TRT Engine File Encoding #
class GPT2TRTEngine(TRTEngineFile):
def __init__(self, model, network_metadata):
super().__init__(model, GPT2Converter, network_metadata)
def __init__(self, model, network_metadata, batch_size = 1):
super().__init__(model, GPT2Converter, network_metadata, batch_size = batch_size)

def use_strict_types(self):
def use_obey_precision_constraints(self):
return self.network_metadata.precision.fp16

def get_dynamic_shape_profiles(self):
Expand All @@ -91,9 +91,9 @@ def get_dynamic_shape_profiles(self):
profile = Profile()
profile.add(
"input_ids",
min=(1, 1),
opt=(1, max_sequence_length // 2),
max=(1, max_sequence_length),
min=(self.batch_size, 1),
opt=(self.batch_size, max_sequence_length // 2),
max=(self.batch_size, max_sequence_length),
)
return [profile]

Expand Down
28 changes: 20 additions & 8 deletions demo/HuggingFace/GPT2/frameworks.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,11 +155,17 @@ def execute_inference(
network_fpaths: NetworkModels,
inference_input: str,
timing_profile: TimingProfile,
use_cpu: bool,
batch_size: int = 1
) -> NetworkResult:

# Execute some tests
tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
input_ids = tokenizer(inference_input, return_tensors="pt").input_ids

# GPT2 has no proper token set. Use custom token. Only "generate()" will auto
# replace with EOS token when using generating mode
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids

# By default, HuggingFace model structure is one giant file.
gpt2_torch_fpath = network_fpaths.torch[0].fpath
Expand All @@ -172,7 +178,7 @@ def execute_inference(

# get single decoder iteration inference timing profile
_, decoder_e2e_median_time = gpt2_inference(
gpt2_torch, input_ids, timing_profile
gpt2_torch, input_ids, timing_profile, use_cuda=(not use_cpu)
)

# get complete decoder inference result and its timing profile
Expand All @@ -181,13 +187,17 @@ def execute_inference(
input_ids,
timing_profile,
max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
use_cuda=(not use_cpu),
batch_size=batch_size
)

semantic_outputs = []
for i, sample_output in enumerate(sample_output):
semantic_outputs.append(
tokenizer.decode(sample_output, skip_special_tokens=True)
)
# Remove the padding and end tokens.
semantic_outputs = tokenizer.decode(
sample_output[-1, :], skip_special_tokens=True
)

if isinstance(semantic_outputs, list):
semantic_outputs = " ".join(semantic_outputs).strip()

return NetworkResult(
input=inference_input,
Expand All @@ -214,6 +224,8 @@ def run_framework(
keep_onnx_model: bool,
keep_pytorch_model: bool,
timing_profile: TimingProfile,
use_cpu: bool = False,
batch_size: int = 1
) -> List[NetworkResult]:
"""
Main entry point of our function which compiles and generates our model data.
Expand All @@ -227,7 +239,7 @@ def run_framework(
for ninput in network_input:
results.append(
self.execute_inference(
metadata, network_fpaths, ninput, timing_profile
metadata, network_fpaths, ninput, timing_profile, use_cpu
)
)
finally:
Expand Down
9 changes: 7 additions & 2 deletions demo/HuggingFace/GPT2/measurements.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
# TRT-HuggingFace
from NNDF.general_utils import measure_python_inference_code
from NNDF.torch_utils import use_cuda
from NNDF.tensorrt_utils import TRTNativeRunner


@use_cuda
Expand All @@ -37,9 +38,13 @@ def gpt2_inference(gpt2, input_ids, timing_profile, use_cuda=True):

# Code specifically for Pythonic inference measurement used across all GPT2 related scripts
@use_cuda
def full_inference_greedy(gpt2, input_ids, timing_profile, max_length, use_cuda=True):
def full_inference_greedy(gpt2, input_ids, timing_profile, max_length, use_cuda=True, batch_size=1):

if isinstance(gpt2, TRTNativeRunner):
gpt2.set_return_device("cuda" if use_cuda else "cpu")

def _e2e():
return gpt2.generate(input_ids, max_length=max_length) # greedy search
return gpt2.generate(input_ids, max_length=max_length, batch_size=batch_size) # greedy search

full_e2e_median_time = measure_python_inference_code(
_e2e,
Expand Down
Loading

0 comments on commit 6f38570

Please sign in to comment.