TensorRT-OSS 8.2 GA release

Signed-off-by: Rajeev Rao <rajeevrao@nvidia.com>
NVIDIA · Nov 24, 2021 · 6f38570 · 6f38570
1 parent 9ec6eb6
commit 6f38570
Show file tree

Hide file tree

Showing 184 changed files with 5,739 additions and 1,152 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,63 @@
 # TensorRT OSS Release Changelog
 
+## [8.2.1 GA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-1) - 2021-11-24
+
+TensorRT OSS release corresponding to TensorRT 8.2.1.8 GA release.
+- Updates since [TensorRT 8.2.0 EA release](https://github.com/NVIDIA/TensorRT/releases/tag/8.2.0-EA).
+- Please refer to the [TensorRT 8.2.1 GA release notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-1) for more information.
+
+- ONNX parser [v8.2.1](https://github.com/onnx/onnx-tensorrt/releases/tag/release%2F8.2-GA)
+  - Removed duplicate constant layer checks that caused some performance regressions
+  - Fixed expand dynamic shape calculations
+  - Added parser-side checks for `Scatter` layer support
+
+- Sample updates
+  - Added [Tensorflow Object Detection API converter samples](samples/python/tensorflow_object_detection_api), including Single Shot Detector, Faster R-CNN and Mask R-CNN models
+  - Multiple enhancements in HuggingFace transformer demos
+    - Added multi-batch support
+    - Fixed resultant performance regression in batchsize=1
+    - Fixed T5 large/T5-3B accuracy issues
+    - Added [notebooks](demo/HuggingFace/notebooks) for T5 and GPT-2
+    - Added CPU benchmarking option
+  - Deprecated `kSTRICT_TYPES` (strict type constraints). Equivalent behaviour now achieved by setting `PREFER_PRECISION_CONSTRAINTS`, `DIRECT_IO`, and `REJECT_EMPTY_ALGORITHMS`
+  - Removed `sampleMovieLens`
+  - Renamed sampleReformatFreeIO to sampleIOFormats
+  - Add `idleTime` option for samples to control qps
+  - Specify default value for `precisionConstraints`
+  - Fixed reporting of TensorRT build version in trtexec
+  - Fixed `combineDescriptions` typo in trtexec/tracer.py
+  - Fixed usages of of `kDIRECT_IO`
+
+- Plugin updates
+  - `EfficientNMS` plugin support extended to TF-TRT, and for clang builds.
+  - Sanitize header definitions for BERT fused MHA plugin
+  - Separate C++ and cu files in `splitPlugin` to avoid PTX generation (required for CUDA enhanced compatibility support)
+  - Enable C++14 build for plugins
+
+- ONNX tooling updates
+  - [onnx-graphsurgeon](tools/onnx-graphsurgeon/CHANGELOG.md) upgraded to v0.3.14
+  - [Polygraphy](tools/Polygraphy/CHANGELOG.md) upgraded to v0.33.2
+  - [pytorch-quantization](tools/pytorch-quantization) toolkit upgraded to v2.1.2
+
+- Build and container fixes
+  - Add `SM86` target to default `GPU_ARCHS` for platforms with cuda-11.1+
+  - Remove deprecated `SM_35` and add `SM_60` to default `GPU_ARCHS`
+  - Skip CUB builds for cuda 11.0+ [#1455](https://github.com/NVIDIA/TensorRT/pull/1455)
+  - Fixed cuda-10.2 container build failures in Ubuntu 20.04
+  - Add native ARM server build container
+  - Install devtoolset-8 for updated g++ version in CentOS7
+  - Added a note on supporting c++14 builds for CentOS7
+  - Fixed docker build for large UIDs [#1373](https://github.com/NVIDIA/TensorRT/issues/1373)
+  - Updated README instructions for Jetpack builds
+
+- demo enhancements
+  - Updated Tacotron2 instructions and add CPU benchmarking
+  - Fixed issues in demoBERT python notebook
+
+- Documentation updates
+  - Updated Python documentation for `add_reduce`, `add_top_k`, and `ISoftMaxLayer`
+  - Renamed default GitHub branch to `main` and updated hyperlinks
+
 ## [8.2.0 EA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-0-EA) - 2021-10-05
 ### Added
 - [Demo applications](demo/HuggingFace) showcasing TensorRT inference of [HuggingFace Transformers](https://huggingface.co/transformers).

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -141,8 +141,8 @@ if (DEFINED GPU_ARCHS)
   separate_arguments(GPU_ARCHS)
 else()
   list(APPEND GPU_ARCHS
-      35
       53
+      60
       61
       70
       75
@@ -157,8 +157,9 @@ else()
   if (CUDA_VERSION VERSION_GREATER_EQUAL 11.0)
     # Ampere GPU (SM80) support is only available in CUDA versions > 11.0
     list(APPEND GPU_ARCHS 80)
-  else()
-    message(WARNING "Detected CUDA version is < 11.0. SM80 not supported.")
+  endif()
+  if (CUDA_VERSION VERSION_GREATER_EQUAL 11.1)
+    list(APPEND GPU_ARCHS 86)
   endif()
 
   message(STATUS "GPU_ARCHS is not defined. Generating CUDA code for default SMs: ${GPU_ARCHS}")

diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ This repository contains the Open Source Software (OSS) components of NVIDIA Ten
 To build the TensorRT-OSS components, you will first need the following software packages.
 
 **TensorRT GA build**
-* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.0.6
+* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.1.8
 
 **System Packages**
 * [CUDA](https://developer.nvidia.com/cuda-toolkit)
@@ -70,16 +70,16 @@ To build the TensorRT-OSS components, you will first need the following software
 
     ```bash
     cd ~/Downloads
-    tar -xvzf TensorRT-8.2.0.6.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
-    export TRT_LIBPATH=`pwd`/TensorRT-8.2.0.6
+    tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
+    export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8
     ```
 
     **Example: Windows on x86-64 with cuda-11.4**
 
     ```powershell
     cd ~\Downloads
-    Expand-Archive .\TensorRT-8.2.0.6.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
-    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.0.6'
+    Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
+    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
     $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
     ```
 
@@ -110,6 +110,10 @@ For Linux platforms, we recommend that you generate a docker container for build
     ```bash
     ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2
     ```
+    **Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2**
+    ```bash
+    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
+    ```
 
 2. #### Launch the TensorRT-OSS build container.
     **Example: Ubuntu 18.04 build container**
@@ -132,6 +136,23 @@ For Linux platforms, we recommend that you generate a docker container for build
 	cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
 	make -j$(nproc)
 	```
+
+    > NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:
+    ```bash
+    yum -y install centos-release-scl
+    yum-config-manager --enable rhel-server-rhscl-7-rpms
+    yum -y install devtoolset-8
+    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}
+    ```
+
+    **Example: Linux (aarch64) build with default cuda-11.4.2**
+	```bash
+	cd $TRT_OSSPATH
+	mkdir -p build && cd build
+	cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
+	make -j$(nproc)
+	```
+
     **Example: Native build on Jetson (aarch64) with cuda-10.2**
     ```bash
     cd $TRT_OSSPATH
@@ -141,13 +162,15 @@ For Linux platforms, we recommend that you generate a docker container for build
     ```
     > NOTE: C compiler must be explicitly specified via `CC=` for native `aarch64` builds of protobuf.
 
-    **Example: Ubuntu 18.04 Cross-Compile for Jetson (arm64) with cuda-10.2 (JetPack)**
+    **Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)**
 	```bash
 	cd $TRT_OSSPATH
 	mkdir -p build && cd build
-	cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2
+	cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
 	make -j$(nproc)
 	```
+    > NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.
+
     **Example: Windows (x86-64) build in Powershell**
 	```powershell
 	cd $Env:TRT_OSSPATH
@@ -191,4 +214,4 @@ For Linux platforms, we recommend that you generate a docker container for build
 
 ## Known Issues
 
-* None
+* Please refer to [TensorRT 8.2 Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#tensorrt-8)
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-8.2.0.6
+8.2.1.8
diff --git a/cmake/toolchains/cmake_aarch64-native.toolchain b/cmake/toolchains/cmake_aarch64-native.toolchain
@@ -0,0 +1,37 @@
+#
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
+set(TRT_PLATFORM_ID "aarch64")
+
+set(CUDA_PLATFORM_ID "sbsa-linux")
+
+set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc)
+set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++)
+
+set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
+set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)
+
+set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
+set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)
+
+set(CMAKE_C_COMPILER_FORCED TRUE)
+set(CMAKE_CXX_COMPILER_FORCED TRUE)
+
+set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")
+set(CUDA_INCLUDE_DIRS ${CUDA_TOOLKIT_ROOT_DIR}/include)
diff --git a/demo/BERT/notebooks/BERT-TRT-FP16.ipynb b/demo/BERT/notebooks/BERT-TRT-FP16.ipynb
@@ -171,7 +171,7 @@
    "outputs": [],
    "source": [
     "# Build BERT TensorRT FP16 model from NGC checkpoint\n",
-    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_large_384.engine -b $BATCH_SIZE -s 384 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1"
+    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_large_384.engine -b 1 -b $BATCH_SIZE -s 384 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1"
    ]
   },
   {
@@ -333,7 +333,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_base_128.engine -b $BATCH_SIZE -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1"
+    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_base_128.engine -b 1 -b $BATCH_SIZE -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1"
    ]
   },
   {

diff --git a/demo/BERT/perf.py b/demo/BERT/perf.py
@@ -79,11 +79,20 @@ def main():
         bench_times = {}
 
         stream = cuda.Stream()
-        for idx, batch_size in enumerate(sorted(args.batch_size)):
-            context.set_optimization_profile_async(idx, stream.handle)
+        for batch_size in sorted(args.batch_size):
+            # Select engine profile
+            selected_profile = -1
+            for idx in range(engine.num_optimization_profiles):
+                profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
+                if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size and profile_shape[0][1] <= args.sequence_length and profile_shape[2][1] >= args.sequence_length:
+                    selected_profile = idx
+                    break
+            if selected_profile == -1:
+                raise RuntimeError("None of the dynamic shape profiles meets the requirement batch = {} and sequence = {}.".format(batch_size, args.sequence_length))
+            context.set_optimization_profile_async(selected_profile, stream.handle)
 
             # Each profile has unique bindings
-            binding_idx_offset = idx * num_binding_per_profile
+            binding_idx_offset = selected_profile * num_binding_per_profile
             bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]
 
             shapes = {

diff --git a/demo/HuggingFace/GPT2/export.py b/demo/HuggingFace/GPT2/export.py
@@ -78,10 +78,10 @@ def __init__(self, model, network_metadata):
 
 # TRT Engine File Encoding #
 class GPT2TRTEngine(TRTEngineFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, GPT2Converter, network_metadata)
+    def __init__(self, model, network_metadata, batch_size = 1):
+        super().__init__(model, GPT2Converter, network_metadata, batch_size = batch_size)
 
-    def use_strict_types(self):
+    def use_obey_precision_constraints(self):
         return self.network_metadata.precision.fp16
 
     def get_dynamic_shape_profiles(self):
@@ -91,9 +91,9 @@ def get_dynamic_shape_profiles(self):
         profile = Profile()
         profile.add(
             "input_ids",
-            min=(1, 1),
-            opt=(1, max_sequence_length // 2),
-            max=(1, max_sequence_length),
+            min=(self.batch_size, 1),
+            opt=(self.batch_size, max_sequence_length // 2),
+            max=(self.batch_size, max_sequence_length),
         )
         return [profile]
 

diff --git a/demo/HuggingFace/GPT2/frameworks.py b/demo/HuggingFace/GPT2/frameworks.py
@@ -155,11 +155,17 @@ def execute_inference(
         network_fpaths: NetworkModels,
         inference_input: str,
         timing_profile: TimingProfile,
+        use_cpu: bool,
+        batch_size: int = 1
     ) -> NetworkResult:
 
         # Execute some tests
         tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-        input_ids = tokenizer(inference_input, return_tensors="pt").input_ids
+
+        # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
+        # replace with EOS token when using generating mode
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
 
         # By default, HuggingFace model structure is one giant file.
         gpt2_torch_fpath = network_fpaths.torch[0].fpath
@@ -172,7 +178,7 @@ def execute_inference(
 
         # get single decoder iteration inference timing profile
         _, decoder_e2e_median_time = gpt2_inference(
-            gpt2_torch, input_ids, timing_profile
+            gpt2_torch, input_ids, timing_profile, use_cuda=(not use_cpu)
         )
 
         # get complete decoder inference result and its timing profile
@@ -181,13 +187,17 @@ def execute_inference(
             input_ids,
             timing_profile,
             max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
+            use_cuda=(not use_cpu),
+            batch_size=batch_size
         )
 
-        semantic_outputs = []
-        for i, sample_output in enumerate(sample_output):
-            semantic_outputs.append(
-                tokenizer.decode(sample_output, skip_special_tokens=True)
-            )
+        # Remove the padding and end tokens.
+        semantic_outputs = tokenizer.decode(
+            sample_output[-1, :], skip_special_tokens=True
+        )
+
+        if isinstance(semantic_outputs, list):
+            semantic_outputs = " ".join(semantic_outputs).strip()
 
         return NetworkResult(
             input=inference_input,
@@ -214,6 +224,8 @@ def run_framework(
         keep_onnx_model: bool,
         keep_pytorch_model: bool,
         timing_profile: TimingProfile,
+        use_cpu: bool = False,
+        batch_size: int = 1
     ) -> List[NetworkResult]:
         """
         Main entry point of our function which compiles and generates our model data.
@@ -227,7 +239,7 @@ def run_framework(
             for ninput in network_input:
                 results.append(
                     self.execute_inference(
-                        metadata, network_fpaths, ninput, timing_profile
+                        metadata, network_fpaths, ninput, timing_profile, use_cpu
                     )
                 )
         finally:

diff --git a/demo/HuggingFace/GPT2/measurements.py b/demo/HuggingFace/GPT2/measurements.py
@@ -24,6 +24,7 @@
 # TRT-HuggingFace
 from NNDF.general_utils import measure_python_inference_code
 from NNDF.torch_utils import use_cuda
+from NNDF.tensorrt_utils import TRTNativeRunner
 
 
 @use_cuda
@@ -37,9 +38,13 @@ def gpt2_inference(gpt2, input_ids, timing_profile, use_cuda=True):
 
 # Code specifically for Pythonic inference measurement used across all GPT2 related scripts
 @use_cuda
-def full_inference_greedy(gpt2, input_ids, timing_profile, max_length, use_cuda=True):
+def full_inference_greedy(gpt2, input_ids, timing_profile, max_length, use_cuda=True, batch_size=1):
+
+    if isinstance(gpt2, TRTNativeRunner):
+        gpt2.set_return_device("cuda" if use_cuda else "cpu")
+
     def _e2e():
-        return gpt2.generate(input_ids, max_length=max_length)  # greedy search
+        return gpt2.generate(input_ids, max_length=max_length, batch_size=batch_size)  # greedy search
 
     full_e2e_median_time = measure_python_inference_code(
         _e2e,