TRTorch v0.4.0 #605
narendasan
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TRTorch v0.4.0
Support for PyTorch 1.9, TensorRT 8.0. Introducing INT8 Execution for QAT models, Module Based Partial Compilation, Auto Device Configuration, Input Class, Usability Improvements, New Converters, Bug Fixes
Target Platform Changes
This is the fourth beta release of TRTorch, targeting PyTorch 1.9, CUDA 11.1 (on x86_64, CUDA 10.2 on aarch64), cuDNN 8.2 and TensorRT 8.0 with backwards compatibility to TensorRT 7.1. On aarch64 TRTorch targets Jetpack 4.6 primarily with backwards compatibility to Jetpack 4.5. When building on Jetson, the flag
--platforms //toolchains:jetpack_4.x
must be now be provided for C++ compilation to select the correct dependency paths. For python by default it is assumed the Jetpack version is 4.6. To override this add the--jetpack-version 4.5
flag when building.TensorRT 8.0
This release adds support for compiling models trained with Quantization aware training (QAT) allowing users using the TensorRT PyTorch Quantization Toolkit (https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization) to compile their models using TRTorch. For more information and a tutorial, refer to https://www.github.com/NVIDIA/TRTorch/tree/v0.4.0/examples/int8/qat. It also adds support for sparsity via the
sparse_weights
flag in the compile spec. This allows TensorRT to utilize specialized hardware in Ampere GPUs to minimize unnecessary computation and therefore increase computational efficiency.Partial Compilation
In v0.4.0 the partial compilation feature of TRTorch can now be considered beta level stability. New in this release is the ability to specify entire PyTorch modules to run in PyTorch explicitly as part of partial compilation. This should let users isolate troublesome code easily when compiling. Again, feedback on this feature is greatly appreciated.
Automatic Device Configuration at Runtime
v0.4.0 also changes the "ABI" of TRTorch to now include information about the target device for the program. Programs compiled with v0.4.0 will look for and select the most compatible available device. The rules used are: Any valid device option must have the same SM capability as the device building the engine. From there, TRTorch prefers the same device (e.g. Built on A100 so A100 is better than A30) and finally prefers the same device ID. Users will be warned if this selected device is not the current active device in the course of execution as overhead may be incurred in transferring input tensors from the current device to the target device. Users can then modify their code to avoid this. Due to this ABI change, existing compiled TRTorch programs are incompatible with the TRTorch v0.4.0 runtime. From v0.4.0 onwards an internal ABI version will check program compatibility. This ABI version is only incremented with breaking changes to the ABI.
API Changes (Input, enabled_precisions, Device)
TRTorch v0.4.0 changes the API for specifying Input shapes and data types to provide users more control over configuration. The new API makes use of the class
trtorch.Input
which lets users set the shape (or shape range) as well as memory layout and expected data type. These input specs are set in theinput
field of theCompileSpec
.The legacy
input_shapes
field and associated usage with lists of tuples/InputRanges
should now be considered deprecated. They remain usable in v0.4.0 but will be removed in the next release. Similarly, the compile spec fieldop_precision
is now also deprecated in favor ofenabled_precisions
.enabled_precisions
is a set containing the data types that kernels will be allowed to use. Whereas settingop_precision = torch.int8
would implicitly enable FP32 and FP16 kernels as well, nowenabled_precisions
should be set as{torch.float32, torch.float16, torch.int8}
to do the same. In order to maintain similar behavior to normal PyTorch, if FP16 is the lowest precision enabled but no explicit data type is set for the inputs to the model, the expectation will be that inputs will be in FP16 . For other cases (FP32, INT8) FP32 is the default, similar to PyTorch and previous versions of TRTorch. Finally in the Python API, a classtrtorch.Device
has been added. While users can continue to usetorch.Device
or other torch APIs,trtorch.Device
allows for better control for the specific use cases of compiling with TRTorch (e.g. setting DLA core and GPU fallback). This class is very similar to the C++ version with a couple additions of syntactic sugar to make the class easier and more familiar to use:trtorch.Device
can be used instead of a dictionary in the compile spec if desired.trtorchc
has been updated to reflect these API changes. Users can set the shape, dtype and format of inputs from the command line using the following format"[(MIN_N,..,MIN_C,MIN_H,MIN_W);(OPT_N,..,OPT_C,OPT_H,OPT_W);(MAX_N,..,MAX_C,MAX_H,MAX_W)]@DTYPE%FORMAT"
e.g.(3, 3, 32,32)@f16%NHWC
.-p
is now a repeatable flag to enable multiple precisions. Also added are repeatable flags--ffm
and--ffo
to mark specific modules and operators for running in PyTorch respectively. To use these two options,--allow-torch-fallback
should be set. Options for embedding serialized engines (--embed-engine
) and sparsity (--sparse-weights
) added as well.Usability
Finally, TRTorch v0.4.0 also now includes the ability to provide backtraces for locations in your model which TRTorch does not support. This can help in identifying locations in the model that might need to change for TRTorch support or modules which should run fully in PyTorch via partial compilation.
Dependencies
0.4.0 (2021-08-24)
Bug Fixes
Features
BREAKING CHANGES
simplifying implementation and formalizing the seralized format for CUDA
devices.
It also implements ABI Versioning. The first entry in the serialized
format of a TRTEngine now records the ABI that the engine was compiled
with, defining expected compatibility with the TRTorch runtime. If the
ABI version does not match, the runtime will error out asking to
recompile the program.
ABI version is a monotonically increasing integer and should be
incremented everytime the serialization format changes in some way.
This commit cleans up the CudaDevice class, implementing a number of
constructors to replace the various utility functions that populate the
struct. Descriptive utility functions remain but solely call the
relevant constructor.
Signed-off-by: Naren Dasan naren@narendasan.com
Signed-off-by: Naren Dasan narens@nvidia.com
TRTorch API. Starting in TRTorch v0.5.0 support for the "input_shapes"
and "op_precision" compile spec keys will be removed. Users should port
forward to using the "inputs" key which expects a list of trtorch.Input
objects and the "enabled_precisions" key which expects a set of data
type specifying enums.
Signed-off-by: Naren Dasan naren@narendasan.com
Signed-off-by: Naren Dasan narens@nvidia.com
fields "input_shapes", "op_precision" and associated contructors and
functions. These are replaced wtih Input, "inputs" and
"enabled_precisions" respectively. Deprecated components will be removed
in TRTorch v0.5.0
Signed-off-by: Naren Dasan naren@narendasan.com
Signed-off-by: Naren Dasan narens@nvidia.com
breaking changes to the to_backend api
Signed-off-by: Naren Dasan naren@narendasan.com
Signed-off-by: Naren Dasan narens@nvidia.com
a new field for engine name included in the serialized form of
TRTEngine. This lets deserialized engines have the same name they
serialized with
Signed-off-by: Naren Dasan naren@narendasan.com
Signed-off-by: Naren Dasan narens@nvidia.com>
Supported Operators in TRTorch v0.4.0
Operators Currently Supported Through Converters
Operators Currently Supported Through Evaluators
Device? device=None, bool? pin_memory=None) -> (Tensor)
Layout? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor)
Layout? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor)
This discussion was created from the release TRTorch v0.4.0 .
Beta Was this translation helpful? Give feedback.
All reactions