-
Notifications
You must be signed in to change notification settings - Fork 355
How To Benchmark Torch‐TensorRT with TorchBench
We have added support for benchmarking Torch-TRT across IRs (torchscript
, torch_compile
, dynamo
) in TorchBench, which features a set of key models, and the extensibility to add easily to those.
First, it is key to set up a clean environment for benchmarking. We have two recommended ways to accomplish this.
- Set up a container based on the provided TorchBench Dockerfiles, then install
torch_tensorrt
in it. - Set up a container based on the Torch-TRT Docker, then install torchbench in it.
With the environment set up, benchmarking Torch-TRT in TorchBench can be done in the following ways (from the root of the TorchBench clone).
# Prints metrics to stdout
python run.py {MODEL selected from TorchBench set} -d cuda -t eval --backend torch_trt --precision [fp32 OR fp16] [Torch-TRT specific options, see below]
# Saves metrics to .userbenchmark/torch_trt/metrics-*.json
python run_benchmark.py torch_trt --model {MODEL selected from TorchBench set} --precision [fp32 OR fp16] [Torch-TRT specific options, see below]
--truncate_long_and_double: Whether to automatically truncate long and double operations
--min_block_size: Minimum number of operations in an accelerated TRT block
--workspace_size: Size of workspace allotted to TensorRT
--ir: Which internal representation to use: {"ts", "torch_compile", "dynamo", ...}
# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run.py resnet18 -d cuda -t eval --backend torch_trt --precision fp32 --truncate_long_and_double --ir torchscript
# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run.py vgg16 -d cuda -t eval --backend torch_trt --precision fp16 --ir dynamo --bs 32
# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run.py BERT_pytorch -d cuda -t eval --backend torch_trt --precision fp16 --truncate_long_and_double --ir torch_compile
In both of the cases below, the metrics will be saved to files at the path .userbenchmark/torch_trt/metrics-*.json
, as per the TorchBench
# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run_benchmark.py torch_trt --model resnet18 --precision fp32 --truncate_long_and_double --ir torchscript
# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run_benchmark.py torch_trt --model vgg16--precision fp16 --ir dynamo --bs 32
# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run_benchmark.py torch_trt --model BERT_pytorch --precision fp16 --truncate_long_and_double --ir torch_compile
In the future, we hope to enable:
# Benchmarks all TorchBench models with Torch-TRT, compiling via the torch compile path
python run_benchmark.py torch_trt --precision fp16 --ir torch_compile
Currently, this is still in development, and the recommended method to benchmark multiple models is to make a bash script which iterates over the set of desired models and runs the individual benchmarks of those. See the discussion here, for more details.