You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to quantize already trained FP16 models to INT8 precision using torch_tensorrt and accelerate inference with TensorRT engines. However, during this process, I encountered several different issues — either inside torch_tensorrt or TensorRT itself (not entirely sure).
In most cases, the models fail to pass the quantize and/or compile process.
To Reproduce
Define several common models (MLP, CNN, Attention, LSTM, Transformer) in torch.
Randomly initialize model weights.
Convert the models to FP16 precision and move them to GPU.
Compile models using torch_tensorrt:
Compile to FP16 TensorRT engine.
Compile and quantize to INT8 TensorRT engine.
Compare inference performance and accuracy between:
Successfully compile FP16 models to INT8 TensorRT engines, also maintain reasonable inference accuracy and performance.
Actual Behavior
In most cases, the compilation fails or the resulting models cannot run correctly. Below is a summary to the results that I tested:
Model
IR
FP16
FP16 + TRT
INT8 + TRT
Error Log
MLP
torch_compile
pass
pass
failed
see [1]
MLP
torchscript
pass
pass
pass
N/A
MLP
dynamo
pass
pass
failed
see [2]
CNN
torch_compile
pass
pass
failed
see [3]
CNN
torchscript
pass
pass
failed
see [4]
CNN
dynamo
pass
pass
failed
see [5]
Attention
torch_compile
pass
pass
pass
N/A
Attention
torchscript
pass
failed
failed
see [6]
Attention
dynamo
pass
failed
failed
see [7] [8]
Transformer
torch_compile
pass
pass
pass
N/A
Transformer
torchscript
pass
failed
failed
see [9]
Transformer
dynamo
pass
failed
failed
see [10] [11]
LSTM
torch_compile
pass
pass
pass
N/A
LSTM
torchscript
pass
failed
failed
see [12]
LSTM
dynamo
pass
failed
failed
see [13]
And the corresponding error log is (due to the length limitation I must upload a file) error_log.txt.
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
Torch-TensorRT Version (e.g. 1.0.0): 2.6.0+cu124
PyTorch Version (e.g. 1.0): 2.6.0+cu124
CPU Architecture: x86_64
OS (e.g., Linux): Rocky Linux 8.7
How you installed PyTorch (conda, pip, libtorch, source): pip
Build command you used (if compiling from source): N/A
Are you using local sources or building from archives: N/A
Python version: 3.10.16
CUDA version: 12.5
GPU models and configuration: NVIDIA GeForce RTX 4090
Any other relevant information: N/A
Questions
Am I using torch_tensorrt incorrectly?
Are there any important documentation notes or best practices regarding compilation and quantization that I might have missed?
Whats the correct way (or official suggestion) to do this task, specifically given a fp16 model then build a int8 quantized version and inference the model with TensorRT backend?
Any help would be greatly appreciated! Thank you in advance!
Additional context
N/A
The text was updated successfully, but these errors were encountered:
Bug Description
I am trying to quantize already trained FP16 models to INT8 precision using torch_tensorrt and accelerate inference with TensorRT engines. However, during this process, I encountered several different issues — either inside torch_tensorrt or TensorRT itself (not entirely sure).
In most cases, the models fail to pass the quantize and/or compile process.
To Reproduce
Here is the minimal reproducible code:
Expected Behavior
Successfully compile FP16 models to INT8 TensorRT engines, also maintain reasonable inference accuracy and performance.
Actual Behavior
In most cases, the compilation fails or the resulting models cannot run correctly. Below is a summary to the results that I tested:
And the corresponding error log is (due to the length limitation I must upload a file) error_log.txt.
Environment
conda
,pip
,libtorch
, source): pipQuestions
Am I using torch_tensorrt incorrectly?
Are there any important documentation notes or best practices regarding compilation and quantization that I might have missed?
Whats the correct way (or official suggestion) to do this task, specifically given a fp16 model then build a int8 quantized version and inference the model with TensorRT backend?
Any help would be greatly appreciated! Thank you in advance!
Additional context
N/A
The text was updated successfully, but these errors were encountered: