TensorRT
The industry's most powerful deep learning inference optimizer. Transform trained models into high-performance engines that run 10× faster on NVIDIA GPUs.
What is TensorRT?
A high-performance inference optimizer and runtime that transforms your trained models into production-ready engines.
From Training to Production
TensorRT bridges the gap between model training and deployment. While frameworks like PyTorch and TensorFlow excel at training, they're not optimized for inference. TensorRT takes your trained model and transforms it into a highly optimized engine.
The result? Dramatically faster inference, reduced memory footprint, and lower latency — all without changing a single line of your model's architecture.
TensorRT supports all major frameworks through ONNX, making it framework-agnostic. Train in PyTorch, deploy with TensorRT.
Trained Model
PyTorch, TensorFlow, ONNX
TensorRT Optimizer
Parse, fuse, quantize, tune
Optimized Engine
Serialized .engine file
Production Deployment
Triton, vLLM, custom apps
Under the Hood
A deep dive into TensorRT's multi-layered optimization architecture
ONNX Parser
Imports models from the Open Neural Network Exchange format, supporting 150+ operators
UFF Parser
Legacy TensorFlow format support for older models and workflows
Caffe Parser
Native support for Caffe model definitions and pretrained weights
Network Definition API
Programmatically build networks layer-by-layer for custom architectures
Graph Optimizer
Eliminates redundant operations, constant folding, dead code elimination
Layer Fusion
Combines Conv+BN+ReLU and similar patterns into single optimized kernels
Precision Calibrator
INT8/FP16 quantization with automatic scale factor computation
Kernel Auto-Tuner
Benchmarks multiple implementations per layer, selects fastest
Memory Optimizer
Tensor reuse, workspace allocation, memory pooling strategies
Timing Cache
Stores kernel timing results to accelerate future builds
Execution Context
Manages inference state, allows multiple concurrent executions
CUDA Streams
Asynchronous execution with stream-ordered memory allocation
Dynamic Shapes
Runtime tensor dimension changes with optimization profiles
DLA Support
Offload layers to Deep Learning Accelerator on Jetson/DRIVE
Engine Serialization
Save optimized engine as portable .plan/.engine file
Version Compatibility
Cross-version loading with backward compatibility checks
Triton Integration
Native backend for NVIDIA Triton Inference Server
Plugin System
Custom layer implementations for unsupported operations
The Magic Inside
Four key optimization techniques that make TensorRT incredibly fast
Layer & Tensor Fusion
Combine operations, reduce overhead
TensorRT identifies patterns of operations that can be merged into single, optimized CUDA kernels. This eliminates memory transfers between layers and reduces kernel launch overhead.
Precision Calibration
Quantize without losing accuracy
Convert FP32 weights to FP16 or INT8 for massive speedups. INT8 calibration uses a representative dataset to compute optimal scale factors that minimize accuracy loss.
Kernel Auto-Tuning
Find the fastest implementation
For each layer, TensorRT benchmarks multiple CUDA kernel implementations and selects the fastest one for your specific GPU architecture. Results are hardware-specific.
Memory Optimization
Maximize GPU utilization
TensorRT analyzes tensor lifetimes and reuses memory where possible. Tensors that don't overlap in time share the same memory allocation, dramatically reducing footprint.
The Optimization Pipeline
Step-by-step journey from trained model to optimized engine
Model Import
Parse ONNX/TensorFlow model. Build internal network representation. Validate operator support.
Graph Analysis
Identify fusion patterns. Detect quantizable layers. Map tensor dependencies.
Layer Fusion
Merge compatible operations. Eliminate redundant computations. Optimize data flow.
Precision Selection
Calibrate INT8 scales. Apply mixed precision. Balance accuracy vs speed.
Kernel Selection
Benchmark kernel variants. Auto-tune for target GPU. Cache timing results.
Engine Build
Serialize optimized engine. Generate .plan file. Ready for deployment.
Real-World Results
Inference latency comparison across optimization levels
Optimized For Your Workload
Computer Vision
Image classification, object detection, segmentation. Optimized kernels for ResNet, YOLO, EfficientNet.
Speech & Audio
ASR, TTS, speaker recognition. Real-time processing for voice assistants and transcription.
NLP & Transformers
BERT, GPT, T5 inference. Optimized attention mechanisms and sequence processing.
TensorRT-LLM
A specialized extension of TensorRT optimized for large language models. Powers production LLM inference at scale with state-of-the-art performance.
KV-Cache Optimization
Efficient key-value cache management for autoregressive generation
In-Flight Batching
Dynamic batching of requests at different generation stages
Flash Attention
Memory-efficient attention with fused softmax kernels
Tensor Parallelism
Scale across multiple GPUs with optimized communication
Quantization
INT8, INT4, FP8 quantization with minimal accuracy loss
Paged Attention
Virtual memory for KV cache, enabling larger batch sizes
Speculative Decoding
Draft model acceleration for faster token generation
Custom Plugins
Extensible architecture for custom model components
Containerized Deployment
NGC containers and production configurations for TensorRT inference
NGC Quick Start
Official NVIDIA TensorRT container
# Pull TensorRT container from NGC
docker pull nvcr.io/nvidia/tensorrt:24.01-py3
# Run with GPU access
docker run --gpus all -it \
-v $(pwd):/workspace \
nvcr.io/nvidia/tensorrt:24.01-py3
# Build TensorRT engine
trtexec --onnx=model.onnx --saveEngine=model.plan
Production Dockerfile
Multi-stage build with runtime-only image
# Build stage - compile TensorRT engine
FROM nvcr.io/nvidia/tensorrt:24.01-py3 AS builder
WORKDIR /build
COPY model.onnx .
RUN trtexec --onnx=model.onnx --saveEngine=model.plan --fp16
# Runtime stage - minimal footprint
FROM nvcr.io/nvidia/cuda:12.3.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y libnvinfer8 && \
rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/model.plan /app/
COPY inference_app /app/
WORKDIR /app
CMD ["./inference_app"]
Docker Compose
Triton + Prometheus + Grafana
version: '3.8'
services:
triton:
image: nvcr.io/nvidia/tritonserver:24.01-py3
command: tritonserver --model-repository=/models
ports:
- "8000:8000" # HTTP
- "8001:8001" # gRPC
- "8002:8002" # Metrics
volumes:
- ./models:/models
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Environment Variables
Key configuration options
Kubernetes Deployment
GPU-aware scheduling and auto-scaling for production workloads
Triton Deployment
GPU-enabled pod specification
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-tensorrt
spec:
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
command: ["tritonserver"]
args: ["--model-repository=/models"]
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
Service & Ingress
Load balancing configuration
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
type: ClusterIP
selector:
app: triton
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
Horizontal Pod Autoscaler
GPU-aware auto-scaling with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-tensorrt
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: triton_queue_size
target:
type: AverageValue
averageValue: "10"
Production Topologies
Architecture patterns for every scale and use case
Scenario 1: Single GPU Direct Inference
Direct TensorRT runtime with C++ or Python API. Lowest latency path.
Performance
Tips
Scenario 2: Triton Inference Server
Enterprise model serving with dynamic batching and model management
Features
Batching
Scenario 3: Multi-GPU Inference
Scale inference across multiple GPUs for throughput
Scaling
Config
Scenario 4: Edge Deployment (Jetson)
Low-power, low-latency inference on NVIDIA Jetson platforms
Jetson Orin
Optimize
trtexec --onnx=model.onnx --saveEngine=model.plan --int8 --useDLACore=0 --allowGPUFallback
sudo nvpmodel -m 0 && sudo jetson_clocks