-->
NVIDIA Deep Learning SDK

TensorRT

The industry's most powerful deep learning inference optimizer. Transform trained models into high-performance engines that run 10× faster on NVIDIA GPUs.

10×
Faster Inference
75%
Memory Reduction
Throughput Gain

What is TensorRT?

A high-performance inference optimizer and runtime that transforms your trained models into production-ready engines.

From Training to Production

TensorRT bridges the gap between model training and deployment. While frameworks like PyTorch and TensorFlow excel at training, they're not optimized for inference. TensorRT takes your trained model and transforms it into a highly optimized engine.

The result? Dramatically faster inference, reduced memory footprint, and lower latency — all without changing a single line of your model's architecture.

TensorRT supports all major frameworks through ONNX, making it framework-agnostic. Train in PyTorch, deploy with TensorRT.

🧠

Trained Model

PyTorch, TensorFlow, ONNX

⚙️

TensorRT Optimizer

Parse, fuse, quantize, tune

🚀

Optimized Engine

Serialized .engine file

Production Deployment

Triton, vLLM, custom apps

Under the Hood

A deep dive into TensorRT's multi-layered optimization architecture

📥
Model Import Layer
Input
ONNX Parser

Imports models from the Open Neural Network Exchange format, supporting 150+ operators

UFF Parser

Legacy TensorFlow format support for older models and workflows

Caffe Parser

Native support for Caffe model definitions and pretrained weights

Network Definition API

Programmatically build networks layer-by-layer for custom architectures

Optimization Engine
Core
Graph Optimizer

Eliminates redundant operations, constant folding, dead code elimination

Layer Fusion

Combines Conv+BN+ReLU and similar patterns into single optimized kernels

Precision Calibrator

INT8/FP16 quantization with automatic scale factor computation

Kernel Auto-Tuner

Benchmarks multiple implementations per layer, selects fastest

Memory Optimizer

Tensor reuse, workspace allocation, memory pooling strategies

Timing Cache

Stores kernel timing results to accelerate future builds

🔧
Runtime Execution
Runtime
Execution Context

Manages inference state, allows multiple concurrent executions

CUDA Streams

Asynchronous execution with stream-ordered memory allocation

Dynamic Shapes

Runtime tensor dimension changes with optimization profiles

DLA Support

Offload layers to Deep Learning Accelerator on Jetson/DRIVE

📦
Serialization & Deployment
Output
Engine Serialization

Save optimized engine as portable .plan/.engine file

Version Compatibility

Cross-version loading with backward compatibility checks

Triton Integration

Native backend for NVIDIA Triton Inference Server

Plugin System

Custom layer implementations for unsupported operations

The Magic Inside

Four key optimization techniques that make TensorRT incredibly fast

🔗

Layer & Tensor Fusion

Combine operations, reduce overhead

TensorRT identifies patterns of operations that can be merged into single, optimized CUDA kernels. This eliminates memory transfers between layers and reduces kernel launch overhead.

Before Fusion
Conv2D
BatchNorm
ReLU
After Fusion
ConvBNReLU
Fewer Kernels
Less Memory I/O
🎯

Precision Calibration

Quantize without losing accuracy

Convert FP32 weights to FP16 or INT8 for massive speedups. INT8 calibration uses a representative dataset to compute optimal scale factors that minimize accuracy loss.

FP32
32 bits
FP16
16 bits
INT8
8 bits
INT8 Speedup
<1%
Accuracy Loss

Kernel Auto-Tuning

Find the fastest implementation

For each layer, TensorRT benchmarks multiple CUDA kernel implementations and selects the fastest one for your specific GPU architecture. Results are hardware-specific.

implicit_gemm
2.4ms
winograd
1.8ms
fft_tiled
0.9ms ✓
direct_conv
3.1ms
100+
Kernel Variants
GPU
Specific Tuning
💾

Memory Optimization

Maximize GPU utilization

TensorRT analyzes tensor lifetimes and reuses memory where possible. Tensors that don't overlap in time share the same memory allocation, dramatically reducing footprint.

Naive Allocation
Tensor A
Tensor B
Tensor C
Tensor D
Wasted
Wasted
Optimized
A → C (reused)
B → D (reused)
75%
Memory Saved
Batch Size

The Optimization Pipeline

Step-by-step journey from trained model to optimized engine

1
📄

Model Import

Parse ONNX/TensorFlow model. Build internal network representation. Validate operator support.

2
🔍

Graph Analysis

Identify fusion patterns. Detect quantizable layers. Map tensor dependencies.

3
🔗

Layer Fusion

Merge compatible operations. Eliminate redundant computations. Optimize data flow.

4
🎯

Precision Selection

Calibrate INT8 scales. Apply mixed precision. Balance accuracy vs speed.

5

Kernel Selection

Benchmark kernel variants. Auto-tune for target GPU. Cache timing results.

6
📦

Engine Build

Serialize optimized engine. Generate .plan file. Ready for deployment.

Real-World Results

Inference latency comparison across optimization levels

100ms
PyTorch
FP32
50ms
TensorRT
FP16
28ms
TensorRT
INT8
10ms
TensorRT
INT8 + Fusion
Baseline (PyTorch FP32)
2× speedup (FP16)
3.5× speedup (INT8)
10× speedup (Full)

Optimized For Your Workload

🖼️

Computer Vision

Image classification, object detection, segmentation. Optimized kernels for ResNet, YOLO, EfficientNet.

Faster
4K
FPS
🗣️

Speech & Audio

ASR, TTS, speaker recognition. Real-time processing for voice assistants and transcription.

<10ms
Latency
Real
Time
📝

NLP & Transformers

BERT, GPT, T5 inference. Optimized attention mechanisms and sequence processing.

Throughput
50%
Cost Down
For Large Language Models

TensorRT-LLM

A specialized extension of TensorRT optimized for large language models. Powers production LLM inference at scale with state-of-the-art performance.

🔑

KV-Cache Optimization

Efficient key-value cache management for autoregressive generation

📦

In-Flight Batching

Dynamic batching of requests at different generation stages

Flash Attention

Memory-efficient attention with fused softmax kernels

🔀

Tensor Parallelism

Scale across multiple GPUs with optimized communication

📊

Quantization

INT8, INT4, FP8 quantization with minimal accuracy loss

🧩

Paged Attention

Virtual memory for KV cache, enabling larger batch sizes

🎯

Speculative Decoding

Draft model acceleration for faster token generation

🔧

Custom Plugins

Extensible architecture for custom model components

Containerized Deployment

NGC containers and production configurations for TensorRT inference

🐳

NGC Quick Start

Official NVIDIA TensorRT container

bash ngc-quickstart.sh
# Pull TensorRT container from NGC
docker pull nvcr.io/nvidia/tensorrt:24.01-py3

# Run with GPU access
docker run --gpus all -it \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/tensorrt:24.01-py3

# Build TensorRT engine
trtexec --onnx=model.onnx --saveEngine=model.plan
CUDA 12.3 included
cuDNN 8.9 included
Python 3 + TensorRT bindings
🏭

Production Dockerfile

Multi-stage build with runtime-only image

dockerfile Dockerfile
# Build stage - compile TensorRT engine
FROM nvcr.io/nvidia/tensorrt:24.01-py3 AS builder
WORKDIR /build
COPY model.onnx .
RUN trtexec --onnx=model.onnx --saveEngine=model.plan --fp16

# Runtime stage - minimal footprint
FROM nvcr.io/nvidia/cuda:12.3.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y libnvinfer8 && \
    rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/model.plan /app/
COPY inference_app /app/
WORKDIR /app
CMD ["./inference_app"]
🔗

Docker Compose

Triton + Prometheus + Grafana

yaml docker-compose.yml
version: '3.8'
services:
  triton:
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    command: tritonserver --model-repository=/models
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # gRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./models:/models
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
⚙️

Environment Variables

Key configuration options

Variable Default Description
CUDA_VISIBLE_DEVICES all GPU indices (0,1,2...)
TRT_LOGGER_LEVEL WARNING Logging verbosity
TRT_ENGINE_CACHE - Engine cache directory
TRT_MAX_WORKSPACE 1GB Builder workspace limit

Kubernetes Deployment

GPU-aware scheduling and auto-scaling for production workloads

☸️

Triton Deployment

GPU-enabled pod specification

yaml deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-tensorrt
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        command: ["tritonserver"]
        args: ["--model-repository=/models"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
⚖️

Service & Ingress

Load balancing configuration

yaml service.yaml
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  type: ClusterIP
  selector:
    app: triton
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
📈

Horizontal Pod Autoscaler

GPU-aware auto-scaling with custom metrics

yaml hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-tensorrt
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: triton_queue_size
      target:
        type: AverageValue
        averageValue: "10"
🔧 Requires NVIDIA GPU Operator
📊 Prometheus + Adapter required
🏷️ Use nodeSelector for GPU nodes

Production Topologies

Architecture patterns for every scale and use case

Development

Scenario 1: Single GPU Direct Inference

Direct TensorRT runtime with C++ or Python API. Lowest latency path.

📱
Application
C++/Python
TRT API
⚙️
TRT Engine
model.plan
Execute
🎮
GPU 0
A100-40GB

Performance

Latency <1ms
Throughput 10K+ IPS
Overhead Minimal

Tips

Buffers Pre-alloc
Streams Async
Production

Scenario 2: Triton Inference Server

Enterprise model serving with dynamic batching and model management

👥
Clients
HTTP/gRPC
:8000/:8001
🔺
Triton Inference Server
Dynamic Batching + Metrics
Backends
⚙️
TensorRT
.plan
📦
ONNX-RT
.onnx
🔥
PyTorch
.pt

Features

Batching Dynamic
Versioning Multi-ver
Metrics :8002

Batching

Gain 2-5×
Max Wait 100μs
Enterprise

Scenario 3: Multi-GPU Inference

Scale inference across multiple GPUs for throughput

👥
Clients
High Volume
:8000
🔺
Triton (Instance Groups)
count: 4, kind: KIND_GPU
Schedule
🎮
GPU 0
Instance 0
🎮
GPU 1
Instance 1
🎮
GPU 2
Instance 2
🎮
GPU 3
Instance 3

Scaling

Scale Linear
Latency Unchanged

Config

Instances 4
GPUs [0,1,2,3]
Edge

Scenario 4: Edge Deployment (Jetson)

Low-power, low-latency inference on NVIDIA Jetson platforms

📷
Sensors
Video/Data
DMA
🤖
Jetson AGX Orin
Unified Memory Architecture
15W-60W TDP
⚙️
TRT Engine
INT8
🎬
DeepStream
Pipeline

Jetson Orin

GPU 2048 CUDA
AI Perf 275 TOPS
Power 15-60W

Optimize

Precision INT8
DLA Enable
Build trtexec --onnx=model.onnx --saveEngine=model.plan --int8 --useDLACore=0 --allowGPUFallback
Power sudo nvpmodel -m 0 && sudo jetson_clocks

Scenario Comparison

Scenario Latency Throughput Scaling Complexity Best For
Direct Inference <1ms 10K IPS Manual Low Dev/Test
Triton Server 1-5ms 50K+ IPS Multi-GPU Medium Production
Multi-GPU 1-5ms 200K+ IPS Horizontal Medium Scale-Out
Edge (Jetson) 5-50ms 100-1K IPS Devices Medium Edge/IoT