-->

NVIDIA Deep Learning SDK

TensorRT

The industry's most powerful deep learning inference optimizer. Transform trained models into high-performance engines that run 10× faster on NVIDIA GPUs.

10×

Faster Inference

75%

Memory Reduction

8×

Throughput Gain

Overview

What is TensorRT?

A high-performance inference optimizer and runtime that transforms your trained models into production-ready engines.

From Training to Production

TensorRT bridges the gap between model training and deployment. While frameworks like PyTorch and TensorFlow excel at training, they're not optimized for inference. TensorRT takes your trained model and transforms it into a highly optimized engine.

The result? Dramatically faster inference, reduced memory footprint, and lower latency — all without changing a single line of your model's architecture.

TensorRT supports all major frameworks through ONNX, making it framework-agnostic. Train in PyTorch, deploy with TensorRT.

🧠

Trained Model

PyTorch, TensorFlow, ONNX

↓

⚙️

TensorRT Optimizer

Parse, fuse, quantize, tune

↓

🚀

Optimized Engine

Serialized .engine file

↓

⚡

Production Deployment

Triton, vLLM, custom apps

Architecture

Under the Hood

A deep dive into TensorRT's multi-layered optimization architecture

📥

Model Import Layer

Input

ONNX Parser

Imports models from the Open Neural Network Exchange format, supporting 150+ operators

UFF Parser

Legacy TensorFlow format support for older models and workflows

Caffe Parser

Native support for Caffe model definitions and pretrained weights

Network Definition API

Programmatically build networks layer-by-layer for custom architectures

⚡

Optimization Engine

Core

Graph Optimizer

Eliminates redundant operations, constant folding, dead code elimination

Layer Fusion

Combines Conv+BN+ReLU and similar patterns into single optimized kernels

Precision Calibrator

INT8/FP16 quantization with automatic scale factor computation

Kernel Auto-Tuner

Benchmarks multiple implementations per layer, selects fastest

Memory Optimizer

Tensor reuse, workspace allocation, memory pooling strategies

Timing Cache

Stores kernel timing results to accelerate future builds

🔧

Runtime Execution

Runtime

Execution Context

Manages inference state, allows multiple concurrent executions

CUDA Streams

Asynchronous execution with stream-ordered memory allocation

Dynamic Shapes

Runtime tensor dimension changes with optimization profiles

DLA Support

Offload layers to Deep Learning Accelerator on Jetson/DRIVE

📦

Serialization & Deployment

Output

Engine Serialization

Save optimized engine as portable .plan/.engine file

Version Compatibility

Cross-version loading with backward compatibility checks

Triton Integration

Native backend for NVIDIA Triton Inference Server

Plugin System

Custom layer implementations for unsupported operations

Optimizations

The Magic Inside

Four key optimization techniques that make TensorRT incredibly fast

🔗

Layer & Tensor Fusion

Combine operations, reduce overhead

TensorRT identifies patterns of operations that can be merged into single, optimized CUDA kernels. This eliminates memory transfers between layers and reduces kernel launch overhead.

Before Fusion

Conv2D

BatchNorm

ReLU

→

After Fusion

ConvBNReLU

3×

Fewer Kernels

2×

Less Memory I/O

🎯

Precision Calibration

Quantize without losing accuracy

Convert FP32 weights to FP16 or INT8 for massive speedups. INT8 calibration uses a representative dataset to compute optimal scale factors that minimize accuracy loss.

FP32

32 bits

FP16

16 bits

INT8

8 bits

4×

INT8 Speedup

<1%

Accuracy Loss

⚡

Kernel Auto-Tuning

Find the fastest implementation

For each layer, TensorRT benchmarks multiple CUDA kernel implementations and selects the fastest one for your specific GPU architecture. Results are hardware-specific.

implicit_gemm

2.4ms

winograd

1.8ms

fft_tiled

0.9ms ✓

direct_conv

3.1ms

100+

Kernel Variants

GPU

Specific Tuning

💾

Memory Optimization

Maximize GPU utilization

TensorRT analyzes tensor lifetimes and reuses memory where possible. Tensors that don't overlap in time share the same memory allocation, dramatically reducing footprint.

Naive Allocation

Tensor A

Tensor B

Tensor C

Tensor D

Wasted

→

Optimized

A → C (reused)

B → D (reused)

75%

Memory Saved

2×

Batch Size

Pipeline

The Optimization Pipeline

Step-by-step journey from trained model to optimized engine

📄

Model Import

Parse ONNX/TensorFlow model. Build internal network representation. Validate operator support.

🔍

Graph Analysis

Identify fusion patterns. Detect quantizable layers. Map tensor dependencies.

🔗

Layer Fusion

Merge compatible operations. Eliminate redundant computations. Optimize data flow.

🎯

Precision Selection

Calibrate INT8 scales. Apply mixed precision. Balance accuracy vs speed.

⚡

Kernel Selection

Benchmark kernel variants. Auto-tune for target GPU. Cache timing results.

📦

Engine Build

Serialize optimized engine. Generate .plan file. Ready for deployment.

Performance

Real-World Results

Inference latency comparison across optimization levels

100ms

PyTorch
FP32

50ms

TensorRT
FP16

28ms

TensorRT
INT8

10ms

TensorRT
INT8 + Fusion

Baseline (PyTorch FP32)

2× speedup (FP16)

3.5× speedup (INT8)

10× speedup (Full)

Optimized For Your Workload

🖼️

Computer Vision

Image classification, object detection, segmentation. Optimized kernels for ResNet, YOLO, EfficientNet.

8×

Faster

FPS

🗣️

Speech & Audio

ASR, TTS, speaker recognition. Real-time processing for voice assistants and transcription.

<10ms

Latency

Real

Time

📝

NLP & Transformers

BERT, GPT, T5 inference. Optimized attention mechanisms and sequence processing.

5×

Throughput

50%

Cost Down

For Large Language Models

TensorRT-LLM

A specialized extension of TensorRT optimized for large language models. Powers production LLM inference at scale with state-of-the-art performance.

🔑

KV-Cache Optimization

Efficient key-value cache management for autoregressive generation

📦

In-Flight Batching

Dynamic batching of requests at different generation stages

⚡

Flash Attention

Memory-efficient attention with fused softmax kernels

🔀

Tensor Parallelism

Scale across multiple GPUs with optimized communication

📊

Quantization

INT8, INT4, FP8 quantization with minimal accuracy loss

🧩

Paged Attention

Virtual memory for KV cache, enabling larger batch sizes

🎯

Speculative Decoding

Draft model acceleration for faster token generation

🔧

Custom Plugins

Extensible architecture for custom model components

Docker Deployment

Containerized Deployment

NGC containers and production configurations for TensorRT inference

🐳

NGC Quick Start

Official NVIDIA TensorRT container

bash ngc-quickstart.sh

# Pull TensorRT container from NGC
docker pull nvcr.io/nvidia/tensorrt:24.01-py3

# Run with GPU access
docker run --gpus all -it \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/tensorrt:24.01-py3

# Build TensorRT engine
trtexec --onnx=model.onnx --saveEngine=model.plan

✓ CUDA 12.3 included

✓ cuDNN 8.9 included

✓ Python 3 + TensorRT bindings

🏭

Production Dockerfile

Multi-stage build with runtime-only image

dockerfile Dockerfile

# Build stage - compile TensorRT engine
FROM nvcr.io/nvidia/tensorrt:24.01-py3 AS builder
WORKDIR /build
COPY model.onnx .
RUN trtexec --onnx=model.onnx --saveEngine=model.plan --fp16

# Runtime stage - minimal footprint
FROM nvcr.io/nvidia/cuda:12.3.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y libnvinfer8 && \
    rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/model.plan /app/
COPY inference_app /app/
WORKDIR /app
CMD ["./inference_app"]

🔗

Docker Compose

Triton + Prometheus + Grafana

yaml docker-compose.yml

version: '3.8'
services:
  triton:
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    command: tritonserver --model-repository=/models
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # gRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./models:/models
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

⚙️

Environment Variables

Key configuration options

CUDA_VISIBLE_DEVICES all GPU indices (0,1,2...)

TRT_LOGGER_LEVEL WARNING Logging verbosity

TRT_ENGINE_CACHE - Engine cache directory

TRT_MAX_WORKSPACE 1GB Builder workspace limit

Kubernetes

Kubernetes Deployment

GPU-aware scheduling and auto-scaling for production workloads

☸️

Triton Deployment

GPU-enabled pod specification

yaml deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-tensorrt
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        command: ["tritonserver"]
        args: ["--model-repository=/models"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"

⚖️

Service & Ingress

Load balancing configuration

yaml service.yaml

apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  type: ClusterIP
  selector:
    app: triton
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002

📈

Horizontal Pod Autoscaler

GPU-aware auto-scaling with custom metrics

yaml hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-tensorrt
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: triton_queue_size
      target:
        type: AverageValue
        averageValue: "10"

🔧 Requires NVIDIA GPU Operator

📊 Prometheus + Adapter required

🏷️ Use nodeSelector for GPU nodes

Deployment Scenarios

Production Topologies

Architecture patterns for every scale and use case

Development

Scenario 1: Single GPU Direct Inference

Direct TensorRT runtime with C++ or Python API. Lowest latency path.

📱

Application

C++/Python

TRT API

⚙️

TRT Engine

model.plan

Execute

🎮

GPU 0

A100-40GB

Performance

Latency <1ms

Throughput 10K+ IPS

Overhead Minimal

Tips

Buffers Pre-alloc

Streams Async

Production

Scenario 2: Triton Inference Server

Enterprise model serving with dynamic batching and model management

👥

Clients

HTTP/gRPC

:8000/:8001

🔺

Triton Inference Server

Dynamic Batching + Metrics

Backends

⚙️

TensorRT

.plan

📦

ONNX-RT

.onnx

🔥

PyTorch

.pt

Features

Batching Dynamic

Versioning Multi-ver

Metrics :8002

Batching

Gain 2-5×

Max Wait 100μs

Enterprise

Scenario 3: Multi-GPU Inference

Scale inference across multiple GPUs for throughput

👥

Clients

High Volume

:8000

🔺

Triton (Instance Groups)

count: 4, kind: KIND_GPU

Schedule

🎮

GPU 0

Instance 0

🎮

GPU 1

Instance 1

🎮

GPU 2

Instance 2

🎮

GPU 3

Instance 3

Scaling

Scale Linear

Latency Unchanged

Config

Instances 4

GPUs [0,1,2,3]

Edge

Scenario 4: Edge Deployment (Jetson)

Low-power, low-latency inference on NVIDIA Jetson platforms

📷

Sensors

Video/Data

DMA

🤖

Jetson AGX Orin

Unified Memory Architecture

15W-60W TDP

◀

▶

⚙️

TRT Engine

INT8

🎬

DeepStream

Pipeline

Jetson Orin

GPU 2048 CUDA

AI Perf 275 TOPS

Power 15-60W

Optimize

Precision INT8

DLA Enable

Build trtexec --onnx=model.onnx --saveEngine=model.plan --int8 --useDLACore=0 --allowGPUFallback

Power sudo nvpmodel -m 0 && sudo jetson_clocks

Scenario Comparison

Direct Inference <1ms 10K IPS Manual Low Dev/Test

Triton Server 1-5ms 50K+ IPS Multi-GPU Medium Production

Multi-GPU 1-5ms 200K+ IPS Horizontal Medium Scale-Out

Edge (Jetson) 5-50ms 100-1K IPS Devices Medium Edge/IoT