Deep Dive Documentation Series

NVIDIA CUDA Platform

Comprehensive technical exploration of GPU computing architecture, from PTX binaries to kernel execution

CUDA 14.x | Rubin Architecture

Select Your Module

01
🔧

PTX & CUDA Binaries

Deep dive into the CUDA compilation pipeline. Understand PTX intermediate representation, SASS native code, cubin binaries, and the forward compatibility that enables your code to run on future GPU architectures.

PTX SASS cubin fatbin nvcc
Enter Chapter
02
🏗️

GPU Architecture Deep Dive

Evolution from Pascal to Rubin. Explore streaming multiprocessors, tensor cores, memory hierarchies, and the revolutionary architectural advances that power modern AI and HPC workloads.

Pascal Volta Ampere Hopper Blackwell Rubin
Enter Chapter
03

CUDA API & Kernel Execution

Master the CUDA software stack. Compare Runtime vs Driver APIs, understand CUDA libraries, and learn how kernels are launched, scheduled, and executed across GPU streaming multiprocessors.

Runtime API Driver API cuBLAS cuDNN Kernels
Enter Chapter
04
💾

Memory Optimization

Master the GPU memory hierarchy. Learn coalescing patterns, shared memory usage, bank conflicts, and register optimization to maximize bandwidth and minimize latency.

Global Memory Shared Memory Coalescing Bank Conflicts
Enter Chapter
05
📊

Profiling Deep-Dive

Profile and optimize CUDA applications. Master Nsight Compute, Nsight Systems, identify bottlenecks, and apply targeted optimizations using metrics-driven analysis.

Nsight Compute Nsight Systems Metrics Bottlenecks
Enter Chapter
06
🔧

Binary Utilities

Understand the CUDA compilation pipeline. From nvcc to PTX to SASS, inspect binaries with cuobjdump and nvdisasm, and debug with cuda-gdb and compute-sanitizer.

nvcc PTX SASS cuobjdump
Enter Chapter
07
🚀

TensorRT Deep Dive

NVIDIA's inference optimization engine. Master layer fusion, precision calibration, kernel auto-tuning, and production deployment with Docker, Kubernetes, and Triton Inference Server.

Layer Fusion INT8/FP16 Triton TensorRT-LLM
Enter Chapter
08

vLLM Deep Dive

High-throughput LLM serving with PagedAttention. Explore continuous batching, CUDA graphs, speculative decoding, and production deployment patterns for scale.

PagedAttention Continuous Batching CUDA Graphs Multi-GPU
Enter Chapter