Deep Dive Documentation Series

NVIDIA CUDA Platform

Comprehensive technical exploration of GPU computing architecture, from PTX binaries to kernel execution

⚡ CUDA 14.x | Rubin Architecture

Chapters

Select Your Module

PTX & CUDA Binaries

Deep dive into the CUDA compilation pipeline. Understand PTX intermediate representation, SASS native code, cubin binaries, and the forward compatibility that enables your code to run on future GPU architectures.

PTX SASS cubin fatbin nvcc

Enter Chapter →

🏗️

GPU Architecture Deep Dive

Evolution from Pascal to Rubin. Explore streaming multiprocessors, tensor cores, memory hierarchies, and the revolutionary architectural advances that power modern AI and HPC workloads.

Pascal Volta Ampere Hopper Blackwell Rubin

Enter Chapter →

⚡

CUDA API & Kernel Execution

Master the CUDA software stack. Compare Runtime vs Driver APIs, understand CUDA libraries, and learn how kernels are launched, scheduled, and executed across GPU streaming multiprocessors.

Runtime API Driver API cuBLAS cuDNN Kernels

Enter Chapter →

💾

Memory Optimization

Master the GPU memory hierarchy. Learn coalescing patterns, shared memory usage, bank conflicts, and register optimization to maximize bandwidth and minimize latency.

Global Memory Shared Memory Coalescing Bank Conflicts

Enter Chapter →

📊

Profiling Deep-Dive

Profile and optimize CUDA applications. Master Nsight Compute, Nsight Systems, identify bottlenecks, and apply targeted optimizations using metrics-driven analysis.

Nsight Compute Nsight Systems Metrics Bottlenecks

Enter Chapter →

🔧

Binary Utilities

Understand the CUDA compilation pipeline. From nvcc to PTX to SASS, inspect binaries with cuobjdump and nvdisasm, and debug with cuda-gdb and compute-sanitizer.

nvcc PTX SASS cuobjdump

Enter Chapter →

🚀

TensorRT Deep Dive

NVIDIA's inference optimization engine. Master layer fusion, precision calibration, kernel auto-tuning, and production deployment with Docker, Kubernetes, and Triton Inference Server.

Layer Fusion INT8/FP16 Triton TensorRT-LLM

Enter Chapter →

⚡

vLLM Deep Dive

High-throughput LLM serving with PagedAttention. Explore continuous batching, CUDA graphs, speculative decoding, and production deployment patterns for scale.

PagedAttention Continuous Batching CUDA Graphs Multi-GPU

Enter Chapter →