cuDNN 9.x MIOpen 3.x

cuDNN vs MIOpen

GPU-Accelerated Deep Learning Primitives • 2026

Software Stack Architecture

How deep learning primitives fit into the GPU computing ecosystem

PyTorch / TensorFlow
Framework
cuDNN
DNN Primitives
CUDA / cuBLAS
Runtime
Tensor Cores
NVIDIA GPU
PyTorch / TensorFlow
Framework
MIOpen
DNN Primitives
HIP / rocBLAS
Runtime
Matrix Cores (MFMA)
AMD GPU

Kernel Execution Flow

How data moves through the deep learning primitive pipeline

📊
Input Tensor
NCHW / NHWC format
🔧
API Call
cudnnConvolutionForward
Kernel Selection
Auto-tune / Heuristic
🔲
Tensor Core Exec
MMA / MFMA instructions
Output Tensor
Result in GPU memory

Core Capabilities

Key features and optimizations in each library

NVIDIA cuDNN
CUDA Deep Neural Network library
  • 🔲
    Convolution
    Multiple algorithms with auto-tuning
    7+ algos
  • Tensor Cores
    FP8, FP16, BF16, TF32 acceleration
    4th Gen
  • 🎯
    Flash Attention
    Memory-efficient transformer attention
    FA-3
  • 📊
    Graph API
    Operation fusion and compilation
    cudnnGraph
  • 📐
    Normalization
    All variants with fusion support
    BN/LN/GN/IN
  • 🔄
    RNN/LSTM
    Persistent state in registers
    Persistent
AMD MIOpen
Machine Intelligence Open library
  • 🔲
    Convolution
    CDNA-optimized algorithms
    5+ algos
  • Matrix Cores
    MFMA FP8, BF16, FP16, INT8
    CDNA3
  • 🎯
    CK Attention
    Composable Kernel based attention
    CK-based
  • 📊
    Fusion API
    Operation fusion for common patterns
    Fusion
  • 📐
    Normalization
    Fused variants available
    BN/LN
  • 📖
    Open Source
    MIT license, full GitHub access
    MIT

Matrix Acceleration Units

Hardware-accelerated matrix multiplication at the heart of AI

NVIDIA Tensor Cores
8×8×4 MMA Operation
Architecture Hopper (H100)
Peak FP16 1,979 TFLOPS
FP8 Support E4M3 / E5M2
Sparsity 2:4 Structured
VS
AMD Matrix Cores
MFMA 16×16×16 Operation
Architecture CDNA3 (MI300X)
Peak FP16 1,307 TFLOPS
FP8 Support E4M3 / E5M2
HBM3 5.3 TB/s

Feature-by-Feature Comparison

Detailed breakdown of capabilities across both libraries

cuDNN MIOpen
General
📜 License Proprietary (free) MIT Open Source
💻 Source Available No (closed) Yes (GitHub)
📦 Version 9.x (2024+) 3.x (2024+)
🔗 Framework Support Extensive Good (PyTorch, TF)
Precision & Data Types
🔢 FP32
FP16 ✓ Tensor Core ✓ MFMA
🧠 BF16 ✓ Tensor Core ✓ MFMA
🎯 TF32 ✓ (Ampere+) N/A
💎 FP8 ✓ (Hopper+) ✓ (MI300+)
📊 INT8
Convolution
🔲 Algorithms 7+ variants 5+ variants
🔧 Auto-Tune cudnnFind/Get miopenFind*
👥 Grouped Conv ✓ Full ✓ Full
📱 Depthwise Optimized Supported
Operations
🎯 Attention Flash Attention CK Attention
📐 BatchNorm Fused + Persistent Fused
📏 LayerNorm
🏊 Pooling Max/Avg/Adaptive Max/Avg
🔄 RNN Persistent RNN Standard
Advanced Features
🌐 Op Graphs cudnnGraph (extensive) Fusion API
💾 Cache Location ~/.nv/ ~/.config/miopen/
🖥️ Multi-GPU + NCCL + RCCL

Algorithm Selection Logic

How cuDNN and MIOpen choose the optimal kernel for your workload

Convolution Request
Input shape, filter size, stride, padding, data type
Auto-Tune Enabled?
cudnnFind / miopenFindConvolution
Benchmark all compatible algorithms
GEMM
im2col + matmul
FFT
frequency domain
Winograd
3×3 optimized
Implicit GEMM
direct compute
Cache Result
Store winning algorithm for this configuration

Relative Performance Characteristics

Conceptual comparison based on typical workloads (actual results vary by configuration)

ResNet-50 Training Throughput
cuDNN
~1.0x
MIOpen
~0.92x
Transformer (LLM) Inference
cuDNN
~1.0x
MIOpen
~0.92x
Flash Attention Performance
cuDNN
~1.0x
MIOpen
~0.84x
Memory Bandwidth Utilization
H100
3.35 TB/s
MI300X
5.3 TB/s

API Comparison

Side-by-side code examples for common operations

cuDNN - Convolution Forward
// Create handles and descriptors
cudnnHandle_t handle;
cudnnCreate(&handle);

cudnnTensorDescriptor_t xDesc, yDesc;
cudnnCreateTensorDescriptor(&xDesc);
cudnnSetTensor4dDescriptor(xDesc,
    CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
    N, C, H, W);

cudnnFilterDescriptor_t wDesc;
cudnnSetFilter4dDescriptor(wDesc,
    CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW,
    K, C, R, S);

cudnnConvolutionDescriptor_t convDesc;
cudnnSetConvolution2dDescriptor(convDesc,
    pad, pad, stride, stride, 1, 1,
    CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT);

// Auto-tune: find best algorithm
cudnnConvolutionFwdAlgo_t algo;
cudnnGetConvolutionForwardAlgorithm_v7(handle,
    xDesc, wDesc, convDesc, yDesc,
    1, &returnedCount, &perfResults);
algo = perfResults[0].algo;

// Execute convolution
cudnnConvolutionForward(handle,
    &alpha, xDesc, x, wDesc, w,
    convDesc, algo, workspace, wsSize,
    &beta, yDesc, y);
MIOpen - Convolution Forward
// Create handle and descriptors
miopenHandle_t handle;
miopenCreate(&handle);

miopenTensorDescriptor_t xDesc, yDesc;
miopenCreateTensorDescriptor(&xDesc);
miopenSet4dTensorDescriptor(xDesc,
    miopenFloat, N, C, H, W);

miopenTensorDescriptor_t wDesc;
miopenSet4dTensorDescriptor(wDesc,
    miopenFloat, K, C, R, S);

miopenConvolutionDescriptor_t convDesc;
miopenInitConvolutionDescriptor(convDesc,
    miopenConvolution,
    pad, pad, stride, stride, 1, 1);

// Solution finding: benchmark algorithms
miopenConvAlgoPerf_t perfResults;
miopenFindConvolutionForwardAlgorithm(handle,
    xDesc, x, wDesc, w, convDesc, yDesc, y,
    1, &returnedCount, &perfResults,
    workspace, wsSize, false);

// Execute convolution
miopenConvolutionForward(handle,
    &alpha, xDesc, x, wDesc, w,
    convDesc, perfResults.fwd_algo,
    &beta, yDesc, y,
    workspace, wsSize);

Evolution Timeline

Key milestones in GPU deep learning primitive development

2014
cuDNN 1.0 Launch
NVIDIA releases first version with basic convolution primitives
2017
MIOpen 1.0 Launch
AMD releases open-source alternative for ROCm ecosystem
2018
Tensor Core Support
cuDNN adds Volta Tensor Core acceleration for mixed precision
2020
CDNA Architecture
AMD introduces MFMA Matrix Cores with MI100
2022
Flash Attention Era
Both libraries integrate memory-efficient attention for LLMs
2023
FP8 Support
Hopper (H100) and MI300 bring 8-bit floating point acceleration
2024+
Graph Compilation
Advanced operation fusion with cudnnGraph and CK-based fusion