Subramaniyam Venkata Pooni

// Research & Publications

Research & Publications

Ongoing research contributions in AI, distributed systems, and compiler technologies

🔬 ACTIVE RESEARCH AREAS

Agentic AI Systems

Multi-agent orchestration, ReAct patterns, tool-augmented reasoning, and autonomous AI workflows using LangChain, LangGraph, and custom agent frameworks.

Multi-Agent RAG AP2 Protocol X-A2A

⚡

LLM Inference Optimization

High-performance inference pipelines, quantization strategies, KV-cache optimization, and hardware-aware compilation for NVIDIA and AMD accelerators.

TensorRT vLLM Triton

🔧

Compiler Technologies

MLIR-based compiler design, LLVM optimization passes, WebAssembly targets for edge AI, and domain-specific language development.

MLIR LLVM WASM

📚 PUBLICATIONS & PAPERS

🖥️ AI ACCELERATOR PLATFORMS

AI Accelerator Market Report 2026: The Platform Race

Subramaniyam V. Pooni

Industry Report · January 2026

NVIDIA AMD Intel HPC

Comprehensive analysis of the AI accelerator landscape in 2026, covering GPU, TPU, and custom ASIC architectures from NVIDIA, AMD, Intel, Google, and emerging players.

Read Full Report →

NVIDIA CUDA Platform: Deep Dive Documentation Series

Subramaniyam V. Pooni

Technical Documentation · 2026 · 6 Chapters · CUDA 14.x | Rubin Architecture

CUDA PTX Hopper Blackwell

Comprehensive technical exploration of GPU computing architecture from PTX binaries to kernel execution. Covers CUDA compilation pipeline, GPU architecture evolution (Pascal→Volta→Ampere→Hopper→Blackwell→Rubin).

Explore CUDA Deep Dive →

AMD ROCm Platform: Deep Dive Documentation Series

Subramaniyam V. Pooni

Technical Documentation · 2026 · 7 Chapters · ROCm 6.x | MI300X/MI350X

ROCm HIP CDNA MI300

Comprehensive technical exploration of AMD's ROCm platform and CDNA architecture. Covers HIP programming, AMDGPU compiler, Instinct accelerators (MI250→MI300→MI350), and datacenter GPU deployments.

Explore ROCm Deep Dive →

Google TPU & XLA Platform: Deep Dive Documentation Series

Subramaniyam V. Pooni

Technical Documentation · 2026 · 3 Chapters · TPU v7 Ironwood | JAX/XLA Stack

TPU XLA JAX HLO

Comprehensive technical exploration of Google's Tensor Processing Units, XLA compiler infrastructure, and JAX programming framework. Covers TPU architecture evolution (v1→v4→v5e→v5p→Trillium→Ironwood).

Explore TPU Deep Dive →

AWS Trainium & Neuron Platform: Deep Dive Documentation Series

Subramaniyam V. Pooni

Technical Documentation · 2026 · 7 Chapters · Trainium3 | Neuron SDK | NKI

Trainium Neuron SDK NKI NeuronCore

Comprehensive technical exploration of Amazon's Trainium AI accelerators, Neuron SDK compiler infrastructure, and NeuronCore architecture. Covers Trainium evolution (1→2→3), NKI kernel programming.

Explore Trainium Deep Dive →

🌐 HIGH-PERFORMANCE NETWORKING & INTERCONNECTS

Ultra Ethernet vs RDMA + NVMe-oF Integration

Subramaniyam V. Pooni

Technical Reference · December 2025 · 8 Sections · UEC Spec 1.0 | NVMe-oF 1.1

UEC RDMA NVMe-oF RoCE v2

Comprehensive analysis of high-performance networking for AI/HPC and storage. Covers RDMA fundamentals, UEC architecture, memory operations, flow control, NVMe over Fabrics, AI collectives.

1M+ UEC Endpoints <2μs NVMe-oF Latency 800G Link Speed

Read Full Reference →

CXL + UEC Integration: Bridging Internal Memory Fabric to External Network

Subramaniyam V. Pooni

LinkedIn Article · 2025 · Industry Analysis

CXL 3.0 UEC Memory Fabric

Analysis of how CXL and UEC technologies can be integrated to bridge internal memory fabric with external network fabric. Explores cache-coherent interconnects and memory pooling.

Read on LinkedIn →

💾 STORAGE & MEMORY SYSTEMS

Storage Implications of New Generation AI Applications

Subramaniyam V. Pooni

LinkedIn Article · 2025 · AI Infrastructure

AI Storage LLM Inference Data Pipeline

Analysis of storage architecture requirements for next-generation AI workloads. Explores checkpoint/restart patterns, model weight distribution, and evolving storage hierarchy.

Read on LinkedIn →

Storage is the Bottleneck: A GPU-NVMe Technical Deep Dive

Subramaniyam V. Pooni

Technical Documentation · 2025 · 4 Chapters, 26 Sections, 1MB+

NVMe GPUDirect CUDA HPC

Publication-quality technical documentation on GPU-storage integration challenges. Covers NVMe queue architecture, doorbell serialization, GPUDirect Storage, CXL memory semantics.

Read Full Documentation →

Distributed Endpoint Architecture for KV-Cache Offloading in LLM Inference

Subramaniyam V. Pooni

Technical Reference v3.0 · December 2025 · 13 Chapters, 10 Appendices

KV-Cache CXL 3.0 LLM Inference

Memory-efficient LLM serving using CXL-based intelligent memory endpoints. Per-head tracking, EMA-based attention scoring, RoPE-aware prefetch, hardware-accelerated cache management.

6× Memory Expansion 16× User Capacity 97% HBM Hit Rate 36% Cost Reduction

Read Full Reference →

🧠 AI/ML ARCHITECTURE & SYSTEMS

Graph Neural Networks & Large Language Models: A Visual Guide

Subramaniyam V. Pooni

LinkedIn Article · 2025 · AI/ML Architecture

GNN LLM Transformers

Visual guide exploring the intersection of Graph Neural Networks and Large Language Models. Covers message passing, attention mechanisms, graph transformers, knowledge graph integration.

Read on LinkedIn →

Scalable Multi-Agent RAG Architectures for Enterprise LLM Deployments

Subramaniyam V. Pooni

Working Paper · 2025

Agentic AI RAG LLMOps

Proposes a scalable architecture for deploying multi-agent RAG systems in enterprise environments, addressing challenges in agent coordination, memory management, and inference optimization.

Agent Communication Protocols & Context Engineering

Subramaniyam V. Pooni

Technical Documentation · CS²B Technologies · 2025 · 16 Chapters

MCP A2A Context Engineering WSCI

Comprehensive research document exploring MCP, A2A, and emerging standards for multi-agent communication. Deep dives into context engineering frameworks including the WSCI methodology. Features 6 protocols, 4 frameworks, and 50+ code examples.

View Agent Protocols Research →

🔧 CPU MICROARCHITECTURE

Bridge Checkpoint Unit (BCU) Microarchitecture

Subramaniyam V. Pooni

LinkedIn Article · 2025 · CPU Microarchitecture

Microarchitecture Checkpoint OoO Execution

Deep dive into Bridge Checkpoint Unit (BCU) microarchitecture for modern out-of-order processors. Explores checkpoint/rollback mechanisms, speculative execution recovery, register renaming.

Read on LinkedIn →

⚙️ COMPILERS & DISTRIBUTED SYSTEMS

MLIR-Based Compiler Design for Edge AI Inference on WebAssembly

Subramaniyam V. Pooni

Technical Report · CSSQUAREDB Technologies · 2020

MLIR WASM Edge AI

Presents a novel MLIR-based compiler toolchain targeting WebAssembly for deploying lightweight AI models on edge devices with near-native performance.

Distributed Parameter Server with Raft Consensus for Federated Learning

Subramaniyam V. Pooni

Technical Report · CSSQUAREDB Technologies · 2020

Distributed ML Raft Federated

Introduces a fault-tolerant parameter server architecture using Raft consensus for distributed machine learning, achieving 40%+ throughput improvement in federated learning scenarios.

Multi-Target Compiler Infrastructure: Deep Dive Research

Subramaniyam V. Pooni

Technical Documentation · CS²B Technologies · 2025 · 9 Phases

LLVM WebAssembly LALR Type Systems

Comprehensive multi-target compiler infrastructure covering lexical analysis, LALR parsing, type checking, stack-based IR design, WebAssembly binary encoding, LLVM code generation, and Python bytecode emission.

View Compiler Deep Dive →

PyRaft: Distributed Consensus Implementation & Documentation

Subramaniyam V. Pooni

Technical Documentation · CS²B Technologies · 2025 · Full Implementation

Raft Consensus Python Distributed

Complete Raft consensus algorithm implementation with leader election, log replication, cluster membership, and simulation framework. Includes comprehensive API reference and interactive visualizations.

View Raft Documentation →

🤖 GENERATIVE AI & EMERGENT PROPERTIES

Can AI Agents Go "Rogue" Because of Emergent Properties?

Subramaniyam V. Pooni

LinkedIn Article · AI Safety & Alignment

AI Safety Emergent

Emergent Properties in GenAI

Subramaniyam V. Pooni

LinkedIn Article · LLM Research

GenAI Emergence

Holy Grail of Zero-Shot Learning

Subramaniyam V. Pooni

LinkedIn Article · Transfer Learning

Zero-Shot LLM

Risks Associated with Emergent Properties

Subramaniyam V. Pooni

LinkedIn Article · AI Risk Assessment

AI Risk Safety

Size of Model at which Emergent Properties Occur

Subramaniyam V. Pooni

LinkedIn Article · Scaling Laws

Scaling Emergence

In-Depth Comparison: Auto-Regressive Models vs. Masked Language Models (MLMs)

Subramaniyam V. Pooni

LinkedIn Article · LLM Architecture

GPT BERT MLM

📡 AI-ENHANCED NETWORKING

Next Generation Networking Enhanced by AI

Subramaniyam V. Pooni

LinkedIn Article · Network Architecture

AI Networking SDN

Elimination of 5-Tuple Classification in Networking Using AI

Subramaniyam V. Pooni

LinkedIn Article · Traffic Classification

5-Tuple Flow

Introduction of New Traffic Flow Types Using AI Without Code Changes

Subramaniyam V. Pooni

LinkedIn Article · Adaptive Networks

Zero-Code Traffic

Tagging, Networks and AI

Subramaniyam V. Pooni

LinkedIn Article · Network Metadata

Tagging Metadata

Next-Generation NOC Powered by Generative AI

Subramaniyam V. Pooni

LinkedIn Article · Network Operations

NOC GenAI AIOps

Ethernet Scale-Up Fabrics: A Deep Dive

Subramaniyam V. Pooni

LinkedIn Article · Data Center Networking

Ethernet Scale-Up Fabric

💿 AI-ENHANCED STORAGE

Building Point-in-Time Filesystem Traversal

Subramaniyam V. Pooni

LinkedIn Article · Filesystem Architecture

Filesystem Snapshots Time Travel

Read on LinkedIn →

Next Generation Storage Enhanced by AI

Subramaniyam V. Pooni

LinkedIn Article · Intelligent Storage

AI Storage Smart I/O

Storage Retrieval Inspired by Ray Tracing

Subramaniyam V. Pooni

LinkedIn Article · Novel Architectures

Ray Tracing Retrieval

DLSS and Data Retrieval

Subramaniyam V. Pooni

LinkedIn Article · GPU-Inspired Storage

DLSS Upscaling

Reed-Solomon Coding and AI: Enhancing Error Correction and Data Reliability

Subramaniyam V. Pooni

LinkedIn Article · Erasure Coding

Reed-Solomon ECC

Dispersed Storage, AI and Lagrange's Interpolation

Subramaniyam V. Pooni

LinkedIn Article · Distributed Storage

Dispersed Lagrange

🔀 DEEP NEURAL NETWORKS & PARALLELISM

Parallelism in AI

Subramaniyam V. Pooni

LinkedIn Article · Distributed Training

Data Parallel Model Parallel

Model Sharding + Layer Parallelism = Model Parallelism

Subramaniyam V. Pooni

LinkedIn Article · Large Model Training

Sharding Pipeline

Neural Layer Parallelism (Deep Dive)

Subramaniyam V. Pooni

LinkedIn Article · Layer-wise Training

Layer Parallel DNN

Awesome World of Federated Learning in Terms of Global and Local Models/Sites

Subramaniyam V. Pooni

LinkedIn Article · Privacy-Preserving ML

Federated Privacy

Personalized Models - Combining Transfer Learning with Federated Learning

Subramaniyam V. Pooni

LinkedIn Article · Personalization

Transfer Federated

Deep Models, Shallow Models and Overparameterization

Subramaniyam V. Pooni

LinkedIn Article · Model Theory

Overparameterization

Over-Parameterization Does Not Lead to Poor Generalization

Subramaniyam V. Pooni

LinkedIn Article · Generalization Theory

Generalization Theory

📶 WIRELESS COMMUNICATION & DEEP LEARNING

DNDR: End-to-End Learning with Different Functionality Discovered by Gradient Descent

Subramaniyam V. Pooni

LinkedIn Article · Neural Communication

DNDR E2E Learning

DNDR: A Comprehensive Exploration of Perspectives in End-to-End Communication Learning

Subramaniyam V. Pooni

LinkedIn Article · Communication Theory

DNDR Autoencoder

DeepSig Autoencoders and Meta-Learning Systems like DNDR: A Deep Dive

Subramaniyam V. Pooni

LinkedIn Article · Signal Processing

DeepSig Meta-Learning

In Search of Equivalent of CNNs for Wireless Communication

Subramaniyam V. Pooni

LinkedIn Article · Neural Wireless

CNN Wireless

🔬 NEURAL NETWORK THEORY & TOOLS

Understanding Distillation in AI: How Models Can Be Extracted

Subramaniyam V. Pooni

LinkedIn Article · Knowledge Distillation

Distillation Model Compression

Read on LinkedIn →

Mysterious Latent Space - Math of the 21st Century

Subramaniyam V. Pooni

LinkedIn Article · Representation Learning

Latent Space Embeddings

Model Order Selection

Subramaniyam V. Pooni

LinkedIn Article · Model Selection

AIC BIC

Neural Studio

Subramaniyam V. Pooni

LinkedIn Article · Development Tools

IDE Neural

Workflow for Neural Layer Splitting

Subramaniyam V. Pooni

LinkedIn Article · Model Optimization

Layer Split Workflow

🔌 AI HARDWARE & INFRASTRUCTURE

Ray Tracing in Rust: Weekend Project with David Beazley

Subramaniyam V. Pooni

LinkedIn Post · Open Source Project · Rust vs C++ Performance

Rust Ray Tracing GPU/CUDA

Converted "Ray Tracing in One Weekend" from C++ to Rust, achieving faster performance. Explores multi-core parallelism, GPU acceleration with CUDA, and Rust optimization techniques.

Read on LinkedIn →

AI Control Center

Subramaniyam V. Pooni

LinkedIn Article · AI Operations

AI Ops Control Plane

Read on LinkedIn →

The Full Scope of FPGA, ASIC, and Hybrid Solutions in AI

Subramaniyam V. Pooni

LinkedIn Article · Hardware Accelerators

FPGA ASIC Hybrid

AI MicroClouds: A Deep Dive

Subramaniyam V. Pooni

LinkedIn Article · Edge Infrastructure

MicroCloud Edge AI

Emerging Trends in AI and Data Center Design: Examples

Subramaniyam V. Pooni

LinkedIn Article · Data Center Architecture

Data Center AI Infra

FaaS Platform Design

Subramaniyam V. Pooni

LinkedIn Article · Serverless Architecture

FaaS Serverless

🔒 TRADE SECRETS & PROPRIETARY RESEARCH

📡

SONiCS Platform (Futurewei)

12+ trade secrets filed for Self-Organizing Networks with intelligent Controllers. Includes FaaS architecture, MLaaS integration, and federated learning for wireless networks.

12+ Trade Secrets 2016-2019

🧠

GAN-Based Wireless Receiver

Award-winning research on Generative Adversarial Networks for software-defined wireless receivers. Collaboration with Berkeley AI Research (BAIR) and Georgia Tech.

🏆 Top Innovation Award 2019

🚀 UPCOMING & IN-PROGRESS

▹ Agent Protocol Standardization (AP2, SLIM, X-A2A) — Defining interoperability standards for multi-agent systems
▹ Hardware-Aware LLM Compilation — Optimizing inference for heterogeneous GPU clusters (NVIDIA + AMD)
▹ Agentic RAG with Long-Context Memory — Scaling agent memory for enterprise knowledge bases
▹ Rust-Based Interpreter Design Patterns — Documenting learnings from Crusty Lox implementation

CSSQUAREDB Technologies

CS2B

Work Experience

CSSQUAREDB Technologies Inc.

Broadcom (VMware)

CSSQUAREDB Technologies

Personal Goal Pursuit

Futurewei Technologies

A10 Networks, Inc.

Virtustream → Dell Acquisition

Dorado Software

Hewlett Packard

Starcom Technology Inc

IBM (Sequent Computers)

Key Projects

VMware Aria Automation | Multi-Cloud IaC & CI/CD

LLMOps Frameworks | Prompt Engineering | RAG

LLM Agent Programming

AIaaS for CSPs Expertise

AI Performance Engineering | HPC | LLM Inference

Scalable Distributed ML Parameter Server

Compiler for WebAssembly AI Edge Inference

Crusty Lox Interpreter in Rust

Applied R&D in Software Design Patterns

Java Language New Features Experimentation

Ray Tracer in Rust

Skills & Expertise

US Patents

Honors & Awards

Top Innovation Award

Outstanding Contributions

WOW Team Award

Future Star Medal

Certifications & Courses

VMware Livefire Certifications

VMware NSX Training

VMware Aria Automation Training

VMware vSphere & Cloud Director

Team Awards

Official Course Documentation

NSX Install, Configure, Manage

NSX Design

Aria Automation ICM

Aria Automation Advanced

Aria Orchestration & Extensibility

Aria Suite Lifecycle

vSphere Design

Cloud Director DCM

Professional Courses

Functional Programming in Scala

The Art of Functional Design

Write a Compiler (Python)

Implementing Raft Consensus

Advanced Python

Education & Research

IIT Madras

Mangalore University

Sri Venkateswara College of Engineering

Research & Publications

🔬 ACTIVE RESEARCH AREAS

Agentic AI Systems

LLM Inference Optimization

Compiler Technologies

📚 PUBLICATIONS & PAPERS

🖥️ AI ACCELERATOR PLATFORMS

AI Accelerator Market Report 2026: The Platform Race

NVIDIA CUDA Platform: Deep Dive Documentation Series

AMD ROCm Platform: Deep Dive Documentation Series

Google TPU & XLA Platform: Deep Dive Documentation Series

AWS Trainium & Neuron Platform: Deep Dive Documentation Series

🌐 HIGH-PERFORMANCE NETWORKING & INTERCONNECTS

Ultra Ethernet vs RDMA + NVMe-oF Integration

CXL + UEC Integration: Bridging Internal Memory Fabric to External Network

💾 STORAGE & MEMORY SYSTEMS

Storage Implications of New Generation AI Applications

Storage is the Bottleneck: A GPU-NVMe Technical Deep Dive

Distributed Endpoint Architecture for KV-Cache Offloading in LLM Inference

🧠 AI/ML ARCHITECTURE & SYSTEMS

Graph Neural Networks & Large Language Models: A Visual Guide

CS²B