GenAI Physics, Capacity Planning, and Throughput Estimation
π New to GenAI Infra? Start Here
A 5-minute guided workflow, golden production defaults, and the 'What Matters' cheat sheet.
β
π§
The Engineering Manual
Mental models, Workload Archetypes, the Batch vs. Latency tradeoff, and advanced hardware physics.
β οΈ
Pitfalls & Debugging
Failure mode debugging mapping, common misconceptions, and real-world benchmark data.
β‘
Throughput & Metrics
Custom batch/context modeling, ITL scaling, and Roofline charts.
π§
Hardware Router
Find GPUs (PCIe & SXM) that physically fit your exact memory needs.
π°
TCO & Cloud Cost
Monthly OpEx modeling across major cloud providers.
π
GPU Spec Sheets
Verified Dense TFLOPS, bandwidth, and architecture info (A100 to B300 Ultra).
π The 5-Minute Quick Start
You only need to care about two user metrics:
1. TTFT (Time-To-First-Token): First token delay. Crucial for RAG/Agents.
2. ITL (Inter-Token Latency): Stream speed. Crucial for Chat APIs. Everything else (MFU, MBU) is internal plumbing for the engineers.
Step-by-Step Execution Flow
Don't guess your hardware. Follow this deterministic workflow:
Identify workload: Is it Chat, RAG, or Batch? (Sets your latency goals).
Choose model + context: (Sets your theoretical VRAM footprint).
Check memory fit: Use the Hardware Router. Can it fit on a Single GPU?
Estimate throughput: Use the Throughput & Metrics tool.
Validate latency: Check TTFT / ITL under load. If it fails, tweak batch size.
Choose hardware: Need TP? Ensure NVLink. Otherwise, PCIe is fine.
Estimate cost: Use the TCO tool, factoring in the 20% autoscaling buffer.
π Golden Defaults (Production Baselines)
If you don't know where to start, use these safe defaults in the calculators:
Usecase
Batch
Context
KV Precision
Goal
π¬ Chat API
8β16
4K
FP8
Keep ITL low (Human reading speed)
π€ RAG / Agent
16β32
32K
FP8
Optimize TTFT (Fast response)
π¦ Batch
64β256
8K
INT4 / FP8
Max throughput (Ignore latency)
The "What Matters for WHAT" Cheat Sheet
Metric
Controlled By
Optimization Focus
TTFT (First Token)
Prefill + Scheduler Queue
TFLOPS, Prefix Caching, Batch Sizing
ITL (Stream Speed)
Decode Loop
Memory Bandwidth
Throughput (Sys tok/s)
Batch Size
Total VRAM (to fit larger batches)
Cost ($ / 1M tok)
GPU Utilization
Autoscaling & Warm Pools
π§ The Infrastructure Engineering Manual
β οΈ Model vs. Reality Disclaimer: These calculations provide first-order physics estimates. Real-world performance will vary based on kernel efficiency, scheduler queueing dynamics, and workload distribution.
What this manual assumes: Standard Transformer architectures (Llama/Mistral/Gemma), highly optimized serving kernels (vLLM/TRT-LLM), and no speculative decoding.
Layer 1: Core Mental Models
The Multi-GPU Decision Tree
1. If model fits in 1 GPU β ALWAYS use 1 GPU
2. If not β Quantize first (FP8 / INT4)
3. If still not β Use Tensor Parallelism (Only with NVLink / SXM)
NEVER use Tensor Parallelism over PCIe unless absolutely forced.
Rule 1: Bandwidth > TFLOPS for Decode
Generating text is fundamentally memory-bound. A GPU with fewer TFLOPS but a wider data highway (HBM Bandwidth) will generate text faster.
Rule 2: KV Cache > Model Weights
KV Cache is the memory used to remember conversation. It grows massively with (batch Γ context). At 128K context, it easily exceeds the model itself.
Visualizing the Bottleneck: Prefill vs. Decode
LLM generation happens in two distinct phases with entirely different physics:
Prefill Phase (Processing Prompt)
Compute-Bound (Limited by TFLOPS)
Decode Phase (Generating Text)
Memory-Bound (Limited by Bandwidth)
Layer 2: Intermediate Operational Realities
Expand these sections to understand dynamic memory systems, cost scaling, and deployment decision trees.
1. What is a FLOP & Datacenter Architectures
TFLOPS (Tera Floating-point Operations Per Second) measures the raw mathematical muscle of the GPU. It represents how many trillions of calculations the processor can do in a single second.
CUDA Cores vs. Tensor Cores
CUDA Cores: General-purpose scalar processors. They compute one mathematical operation at a time (e.g., A Γ B). Excellent for game rendering, but far too slow for deep learning.
Tensor Cores: Specialized matrix-math processors. A single Tensor Core performs a massive Matrix Multiply-Accumulate (MMA) operation in a single clock cycle. Since neural networks are fundamentally just giant matrices being multiplied together, Tensor Cores accelerate inference by orders of magnitude.
Datacenter Architectures
Ampere (A100): The workhorse of the previous generation. Limitation: Lacks native FP8 Tensor Cores. It processes math in FP16, missing out on massive throughput boosts.
Hopper (H100, H200): Introduced native FP8 Tensor Cores, doubling math throughput with negligible accuracy loss. H200 vs. H100: The H200 has the exact same compute chip as the H100, but upgrades the memory to 141GB HBM3e at 4.8 TB/s, drastically improving text generation speed.
Blackwell (B200, B300): Introduces a multi-die architecture and native FP4 Tensor Cores. Designed for multi-trillion parameter MoE models. The B300 Ultra pushes memory to 288GB, allowing massive 70B+ models to run on a single card without Tensor Parallelism.
2. Calculating the KV Cache & Paged Memory
The KV Cache is the temporary memory used to "remember" the conversation context. At large contexts (e.g., 128K), it can easily consume more memory than the model itself.
Example (Llama-70B at 128K Context, FP16):
2 Γ 80 Γ 8 Γ 128 Γ 2 bytes Γ 131,072 tokens = ~42.9 GB per request!
KV Cache Reality: It's a Dynamic Memory System
Tokens β Words: A token is typically ~0.75 words (or ~4 characters). A 10,000-word document is ~13,300 tokens. This variance dictates your exact KV cache sizing and TTFT.
Fragmentation & PagedAttention: Older systems allocated a contiguous block of memory for the max context length. Modern engines (vLLM) use PagedAttention, breaking the cache into small pages, eliminating fragmentation and allowing 2-4x more users per batch.
Prefix Caching: If multiple users send the same system prompt (e.g., an Agent instruction), engines will compute the KV cache once and store it in a radix tree, dropping TTFT to near-zero for future hits.
3. Quantization Techniques & Tradeoffs
Quantization compresses model weights from 16-bit (FP16) to 8-bit (FP8) or 4-bit (INT4). Dropping to FP8 doesn't just save memory; it doubles your memory bandwidth efficiency.
Model Size
FP16 (2 bytes/param)
FP8 (1 byte/param)
INT4 (0.5 bytes/param)
7-9B
14-18 GB
7-9 GB
3.5-4.5 GB
27-34B
54-68 GB
27-34 GB
13.5-17 GB
70-72B
140-144 GB
70-72 GB
35-36 GB
405B
810 GB
405 GB
202 GB
Available Techniques
Native FP8: Hopper (H100) and Blackwell (B200) have native FP8 Tensor Cores. Processing is direct with zero overhead.
AWQ (Activation-aware Weight Quantization): Fastest on vLLM. Warning: Requires kernel fusion and may introduce dequantization overhead depending on the serving engine.
GPTQ: Mature GPU option, with an extensive pre-quantized model library available on HuggingFace.
GGUF: Best for CPU/Ollama. Do not use for high-throughput datacenter GPU serving.
4. Inference Frameworks & Continuous Batching
Running LLMs via standard HuggingFace/Python is painfully slow. Production requires C++ serving engines with custom CUDA kernels.
vLLM: Invented PagedAttention. Problem Solved: Memory fragmentation. Use Case: General LLMs, rapid prototyping, and best-in-class support for Vision Language Models (VLMs).
TensorRT-LLM (TRT-LLM): NVIDIA's proprietary engine. Problem Solved: Hardware underutilization. Use Case: Locked-in, massive scale production squeezing every last drop of MFU out of Hopper GPUs.
SGLang: Invented RadixAttention. Problem Solved: Redundant prompt processing. Use Case: Agentic workflows where a massive system prompt is sent repeatedly.
Layer 3: Advanced Engineering Notes
Deep-dives into mathematical physics, kernel mechanics, and infrastructure constraints. Expand only if building custom inference stacks.
[Advanced Note] KV Cache Eviction Algorithms
What happens when KV is full?
When VRAM runs out, the engine must pause requests or evict cached tokens. Relying on continuous eviction destroys your TTFT.
β’ LRU (Least Recently Used): Standard approach. Evicts the oldest cached pages first.
β’ StreamingLLM / Heavy Hitter Oracles (H2O): Retains initial "attention sink" tokens and recent tokens, discarding the middle, enabling infinite context generation without OOM.
β’ Segment-based / Radix Eviction: Evicts based on prefix-tree usage frequencies to maximize cache hit rates for Agentic prompts.
[Advanced Note] Tensor Parallelism Math & Networking
Physics-Based TP Penalty Formula:
During Tensor Parallelism, GPUs must synchronize via an AllReduce operation. Comm_Time = 2 * ((N-1)/N) * (Activation_Size / Interconnect_Bandwidth)
If Interconnect_Bandwidth is low (e.g., PCIe Gen4 at 64 GB/s), Comm_Time dominates the loop. Real TP efficiency is often even lower than the pure bandwidth formula due to synchronization barriers and pipeline stalls.
Why 128K Context is Physically Possible:
Standard attention mechanisms scale quadratically $O(N^2)$ in both compute and memory reads/writes. Generating attention scores for a 128K context would require hundreds of gigabytes of intermediate memory reads.
FlashAttention solves this via "Tiling"βit computes the attention scores in small blocks that fit entirely within the GPU's ultra-fast SRAM, preventing the Tensor Cores from constantly waiting on the slow HBM VRAM.
The Autoscaling Trap:
Autoscaling an LLM to zero is an infrastructure trap. Cold starts aren't just about Docker pulls. Loading 140GB of weights from disk into HBM is heavily bottlenecked by storage I/O. A network-attached EBS volume capped at 1 GB/s will take over 2.5 minutes just to move the weights into memory, long after the container boots. Always use local NVMe SSDs for model caching (capable of 5-7 GB/s).
Solutions:
β’ Predictive Autoscaling: Use KEDA to spin up nodes 15 minutes before anticipated traffic.
β’ Warm Pools: Maintain a baseline of idle instances (factor this 20% overhead into TCO calculations).
π οΈ The Production Debugger
If you see these failure modes in production, take these exact actions:
β OOM (Out of Memory) Crash
β³ Reduce Batch Size OR Context Length limits. β³ Switch KV Cache precision to FP8 (halves memory footprint). β³ Verify jemalloc is injected to prevent glibc fragmentation.
β High TTFT (>2 seconds)
β³ Reduce Batch Size (Decode contention is starving the prefill queue). β³ Enable Prefix Caching (RadixAttention via SGLang) for Agentic prompts.
β Low GPU Utilization (<40%)
β³ Increase Batch Size. β³ Check if the engine is using Continuous (In-Flight) batching rather than Static batching.
β Good Throughput but Bad Latency
β³ Scheduler issue. You are optimizing for batching at the expense of fairness. Tune engine max_batch constraints down.
β οΈ The 4 Critical Pitfalls of LLM Capacity Planning
Pitfall 1: "More TFLOPS = faster LLM inference"
A GPU with 100 TFLOPS and 4 TB/s bandwidth will outperform one with 200 TFLOPS and 2 TB/s for most inference workloads. During decode, the GPU computes one token at a time, reading the entire model from memory each step. The H200 outperforms H100 by ~1.9Γ on Llama-70B despite identical compute β purely due to bandwidth.
NVIDIA's Hopper datasheets quote "with sparsity" as the headline number, which is 2Γ the dense figure. Most models do not use 2:4 structured sparsity. Always use the dense number for LLM serving.
Pitfall 3: "Batch size doesn't matter for throughput planning"
The difference between batch=1 and batch=64 can be 64Γ in total throughput in the memory-bound regime, because weight-loading cost is amortized across all batch items. Any calculator that estimates throughput without asking about batch size and serving scenario will produce misleading results.
Pitfall 4: "Buy the GPU with the most memory"
Memory capacity is necessary but not sufficient. The RTX PRO 6000 has 96 GB GDDR7 β enough for a 70B FP8 model with room β but its 1.597 TB/s bandwidth (3Γ lower than H200) makes it severely memory-bound for LLM serving. You'll get the model to fit, but throughput will disappoint. For serving, prioritize bandwidth over capacity beyond what your model needs.
π Real-World Benchmarks (Production Data)
Benchmarks from highly optimized TensorRT-LLM and vLLM production environments. Benchmark Config: 1024 Input / 128 Output Tokens. Batch Size = 128 (where memory permits). Metrics measured in Output Tokens Per Second (System Decode Throughput).
Model
Hardware Config
Precision
Tokens / Sec
Observation
Llama 3 8B
1x A100 80GB (PCIe)
BF16
~110
Memory bandwidth bottlenecked (1.93 TB/s)
Llama 3 8B
1x H100 (SXM)
FP8
~460
4x speedup due to FP8 bandwidth efficiency
Gemma 4 27B
1x H200 (SXM)
FP8
~210
Native FP8 support drastically increases throughput.
Llama 3 70B
4x A100 80GB (PCIe)
BF16
~65
Severe PCIe Tensor Parallelism network penalty
Llama 3 70B
1x H200 (SXM)
FP8
~180
Zero TP penalty. Huge bandwidth (4.8 TB/s)
Qwen2.5-VL 72B
2x H100 (SXM)
FP8
~165
Vision encoder memory limits batch size, but NVLink maintains fast decode.
Llama 3.1 405B
8x B200 (HGX)
FP4
~380
Over 3x speedup via FP4 native routing and 8 TB/s HBM
β‘ Throughput, Math & Observability
Map hardware capabilities to mathematical formulas, and formulas to user-facing latency metrics. Visualizes constraints via dynamic Roofline plotting.
4K8K32K128K
π§ Hardware Router
Find the cheapest GPU configurations that will physically fit your exact memory requirements.
π° Cloud TCO & Operations Modeling
π Datacenter GPU Specifications
Official specs verified for modern datacenter hardware. Crucial: All TFLOPS numbers here are Dense (what you actually get for LLM inference), not the artificially inflated Sparse numbers.
PCIe vs SXM: PCIe cards operate at lower power (TDP) resulting in lower clock speeds and TFLOPS compared to their SXM equivalents. More importantly, PCIe cards lack high-speed NVLink fabrics, making them severe bottlenecks for Multi-GPU Tensor Parallelism.