🎯 Infrastructure Engineering Manual

GenAI Physics, Capacity Planning, and Throughput Estimation

πŸš€ New to GenAI Infra? Start Here
A 5-minute guided workflow, golden production defaults, and the 'What Matters' cheat sheet.
β†’
🧠
The Engineering Manual
Mental models, Workload Archetypes, the Batch vs. Latency tradeoff, and advanced hardware physics.
⚠️
Pitfalls & Debugging
Failure mode debugging mapping, common misconceptions, and real-world benchmark data.
⚑
Throughput & Metrics
Custom batch/context modeling, ITL scaling, and Roofline charts.
🧭
Hardware Router
Find GPUs (PCIe & SXM) that physically fit your exact memory needs.
πŸ’°
TCO & Cloud Cost
Monthly OpEx modeling across major cloud providers.
πŸ“‹
GPU Spec Sheets
Verified Dense TFLOPS, bandwidth, and architecture info (A100 to B300 Ultra).

πŸš€ The 5-Minute Quick Start

You only need to care about two user metrics:
1. TTFT (Time-To-First-Token): First token delay. Crucial for RAG/Agents.
2. ITL (Inter-Token Latency): Stream speed. Crucial for Chat APIs.
Everything else (MFU, MBU) is internal plumbing for the engineers.

Step-by-Step Execution Flow

Don't guess your hardware. Follow this deterministic workflow:

  1. Identify workload: Is it Chat, RAG, or Batch? (Sets your latency goals).
  2. Choose model + context: (Sets your theoretical VRAM footprint).
  3. Check memory fit: Use the Hardware Router. Can it fit on a Single GPU?
  4. Estimate throughput: Use the Throughput & Metrics tool.
  5. Validate latency: Check TTFT / ITL under load. If it fails, tweak batch size.
  6. Choose hardware: Need TP? Ensure NVLink. Otherwise, PCIe is fine.
  7. Estimate cost: Use the TCO tool, factoring in the 20% autoscaling buffer.

πŸ† Golden Defaults (Production Baselines)

If you don't know where to start, use these safe defaults in the calculators:

UsecaseBatchContextKV PrecisionGoal
πŸ’¬ Chat API8–164KFP8Keep ITL low (Human reading speed)
πŸ€– RAG / Agent16–3232KFP8Optimize TTFT (Fast response)
πŸ“¦ Batch64–2568KINT4 / FP8Max throughput (Ignore latency)

The "What Matters for WHAT" Cheat Sheet

MetricControlled ByOptimization Focus
TTFT (First Token)Prefill + Scheduler QueueTFLOPS, Prefix Caching, Batch Sizing
ITL (Stream Speed)Decode LoopMemory Bandwidth
Throughput (Sys tok/s)Batch SizeTotal VRAM (to fit larger batches)
Cost ($ / 1M tok)GPU UtilizationAutoscaling & Warm Pools

🧠 The Infrastructure Engineering Manual

⚠️ Model vs. Reality Disclaimer: These calculations provide first-order physics estimates. Real-world performance will vary based on kernel efficiency, scheduler queueing dynamics, and workload distribution.
What this manual assumes: Standard Transformer architectures (Llama/Mistral/Gemma), highly optimized serving kernels (vLLM/TRT-LLM), and no speculative decoding.

Layer 1: Core Mental Models

The Multi-GPU Decision Tree
1. If model fits in 1 GPU β†’ ALWAYS use 1 GPU
2. If not β†’ Quantize first (FP8 / INT4)
3. If still not β†’ Use Tensor Parallelism (Only with NVLink / SXM)
NEVER use Tensor Parallelism over PCIe unless absolutely forced.

Rule 1: Bandwidth > TFLOPS for Decode

Generating text is fundamentally memory-bound. A GPU with fewer TFLOPS but a wider data highway (HBM Bandwidth) will generate text faster.

Rule 2: KV Cache > Model Weights

KV Cache is the memory used to remember conversation. It grows massively with (batch Γ— context). At 128K context, it easily exceeds the model itself.

Visualizing the Bottleneck: Prefill vs. Decode

LLM generation happens in two distinct phases with entirely different physics:

Prefill Phase
(Processing Prompt)
Compute-Bound (Limited by TFLOPS)
Decode Phase
(Generating Text)
Memory-Bound (Limited by Bandwidth)

Layer 2: Intermediate Operational Realities

Expand these sections to understand dynamic memory systems, cost scaling, and deployment decision trees.

1. What is a FLOP & Datacenter Architectures

TFLOPS (Tera Floating-point Operations Per Second) measures the raw mathematical muscle of the GPU. It represents how many trillions of calculations the processor can do in a single second.

CUDA Cores vs. Tensor Cores

CUDA Cores: General-purpose scalar processors. They compute one mathematical operation at a time (e.g., A Γ— B). Excellent for game rendering, but far too slow for deep learning.

Tensor Cores: Specialized matrix-math processors. A single Tensor Core performs a massive Matrix Multiply-Accumulate (MMA) operation in a single clock cycle. Since neural networks are fundamentally just giant matrices being multiplied together, Tensor Cores accelerate inference by orders of magnitude.

Datacenter Architectures

  • Ampere (A100): The workhorse of the previous generation. Limitation: Lacks native FP8 Tensor Cores. It processes math in FP16, missing out on massive throughput boosts.
  • Hopper (H100, H200): Introduced native FP8 Tensor Cores, doubling math throughput with negligible accuracy loss. H200 vs. H100: The H200 has the exact same compute chip as the H100, but upgrades the memory to 141GB HBM3e at 4.8 TB/s, drastically improving text generation speed.
  • Blackwell (B200, B300): Introduces a multi-die architecture and native FP4 Tensor Cores. Designed for multi-trillion parameter MoE models. The B300 Ultra pushes memory to 288GB, allowing massive 70B+ models to run on a single card without Tensor Parallelism.
2. Calculating the KV Cache & Paged Memory

The KV Cache is the temporary memory used to "remember" the conversation context. At large contexts (e.g., 128K), it can easily consume more memory than the model itself.

Formula: KV Cache Size (Bytes per Request)
KV_Cache = 2 (K & V) Γ— Layers Γ— KV_Heads Γ— Head_Dimension Γ— Bytes_per_Param Γ— Context_Length

Example (Llama-70B at 128K Context, FP16):
2 Γ— 80 Γ— 8 Γ— 128 Γ— 2 bytes Γ— 131,072 tokens = ~42.9 GB per request!

KV Cache Reality: It's a Dynamic Memory System

  • Tokens β‰  Words: A token is typically ~0.75 words (or ~4 characters). A 10,000-word document is ~13,300 tokens. This variance dictates your exact KV cache sizing and TTFT.
  • Fragmentation & PagedAttention: Older systems allocated a contiguous block of memory for the max context length. Modern engines (vLLM) use PagedAttention, breaking the cache into small pages, eliminating fragmentation and allowing 2-4x more users per batch.
  • Prefix Caching: If multiple users send the same system prompt (e.g., an Agent instruction), engines will compute the KV cache once and store it in a radix tree, dropping TTFT to near-zero for future hits.
3. Quantization Techniques & Tradeoffs

Quantization compresses model weights from 16-bit (FP16) to 8-bit (FP8) or 4-bit (INT4). Dropping to FP8 doesn't just save memory; it doubles your memory bandwidth efficiency.

Model SizeFP16 (2 bytes/param)FP8 (1 byte/param)INT4 (0.5 bytes/param)
7-9B14-18 GB7-9 GB3.5-4.5 GB
27-34B54-68 GB27-34 GB13.5-17 GB
70-72B140-144 GB70-72 GB35-36 GB
405B810 GB405 GB202 GB

Available Techniques

  • Native FP8: Hopper (H100) and Blackwell (B200) have native FP8 Tensor Cores. Processing is direct with zero overhead.
  • AWQ (Activation-aware Weight Quantization): Fastest on vLLM. Warning: Requires kernel fusion and may introduce dequantization overhead depending on the serving engine.
  • GPTQ: Mature GPU option, with an extensive pre-quantized model library available on HuggingFace.
  • GGUF: Best for CPU/Ollama. Do not use for high-throughput datacenter GPU serving.
4. Inference Frameworks & Continuous Batching

Running LLMs via standard HuggingFace/Python is painfully slow. Production requires C++ serving engines with custom CUDA kernels.

  • vLLM: Invented PagedAttention. Problem Solved: Memory fragmentation. Use Case: General LLMs, rapid prototyping, and best-in-class support for Vision Language Models (VLMs).
  • TensorRT-LLM (TRT-LLM): NVIDIA's proprietary engine. Problem Solved: Hardware underutilization. Use Case: Locked-in, massive scale production squeezing every last drop of MFU out of Hopper GPUs.
  • SGLang: Invented RadixAttention. Problem Solved: Redundant prompt processing. Use Case: Agentic workflows where a massive system prompt is sent repeatedly.

Layer 3: Advanced Engineering Notes

Deep-dives into mathematical physics, kernel mechanics, and infrastructure constraints. Expand only if building custom inference stacks.

[Advanced Note] KV Cache Eviction Algorithms
What happens when KV is full?
When VRAM runs out, the engine must pause requests or evict cached tokens. Relying on continuous eviction destroys your TTFT.
β€’ LRU (Least Recently Used): Standard approach. Evicts the oldest cached pages first.
β€’ StreamingLLM / Heavy Hitter Oracles (H2O): Retains initial "attention sink" tokens and recent tokens, discarding the middle, enabling infinite context generation without OOM.
β€’ Segment-based / Radix Eviction: Evicts based on prefix-tree usage frequencies to maximize cache hit rates for Agentic prompts.
[Advanced Note] Tensor Parallelism Math & Networking
Physics-Based TP Penalty Formula:
During Tensor Parallelism, GPUs must synchronize via an AllReduce operation.
Comm_Time = 2 * ((N-1)/N) * (Activation_Size / Interconnect_Bandwidth)

If Interconnect_Bandwidth is low (e.g., PCIe Gen4 at 64 GB/s), Comm_Time dominates the loop. Real TP efficiency is often even lower than the pure bandwidth formula due to synchronization barriers and pipeline stalls.

Networking Rule:
β€’ NVLink β†’ intra-node (TP)
β€’ InfiniBand β†’ inter-node (PP)
β€’ Ethernet β†’ worst case
[Advanced Note] FlashAttention & Kernel Mechanics
Why 128K Context is Physically Possible:
Standard attention mechanisms scale quadratically $O(N^2)$ in both compute and memory reads/writes. Generating attention scores for a 128K context would require hundreds of gigabytes of intermediate memory reads.

FlashAttention solves this via "Tiling"β€”it computes the attention scores in small blocks that fit entirely within the GPU's ultra-fast SRAM, preventing the Tensor Cores from constantly waiting on the slow HBM VRAM.
[Advanced Note] Autoscaling, Storage I/O & True Cold Starts
The Autoscaling Trap:
Autoscaling an LLM to zero is an infrastructure trap. Cold starts aren't just about Docker pulls. Loading 140GB of weights from disk into HBM is heavily bottlenecked by storage I/O. A network-attached EBS volume capped at 1 GB/s will take over 2.5 minutes just to move the weights into memory, long after the container boots. Always use local NVMe SSDs for model caching (capable of 5-7 GB/s).

Solutions:
β€’ Predictive Autoscaling: Use KEDA to spin up nodes 15 minutes before anticipated traffic.
β€’ Warm Pools: Maintain a baseline of idle instances (factor this 20% overhead into TCO calculations).

πŸ› οΈ The Production Debugger

If you see these failure modes in production, take these exact actions:

❌ OOM (Out of Memory) Crash
↳ Reduce Batch Size OR Context Length limits.
↳ Switch KV Cache precision to FP8 (halves memory footprint).
↳ Verify jemalloc is injected to prevent glibc fragmentation.
❌ High TTFT (>2 seconds)
↳ Reduce Batch Size (Decode contention is starving the prefill queue).
↳ Enable Prefix Caching (RadixAttention via SGLang) for Agentic prompts.
❌ Low GPU Utilization (<40%)
↳ Increase Batch Size.
↳ Check if the engine is using Continuous (In-Flight) batching rather than Static batching.
❌ Good Throughput but Bad Latency
↳ Scheduler issue. You are optimizing for batching at the expense of fairness. Tune engine max_batch constraints down.

⚠️ The 4 Critical Pitfalls of LLM Capacity Planning

Pitfall 1: "More TFLOPS = faster LLM inference"

A GPU with 100 TFLOPS and 4 TB/s bandwidth will outperform one with 200 TFLOPS and 2 TB/s for most inference workloads. During decode, the GPU computes one token at a time, reading the entire model from memory each step. The H200 outperforms H100 by ~1.9Γ— on Llama-70B despite identical compute β€” purely due to bandwidth.

Pitfall 2: "Sparsity TFLOPS reflect real-world performance"

NVIDIA's Hopper datasheets quote "with sparsity" as the headline number, which is 2Γ— the dense figure. Most models do not use 2:4 structured sparsity. Always use the dense number for LLM serving.

Pitfall 3: "Batch size doesn't matter for throughput planning"

The difference between batch=1 and batch=64 can be 64Γ— in total throughput in the memory-bound regime, because weight-loading cost is amortized across all batch items. Any calculator that estimates throughput without asking about batch size and serving scenario will produce misleading results.

Pitfall 4: "Buy the GPU with the most memory"

Memory capacity is necessary but not sufficient. The RTX PRO 6000 has 96 GB GDDR7 β€” enough for a 70B FP8 model with room β€” but its 1.597 TB/s bandwidth (3Γ— lower than H200) makes it severely memory-bound for LLM serving. You'll get the model to fit, but throughput will disappoint. For serving, prioritize bandwidth over capacity beyond what your model needs.

πŸ“Š Real-World Benchmarks (Production Data)

Benchmarks from highly optimized TensorRT-LLM and vLLM production environments. Benchmark Config: 1024 Input / 128 Output Tokens. Batch Size = 128 (where memory permits). Metrics measured in Output Tokens Per Second (System Decode Throughput).

ModelHardware ConfigPrecisionTokens / SecObservation
Llama 3 8B1x A100 80GB (PCIe)BF16~110Memory bandwidth bottlenecked (1.93 TB/s)
Llama 3 8B1x H100 (SXM)FP8~4604x speedup due to FP8 bandwidth efficiency
Gemma 4 27B1x H200 (SXM)FP8~210Native FP8 support drastically increases throughput.
Llama 3 70B4x A100 80GB (PCIe)BF16~65Severe PCIe Tensor Parallelism network penalty
Llama 3 70B1x H200 (SXM)FP8~180Zero TP penalty. Huge bandwidth (4.8 TB/s)
Qwen2.5-VL 72B2x H100 (SXM)FP8~165Vision encoder memory limits batch size, but NVLink maintains fast decode.
Llama 3.1 405B8x B200 (HGX)FP4~380Over 3x speedup via FP4 native routing and 8 TB/s HBM

⚑ Throughput, Math & Observability

Map hardware capabilities to mathematical formulas, and formulas to user-facing latency metrics. Visualizes constraints via dynamic Roofline plotting.

🧭 Hardware Router

Find the cheapest GPU configurations that will physically fit your exact memory requirements.

πŸ’° Cloud TCO & Operations Modeling

πŸ“‹ Datacenter GPU Specifications

Official specs verified for modern datacenter hardware. Crucial: All TFLOPS numbers here are Dense (what you actually get for LLM inference), not the artificially inflated Sparse numbers.

PCIe vs SXM: PCIe cards operate at lower power (TDP) resulting in lower clock speeds and TFLOPS compared to their SXM equivalents. More importantly, PCIe cards lack high-speed NVLink fabrics, making them severe bottlenecks for Multi-GPU Tensor Parallelism.