AI Inference Benchmark - Comparing vLLM, SGLang, and NVIDIA NIM

Abstract

Modern LLM inference engines like vLLM, SGLang, and NVIDIA NIM promise high throughput through optimizations like PagedAttention, RadixAttention, and TensorRT-LLM. But how do they perform under realistic, sustained production load?

This comprehensive benchmark uses closed-loop methodology that maintains constant concurrency levels, measuring sustainable throughput rather than peak burst capacity. Testing on an NVIDIA H200 SXM with models ranging from 32B to 120B parameters, the results reveal significant performance differences across workload types, with implications for production deployments.

Key Findings

vLLM dominates: Wins 72% of all comparisons for server throughput and 64% for lowest jitter
SGLang falls short: Despite RadixAttention promises, wins only 8% of comparisons
NVIDIA NIM niche strengths: Outperforms at small batch sizes (1-8) for certain models
Workload matters: Diverse long contexts (B2) are 2-3x harder than cached short prompts (A1)
Batch scaling: Throughput scales up to batch 128-256, but at cost of higher jitter and lesser stability

Methodology: Closed-Loop Benchmarking

Why Closed-Loop?

Traditional "open-loop" benchmarks send burst requests and measure completion time. This doesn't reflect production reality where servers face continuous load.

The closed-loop approach maintains a constant number of in-flight requests. When one completes, another immediately starts. This measures true sustainable throughput.

Benchmark Phases

Fill Gradually ramp up to target concurrency

Stabilize Let it run, caches warm (60s)

Measure Collect all metrics (120s)

Ramp-down Let pending requests complete gracefully

Metrics

Server Throughput

user_throughput × batch_size

Aggregate tokens/second (derived from per-request rate)

User Throughput

tokens / request_latency

Tokens/second per request (includes TTFT)

TTFT

t_first_token - t_request

Time to First Token - perceived "thinking" time

TPOT

(t_last - t_first) / (n-1)

Time per Output Token - streaming speed

Jitter Ratio

ITL_p99 / ITL_p50

Token timing stability (1.0 = perfect)

Benchmark Scenarios

Six scenarios cover diverse production workloads:

Scenario Overview — Overview of benchmark scenarios with input token counts and caching potential

ID	Name	Input Tokens	Caching	Description
A1	Same Short	~50	Maximum	Identical prompts - best case for prefix caching
A2	Same Long	~2,500	Maximum	Large identical context with prefix caching
B1	Diverse Short	~50	None	All different prompts - raw throughput test
B2	Diverse Long	~2,500	None	Stress test: diverse + large context
C	RAG/Agent	~2,000 + query	System prompt	Typical RAG pattern with reusable context
D	Multi-Turn	Growing	History	Chat with accumulating context

Results: Server Throughput

Server throughput (tokens/second) across concurrency levels for each scenario:

Throughput A1 — A1: Same Short - Maximum prefix caching benefit

Throughput A2 — A2: Same Long - Prefix caching with large context

Throughput B1 — B1: Diverse Short - Raw throughput, no caching

Throughput B2 — B2: Diverse Long - Most demanding workload

Throughput C — C: RAG/Agent - System prompt caching

Throughput D — D: Multi-Turn Dialog - Growing context

Peak Throughput Comparison — Peak throughput comparison across all scenarios and engines

Latency Analysis

Time to First Token (TTFT)

TTFT represents the perceived "thinking time" before the model starts responding:

Time per Output Token (TPOT)

TPOT determines the streaming speed - how fast text appears during generation:

Jitter Analysis

Jitter ratio (ITL P99/P50) shows token timing stability. Values close to 1.0 indicate smooth, consistent generation:

Workload Difficulty Analysis

How well do engines handle different workloads? Performance is normalized to scenario A1 (easiest) to show relative difficulty:

Workload Difficulty b64 — Relative performance at concurrency 64

Workload Difficulty b128 — Relative performance at concurrency 128

Workload Difficulty b256 — Relative performance at concurrency 256

Throughput Heatmaps

Absolute throughput across all scenario × concurrency combinations:

Experimental Setup

Hardware (RunPod)

GPU: NVIDIA H200 SXM (141 GB HBM3)
CPU: Intel Xeon Platinum 8568Y+ (24 vCPU)
Memory: 377 GB RAM
Driver: 570.195.03
Region: North America

Models Tested

Llama 3.3 70B Instruct — FP8 (68 GB)
DeepSeek-R1-Distill-Qwen-32B — BF16 (62 GB)
Mixtral 8x7B Instruct — BF16 (87 GB)
GPT-OSS-120B — MXFP4 (67 GB)

Inference Engines

Engine	Container / Version	Backend	Configuration
vLLM	`nvcr.io/nvidia/vllm:25.12.post1-py3` v0.12.0	vLLM native	`--gpu-memory-utilization 0.9`
SGLang	`nvcr.io/nvidia/sglang:25.12-py3` v0.5.5	SGLang native	`--mem-fraction-static 0.9`
NIM (Llama 70B)	`nvcr.io/nim/meta/llama-3.3-70b-instruct` NIM 1.15.4 · TRT-LLM 1.0.3	TensorRT-LLM	FP8, auto 0.9 GPU mem
NIM (DeepSeek 32B)	`nvcr.io/nim/deepseek-ai/deepseek-r1-distill-qwen-32b` NIM 1.8.0 · TRT-LLM 0.17.1	TensorRT-LLM	BF16, throughput profile
NIM (Mixtral 8x7B)	`nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v0-1` NIM 1.12.0 · TRT-LLM 0.20.1	TensorRT-LLM	BF16, throughput profile
NIM (GPT-OSS-120B)	`nvcr.io/nim/openai/gpt-oss-120b:1.12.4` NIM 1.12.4 · vLLM 0.10.2	vLLM (NIM wrapper)	MXFP4

Note: All engines configured with 90% GPU memory utilization. NIM containers use pre-optimized TensorRT-LLM engines where available, falling back to vLLM for unsupported model architectures.

Citation

If you use this benchmark framework or reference these results:

@article{inference-benchmark-2025,
  title={Comparing LLM Inference Engines Under Production Load},
  author={Gerst, Danny},
  journal={iX Magazin},
  year={2025}
}