Abstract
Modern LLM inference engines like vLLM, SGLang, and NVIDIA NIM promise high throughput through optimizations like PagedAttention, RadixAttention, and TensorRT-LLM. But how do they perform under realistic, sustained production load?
This comprehensive benchmark uses closed-loop methodology that maintains constant concurrency levels, measuring sustainable throughput rather than peak burst capacity. Testing on an NVIDIA H200 SXM with models ranging from 32B to 120B parameters, the results reveal significant performance differences across workload types, with implications for production deployments.
Key Findings
- vLLM dominates: Wins 72% of all comparisons for server throughput and 64% for lowest jitter
- SGLang falls short: Despite RadixAttention promises, wins only 8% of comparisons
- NVIDIA NIM niche strengths: Outperforms at small batch sizes (1-8) for certain models
- Workload matters: Diverse long contexts (B2) are 2-3x harder than cached short prompts (A1)
- Batch scaling: Throughput scales up to batch 128-256, but at cost of higher jitter and lesser stability
Methodology: Closed-Loop Benchmarking
Why Closed-Loop?
Traditional "open-loop" benchmarks send burst requests and measure completion time. This doesn't reflect production reality where servers face continuous load.
The closed-loop approach maintains a constant number of in-flight requests. When one completes, another immediately starts. This measures true sustainable throughput.
Benchmark Phases
Metrics
Server Throughput
user_throughput × batch_size
Aggregate tokens/second (derived from per-request rate)
User Throughput
tokens / request_latency
Tokens/second per request (includes TTFT)
TTFT
t_first_token - t_request
Time to First Token - perceived "thinking" time
TPOT
(t_last - t_first) / (n-1)
Time per Output Token - streaming speed
Jitter Ratio
ITL_p99 / ITL_p50
Token timing stability (1.0 = perfect)
Benchmark Scenarios
Six scenarios cover diverse production workloads:
| ID | Name | Input Tokens | Caching | Description |
|---|---|---|---|---|
| A1 | Same Short | ~50 | Maximum | Identical prompts - best case for prefix caching |
| A2 | Same Long | ~2,500 | Maximum | Large identical context with prefix caching |
| B1 | Diverse Short | ~50 | None | All different prompts - raw throughput test |
| B2 | Diverse Long | ~2,500 | None | Stress test: diverse + large context |
| C | RAG/Agent | ~2,000 + query | System prompt | Typical RAG pattern with reusable context |
| D | Multi-Turn | Growing | History | Chat with accumulating context |
Results: Server Throughput
Server throughput (tokens/second) across concurrency levels for each scenario:
Latency Analysis
Time to First Token (TTFT)
TTFT represents the perceived "thinking time" before the model starts responding:
Time per Output Token (TPOT)
TPOT determines the streaming speed - how fast text appears during generation:
Jitter Analysis
Jitter ratio (ITL P99/P50) shows token timing stability. Values close to 1.0 indicate smooth, consistent generation:
Workload Difficulty Analysis
How well do engines handle different workloads? Performance is normalized to scenario A1 (easiest) to show relative difficulty:
Throughput Heatmaps
Absolute throughput across all scenario × concurrency combinations:
Experimental Setup
Hardware (RunPod)
- GPU: NVIDIA H200 SXM (141 GB HBM3)
- CPU: Intel Xeon Platinum 8568Y+ (24 vCPU)
- Memory: 377 GB RAM
- Driver: 570.195.03
- Region: North America
Models Tested
- Llama 3.3 70B Instruct — FP8 (68 GB)
- DeepSeek-R1-Distill-Qwen-32B — BF16 (62 GB)
- Mixtral 8x7B Instruct — BF16 (87 GB)
- GPT-OSS-120B — MXFP4 (67 GB)
Inference Engines
| Engine | Container / Version | Backend | Configuration |
|---|---|---|---|
| vLLM |
nvcr.io/nvidia/vllm:25.12.post1-py3v0.12.0 |
vLLM native | --gpu-memory-utilization 0.9 |
| SGLang |
nvcr.io/nvidia/sglang:25.12-py3v0.5.5 |
SGLang native | --mem-fraction-static 0.9 |
| NIM (Llama 70B) |
nvcr.io/nim/meta/llama-3.3-70b-instructNIM 1.15.4 · TRT-LLM 1.0.3 |
TensorRT-LLM | FP8, auto 0.9 GPU mem |
| NIM (DeepSeek 32B) |
nvcr.io/nim/deepseek-ai/deepseek-r1-distill-qwen-32bNIM 1.8.0 · TRT-LLM 0.17.1 |
TensorRT-LLM | BF16, throughput profile |
| NIM (Mixtral 8x7B) |
nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v0-1NIM 1.12.0 · TRT-LLM 0.20.1 |
TensorRT-LLM | BF16, throughput profile |
| NIM (GPT-OSS-120B) |
nvcr.io/nim/openai/gpt-oss-120b:1.12.4NIM 1.12.4 · vLLM 0.10.2 |
vLLM (NIM wrapper) | MXFP4 |
Note: All engines configured with 90% GPU memory utilization. NIM containers use pre-optimized TensorRT-LLM engines where available, falling back to vLLM for unsupported model architectures.
Citation
If you use this benchmark framework or reference these results:
@article{inference-benchmark-2025,
title={Comparing LLM Inference Engines Under Production Load},
author={Gerst, Danny},
journal={iX Magazin},
year={2025}
}