AI Inference Benchmark

Comparing LLM Inference Engines Under Realistic Production Load

vLLM · SGLang · NVIDIA NIM

Abstract

Modern LLM inference engines like vLLM, SGLang, and NVIDIA NIM promise high throughput through optimizations like PagedAttention, RadixAttention, and TensorRT-LLM. But how do they perform under realistic, sustained production load?

This comprehensive benchmark uses closed-loop methodology that maintains constant concurrency levels, measuring sustainable throughput rather than peak burst capacity. Testing on an NVIDIA H200 SXM with models ranging from 32B to 120B parameters, the results reveal significant performance differences across workload types, with implications for production deployments.

Key Findings

  • vLLM dominates: Wins 72% of all comparisons for server throughput and 64% for lowest jitter
  • SGLang falls short: Despite RadixAttention promises, wins only 8% of comparisons
  • NVIDIA NIM niche strengths: Outperforms at small batch sizes (1-8) for certain models
  • Workload matters: Diverse long contexts (B2) are 2-3x harder than cached short prompts (A1)
  • Batch scaling: Throughput scales up to batch 128-256, but at cost of higher jitter and lesser stability

Methodology: Closed-Loop Benchmarking

Why Closed-Loop?

Traditional "open-loop" benchmarks send burst requests and measure completion time. This doesn't reflect production reality where servers face continuous load.

The closed-loop approach maintains a constant number of in-flight requests. When one completes, another immediately starts. This measures true sustainable throughput.

Benchmark Phases

Fill Gradually ramp up to target concurrency
Stabilize Let it run, caches warm (60s)
Measure Collect all metrics (120s)
Ramp-down Let pending requests complete gracefully

Metrics

Server Throughput

user_throughput × batch_size

Aggregate tokens/second (derived from per-request rate)

User Throughput

tokens / request_latency

Tokens/second per request (includes TTFT)

TTFT

t_first_token - t_request

Time to First Token - perceived "thinking" time

TPOT

(t_last - t_first) / (n-1)

Time per Output Token - streaming speed

Jitter Ratio

ITL_p99 / ITL_p50

Token timing stability (1.0 = perfect)

Benchmark Scenarios

Six scenarios cover diverse production workloads:

Scenario Overview
Overview of benchmark scenarios with input token counts and caching potential
ID Name Input Tokens Caching Description
A1 Same Short ~50 Maximum Identical prompts - best case for prefix caching
A2 Same Long ~2,500 Maximum Large identical context with prefix caching
B1 Diverse Short ~50 None All different prompts - raw throughput test
B2 Diverse Long ~2,500 None Stress test: diverse + large context
C RAG/Agent ~2,000 + query System prompt Typical RAG pattern with reusable context
D Multi-Turn Growing History Chat with accumulating context

Results: Server Throughput

Server throughput (tokens/second) across concurrency levels for each scenario:

Throughput A1
A1: Same Short - Maximum prefix caching benefit
Throughput A2
A2: Same Long - Prefix caching with large context
Throughput B1
B1: Diverse Short - Raw throughput, no caching
Throughput B2
B2: Diverse Long - Most demanding workload
Throughput C
C: RAG/Agent - System prompt caching
Throughput D
D: Multi-Turn Dialog - Growing context
Peak Throughput Comparison
Peak throughput comparison across all scenarios and engines

Latency Analysis

Time to First Token (TTFT)

TTFT represents the perceived "thinking time" before the model starts responding:

TTFT A1
A1: Same Short
TTFT B1
B1: Diverse Short
TTFT C
C: RAG/Agent
TTFT D
D: Multi-Turn

Time per Output Token (TPOT)

TPOT determines the streaming speed - how fast text appears during generation:

TPOT A1
A1: Same Short
TPOT B1
B1: Diverse Short
TPOT C
C: RAG/Agent
TPOT D
D: Multi-Turn

Jitter Analysis

Jitter ratio (ITL P99/P50) shows token timing stability. Values close to 1.0 indicate smooth, consistent generation:

Jitter A1
A1: Same Short
Jitter B1
B1: Diverse Short
Jitter C
C: RAG/Agent
Jitter D
D: Multi-Turn

Workload Difficulty Analysis

How well do engines handle different workloads? Performance is normalized to scenario A1 (easiest) to show relative difficulty:

Workload Heatmap b64
Concurrency 64
Workload Heatmap b128
Concurrency 128
Workload Heatmap b256
Concurrency 256
Workload Difficulty b64
Relative performance at concurrency 64
Workload Difficulty b128
Relative performance at concurrency 128
Workload Difficulty b256
Relative performance at concurrency 256

Throughput Heatmaps

Absolute throughput across all scenario × concurrency combinations:

Heatmap b64
Concurrency 64
Heatmap b128
Concurrency 128
Heatmap b256
Concurrency 256

Experimental Setup

Hardware (RunPod)

  • GPU: NVIDIA H200 SXM (141 GB HBM3)
  • CPU: Intel Xeon Platinum 8568Y+ (24 vCPU)
  • Memory: 377 GB RAM
  • Driver: 570.195.03
  • Region: North America

Models Tested

  • Llama 3.3 70B Instruct — FP8 (68 GB)
  • DeepSeek-R1-Distill-Qwen-32B — BF16 (62 GB)
  • Mixtral 8x7B Instruct — BF16 (87 GB)
  • GPT-OSS-120B — MXFP4 (67 GB)

Inference Engines

Engine Container / Version Backend Configuration
vLLM nvcr.io/nvidia/vllm:25.12.post1-py3
v0.12.0
vLLM native --gpu-memory-utilization 0.9
SGLang nvcr.io/nvidia/sglang:25.12-py3
v0.5.5
SGLang native --mem-fraction-static 0.9
NIM (Llama 70B) nvcr.io/nim/meta/llama-3.3-70b-instruct
NIM 1.15.4 · TRT-LLM 1.0.3
TensorRT-LLM FP8, auto 0.9 GPU mem
NIM (DeepSeek 32B) nvcr.io/nim/deepseek-ai/deepseek-r1-distill-qwen-32b
NIM 1.8.0 · TRT-LLM 0.17.1
TensorRT-LLM BF16, throughput profile
NIM (Mixtral 8x7B) nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v0-1
NIM 1.12.0 · TRT-LLM 0.20.1
TensorRT-LLM BF16, throughput profile
NIM (GPT-OSS-120B) nvcr.io/nim/openai/gpt-oss-120b:1.12.4
NIM 1.12.4 · vLLM 0.10.2
vLLM (NIM wrapper) MXFP4

Note: All engines configured with 90% GPU memory utilization. NIM containers use pre-optimized TensorRT-LLM engines where available, falling back to vLLM for unsupported model architectures.

Citation

If you use this benchmark framework or reference these results:

@article{inference-benchmark-2025,
  title={Comparing LLM Inference Engines Under Production Load},
  author={Gerst, Danny},
  journal={iX Magazin},
  year={2025}
}