Introduction

Survey of LLM inference systems

Basic computation graph of transformer model

transformer
prefill, decode
kv cache

Inference engine:

vLLM: Page Attention
SGLang: Radix Attention

P/D disaggregated inference:

Dynamo

KV Cache:

Radix Attention
LLMCache
Mooncake

Model Parallelism:

Tensor Parallelism

Kernel:

MoE kernel
Attention kernel
Sparse attention

Engine

Overlap scheduling

Paged KV Cache: The system manages GPU memory by dividing the Key-Value (KV) cache into fixed-size pages (blocks) rather than contiguous memory.Storage Layout: Memory is pre-allocated as a large tensor (e.g., [2, layers, pages, heads, dim]).Eliminating Fragmentation: By using a Page Table, the server can map non-contiguous physical memory pages to a logical sequence, allowing it to handle variable-length requests without wasting space.Vectorized Ops: Custom JIT CUDA kernels handle the writing of new tokens into these pages using warp-level vectorized memory operations to achieve peak bandwidth.
Radix Attention Tree (Prefix Caching): SGLang uses a Radix Tree structure to manage the KV cache and enable prefix caching.Prefix Matching: When a new request arrives (e.g., a prompt with a system message), the scheduler searches the Radix Tree for existing prefixes already stored in the KV cache.Computation Reuse: If a match is found, the server reuses the cached KV tensors, skipping the “prefill” phase for that part of the prompt, which significantly reduces latency and GPU load.LRU Eviction: When GPU memory is full, the system uses a Least Recently Used (LRU) policy to evict old tree nodes (and their corresponding physical pages).
Scheduler for Overlapping Computation: The “Overlap Scheduling” mechanism hides CPU overhead by pipelining CPU and GPU tasks using dual CUDA streams.The Problem: Normally, the GPU sits idle while the CPU schedules the next batch or processes the current result.The Solution: SGLang overlaps these tasks:GPU Task: The Engine Stream executes the model forward pass for Batch $N$.CPU Task: Simultaneously, the Scheduler Stream prepares metadata/memory planning for Batch $N+1$ and processes the results (sampling) of Batch $N-1$.Async Results: Token results are copied from GPU to CPU using pinned memory and non-blocking transfers, ensuring the GPU never waits for the CPU to catch up.

全面剖析推理框架技术原理: https://mp.weixin.qq.com/s/b_4YhTa96Zeklh23lv8qBw

SGLang 的整体架构很清晰，从请求执行流程来看，大致能分为服务接口层、调度层和执行层：

服务接口层（Serving Interface）：核心组件有 API Server、Tokenizer（分词器）、Detokenizer（去分词器），主要负责接收请求、下发请求，以及处理简单的编解码逻辑。
调度层（Scheduler）：核心是 Scheduler 类，它集成了组 Batch 策略（Scheduler Policy）、Disaggregation（PD 分离）、Overlap 以及输出处理等特性（以 Mixin 方式嵌入，代码文件是独立的）；Radix Cache（前缀 KV 缓存管理）、Prefill Adder（请求组 Batch）等功能模块是单独实现的。
执行层（Executor）：以 TPWorker 作为核心入口，接收调度层的指令，支持张量并行（TP）和流水线并行（PP）。TPWorker 会协同管理 Model Runner，集成了多种计算后端、 Memory Pool 和 NCCL Group（通信组），一起完成模型推理计算。

其简化的请求流程如下图所示：

SGLang request flow

Kernel

Attention: FA-3, FlashInfer

Kernel Fusion: just-in-time (JIT) compiled kernel for better runtime performance.