TensorRT-LLM 整体架构设计
1. 系统架构概览
TensorRT-LLM 采用分层架构设计,从上到下分为以下几个层次:
graph TB
subgraph "用户接口层 (User Interface Layer)"
A[LLM API] --> B[命令行工具]
A --> C[Python SDK]
A --> D[多模态API]
B --> E[trtllm-build]
B --> F[trtllm-serve]
B --> G[trtllm-bench]
B --> H[trtllm-eval]
end
subgraph "执行器层 (Executor Layer)"
I[GenerationExecutor] --> J[ExecutorProxy]
I --> K[ExecutorWorker]
J --> L[多进程管理]
J --> M[IPC通信]
K --> N[推理引擎]
K --> O[后台任务管理]
end
subgraph "运行时层 (Runtime Layer)"
P[ModelRunner] --> Q[Session管理]
P --> R[KV缓存管理]
P --> S[内存管理]
P --> T[批次调度]
Q --> U[TensorRT Runtime]
R --> V[分页缓存]
S --> W[GPU内存池]
end
subgraph "构建器层 (Builder Layer)"
X[Builder] --> Y[网络构建]
X --> Z[优化配置]
X --> AA[自动并行]
Y --> BB[模型转换]
Y --> CC[图优化]
Z --> DD[引擎编译]
AA --> EE[分片策略]
end
subgraph "底层支撑 (Infrastructure Layer)"
FF[CUDA Kernels] --> GG[自定义算子]
FF --> HH[融合算子]
FF --> II[FlashAttention]
JJ[量化支持] --> KK[FP8/INT4/FP4]
JJ --> LL[校准工具]
MM[并行策略] --> NN[TP/PP/EP]
MM --> OO[通信优化]
end
subgraph "配置管理 (Configuration Management)"
PP[BuildConfig] --> QQ[序列长度配置]
PP --> RR[批次配置]
PP --> SS[优化配置]
TT[QuantConfig] --> UU[量化算法]
TT --> VV[校准配置]
WW[ParallelConfig] --> XX[并行度配置]
WW --> YY[通信配置]
end
A --> I
I --> P
P --> X
X --> FF
X --> JJ
X --> MM
PP --> X
TT --> JJ
WW --> MM
1.1 层次职责说明
用户接口层 (User Interface Layer)
- LLM API: 提供统一的高级接口,支持同步/异步生成
- 命令行工具: 提供构建、服务、测试等命令行功能
- Python SDK: 完整的 Python 开发工具包
- 多模态API: 支持图像、音频等多模态输入
执行器层 (Executor Layer)
- GenerationExecutor: 抽象执行器接口,定义统一的生成协议
- ExecutorProxy: 多进程代理,管理分布式推理
- ExecutorWorker: 工作进程,执行具体的推理任务
- IPC通信: 进程间通信机制,支持高效的数据传输
运行时层 (Runtime Layer)
- ModelRunner: 模型运行器,封装推理执行逻辑
- Session管理: TensorRT 会话和上下文管理
- KV缓存管理: 键值缓存的分页和重用机制
- 内存管理: GPU 内存池和动态分配
构建器层 (Builder Layer)
- Builder: TensorRT 引擎构建器
- 网络构建: 从模型定义构建计算图
- 优化配置: 图优化和算子融合
- 自动并行: 自动分片和并行策略选择
底层支撑 (Infrastructure Layer)
- CUDA Kernels: 高性能 CUDA 算子实现
- 量化支持: 多精度量化算法和工具
- 并行策略: 张量并行、流水线并行、专家并行
2. 核心组件架构
2.1 LLM API 层架构
classDiagram
class BaseLLM {
-_executor: GenerationExecutor
-_tokenizer: TokenizerBase
-mpi_session: MpiCommSession
-input_processor: InputProcessor
+generate(inputs, sampling_params)
+generate_async(inputs, sampling_params)
+shutdown()
+_init_executor()
+_build_model()
+_try_load_tokenizer()
}
class LLM {
+__init__(model, tokenizer, **kwargs)
+save(engine_dir)
}
class _TorchLLM {
+backend: "pytorch"
+_validate_args_for_torch_backend()
+_build_model()
}
class _TrtLLM {
+backend: "tensorrt"
+workspace: Path
+_engine_dir: Path
+save(engine_dir)
+_build_model()
}
class MultimodalEncoder {
+generate(inputs)
+generate_async(inputs)
+_validate_mm_args_for_torch_backend()
}
BaseLLM <|-- _TorchLLM
BaseLLM <|-- _TrtLLM
_TorchLLM <|-- LLM
_TorchLLM <|-- MultimodalEncoder
class GenerationExecutor {
<<abstract>>
+submit(request): GenerationResult
+abort_request(request_id)
+generate_async()
+shutdown()
+create(**kwargs)$
}
class GenerationExecutorProxy {
-workers: List[Process]
-request_queue: Queue
-result_queue: Queue
-mpi_session: MpiCommSession
+submit(request)
+_manage_workers()
+_start_workers()
+_shutdown_workers()
}
class GenerationExecutorWorker {
-engine: Engine
-session: Session
-_results: Dict
-await_response_thread: ManagedThread
+submit(request)
+await_response_task()
+_handle_response()
+setup_engine()
}
GenerationExecutor <|-- GenerationExecutorProxy
GenerationExecutor <|-- GenerationExecutorWorker
BaseLLM --> GenerationExecutor
class InputProcessor {
+tokenizer: TokenizerBase
+process(inputs, sampling_params)
+preprocess_inputs()
}
class RequestOutput {
+request_id: int
+prompt: str
+outputs: List[CompletionOutput]
+finished: bool
+_from_generation_result()$
}
class GenerationResult {
+request_id: int
+prompt_token_ids: List[int]
+add_output_tokens()
+set_finished()
+set_exception()
}
BaseLLM --> InputProcessor
GenerationExecutor --> GenerationResult
RequestOutput --> GenerationResult
2.2 执行器层详细架构
graph TB
subgraph "执行器抽象层"
A[GenerationExecutor] --> B[抽象接口定义]
A --> C[公共功能实现]
A --> D[工厂方法]
end
subgraph "代理执行器 (多进程)"
E[GenerationExecutorProxy] --> F[进程管理]
E --> G[请求分发]
E --> H[结果收集]
F --> I[Worker进程启动]
F --> J[进程监控]
F --> K[故障恢复]
G --> L[负载均衡]
G --> M[请求队列]
H --> N[结果聚合]
H --> O[异常处理]
end
subgraph "工作执行器 (单进程)"
P[GenerationExecutorWorker] --> Q[引擎管理]
P --> R[请求处理]
P --> S[后台任务]
Q --> T[Engine初始化]
Q --> U[Session管理]
R --> V[请求验证]
R --> W[结果映射]
S --> X[响应监听]
S --> Y[统计收集]
end
subgraph "基础工作器"
Z[BaseWorker] --> AA[引擎设置]
Z --> BB[后处理管理]
Z --> CC[错误处理]
AA --> DD[模型加载]
AA --> EE[配置验证]
BB --> FF[PostprocWorker]
BB --> GG[分词器管理]
end
subgraph "支撑组件"
HH[IPC队列] --> II[进程间消息传递]
HH --> JJ[异步通信]
KK[ManagedThread] --> LL[后台任务管理]
KK --> MM[线程监控]
NN[IterationResultQueue] --> OO[结果队列管理]
NN --> PP[事件分发]
end
A <|-- E
A <|-- P
P --|> Z
E --> HH
P --> KK
P --> NN
2.3 执行器层时序图
sequenceDiagram
participant User
participant LLM
participant InputProcessor
participant Executor
participant Worker
participant Engine
participant BackgroundTask
User->>LLM: generate(prompt, sampling_params)
Note over LLM: 输入预处理
LLM->>InputProcessor: process(inputs)
InputProcessor->>InputProcessor: tokenize & validate
InputProcessor-->>LLM: processed_inputs
Note over LLM: 请求创建
LLM->>LLM: create GenerationRequest
LLM->>Executor: submit(GenerationRequest)
alt 多进程模式 (ExecutorProxy)
Note over Executor: 进程间通信
Executor->>Executor: select_worker()
Executor->>Worker: 通过 IPC 发送请求
Worker->>Worker: validate_request()
Worker->>Engine: enqueue_request()
Note over BackgroundTask: 后台处理
BackgroundTask->>Engine: await_responses()
Engine-->>BackgroundTask: response
BackgroundTask->>Worker: handle_response()
Worker->>Executor: 通过 IPC 返回结果
else 单进程模式 (ExecutorWorker)
Note over Executor: 直接处理
Executor->>Executor: validate_request()
Executor->>Engine: enqueue_request()
Note over BackgroundTask: 后台监听
BackgroundTask->>Engine: await_responses()
Engine-->>BackgroundTask: response
BackgroundTask->>Executor: handle_response()
end
Note over LLM: 结果处理
Executor-->>LLM: GenerationResult
LLM->>LLM: create RequestOutput
LLM-->>User: RequestOutput
2.4 构建器层架构
flowchart TD
subgraph "输入层"
A[PretrainedModel] --> A1[模型权重]
A --> A2[模型配置]
B[BuildConfig] --> B1[序列长度配置]
B --> B2[批次配置]
B --> B3[优化配置]
C[QuantConfig] --> C1[量化算法]
C --> C2[校准配置]
end
subgraph "构建流程"
D[build函数] --> E[配置预处理]
E --> F[网络构建阶段]
F --> G[优化阶段]
G --> H[编译阶段]
H --> I[序列化阶段]
end
subgraph "网络构建"
F --> F1[创建Network]
F1 --> F2[设置插件配置]
F2 --> F3[准备输入参数]
F3 --> F4[模型前向传播]
F4 --> F5[标记输出张量]
end
subgraph "图优化"
G --> G1[算子融合]
G1 --> G2[内存优化]
G2 --> G3[计算图简化]
G3 --> G4[自动并行处理]
end
subgraph "引擎编译"
H --> H1[创建Builder]
H1 --> H2[设置优化配置文件]
H2 --> H3[权重重命名]
H3 --> H4[TensorRT编译]
H4 --> H5[时序缓存]
end
subgraph "输出层"
I --> I1[序列化引擎]
I1 --> I2[引擎元数据]
I2 --> I3[Engine对象]
end
A --> D
B --> D
C --> D
2.5 构建器时序图
sequenceDiagram
participant User
participant BuildFunc
participant Model
participant Network
participant Optimizer
participant Builder
participant TensorRT
User->>BuildFunc: build(model, build_config)
Note over BuildFunc: 配置预处理
BuildFunc->>BuildFunc: validate_config()
BuildFunc->>BuildFunc: init_max_seq_len()
BuildFunc->>BuildFunc: update_kv_cache_type()
Note over BuildFunc: 网络构建
BuildFunc->>Network: create Network()
BuildFunc->>Network: set plugin_config
BuildFunc->>Model: prepare_inputs(**args)
Model-->>BuildFunc: input tensors
BuildFunc->>Network: net_guard(network)
BuildFunc->>Model: forward(**inputs)
Model->>Network: build computation graph
alt 启用调试输出
BuildFunc->>Network: mark debug outputs
end
Note over BuildFunc: 图优化
alt 非DecoderModel
BuildFunc->>Optimizer: optimize(network)
Optimizer->>Optimizer: operator fusion
Optimizer->>Optimizer: memory optimization
Optimizer->>Optimizer: graph simplification
Optimizer-->>BuildFunc: optimized network
end
Note over BuildFunc: 自动并行
alt 启用自动并行
BuildFunc->>Optimizer: auto_parallel(network, config)
Optimizer->>Optimizer: analyze parallelism
Optimizer->>Optimizer: generate sharded networks
Optimizer-->>BuildFunc: sharded_networks[rank]
BuildFunc->>Model: update mapping config
end
Note over BuildFunc: 网络可视化
alt 启用可视化
BuildFunc->>Network: save_visualization()
end
Note over BuildFunc: 引擎编译
BuildFunc->>Builder: create Builder()
BuildFunc->>Builder: create BuilderConfig
BuildFunc->>Builder: build_engine(network, config)
Builder->>Builder: add_optimization_profile()
Builder->>Builder: rename_weights()
Builder->>TensorRT: build_serialized_network()
TensorRT->>TensorRT: compile and optimize
TensorRT-->>Builder: serialized engine
Builder-->>BuildFunc: engine buffer
Note over BuildFunc: 创建引擎对象
BuildFunc->>BuildFunc: create Engine(config, buffer)
BuildFunc-->>User: Engine object
3. 数据流架构
3.1 推理数据流
graph LR
subgraph "输入处理"
A[原始文本] --> B[Tokenizer]
B --> C[Token IDs]
C --> D[输入张量]
end
subgraph "推理执行"
D --> E[Attention计算]
E --> F[FFN计算]
F --> G[输出投影]
G --> H[Logits]
end
subgraph "采样解码"
H --> I[采样策略]
I --> J[Token选择]
J --> K[新Token]
end
subgraph "输出处理"
K --> L[Token累积]
L --> M[Detokenizer]
M --> N[生成文本]
end
subgraph "KV缓存管理"
E --> O[KV Cache]
O --> P[缓存更新]
P --> E
end
3.2 内存管理架构
graph TB
subgraph "GPU内存布局"
A[模型权重] --> A1[Embedding层]
A --> A2[Transformer层]
A --> A3[输出层]
B[KV缓存] --> B1[Key缓存]
B --> B2[Value缓存]
B1 --> B3[分页管理]
B2 --> B3
C[激活内存] --> C1[输入张量]
C --> C2[中间激活]
C --> C3[输出张量]
D[工作内存] --> D1[临时缓冲区]
D --> D2[算子工作空间]
end
subgraph "内存优化策略"
E[内存池] --> F[预分配]
E --> G[动态分配]
H[内存复用] --> I[激活检查点]
H --> J[梯度累积]
end
A --> E
B --> E
C --> H
4. 并行策略架构
4.1 张量并行(Tensor Parallelism)
graph TB
subgraph "单层张量并行"
A[输入张量] --> B[分割]
B --> C[GPU 0: 权重分片0]
B --> D[GPU 1: 权重分片1]
B --> E[GPU N: 权重分片N]
C --> F[局部计算0]
D --> G[局部计算1]
E --> H[局部计算N]
F --> I[AllReduce通信]
G --> I
H --> I
I --> J[输出张量]
end
subgraph "多层级联"
J --> K[下一层输入]
K --> L[重复并行过程]
end
4.2 流水线并行(Pipeline Parallelism)
sequenceDiagram
participant GPU0 as GPU 0 (层1-4)
participant GPU1 as GPU 1 (层5-8)
participant GPU2 as GPU 2 (层9-12)
participant GPU3 as GPU 3 (层13-16)
Note over GPU0,GPU3: 批次1处理
GPU0->>GPU0: 前向传播(层1-4)
GPU0->>GPU1: 传递激活
GPU1->>GPU1: 前向传播(层5-8)
GPU1->>GPU2: 传递激活
GPU2->>GPU2: 前向传播(层9-12)
GPU2->>GPU3: 传递激活
GPU3->>GPU3: 前向传播(层13-16)
Note over GPU0,GPU3: 批次2处理(流水线)
GPU0->>GPU0: 前向传播(层1-4)
GPU0->>GPU1: 传递激活
GPU1->>GPU1: 前向传播(层5-8)
GPU1->>GPU2: 传递激活
4.3 专家并行(Expert Parallelism)
graph TB
subgraph "MoE层结构"
A[输入Token] --> B[门控网络]
B --> C[专家选择]
C --> D[专家0 - GPU0]
C --> E[专家1 - GPU0]
C --> F[专家2 - GPU1]
C --> G[专家3 - GPU1]
C --> H[专家N - GPUM]
D --> I[AllToAll通信]
E --> I
F --> I
G --> I
H --> I
I --> J[专家输出聚合]
J --> K[最终输出]
end
subgraph "负载均衡"
L[Token分布] --> M[专家负载监控]
M --> N[动态路由调整]
N --> C
end
5. 量化架构
5.1 量化策略层次
graph TB
subgraph "量化算法"
A[权重量化] --> A1[INT4 AWQ]
A --> A2[INT8 GPTQ]
A --> A3[FP8]
A --> A4[FP4]
B[激活量化] --> B1[INT8 SmoothQuant]
B --> B2[FP8 动态量化]
C[KV缓存量化] --> C1[INT8 KV Cache]
C --> C2[FP8 KV Cache]
C --> C3[FP4 KV Cache]
end
subgraph "量化粒度"
D[Per-Tensor] --> E[全局缩放因子]
F[Per-Channel] --> G[通道级缩放]
H[Per-Group] --> I[分组量化]
J[Per-Token] --> K[动态量化]
end
A1 --> H
A2 --> F
A3 --> D
B1 --> F
B2 --> J
5.2 量化执行流程
sequenceDiagram
participant Model
participant QuantConfig
participant Quantizer
participant Calibrator
participant Engine
Model->>QuantConfig: 创建量化配置
QuantConfig->>Quantizer: 初始化量化器
alt 需要校准的量化方法
Quantizer->>Calibrator: 创建校准器
Calibrator->>Calibrator: 收集激活统计
Calibrator->>Quantizer: 返回量化参数
end
Quantizer->>Model: 量化权重
Quantizer->>Model: 插入量化/反量化节点
Model->>Engine: 构建量化引擎
Engine->>Engine: 优化量化计算图
6. 优化策略架构
6.1 计算优化
graph TB
subgraph "算子融合"
A[LayerNorm + Linear] --> A1[融合算子]
B[GELU + Linear] --> B1[融合算子]
C[Attention计算] --> C1[FlashAttention]
D[MoE路由] --> D1[融合专家计算]
end
subgraph "内存优化"
E[激活重计算] --> F[减少内存占用]
G[梯度检查点] --> H[平衡计算与内存]
I[KV缓存分页] --> J[动态内存管理]
end
subgraph "调度优化"
K[批次调度] --> L[动态批处理]
M[请求调度] --> N[优先级队列]
O[资源调度] --> P[GPU利用率优化]
end
6.2 通信优化
graph LR
subgraph "通信模式"
A[AllReduce] --> B[环形通信]
A --> C[树形通信]
D[AllToAll] --> E[专家并行通信]
F[P2P] --> G[流水线通信]
end
subgraph "通信优化"
H[通信重叠] --> I[计算与通信并行]
J[通信压缩] --> K[梯度压缩]
L[通信调度] --> M[带宽感知调度]
end
subgraph "网络拓扑"
N[NVLink] --> O[高带宽互连]
P[InfiniBand] --> Q[跨节点通信]
R[以太网] --> S[标准网络]
end
B --> H
C --> H
E --> J
G --> L
7. 系统时序架构
7.1 初始化时序
sequenceDiagram
participant User
participant LLM
participant MPI
participant Executor
participant Engine
participant GPU
User->>LLM: LLM(model_path)
LLM->>LLM: 解析参数
alt 多GPU模式
LLM->>MPI: 启动MPI会话
MPI->>MPI: 初始化进程组
end
LLM->>Executor: 创建执行器
Executor->>Engine: 加载引擎
Engine->>GPU: 分配GPU内存
GPU->>Engine: 内存分配完成
Engine->>Executor: 引擎就绪
Executor->>LLM: 执行器就绪
LLM->>User: 初始化完成
7.2 推理时序
sequenceDiagram
participant User
participant LLM
participant Executor
participant Scheduler
participant Engine
participant KVCache
User->>LLM: generate(prompt)
LLM->>LLM: 预处理输入
LLM->>Executor: submit(request)
Executor->>Scheduler: 调度请求
Scheduler->>KVCache: 分配缓存块
Scheduler->>Engine: 执行推理
loop 自回归生成
Engine->>Engine: 前向传播
Engine->>KVCache: 更新缓存
Engine->>Scheduler: 返回logits
Scheduler->>Scheduler: 采样决策
alt 未完成
Scheduler->>Engine: 继续生成
else 完成
Scheduler->>Executor: 返回结果
end
end
Executor->>LLM: GenerationResult
LLM->>User: RequestOutput
8. 错误处理架构
8.1 异常处理层次
graph TB
subgraph "用户层异常"
A[参数验证错误] --> A1[配置异常]
A --> A2[输入格式错误]
A --> A3[资源不足]
end
subgraph "执行器层异常"
B[请求处理错误] --> B1[队列满]
B --> B2[超时错误]
B --> B3[进程通信错误]
end
subgraph "引擎层异常"
C[推理执行错误] --> C1[CUDA错误]
C --> C2[内存不足]
C --> C3[计算错误]
end
subgraph "系统层异常"
D[硬件故障] --> D1[GPU故障]
D --> D2[网络故障]
D --> D3[存储故障]
end
A1 --> E[异常捕获]
B1 --> E
C1 --> E
D1 --> E
E --> F[错误恢复]
F --> G[用户反馈]
8.2 容错机制
graph LR
subgraph "检测机制"
A[健康检查] --> B[定期探测]
C[异常监控] --> D[实时监控]
E[性能监控] --> F[指标收集]
end
subgraph "恢复策略"
G[重试机制] --> H[指数退避]
I[故障转移] --> J[备用资源]
K[降级服务] --> L[基础功能]
end
subgraph "预防措施"
M[资源预留] --> N[内存缓冲]
O[负载限制] --> P[请求限流]
Q[优雅关闭] --> R[资源清理]
end
B --> G
D --> I
F --> K
9. 性能监控架构
9.1 指标收集体系
graph TB
subgraph "系统指标"
A[GPU利用率] --> A1[计算利用率]
A --> A2[内存利用率]
A --> A3[温度监控]
B[网络指标] --> B1[带宽使用]
B --> B2[延迟监控]
B --> B3[丢包率]
end
subgraph "业务指标"
C[吞吐量] --> C1[QPS]
C --> C2[Token/s]
D[延迟指标] --> D1[端到端延迟]
D --> D2[首Token延迟]
D --> D3[生成延迟]
E[质量指标] --> E1[准确率]
E --> E2[一致性]
end
subgraph "资源指标"
F[内存使用] --> F1[峰值内存]
F --> F2[内存碎片]
G[缓存效率] --> G1[KV缓存命中率]
G --> G2[缓存利用率]
end
A1 --> H[指标聚合]
C1 --> H
F1 --> H
H --> I[监控面板]
这个整体架构设计文档详细描述了 TensorRT-LLM 的系统架构、核心组件、数据流、并行策略、量化机制、优化策略、时序设计、错误处理和性能监控等方面,为深入理解该框架的设计理念和实现原理提供了全面的技术视角。