1. 总体请求处理时序图
1.1 完整推理流程
sequenceDiagram
participant User as 用户应用
participant LLM as LLM类
participant Proc as Processor
participant Engine as EngineCore
participant Sched as Scheduler
participant KVMgr as KVCacheManager
participant Exec as ModelExecutor
participant Worker as GPUWorker
participant Model as 模型
participant Attn as AttentionBackend
Note over User,Attn: 1. 请求初始化阶段
User->>LLM: generate(prompts, sampling_params)
LLM->>LLM: validate_inputs()
LLM->>Proc: process_requests()
Note over Proc: 输入预处理
Proc->>Proc: validate_params()
Proc->>Proc: tokenize_prompts()
Proc->>Proc: process_multimodal_data()
Proc->>Engine: add_requests()
Note over Engine,Attn: 2. 调度与执行循环
loop 推理循环 [直到所有请求完成]
Engine->>Sched: schedule()
Note over Sched,KVMgr: 调度决策阶段
Sched->>KVMgr: check_available_memory()
KVMgr-->>Sched: memory_status
Sched->>Sched: select_requests_to_run()
Sched->>KVMgr: allocate_kv_cache()
KVMgr-->>Sched: allocated_blocks
Note over Sched,Attn: 模型执行阶段
Sched->>Exec: execute_model_batch()
Exec->>Worker: prepare_inputs()
Worker->>Worker: build_input_tensors()
Worker->>Model: forward()
Model->>Attn: attention_forward()
Note over Attn: PagedAttention计算
Attn->>Attn: paged_attention_v2()
Attn-->>Model: attention_output
Model->>Model: feedforward()
Model-->>Worker: logits
Worker->>Worker: sample_tokens()
Worker-->>Exec: new_tokens
Exec-->>Sched: execution_results
Note over Sched: 状态更新阶段
Sched->>Sched: update_request_states()
Sched->>KVMgr: update_kv_cache()
Sched-->>Engine: scheduler_output
end
Note over Engine,LLM: 3. 结果返回阶段
Engine->>Engine: collect_finished_requests()
Engine-->>LLM: request_outputs
LLM->>LLM: post_process_outputs()
LLM-->>User: generated_texts
关键阶段说明:
请求初始化阶段(1-3秒):
- 输入验证和预处理
- Token化和多模态数据处理
- 请求对象创建
调度与执行循环(主要耗时):
- 每个循环处理一个或多个推理步骤
- 包含调度决策、内存管理、模型执行
- 动态调整批次大小和内存分配
结果返回阶段(<1秒):
- 收集完成的请求
- 后处理和格式化输出
2. 详细子流程分析
2.1 输入处理时序图
sequenceDiagram
participant LLM as LLM类
participant Proc as Processor
participant PreProc as InputPreprocessor
participant Tokenizer as Tokenizer
participant MMProc as MultiModalProcessor
Note over LLM,MMProc: 输入预处理详细流程
LLM->>Proc: process_request(prompt, params)
Proc->>Proc: _validate_params(params)
alt 参数验证失败
Proc-->>LLM: ValidationError
else 参数验证成功
Proc->>PreProc: preprocess(prompt, lora_request)
alt 文本输入
PreProc->>Tokenizer: encode(prompt)
Tokenizer-->>PreProc: token_ids
else Token输入
PreProc->>PreProc: validate_token_ids()
end
alt 包含多模态数据
PreProc->>MMProc: process_multimodal_data()
MMProc->>MMProc: process_images()
MMProc->>MMProc: process_audio()
MMProc->>MMProc: extract_features()
MMProc-->>PreProc: mm_inputs
end
PreProc->>PreProc: combine_inputs()
PreProc-->>Proc: ProcessorInputs
Proc-->>LLM: processed_request
end
处理步骤详解:
参数验证(~1ms):
- 检查采样参数范围
- 验证结构化输出配置
- 确认token ID有效性
文本预处理(1-10ms):
- 分词处理
- 特殊token添加
- 序列长度检查
多模态处理(10-100ms):
- 图像特征提取
- 音频预处理
- 特征融合
2.2 调度器详细时序图
sequenceDiagram
participant Engine as EngineCore
participant Sched as Scheduler
participant ReqQueue as RequestQueue
participant KVMgr as KVCacheManager
participant BlockPool as BlockPool
participant PrefixCache as PrefixCache
Note over Engine,PrefixCache: 调度器内部流程
Engine->>Sched: schedule()
Note over Sched,PrefixCache: 1. 收集待调度请求
Sched->>ReqQueue: get_waiting_requests()
ReqQueue-->>Sched: waiting_requests
Sched->>ReqQueue: get_running_requests()
ReqQueue-->>Sched: running_requests
Note over Sched,PrefixCache: 2. 内存预算计算
Sched->>KVMgr: get_available_blocks()
KVMgr->>BlockPool: get_free_blocks()
BlockPool-->>KVMgr: free_count
KVMgr-->>Sched: available_memory
Note over Sched,PrefixCache: 3. 请求选择和资源分配
loop 对每个待处理请求
Sched->>Sched: estimate_memory_cost(request)
alt 内存足够
Sched->>KVMgr: allocate_blocks(request)
KVMgr->>PrefixCache: find_reusable_blocks()
PrefixCache-->>KVMgr: cached_blocks
KVMgr->>BlockPool: allocate_new_blocks()
BlockPool-->>KVMgr: new_blocks
KVMgr-->>Sched: allocated_blocks
Sched->>Sched: add_to_batch(request)
else 内存不足
Sched->>Sched: try_preemption()
alt 可以抢占
Sched->>KVMgr: free_blocks(preempted_request)
Sched->>ReqQueue: move_to_waiting(preempted_request)
else 无法抢占
Sched->>Sched: skip_request()
end
end
end
Note over Sched,PrefixCache: 4. 生成调度输出
Sched->>Sched: create_scheduler_output()
Sched-->>Engine: SchedulerOutput
调度策略说明:
内存预算管理:
- 基于可用GPU内存动态调整批次大小
- 考虑KV缓存、激活值、梯度的内存需求
- 预留安全边际防止OOM
抢占式调度:
- 优先级高的请求可以抢占资源
- 支持部分完成请求的暂停和恢复
- 最小化重新计算开销
前缀缓存优化:
- 识别相同前缀的请求
- 共享KV缓存块
- 减少重复计算
2.3 KV缓存管理时序图
sequenceDiagram
participant Sched as Scheduler
participant KVMgr as KVCacheManager
participant BlockPool as BlockPool
participant PrefixCache as PrefixCache
participant HashFunc as BlockHasher
participant GPU as GPU内存
Note over Sched,GPU: KV缓存分配流程
Sched->>KVMgr: allocate_blocks(seq_id, num_tokens)
Note over KVMgr,GPU: 1. 计算资源需求
KVMgr->>KVMgr: calculate_blocks_needed(num_tokens)
KVMgr->>HashFunc: compute_prefix_hash(seq_id)
HashFunc-->>KVMgr: prefix_hash
Note over KVMgr,GPU: 2. 检查前缀缓存
KVMgr->>PrefixCache: get_cached_blocks(prefix_hash)
PrefixCache-->>KVMgr: cached_block_ids
Note over KVMgr,GPU: 3. 分配新块
KVMgr->>BlockPool: allocate(num_needed_blocks)
loop 对每个需要的块
BlockPool->>BlockPool: get_free_block()
alt 有可用块
BlockPool->>GPU: allocate_gpu_memory(block_id)
GPU-->>BlockPool: memory_ptr
BlockPool-->>KVMgr: block_id
else 无可用块
BlockPool->>BlockPool: trigger_eviction()
alt 驱逐成功
BlockPool->>GPU: free_gpu_memory(evicted_block)
BlockPool->>GPU: allocate_gpu_memory(block_id)
GPU-->>BlockPool: memory_ptr
BlockPool-->>KVMgr: block_id
else 驱逐失败
BlockPool-->>KVMgr: OutOfMemoryError
end
end
end
Note over KVMgr,GPU: 4. 更新映射关系
KVMgr->>KVMgr: update_seq_to_blocks_mapping()
KVMgr->>PrefixCache: update_cache(prefix_hash, blocks)
KVMgr-->>Sched: allocated_block_ids
缓存管理策略:
分页内存管理:
- 固定大小的内存页(通常16-64 tokens)
- 动态分配和释放
- 支持非连续内存布局
前缀缓存共享:
- 基于内容哈希的缓存键
- 引用计数管理
- LRU驱逐策略
内存驱逐机制:
- 优先驱逐未激活的缓存
- 支持CPU-GPU内存交换
- 最小化重新计算代价
2.4 模型执行时序图
sequenceDiagram
participant Sched as Scheduler
participant Exec as ModelExecutor
participant Worker as GPUWorker
participant Runner as ModelRunner
participant Model as 模型
participant Attn as AttentionBackend
participant KVCache as KVCache
participant Sampler as Sampler
Note over Sched,Sampler: 模型执行详细流程
Sched->>Exec: execute_model(scheduler_output)
Note over Exec,Sampler: 1. 输入准备阶段
Exec->>Worker: prepare_model_inputs()
Worker->>Worker: build_input_tensors()
Worker->>Worker: prepare_attention_metadata()
Worker->>KVCache: get_kv_cache_refs()
KVCache-->>Worker: cache_references
Note over Exec,Sampler: 2. 分布式协调
alt 多GPU并行
Exec->>Exec: broadcast_inputs_to_workers()
Exec->>Exec: synchronize_workers()
end
Note over Exec,Sampler: 3. 模型前向传播
Worker->>Runner: execute_model()
Runner->>Model: forward()
Note over Model,Sampler: 3.1 嵌入层
Model->>Model: embedding_forward()
Note over Model,Sampler: 3.2 Transformer层循环
loop 对每个Transformer层
Model->>Attn: attention_forward()
Note over Attn,KVCache: PagedAttention计算
Attn->>KVCache: get_kv_blocks()
KVCache-->>Attn: key_cache, value_cache
Attn->>Attn: scaled_dot_product_attention()
Attn->>Attn: apply_rotary_embedding()
Attn->>Attn: paged_attention_kernel()
Attn-->>Model: attention_output
Model->>Model: feedforward()
Model->>Model: layer_norm()
end
Note over Model,Sampler: 3.3 输出层
Model->>Model: output_projection()
Model-->>Runner: logits
Note over Runner,Sampler: 4. 采样和后处理
Runner->>Sampler: sample_tokens()
Sampler->>Sampler: apply_temperature()
Sampler->>Sampler: top_k_top_p_filtering()
Sampler->>Sampler: multinomial_sampling()
Sampler-->>Runner: sampled_tokens
Runner->>Runner: update_sequences()
Runner-->>Worker: model_output
Worker-->>Exec: execution_result
Note over Exec,Sampler: 5. 分布式聚合
alt 多GPU并行
Exec->>Exec: gather_outputs_from_workers()
Exec->>Exec: aggregate_results()
end
Exec-->>Sched: ModelRunnerOutput
模型执行关键点:
输入准备(1-5ms):
- 构建输入张量批次
- 准备注意力掩码
- 获取KV缓存引用
前向传播(10-1000ms):
- 嵌入层计算
- 多层Transformer处理
- PagedAttention高效计算
采样后处理(1-10ms):
- 温度缩放
- Top-k/Top-p过滤
- 多项式采样
2.5 PagedAttention详细时序图
sequenceDiagram
participant Model as Transformer层
participant Attn as AttentionBackend
participant KVMgr as KVCacheManager
participant BlockTable as 块表
participant GPU as GPU内核
Note over Model,GPU: PagedAttention核心流程
Model->>Attn: attention_forward(query, key, value, metadata)
Note over Attn,GPU: 1. 缓存管理
Attn->>KVMgr: update_kv_cache(key, value)
KVMgr->>BlockTable: get_block_tables()
BlockTable-->>KVMgr: block_mappings
KVMgr->>KVMgr: write_to_cache_blocks()
KVMgr-->>Attn: cache_updated
Note over Attn,GPU: 2. 注意力计算
Attn->>Attn: prepare_attention_inputs()
Attn->>GPU: launch_paged_attention_kernel()
Note over GPU: 2.1 内核执行细节
GPU->>GPU: load_query_from_global_mem()
loop 对每个注意力头
loop 对每个序列
GPU->>GPU: load_keys_from_paged_cache()
GPU->>GPU: compute_qk_scores()
GPU->>GPU: apply_softmax()
GPU->>GPU: load_values_from_paged_cache()
GPU->>GPU: compute_weighted_values()
end
GPU->>GPU: reduce_across_sequences()
end
GPU->>GPU: store_output_to_global_mem()
GPU-->>Attn: attention_output
Attn-->>Model: output_tensor
PagedAttention优化点:
内存访问优化:
- 合并全局内存访问
- 最大化共享内存利用
- 减少内存带宽瓶颈
并行策略:
- 头并行:不同注意力头并行计算
- 序列并行:长序列分块并行
- 批并行:多个序列同时处理
缓存友好性:
- 页表索引优化
- 块内数据布局优化
- 预取策略
3. 异步处理时序图
3.1 异步LLM引擎
sequenceDiagram
participant User as 用户应用
participant AsyncLLM as AsyncLLMEngine
participant Engine as EngineCore
participant BgLoop as 后台循环
participant Queue as 请求队列
Note over User,Queue: 异步处理架构
User->>AsyncLLM: generate_async(prompt)
AsyncLLM->>Queue: add_request()
AsyncLLM-->>User: Future[RequestOutput]
Note over BgLoop,Queue: 后台处理循环
loop 后台执行循环
BgLoop->>Queue: get_pending_requests()
Queue-->>BgLoop: request_batch
BgLoop->>Engine: process_batch()
Engine-->>BgLoop: outputs
BgLoop->>BgLoop: complete_futures()
BgLoop->>User: notify_completion()
end
User->>User: await_results()
User->>AsyncLLM: get_results()
3.2 流式输出时序图
sequenceDiagram
participant User as 用户应用
participant Stream as StreamingIterator
participant Engine as EngineCore
participant Buffer as 输出缓冲
Note over User,Buffer: 流式输出流程
User->>Stream: generate_stream(prompt)
Stream->>Engine: start_generation()
loop 流式生成循环
Engine->>Engine: generate_next_token()
Engine->>Buffer: append_token()
Buffer->>Stream: notify_new_token()
Stream-->>User: yield token
User->>User: process_partial_result()
end
Engine->>Buffer: mark_complete()
Buffer->>Stream: notify_completion()
Stream-->>User: final_result
4. 错误处理和恢复时序图
4.1 OOM错误处理
sequenceDiagram
participant Sched as Scheduler
participant KVMgr as KVCacheManager
participant BlockPool as BlockPool
participant Preemption as 抢占管理器
Note over Sched,Preemption: OOM错误处理流程
Sched->>KVMgr: allocate_blocks()
KVMgr->>BlockPool: allocate()
BlockPool-->>KVMgr: OutOfMemoryError
KVMgr->>Preemption: trigger_preemption()
Preemption->>Preemption: select_preemption_candidates()
Preemption->>KVMgr: free_blocks(preempted_requests)
KVMgr->>BlockPool: release_blocks()
BlockPool-->>KVMgr: blocks_freed
KVMgr->>BlockPool: retry_allocation()
alt 重试成功
BlockPool-->>KVMgr: allocated_blocks
KVMgr-->>Sched: allocation_success
else 重试失败
KVMgr-->>Sched: allocation_failed
Sched->>Sched: defer_request()
end
4.2 请求超时处理
sequenceDiagram
participant Timer as 超时定时器
participant Engine as EngineCore
participant Request as 请求对象
participant Cleanup as 清理服务
Note over Timer,Cleanup: 超时处理机制
Timer->>Timer: check_request_timeouts()
Timer->>Engine: get_running_requests()
Engine-->>Timer: request_list
loop 检查每个请求
Timer->>Request: check_timeout()
alt 请求超时
Timer->>Engine: abort_request()
Engine->>Cleanup: cleanup_request_resources()
Cleanup->>Cleanup: free_kv_cache()
Cleanup->>Cleanup: release_gpu_memory()
Cleanup-->>Engine: cleanup_complete
Engine-->>Timer: request_aborted
else 请求未超时
Timer->>Timer: continue_monitoring()
end
end
5. 性能监控时序图
5.1 指标收集流程
sequenceDiagram
participant Engine as EngineCore
participant Collector as MetricCollector
participant Prometheus as Prometheus
participant Dashboard as 监控面板
Note over Engine,Dashboard: 性能监控流程
loop 指标收集循环
Engine->>Collector: collect_metrics()
Collector->>Collector: aggregate_request_metrics()
Collector->>Collector: calculate_throughput()
Collector->>Collector: measure_latency()
Collector->>Prometheus: push_metrics()
Prometheus-->>Dashboard: update_dashboard
end
Dashboard->>Dashboard: display_realtime_metrics()
Dashboard->>Dashboard: trigger_alerts_if_needed()
这些时序图展现了VLLM在不同场景下的执行流程,帮助开发者理解系统的动态行为和性能特征。通过这些图表,可以:
- 识别性能瓶颈:找出耗时最多的操作环节
- 优化执行顺序:调整组件间的交互时机
- 设计容错机制:理解错误传播路径
- 监控系统健康:建立有效的观测体系
每个时序图都标注了关键的时间节点和资源消耗,为系统调优提供了重要参考。