279 Commits

Author SHA1 Message Date
Thomas Parnell
6bd1dd9d26
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152) 2025-03-06 07:39:16 -08:00
Lucas Wilkinson
f6bb18fd9a
[BugFix] MLA + V1, illegal memory access and accuracy issues (#14253)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-05 17:10:13 -08:00
Nick Hill
ac60dc7fe1
[V1][BugFix] Fix for mixed top_k batch (#14301)
Signed-off-by: Nick Hill <nhill@redhat.com>


Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com>
2025-03-05 20:43:04 +00:00
Vincent
a4f1ee35d6
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997)
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-05 20:22:43 +00:00
Nick Hill
a32c8669ca
[V1][Minor] Remove obsolete FIXME comment (#14304)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-05 11:59:23 -08:00
Robert Shaw
257e200a25
[V1][Frontend] Add Testing For V1 Runtime Parameters (#14159)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-03-05 14:18:55 +00:00
Lu Fang
8d6cd32b7b
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 08:49:44 +00:00
Roger Wang
ec79b67c77
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-05 07:37:16 +00:00
Tyler Michael Smith
72c62eae5f
[V1] EP/TP MoE + DP Attention (#13931) 2025-03-04 21:27:26 -08:00
Cody Yu
ade3f7d988
[V1][Bugfix] Do not reset prefix caching metrics (#14235) 2025-03-05 04:39:13 +00:00
Michael Goin
fbfc3ee37e
[V1][TPU] TPU multimodal model support for ragged attention (#14158)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-03-04 19:58:48 -05:00
Siyuan Liu
beebf4742a
[TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-04 14:40:06 -05:00
Nick Hill
5db6b2c961
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-04 15:06:47 +00:00
iefgnoix
79e4937c65
[v1] Add comments to the new ragged paged attention Pallas kernel (#14155)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-03-03 23:00:55 +00:00
Mark McLoughlin
ae122b1cbd
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 19:04:45 +00:00
Nick Hill
872db2be0e
[V1] Simplify stats logging (#14082)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-03 10:34:14 -08:00
Mark McLoughlin
4167252eaf
[V1] Refactor parallel sampling support (#13774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 08:15:27 -08:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
Jun Duan
82fbeae92b
[Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-01 17:20:30 -08:00
Chen Zhang
d54990da47
[v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) 2025-03-01 20:46:02 +00:00
Chen Zhang
b9f1d4294e
[v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) 2025-03-01 08:25:54 +00:00
Sage Moore
b28246f6ff
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-01 07:18:32 +00:00
Li, Jiang
02296f420d
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) 2025-02-28 22:31:01 -08:00
Chen Zhang
28943d36ce
[v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-28 20:53:31 +00:00
Chen Zhang
e7bd944e08
[v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-02-28 19:03:16 +00:00
iefgnoix
c3b6559a10
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-28 11:01:36 -07:00
Lucas Wilkinson
2e94b9cfbb
[Attention] Flash MLA for V1 (#13867)
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
2025-02-27 23:03:41 +00:00
Woosuk Kwon
cd813c6d4d
[V1][Minor] Minor cleanup for GPU Model Runner (#13983)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-27 13:11:40 -08:00
Yang Chen
58d1b2aa77
[Attention] MLA support for V1 (#13789)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-27 13:14:17 -05:00
Mark McLoughlin
cd711c48b2
[V1][Metrics] Handle preemptions (#13169) 2025-02-26 20:04:59 -08:00
Lily Liu
5629f26df7
[V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729) 2025-02-25 18:14:48 -08:00
Varun Sundar Rabindranath
03f48b3db6
[Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705) 2025-02-25 00:18:02 -08:00
Mark McLoughlin
bc32bc73aa
[V1][Metrics] Implement vllm:lora_requests_info metric (#13504) 2025-02-24 20:01:33 -08:00
cjackal
51010a1807
[Misc] set single whitespace between log sentences (#13771)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2025-02-25 10:26:12 +08:00
Harry Mellor
cdc1fa12eb
Remove unused kwargs from model definitions (#13555) 2025-02-24 17:13:52 -08:00
Roger Wang
227578480d
Revert "[V1][Core] Fix memory issue with logits & sampling" (#13775) 2025-02-24 09:16:05 -08:00
afeldman-nm
befc402d34
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) (#10980)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-02-24 08:29:41 -08:00
Roger Wang
437b76ff59
[V1][Core] Fix memory issue with logits & sampling (#13721) 2025-02-24 06:10:06 -08:00
Nick Hill
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs (#13298)
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.

Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.

Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-23 13:07:43 -08:00
youkaichao
eb24dc4a45
[v1] torchrun compatibility (#13642)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-23 22:47:24 +08:00
Sage Moore
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095) 2025-02-22 05:25:41 -08:00
youkaichao
2382ad29d1
[ci] fix linter (#13701)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 20:28:59 +08:00
youkaichao
3e472d882a
[core] set up data parallel communication (#13591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 19:28:59 +08:00
Cyrus Leung
7f6bae561c
[CI/Build] Fix pre-commit errors (#13696) 2025-02-22 00:31:26 -08:00
Mark McLoughlin
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295) 2025-02-22 00:20:45 -08:00
Mark McLoughlin
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info (#13299) 2025-02-22 00:20:00 -08:00
Jennifer Zhao
da31b5333e
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler (#13594)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-02-22 00:08:29 -08:00
Lu Fang
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler (#13210)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-22 14:13:05 +08:00
Jun Duan
68d535ef44
[Misc] Capture and log the time of loading weights (#13666) 2025-02-21 22:06:34 -08:00
Lucas Wilkinson
288cc6c234
[Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00