Tyler Michael Smith
ab153be252
take 2
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
2025-07-11 14:42:44 +00:00
Or Ozeri
cc876d0f29
[KVConnector] Aggregate finished requests on the scheduler ( #19555 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-07-10 09:22:18 +01:00
Chenyaaang
fdfd409f8f
[TPU][Core]Make load weight exceed hbm error more instructive for customers ( #20644 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com>
2025-07-10 07:01:17 +00:00
Nick Hill
59389c927b
[BugFix][CPU] Fix CPU worker dependency on cumem_allocator ( #20696 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-10 14:24:20 +08:00
Woosuk Kwon
7c12a765aa
[Misc] Simplify the prefix caching logic on draft tokens ( #20701 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-09 14:48:35 -07:00
Tuan, Hoang-Trong
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models ( #18218 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-07-09 12:53:55 -07:00
Liangliang Ma
a3e4e85ece
[XPU][CI] enhance xpu test support ( #20652 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Co-authored-by: zhenwei-intel <zhenweiliu@habana.ai>
2025-07-09 16:53:09 +00:00
B-201
6bbf1795b7
[Misc] Fix the size of batched_dummy_mm_inputs in profile_run ( #20434 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com>
2025-07-08 20:15:44 -07:00
Akash kaothalkar
6db31e7a27
[Hardware][PPC64LE] Enable V1 for ppc64le and ARM ( #20554 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
2025-07-08 20:00:41 -07:00
Ratnam Parikh
c40784c794
[BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker ( #20596 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com>
2025-07-08 19:44:23 -07:00
Kunshang Ji
0b407479ef
[misc]refactor Platform.set_device method ( #20262 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-07-09 01:39:47 +00:00
Sanger Steel
72d14d0eed
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load ( #19619 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Eta <esyra@coreweave.com>
2025-07-07 22:47:43 -07:00
Chenyaaang
e34d130c16
[TPU] Temporary fix vmem oom for long model len by reducing page size ( #20278 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com>
2025-07-08 05:16:16 +00:00
Li, Jiang
7721ef1786
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files ( #20560 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-07 22:13:44 -07:00
Woosuk Kwon
31c5d0a1b7
[Optimize] Don't send token ids when kv connector is not used ( #20586 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-07 19:04:54 -07:00
Rui Qiao
a6d795d593
[DP] Copy environment variables to Ray DPEngineCoreActors ( #20344 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-07-07 10:14:22 -07:00
Peter Pan
edd270bc78
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled ( #20486 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-07-07 09:41:15 -07:00
Yang Yang
6e2c19ce22
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU ( #19410 )
...
Signed-off-by: dbyoung18 <yang5.yang@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-07-07 04:32:32 +00:00
Woosuk Kwon
9528e3a05e
[BugFix][Spec Decode] Fix spec token ids in model runner ( #20530 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-06 19:44:52 +00:00
Cyrus Leung
9fb52e523a
[V1] Support any head size for FlexAttention backend ( #20467 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 09:54:36 -07:00
Isotr0py
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention ( #19754 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-05 09:41:10 +00:00
Thomas Parnell
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models ( #20016 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-07-04 17:46:53 +00:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText ( #20428 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
Aaron Pham
4a98edff1f
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers ( #20365 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-07-04 15:05:49 +08:00
Nicolò Lucchesi
8d775dd30a
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning ( #20400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-07-03 14:56:09 -07:00
Nick Hill
657f2f301a
[DP] Support external DP Load Balancer mode ( #19790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-02 10:21:52 -07:00
vllmellm
a1aafc827a
[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) ( #20254 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-07-02 16:25:46 +00:00
afeldman-nm
48fb076cbc
[V1] LogitsProcessor programming model ( #16728 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-07-02 09:10:42 -07:00
Chengji Yao
7da296be04
[TPU] kv cache update kernel supports dynamic grid ( #20235 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-02 06:33:37 +00:00
Liangliang Ma
a0389e0554
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests ( #20169 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
2025-07-02 09:06:04 +08:00
Woosuk Kwon
7f280d69c9
[Optimization] Cache sampled token ids in model runner ( #20291 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 11:01:31 -07:00
Woosuk Kwon
8acb4badee
[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling ( #20301 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 09:07:36 -07:00
Woosuk Kwon
0e96cc9b7e
[Misc] Minor refactoring for scheduler ( #20299 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 07:55:32 -07:00
TY-AMD
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine ( #19067 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
2025-07-01 16:12:19 +08:00
Michael Goin
be250bbc67
[V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank ( #19516 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-01 06:02:09 +00:00
Woosuk Kwon
2863befce3
[Optimization] Use Shared CachedRequestData Instance Across All Requests ( #20232 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 09:07:50 -07:00
Woosuk Kwon
2062c0723d
[Spec Decode] Refactor spec decoding into a separate function ( #20238 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 08:13:50 -07:00
Woosuk Kwon
19108ef311
[Misc] Fix import ( #20233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-29 20:34:54 -07:00
Michael Goin
f719772281
[Bugfix] Properly reject requests with empty list guided_choice ( #20195 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-06-27 22:50:52 -07:00
Chales Xu
e53be6f00a
[Misc] Add type assertion of request_id for LLMEngine.add_request ( #19700 )
...
Signed-off-by: n2ptr <xuzhanchaomail@163.com>
2025-06-27 22:47:36 -07:00
Chendi.Xue
dec197e3e5
Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn ( #20143 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
2025-06-27 05:48:13 +00:00
Bowen Wang
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-06-26 15:30:21 -07:00
Chengji Yao
04e1642e32
[TPU] add kv cache update kernel ( #19928 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-06-26 10:01:37 -07:00
Kunshang Ji
b69781f107
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. ( #19560 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-06-26 09:27:18 -07:00
TJian
27c065df50
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) ( #19904 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-06-26 12:42:31 +00:00
Chenyaaang
2d7620c3eb
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN ( #19919 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com>
2025-06-25 15:51:02 -07:00
Chengji Yao
2cc2069970
[TPU][Bugfix] fix kv cache padding ( #20048 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-06-25 21:24:10 +00:00
cjackal
8359f4c8d8
[V1][Speculative Decoding] Fix DeepSeek MTP ( #20022 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2025-06-25 08:41:02 -07:00
Lucas Wilkinson
0f9e7354f5
[BugFix] Fix full-cuda-graph illegal memory access in FA3 ( #20057 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-06-25 08:39:04 +00:00
Nick Hill
8619e7158c
[BugFix] Fix multi-node offline data parallel ( #19937 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-24 12:45:20 -07:00