xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-04-29 19:07:15 +08:00

Author	SHA1	Message	Date
Nick Hill	919234fe17	[BugFix] Fix initial DP request load imbalance (#22910 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-08-14 15:20:28 -07:00
Nick Hill	ebcce2cd36	[Core] Return final response for aborted requests from `AsyncLLM.generate` (#22283 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-08-14 14:49:02 -07:00
nvjullin	279a5f31b3	[Kernel] Add nvfp4 gemm flashinfer backends (#22346 ) Signed-off-by: Julien Lin <jullin@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-08-14 16:03:55 -04:00
Lucas Wilkinson	829b9a62d0	[Perf] Dont create unnecessary pooling params (#22876 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-08-14 05:28:09 -07:00
iAmir97	7655dc3e45	[Bugfix] Add reset prefix cache for online serving (#22726 ) Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com> Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com> Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-14 04:04:18 -07:00
Nick Hill	eb08487b18	[BugFix] Threadsafe close async zmq sockets (#22877 ) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-08-14 03:44:29 -07:00
Jialin Ouyang	31a500c86f	[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP (#22437 ) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>	2025-08-13 14:44:06 -07:00
Cyrus Leung	19b927e52d	[Core] Use individual MM items in P0/P1 cache and model runner (#22570 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-13 07:18:07 -07:00
Chen Zhang	fceafaf582	[Bugfix][mamba] Fix type annotation of Mamba2Metadata (#22787 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-08-13 06:07:09 -07:00
Giancarlo Delfin	d94e3026de	[V1] Add tree drafting tests for eagle spec decoding (#22705 ) Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>	2025-08-13 04:11:28 -07:00
Michael Goin	c6b928798e	Force TRTLLM attention for gpt-oss on SM100 (#22678 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-08-12 21:22:16 -07:00
Xiaozhu Meng	6bd8ebf026	[Kernel][AMD] Avoid D2H copy and cumsum kernel (#22683 ) Signed-off-by: Xiaozhu <mxz297@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-12 12:53:36 -07:00
Rahul Tuli	5a4b4b3729	Add: `SupportsEagle3` interface for explicit EAGLE3 support (#22642 ) Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-08-12 09:24:52 -07:00
wang.yuqi	6d729c43fb	[Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. (#22637 ) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>	2025-08-12 00:23:17 -07:00
wang.yuqi	84cf78acee	[Model] Pooling models default to using chunked prefill & prefix caching if supported. (#20930 ) Signed-off-by: wang.yuqi <noooop@126.com>	2025-08-11 09:41:37 -07:00
Maximilien de Bayser	39052dbca8	Support token_type_ids in V1 with less code changes (#21985 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2025-08-10 22:54:59 -07:00
Nick Hill	5898b135ab	[BugFix] Fix KVConnectorOutput TPU breakage (#22598 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-08-10 19:33:48 -07:00
Chengji Yao	2a84fb422f	[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block (#22394 ) Signed-off-by: Chengji Yao <chengjiyao@gmail.com> Co-authored-by: Chengji Yao <chengjiyao@gmail.com>	2025-08-09 20:49:04 -07:00
Thomas Parnell	61f67d8acd	[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (#21401 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-08-09 20:16:11 -07:00
Or Ozeri	7ad7adb67f	v1: Pass KVConnectorOutput to scheduler-side (#22157 ) Signed-off-by: Or Ozeri <oro@il.ibm.com>	2025-08-08 23:09:51 -07:00
Thomas Parnell	6ade99eafa	[V1] [Hybrid] Support Minimax-Text-01 in V1 (#22151 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-08-08 23:08:48 -07:00
Roger Wang	08b751ba74	Implicit language-model-only mode via limit-mm-per-prompt (#22299 ) Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: XIn Li <xinli@nvidia.com> Signed-off-by: Junhao Li <junhao@ubicloud.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com> Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com> Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com> Signed-off-by: Linkun <github@lkchen.net> Co-authored-by: Ning Xie <andy.xning@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Andrew Sansom <andrew@protopia.ai> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: XIn Li <xinli@nvidia.com> Co-authored-by: Junhao Li <streaver91@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com> Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com> Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Hong Hanh <hanh.usth@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: lkchen <github@lkchen.net>	2025-08-08 22:21:40 -07:00
Pradyun92	35afe1b30b	[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D (#22317 ) Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com> Signed-off-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com> Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>	2025-08-08 17:04:15 -07:00
Kunshang Ji	81c57f60a2	[XPU] upgrade torch 2.8 on for XPU (#22300 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2025-08-08 17:03:45 -07:00
Varun Sundar Rabindranath	f703b923f3	[Misc] DeepGEMM : Avoid JIT generation in the hot-path (#22215 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-08-08 16:09:59 -07:00
Lucas Wilkinson	cd9b9de1fb	[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA (#21691 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-08-08 16:09:42 -07:00
Nick Hill	ccdae737a0	[BugFix] Don't cancel asyncio tasks directly from destructors (#22476 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-08-08 01:13:18 -07:00
Cyrus Leung	1712543df6	[CI/Build] Fix multimodal tests (#22491 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-08 00:31:19 -07:00
Po-Han Huang (NVIDIA)	af473f0a85	[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 (#22426 ) Signed-off-by: Po-Han Huang <pohanh@nvidia.com>	2025-08-07 20:25:01 -07:00
TJian	1ee5ead5f8	[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine (#21496 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-08-07 19:13:17 -07:00
Cyrus Leung	139d155781	[Frontend] Use engine argument to control MM cache size (#22441 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-07 09:47:10 -07:00
Cyrus Leung	8c9da6be22	[Core] Simplify mm processing cache (#22457 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-07 09:47:07 -07:00
Cyrus Leung	766bc8162c	[Core] Store only the keys for multi-modal data in P0 (#22198 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-07 01:45:04 -07:00
Syed Muhammad Bin Asif	609b533cb6	[Bugfix] Add proper comparison for package versions (#22314 ) Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk>	2025-08-06 20:31:03 -07:00
Lucas Wilkinson	1dc8a70b6d	[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-08-06 18:40:52 -07:00
Maximilien de Bayser	f825c6bd22	Support encoder_only attention for FlexAttention (#22273 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2025-08-06 18:37:14 -07:00
Lain	9a3835aaa9	Fix trtllm-gen attention env and add attention sink (#22378 ) Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: Lain <fusiyuan2000@hotmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-06 18:07:41 -07:00
Asaf Joseph Gardin	46a13949d5	[v1] - Mamba1 Attention Metadata (#21249 ) Signed-off-by: asafg <asafg@ai21.com> Co-authored-by: asafg <asafg@ai21.com>	2025-08-06 17:03:42 -07:00
Yongye Zhu	31f5dc5b2a	[gpt-oss] Enhance error msg on attention sink init (#22335 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-08-06 11:41:42 -07:00
Yongye Zhu	90ec006937	[gpt-oss] flashinfer attention sink init (#22330 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>	2025-08-05 23:48:19 -07:00
Woosuk Kwon	6e20924350	Add attention sink in attention backends (#22320 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-05 22:37:21 -07:00
Rui Qiao	302962e806	[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#22275 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2025-08-05 20:35:32 -07:00
Benjamin Chislett	7e6544c797	[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding (#21862 ) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>	2025-08-05 19:57:49 -07:00
Giancarlo Delfin	469b3ffaaa	[V1] port xformers backend to v1 (#21342 ) Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>	2025-08-05 10:04:46 -07:00
elvischenv	83156c7b89	[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-08-05 02:45:34 -07:00
Cyrus Leung	811ac13d03	[Core] Factor out common logic for MM budget calculation (#22228 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-04 23:54:55 -07:00
Cyrus Leung	cdfd6871a5	[Bugfix] Misaligned params in TreeAttentionImpl (#22226 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-04 22:40:09 -07:00
lkchen	f4f4e7ef27	[V0 deprecation][P/D] Deprecate v0 `KVConnectorBase` code (1/2) (#21785 ) Signed-off-by: Linkun Chen <github@lkchen.net>	2025-08-04 19:11:33 -07:00
Woosuk Kwon	7175817637	Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." (#22223 )	2025-08-04 18:37:06 -07:00
PiteXChen	2dffac464c	[Bugfix] V1 Fix the cursor leakage issue during request scheduling. (#21173 ) Signed-off-by: CLFutureX <775523362@qq.com>	2025-08-04 18:34:10 -07:00

1 2 3 4 5 ...

970 Commits