xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-04-25 12:27:03 +08:00

Author	SHA1	Message	Date
Michael Goin	4082338a25	Remove unneeded ROCm platform import when using CUDA (#22765 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-08-12 21:26:38 -07:00
Yongye Zhu	007dd90859	[gpt-oss] Enable gpt-oss on ampere (#22714 ) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-12 03:21:44 -07:00
Harry Mellor	00976db0c3	[Docs] Fix warnings in docs build (#22588 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-08-10 05:49:51 -07:00
Chengji Yao	2a84fb422f	[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block (#22394 ) Signed-off-by: Chengji Yao <chengjiyao@gmail.com> Co-authored-by: Chengji Yao <chengjiyao@gmail.com>	2025-08-09 20:49:04 -07:00
Lucas Wilkinson	cd9b9de1fb	[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA (#21691 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-08-08 16:09:42 -07:00
Syed Muhammad Bin Asif	609b533cb6	[Bugfix] Add proper comparison for package versions (#22314 ) Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk>	2025-08-06 20:31:03 -07:00
Lucas Wilkinson	1dc8a70b6d	[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-08-06 18:40:52 -07:00
Gregory Shtrasberg	2435ea7ed5	[Bugfix] Make condition in triton kernel constexpr (#22370 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2025-08-06 10:00:58 -07:00
Lucas Wilkinson	4a6b72c2ab	[BugFix] Fix triton compile error in `kernel_unified_attention_2/3d` caused by attention sinks (#22368 ) Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>	2025-08-06 09:47:38 -07:00
Woosuk Kwon	6e20924350	Add attention sink in attention backends (#22320 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-05 22:37:21 -07:00
elvischenv	83156c7b89	[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-08-05 02:45:34 -07:00
Michael Goin	e79a12fc3a	[UX] Fail if an invalid attention backend is specified (#22217 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2025-08-04 23:54:52 -07:00
Giancarlo Delfin	aa7012eb6d	Add tree attention backend for v1 (part 1) (#20401 ) Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>	2025-08-03 22:13:26 -07:00
Michael Goin	f81c1bb055	[Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels (#21893 )	2025-08-01 08:28:45 -04:00
elvischenv	58b11b24a6	[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-07-29 10:34:00 -04:00
weiliang	2dd72d23d9	update flashinfer to v0.2.9rc1 (#21485 ) Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>	2025-07-24 14:06:11 -07:00
Yong Hoon Shin	78c13e30e1	[V1] Fix local chunked attention always disabled (#21419 ) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>	2025-07-23 15:59:30 -07:00
Tao He	7c734ee09b	[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. (#21364 ) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>	2025-07-23 06:34:37 -07:00
Lucas Wilkinson	304dce7ec0	[Attention] Clean up iRoPE in V1 (#21188 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-07-21 09:10:30 -07:00
Woosuk Kwon	752c6ade2e	[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-07-19 13:53:17 -07:00
Lucia Fang	9a9fda1423	[Core] Support Local Chunked Attention for Hybrid KV Cache (#19351 ) Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: Lu Fang <fanglu@meta.com>	2025-07-18 20:48:38 -07:00
hax0r31337	5782581acf	[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) (#21077 ) Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com>	2025-07-18 18:40:18 -04:00
Richard Zou	b2eb2b5ad7	[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 (#19346 ) Signed-off-by: rzou <zou3519@gmail.com>	2025-07-18 14:10:21 -04:00
Woosuk Kwon	4de7146351	[V0 deprecation] Remove V0 HPU backend (#21131 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-07-17 16:37:36 -07:00
Peter Pan	1eb2b9c102	[CI] update typos config for CI pre-commit and fix some spells (#20919 ) Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>	2025-07-15 21:12:40 -07:00
Li Wang	20149d84d9	[MISC] Add init files for python package (#20908 ) Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-15 12:16:33 +00:00
Cyrus Leung	e8cc53af5e	[Misc] Log the reason for falling back to FlexAttention (#20699 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-07-14 04:16:51 -07:00
Wentao Ye	c1acd6d7d4	[Refactor] Change the way of import triton (#20774 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-07-12 19:39:55 -07:00
Congcong Chen	2c11a738b3	[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (#20702 ) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>	2025-07-12 06:02:10 -07:00
Pavani Majety	7bd4c37ae7	[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-11 09:23:23 +00:00
Luka Govedič	31d5c1797f	[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf (#19830 ) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-11 04:56:28 +00:00
Alexander Matveev	5b032352cc	[Attention] MLA - Flashinfer Ragged Prefill (#20034 )	2025-07-10 20:17:47 -07:00
Li, Jiang	7721ef1786	[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files (#20560 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-07-07 22:13:44 -07:00
jvlunteren	22dd9c2730	[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel (#20308 ) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>	2025-07-07 19:08:12 +00:00
Cyrus Leung	9fb52e523a	[V1] Support any head size for FlexAttention backend (#20467 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-07-06 09:54:36 -07:00
Woosuk Kwon	e202dd2736	[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: jiang1.li <jiang1.li@intel.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com>	2025-07-06 08:48:13 -07:00
Chengji Yao	7da296be04	[TPU] kv cache update kernel supports dynamic grid (#20235 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-07-02 06:33:37 +00:00
Chendi.Xue	dec197e3e5	Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn (#20143 ) Signed-off-by: Chendi.Xue <chendi.xue@intel.com>	2025-06-27 05:48:13 +00:00
Chengji Yao	04e1642e32	[TPU] add kv cache update kernel (#19928 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-06-26 10:01:37 -07:00
Kunshang Ji	b69781f107	[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. (#19560 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2025-06-26 09:27:18 -07:00
TJian	27c065df50	[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) (#19904 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-06-26 12:42:31 +00:00
Wentao Ye	879f69bed3	[Refactor] Remove duplicate `ceil_div` (#20023 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-06-25 05:19:09 +00:00
Ning Xie	71baf85ae1	[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (#19749 )	2025-06-20 18:18:11 +00:00
Ning Xie	71d1219545	[Kernel] correct cpu worker function parameter type (#19745 ) Signed-off-by: Andy Xie <andy.xning@gmail.com>	2025-06-20 10:50:13 +00:00
Woosuk Kwon	f04d604567	[Minor] Zero-initialize attn output buffer (#19784 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-06-18 06:59:27 +00:00
Ning Xie	c53711bd63	[MISC] correct copy_blocks src_to_dists param type (#19696 ) Signed-off-by: Andy Xie <andy.xning@gmail.com>	2025-06-17 17:21:06 -07:00
Nicolò Lucchesi	4c8f64faa7	[V1][Kernel] Flashinfer HND KV cache layout (#19280 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-06-17 09:09:22 -04:00
jvlunteren	ccd7c05089	[Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19152 ) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>	2025-06-17 10:45:07 +00:00
22quinn	0b73736a0d	[Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (#19339 ) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>	2025-06-15 13:43:48 +08:00
Luka Govedič	f98548b9da	[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (#16756 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com>	2025-06-12 08:31:04 -07:00

1 2 3 4 5 ...

405 Commits