xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-16 22:42:19 +08:00

Author	SHA1	Message	Date
Asaf Joseph Gardin	34916ae37f	[Mamba] - Consolidate Mambas Attention Logic (#28133 )	2025-12-23 21:57:00 +01:00
Patrick von Platen	3faa8bee57	adapt voxtral (#31095 ) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>	2025-12-23 05:31:55 -08:00
Pavani Majety	3e10262356	Revert "[SM100] Enable fp8 compute for prefill MLA (#30746 )" (#31197 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-12-22 18:15:33 -08:00
Benjamin Chislett	85aff45e24	[Perf] Remove blocking copy in GDN Attention (#31167 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2025-12-22 14:25:22 -08:00
Wentao Ye	5312a7284e	[Bug] Fix `'CutlassMLAImpl' object has no attribute '_workspace_buffer'` (#31173 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-12-22 14:24:27 -08:00
Lucas Wilkinson	de71747655	[SpecDecode] Simplified alternative padded-speculation acceptance rate fix (#29845 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-12-22 13:06:10 -08:00
Pavani Majety	b10f41c894	[SM100] Enable fp8 compute for prefill MLA (#30746 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-12-22 19:15:57 +00:00
Benjamin Chislett	d6b3d39b6d	[Cleanup] Refactor FlashInferMetadataBuilder (#29128 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-12-18 14:45:30 -08:00
Yifan Qiao	11a89cf95c	[Fix][FlexAttention] return max logical block index to handle reused blocks (#30915 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>	2025-12-18 06:42:21 +00:00
Micah Williamson	fd8afdf38d	[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 (#30811 ) Signed-off-by: Micah Williamson <micah.williamson@amd.com>	2025-12-18 10:27:37 +08:00
Isotr0py	74a1ac38b0	[v1] Add PrefixLM support to TritonAttention backend (#30386 )	2025-12-17 16:05:24 -08:00
Matthew Bonanni	7eb6cb6c18	[Attention] Update tests to remove deprecated env vars (#30563 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-12-17 09:49:59 -08:00
Lucas Wilkinson	9fec0e13d5	[Attention] Cache attention metadata builds across hybrid KV-cache groups (#29627 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>	2025-12-16 17:10:16 -05:00
Pleaplusone	9dbbc59b15	[ROCm][MTP] Support MTP for AITER MLA backend (#28624 ) Signed-off-by: ganyi <ygan@amd.com>	2025-12-16 14:10:26 +00:00
jiangkuaixue123	b9ff4f2a8d	[feature] extend DBO to XBO (#30120 ) Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com> Co-authored-by: root <root@hk01dgx028.cm.cluster>	2025-12-16 00:04:01 -05:00
drslark	add1b9d3de	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632 ) Signed-off-by: drslark <slarksblood@qq.com>	2025-12-14 01:32:16 -08:00
Roberto L. Castro	4fa7ce46f3	[Feature] Add SM103 (Blackwell Ultra) Support to vLLM (#30484 ) Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-12-12 19:34:23 -08:00
jvlunteren	9c0ee995a8	[Kernel] Support CUDA Graphs in 3D Triton Attention Kernel (#28306 ) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: jvlunteren <161835099+jvlunteren@users.noreply.github.com> Co-authored-by: Thomas Parnell <tom.parnell@gmail.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-12-12 16:55:40 +01:00
Lucas Wilkinson	3e41992fec	[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 (#27532 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-12-12 05:57:47 -08:00
Ming Yang	fba8906930	[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill (#29710 ) Signed-off-by: Ming Yang <minos.future@gmail.com>	2025-12-11 08:20:45 +00:00
Po-Han Huang (NVIDIA)	eea41804a4	[bug] Fix "Current vLLM config is not set." warnings when FlashInfer attention is used (#30241 ) Signed-off-by: Po-Han Huang <pohanh@nvidia.com>	2025-12-10 11:18:51 -08:00
Aditya Tewari	cebda2a4af	[CPU] Support for Whisper (#30062 ) Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>	2025-12-10 04:58:42 -08:00
Lucas Wilkinson	abe93bce59	[Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode (#29624 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>	2025-12-09 17:18:10 -08:00
Jaya Yuan	67475a6e81	[DCP][Bugfix][CI] Fix accuracy issue of DCP when using FLASH_ATTN_MLA (#30309 ) Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>	2025-12-09 08:22:14 +00:00
Lucas Wilkinson	aed846917f	[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding (#29644 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>	2025-12-09 07:24:01 +00:00
Lain	1fb632fdb6	[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum (#29795 ) Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>	2025-12-08 15:02:34 -08:00
Isotr0py	b952f4d3c3	[v1] Add PrefixLM support to FlexAttention backend (#27938 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-12-07 15:51:36 +00:00
Matthew Bonanni	66e674cdd5	[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments (#26315 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>	2025-12-05 09:48:43 -08:00
Jingchun Gao	d698bb382d	[Bugfix] Correct num_q_heads on DCP for Flashinfer backends (#29487 ) Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>	2025-12-05 05:54:31 +00:00
Andreas Karatzas	e96a6a6dca	[ROCm][CI][Bugfix] Fixing the `Multi-Modal Models Test (Extended) 1` group (#30013 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2025-12-04 11:00:16 +00:00
Matthew Bonanni	1d93f11675	[Attention][CUDAGraph] Remove CG padding from attention backends (#29352 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-12-02 13:48:08 -05:00
Isotr0py	b95db244ee	[v1] Add real sliding window calculation to FlexAttention direct BlockMask building (#26015 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com> Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com>	2025-12-01 13:12:51 +00:00
Pleaplusone	8c363ed666	[ROCm][Attention] Sliding window support for `AiterFlashAttentionBackend` (#29234 ) Signed-off-by: ganyi <ygan@amd.com>	2025-11-30 11:31:50 +00:00
Huamin Li	82c795d6f2	Fix AttributeError about _use_fi_prefill (#29734 ) Signed-off-by: Huamin Li <3ericli@gmail.com>	2025-11-30 06:04:55 +00:00
Lucas Wilkinson	e23f665d83	[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable (#29698 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-11-28 20:19:01 -08:00
Augusto Yao	9726e64530	bugfix: correct attn output with base 2 or e (#28840 ) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>	2025-11-29 07:52:12 +08:00
Lucas Wilkinson	be493e0b3c	[BugFix] Fix new nightly failures (#29578 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-11-27 13:45:38 -08:00
Andrii Skliar	a5345bf49d	[BugFix] Fix `plan` API Mismatch when using latest FlashInfer (#29426 ) Signed-off-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com> Co-authored-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com>	2025-11-27 11:34:59 -08:00
Matthew Bonanni	fc1d8be3dc	[Attention] Update attention imports (#29540 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-11-27 11:19:09 -05:00
Matthew Bonanni	77740191de	[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 (#29449 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-11-26 18:48:43 -08:00
Lucas Wilkinson	56539cddac	[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building (#28579 )	2025-11-26 14:07:13 -05:00
Matthew Bonanni	430dd4d9eb	[Attention] Remove imports from `vllm/attention/__init__.py` (#29342 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-11-26 10:53:15 -07:00
Pleaplusone	d9d342d214	[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457 ) Signed-off-by: ganyi <ygan@amd.com>	2025-11-26 12:45:28 +08:00
Nicolò Lucchesi	798e87db5c	[Core] Generalize Encoder-Decoder `seq_lens` computation to avoid Whisper hardcoded logic (#29268 ) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>	2025-11-25 11:32:11 +00:00
Jiangyun Zhu	81db702ed2	[Attention] add `_cudagraph_support` for linear attention (#28934 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>	2025-11-25 12:25:20 +08:00
gbyu-amd	cb7214d8ea	[ROCm][MLA] enable fp8 MLA decode on ROCm (#28032 ) Signed-off-by: guanbao <gyu@amd.com> Signed-off-by: Guanbao Yu <gyu@amd.com> Signed-off-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: guanbao <gyu@amd.com>	2025-11-25 10:15:02 +08:00
Pleaplusone	77e10c9cab	[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence (#28029 ) Signed-off-by: ganyi <ygan@amd.com>	2025-11-24 19:05:46 -07:00
Roger Wang	0ff70821c9	[Core] Deprecate `xformers` (#29262 ) Signed-off-by: Roger Wang <hey@rogerw.io>	2025-11-24 04:18:55 +00:00
tongqiu	5253f4276f	[ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter Flash Attention (#28376 ) Signed-off-by: apinge <Tong.Qiu2@amd.com>	2025-11-24 03:26:00 +00:00
Fadi Arafeh	730bd35378	[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193 ) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>	2025-11-22 09:04:36 -08:00

1 2 3 4 5 ...

447 Commits