xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-11 13:35:52 +08:00

Author	SHA1	Message	Date
yurekami	79e0db60ee	Use --attention-backend flag instead of VLLM_ATTENTION_BACKEND env var Per reviewer feedback, the VLLM_ATTENTION_BACKEND environment variable is being deprecated in favor of the --attention-backend CLI flag. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: yurekami <yurekami@users.noreply.github.com>	2025-12-24 15:16:10 +09:00
yurekami	3c8358c328	[v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance Replace cryptic AssertionErrors with informative RuntimeErrors that: - Explain what DCP (Decode Context Parallel) and PCP (Prefill Context Parallel) are - List compatible attention backends - Provide environment variable instructions (VLLM_ATTENTION_BACKEND) - Include documentation links Fixes #28407 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: yurekami <249254018+yurekami@users.noreply.github.com>	2025-12-18 20:05:16 +09:00
wangxiyuan	a85724bd6e	[Platform] Let EPD work with non-cuda platform (#30225 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 06:45:29 +00:00
Yifan Qiao	11a89cf95c	[Fix][FlexAttention] return max logical block index to handle reused blocks (#30915 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>	2025-12-18 06:42:21 +00:00
Li, Jiang	e3ab93c896	[CPU] Refactor CPU fused MOE (#30531 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-12-18 14:36:49 +08:00
Nathan Price	fc2ae6d617	fix: add warmup for audio preprocessing (#30706 ) Signed-off-by: Nathan Price <nathan@abridge.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-12-18 06:12:29 +00:00
Yihua Cheng	ec965569d9	[KV connector][LMCache] Only record the cuda event when there are request to store/load (#30814 ) Signed-off-by: ApostaC <yihua98@uchicago.edu>	2025-12-18 05:31:34 +00:00
Vadim Gimpelson	717ac33d9c	[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. `chmod -x *MI308X.json` (#29553 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2025-12-18 13:16:04 +08:00
zzhxxx	b166ef20e1	[refactor] Add prefix support to embed_tokens in DeepSeek MTP (#30788 ) Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2025-12-18 04:45:56 +00:00
Zhengxu Chen	5f2f3fba1d	[compile] Fix CI for test_gpt2_cache_hit (#30902 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2025-12-17 20:22:23 -08:00
Matthew Bonanni	4a8412f773	[UX] Reduce DeepGEMM warmup log output to single progress bar (#30903 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-12-17 20:21:51 -08:00
Bowen Bao	0c738b58bc	[Quantization] Support Quark int4-fp8 w4a8 for MoE (#30071 ) Signed-off-by: Bowen Bao <bowenbao@amd.com>	2025-12-18 04:20:42 +00:00
gnovack	5a3adf581e	fused_moe_lora PDL improvements (#30716 ) Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-12-17 19:55:00 -08:00
Isotr0py	6fe5887652	[Chore] Remove v0 dead code for Qwen2.5-omni (#30883 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-12-17 19:54:39 -08:00
Nicolò Lucchesi	bc3700e0cd	[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (#27274 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-12-18 11:53:30 +08:00
Micah Williamson	fd8afdf38d	[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 (#30811 ) Signed-off-by: Micah Williamson <micah.williamson@amd.com>	2025-12-18 10:27:37 +08:00
SungMinCho	a0b782f9cc	[Metrics] Model FLOPs Utilization estimation (#30738 ) Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>	2025-12-18 01:40:51 +00:00
Isotr0py	74a1ac38b0	[v1] Add PrefixLM support to TritonAttention backend (#30386 )	2025-12-17 16:05:24 -08:00
Nathan Price	05a83dc6ee	feat(api): Eager chat template warmup to eliminate first-request latency (#30700 ) Signed-off-by: Nathan Price <nathan@abridge.com>	2025-12-18 00:01:29 +00:00
Varun Sundar Rabindranath	e3fc374a9a	[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM (#30899 )	2025-12-17 15:00:59 -08:00
Andrey Talman	e06d0bf0aa	2.9.1 PyTorch release update (#28495 )	2025-12-17 12:20:22 -08:00
Matthew Bonanni	7eb6cb6c18	[Attention] Update tests to remove deprecated env vars (#30563 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-12-17 09:49:59 -08:00
Cyrus Leung	2497228ad4	[Chore] Factor out logic for requesting initial memory (#30868 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-12-17 07:32:17 -08:00
KimHyemin	196cdc3224	[Model] Gemma3: Support untied word embeddings (#30827 ) Signed-off-by: www-spam <panmahm@naver.com>	2025-12-17 07:11:18 -08:00
高鑫崧	b7b6a60aca	Adapt the old parameter enable_thinking in chat_template_kwargs (#30852 ) Signed-off-by: xinsong.gao <1418762819@qq.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>	2025-12-17 07:10:59 -08:00
Jialin Ouyang	6e9dbcc50e	[Fix] uniform decode batch check (#30747 ) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>	2025-12-17 19:58:43 +08:00
Hank_	6482e3895b	chores: adjust the attn register param order (#30688 ) Signed-off-by: Hank <hcc.mayday@gmail.com>	2025-12-17 19:58:16 +08:00
Harry Mellor	fb980eb2fd	Fix lazy import (#30858 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-12-17 03:33:50 -08:00
baoqian426	84896fda22	[Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841 ) Signed-off-by: baoqian <1354987947@qq.com> Signed-off-by: baoqian426 <1354987947@qq.com>	2025-12-17 03:32:34 -08:00
Chauncey	9ad5b21710	[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory (#30749 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2025-12-17 02:27:30 -08:00
Wentao Ye	f284d7bd0c	[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute `weight_scale_inv` (#30823 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-12-17 02:00:35 -08:00
Zhengxu Chen	53cd7f868b	[compile] Recompile graph module during Dynamo cache loading. (#30743 ) Signed-off-by: Zhengxu Chen <zhxchen17@fb.com>	2025-12-17 02:00:12 -08:00
danielafrimi	7b966ae2ba	[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) (#30785 ) Signed-off-by: <> Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local>	2025-12-17 01:56:38 -08:00
Zhengxu Chen	9db1db5949	[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors (#30809 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2025-12-17 01:56:24 -08:00
Zhengxu Chen	177c391db2	[compile] Disable aot when eager backend is used. (#30810 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2025-12-17 01:55:56 -08:00
Michael Goin	519ef9a911	[UX] Make `vllm bench serve` discover model by default and use --input-len (#30816 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-12-17 01:55:30 -08:00
Ye (Charlotte) Qi	a100152288	[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#30842 ) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>	2025-12-17 01:54:21 -08:00
Andrew Xia	4c054d89aa	[Doc][ResponsesAPI] add documentation (#30840 ) Signed-off-by: Andrew Xia <axia@fb.com> Co-authored-by: Andrew Xia <axia@fb.com>	2025-12-17 01:53:02 -08:00
Xinyu Chen	3b1d440ede	CustomOp: grouped topk (#29575 ) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>	2025-12-17 17:43:00 +08:00
Asaf Joseph Gardin	a9e15c21ef	[Mamba] Removed disable cascade attn in MambaModelConfig (#30712 ) Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>	2025-12-17 08:48:53 +00:00
Robin	20fda43151	[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction (#30555 ) Signed-off-by: WangErXiao <863579016@qq.com>	2025-12-17 16:37:57 +08:00
Yan Ma	4f735babb7	[XPU] fix broken fp8 online quantization for XPU platform (#30831 ) Signed-off-by: Yan Ma <yan.ma@intel.com>	2025-12-17 00:28:13 -08:00
Li, Jiang	0cd5353644	[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829 ) Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-16 23:25:12 -08:00
Michael Goin	d4d2751732	Update note comment for flashinfer attention warmup (#30711 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-12-16 21:29:03 -08:00
Grzegorz K. Karch	f5db6385a1	Fix nemotron_nas intermediate_size computation (#30795 ) Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>	2025-12-17 01:06:28 +00:00
Nicolò Lucchesi	e087fbc393	[MM] Pass FA version in ViT Attn (#30756 ) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-12-17 07:54:45 +08:00
TJian	2410132bb1	[ROCm] [Bugfix] Fix torch sdpa hallucination (#30789 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-12-16 15:32:43 -08:00
Jinzhen Lin	ce96857fdd	[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901 ) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-12-16 14:35:28 -08:00
Roger Wang	f5f51e5931	[Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475 ) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Sun Kim <sunytokki@gmail.com>	2025-12-16 14:18:17 -08:00
Lucas Wilkinson	9fec0e13d5	[Attention] Cache attention metadata builds across hybrid KV-cache groups (#29627 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>	2025-12-16 17:10:16 -05:00

1 2 3 4 5 ...

8660 Commits