yurekami
4b7df5710a
Add compatibility notes and docs links to MTP/PCP error messages
...
- Add documentation links to MTP and PCP error messages for consistency
with DCP error message
- Add notes indicating no backends currently support these features
- Remove suggestion to use --attention-backend for PCP since no
backends support it yet
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: yurekami <yurekami@users.noreply.github.com>
2025-12-25 04:10:07 +09:00
yurekami
79e0db60ee
Use --attention-backend flag instead of VLLM_ATTENTION_BACKEND env var
...
Per reviewer feedback, the VLLM_ATTENTION_BACKEND environment variable
is being deprecated in favor of the --attention-backend CLI flag.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: yurekami <yurekami@users.noreply.github.com>
2025-12-24 15:16:10 +09:00
yurekami
3c8358c328
[v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance
...
Replace cryptic AssertionErrors with informative RuntimeErrors that:
- Explain what DCP (Decode Context Parallel) and PCP (Prefill Context
Parallel) are
- List compatible attention backends
- Provide environment variable instructions (VLLM_ATTENTION_BACKEND)
- Include documentation links
Fixes #28407
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: yurekami <249254018+yurekami@users.noreply.github.com>
2025-12-18 20:05:16 +09:00
wangxiyuan
a85724bd6e
[Platform] Let EPD work with non-cuda platform ( #30225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-18 06:45:29 +00:00
Yifan Qiao
11a89cf95c
[Fix][FlexAttention] return max logical block index to handle reused blocks ( #30915 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
2025-12-18 06:42:21 +00:00
Li, Jiang
e3ab93c896
[CPU] Refactor CPU fused MOE ( #30531 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-12-18 14:36:49 +08:00
Nathan Price
fc2ae6d617
fix: add warmup for audio preprocessing ( #30706 )
...
Signed-off-by: Nathan Price <nathan@abridge.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-18 06:12:29 +00:00
Yihua Cheng
ec965569d9
[KV connector][LMCache] Only record the cuda event when there are request to store/load ( #30814 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2025-12-18 05:31:34 +00:00
Divakar Verma
82dc338ad6
[AMD][CI] fix lm eval ci arg ( #30911 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-12-18 13:18:26 +08:00
Vadim Gimpelson
717ac33d9c
[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json ( #29553 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-12-18 13:16:04 +08:00
Li, Jiang
cfb7e55515
[Doc][CPU] Update CPU doc ( #30765 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-18 04:59:09 +00:00
zzhxxx
b166ef20e1
[refactor] Add prefix support to embed_tokens in DeepSeek MTP ( #30788 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-12-18 04:45:56 +00:00
Zhengxu Chen
5f2f3fba1d
[compile] Fix CI for test_gpt2_cache_hit ( #30902 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2025-12-17 20:22:23 -08:00
Matthew Bonanni
4a8412f773
[UX] Reduce DeepGEMM warmup log output to single progress bar ( #30903 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-17 20:21:51 -08:00
Bowen Bao
0c738b58bc
[Quantization] Support Quark int4-fp8 w4a8 for MoE ( #30071 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com>
2025-12-18 04:20:42 +00:00
gnovack
5a3adf581e
fused_moe_lora PDL improvements ( #30716 )
...
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-17 19:55:00 -08:00
Isotr0py
6fe5887652
[Chore] Remove v0 dead code for Qwen2.5-omni ( #30883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-17 19:54:39 -08:00
Nicolò Lucchesi
bc3700e0cd
[NIXL] Support P tensor-parallel-size > D tensor-parallel-size ( #27274 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-18 11:53:30 +08:00
Micah Williamson
fd8afdf38d
[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 ( #30811 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-12-18 10:27:37 +08:00
SungMinCho
a0b782f9cc
[Metrics] Model FLOPs Utilization estimation ( #30738 )
...
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-12-18 01:40:51 +00:00
Rafael Vasquez
ed2897f336
[CI][Feature] Adds auto-rebase PR rule ( #30875 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
2025-12-18 00:46:44 +00:00
Isotr0py
74a1ac38b0
[v1] Add PrefixLM support to TritonAttention backend ( #30386 )
2025-12-17 16:05:24 -08:00
Nathan Price
05a83dc6ee
feat(api): Eager chat template warmup to eliminate first-request latency ( #30700 )
...
Signed-off-by: Nathan Price <nathan@abridge.com>
2025-12-18 00:01:29 +00:00
Varun Sundar Rabindranath
e3fc374a9a
[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM ( #30899 )
2025-12-17 15:00:59 -08:00
Andrey Talman
e06d0bf0aa
2.9.1 PyTorch release update ( #28495 )
2025-12-17 12:20:22 -08:00
Xunzhuo
e3a0f21e6c
[docs]: add ecosystem projects sr in docs/governance ( #30844 )
...
Signed-off-by: bitliu <bitliu@tencent.com>
2025-12-17 18:45:56 +00:00
Matthew Bonanni
7eb6cb6c18
[Attention] Update tests to remove deprecated env vars ( #30563 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-17 09:49:59 -08:00
Nicolò Lucchesi
9ca8cb38fd
[CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio ( #30878 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-17 18:49:56 +01:00
Cyrus Leung
2497228ad4
[Chore] Factor out logic for requesting initial memory ( #30868 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-17 07:32:17 -08:00
KimHyemin
196cdc3224
[Model] Gemma3: Support untied word embeddings ( #30827 )
...
Signed-off-by: www-spam <panmahm@naver.com>
2025-12-17 07:11:18 -08:00
高鑫崧
b7b6a60aca
Adapt the old parameter enable_thinking in chat_template_kwargs ( #30852 )
...
Signed-off-by: xinsong.gao <1418762819@qq.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-12-17 07:10:59 -08:00
rongfu.leng
9e67c4ce98
[Docs] fix function name ( #30748 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-12-17 12:14:45 +00:00
Jialin Ouyang
6e9dbcc50e
[Fix] uniform decode batch check ( #30747 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-12-17 19:58:43 +08:00
Hank_
6482e3895b
chores: adjust the attn register param order ( #30688 )
...
Signed-off-by: Hank <hcc.mayday@gmail.com>
2025-12-17 19:58:16 +08:00
Harry Mellor
fb980eb2fd
Fix lazy import ( #30858 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-17 03:33:50 -08:00
baoqian426
84896fda22
[Bugfix] deepseek-V3.2 self.weights_proj has no bias ( #30841 )
...
Signed-off-by: baoqian <1354987947@qq.com>
Signed-off-by: baoqian426 <1354987947@qq.com>
2025-12-17 03:32:34 -08:00
Kevin H. Luu
4bf6c23668
[ci] Sync test areas yaml file with test-pipeline ( #30862 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2025-12-17 02:30:56 -08:00
Chauncey
9ad5b21710
[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory ( #30749 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-12-17 02:27:30 -08:00
Wentao Ye
f284d7bd0c
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv ( #30823 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-17 02:00:35 -08:00
Zhengxu Chen
53cd7f868b
[compile] Recompile graph module during Dynamo cache loading. ( #30743 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@fb.com>
2025-12-17 02:00:12 -08:00
danielafrimi
7b966ae2ba
[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) ( #30785 )
...
Signed-off-by: <>
Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local>
2025-12-17 01:56:38 -08:00
Zhengxu Chen
9db1db5949
[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors ( #30809 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2025-12-17 01:56:24 -08:00
Zhengxu Chen
177c391db2
[compile] Disable aot when eager backend is used. ( #30810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2025-12-17 01:55:56 -08:00
Michael Goin
519ef9a911
[UX] Make vllm bench serve discover model by default and use --input-len ( #30816 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-17 01:55:30 -08:00
Ye (Charlotte) Qi
a100152288
[Kernels][FI] Skip trtllm attention when num_kv_heads=1 ( #30842 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-12-17 01:54:21 -08:00
Andrew Xia
4c054d89aa
[Doc][ResponsesAPI] add documentation ( #30840 )
...
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-12-17 01:53:02 -08:00
Sheng Lin
f4e884f222
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator ( #29569 )
...
Signed-off-by: Somoku <linsh0@protonmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-12-17 01:52:58 -08:00
Xinyu Chen
3b1d440ede
CustomOp: grouped topk ( #29575 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
2025-12-17 17:43:00 +08:00
Asaf Joseph Gardin
a9e15c21ef
[Mamba] Removed disable cascade attn in MambaModelConfig ( #30712 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-12-17 08:48:53 +00:00
Robin
20fda43151
[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction ( #30555 )
...
Signed-off-by: WangErXiao <863579016@qq.com>
2025-12-17 16:37:57 +08:00