12349 Commits

Author SHA1 Message Date
Isotr0py
74a1ac38b0
[v1] Add PrefixLM support to TritonAttention backend (#30386) 2025-12-17 16:05:24 -08:00
Nathan Price
05a83dc6ee
feat(api): Eager chat template warmup to eliminate first-request latency (#30700)
Signed-off-by: Nathan Price <nathan@abridge.com>
2025-12-18 00:01:29 +00:00
Varun Sundar Rabindranath
e3fc374a9a
[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM (#30899) 2025-12-17 15:00:59 -08:00
Andrey Talman
e06d0bf0aa
2.9.1 PyTorch release update (#28495) 2025-12-17 12:20:22 -08:00
Xunzhuo
e3a0f21e6c
[docs]: add ecosystem projects sr in docs/governance (#30844)
Signed-off-by: bitliu <bitliu@tencent.com>
2025-12-17 18:45:56 +00:00
Matthew Bonanni
7eb6cb6c18
[Attention] Update tests to remove deprecated env vars (#30563)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-17 09:49:59 -08:00
Nicolò Lucchesi
9ca8cb38fd
[CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio (#30878)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-17 18:49:56 +01:00
Cyrus Leung
2497228ad4
[Chore] Factor out logic for requesting initial memory (#30868)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-17 07:32:17 -08:00
KimHyemin
196cdc3224
[Model] Gemma3: Support untied word embeddings (#30827)
Signed-off-by: www-spam <panmahm@naver.com>
2025-12-17 07:11:18 -08:00
高鑫崧
b7b6a60aca
Adapt the old parameter enable_thinking in chat_template_kwargs (#30852)
Signed-off-by: xinsong.gao <1418762819@qq.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-12-17 07:10:59 -08:00
rongfu.leng
9e67c4ce98
[Docs] fix function name (#30748)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-12-17 12:14:45 +00:00
Jialin Ouyang
6e9dbcc50e
[Fix] uniform decode batch check (#30747)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-12-17 19:58:43 +08:00
Hank_
6482e3895b
chores: adjust the attn register param order (#30688)
Signed-off-by: Hank <hcc.mayday@gmail.com>
2025-12-17 19:58:16 +08:00
Harry Mellor
fb980eb2fd
Fix lazy import (#30858)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-17 03:33:50 -08:00
baoqian426
84896fda22
[Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841)
Signed-off-by: baoqian <1354987947@qq.com>
Signed-off-by: baoqian426 <1354987947@qq.com>
2025-12-17 03:32:34 -08:00
Kevin H. Luu
4bf6c23668
[ci] Sync test areas yaml file with test-pipeline (#30862)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2025-12-17 02:30:56 -08:00
Chauncey
9ad5b21710
[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory (#30749)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-12-17 02:27:30 -08:00
Wentao Ye
f284d7bd0c
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv (#30823)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-17 02:00:35 -08:00
Zhengxu Chen
53cd7f868b
[compile] Recompile graph module during Dynamo cache loading. (#30743)
Signed-off-by: Zhengxu Chen <zhxchen17@fb.com>
2025-12-17 02:00:12 -08:00
danielafrimi
7b966ae2ba
[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) (#30785)
Signed-off-by: <>
Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local>
2025-12-17 01:56:38 -08:00
Zhengxu Chen
9db1db5949
[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors (#30809)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2025-12-17 01:56:24 -08:00
Zhengxu Chen
177c391db2
[compile] Disable aot when eager backend is used. (#30810)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2025-12-17 01:55:56 -08:00
Michael Goin
519ef9a911
[UX] Make vllm bench serve discover model by default and use --input-len (#30816)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-17 01:55:30 -08:00
Ye (Charlotte) Qi
a100152288
[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#30842)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-12-17 01:54:21 -08:00
Andrew Xia
4c054d89aa
[Doc][ResponsesAPI] add documentation (#30840)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-12-17 01:53:02 -08:00
Sheng Lin
f4e884f222
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator (#29569)
Signed-off-by: Somoku <linsh0@protonmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-12-17 01:52:58 -08:00
Xinyu Chen
3b1d440ede
CustomOp: grouped topk (#29575)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
2025-12-17 17:43:00 +08:00
Asaf Joseph Gardin
a9e15c21ef
[Mamba] Removed disable cascade attn in MambaModelConfig (#30712)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-12-17 08:48:53 +00:00
Robin
20fda43151
[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction (#30555)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-12-17 16:37:57 +08:00
Yan Ma
4f735babb7
[XPU] fix broken fp8 online quantization for XPU platform (#30831)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2025-12-17 00:28:13 -08:00
Li, Jiang
0cd5353644
[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 23:25:12 -08:00
Michael Goin
d4d2751732
Update note comment for flashinfer attention warmup (#30711)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-16 21:29:03 -08:00
shanjiaz
009a773828
bump up compressed tensors version to 0.13.0 (#30799)
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
2025-12-16 21:01:04 -08:00
Cyrus Leung
44d3b1df3d
[CI/Build] Fix compatibility between #30244 and #30396 (#30787)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-16 20:21:19 -08:00
Fadi Arafeh
bb5ac1fe38
[CPU] Add action to automatically label CPU related PRs (#30678)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-12-17 04:21:07 +00:00
Michael Goin
811cdf5197
Update model-hosting-container-standards to 0.1.10 (#30815)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-12-16 17:52:14 -08:00
Grzegorz K. Karch
f5db6385a1
Fix nemotron_nas intermediate_size computation (#30795)
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
2025-12-17 01:06:28 +00:00
Amr Mahdi
c0a88df7f7
[docker] Allow kv_connectors install to fail on arm64 (#30806)
Signed-off-by: Amr Mahdi <amrmahdi@meta.com>
2025-12-16 16:41:57 -08:00
Nicolò Lucchesi
e087fbc393
[MM] Pass FA version in ViT Attn (#30756)
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-17 07:54:45 +08:00
Michael Goin
e80455ca8b
Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant (#30817)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-16 23:40:47 +00:00
TJian
2410132bb1
[ROCm] [Bugfix] Fix torch sdpa hallucination (#30789)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-12-16 15:32:43 -08:00
Michael Goin
0a1ab1e565
[Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-16 14:56:02 -08:00
Wentao Ye
b6ec077e05
[CI] Skip ci failure test (#30804)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-16 22:47:53 +00:00
Jinzhen Lin
ce96857fdd
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-12-16 14:35:28 -08:00
Daniel Cámpora
eaa82a709a
[Bugfix][DSV32] Fix overflow in topk. (#30754)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-12-16 14:21:17 -08:00
Roger Wang
f5f51e5931
[Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
2025-12-16 14:18:17 -08:00
Lucas Wilkinson
9fec0e13d5
[Attention] Cache attention metadata builds across hybrid KV-cache groups (#29627)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
2025-12-16 17:10:16 -05:00
jiahanc
254a7f8fd6
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (#30014)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-12-16 13:01:48 -08:00
Wentao Ye
f21f5ea38c
[Refactor] Small refactor for group topk (#30562)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-12-16 14:50:59 -05:00
Nicolò Lucchesi
ca702a14dc
[Frontend] Add max-completion-token option to transcription/translation endpoints (#30769)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-16 19:36:49 +00:00