Benjamin Chislett
85aff45e24
[Perf] Remove blocking copy in GDN Attention ( #31167 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-12-22 14:25:22 -08:00
Wentao Ye
5312a7284e
[Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' ( #31173 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-22 14:24:27 -08:00
Lucas Wilkinson
de71747655
[SpecDecode] Simplified alternative padded-speculation acceptance rate fix ( #29845 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-22 13:06:10 -08:00
Pavani Majety
b10f41c894
[SM100] Enable fp8 compute for prefill MLA ( #30746 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-12-22 19:15:57 +00:00
Boyuan Feng
8dd0db687b
[UX] improve profiler error message ( #31125 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-12-22 08:45:59 -08:00
dengyunyang
8f8f469b1b
[BugFix] skip language model in Encoder ( #30242 )
...
Signed-off-by: dengyunyang <584797741@qq.com>
2025-12-22 05:25:59 -08:00
Jeffrey Wang
1501a4070e
[Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() ( #31013 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
2025-12-20 10:29:31 +00:00
Lucas Wilkinson
5f6477d1d0
[BugFix] Fix TypeError: unhashable type: 'dict' when serving deepseek32 ( #30924 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-19 16:07:54 -05:00
Seiji Eicher
1ab5213531
Make engine core client handshake timeout configurable ( #27444 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-12-19 20:38:30 +00:00
Nick Hill
2ac85a4544
[BugFix] Fix logprobs with spec decode and modified logits ( #30846 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-18 19:58:28 -08:00
Nick Hill
45c0526ac9
[BugFix] Handle errors when preprocessing added requests ( #30895 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-19 01:29:11 +00:00
Benjamin Chislett
d6b3d39b6d
[Cleanup] Refactor FlashInferMetadataBuilder ( #29128 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-18 14:45:30 -08:00
Nick Hill
b0b77c4655
[BugFix] Fix spec decode + structured outputs + preemption edge case ( #30916 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-18 12:59:55 -08:00
Chen Zhang
24b65eff0d
[BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 ( #30319 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-12-18 19:47:56 +00:00
Alec
62be3670cb
[BugFix] Add sleep to fix tight loop and release GIL ( #29476 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-18 09:52:55 -08:00
Nick Hill
686cbaac64
[Cleanup] Remove unused ModelRunner V1 InputBatch.num_tokens field ( #30218 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-18 09:17:00 -08:00
Andreas Karatzas
be2ad5f920
[ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties ( #30730 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-18 07:04:57 +00:00
Yifan Qiao
11a89cf95c
[Fix][FlexAttention] return max logical block index to handle reused blocks ( #30915 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
2025-12-18 06:42:21 +00:00
Micah Williamson
fd8afdf38d
[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 ( #30811 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-12-18 10:27:37 +08:00
SungMinCho
a0b782f9cc
[Metrics] Model FLOPs Utilization estimation ( #30738 )
...
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-12-18 01:40:51 +00:00
Isotr0py
74a1ac38b0
[v1] Add PrefixLM support to TritonAttention backend ( #30386 )
2025-12-17 16:05:24 -08:00
Matthew Bonanni
7eb6cb6c18
[Attention] Update tests to remove deprecated env vars ( #30563 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-17 09:49:59 -08:00
Cyrus Leung
2497228ad4
[Chore] Factor out logic for requesting initial memory ( #30868 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-17 07:32:17 -08:00
Jialin Ouyang
6e9dbcc50e
[Fix] uniform decode batch check ( #30747 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-12-17 19:58:43 +08:00
Harry Mellor
fb980eb2fd
Fix lazy import ( #30858 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-17 03:33:50 -08:00
Roger Wang
f5f51e5931
[Core][MM] Optimize encoder cache manager by operating with embeddings only ( #30475 )
...
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
2025-12-16 14:18:17 -08:00
Lucas Wilkinson
9fec0e13d5
[Attention] Cache attention metadata builds across hybrid KV-cache groups ( #29627 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
2025-12-16 17:10:16 -05:00
Harry Mellor
e1625498f4
Update where bytes_to_unicode is imported from ( #30771 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-16 08:05:01 -08:00
Lucas Wilkinson
00a8d7628c
[BugFix] Fix memory spike in workspace allocation ( #30744 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-16 06:46:22 -08:00
Nicolò Lucchesi
75eb302a2e
[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request ( #30772 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-16 14:20:19 +00:00
Pleaplusone
9dbbc59b15
[ROCm][MTP] Support MTP for AITER MLA backend ( #28624 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-12-16 14:10:26 +00:00
Jee Jee Li
0e391e7570
[Bugfix] Fix RequestOutput miss lora_request ( #30636 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-16 01:36:35 -08:00
jiangkuaixue123
b9ff4f2a8d
[feature] extend DBO to XBO ( #30120 )
...
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
2025-12-16 00:04:01 -05:00
Matthew Bonanni
60dbf7d8f1
Update batch invariant to use attention config ( #30704 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-15 15:24:16 -05:00
Jee Jee Li
a524d1ba0a
[Bugfix] Fix deepseek_v32 tokenizer_mode ( #30658 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-15 04:20:31 +00:00
Or Ozeri
174e39ead7
CPU KV Offloading: Use more CUDA streams ( #29013 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-12-14 23:50:45 +00:00
Johannes F
060893654d
fix: Update json features supported by xGrammar ( #30390 )
...
Signed-off-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com>
Signed-off-by: Johannes F <johannesflommersfeld@users.noreply.github.com>
Co-authored-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-14 02:16:06 -08:00
drslark
add1b9d3de
[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring ( #30632 )
...
Signed-off-by: drslark <slarksblood@qq.com>
2025-12-14 01:32:16 -08:00
Wentao Ye
6e78ed6ba7
[Logs] Optimize startup logs 4 ( #29903 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-13 16:12:53 -05:00
Isotr0py
7c16f3fbcc
[Doc] Add documents for multi-node distributed serving with MP backend ( #30509 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-13 18:02:29 +00:00
Cyrus Leung
39cefbdf17
[Refactor] TokenizerRegistry only uses lazy imports ( #30609 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-13 23:16:22 +08:00
Cyrus Leung
64251f48df
[Chore] Adjust tokenizer import to avoid circular imports ( #30601 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-13 04:42:39 -08:00
Nick Hill
1cec5b7ea9
[Scheduer] Simplify stop checking for pooling models ( #30591 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-13 09:45:26 +00:00
Cyrus Leung
b09806e28f
[Bugfix] Dictionary MM embeddings for online chat ( #30507 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-13 15:48:56 +08:00
Roberto L. Castro
4fa7ce46f3
[Feature] Add SM103 (Blackwell Ultra) Support to vLLM ( #30484 )
...
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-12-12 19:34:23 -08:00
Wentao Ye
02a5880394
[CI] Fix mypy for vllm/v1/executor ( #30517 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-12 18:05:34 +00:00
realliujiaxu
d2c919dcc2
[bugfix] fix bug when top_logprobs=0 with spec decoding ( #30059 )
...
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-12-12 09:03:35 -08:00
jvlunteren
9c0ee995a8
[Kernel] Support CUDA Graphs in 3D Triton Attention Kernel ( #28306 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
Signed-off-by: jvlunteren <161835099+jvlunteren@users.noreply.github.com>
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-12-12 16:55:40 +01:00
Lucas Wilkinson
3e41992fec
[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 ( #27532 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-12 05:57:47 -08:00
Lucas Wilkinson
042da73244
[Core] Refactor _build_attention_metadata ( #29628 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-11 17:54:12 -08:00