Harry Mellor
8fcaaf6a16
Update Optional[x] -> x | None and Union[x, y] to x | y ( #26633 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-12 09:51:31 -07:00
wang.yuqi
76852017ea
[MISC] Rename the torch profiler filename as instance_id+rank_id for merging the Profiler results of each Rank ( #25867 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-10-12 09:29:08 +00:00
dsinghvi
727144bed1
[Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py ( #24172 )
...
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: wwl2755 <wangwenlong2755@gmail.com>
2025-10-11 07:21:04 +00:00
Roger Wang
ddaff2938e
[MM] Move Qwen3Omni MRoPE impl to model file ( #26608 )
...
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-10 22:17:24 -07:00
Nick Hill
5bc26c438d
[BugFix] Make penalties and bad_words work with async scheduling ( #26467 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-10 23:27:04 +00:00
Nick Hill
949cb0170d
[BugFix] Fix async scheduling + request preemption ( #26385 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-10 20:29:57 +00:00
Vadim Gimpelson
e94cfd51da
[BUG] Qwen3-next MTP. Fix attn metadata build bug ( #26564 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-10-10 14:59:03 -04:00
Mark McLoughlin
e519281920
[Metrics] Add test for multi-modal cache stats logging ( #26588 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-10-10 16:00:50 +00:00
Sage Moore
ae9d0e7da5
[Bugfix] Make DP padding optional in coordinate_batch_across_dp ( #26375 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-10-10 10:53:33 -04:00
Mark McLoughlin
784c231151
[NIXL] Ignore abort on already-finished request ( #25067 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-10-10 12:21:56 +02:00
Chen Zhang
606b00e80f
[bugfix][DCP] fix block_size of hash in DCP prefix caching ( #26296 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-10-10 03:02:49 -07:00
Cyrus Leung
ad430a67ca
[Metrics] Log multi-modal cache stats and fix reset ( #26285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-10 01:45:55 -07:00
Chen Zhang
6f0f570c43
[deepseek] kernel block size for UniformTypeKVCacheSpecs ( #26559 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-10-10 16:40:41 +08:00
Lucas Wilkinson
29255cfc3b
[Spec-Decode] Support piecewise cudagraphs for Eagle head ( #25109 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-10-10 01:20:31 -04:00
Nick Hill
aafb99a4d4
[Core] Small simplification in GPUModelRunner._update_states() ( #26508 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-10 10:53:58 +08:00
Rui Qiao
757fa4a4da
[DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY ( #23849 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-10-09 19:53:43 -07:00
Julien Denize
c6187f55f7
Refactor MistralTokenizer ( #26358 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-10-09 22:48:58 +00:00
Benjamin Chislett
6e783bc54b
[Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency ( #26499 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-09 17:12:34 -04:00
Nick Hill
2e54db4d2b
[Core] Remove unused prev_sampled_token_ids_invalid_indices input batch field ( #26514 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-09 20:22:14 +00:00
Ming Yang
3b736e1c38
[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 ( #25049 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-10-09 08:06:29 -07:00
Cyrus Leung
4bdf7ac593
[Bugfix] Fix SHM cache initialization ( #26427 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-09 02:48:04 -07:00
Cyrus Leung
dc7976dd9f
[Misc] Upgrade more code to Python 3.10 ( #26463 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-09 10:43:53 +01:00
Nick Hill
ddcbc2f334
[Misc] Misc code simplifications ( #26450 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-09 02:10:06 -07:00
elvischenv
5e49c3e777
Bump Flashinfer to v0.4.0 ( #26326 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-10-08 23:58:44 -07:00
Jee Jee Li
1b2c440cd6
[Core] Relax the LoRA max rank ( #26461 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-08 23:47:14 -07:00
Zhiyuan Li
d24cf322e1
[Hybrid]: Decouple Kernel Block Size from KV Page Size ( #24486 )
...
Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
2025-10-08 23:43:39 -07:00
Qier Li
d17f0fbf30
[Core][KVConnector] Propagate all tokens on resumed preemptions ( #24926 )
...
Signed-off-by: Qier Li <kevin44036@gmail.com>
Co-authored-by: Qier Li <qier@fb.com>
2025-10-09 14:43:31 +08:00
Nick Hill
bb6d8c21f9
[Bugfix] Catch and log invalid token ids in detokenizer #2 ( #26445 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-08 21:20:25 -07:00
Naveenraj Kamalakannan
e614ab7806
Separate MLAAttention class from Attention ( #25103 )
...
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-08 17:11:11 -07:00
Matthew Bonanni
2a03f93de9
[Attention] Register FLASHMLA_SPARSE ( #26441 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-08 22:28:52 +00:00
Elaine Zhao
f08919b7d1
[Bugfix] Respect min_tokens in scheduler stop check ( #26317 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
2025-10-08 14:08:24 -07:00
elvischenv
b82f4307c9
[Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters ( #25924 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-10-08 19:54:48 +00:00
Harry Mellor
e09d1753ec
Remove Python 3.9 support ahead of PyTorch 2.9 in v0.11.1 ( #26416 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-08 10:40:42 -07:00
Harry Mellor
2f99f2f506
Tidy vllm/config/__init__.py to only add classes and functions ( #26405 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-08 07:10:00 -07:00
Utkarsh Sharma
335b28f7d1
[TPU] Rename tpu_commons to tpu_inference ( #26279 )
...
Signed-off-by: Utkarsh Sharma <utksharma@google.com>
Co-authored-by: Utkarsh Sharma <utksharma@google.com>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
2025-10-07 23:30:52 -07:00
Ayush Satyam
5e65d6b2ad
fix[DP][v1]: Prevent hangs from mismatched worker configurations ( #26218 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
2025-10-07 22:55:08 -07:00
Ayush Satyam
cd9890544b
fix(v1/kv_cache): resolve async KV transfer bug in cascade attention ( #23485 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
2025-10-08 04:46:33 +00:00
Nick Hill
067da2d1df
[Core] Simplify setting new_token_ids in CachedRequestData ( #26388 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-08 03:32:37 +00:00
Lucas Wilkinson
f80e7866c0
[Misc] Clean up cruft from previous FlashMLA sparse implementation ( #26125 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-10-08 10:09:34 +08:00
Thomas Parnell
31a4b3e6c4
Revert #24446 and #26168 ( #26332 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-10-07 16:38:19 -06:00
Benjamin Chislett
3d1f67616d
[Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA ( #25984 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-07 16:05:59 -04:00
Sergei Skvortsov
6ebaf43ee4
[V1] Logit processors for rejection sampler ( #19482 )
...
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: Sergei Skvortsov <sergeyskv@nebius.com>
Signed-off-by: Sergei Skvortsov <yvorott@gmail.com>
Co-authored-by: Sergei Skvortsov <sergeyskv@nebius.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-10-07 13:02:49 -07:00
Pei-Lun Liao
eb577e4655
[Bugfix] Add missing sink tensor into flash attn cascade attn implementation ( #26325 )
2025-10-07 18:56:39 +00:00
Cyrus Leung
1e4ecca1d0
[V0 Deprecation] Remove VLLM_USE_V1 from tests ( #26341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-07 15:42:31 +00:00
Grant Holmes (Ren)
d100d78eb3
Optimize KV cache distribution for asymmetric pipeline parallelism ( #25164 )
...
Signed-off-by: gholmes829 <g.holmes429@gmail.com>
2025-10-07 09:20:30 +00:00
Sage Moore
2111b4643c
[Core] Simplify the Dp padding/should ubatch coordination logic ( #25768 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-10-07 01:57:49 +00:00
Benjamin Chislett
f77df94647
[Perf] Add decode full-graph support to FlashInfer-MLA backend ( #26313 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-06 23:03:49 +00:00
Gregory Shtrasberg
f231e5bc21
[ROCm] Split AITER unified attention into its own backend ( #25507 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-10-06 22:49:23 +00:00
Varun Sundar Rabindranath
f23b4c04fd
[BugFix] Pad input buffers in _dummy_run ( #26209 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-06 16:07:51 -04:00
7mile
b2ea5ba677
[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs ( #26231 )
...
Signed-off-by: seven-mile <i@7li.moe>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-06 18:24:22 +00:00