Ronald
d8874c61a5
[Core] Async Scheduling X Spec Decoding Compatibility ( #24799 )
...
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-11-17 12:16:20 -08:00
Zhewen Li
f8b19c0ffd
[Bugfix] Fix GPT-OSS on AMD after #28603 ( #28816 )
...
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-17 13:15:26 -05:00
tiehexue
e42bd8c2e3
Cast return value to int64_t for cache size ( #28814 )
...
Signed-off-by: tiehexue <tiehexue@hotmail.com>
2025-11-17 16:02:32 +00:00
Roger Wang
7f064491f8
[Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models ( #28858 )
...
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-17 14:49:25 +00:00
Lucas Wilkinson
64e39d667c
[BugFix] Temporary fix for IMA with MTP = 2 and full-cg ( #28315 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-17 09:41:22 -05:00
Kunshang Ji
1b82fb0ad3
[XPU] work around for sp, avoid custom op import error ( #28822 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-17 13:16:44 +00:00
Jae-Won Chung
d4acf518d0
[Metrics] Fix KV cache usage percent metric multiproc ( #28792 )
...
The `vllm:kv_cache_usage_perc` Gauge metric is missing `multiprocess_mode="mostrecent"` and ends up returning
```
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="277"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="275"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="273"} 0.6530455880475035
...
```
The deprecated `vllm:gpu_cache_usage_perc` Gauge metric has `multiprocess_mode="mostrecent"`.
Signed-off-by: Jae-Won Chung <jwnchung@umich.edu>
2025-11-17 09:54:15 +00:00
wuyaoxuehun
ab01cd14e5
[BugFix] Fix glm4_moe_mtp load weights bug ( #28805 )
...
Signed-off-by: wuyaoxuehun <798143193@qq.com>
2025-11-17 17:13:11 +08:00
Li, Jiang
577bb34fff
[CPU][Bugfix] Fix _to_list in CPU model runner ( #28824 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-17 07:47:24 +00:00
Jee Jee Li
3380ed5e11
[Doc] Add llama4 LoRA tag ( #28825 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-17 14:08:48 +08:00
Jay Caldwell
6f37419244
[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode ( #28543 )
...
Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
2025-11-17 13:54:46 +08:00
Xiake Sun
60e089f0b9
[ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 ( #28670 )
...
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
2025-11-16 20:52:11 -08:00
liuzhenwei
d64429bb36
[NIXL][XPU] update install script of NIXL ( #28778 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
2025-11-17 03:01:33 +00:00
jiahanc
561253b37f
[Performance][Fix] update nvfp4 code to support renorm routing ( #28569 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-16 18:02:42 -08:00
Nick Hill
80b6080ddc
[BugFix] Fix async scheduling + chunked prefill + preemption ( #28787 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-17 06:46:46 +08:00
amirkl94
03ee48111d
Feature: Support Relu2 in FusedMoE fp8 cutlass path ( #27261 )
2025-11-16 13:39:44 -05:00
Lukas Geiger
5a87076d6e
[Model][QwenVL] Optimize Qwen2_5_VisionAttention q,k preparation ( #28769 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-16 17:37:15 +00:00
Ning Xie
ac1daf3233
fix comment typo ( #28802 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-16 17:03:21 +00:00
Didier Durand
63fed55506
[Doc]: fix typos in various files ( #28811 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-16 14:30:06 +00:00
Anna Shors
8d259fad6c
Fix gpt oss weight loading with EP + bf16 ( #28765 )
...
Signed-off-by: ashors1 <ashors@nvidia.com>
2025-11-16 13:12:45 +00:00
scottzh8
3bc1175798
[Bugfix] Fix host and port join for ipv6 in bench serve ( #28679 )
...
Signed-off-by: Scott Zhang <scottzh@fb.com>
Co-authored-by: Scott Zhang <scottzh@fb.com>
2025-11-16 10:20:57 +00:00
Dezhan
af02c40970
Fixed gpt-oss _load_weights_other() parameter position bug ( #28715 )
...
Co-authored-by: Dezhan Tu <dztu@meta.com>
2025-11-16 09:46:29 +00:00
Lucia Fang
b316ac6589
[V1] Support MP Executor for multi node distributed inference ( #23691 )
...
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-16 09:01:21 +00:00
wang.yuqi
a55b64635c
[Model] Allow users to control skip reading cache per request. ( #28194 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
2025-11-16 00:04:50 -08:00
ai-jz
d231876ce3
[Benchmark] Fix client seed synchronization in multi-turn benchmark ( #28512 )
...
Signed-off-by: ai-jz <aijz.xplr@gmail.com>
2025-11-16 15:04:32 +08:00
Bram Wasti
f849ee739c
Adding a benchmark for batch invariance ( #28161 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-16 13:22:17 +08:00
Lucas Wilkinson
be263f7645
[BugFix] Fix AssertionError: DCP not support reorder_batch_threshold > 1 now. ( #28751 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-15 22:35:06 +00:00
Didier Durand
2bb4435cb7
[Doc]: fix typos in various files ( #28567 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-15 19:27:50 +00:00
Lukas Geiger
07cadab27a
[Model][Qwen3VL] Cache positional embedding indices ( #28475 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-15 19:03:09 +00:00
Nick Hill
637f292196
[CI] Fix broken pipeline ( #28781 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-15 08:44:14 -08:00
Eldar Kurtić
e439c784fa
Add support for Eagle with separate lm-head and embed_tokens layers ( #28549 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
2025-11-15 06:12:02 -08:00
hwhaokun
085a525332
[Model] Fix lmhead init bug of bailing_moe ( #28777 )
...
Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: zhaozx-cn <zhaozx2116@163.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-15 05:44:12 -08:00
Cyrus Leung
89d3679221
[Doc] Fix failing doc build ( #28772 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-15 05:33:27 -08:00
tingtinggithub
cb15ee28db
Allow Gemma3 to take image embeddings ( #28483 )
...
Signed-off-by: tingtinggithub <streamttt@gmail.com>
2025-11-15 04:18:08 -08:00
Angela Yi
f36292dbee
[compile] Enable sequence parallelism matching w/o custom ops enabled ( #27126 )
...
Signed-off-by: angelayi <yiangela7@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>
2025-11-15 11:46:12 +00:00
Vadim Gimpelson
173b356abf
[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 ( #28755 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-15 15:43:41 +05:30
Cyrus Leung
638e4196d1
[Misc] Make SchedulerConfig.max_model_len init-only ( #28733 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-15 01:59:31 -08:00
Zhewen Li
1ec978c209
[Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 ( #28709 )
...
Signed-off-by: Zhewen Li <zhewenli@meta.com>
2025-11-15 01:10:48 -08:00
Jane (Yuan) Xu
74b5267d3a
Use narrow over indexing in hadacore_transform to prep for ABI stable ( #28756 )
...
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-15 01:10:15 -08:00
Zhuohan Li
dd6ac1c2bb
[RL] [V1] Remove unused device argument from reset_kv_cache ( #28766 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-11-14 23:59:42 -08:00
Cyrus Leung
98b4d389ed
[Redo] #26368 ( #28771 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-14 22:47:41 -08:00
Varun Sundar Rabindranath
6965ef436f
[Performance][DeepGEMM] Estimate expected_m ( #28694 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-15 13:52:14 +08:00
Chendi.Xue
c9e665852a
[NIXL] heterogeneous block_size support ( #26759 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2025-11-14 21:51:32 -08:00
Mohammad Othman
363aaeef0f
Fix IntermediateTensors initialization and add type hints ( #28743 )
...
Signed-off-by: Mohammad Othman <Mo@MohammadOthman.com>
Co-authored-by: Mohammad Othman <Mo@MohammadOthman.com>
2025-11-15 04:31:36 +00:00
Nick Hill
ac86bff8cb
Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… ( #28773 )
2025-11-14 20:24:00 -08:00
Michael Goin
edfe498189
[Bugfix] Build hadacore kernels on >SM90 ( #28748 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 19:51:05 -08:00
Lukas Geiger
f05d474c8a
[Model][Qwen3VL] Use mm_position to compute mrope positions ( #28730 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-14 19:45:11 -08:00
QiliangCui
9fc81ec765
[TPU] Fix import error in tpu launch ( #28758 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-11-15 00:58:32 +00:00
Jialin Ouyang
186352b270
[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization ( #26368 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-14 16:04:04 -08:00
Nick Hill
58e61e56b7
[Test] Rework e2e async scheduling tests ( #28744 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-14 16:01:09 -08:00