Michael Goin
10f01d5a3a
[Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX ( #28294 )
2025-11-12 15:14:13 -08:00
QiliangCui
3eb0c2673e
[TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR ( #28487 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-11-12 22:31:14 +00:00
vllmellm
d8140b9833
[ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py ( #28464 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-12 21:46:57 +00:00
Varun Sundar Rabindranath
74a9a9faad
[Performance][B200] Fix deepgemm prologue ( #27897 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-12 13:13:03 -08:00
Wei Wei
478ee511de
[Misc]Fix typo in llm_engine.py ( #28584 )
...
Signed-off-by: Wei Wei <wwei6@meta.com>
2025-11-12 12:59:43 -08:00
Andy Lo
58ce8d12b7
[BugFix] Priority scheduling and spec tokens preemption ( #28558 )
...
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-11-12 20:29:21 +00:00
Yihua Cheng
94a9ebcf31
[KV connector][WIP] KV cache proxy based on LMCache multi-process mode ( #27902 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2025-11-12 20:25:43 +00:00
Harry Mellor
a39dd7bb06
[CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers ( #28559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 19:38:13 +00:00
Thomas Parnell
64d57c3be7
[Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model ( #28563 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-11-12 18:17:55 +00:00
PerryZhang01
a1e7fa362a
[EPLB][ROCm]: support EPBL for ROCm backend ( #27731 )
...
Signed-off-by: Perry Zhang <perzhang@amd.com>
Co-authored-by: Perry Zhang <perzhang@amd.com>
2025-11-12 18:16:35 +00:00
alberto
bac904565f
Implement ARC KV cache eviction policy for CPU offloader ( #27039 )
...
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: alberto <aperdomo@redhat.com>
Co-authored-by: Or Ozeri <or@ozery.com>
2025-11-12 09:51:39 -08:00
Benjamin Chislett
304419576a
[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer ( #28479 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-13 01:56:40 +09:00
Harry Mellor
a742134cc5
Remove deprecated fields from CompilationConfig ( #27593 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 16:10:28 +00:00
Nicolò Lucchesi
728a9eb70e
[Misc] Refactor Attention kv transfer methods into decorator ( #27816 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-11-12 16:05:44 +00:00
Canlin Guo
bc5bd45c7d
[Refactor] Remove redundant TP gather/split in split_qkv in QwenVL ( #28271 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-12 15:56:47 +00:00
Alexander Matveev
f76e85c299
[Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) ( #28492 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-11-12 10:51:43 -05:00
Harry Mellor
54aecd9ed5
Fix pre-commit (and XPU) on main ( #28556 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 06:13:41 -08:00
wangxiyuan
10138c92a5
[V0 deprecation] Deprecate use_v1 parameter ( #28112 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-12 14:03:52 +00:00
Jee Jee Li
a9d18b5107
[Bugfix] Fix gpt_oss packed_modules_mapping ( #28536 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-12 21:02:06 +08:00
TJian
edb59a9470
[ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility ( #28500 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-12 05:01:14 -08:00
ZhengHongming888
c5f10cc139
add cpu option for p/d in nixl_connector ( #28356 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
2025-11-12 11:53:08 +00:00
ziruiliu
d143152308
[KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector ( #27978 )
...
Signed-off-by: Zirui Liu <ziliu@ddn.com>
Signed-off-by: ziruiliu <ziliu@ddn.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-12 11:44:58 +01:00
Chaojun Zhang
a4730c1b4f
[XPU]Fix crash due to removed VLLM_USE_V1 attribute ( #28520 )
...
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
2025-11-12 10:20:55 +00:00
wuyaoxuehun
d3ade61e42
[Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. ( #27597 )
...
Signed-off-by: wuao.scotty <wuao.scotty@bytedance.com>
Co-authored-by: wuao.scotty <wuao.scotty@bytedance.com>
2025-11-12 10:14:00 +00:00
yyzxw
1761dea1a8
[BugFix]: --enable-lora with model granite-4.0-micro crash ( #27733 )
...
Signed-off-by: zxw <1020938856@qq.com>
2025-11-12 09:03:56 +00:00
Huamin Li
c748355e0d
[CI] Introduce autorun_on_main feature ( #27836 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-12 08:51:19 +00:00
Chenguang Zheng
91864b79b3
[CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD ( #28521 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-11 23:09:33 -08:00
Lukas Geiger
ac0bb2c307
[Core] Cache vllm_is_batch_invariant ( #28304 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 05:03:01 +00:00
ai-jz
f31419ed8b
[Benchmark] Add retry support to fix workload bias in multi-turn benchmark ( #28493 )
2025-11-12 05:00:45 +00:00
Fanli Lin
b9ce9a3013
[BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms ( #28447 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-12 03:13:21 +00:00
Chenguang Zheng
4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation ( #25233 )
...
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00
Lukas Geiger
cbb799e314
[Model][Qwen3VL] Simplify get_mrope_input_positions using numpy ( #28302 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 02:55:10 +00:00
Andreas Karatzas
9f0247cfa4
VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (#27611 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <Andreas.Karatzas@amd.com>
2025-11-11 18:34:36 -08:00
Li, Jiang
7f829be7d3
[CPU] Refactor CPU attention backend ( #27954 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-12 09:43:06 +08:00
wangxiyuan
e1710393c4
[[V0 deprecation]]Remove VLLM_USE_V1 env ( #28204 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 18:22:16 -07:00
Isotr0py
3f770f4427
[Performance] Cache loaded custom logitsprocs to avoid overheads ( #28462 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-11 16:49:29 -08:00
Yanan Cao
48c879369f
[Frontend] Change CompilationMode to a proper Enum ( #28165 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-11-11 19:46:18 -05:00
Ilya Markov
1788aa1efb
[BugFix] Graceful handling of torch symm mem errors. ( #27671 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-11 17:41:54 -07:00
Adrian Abeyta
d23539549a
Use FLASHINFER MLA backend when testing fp8_kv_scale_compile ( #28491 )
...
Signed-off-by: adabeyta <aabeyta@redhat.com>
2025-11-12 00:34:58 +00:00
Max Hu
412e153df5
[Feature] Allow configuring FlashInfer workspace size ( #28269 )
...
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-11 23:32:20 +00:00
Michael Goin
e5f599d4d1
[Bugfix] Disable shared expert overlap if Marlin MoE is used ( #28410 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 23:16:12 +00:00
Michael Goin
28534b92b9
Add Zurich vLLM Meetup ( #28488 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 14:53:59 -08:00
wangxiyuan
d4902ba56d
[Misc] Cleanup Executor interface ( #28441 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 22:28:07 +00:00
Kyuyeun Kim
df4d3a44a8
[TPU] Rename path to tpu platform ( #28452 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2025-11-11 19:16:47 +00:00
Jee Jee Li
9d1c474704
[LoRA][1/N]Remove LoRA extra vocab ( #28382 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-11 11:06:21 -08:00
Jie Luo
8c32c6e4b4
[Misc] fix typo in DCP comment ( #28389 )
...
Signed-off-by: Livinfly <luojie3m@gmail.com>
2025-11-11 10:59:16 -08:00
Canlin Guo
de120bc94f
[V0 deprecation] Clean up num_prefill_tokens logic for V0 ( #28203 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-11 10:57:12 -08:00
Jialin Ouyang
4228be7959
[Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead ( #28245 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-11 10:28:47 -08:00
Lukas Geiger
76e4dcf225
[Misc] Remove unused attention prefix prefill ops functions ( #26971 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-11 18:26:04 +00:00
Fanli Lin
d5edcb8678
[BugFix] Fix Siglip2Attention on XPU ( #28448 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 18:18:02 +00:00