8406 Commits

Author SHA1 Message Date
Ming Yang
82216dc21f
[Misc] Support routing logic simulation (#21990)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 23:06:20 -07:00
Moritz Sanft
370661856b
[Frontend] Update OpenAI error response to upstream format (#22099)
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
2025-08-06 23:06:00 -07:00
vllmellm
cbc8457b26
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. (#22184)
Signed-off-by: kf <kuanfu.liu@embeddedllm.com>
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: kf <kuanfu.liu@embeddedllm.com>
2025-08-06 23:05:24 -07:00
lkchen
4d4297e8fe
[Bench] Split serve.py:main into async/async versions (#22405)
Signed-off-by: Linkun <github@lkchen.net>
2025-08-06 23:05:07 -07:00
wang.yuqi
2a4c825523
[CI] Skip the pooling models that do not support transformers v4.55 (#22411)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-06 23:05:03 -07:00
WeiQing Chen
4be02a3776
[Bugfix] EPLB load statistics problem (#22167)
Signed-off-by: ycyaw66 <497410282@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Co-authored-by: ycyaw66 <497410282@qq.com>
2025-08-07 04:07:54 +00:00
Chen Zhang
f6278b6243
[gpt-oss] Convert user input to harmony format (#22402)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-06 20:56:02 -07:00
Lionel Villard
ad6c655dde
preload heavy modules when mp method is forkserver (#22214)
Signed-off-by: Lionel Villard <villard@us.ibm.com>
2025-08-06 20:33:24 -07:00
ZiTian.Zhao
14bcf93a6a
Optimize logger init performance by using module-level constants (#22373)
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-08-06 20:32:19 -07:00
Harry Mellor
ecbea55ca2
Update hf_xet pin to resolve hangs (#22356)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-06 20:31:41 -07:00
Syed Muhammad Bin Asif
609b533cb6
[Bugfix] Add proper comparison for package versions (#22314)
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk>
2025-08-06 20:31:03 -07:00
qscqesze
5e9455ae8f
[Bugfix]: Fix the streaming output for function calls in the minimax (#22015)
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
2025-08-06 20:30:27 -07:00
Michael Goin
a00d8b236f
Use float32 for test_completion.py (#22385)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-08-07 11:07:47 +08:00
Cyrus Leung
04cf435d95
[Bugfix] Fix wrong method name in Intern-S1 image processor (#22417)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-06 20:05:20 -07:00
Tao He
7377131a2c
[Qwen3] Enable dual-chunk-attention support for Qwen3 models. (#21924)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-08-06 19:58:08 -07:00
Kunshang Ji
6b47ef24de
[XPU]Fix flash_attn_varlen_func interface on xpu (#22350)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-08-06 19:28:11 -07:00
Lucas Wilkinson
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-06 18:40:52 -07:00
Maximilien de Bayser
f825c6bd22
Support encoder_only attention for FlexAttention (#22273)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-06 18:37:14 -07:00
tc-mb
41b67f4263
[model] Support MiniCPM-V 4.0 (#22166)
Co-authored-by: imning3 <hbning@pku.edu.cn>
2025-08-06 18:35:46 -07:00
Michael Goin
e8961e963a
Update flashinfer-python==0.2.10 (#22389)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-06 18:10:24 -07:00
Lain
9a3835aaa9
Fix trtllm-gen attention env and add attention sink (#22378)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Lain <fusiyuan2000@hotmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 18:07:41 -07:00
Yongye Zhu
5c7cc33f4d
[gpt-oss] fix model config with hf_config (#22401)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 18:04:04 -07:00
Chen Zhang
19c9365aa4
[gpt-oss] add demo tool server (#22393)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-06 17:47:14 -07:00
Wentao Ye
eec890c1c1
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue (#22399)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-06 17:03:53 -07:00
Asaf Joseph Gardin
46a13949d5
[v1] - Mamba1 Attention Metadata (#21249)
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-06 17:03:42 -07:00
Yongye Zhu
31f09c615f
[gpt-oss] flashinfer mxfp4 (#22339)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-08-06 12:37:27 -07:00
Yongye Zhu
31f5dc5b2a
[gpt-oss] Enhance error msg on attention sink init (#22335)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-08-06 11:41:42 -07:00
Woosuk Kwon
ec7cb19224
[gpt-oss] Add loop for built-in tool call (#22374)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 10:32:21 -07:00
Gregory Shtrasberg
2435ea7ed5
[Bugfix] Make condition in triton kernel constexpr (#22370)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-08-06 10:00:58 -07:00
Lucas Wilkinson
4a6b72c2ab
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks (#22368)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-08-06 09:47:38 -07:00
Zhang Jason
b4b9813b5e
add the codes to check AMD Instinct GPU number (#22367)
Signed-off-by: Zhang Jason <ning.zhang2@amd.com>
2025-08-06 08:58:38 -07:00
Lucas Wilkinson
2cb6ef8996
[BugFix] Fix FA2 RuntimeError when sinks is provided (#22365)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-08-06 08:03:03 -07:00
Woosuk Kwon
9edd1db02b
[Minor] Fix type (#22347)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-06 02:22:03 -07:00
Woosuk Kwon
f263a4b53f
[gpt-oss] Support chat completion api (#22342) 2025-08-06 01:57:39 -07:00
Roger Wang
54991c548a
[gpt-oss] add model to supported models doc (#22336)
Signed-off-by: Roger Wang <hey@rogerw.me>
2025-08-06 01:49:44 -07:00
Woosuk Kwon
178d03fbd6
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils (#22340)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 01:08:49 -07:00
Isotr0py
fa00c5d75b
[Misc] Clean up duplicated hf overrides (#22311)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-06 07:50:25 +00:00
Woosuk Kwon
134a8ee8fd
[gpt-oss] Add openai-harmony as default dependency (#22332)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 00:10:14 -07:00
Yongye Zhu
90ec006937
[gpt-oss] flashinfer attention sink init (#22330)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
2025-08-05 23:48:19 -07:00
Chen Zhang
a47e6ffe93
[GptOss] Add GptOss reasoning parser to support structure output (#22322)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 23:39:13 -07:00
Woosuk Kwon
98a3a81024
[ROCm] Add attention sink to use_rocm_custom_paged_attention (#22329)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 23:30:38 -07:00
Woosuk Kwon
de98252f49
Add GPT-OSS model code and config [1/N] (#22327)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 23:26:00 -07:00
Harry Mellor
796bae07c5
Update transformers to v4.55 (#21931)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 22:56:14 -07:00
Woosuk Kwon
6e20924350
Add attention sink in attention backends (#22320)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 22:37:21 -07:00
Woosuk Kwon
dd16bdc798
Increase openai-python version (#22316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 21:43:21 -07:00
Woosuk Kwon
e3c876dca3
Upgrade FA3 for attention sink (#22313)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 21:36:21 -07:00
Gregory Shtrasberg
5d5d419ca6
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm (#22264)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-08-05 20:39:32 -07:00
Rui Qiao
302962e806
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#22275)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-08-05 20:35:32 -07:00
Benjamin Chislett
7e6544c797
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding (#21862)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-08-05 19:57:49 -07:00
Jee Jee Li
8e6c7e873f
[Bugfix] Fix MoE BNB version (#22260)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-05 19:56:22 -07:00