8345 Commits

Author SHA1 Message Date
qscqesze
5e9455ae8f
[Bugfix]: Fix the streaming output for function calls in the minimax (#22015)
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
2025-08-06 20:30:27 -07:00
Michael Goin
a00d8b236f
Use float32 for test_completion.py (#22385)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-08-07 11:07:47 +08:00
Cyrus Leung
04cf435d95
[Bugfix] Fix wrong method name in Intern-S1 image processor (#22417)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-06 20:05:20 -07:00
Tao He
7377131a2c
[Qwen3] Enable dual-chunk-attention support for Qwen3 models. (#21924)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-08-06 19:58:08 -07:00
Kunshang Ji
6b47ef24de
[XPU]Fix flash_attn_varlen_func interface on xpu (#22350)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-08-06 19:28:11 -07:00
Lucas Wilkinson
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-06 18:40:52 -07:00
Maximilien de Bayser
f825c6bd22
Support encoder_only attention for FlexAttention (#22273)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-06 18:37:14 -07:00
tc-mb
41b67f4263
[model] Support MiniCPM-V 4.0 (#22166)
Co-authored-by: imning3 <hbning@pku.edu.cn>
2025-08-06 18:35:46 -07:00
Michael Goin
e8961e963a
Update flashinfer-python==0.2.10 (#22389)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-06 18:10:24 -07:00
Lain
9a3835aaa9
Fix trtllm-gen attention env and add attention sink (#22378)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Lain <fusiyuan2000@hotmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 18:07:41 -07:00
Yongye Zhu
5c7cc33f4d
[gpt-oss] fix model config with hf_config (#22401)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 18:04:04 -07:00
Chen Zhang
19c9365aa4
[gpt-oss] add demo tool server (#22393)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-06 17:47:14 -07:00
Wentao Ye
eec890c1c1
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue (#22399)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-06 17:03:53 -07:00
Asaf Joseph Gardin
46a13949d5
[v1] - Mamba1 Attention Metadata (#21249)
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-06 17:03:42 -07:00
Yongye Zhu
31f09c615f
[gpt-oss] flashinfer mxfp4 (#22339)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-08-06 12:37:27 -07:00
Yongye Zhu
31f5dc5b2a
[gpt-oss] Enhance error msg on attention sink init (#22335)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-08-06 11:41:42 -07:00
Woosuk Kwon
ec7cb19224
[gpt-oss] Add loop for built-in tool call (#22374)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 10:32:21 -07:00
Gregory Shtrasberg
2435ea7ed5
[Bugfix] Make condition in triton kernel constexpr (#22370)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-08-06 10:00:58 -07:00
Lucas Wilkinson
4a6b72c2ab
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks (#22368)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-08-06 09:47:38 -07:00
Zhang Jason
b4b9813b5e
add the codes to check AMD Instinct GPU number (#22367)
Signed-off-by: Zhang Jason <ning.zhang2@amd.com>
2025-08-06 08:58:38 -07:00
Lucas Wilkinson
2cb6ef8996
[BugFix] Fix FA2 RuntimeError when sinks is provided (#22365)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-08-06 08:03:03 -07:00
Woosuk Kwon
9edd1db02b
[Minor] Fix type (#22347)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-06 02:22:03 -07:00
Woosuk Kwon
f263a4b53f
[gpt-oss] Support chat completion api (#22342) 2025-08-06 01:57:39 -07:00
Roger Wang
54991c548a
[gpt-oss] add model to supported models doc (#22336)
Signed-off-by: Roger Wang <hey@rogerw.me>
2025-08-06 01:49:44 -07:00
Woosuk Kwon
178d03fbd6
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils (#22340)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 01:08:49 -07:00
Isotr0py
fa00c5d75b
[Misc] Clean up duplicated hf overrides (#22311)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-06 07:50:25 +00:00
Woosuk Kwon
134a8ee8fd
[gpt-oss] Add openai-harmony as default dependency (#22332)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-06 00:10:14 -07:00
Yongye Zhu
90ec006937
[gpt-oss] flashinfer attention sink init (#22330)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
2025-08-05 23:48:19 -07:00
Chen Zhang
a47e6ffe93
[GptOss] Add GptOss reasoning parser to support structure output (#22322)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 23:39:13 -07:00
Woosuk Kwon
98a3a81024
[ROCm] Add attention sink to use_rocm_custom_paged_attention (#22329)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 23:30:38 -07:00
Woosuk Kwon
de98252f49
Add GPT-OSS model code and config [1/N] (#22327)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 23:26:00 -07:00
Harry Mellor
796bae07c5
Update transformers to v4.55 (#21931)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 22:56:14 -07:00
Woosuk Kwon
6e20924350
Add attention sink in attention backends (#22320)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-05 22:37:21 -07:00
Woosuk Kwon
dd16bdc798
Increase openai-python version (#22316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 21:43:21 -07:00
Woosuk Kwon
e3c876dca3
Upgrade FA3 for attention sink (#22313)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 21:36:21 -07:00
Gregory Shtrasberg
5d5d419ca6
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm (#22264)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-08-05 20:39:32 -07:00
Rui Qiao
302962e806
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#22275)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-08-05 20:35:32 -07:00
Benjamin Chislett
7e6544c797
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding (#21862)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-08-05 19:57:49 -07:00
Jee Jee Li
8e6c7e873f
[Bugfix] Fix MoE BNB version (#22260)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-05 19:56:22 -07:00
Michael Goin
6a51530437
[Bugfix] Fix 3D input passed into cutlass_scaled_mm (#22278)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-06 10:35:20 +08:00
Michael Goin
35509fc5be
[Bugfix] Remove faulty test for oot attention backend (#22286)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-06 00:05:40 +00:00
Siyuan Liu
4b29d2784b
[CI][TPU] Fix docker clean up (#22271)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-08-05 23:54:56 +00:00
youkaichao
59a0b8554b
[bugfix] fix blackwell deepep installation (#22255) 2025-08-06 01:26:09 +08:00
Giancarlo Delfin
469b3ffaaa
[V1] port xformers backend to v1 (#21342)
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-05 10:04:46 -07:00
Wentao Ye
ae87ddd040
[Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING (#22199)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-05 09:40:23 -07:00
Michael Goin
a7cb6101ca
[CI/Build] Update flashinfer to 0.2.9 (#22233)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-05 09:39:38 -07:00
Michael Goin
c494f96fbc
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail (#22128)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-05 06:57:10 -07:00
Nicolò Lucchesi
0c275ad5ad
[V0 Deprecation][TPU] Remove V1 flag check from tests (#22248)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-05 06:53:23 -07:00
Ning Xie
74333ae2f6
[Misc] correct static type check for GroupCoordinator (#21946)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-05 03:17:46 -07:00
elvischenv
83156c7b89
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-08-05 02:45:34 -07:00