2553 Commits

Author SHA1 Message Date
Isotr0py
049c245143
[Misc] Replace flaky image urls in pixtral test (#22574)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-10 06:18:21 -07:00
Ning Xie
326976291b
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config (#22566)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-10 00:08:48 -07:00
Harry Mellor
c49848396d
Refactor sliding window configuration to Transformers best practice (#21927)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-09 20:50:48 -07:00
Chengji Yao
2a84fb422f
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block (#22394)
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
2025-08-09 20:49:04 -07:00
Le Chen
3d7363e61c
[Config] add "qwen" as a native eagle3 target supported model (#22333)
Signed-off-by: lechen <lecself@163.com>
Signed-off-by: LeChen <lecself@163.com>
2025-08-09 20:21:05 -07:00
Thomas Parnell
61f67d8acd
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (#21401)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-09 20:16:11 -07:00
TJian
42172ad18f
[FEAT] [Performance] Add triton mrope to replace the torch code path (#22375)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-09 11:50:03 -07:00
Nicolò Lucchesi
5a16fa614c
[Model] Gemma3n MM (#20495)
Signed-off-by: ShriKode <shrikode@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Roger Wang <hey@rogerw.me>
Co-authored-by: ShriKode <shrikode@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.me>
2025-08-09 09:56:25 -07:00
Thomas Parnell
1bf5e1f25b
[CI] [Hybrid] Speed up hybrid models test by removing large models (#22563)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-09 02:04:42 -07:00
Yuxuan Zhang
a6022e6fbc
GLM-4.5V with new class name at transformers (#22520)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-09 00:50:21 -07:00
Jee Jee Li
0edc0cd52b
[Bugfix] Fix CI moe kernel failure (#22556)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-09 00:03:29 -07:00
Isotr0py
7920e9b1c5
[Bugfix] Fix failing GPT-OSS initialization test (#22557)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-09 00:03:26 -07:00
Kyuyeun Kim
9a0c5ded5a
[TPU] Add support for online w8a8 quantization (#22425)
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2025-08-08 23:12:54 -07:00
Thomas Parnell
8a0ffd6285
Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation (#22541)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-08 23:05:32 -07:00
Roger Wang
08b751ba74
Implicit language-model-only mode via limit-mm-per-prompt (#22299)
Signed-off-by: Roger Wang <hey@rogerw.me>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: XIn Li <xinli@nvidia.com>
Signed-off-by: Junhao Li <junhao@ubicloud.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Signed-off-by: Linkun <github@lkchen.net>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: XIn Li <xinli@nvidia.com>
Co-authored-by: Junhao Li <streaver91@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Yuxuan Zhang <2448370773@qq.com>
Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com>
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: lkchen <github@lkchen.net>
2025-08-08 22:21:40 -07:00
Isotr0py
429e4e2d42
[Bugfix] Fix ModernBert cuda graph capturing in v1 (#21901)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-08 22:17:22 -07:00
Russell Bryant
311d875614
Drop flaky test_healthcheck_response_time (#22539)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-08-08 16:56:47 -07:00
Harry Mellor
e3edc0a7a8
Extract CompilationConfig from config.py (#22524)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-08 16:34:25 -07:00
yyweiss
baece8c3d2
[Frontend] Add unix domain socket support (#18097)
Signed-off-by: <yyweiss@gmail.com>
Signed-off-by: yyw <yyweiss@gmail.com>
2025-08-08 16:23:44 -07:00
Harry Mellor
41b9655751
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers (#22536)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-08 16:20:58 -07:00
Yongye Zhu
e789cad6b8
[gpt-oss] triton kernel mxfp4 (#22421)
Signed-off-by: <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-08 08:24:07 -07:00
Cyrus Leung
43c4f3d77c
[Misc] Begin deprecation of get_tensor_model_*_group (#22494)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-08 01:11:54 -07:00
Chauncey
17eaaef595
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match (#22065)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-08-07 19:20:21 -07:00
TJian
1ee5ead5f8
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine (#21496)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-07 19:13:17 -07:00
Ning Xie
acf8aeb79e
[Misc] normalize multiprocessing Queue usage (#22371)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-08 01:57:27 +00:00
Harry Mellor
7e3a8dc906
Remove from_dict from SpeculativeConfig (#22451)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-07 10:13:04 -07:00
Cyrus Leung
139d155781
[Frontend] Use engine argument to control MM cache size (#22441)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-07 09:47:10 -07:00
Chen Zhang
4815b00f54
[gpt-oss] Generate ResponseOutputItem from Harmony Message (#22410)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-07 08:33:25 -07:00
Cyrus Leung
766bc8162c
[Core] Store only the keys for multi-modal data in P0 (#22198)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-07 01:45:04 -07:00
Adrián García García
8e8e0b6af1
feat: Add --enable-log-outputs flag for logging model generations (#20707)
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
2025-08-06 23:10:13 -07:00
Ming Yang
82216dc21f
[Misc] Support routing logic simulation (#21990)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 23:06:20 -07:00
Moritz Sanft
370661856b
[Frontend] Update OpenAI error response to upstream format (#22099)
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
2025-08-06 23:06:00 -07:00
wang.yuqi
2a4c825523
[CI] Skip the pooling models that do not support transformers v4.55 (#22411)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-06 23:05:03 -07:00
qscqesze
5e9455ae8f
[Bugfix]: Fix the streaming output for function calls in the minimax (#22015)
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
2025-08-06 20:30:27 -07:00
Michael Goin
a00d8b236f
Use float32 for test_completion.py (#22385)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-08-07 11:07:47 +08:00
Lucas Wilkinson
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-06 18:40:52 -07:00
Maximilien de Bayser
f825c6bd22
Support encoder_only attention for FlexAttention (#22273)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-06 18:37:14 -07:00
tc-mb
41b67f4263
[model] Support MiniCPM-V 4.0 (#22166)
Co-authored-by: imning3 <hbning@pku.edu.cn>
2025-08-06 18:35:46 -07:00
Asaf Joseph Gardin
46a13949d5
[v1] - Mamba1 Attention Metadata (#21249)
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-06 17:03:42 -07:00
Isotr0py
fa00c5d75b
[Misc] Clean up duplicated hf overrides (#22311)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-06 07:50:25 +00:00
Woosuk Kwon
de98252f49
Add GPT-OSS model code and config [1/N] (#22327)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 23:26:00 -07:00
Harry Mellor
796bae07c5
Update transformers to v4.55 (#21931)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-05 22:56:14 -07:00
Michael Goin
35509fc5be
[Bugfix] Remove faulty test for oot attention backend (#22286)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-06 00:05:40 +00:00
Giancarlo Delfin
469b3ffaaa
[V1] port xformers backend to v1 (#21342)
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-05 10:04:46 -07:00
Nicolò Lucchesi
0c275ad5ad
[V0 Deprecation][TPU] Remove V1 flag check from tests (#22248)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-05 06:53:23 -07:00
elvischenv
83156c7b89
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-08-05 02:45:34 -07:00
Wentao Ye
4771df7b2b
[Feature] Non-contiguous Support for FP8 Quantization (#21961)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-08-05 02:36:43 -07:00
wang.yuqi
586f286789
[Model] Pooling model activation supports per request control by PoolingParams (#20538)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-05 00:37:00 -07:00
Michael Goin
e79a12fc3a
[UX] Fail if an invalid attention backend is specified (#22217)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-08-04 23:54:52 -07:00
Yuxuan Zhang
6fa41e0c32
self.gate dtype update for GLM-4.5 (#22203)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
2025-08-04 19:12:38 -07:00