633WHU
0167efe20d
[Core] Optimize scheduler request removal for single completions ( #21917 )
...
Signed-off-by: chiliu <chiliu@paypal.com>
Signed-off-by: chiliu <cliu_whu@yeah.net>
Co-authored-by: chiliu <chiliu@paypal.com>
2025-08-19 18:25:59 -07:00
Kyle Sayers
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version ( #23202 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-20 00:39:28 +00:00
Chenheli Hua
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. ( #23199 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-08-19 23:42:31 +00:00
Lucas Wilkinson
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph ( #23200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-19 22:17:08 +00:00
Michael Goin
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation ( #23002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-19 15:07:30 -07:00
amirkl94
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend ( #22357 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-08-19 18:01:53 -04:00
Michael Goin
21dce80ea9
[CI/Build] Add support for Python 3.13 ( #13164 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 13:49:34 -07:00
Woosuk Kwon
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend ( #23147 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 13:11:51 -07:00
Marko Rosenmueller
80141bbf2f
fix: use cache_salt for gpt-oss ( #23186 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-08-19 18:12:25 +00:00
bnellnm
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. ( #23125 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-19 14:00:51 -04:00
Woosuk Kwon
5b5f350d67
[Misc] Enable yapf for FlashInfer backend ( #23193 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 10:33:47 -07:00
22quinn
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint ( #23075 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-19 17:29:32 +00:00
Ruixiang Tan
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks ( #22654 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com>
2025-08-19 10:18:51 -07:00
Isotr0py
d6a1a20973
[CI/Build] Update transformers to v4.55.2 ( #23093 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 10:06:17 -07:00
Benji Beck
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema ( #21844 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-19 17:02:02 +00:00
Yuge Zhang
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints ( #22587 )
...
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-08-19 09:48:31 -07:00
yiz-liu
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock ( #23169 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-19 16:18:41 +00:00
TJian
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL ( #22742 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-08-19 15:25:57 +00:00
Jee Jee Li
4d9c61993a
[Bugfix] Fix benchmark_moe.py ( #23177 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-19 13:39:40 +00:00
myselvess
b87cb97a53
[Model] support new model ovis2.5 ( #23084 )
...
Signed-off-by: myselvess <244285088@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 13:12:59 +00:00
wang.yuqi
f856c33ce9
[Model] Add multi_label_classification support ( #23173 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-19 12:54:30 +00:00
elvischenv
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel ( #21716 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
Woosuk Kwon
40f26734b9
[Misc] Fix seq_lens for graph capture ( #23175 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 03:58:16 -07:00
Tialo
2c3f557f08
[Doc] use power of 2 ( #23172 )
2025-08-19 03:16:23 -07:00
Woosuk Kwon
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop ( #23159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 09:39:38 +00:00
qizixi
5bfe0dea7a
[bug fix] Fix llama4 spec decoding ( #22691 )
...
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2025-08-19 08:53:24 +00:00
Isotr0py
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model ( #22116 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 08:49:29 +00:00
hustxiayang
31436e8b4f
[Misc] Add request_id into benchmark_serve.py ( #23065 )
...
Signed-off-by: yangxia <yangxiast@gmail.com>
2025-08-19 08:32:18 +00:00
qizixi
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue ( #22949 )
...
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 07:56:31 +00:00
Daniel Serebrenik
3c8a787247
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn ( #22889 )
...
Signed-off-by: daniels <daniels@pliops.com>
2025-08-19 07:48:07 +00:00
Grace Ho
01a08739e0
[misc] split engine_model into json file for nsys profile tool ( #23117 )
...
Signed-off-by: Grace Ho <grho@nvidia.com>
Signed-off-by: Grace Ho <146482179+gracehonv@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 15:44:53 +08:00
Jiangyun Zhu
fda9537c5e
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 ( #23114 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 14:24:31 +08:00
Wentao Ye
90bbe0a5ad
[Log] Warning Once for Cutlass MLA ( #23137 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-18 23:24:16 -07:00
Benji Beck
e75f342261
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema ( #22023 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 13:48:26 +08:00
Nikhil Suryawanshi
78dba404ad
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes ( #22725 )
...
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com>
2025-08-19 04:40:37 +00:00
Chengji Yao
e9d6a3db69
[TPU] make ptxla not imported when using tpu_commons ( #23081 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
2025-08-19 11:46:42 +08:00
Xiao
a4454e9401
chore: disable enable_cpp_symbolic_shape_guards ( #23048 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com>
2025-08-18 23:08:05 -04:00
Woosuk Kwon
14006840ea
[V0 Deprecation] Remove V0 FlashInfer attention backend ( #22776 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 19:54:16 -07:00
Robert Shaw
6603288736
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests ( #22871 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 17:39:01 -07:00
Thomas Parnell
95e3095136
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code ( #23122 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-19 08:31:38 +08:00
Woosuk Kwon
c9b38be8aa
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT ( #23041 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 17:20:38 -07:00
Woosuk Kwon
0dd3f4f5ab
[Misc] Minor refactoring for prepare_inputs ( #23116 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 16:58:05 -07:00
Xiang Xu
498259ccce
Install tpu_info==0.4.0 to fix core dump for TPU ( #23135 )
2025-08-18 16:23:33 -07:00
Michael Goin
6d25e3fd6e
Use Blackwell FlashInfer MXFP4 MoE by default if available ( #23008 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-18 15:25:49 -07:00
Breno Baldas Skuk
ac6eb49de3
fix: OpenAI SDK compat (ResponseTextConfig) ( #23126 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai>
Signed-off-by: Breno Baldas Skuk <breno.skuk@hcompany.ai>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-18 15:22:59 -07:00
Michael Goin
bf756321c7
[CI Bugfix] Pin openai<1.100 to unblock CI ( #23118 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-18 12:14:01 -07:00
Raushan Turganbay
0e3bb543f0
[Bugfix] Support compile for Transformers multimodal ( #23095 )
...
Signed-off-by: raushan <raushan@huggingface.co>
2025-08-18 13:35:48 +00:00
杨朱 · Kiki
569aefd134
chore: remove unnecessary patch_padding_side for the chatglm model ( #23090 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-18 12:32:13 +00:00
Cyrus Leung
d3f71f1224
[Refactor] Get prompt updates earlier ( #23097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-18 12:31:53 +00:00
Ning Xie
5a30bd10d8
[Bugfix] fix IntermediateTensors equal method ( #23027 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-18 02:58:11 -07:00