8712 Commits

Author SHA1 Message Date
633WHU
0167efe20d
[Core] Optimize scheduler request removal for single completions (#21917)
Signed-off-by: chiliu <chiliu@paypal.com>
Signed-off-by: chiliu <cliu_whu@yeah.net>
Co-authored-by: chiliu <chiliu@paypal.com>
2025-08-19 18:25:59 -07:00
Kyle Sayers
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version (#23202)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-20 00:39:28 +00:00
Chenheli Hua
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. (#23199)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-08-19 23:42:31 +00:00
Lucas Wilkinson
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-19 22:17:08 +00:00
Michael Goin
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation (#23002)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-19 15:07:30 -07:00
amirkl94
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend (#22357)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-08-19 18:01:53 -04:00
Michael Goin
21dce80ea9
[CI/Build] Add support for Python 3.13 (#13164)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 13:49:34 -07:00
Woosuk Kwon
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend (#23147)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 13:11:51 -07:00
Marko Rosenmueller
80141bbf2f
fix: use cache_salt for gpt-oss (#23186)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-08-19 18:12:25 +00:00
bnellnm
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. (#23125)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-19 14:00:51 -04:00
Woosuk Kwon
5b5f350d67
[Misc] Enable yapf for FlashInfer backend (#23193)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 10:33:47 -07:00
22quinn
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint (#23075)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-19 17:29:32 +00:00
Ruixiang Tan
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks (#22654)
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com>
2025-08-19 10:18:51 -07:00
Isotr0py
d6a1a20973
[CI/Build] Update transformers to v4.55.2 (#23093)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 10:06:17 -07:00
Benji Beck
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema (#21844)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-19 17:02:02 +00:00
Yuge Zhang
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints (#22587)
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-08-19 09:48:31 -07:00
yiz-liu
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock (#23169)
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-19 16:18:41 +00:00
TJian
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL (#22742)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-08-19 15:25:57 +00:00
Jee Jee Li
4d9c61993a
[Bugfix] Fix benchmark_moe.py (#23177)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-19 13:39:40 +00:00
myselvess
b87cb97a53
[Model] support new model ovis2.5 (#23084)
Signed-off-by: myselvess <244285088@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 13:12:59 +00:00
wang.yuqi
f856c33ce9
[Model] Add multi_label_classification support (#23173)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-19 12:54:30 +00:00
elvischenv
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
Woosuk Kwon
40f26734b9
[Misc] Fix seq_lens for graph capture (#23175)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 03:58:16 -07:00
Tialo
2c3f557f08
[Doc] use power of 2 (#23172) 2025-08-19 03:16:23 -07:00
Woosuk Kwon
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop (#23159)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 09:39:38 +00:00
qizixi
5bfe0dea7a
[bug fix] Fix llama4 spec decoding (#22691)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2025-08-19 08:53:24 +00:00
Isotr0py
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model (#22116)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 08:49:29 +00:00
hustxiayang
31436e8b4f
[Misc] Add request_id into benchmark_serve.py (#23065)
Signed-off-by: yangxia <yangxiast@gmail.com>
2025-08-19 08:32:18 +00:00
qizixi
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue (#22949)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 07:56:31 +00:00
Daniel Serebrenik
3c8a787247
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889)
Signed-off-by: daniels <daniels@pliops.com>
2025-08-19 07:48:07 +00:00
Grace Ho
01a08739e0
[misc] split engine_model into json file for nsys profile tool (#23117)
Signed-off-by: Grace Ho <grho@nvidia.com>
Signed-off-by: Grace Ho <146482179+gracehonv@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 15:44:53 +08:00
Jiangyun Zhu
fda9537c5e
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 (#23114)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 14:24:31 +08:00
Wentao Ye
90bbe0a5ad
[Log] Warning Once for Cutlass MLA (#23137)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-18 23:24:16 -07:00
Benji Beck
e75f342261
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema (#22023)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 13:48:26 +08:00
Nikhil Suryawanshi
78dba404ad
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725)
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com>
2025-08-19 04:40:37 +00:00
Chengji Yao
e9d6a3db69
[TPU] make ptxla not imported when using tpu_commons (#23081)
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
2025-08-19 11:46:42 +08:00
Xiao
a4454e9401
chore: disable enable_cpp_symbolic_shape_guards (#23048)
Signed-off-by: Xiao Liu <xiszishu@gmail.com>
2025-08-18 23:08:05 -04:00
Woosuk Kwon
14006840ea
[V0 Deprecation] Remove V0 FlashInfer attention backend (#22776)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 19:54:16 -07:00
Robert Shaw
6603288736
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests (#22871)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 17:39:01 -07:00
Thomas Parnell
95e3095136
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code (#23122)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-19 08:31:38 +08:00
Woosuk Kwon
c9b38be8aa
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 17:20:38 -07:00
Woosuk Kwon
0dd3f4f5ab
[Misc] Minor refactoring for prepare_inputs (#23116)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 16:58:05 -07:00
Xiang Xu
498259ccce
Install tpu_info==0.4.0 to fix core dump for TPU (#23135) 2025-08-18 16:23:33 -07:00
Michael Goin
6d25e3fd6e
Use Blackwell FlashInfer MXFP4 MoE by default if available (#23008)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-18 15:25:49 -07:00
Breno Baldas Skuk
ac6eb49de3
fix: OpenAI SDK compat (ResponseTextConfig) (#23126)
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai>
Signed-off-by: Breno Baldas Skuk <breno.skuk@hcompany.ai>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-18 15:22:59 -07:00
Michael Goin
bf756321c7
[CI Bugfix] Pin openai<1.100 to unblock CI (#23118)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-18 12:14:01 -07:00
Raushan Turganbay
0e3bb543f0
[Bugfix] Support compile for Transformers multimodal (#23095)
Signed-off-by: raushan <raushan@huggingface.co>
2025-08-18 13:35:48 +00:00
杨朱 · Kiki
569aefd134
chore: remove unnecessary patch_padding_side for the chatglm model (#23090)
Signed-off-by: carlory <baofa.fan@daocloud.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-18 12:32:13 +00:00
Cyrus Leung
d3f71f1224
[Refactor] Get prompt updates earlier (#23097)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-18 12:31:53 +00:00
Ning Xie
5a30bd10d8
[Bugfix] fix IntermediateTensors equal method (#23027)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-18 02:58:11 -07:00