8733 Commits

Author SHA1 Message Date
Cyrus Leung
5efd6905bc
[CLI][Doc] Formalize --mm-encoder-tp-mode (#23190)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-20 23:42:28 +08:00
shixianc
b17109beea
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute (#23045)
Signed-off-by: Shixian Cui <shixian@amazon.com>
2025-08-20 10:35:26 -04:00
Cyrus Leung
4449235843
[Bugfix] Ensure correctness of HCXVision processing (#23254)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-20 14:19:30 +00:00
rongfu.leng
38217877aa
[Fix] fix offline env use local mode path (#22526)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-08-20 13:34:49 +00:00
Jee Jee Li
c6d80a7a96
[Model] Improve olmo and olmo2 (#23228)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-20 12:47:05 +00:00
xyxinyang
7cd17e22d7
[Model][V1] Support Ernie MTP (#22169)
Signed-off-by: zhouchong <zhouchong03@baidu.com>
Co-authored-by: zhouchong <zhouchong03@baidu.com>
2025-08-20 20:41:55 +08:00
Michael Goin
50df09fe13
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image (#23129)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-20 08:05:54 -04:00
Cyrus Leung
68fcd3fa73
[Bugfix] Ensure correctness of Cohere2Vision processing (#23245)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-20 11:09:18 +00:00
Xin Yang
83e69a09d6
[Model] Support deepseek with eagle (#21086)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-08-20 19:01:31 +08:00
Shiming Zhang
3aa8c10038
Fix missing quotes (#23242)
Signed-off-by: Shiming Zhang <wzshiming@hotmail.com>
2025-08-20 10:46:59 +00:00
Calvin Chen
103f1ec8d3
[Model] use autoWeightsLoader for gptoss (#22446)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
2025-08-20 10:16:27 +00:00
who who who
d983769c41
fix cuda graph (#22721)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
2025-08-20 06:24:37 +00:00
Nick Hill
8fd920924c
[BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-20 13:50:29 +08:00
Cyrus Leung
de7b67a023
[CI/Build] Sync multimodal tests (#23181)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-20 05:06:42 +00:00
Zhewen Li
f729023272
[CI/Build] Also check DP in benchmarks throughput script (#23038)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-08-20 04:09:27 +00:00
길재은
1a3079a15e
chore: support pytorch format in lora (#22790)
Signed-off-by: jaeeun.kil <rha3122@naver.com>
Signed-off-by: 길재은 <rha3122@naver.com>
2025-08-20 04:02:50 +00:00
Louie Tsai
941f56858a
Fix a performance comparison issue in Benchmark Suite (#23047)
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
2025-08-20 03:14:32 +00:00
Zebing Lin
a634733f67
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention (#23185)
Signed-off-by: linzebing <linzebing1995@gmail.com>
2025-08-20 02:57:47 +00:00
Cyrus Leung
64ab3c7253
[Doc] Update V1 status of various pooling models (#23189)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-20 10:33:41 +08:00
Chenheli Hua
e58c5a9768
[Core] Add torch profiler CPU traces for AsyncLLM. (#21794)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-08-20 02:32:47 +00:00
Michael Goin
d46d417b58
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py (#23132)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-19 20:18:52 -06:00
633WHU
0167efe20d
[Core] Optimize scheduler request removal for single completions (#21917)
Signed-off-by: chiliu <chiliu@paypal.com>
Signed-off-by: chiliu <cliu_whu@yeah.net>
Co-authored-by: chiliu <chiliu@paypal.com>
2025-08-19 18:25:59 -07:00
Kyle Sayers
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version (#23202)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-20 00:39:28 +00:00
Chenheli Hua
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. (#23199)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-08-19 23:42:31 +00:00
Lucas Wilkinson
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-19 22:17:08 +00:00
Michael Goin
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation (#23002)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-19 15:07:30 -07:00
amirkl94
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend (#22357)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-08-19 18:01:53 -04:00
Michael Goin
21dce80ea9
[CI/Build] Add support for Python 3.13 (#13164)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 13:49:34 -07:00
Woosuk Kwon
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend (#23147)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 13:11:51 -07:00
Marko Rosenmueller
80141bbf2f
fix: use cache_salt for gpt-oss (#23186)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-08-19 18:12:25 +00:00
bnellnm
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. (#23125)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-19 14:00:51 -04:00
Woosuk Kwon
5b5f350d67
[Misc] Enable yapf for FlashInfer backend (#23193)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 10:33:47 -07:00
22quinn
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint (#23075)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-19 17:29:32 +00:00
Ruixiang Tan
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks (#22654)
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com>
2025-08-19 10:18:51 -07:00
Isotr0py
d6a1a20973
[CI/Build] Update transformers to v4.55.2 (#23093)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 10:06:17 -07:00
Benji Beck
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema (#21844)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-19 17:02:02 +00:00
Yuge Zhang
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints (#22587)
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-08-19 09:48:31 -07:00
yiz-liu
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock (#23169)
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-19 16:18:41 +00:00
TJian
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL (#22742)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-08-19 15:25:57 +00:00
Jee Jee Li
4d9c61993a
[Bugfix] Fix benchmark_moe.py (#23177)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-19 13:39:40 +00:00
myselvess
b87cb97a53
[Model] support new model ovis2.5 (#23084)
Signed-off-by: myselvess <244285088@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 13:12:59 +00:00
wang.yuqi
f856c33ce9
[Model] Add multi_label_classification support (#23173)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-19 12:54:30 +00:00
elvischenv
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
Woosuk Kwon
40f26734b9
[Misc] Fix seq_lens for graph capture (#23175)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 03:58:16 -07:00
Tialo
2c3f557f08
[Doc] use power of 2 (#23172) 2025-08-19 03:16:23 -07:00
Woosuk Kwon
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop (#23159)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 09:39:38 +00:00
qizixi
5bfe0dea7a
[bug fix] Fix llama4 spec decoding (#22691)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2025-08-19 08:53:24 +00:00
Isotr0py
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model (#22116)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-19 08:49:29 +00:00
hustxiayang
31436e8b4f
[Misc] Add request_id into benchmark_serve.py (#23065)
Signed-off-by: yangxia <yangxiast@gmail.com>
2025-08-19 08:32:18 +00:00
qizixi
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue (#22949)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-19 07:56:31 +00:00