468 Commits

Author SHA1 Message Date
Ekagra Ranjan
e71b8e210d
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-09-25 15:22:03 -07:00
Cyrus Leung
2f17117606
[mypy] Fix wrong type annotations related to tuple (#25660)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-09-25 13:00:45 +00:00
Tyler Michael Smith
1260180c67
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… (#25607)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-09-25 08:05:21 +00:00
Saman A. Pour
90b139cfff
Enable Fbgemm NVFP4 on Dense models (#25609)
Signed-off-by: Saman Keon <samanamp@outlook.com>
2025-09-24 21:12:53 -07:00
Wentao Ye
1f29141258
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor (#25517)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-09-24 18:52:36 -04:00
Michael Goin
d83f3f7cb3
Fixes and updates to bench_per_token_quant_fp8 (#25591)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-09-24 08:30:15 -07:00
Russell Bryant
164299500b
[Benchmark] Fix regression in structured output benchmark (#25500)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-09-24 10:40:42 +00:00
Chenxi Yang
0d235b874a
Add CUTLASS FP8 MOE benchmark scripts and kernel config (#25302)
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Co-authored-by: Chenxi Yang <cxyang@fb.com>
2025-09-23 18:07:42 -06:00
ElizaWszola
63400259d0
[Performance] Move apply_w8a8_block_fp8_linear to an op class (#24666)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
2025-09-23 12:03:10 -07:00
Amir Samani
8c1c81a3de
[core] add nccl symmetric memory for all reduce (#24532)
Signed-off-by: Amir Samani <asamani@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-09-23 14:33:06 -04:00
Weida Hong
24e8222745
[Misc] Reduce initialization time of auto_tune (#23682)
Signed-off-by: Weida Hong <wdhongtw@google.com>
2025-09-23 17:34:58 +00:00
Burkhard Ringlein
100b630a60
[V1][Kernel] Add triton implementation for reshape_and_cache_flash (#24503)
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-09-23 12:52:40 -04:00
Cyrus Leung
6c117cff7d
[Frontend] Pass API server count to each process (#23717)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-09-20 01:15:19 +08:00
Aaron Pham
29283e8976
[Chore] Cleanup guided namespace, move to structured outputs config (#22772)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-09-18 09:20:27 +00:00
bnellnm
5963b98b46
[Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses (#22537)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-09-17 17:43:31 -06:00
Karan Goel
2a4d6412e6
Add a batched auto tune script (#25076)
Signed-off-by: Karan Goel <karangoel@google.com>
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-17 22:41:18 +00:00
dolpm
1b962e2457
[fix] lora benchmarks pass no_lora_flag_cpu (#23774)
Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-17 21:22:25 +08:00
Daniel Serebrenik
43a62c51be
Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) (#23255)
Signed-off-by: daniels <daniels@pliops.com>
2025-09-17 05:53:17 +00:00
Isotr0py
5a411ef6c4
[Benchmarks] Add MMVU video dataset support and clean up deprecated datasets (#24719)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-09-17 03:29:43 +00:00
Tahsin Tunan
cef32104b4
[FP8] Extend per-token-group quantization support to QuantFP8 (#24342)
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
2025-09-16 18:31:06 -07:00
Ye (Charlotte) Qi
85e0df1392
[Docs] move benchmarks README to contributing guides (#24820) 2025-09-16 05:52:57 -07:00
Jee Jee Li
04ad0dc275
[benchmark] Add triton version in the moe tuned config (#24769)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-16 14:10:54 +08:00
Elvir Crnčević
98229db244
[Kernels][DP/EP] Optimize Silu Kernel for R1 (#24054)
Signed-off-by: elvircrn <elvircrn@gmail.com>
2025-09-13 00:17:27 -07:00
Didier Durand
bcb06d7baf
[Doc]: fix typos in various files (#24726)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-09-12 06:43:12 -07:00
Michael Goin
c3aea10dc8
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel (#23280)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-09-11 15:43:14 -07:00
Ilya Markov
1fdd5c42d7
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-09-11 09:45:31 -07:00
Jee Jee Li
d11ec124a0
[Bench] Add qwen-next in benchmark_moe.py (#24661)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-11 21:29:43 +08:00
TaehyunKim
9bd831f501
[Model] New model support for Motif-1-Tiny (#23414)
Signed-off-by: ca1207 <ca1207zzz@gmail.com>
Signed-off-by: TaehyunKim <73943231+ca1207@users.noreply.github.com>
Co-authored-by: WyldeCat <skan1543@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-10 23:29:40 -07:00
Ekagra Ranjan
0dc9cbb527
[Benchmark] Update bench doc with mtbench, blazedit, spec bench (#24450)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
2025-09-09 21:15:41 +00:00
Ye (Charlotte) Qi
6fb2788163
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency (#24411)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-09-09 10:02:35 +00:00
elvischenv
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel (#23647)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-09-08 20:53:07 -07:00
Jee Jee Li
62f66be1f7
[Bugfix] Fix Qwen3-coder moe tuned config (#24072)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-07 05:19:46 +00:00
Jiangyun Zhu
77aec83b8c
[Benchmark] add benchmark for custom activation op (#23908)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-09-06 20:12:05 -07:00
Didier Durand
83609ca91d
[Doc]: fix typos in Python comments (#24173)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-09-04 08:52:17 -07:00
anthonsu
04f3c35cff
Improve flexibility of auto_tune.sh execution. (#23766)
Signed-off-by: Anthony Su <50185138+anthonsu@users.noreply.github.com>
Signed-off-by: anthonsu <50185138+anthonsu@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 09:41:41 +00:00
Weida Hong
12e1e63cc5
[Misc] Enhance output readability of helper script (#24214)
Signed-off-by: Weida Hong <wdhongtw@google.com>
2025-09-04 06:38:26 +00:00
Peter Pan
b5ee1e3261
Remove deprecated PyNcclConnector (#24151)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-09-03 22:49:16 +00:00
Didier Durand
02d411fdb2
[Doc]: fix typos in Python comments (#24115)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-09-02 21:14:07 -07:00
co63oc
1bd007f234
fix some typos (#24071)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-09-02 20:44:50 -07:00
Jiangyun Zhu
c83c4ff815
[Benchmark] Add support for local hf dataset path in benchmark (#23999)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-09-02 17:49:16 +00:00
Michael Goin
b7adf94c4a
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj (#23939)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-29 10:28:35 -07:00
YUQI.CHENG
66548f6603
[Bugfix] Fix benchmark_moe.py for blockwise fp8. (#23823)
Signed-off-by: crischeng <420985011@qq.com>
Co-authored-by: cris <grace@guisenbindeMacBook-Pro.local>
2025-08-28 21:44:09 +08:00
Michael Goin
a781e84ec2
[Perf] Tune configs for triton block fp8 gemm H100/H200 (#23748)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-28 11:12:53 +08:00
Chen Zhang
142ac08030
[Frontend] Optimize beam search performance by limiting concurrency (#23599)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-27 04:59:14 +00:00
Didier Durand
7c04779afa
[Doc]: fix various spelling issues in multiple files (#23636)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-08-26 14:05:29 +00:00
Breno Baldas Skuk
0cb7b065c3
Feature/benchmark/random mm data/images (#23119)
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai>
2025-08-25 01:28:35 -07:00
Ming Yang
504d914314
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 (#23504)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-08-24 18:06:35 -07:00
czhu-cohere
e76e233540
[kernel] Support W4A8 on Hopper (#23198)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-08-24 06:18:04 +00:00
elvischenv
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel (#22703)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-22 22:09:05 +00:00
Michael Goin
3bbe11cc13
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm (#23265)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-21 17:56:15 -04:00