Wentao Ye
|
c1acd6d7d4
|
[Refactor] Change the way of import triton (#20774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-12 19:39:55 -07:00 |
|
ElizaWszola
|
3b3b778d4a
|
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs (#20825)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-12 19:39:14 -07:00 |
|
Wentao Ye
|
42d440c22b
|
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (#20841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-12 19:38:45 -07:00 |
|
Woosuk Kwon
|
f45a332886
|
[Sched] Enhance the logic to remove stopped requests from queues (#20739)
|
2025-07-12 15:33:13 -07:00 |
|
Michael Goin
|
6e2c176e1f
|
[Bugfix] Restrict Machete to only run on Hopper (#20830)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-12 17:34:40 +00:00 |
|
Reid
|
a86754a12b
|
[docs] convert supported configs to table (#20858)
Signed-off-by: reidliu41 <reid201711@gmail.com>
|
2025-07-12 06:54:50 -07:00 |
|
Alex Brooks
|
c2a2f19aba
|
[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models (#20843)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
|
2025-07-12 06:11:30 -07:00 |
|
Congcong Chen
|
2c11a738b3
|
[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (#20702)
Signed-off-by: Congcong Chen <congcongchen@microsoft.com>
|
2025-07-12 06:02:10 -07:00 |
|
Michael Goin
|
b639327ad9
|
Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694" (#20853)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 23:07:35 -07:00 |
|
Zhiyu
|
4afe687a82
|
Enable ModelOpt Llama4 fp8 checkpoint deployment (#20419)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-07-11 23:07:16 -07:00 |
|
Maximilien de Bayser
|
5de8d9f111
|
Remove extra tensor on CPU (#20693)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
2025-07-12 14:06:34 +08:00 |
|
Boyuan Feng
|
c1c8ca57ff
|
[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile (#20790)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
2025-07-11 23:06:13 -07:00 |
|
Richard Zou
|
a3a5a47e48
|
[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 (#20823)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-07-11 23:06:04 -07:00 |
|
Lucia Fang
|
fb25e95688
|
[Docs] Update basic.md (#20846)
|
2025-07-11 23:05:32 -07:00 |
|
Wentao Ye
|
0d4891cd03
|
[Bug] Fix DeepGemm for EP low latency case (#20833)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-11 23:05:12 -07:00 |
|
lkchen
|
f56d2996ca
|
[Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph (#20834)
Signed-off-by: Linkun <github@lkchen.net>
|
2025-07-11 23:04:45 -07:00 |
|
Isotr0py
|
147afb448b
|
[Bugfix] Replace unavailable video url in multimodal test (#20854)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-07-12 05:25:39 +00:00 |
|
Nicolò Lucchesi
|
3c7d942da8
|
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models (#20637)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-07-11 21:33:26 -07:00 |
|
Varun Sundar Rabindranath
|
890323dc1b
|
[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once (#20852)
|
2025-07-11 20:56:24 -07:00 |
|
Isotr0py
|
01cae37713
|
[CI/Build] Ensure compatability with Transformers v4.53 (#20541)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-07-11 20:53:07 -07:00 |
|
yurhett
|
11c0198615
|
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading (#20682)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2025-07-11 20:52:43 -07:00 |
|
Li, Jiang
|
b1235c3e10
|
[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices (#20822)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-07-11 20:52:05 -07:00 |
|
Jee Jee Li
|
44d02f54db
|
[Misc] Restrict deep_gemm's log output (#20827)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-11 20:50:42 -07:00 |
|
Trevor Morris
|
a8593237c0
|
Add pynccl all-gatherv and reducescatterv (#20154)
Signed-off-by: Trevor Morris <tmorris@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 18:59:23 -07:00 |
|
Ilya Markov
|
fc0f41d10a
|
Integration SM100 FlashInfer fused allreduce RMSNorm (#20691)
Signed-off-by: ilmarkov <imarkov@redhat.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
|
2025-07-11 18:58:15 -07:00 |
|
Wentao Ye
|
7b828e30d5
|
[CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' (#20845)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-11 18:57:24 -07:00 |
|
bigmoyan
|
5f0af36af5
|
Update kimi-k2 tool calling docs, enable unit tests (#20821)
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn>
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>
|
2025-07-11 20:16:14 +00:00 |
|
Isotr0py
|
0d21b2664c
|
[Bugfix] Fix OOM in language generation test (#20814)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-07-11 11:21:52 -07:00 |
|
Nick Hill
|
9907fc4494
|
[Docs] Data Parallel deployment documentation (#20768)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-07-11 09:42:10 -07:00 |
|
Michael Goin
|
d47661f0cd
|
[Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM (#20646)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 10:05:33 -06:00 |
|
Varun Sundar Rabindranath
|
53fa457391
|
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility (#20449)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-11 07:51:46 -07:00 |
|
Reid
|
6fb162447b
|
[doc] fix ordered list issue (#20819)
Signed-off-by: reidliu41 <reid201711@gmail.com>
|
2025-07-11 06:49:46 -07:00 |
|
Li, Jiang
|
66177189c5
|
[Bugfix] Add missing field to TritonLanguagePlaceholder (#20812)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-07-11 05:25:11 -07:00 |
|
QiliangCui
|
b4f0b5f9aa
|
Temporarily suspend google/gemma-3-1b-it. (#20722)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
|
2025-07-11 11:21:26 +00:00 |
|
Cyrus Leung
|
cbd14ed561
|
[Bugfix] Refactor /invocations to be task-agnostic (#20764)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-11 03:20:54 -07:00 |
|
Pavani Majety
|
7bd4c37ae7
|
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 09:23:23 +00:00 |
|
Jee Jee Li
|
8020e98c9f
|
[Quantization][1/N] MoE support BNB-Inflight Quantization (#20061)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-11 08:01:13 +00:00 |
|
Luka Govedič
|
762be26a8e
|
[Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging (#20777)
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
|
2025-07-11 00:15:22 -07:00 |
|
Reid
|
6a9e6b2abf
|
[doc] fold long code block (#20795)
Signed-off-by: reidliu41 <reid201711@gmail.com>
|
2025-07-10 23:16:41 -07:00 |
|
nopperl
|
5d09152ff1
|
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine (#20660)
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
|
2025-07-11 05:53:31 +00:00 |
|
Luka Govedič
|
31d5c1797f
|
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf (#19830)
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 04:56:28 +00:00 |
|
Ratnam Parikh
|
35514b682a
|
[XPU] XCCL support enabled in torch 2.8.0.dev nightly builds (#20705)
Signed-off-by: ratnampa <ratnam.parikh@intel.com>
|
2025-07-10 20:39:52 -07:00 |
|
Wentao Ye
|
e2de455c34
|
[Feature] Integrate SM100 DeepGEMM support (#20087)
|
2025-07-10 20:18:05 -07:00 |
|
Alexander Matveev
|
5b032352cc
|
[Attention] MLA - Flashinfer Ragged Prefill (#20034)
|
2025-07-10 20:17:47 -07:00 |
|
Michael Goin
|
922f316441
|
[Model] Support HF format of minimax (#20211)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 02:55:21 +00:00 |
|
Duncan Moss
|
5923ab9524
|
[fix]: disable cutlass block scaled group gemm for EP (#20781)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
|
2025-07-11 02:39:18 +00:00 |
|
bigmoyan
|
0cf893cae1
|
Add kimi-k2 tool parser (#20789)
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn>
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>
|
2025-07-11 10:36:23 +08:00 |
|
Michael Goin
|
cf75cd2098
|
[CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install (#20772)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 01:16:01 +00:00 |
|
Simon Mo
|
b854321ffe
|
[Docs] Lazy import gguf (#20785)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-07-10 16:06:37 -07:00 |
|
Kuntai Du
|
5b6fe23d05
|
[Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. (#20786)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-07-10 14:52:46 -07:00 |
|