xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-21 17:25:01 +08:00

Author	SHA1	Message	Date
bnellnm	2902c34826	[Kernels] Remove BatchedTritonOrDeepGemmExperts and default fallback to Triton (#29929 ) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-12-03 20:49:00 +00:00
Varun Sundar Rabindranath	19bee6d12d	[Performance][DP/EP] Add silu_mul_per_token_group_quant_fp8_colmajor kernel (#29470 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-12-03 18:04:59 +00:00
Julien Denize	d8c6210eea	Add Mistral Large 3 and Ministral 3 (#29757 ) Signed-off-by: Julien Denize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Signed-off-by: Mickael Seznec <mickael@mistral.ai> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Mickael Seznec <mickael@mistral.ai>	2025-12-02 10:29:00 +00:00
Shu Wang	f72a817bdf	[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch (#27141 ) Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-11-30 16:05:32 -08:00
Omer Ullman Argov	39d28108f4	[Feat] Support non-gated activations in NVFP4 modelopt path (#29004 )	2025-11-30 11:02:40 -05:00
朝	9381b5cde0	[Doc]: Fix typo in fused_moe layer (#29731 ) Signed-off-by: BowTen <bowten@qq.com>	2025-11-29 22:29:13 -08:00
Xin Yang	a491b0911b	[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#29708 ) Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-11-30 10:37:25 +08:00
Jinzhen Lin	1656ad3704	[Kernel][Quantization] add w4a8 support for marlin kernel (#24722 ) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin@redhat.com>	2025-11-29 07:19:33 -08:00
Huamin Li	3fd1fb0b60	Revert "[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#28971 )" (#29697 ) Signed-off-by: Huamin Li <3ericli@gmail.com>	2025-11-28 15:26:52 -08:00
Xin Yang	745a3bae1a	[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 (#28971 ) Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-11-28 10:48:28 +08:00
Jinzhen Lin	a67dec7cba	[Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28619 ) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-11-26 19:02:21 -08:00
HDCharles	df01eda4dc	[Bugfix] Make compressed-tensors MoEs respect ignored layers (#28878 ) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>	2025-11-26 21:35:13 -05:00
Huamin Li	70d5953f82	Revert "[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841 )" (#29483 ) Signed-off-by: Huamin Li <3ericli@gmail.com>	2025-11-26 22:27:26 +08:00
Xin Yang	53d7f1f601	[Kernel] Use pre-allocated output buffer for triton kernel fused_experts (#29219 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2025-11-26 10:21:00 +08:00
elvischenv	6330f9477d	[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-11-25 07:59:40 +00:00
Fadi Arafeh	98caeadd54	[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei (#29273 ) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>	2025-11-25 15:11:11 +08:00
bnellnm	8f066146c3	[MoE][Refactor] Make select_experts a non-static method (#29067 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-11-24 13:38:04 -05:00
Federico	f55c76c2b3	chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning (#29240 )	2025-11-22 08:42:48 -08:00
jinghanhu	988ee66b0d	Handle triton kernel import exception (#29062 )	2025-11-22 10:07:50 +00:00
FlintyLemming	052950e5b3	Add fused MoE config for H200 E160 N192 fp8 (#29182 ) Signed-off-by: FlintyLemming <admin@flinty.moe>	2025-11-21 17:37:51 -08:00
Varun Sundar Rabindranath	3137991f55	[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor (#29162 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-21 14:28:17 -08:00
Lucas Wilkinson	1840c5cb18	[BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case (#27426 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-11-21 11:41:52 -08:00
Wentao Ye	a42ab317ac	[Log] Optimize startup log (#28948 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-11-21 08:46:20 -08:00
Wentao Ye	df44df0143	[Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement (#28879 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-11-20 18:41:49 -07:00
Wentao Ye	5031cd5d55	[Refactor] Optimize `select_experts` (#28069 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-11-19 18:53:15 -05:00
Max Hu	cb0a7b4bea	[Bugfix] Move flashinfer kernel check into ```__init__`` `function of` ``FusedMoE``` (#29018 ) Signed-off-by: Max Hu <hyoung2991@gmail.com>	2025-11-19 21:54:15 +00:00
Shu Wang	613abb50d5	[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#25990 ) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-11-19 13:29:06 -08:00
Qiu	2fd893b4ce	[Feature] Prefill Context Parallel (PCP) basic support (#28718 ) Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>	2025-11-19 15:52:44 -05:00
杰兮	9d2d561257	[Bugfix] Fix precision corruption when shared_experts_stream=None (#28942 ) Signed-off-by: zhyajie <yajizhan@amd.com> Co-authored-by: zhyajie <yajizhan@amd.com>	2025-11-19 19:30:57 +00:00
Robert Shaw	fe69f331f8	[Kernels] Improve H200 Fused MoE Config (#28992 ) Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2025-11-19 19:23:54 +00:00
Chen Bruce	da2f6800e0	[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. (#28449 ) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-11-19 13:46:24 +01:00
Li, Jiang	20852c8f4c	[CPU] Refactor CPU WNA16 (#28826 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-11-19 10:32:00 +08:00
amirkl94	03ee48111d	Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261 )	2025-11-16 13:39:44 -05:00
Zhewen Li	1ec978c209	[Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (#28709 ) Signed-off-by: Zhewen Li <zhewenli@meta.com>	2025-11-15 01:10:48 -08:00
Varun Sundar Rabindranath	6965ef436f	[Performance][DeepGEMM] Estimate expected_m (#28694 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-15 13:52:14 +08:00
Alexander Matveev	e5c78956c0	[Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (#28740 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com>	2025-11-14 14:13:46 -08:00
Andrey Khalyavin	fd4555089a	[BugFix] Fix misprint introduced by modular_kernel refactoring. (#28728 ) Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>	2025-11-14 10:58:18 -08:00
Duncan Moss	3f8a874065	[Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (#27134 ) Signed-off-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-11-14 08:02:44 -08:00
Varun Sundar Rabindranath	fe1cd7704d	[Performance][B200] silu_mul_quant: pack scales in int32 (#28358 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-13 10:16:55 -08:00
Lucia Fang	7e082bc14e	Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 (#28574 ) Signed-off-by: Lu Fang <fanglu@fb.com>	2025-11-12 21:40:45 -08:00
Alexander Matveev	69d0e90313	[MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (#28406 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com>	2025-11-12 23:37:24 +00:00
Varun Sundar Rabindranath	74a9a9faad	[Performance][B200] Fix deepgemm prologue (#27897 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-12 13:13:03 -08:00
PerryZhang01	a1e7fa362a	[EPLB][ROCm]: support EPBL for ROCm backend (#27731 ) Signed-off-by: Perry Zhang <perzhang@amd.com> Co-authored-by: Perry Zhang <perzhang@amd.com>	2025-11-12 18:16:35 +00:00
Michael Goin	e5f599d4d1	[Bugfix] Disable shared expert overlap if Marlin MoE is used (#28410 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-11-11 23:16:12 +00:00
bnellnm	a1448b4b69	[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064 )	2025-11-11 07:29:02 -07:00
Robert Shaw	e605e8e323	[Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Co-authored-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2025-11-11 05:59:08 +00:00
Ilya Markov	d17ecc6b19	[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (#24248 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2025-11-10 18:33:11 -05:00
Lucas Wilkinson	6dec9f6109	[BugFix] Fix DeepGEMM over-allocating workspace (#28254 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-11-10 17:01:17 -05:00
Sage Moore	40d33264c6	[Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (#28377 ) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Sage Moore <sagemoore@utexas.edu> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-11-10 20:39:19 +00:00
jiahanc	34553b9d27	[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (#27492 ) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2025-11-10 12:34:57 -05:00

1 2 3 4 5 ...

508 Commits