Li, Jiang
e2f56c309d
[CPU] Update torch 2.9.1 for CPU backend ( #29664 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-28 13:37:54 +00:00
Jinzhen Lin
a67dec7cba
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel ( #28619 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-26 19:02:21 -08:00
Pleaplusone
d9d342d214
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek ( #27457 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-11-26 12:45:28 +08:00
Michael Goin
e502098643
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 ( #29242 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-11-25 06:59:07 -08:00
Pleaplusone
77e10c9cab
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence ( #28029 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-11-24 19:05:46 -07:00
R3hankhan
4de87866a8
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x ( #28926 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2025-11-24 12:08:09 +00:00
Fadi Arafeh
730bd35378
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON ( #29193 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-22 09:04:36 -08:00
Jane (Yuan) Xu
e6309acdba
Simplify from_blob usage in get_cuda_view_from_cpu_tensor ( #29027 )
...
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-22 10:35:32 +00:00
skaraban3807
f1805db1a6
[Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket ( #25559 )
...
Signed-off-by: Siddappa Karabannavar <siddappa.karabannavar@amd.com>
2025-11-21 14:13:52 +00:00
zhrrr
a982f5b5ea
[kernel][perf] support uncontiguous input for rms_norm kernel ( #28103 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-20 19:39:09 -08:00
Pleaplusone
06c20c9904
[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA ( #26670 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-11-20 02:54:01 -08:00
Vensen
fb8851f254
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu ( #28760 )
...
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
2025-11-20 02:52:02 -08:00
Boyuan Feng
a903d59ffa
cleanup at::Tag::needs_fixed_stride_order ( #28974 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-20 02:51:36 -08:00
j20120307
bbc6c2f1e5
[CI/Build] Fix broken build on Apple M1 ( #28999 )
...
Signed-off-by: Kan Zhu <j20120307@gmail.com>
2025-11-19 11:07:22 +00:00
ihb2032
8151609583
refactor(cpu_types_scalar.hpp): Unify scalar loop implementations using unroll_loop ( #28847 )
...
Signed-off-by: ihb2032 <1355790728@qq.com>
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
2025-11-19 11:05:44 +00:00
Li, Jiang
20852c8f4c
[CPU] Refactor CPU WNA16 ( #28826 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-19 10:32:00 +08:00
tiehexue
e42bd8c2e3
Cast return value to int64_t for cache size ( #28814 )
...
Signed-off-by: tiehexue <tiehexue@hotmail.com>
2025-11-17 16:02:32 +00:00
Li, Jiang
577bb34fff
[CPU][Bugfix] Fix _to_list in CPU model runner ( #28824 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-17 07:47:24 +00:00
Jane (Yuan) Xu
74b5267d3a
Use narrow over indexing in hadacore_transform to prep for ABI stable ( #28756 )
...
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-15 01:10:15 -08:00
Sage Moore
8977ffb5e6
[ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu ( #28682 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-11-14 11:06:01 -08:00
czhu-cohere
cdd7025961
[kernel] Improve FP8 PTPC on Hopper for larger shapes ( #28692 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-11-14 09:59:11 -08:00
Michael Goin
622e6106a9
[CPU][Bugfix] Fix Apple Silicon M1 compilation failure ( #28681 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 09:49:55 +08:00
Varun Sundar Rabindranath
fe1cd7704d
[Performance][B200] silu_mul_quant: pack scales in int32 ( #28358 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-13 10:16:55 -08:00
Jane (Yuan) Xu
06c4873d95
Rewrite C++ meta funcs to Python ( #28595 )
...
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-14 00:52:50 +08:00
Akash kaothalkar
86d15bfd8d
[Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version ( #28535 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-11-13 13:32:21 +00:00
usberkeley
4ab34f6ef1
Add NUMA node validation for CPU thread binding ( #28555 )
...
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-11-13 07:03:52 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
4ca5cd5740
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform ( #12695 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
2025-11-12 15:24:12 -08:00
TJian
edb59a9470
[ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility ( #28500 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-12 05:01:14 -08:00
Li, Jiang
7f829be7d3
[CPU] Refactor CPU attention backend ( #27954 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-12 09:43:06 +08:00
Xin Yang
6c3c0f8235
[Kernel] Optimize rms_norm kernel ( #27931 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-11-11 18:02:23 +00:00
zhrrr
68c09efc37
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model ( #27165 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2025-11-11 12:00:31 -05:00
Michael Goin
f9a4087182
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel ( #28431 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 11:46:04 -05:00
Varun Sundar Rabindranath
b039bfda8f
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests ( #28366 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-10 09:21:52 -08:00
ElizaWszola
171133f929
[Bugfix] Fix test fused quant layernorm tests ( #27865 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-08 14:31:33 -08:00
Michael Goin
0852527647
[Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM ( #28124 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-07 18:20:55 -08:00
Pavani Majety
72b1c2ae2c
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes ( #27439 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-11-07 04:18:39 -08:00
courage17340
981cadb35c
[Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty ( #28181 )
...
Signed-off-by: courage17340 <courage17340@163.com>
2025-11-06 17:52:13 +08:00
lyrisz
97e3dda84b
[Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM ( #27284 )
...
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-04 07:49:25 -08:00
xiangze-arm
f32cbc9a0c
[CPU]Improve dynamic 4bit moe performance ( #27240 )
...
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-11-04 06:33:23 +00:00
gnovack
294c805f1d
Early exit for MoE LoRA kernels ( #27131 )
...
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-03 20:22:17 +08:00
Asaf Joseph Gardin
00b31a36a2
[V1] [Hybrid] Mamba1 Automatic Prefix Caching ( #26377 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-11-02 04:16:23 -08:00
Fadi Arafeh
2080b05099
[cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype ( #27472 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-10-24 15:57:48 +00:00
Xiangyu Li
5cc6bddb6e
[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm ( #26092 )
2025-10-23 23:26:13 -04:00
gnovack
8e4ca4d14e
Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' ( #27311 )
...
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-22 12:23:57 +00:00
Lain
09a7e6f617
[Deepseek v3.2] Remove extra logics in indexer ( #26465 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 23:34:03 +00:00
Daniel Cámpora
80e9452984
[Deepseek v3.2] Optimize top_k_per_row ( #26763 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 08:30:07 +00:00
Chen Wu
5f6cbf60d6
[Feature][Kernel]FusedMoE LoRA ( #21229 )
...
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: wuchen <wuchen@zetyun.com>
Co-authored-by: Nathan Van Gheem <vangheem@gmail.com>
Co-authored-by: banjuede <lmklhc@163.com>
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
2025-10-21 03:01:37 +00:00
Isotr0py
3125d79950
[Chore] Remove unused PolyNorm layer ( #27110 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-17 19:03:43 +00:00
rasmith
b10c64c834
[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) ( #26192 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-17 14:17:18 -04:00
Luka Govedič
bd7157a071
[torch.compile] Enable attention and allreduce fusion without custom ops enabled ( #24604 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-17 08:10:23 -06:00