yuantao
4ad3f75875
Refactor code, add FLASH_ATTN_DIFFKV backend
...
Signed-off-by: yuantao <2422264527@qq.com>
2025-12-22 22:45:11 +08:00
yuantao
a7430ab479
Fix typos
...
Signed-off-by: yuantao <2422264527@qq.com>
2025-12-13 17:59:28 +08:00
yuantao
b565203d92
Refacotr code. Extent FLASH_ATTN to support different KV size and create a new StaticSinkAttention for sink token logics
...
Signed-off-by: yuantao <2422264527@qq.com>
2025-12-13 15:47:33 +08:00
yuantao
de538d3b8f
Merge branch 'main' into Add_support_for_openpangu_promoe_v2
...
Signed-off-by: yuantao <2422264527@qq.com>
2025-12-13 10:09:21 +08:00
rasmith
08f8a5627e
[CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality ( #30292 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-12-12 18:41:56 -05:00
danielafrimi
13618626df
[MoE-FP8-modelopt] Add FlashInfer alignment padding for intermediate dimensions ( #29748 )
...
Signed-off-by: Daniel Afrimi <dafrimi@pool0-00589.cm.cluster>
Signed-off-by: dafrimi <dafrimi@nvidia.com>
Co-authored-by: Daniel Afrimi <dafrimi@pool0-00589.cm.cluster>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-12-12 20:42:32 +00:00
Xin Yang
1f19d8f899
[Perf] Set split_k to 1 for triton_kernels ( #30528 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-12-12 14:07:57 -05:00
shivampr
cd7740ac5c
[ROCm] Enable Triton ScaledMM fallback + kernel selection fix ( #26668 )
...
Signed-off-by: Shivam <shivampr.dev@gmail.com>
Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-12 13:28:20 -05:00
Christina Norman
dc13c99eed
fix(gguf): Disable bfloat16 for GGUF on blackwell device ( #30408 )
...
Signed-off-by: Christina <truffle@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Christina Norman <christina@example.com>
Co-authored-by: Isotr0py <isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-12 23:10:12 +08:00
Lucas Wilkinson
3e41992fec
[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 ( #27532 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-12 05:57:47 -08:00
Jaehwang Jung
f90319d5d1
[Bugfix] Schedule failure due to wrong get_image_size_with_most_features ( #29692 )
2025-12-12 02:27:20 -08:00
Michael Goin
9f2fc16a69
[Bugfix][Model] Fix Afmoe rope_parameters issue ( #30505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-12 02:53:57 +00:00
Bhanu Prakash Voutharoja
6a6fc41c79
gptq marlin quantization support for fused moe with lora ( #30254 )
...
Signed-off-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
2025-12-12 02:27:22 +00:00
jiahanc
0ab23c2b2b
[fix] fix SM check for Flashinfer TRTLLM MOE ( #30314 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-12-12 01:00:58 +00:00
Andrew Briand
a00d88973d
[EPLB] Support EPLB w/ NVFP4 ( #29804 )
...
Signed-off-by: Andrew Briand <abriand@nvidia.com>
Co-authored-by: Andrew Briand <abriand@nvidia.com>
2025-12-11 22:59:40 +00:00
Wentao Ye
c817b14151
[Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement ( #30494 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: li-jinpeng <3332126450@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-12-11 17:28:34 -05:00
Nicolò Lucchesi
0efd9f867c
[Core] Whisper Enable Encoder Batching ( #29421 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-11 21:06:51 +00:00
Harry Mellor
cf3eacfe58
Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim ( #30389 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-11 20:45:23 +00:00
汪志鹏
0e71eaa644
[Feature] AWQ marlin quantization support for fused moe with lora ( #30442 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com>
2025-12-11 18:03:32 +00:00
Harry Mellor
8781cd6b88
Add Eagle and Eagle3 support to Transformers modeling backend ( #30340 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-11 17:02:10 +00:00
Harry Mellor
93db3256a4
Give pooling examples better names ( #30488 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-11 16:22:58 +00:00
Cyrus Leung
3a3b06ee70
[Misc] Improve error message for is_multimodal ( #30483 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-11 06:39:51 -08:00
Cyrus Leung
13d63b65e0
[Deprecation] Remove missed fallback for embed_input_ids ( #30469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-11 10:06:36 +00:00
Cyrus Leung
979f50efd0
[Deprecation] Remove fallbacks for embed_input_ids and embed_multimodal ( #30458 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-11 06:58:23 +00:00
gh-wf
36c9ce2554
Ensure minimum frames for GLM 4.6V compatibility ( #30285 )
...
Signed-off-by: Wayne Ferguson <wayneferguson@gmail.com>
2025-12-11 05:26:49 +00:00
Divakar Verma
d1e1fb4363
[Bugfix] Fix grouped_topk pytorch impl when num_experts can't be grouped properly ( #29439 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-12-10 19:47:18 -08:00
Anker
e8e8cd73e5
[Bugfix] Fix HunyuanOCR cross-image contamination in batch processing ( #30344 )
...
Signed-off-by: Lennart Brog <lennart.borg@list-ag.de>
Signed-off-by: Anker <20343812+anker-c2@users.noreply.github.com>
2025-12-10 18:09:31 +00:00
Lucas Wilkinson
aacf0abf8b
[BugFix] Fix AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale' ( #30399 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-10 07:59:23 -08:00
Roger Young
d017bceb08
[BugFix] Fix minimax m2 model rotary_dim ( #30384 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-12-10 04:58:50 -08:00
Wilson Wu
3bdd426636
Fix typos in comments across multiple files ( #30345 )
...
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-12-09 20:05:28 -08:00
haoyangli-amd
06462392e4
[bugfix][quantization] fix quark qwen3 kv_cache quantization ( #30308 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
2025-12-10 03:24:12 +00:00
ElizaWszola
2e7035dd8c
[Bugfix] Fix fp8 DeepGemm compilation issues ( #30336 )
2025-12-09 20:17:25 -05:00
Charlie Fu
3c680f4a17
[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter ( #25693 )
...
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
2025-12-09 22:39:26 +00:00
Kyle Sayers
fccd532587
[Quantization] FP8 Weight Reloading for Quantized RL Rollout ( #28480 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-12-09 13:54:32 -08:00
bnellnm
00e5cbb967
[MoE][Refactor] Remove most arguments to FusedMoEMethodBase.apply ( #29066 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-12-09 13:48:25 -08:00
Tsukasa OI
73a484caa1
[Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models ( #30307 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
2025-12-09 19:13:10 +00:00
quanliu
5dcd593baf
[Feature] Batch-Invariant Support for FA2 and LoRA ( #30018 )
...
Signed-off-by: quanliu <18646313696@163.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-12-09 10:01:38 -05:00
vllmellm
ee14644ba9
[ROCm] Aiter Quant Kernels ( #25552 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-12-09 14:27:37 +00:00
Dongjie Zou
1166c31cc7
[Bugfix]: Fix glm46 awq marlin moe wna16 compatibility ( #30210 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2025-12-09 12:20:21 +00:00
wang.yuqi
9c32df6101
[Bugfix] Qwen 3 VL Embedding loading ( #30303 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-09 08:04:02 +00:00
Tsukasa OI
58d5b3f514
[Model][Quantization] Restore MoE + GGUF models support (incl. Qwen3 MoE) by allowing Sideload Parameters ( #30116 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-09 05:30:05 +00:00
liangel-02
4b03b50211
update torchao safetensors impl ( #30155 )
...
Signed-off-by: Angel Li <liangel@meta.com>
2025-12-09 12:46:35 +08:00
Michael Goin
03b91f7262
[Bugfix] Fix compressed-tensors models failing to load with transformers backend ( #30287 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-08 20:44:28 -08:00
czhu-cohere
f6227c22ab
[Kernel]Support W4A8 Grouped GEMM on Hopper ( #29691 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-12-08 19:29:06 -08:00
Zhewen Li
ae339b1a67
[Bugfix] Fix DeepGEMM after #29546 ( #30267 )
...
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: Zhewen Li <zhewenli@meta.com>
2025-12-09 01:05:27 +00:00
Wentao Ye
d9417096d1
[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching ( #29125 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-08 19:31:57 -05:00
Ming Yang
9d6235ca9a
[moe] Allow disabling DP chunking ( #29936 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-12-09 00:29:36 +00:00
roikoren755
ae0f69b16a
Add SpecDec support to selective_state_update ( #29488 )
...
Signed-off-by: Roi Koren <roik@nvidia.com>
2025-12-08 16:45:18 -05:00
Vasiliy Kuznetsov
0d402d2600
online fp8 quant with streaming weight post-processing ( #29196 )
...
Signed-off-by: vasiliy <vasiliy@fb.com>
2025-12-08 20:15:10 +00:00
shaharmor98
fcd5306f65
Add latent MoE support ( #30203 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-12-08 17:35:01 +00:00