vllmellm
fb72ec8218
add missing kernels for cuda dispatch
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-04 12:23:38 +00:00
vllmellm
f10171cb3d
correct minimum capability req for channelwise torch
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-04 12:22:49 +00:00
vllmellm
a76f7bb90c
rename flash_infer.py to flashinfer.py
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-04 12:13:02 +00:00
vllmellm
f5e6cd9695
prefer QuantKey over ScaledMMLinearQuantStrategy
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-04 12:11:13 +00:00
vllmellm
a8010c7b1c
flash_infer missing out dtype bug fix
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-03 08:02:45 +00:00
vllmellm
7794009661
add missing arg
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-03 07:09:52 +00:00
vllmellm
b13c4bb25c
remove FP8LinearOps
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-01 16:30:32 +00:00
vllmellm
52ff537459
update modelopt path
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-01 16:28:18 +00:00
vllmellm
dd5a70ec71
update unit tests to use ScaledMMLinearKernels
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-01 16:28:03 +00:00
vllmellm
4ce0ba2df4
format
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-01 10:01:13 +00:00
vllmellm
8e8218ebac
Merge remote-tracking branch 'origin/main' into refactor-fp8-linear
2025-11-01 09:59:43 +00:00
vllmellm
d92c23b446
fix types; reduce boilerplate for int8
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-01 09:59:00 +00:00
ai-jz
2c0c7c39bd
feat(benchmarks): support HF model names in multi-turn benchmark ( #27850 )
2025-11-01 08:04:52 +00:00
Yihua Cheng
e675118849
[Add] cmdline argument parsing for KV cache offloading modules ( #27621 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-01 07:17:07 +00:00
TJian
e2347dbf58
[Bugfix] [Model] Missing MRoPE function definition from KeyeForConditionalGeneration ( #27895 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-01 13:45:23 +08:00
Cyrus Leung
879a06579e
[CI/Build] Bump transformers version ( #27528 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 22:11:07 -07:00
yugong333
29de3cdee4
Adding SplitK in fused_moe_lora kernel ( #27818 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 12:55:46 +08:00
Yan Ma
7e2729b57e
[Multimodal][XPU]Enable vision attn backend for xpu platform ( #27525 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yejing Lai <yejing.lai@intel.com>
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-01 04:45:02 +00:00
Jee Jee Li
3a5de7d2d6
[Bugfix] Fix KDA output ( #27905 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 11:54:36 +08:00
Jee Jee Li
bc4486d609
[Kernel] Enable FusedMoEModularKernel support bias ( #27754 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 02:05:12 +00:00
Nick Hill
0cdbe7b744
[Core] Async scheduling + structured outputs compatibility ( #26866 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-01 00:35:04 +00:00
Chen Zhang
df334868ca
[Hybrid] A simpler algorithm to find kernel_block_size ( #26476 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-10-31 21:30:28 +00:00
Bram Wasti
0e0a638c3b
Batch invariance doc ( #27839 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-31 17:22:19 -04:00
Matthew Bonanni
f29aeb5a25
Add FLASHINFER_MLA to test_mla_backends and add B200 CI run ( #27663 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-31 11:12:19 -07:00
Vinay R Damodaran
5e8862e9e0
[Feature] Pydantic validation for scheduler.py and structured_outputs.py ( #26519 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 18:05:50 +00:00
Nick Hill
9e5bd3076e
[Cleanup] Remove no-longer-used SpeculativeConfig.enable_chunked_prefill ( #27826 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-31 10:57:45 -07:00
Shu Wang
fc16f1c477
Flashinfer_CUTLASS_MOE fuses quantization for TP ( #27223 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com>
2025-10-31 17:54:29 +00:00
ZiTian Zhao
bc306fe5e9
fix incorrect type annotation in KimiMLP ( #27885 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-10-31 17:38:02 +00:00
Chenguang Zheng
103a468bbf
[bugfix] Missing cached item in beam search ( #27874 )
...
Signed-off-by: fake0fan <645327136@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-31 17:34:27 +00:00
Rob Mulla
70bfbd7b16
Docs update tpu install instructions ( #27824 )
...
Signed-off-by: Rob Mulla <rob.mulla@gmail.com>
Signed-off-by: Rob Mulla <RobMulla@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 10:29:55 -07:00
GuanLuo
d6517be3cd
[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node ( #26338 )
...
Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-10-31 10:16:00 -07:00
Isotr0py
7e06c40e63
[Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V ( #27860 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-31 17:04:51 +00:00
Madeesh Kannan
675704ac01
[Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation ( #27876 )
...
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2025-10-31 16:58:42 +00:00
vllmellm
e845035f4c
bug fix
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 16:38:26 +00:00
vllmellm
5fbe76bc0a
format; update fbgemm path
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 15:08:19 +00:00
vllmellm
1f65cd56e5
revert input scale upper bounds
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 15:06:51 +00:00
vllmellm
7d361487f7
update ptpc path; bug fixes
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 14:52:51 +00:00
vllmellm
dd001064c0
reduce kernel init boilerplate
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 14:21:49 +00:00
Jee Jee Li
0384aa7150
[CI/Build] Add gpt-oss LoRA test ( #27870 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-31 22:17:21 +08:00
vllmellm
423e2a625e
reduce logging boilerplate; update fp8 path
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-31 14:07:09 +00:00
Jiangyun Zhu
3857eb8725
[Perf] Decouple torch op from GDA to leverage torch.compile ( #27871 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-31 21:35:52 +08:00
Huamin Li
933cdea440
[BugFix] Don’t compute reorder threshold when there are no attention groups ( #27861 )
2025-10-31 11:36:18 +00:00
Isotr0py
3933f18a5e
[Bugfix] Avoid too small block m/n for FlexAttention kernel option ( #27853 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-31 19:33:12 +08:00
toncao
e5ef4dfc11
[Kimi-Linear] Correct prefixes and add compatibility to AWQ quants ( #27834 )
...
Signed-off-by: toncao <cpatonn@gmail.com>
Co-authored-by: toncao <cpatonn@gmail.com>
2025-10-31 17:36:37 +08:00
vllmellm
38825fce0f
Merge branch 'main' into refactor-fp8-linear
2025-10-31 15:58:25 +08:00
Akash kaothalkar
36960501d3
[Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power ( #27734 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-10-31 07:45:26 +00:00
Seiji Eicher
b2e65cb4a7
[benchmark] Make request IDs unique across clients by default ( #27723 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-10-30 17:40:35 -07:00
Wentao Ye
2bf0bcc1fc
[CI Test] Add Scheduled Integration Test ( #27765 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-30 17:29:26 -07:00
Jakub Sochacki
697f507a8e
[CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 ( #26919 )
...
Signed-off-by: jakub-sochacki <jakub.sochacki@wp.pl>
2025-10-31 07:57:22 +08:00
Matthew Bonanni
d5d2a0fe74
[Misc] Make all tool scripts executable ( #27831 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-30 23:46:02 +00:00