Akshat Tripathi
c20ef40fd0
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend ( #14238 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
2025-05-07 16:28:47 -04:00
Bowen Bao
db593aa67f
[Quantization] Quark MXFP4 format loading ( #16943 )
2025-05-07 15:05:05 -04:00
Isotr0py
f98e307588
[Bugfix] Fix missing lora name mapping for lora without prefix ( #17793 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-07 16:17:12 +00:00
Harry Mellor
646a31e51e
Fix and simplify deprecated=True CLI kwarg ( #17781 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-07 16:51:06 +01:00
Isotr0py
be8ff88e66
[Bugfix] Fix Video IO error for short video ( #17791 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-07 15:36:06 +00:00
Christian Heimes
1a6af1453d
Only depend on importlib-metadata for Python < 3.10 ( #17776 )
...
Signed-off-by: Christian Heimes <christian@python.org>
2025-05-07 07:51:06 -07:00
Gregory Shtrasberg
32aa74c09c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention ( #17139 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-05-07 07:12:35 -07:00
Reid
7377dd0307
[doc] update the issue link ( #17782 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-07 20:29:05 +08:00
Yong Hoon Shin
98c89e16ff
Make key optional for rotary embedding ( #17566 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-05-07 00:11:46 -07:00
Yong Hoon Shin
324a3119b0
Fix test_memory_usage_no_spec ( #17754 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-05-07 00:10:33 -07:00
Cyrus Leung
8a15c2603a
[Frontend] Add missing chat templates for various MLLMs ( #17758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-07 00:10:01 -07:00
Satyajith Chilappagari
043e4c4955
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling ( #16357 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>
Co-authored-by: Aaron Dou <yzdou@amazon.com>
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com>
Co-authored-by: Chongming Ni <chongmni@amazon.com>
Co-authored-by: Amulya Ballakur <amulyaab@amazon.com>
Co-authored-by: Patrick Lange <patlange@amazon.com>
Co-authored-by: Elaine Zhao <elaineyz@amazon.com>
Co-authored-by: Lin Lin Pan <tailinpa@amazon.com>
Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com>
Co-authored-by: Yishan McNabb <yishanm@amazon.com>
Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com>
2025-05-07 00:07:30 -07:00
Jee Jee Li
ba7703e659
[Misc] Remove qlora_adapter_name_or_path ( #17699 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-05-06 23:10:37 -07:00
Wanrui Dai
f80ae5bdcf
[Kernel] Use fused rmsnorm for some models like qwen3 series ( #17735 )
...
Signed-off-by: evian <eviantai@u.nus.edu>
Co-authored-by: evian <eviantai@u.nus.edu>
2025-05-06 23:10:02 -07:00
Szymon Ożóg
1a45a61387
[Kernel] GGUF MoeVec kernel ( #16780 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-05-06 23:07:23 -07:00
Isotr0py
c3e9d5060e
[Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE ( #17726 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-07 04:51:33 +00:00
Jee Jee Li
822de7fb94
[Misc] Split model loader ( #17712 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-05-07 12:42:26 +08:00
Woosuk Kwon
8d84d836d1
[BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head ( #17740 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-05-06 19:51:26 -07:00
Michael Goin
950b71186f
Replace lm-eval bash script with pytest and use enforce_eager for faster CI ( #17717 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-06 18:00:10 -07:00
Michael Goin
e50a1f1a9c
[TPU] Add kernel test for moe_pallas ( #17496 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-05-06 17:59:57 -07:00
Michael Goin
a17cef70ea
Removed unused marlin cuda code ( #17684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-06 17:59:47 -07:00
Chih-Chieh Yang
18dd5e01f2
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels ( #17146 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
2025-05-06 17:59:30 -07:00
Yang Wang
6de3e13413
Add logging for torch nightly version ( #17669 )
...
Signed-off-by: Yang Wang <elainewy@meta.com>
2025-05-07 00:45:51 +00:00
Hongxia Yang
ed3a1d2106
[ROCm] fix num_stages for default moe config to avoid triton OutOfResource error ( #17744 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
2025-05-07 00:39:48 +00:00
Harry Mellor
022afbeb4e
Fix doc build performance ( #17748 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-07 00:36:41 +00:00
Thomas Parnell
2f925e5777
[Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode ( #16828 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-06 18:21:48 -04:00
Gregory Shtrasberg
de906b95f9
[Bugfix] Fix for the condition to accept empty encoder inputs for mllama ( #17732 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-05-06 19:59:06 +00:00
d.transposed
d456aea71f
[Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py ( #16839 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
2025-05-06 15:38:45 -04:00
Jevin Jiang
621ca2c0ab
[TPU] Increase block size and reset block shapes ( #16458 )
2025-05-06 13:55:04 -04:00
Harry Mellor
6115b11582
Make right sidebar more readable in "Supported Models" ( #17723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-06 16:48:26 +00:00
Cyrus Leung
5b8c390747
[Bugfix] Fix modality limits in vision language example ( #17721 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-06 16:12:28 +00:00
Reid
7525d5f3d5
[doc] Add RAG Integration example ( #17692 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-06 16:10:23 +00:00
Chen Zhang
aabcd2cae3
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager ( #17479 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-06 08:50:34 -07:00
Michael Yao
0d115460a7
[Docs] Use gh-file to add links to tool_calling.md ( #17709 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-05-06 15:27:19 +00:00
Aaron Pham
175bda67a1
[Feat] Add deprecated=True to CLI args ( #17426 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-05-06 08:11:27 -07:00
Chen Zhang
cba31c47c4
[v1] AttentionMetadata for each layer ( #17394 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-06 07:58:37 -07:00
Li, Jiang
a6fed02068
[V1][PP] Support PP for MultiprocExecutor ( #14219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
2025-05-06 07:58:05 -07:00
Michael Goin
d419aa5dc4
[V1] Enable TPU V1 backend by default ( #17673 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-06 06:49:49 -07:00
Mengqing Cao
f9bc5a0693
[Bugfix] Fix triton import with local TritonPlaceholder ( #17446 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-05-06 17:53:09 +08:00
Harry Mellor
05e1f96419
Fix dockerfilegraph pre-commit hook ( #17698 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-06 08:56:48 +00:00
Lucas Wilkinson
6eae34533a
[Misc] Fix ScalarType float4 naming ( #17690 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-06 01:07:15 -07:00
Cyrus Leung
63ced7b43f
[Doc] Update notes for H2O-VL and Gemma3 ( #17219 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-06 07:51:02 +00:00
Mikhail Podvitskii
dc47ba32f8
[Bugfix] Fixed prompt length for random dataset ( #17408 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com>
2025-05-06 07:00:08 +00:00
Richard Zou
edbf2d609e
[easy] Fix logspam on PiecewiseBackend errors ( #17138 )
...
Signed-off-by: rzou <zou3519@gmail.com>
2025-05-05 23:46:11 -07:00
Stan Wozniak
999328be0d
[Model] Add GraniteMoeHybrid 4.0 model ( #17497 )
...
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
2025-05-06 12:00:31 +08:00
Michael Goin
98834fefaa
Update nm to rht in doc links + refine fp8 doc ( #17678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-06 00:41:14 +00:00
Varun Sundar Rabindranath
90bd2ae172
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument ( #17677 )
2025-05-05 17:34:29 -07:00
Nicolò Lucchesi
5941e0b7ea
[TPU][V1] Add support for top-logprobs ( #17072 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-05-05 14:20:15 -07:00
XiongfeiWei
9765940824
[TPU] Enable gemma3-27b with TP>1 on multi-chips. ( #17335 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-05-05 14:19:58 -07:00
Nick Hill
5ea5c514da
[BugFix] Increase timeout for startup failure test ( #17642 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-05-05 20:53:19 +00:00