Woosuk Kwon
|
dd572c0ab3
|
[V0 Deprecation] Remove V0 Spec Decode workers (#21152)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-18 21:47:50 -07:00 |
|
Varun Sundar Rabindranath
|
9ffe905a41
|
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 (#21183)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-07-18 21:15:03 -07:00 |
|
Lucia Fang
|
9a9fda1423
|
[Core] Support Local Chunked Attention for Hybrid KV Cache (#19351)
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@meta.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lu Fang <fanglu@meta.com>
|
2025-07-18 20:48:38 -07:00 |
|
Jee Jee Li
|
466e878f2a
|
[Quantization] Enable BNB support for more MoE models (#21100)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-18 17:52:02 -07:00 |
|
Rui Qiao
|
217937221b
|
Elastic Expert Parallel Initial Support (#20775)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-07-18 17:46:09 -07:00 |
|
hax0r31337
|
5782581acf
|
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) (#21077)
Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com>
|
2025-07-18 18:40:18 -04:00 |
|
JialinOuyang-Meta
|
0f199f197b
|
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue (#21005)
Signed-off-by: Jialin Ouyang <jialino@meta.com>
|
2025-07-18 12:34:40 -07:00 |
|
Richard Zou
|
b2eb2b5ad7
|
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 (#19346)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-07-18 14:10:21 -04:00 |
|
Richard Zou
|
21274ab476
|
[CI] Update CODEOWNERS for vllm/compilation (#21185)
Signed-off-by: Richard Zou <zou3519@gmail.com>
|
2025-07-18 06:51:12 -07:00 |
|
Thomas Parnell
|
ed8cbfedf8
|
Let GraniteMoeAttention use YaRN (#21174)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-07-18 05:52:52 -07:00 |
|
Cyrus Leung
|
45badd05d0
|
[Core] Set pooling params based on task and model (#21128)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-18 05:41:17 -07:00 |
|
ElizaWszola
|
4adc66f64d
|
[Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-18 18:55:52 +08:00 |
|
Cyrus Leung
|
55ad648715
|
[Doc] Fix typo in model name (#21178)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-18 03:55:10 -07:00 |
|
wang.yuqi
|
5895afd780
|
[Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. (#20750)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-07-18 09:10:47 +00:00 |
|
wang.yuqi
|
ca4eb82bcb
|
[Model] Re-add the implicit conversion feature for as_seq_cls_model (#21103)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-07-18 07:15:07 +00:00 |
|
Roger Wang
|
ba2dfbb0c2
|
[Misc] Make MM embedding merge interface explicit in model runner (#21147)
Signed-off-by: Roger Wang <hey@rogerw.me>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-07-18 07:13:57 +00:00 |
|
Jialin Ouyang
|
1bf65138f6
|
[benchmark] Sending request strictly follows the random intervals (#21108)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
|
2025-07-18 06:22:08 +00:00 |
|
Woosuk Kwon
|
54cf1cae62
|
[Misc] Do not print async output warning for v1 (#21151)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-17 21:57:02 -07:00 |
|
shixianc
|
5780121c95
|
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm (#20911)
Signed-off-by: Shixian Cui <shixian@amazon.com>
Co-authored-by: Shixian Cui <shixian@amazon.com>
|
2025-07-18 04:34:43 +00:00 |
|
Shu Wang
|
c7d8724e78
|
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037)
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-17 21:32:45 -07:00 |
|
22quinn
|
b38baabcf9
|
[Doc] Add inplace weights loading example (#19640)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
|
2025-07-17 21:12:23 -07:00 |
|
Lucas Wilkinson
|
89cab4d01f
|
[Attention] Make local attention backend agnostic (#21093)
|
2025-07-18 00:10:42 -04:00 |
|
Lucia Fang
|
b9a21e9173
|
[Docs] Update supported models documentation with missing models (#20844)
Signed-off-by: Lu Fang <fanglu@fb.com>
|
2025-07-17 20:12:13 -07:00 |
|
Ricardo Decal
|
c4e3b12524
|
[Docs] Add minimal demo of Ray Data API usage (#21080)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
|
2025-07-17 20:09:19 -07:00 |
|
elvischenv
|
8dfb45ca33
|
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133)
|
2025-07-18 00:35:58 +00:00 |
|
Wentao Ye
|
8a8fc94639
|
[Log] Debugging Log with more Information (#20770)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-18 00:19:46 +00:00 |
|
Woosuk Kwon
|
4de7146351
|
[V0 deprecation] Remove V0 HPU backend (#21131)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-17 16:37:36 -07:00 |
|
Eric Curtin
|
ac9fb732a5
|
On environments where numa cannot be detected we get 0 (#21115)
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
|
2025-07-17 18:52:17 +00:00 |
|
Jee Jee Li
|
a3a6c695f4
|
[Misc] Qwen MoE model supports LoRA (#20932)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-17 18:32:52 +00:00 |
|
Cyrus Leung
|
90bd2ab6e3
|
[Model] Update pooling model interface (#21058)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-17 16:05:40 +00:00 |
|
ElizaWszola
|
9fb2d22032
|
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-17 09:56:44 -04:00 |
|
Harry Mellor
|
2d6a38209b
|
[Docs] Move code block out of admonition now that it's short (#21118)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-17 06:12:29 -07:00 |
|
wangxiyuan
|
89e3c4e9b4
|
[Misc] Avoid unnecessary import (#21106)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-07-17 12:57:41 +00:00 |
|
Harry Mellor
|
fe8a2c544a
|
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make (#21117)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-17 04:13:00 -07:00 |
|
kYLe
|
4ef00b5cac
|
[VLM] Add Nemotron-Nano-VL-8B-V1 support (#20349)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-07-17 03:07:55 -07:00 |
|
Asher
|
5a7fb3ab9e
|
[Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820)
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
|
2025-07-17 09:10:09 +00:00 |
|
Varun Sundar Rabindranath
|
11dfdf21bf
|
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-17 08:10:37 +00:00 |
|
Chauncey
|
fdc5b43d20
|
[Bugfix]: Fix final_res_batch list index out of range error (#21055)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-07-17 00:29:09 -07:00 |
|
Jee Jee Li
|
c5b8b5953a
|
[Misc] Fix PhiMoE expert mapping (#21085)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-17 05:47:49 +00:00 |
|
David Ben-David
|
4fcef49ec4
|
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation (#21048)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
|
2025-07-17 13:29:45 +08:00 |
|
Zhonghua Deng
|
8a4e5c5f3c
|
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector (#20906)
Signed-off-by: Abatom <abzhonghua@gmail.com>
|
2025-07-16 22:13:00 -07:00 |
|
Lucas Wilkinson
|
76b494444f
|
[Attention] Refactor attention metadata builder interface (#20466)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-07-17 04:44:25 +00:00 |
|
Michael Goin
|
28a6d5423d
|
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-16 19:54:45 -07:00 |
|
XiongfeiWei
|
58760e12b1
|
[TPU] Start using python 3.12 (#21000)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
|
2025-07-16 19:37:44 -07:00 |
|
Michael Goin
|
a50d918225
|
[Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile (#21013)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-16 19:37:13 -07:00 |
|
Kevin_Xiong
|
c9ba8104ed
|
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024)
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com>
|
2025-07-16 19:36:36 -07:00 |
|
Michael Goin
|
4e7dfbe7b4
|
Update PyTorch to torch==2.7.1 for CUDA (#21011)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-17 02:30:44 +00:00 |
|
QiliangCui
|
72ad273582
|
Remove torch_xla.tpu.version() from pallas.py. (#21065)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
|
2025-07-17 00:25:26 +00:00 |
|
Nir David
|
01513a334a
|
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010)
Signed-off-by: Nir David <ndavid@habana.ai>
Signed-off-by: Uri Livne <ulivne@habana.ai>
Co-authored-by: Uri Livne <ulivne@habana.ai>
|
2025-07-16 15:33:41 -04:00 |
|
Cyrus Leung
|
ac2bf41e53
|
[Model] Remove model sampler (#21059)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-16 19:03:37 +00:00 |
|