6026 Commits

Author SHA1 Message Date
youkaichao
124776ebd5
[ci] skip failed tests for flashinfer (#13352)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-16 22:09:15 +08:00
Roger Wang
b7d309860e
[V1] Update doc and examples for H2O-VL (#13349)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-16 10:35:54 +00:00
wchen61
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-02-16 08:59:49 +00:00
Michael Goin
d3d547e057
[Bugfix] Pin xgrammar to 0.1.11 (#13338) 2025-02-15 19:42:25 -08:00
Kyle Sayers
12913d17ba
[Quant] Add SupportsQuant to phi3 and clip (#13104) 2025-02-15 19:28:33 -08:00
Lily Liu
80f63a3966
[V1][Spec Decode] Ngram Spec Decode (#12193)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-15 18:05:11 -08:00
Cyrus Leung
367cb8ce8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor (#13331) 2025-02-15 07:06:23 -08:00
youkaichao
54ed913f34
[ci/build] update flashinfer (#13323) 2025-02-15 05:33:13 -08:00
Cody Yu
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue (#13064) 2025-02-15 03:59:01 -08:00
rasmith
ed0de3e4b8
[AMD] [Model] DeepSeek tunings (#13199) 2025-02-15 03:58:09 -08:00
Mark McLoughlin
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288) 2025-02-15 03:56:19 -08:00
Isotr0py
7fdaaf48ef
[Bugfix] Fix qwen2.5-vl image processor (#13286) 2025-02-15 03:00:11 -08:00
Xu Song
067fa2255b
[Bugfix]Fix search start_index of stop_checker (#13280) 2025-02-14 21:39:42 -08:00
Nick Hill
9076325677
[BugFix] Don't scan entire cache dir when loading model (#13302) 2025-02-14 21:33:31 -08:00
Tyler Michael Smith
97a3d6d995
[Bugfix] Massage MLA's usage of flash attn for RoCM (#13310) 2025-02-14 21:33:25 -08:00
Nicolò Lucchesi
579d7a63b2
[Bugfix][Docs] Fix offline Whisper (#13274) 2025-02-14 21:32:37 -08:00
Sage Moore
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (#13235) 2025-02-14 20:30:42 -08:00
Woosuk Kwon
0c73026844
[V1][PP] Fix memory profiling in PP (#13315)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 20:17:25 -08:00
Nick Hill
6a854c7a2b
[V1][Sampler] Don't apply temp for greedy-only (#13311)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-14 18:10:53 -08:00
Woosuk Kwon
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p (#13316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 17:29:51 -08:00
Aoyu
a12934d3ec
[V1][Core] min_p sampling support (#13191)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
2025-02-14 15:50:05 -08:00
Joe Runde
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-14 15:36:07 -08:00
Michael Goin
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-14 12:53:42 -08:00
Yu-Zhou
c9e2d644e7
[Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317) 2025-02-14 04:36:49 -08:00
Russell Bryant
7734e9a291
[Core] choice-based structured output with xgrammar (#12632) 2025-02-14 04:36:05 -08:00
Lu Fang
6224a9f620
Support logit_bias in v1 Sampler (#13079) 2025-02-14 04:34:59 -08:00
Nick Hill
085b7b2d6c
[V1] Simplify GPUModelRunner._update_states check (#13265) 2025-02-14 04:33:43 -08:00
Cyrus Leung
4da1f667e9
[VLM] Keep track of whether prompt replacements have been applied (#13215) 2025-02-14 04:20:46 -08:00
Jun Duan
556ef7f714
[Misc] Log time consumption of sleep and wake-up (#13115)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-02-14 20:10:21 +08:00
Xu Song
83481ceb49
[Bugfix] Fix missing parentheses (#13263) 2025-02-14 01:07:10 -08:00
Pooya Davoodi
185cc19f92
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (#12927)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-02-14 08:22:42 +00:00
Alexander Matveev
45f90bcbba
[WIP] TPU V1 Support Refactored (#13049) 2025-02-14 00:21:53 -08:00
Kero Liang
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126) 2025-02-13 22:39:20 -08:00
Sage Moore
ba59b78a9c
[ROCm][V1] Add intial ROCm support to V1 (#12790) 2025-02-13 22:21:50 -08:00
Varun Sundar Rabindranath
cbc40128eb
[V1] LoRA - Enable Serving Usecase (#12883)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-14 14:21:12 +08:00
Michael Goin
f0b2da72a8
Expand MLA to support most types of quantization (#13181) 2025-02-13 22:19:22 -08:00
Harry Mellor
f2b20fe491
Consolidate Llama model usage in tests (#13094) 2025-02-13 22:18:03 -08:00
Wang Ran (汪然)
40932d7a05
[Misc] Remove redundant statements in scheduler.py (#13229) 2025-02-13 22:07:25 -08:00
XiaobingZhang
84683fa271
[Bugfix] Offline example of disaggregated prefill (#13214) 2025-02-13 20:20:47 -08:00
Tyler Michael Smith
067678262a
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (#13237) 2025-02-13 20:19:43 -08:00
Tyler Michael Smith
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250) 2025-02-13 20:19:25 -08:00
Roger Wang
dd5ede4440
[V1] Consolidate MM cache size to vllm.envs (#13239) 2025-02-13 20:19:03 -08:00
Jinzhen Lin
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 (#13227) 2025-02-13 20:07:05 -08:00
Gregory Shtrasberg
410886950a
[ROCm] Avoid using the default stream on ROCm (#13238)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-02-14 09:29:26 +08:00
Harry Mellor
e38be640e6
Revert "Add label if pre-commit passes" (#13242) 2025-02-13 16:12:32 -08:00
Tyler Michael Smith
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-14 00:01:14 +00:00
Michael Goin
2344192a55
Optimize moe_align_block_size for deepseek_v3 (#12850)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-13 18:43:37 -05:00
Harry Mellor
bffddd9a05
Add label if pre-commit passes (#12527)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-13 20:51:30 +00:00
Nicolò Lucchesi
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909) 2025-02-13 07:23:45 -08:00
Vaibhav Jain
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193) 2025-02-13 06:52:22 -08:00