Nick Hill
|
6a854c7a2b
|
[V1][Sampler] Don't apply temp for greedy-only (#13311)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-02-14 18:10:53 -08:00 |
|
Woosuk Kwon
|
e7eea5a520
|
[V1][CI] Fix failed v1-test because of min_p (#13316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-02-14 17:29:51 -08:00 |
|
Aoyu
|
a12934d3ec
|
[V1][Core] min_p sampling support (#13191)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
|
2025-02-14 15:50:05 -08:00 |
|
Joe Runde
|
3bcb8c75da
|
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-14 15:36:07 -08:00 |
|
Michael Goin
|
5e5c8e091e
|
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-14 12:53:42 -08:00 |
|
Yu-Zhou
|
c9e2d644e7
|
[Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317)
|
2025-02-14 04:36:49 -08:00 |
|
Russell Bryant
|
7734e9a291
|
[Core] choice-based structured output with xgrammar (#12632)
|
2025-02-14 04:36:05 -08:00 |
|
Lu Fang
|
6224a9f620
|
Support logit_bias in v1 Sampler (#13079)
|
2025-02-14 04:34:59 -08:00 |
|
Nick Hill
|
085b7b2d6c
|
[V1] Simplify GPUModelRunner._update_states check (#13265)
|
2025-02-14 04:33:43 -08:00 |
|
Cyrus Leung
|
4da1f667e9
|
[VLM] Keep track of whether prompt replacements have been applied (#13215)
|
2025-02-14 04:20:46 -08:00 |
|
Jun Duan
|
556ef7f714
|
[Misc] Log time consumption of sleep and wake-up (#13115)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
2025-02-14 20:10:21 +08:00 |
|
Xu Song
|
83481ceb49
|
[Bugfix] Fix missing parentheses (#13263)
|
2025-02-14 01:07:10 -08:00 |
|
Pooya Davoodi
|
185cc19f92
|
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (#12927)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
|
2025-02-14 08:22:42 +00:00 |
|
Alexander Matveev
|
45f90bcbba
|
[WIP] TPU V1 Support Refactored (#13049)
|
2025-02-14 00:21:53 -08:00 |
|
Kero Liang
|
b0ccfc565a
|
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126)
|
2025-02-13 22:39:20 -08:00 |
|
Sage Moore
|
ba59b78a9c
|
[ROCm][V1] Add intial ROCm support to V1 (#12790)
|
2025-02-13 22:21:50 -08:00 |
|
Varun Sundar Rabindranath
|
cbc40128eb
|
[V1] LoRA - Enable Serving Usecase (#12883)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-02-14 14:21:12 +08:00 |
|
Michael Goin
|
f0b2da72a8
|
Expand MLA to support most types of quantization (#13181)
|
2025-02-13 22:19:22 -08:00 |
|
Harry Mellor
|
f2b20fe491
|
Consolidate Llama model usage in tests (#13094)
|
2025-02-13 22:18:03 -08:00 |
|
Wang Ran (汪然)
|
40932d7a05
|
[Misc] Remove redundant statements in scheduler.py (#13229)
|
2025-02-13 22:07:25 -08:00 |
|
XiaobingZhang
|
84683fa271
|
[Bugfix] Offline example of disaggregated prefill (#13214)
|
2025-02-13 20:20:47 -08:00 |
|
Tyler Michael Smith
|
067678262a
|
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (#13237)
|
2025-02-13 20:19:43 -08:00 |
|
Tyler Michael Smith
|
09545c0a94
|
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250)
|
2025-02-13 20:19:25 -08:00 |
|
Roger Wang
|
dd5ede4440
|
[V1] Consolidate MM cache size to vllm.envs (#13239)
|
2025-02-13 20:19:03 -08:00 |
|
Jinzhen Lin
|
8c32b08a86
|
[Kernel] Fix awq error when n is not divisable by 128 (#13227)
|
2025-02-13 20:07:05 -08:00 |
|
Gregory Shtrasberg
|
410886950a
|
[ROCm] Avoid using the default stream on ROCm (#13238)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
2025-02-14 09:29:26 +08:00 |
|
Harry Mellor
|
e38be640e6
|
Revert "Add label if pre-commit passes" (#13242)
|
2025-02-13 16:12:32 -08:00 |
|
Tyler Michael Smith
|
c1e37bf71b
|
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-02-14 00:01:14 +00:00 |
|
Michael Goin
|
2344192a55
|
Optimize moe_align_block_size for deepseek_v3 (#12850)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-13 18:43:37 -05:00 |
|
Harry Mellor
|
bffddd9a05
|
Add label if pre-commit passes (#12527)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-02-13 20:51:30 +00:00 |
|
Nicolò Lucchesi
|
d84cef76eb
|
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909)
|
2025-02-13 07:23:45 -08:00 |
|
Vaibhav Jain
|
37dfa60037
|
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193)
|
2025-02-13 06:52:22 -08:00 |
|
Cyrus Leung
|
1bc3b5e71b
|
[VLM] Separate text-only and vision variants of the same model architecture (#13157)
|
2025-02-13 06:19:15 -08:00 |
|
燃
|
02ed8a1fbe
|
[Misc] Qwen2.5-VL Optimization (#13155)
|
2025-02-13 06:17:57 -08:00 |
|
Aoyu
|
2092a6fa7d
|
[V1][Core] Add worker_base for v1 worker (#12816)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-02-13 20:35:18 +08:00 |
|
Cyrus Leung
|
c9d3ecf016
|
[VLM] Merged multi-modal processor for Molmo (#12966)
|
2025-02-13 04:34:00 -08:00 |
|
Roger Wang
|
fdcf64d3c6
|
[V1] Clarify input processing and multimodal feature caching logic (#13211)
|
2025-02-13 03:43:24 -08:00 |
|
Russell Bryant
|
578087e56c
|
[Frontend] Pass pre-created socket to uvicorn (#13113)
|
2025-02-13 00:51:46 -08:00 |
|
Isotr0py
|
fa253f1a70
|
[VLM] Remove input processor from clip and siglip (#13165)
|
2025-02-13 00:31:37 -08:00 |
|
Rui Qiao
|
9605c1256e
|
[V1][core] Implement pipeline parallel on Ray (#12996)
|
2025-02-13 08:02:46 +00:00 |
|
Russell Bryant
|
0ccd8769fb
|
[CI/Build] Allow ruff to auto-fix some issues (#13180)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-02-13 07:45:38 +00:00 |
|
Daniel Han
|
cb944d5818
|
Allow Unsloth Dynamic 4bit BnB quants to work (#12974)
|
2025-02-12 23:13:08 -08:00 |
|
Russell Bryant
|
d46d490c27
|
[Frontend] Move CLI code into vllm.cmd package (#12971)
|
2025-02-12 23:12:21 -08:00 |
|
LikeSundayLikeRain
|
04f50ad9d1
|
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097)
|
2025-02-12 23:11:26 -08:00 |
|
Cody Yu
|
60c68df6d1
|
[Build] Automatically use the wheel of the base commit with Python-only build (#13178)
|
2025-02-12 23:10:28 -08:00 |
|
Lu Fang
|
009439caeb
|
Simplify logic of locating CUDART so file path (#13203)
Signed-off-by: Lu Fang <lufang@fb.com>
|
2025-02-13 13:52:41 +08:00 |
|
Isotr0py
|
bc55d13070
|
[VLM] Implement merged multimodal processor for Mllama (#11427)
|
2025-02-12 20:26:21 -08:00 |
|
Michael Goin
|
d88c8666a1
|
[Bugfix][Example] Fix GCed profiling server for TPU (#12792)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-02-13 11:52:11 +08:00 |
|
Kaixi Hou
|
4fc5c23bb6
|
[NVIDIA] Support nvfp4 quantization (#12784)
|
2025-02-12 19:51:51 -08:00 |
|
Kevin H. Luu
|
9f9704dca6
|
[perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (#12706)
|
2025-02-12 19:51:33 -08:00 |
|