Nick Hill
|
5db6b2c961
|
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-04 15:06:47 +00:00 |
|
Michael Goin
|
6247bae6c6
|
[Bugfix] Restrict MacOS CPU detection (#14210)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-04 22:25:27 +08:00 |
|
youkaichao
|
3610fb4930
|
[doc] add "Failed to infer device type" to faq (#14200)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-04 20:47:06 +08:00 |
|
youkaichao
|
71c4b40562
|
[sleep mode] error out with expandable_segments (#14189)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-04 18:54:19 +08:00 |
|
youkaichao
|
ac65bc92df
|
[platform] add debug logging during inferring the device type (#14195)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-04 18:39:16 +08:00 |
|
Michael Goin
|
f78c0be80a
|
Fix benchmark_moe.py tuning for CUDA devices (#14164)
|
2025-03-03 21:11:03 -08:00 |
|
Zhanwen Chen
|
66233af7b6
|
Use math.prod instead of np.prod for trivial ops (#14142)
|
2025-03-03 21:09:22 -08:00 |
|
Rui Qiao
|
bf13d40972
|
[core] Pass all driver env vars to ray workers unless excluded (#14099)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-03-04 11:44:17 +08:00 |
|
Cody Yu
|
989f4f430c
|
[Misc] Remove lru_cache in NvmlCudaPlatform (#14156)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-04 11:09:34 +08:00 |
|
Divakar Verma
|
bb5b640359
|
[core] moe fp8 block quant tuning support (#14068)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
|
2025-03-04 01:30:23 +00:00 |
|
Travis Johnson
|
c060b71408
|
[Model] Add support for GraniteMoeShared models (#13313)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-03-04 08:04:52 +08:00 |
|
iefgnoix
|
79e4937c65
|
[v1] Add comments to the new ragged paged attention Pallas kernel (#14155)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-03-03 23:00:55 +00:00 |
|
Qubitium-ModelCloud
|
cd1d3c3df8
|
[Docs] Add GPTQModel (#14056)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 21:59:09 +00:00 |
|
Michael Goin
|
19d98e0c7d
|
[Kernel] Optimize moe intermediate_cache usage (#13625)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 16:29:53 -05:00 |
|
Michael Goin
|
2b04c209ee
|
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 14:20:24 -07:00 |
|
Mark McLoughlin
|
ae122b1cbd
|
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 19:04:45 +00:00 |
|
Nick Hill
|
872db2be0e
|
[V1] Simplify stats logging (#14082)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-03 10:34:14 -08:00 |
|
Mark McLoughlin
|
2dfdfed8a0
|
[V0][Metrics] Deprecate some KV/prefix cache metrics (#14136)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 18:25:46 +00:00 |
|
Mark McLoughlin
|
c41d27156b
|
[V0][Metrics] Remove unimplemented vllm:tokens_total (#14134)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 17:50:22 +00:00 |
|
Harry Mellor
|
91373a0d15
|
Fix head_dim not existing in all model configs (Transformers backend) (#14141)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-03 17:48:11 +00:00 |
|
TJian
|
848a6438ae
|
[ROCm] Faster Custom Paged Attention kernels (#12348)
|
2025-03-03 09:24:45 -08:00 |
|
Harry Mellor
|
98175b2816
|
Improve the docs for TransformersModel (#14147)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-03 17:03:05 +00:00 |
|
Mark McLoughlin
|
4167252eaf
|
[V1] Refactor parallel sampling support (#13774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 08:15:27 -08:00 |
|
Cody Yu
|
f35f8e2242
|
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-03 16:43:14 +08:00 |
|
Mengqing Cao
|
b87c21fc89
|
[Misc][Platform] Move use allgather to platform (#14010)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-03-03 15:40:04 +08:00 |
|
wang.yuqi
|
e584b85afd
|
[Misc] duplicate code in deepseek_v2 (#14106)
|
2025-03-03 14:10:11 +08:00 |
|
Sheng Yao
|
09e56f9262
|
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051)
|
2025-03-02 17:35:01 -08:00 |
|
Harry Mellor
|
cf069aa8aa
|
Update deprecated Python 3.8 typing (#13971)
|
2025-03-02 17:34:51 -08:00 |
|
Ce Gao
|
bf33700ecd
|
[v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
|
2025-03-02 14:49:42 -05:00 |
|
qux-bbb
|
bc6ccb9878
|
[Doc] Source building add clone step (#14086)
Signed-off-by: qux-bbb <1147635419@qq.com>
|
2025-03-02 10:59:50 +00:00 |
|
Jun Duan
|
82fbeae92b
|
[Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
2025-03-01 17:20:30 -08:00 |
|
Jee Jee Li
|
cc5e8f6db8
|
[Model] Add LoRA support for TransformersModel (#13770)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-02 09:17:34 +08:00 |
|
Chen Zhang
|
d54990da47
|
[v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081)
|
2025-03-01 20:46:02 +00:00 |
|
Chen Zhang
|
b9f1d4294e
|
[v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073)
|
2025-03-01 08:25:54 +00:00 |
|
Sage Moore
|
b28246f6ff
|
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2025-03-01 07:18:32 +00:00 |
|
Woosuk Kwon
|
3b5567a209
|
[V1][Minor] Do not print attn backend twice (#13985)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-01 07:09:14 +00:00 |
|
Isotr0py
|
fdcc405346
|
[Doc] Consolidate whisper and florence2 examples (#14050)
|
2025-02-28 22:49:15 -08:00 |
|
Kuntai Du
|
8994dabc22
|
[Documentation] Add more deployment guide for Kubernetes deployment (#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
|
2025-03-01 06:44:24 +00:00 |
|
Li, Jiang
|
02296f420d
|
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053)
|
2025-02-28 22:31:01 -08:00 |
|
YajieWang
|
6a92ff93e1
|
[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931)
|
2025-02-28 22:30:59 -08:00 |
|
Jee Jee Li
|
6a84164add
|
[Bugfix] Add file lock for ModelScope download (#14060)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-01 06:10:28 +00:00 |
|
Brayden Zhong
|
f64ffa8c25
|
[Docs] Add pipeline_parallel_size to optimization docs (#14059)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-03-01 05:43:54 +00:00 |
|
Luka Govedič
|
bd56c983d6
|
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
|
2025-02-28 16:20:11 -07:00 |
|
Rui Qiao
|
084bbac8cc
|
[core] Bump ray to 2.43 (#13994)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-02-28 21:47:44 +00:00 |
|
Chen Zhang
|
28943d36ce
|
[v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-28 20:53:31 +00:00 |
|
Andrey Talman
|
b526ca6726
|
Add RELEASE.md (#13926)
Signed-off-by: atalman <atalman@fb.com>
|
2025-02-28 12:25:50 -08:00 |
|
Chen Zhang
|
e7bd944e08
|
[v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-02-28 19:03:16 +00:00 |
|
iefgnoix
|
c3b6559a10
|
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-02-28 11:01:36 -07:00 |
|
Harry Mellor
|
4be4b26cb7
|
Fix entrypoint tests for embedding models (#14052)
|
2025-02-28 08:56:44 -08:00 |
|
Brayden Zhong
|
2aed2c9fa7
|
[Doc] Fix ROCm documentation (#14041)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-02-28 16:42:07 +00:00 |
|