Cody Yu
|
989f4f430c
|
[Misc] Remove lru_cache in NvmlCudaPlatform (#14156)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-04 11:09:34 +08:00 |
|
Divakar Verma
|
bb5b640359
|
[core] moe fp8 block quant tuning support (#14068)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
|
2025-03-04 01:30:23 +00:00 |
|
Travis Johnson
|
c060b71408
|
[Model] Add support for GraniteMoeShared models (#13313)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-03-04 08:04:52 +08:00 |
|
iefgnoix
|
79e4937c65
|
[v1] Add comments to the new ragged paged attention Pallas kernel (#14155)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-03-03 23:00:55 +00:00 |
|
Qubitium-ModelCloud
|
cd1d3c3df8
|
[Docs] Add GPTQModel (#14056)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 21:59:09 +00:00 |
|
Michael Goin
|
19d98e0c7d
|
[Kernel] Optimize moe intermediate_cache usage (#13625)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 16:29:53 -05:00 |
|
Michael Goin
|
2b04c209ee
|
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-03 14:20:24 -07:00 |
|
Mark McLoughlin
|
ae122b1cbd
|
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 19:04:45 +00:00 |
|
Nick Hill
|
872db2be0e
|
[V1] Simplify stats logging (#14082)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-03 10:34:14 -08:00 |
|
Mark McLoughlin
|
2dfdfed8a0
|
[V0][Metrics] Deprecate some KV/prefix cache metrics (#14136)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 18:25:46 +00:00 |
|
Mark McLoughlin
|
c41d27156b
|
[V0][Metrics] Remove unimplemented vllm:tokens_total (#14134)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 17:50:22 +00:00 |
|
Harry Mellor
|
91373a0d15
|
Fix head_dim not existing in all model configs (Transformers backend) (#14141)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-03 17:48:11 +00:00 |
|
TJian
|
848a6438ae
|
[ROCm] Faster Custom Paged Attention kernels (#12348)
|
2025-03-03 09:24:45 -08:00 |
|
Harry Mellor
|
98175b2816
|
Improve the docs for TransformersModel (#14147)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-03 17:03:05 +00:00 |
|
Mark McLoughlin
|
4167252eaf
|
[V1] Refactor parallel sampling support (#13774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-03-03 08:15:27 -08:00 |
|
Cody Yu
|
f35f8e2242
|
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-03 16:43:14 +08:00 |
|
Mengqing Cao
|
b87c21fc89
|
[Misc][Platform] Move use allgather to platform (#14010)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-03-03 15:40:04 +08:00 |
|
wang.yuqi
|
e584b85afd
|
[Misc] duplicate code in deepseek_v2 (#14106)
|
2025-03-03 14:10:11 +08:00 |
|
Sheng Yao
|
09e56f9262
|
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051)
|
2025-03-02 17:35:01 -08:00 |
|
Harry Mellor
|
cf069aa8aa
|
Update deprecated Python 3.8 typing (#13971)
|
2025-03-02 17:34:51 -08:00 |
|
Ce Gao
|
bf33700ecd
|
[v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
|
2025-03-02 14:49:42 -05:00 |
|
qux-bbb
|
bc6ccb9878
|
[Doc] Source building add clone step (#14086)
Signed-off-by: qux-bbb <1147635419@qq.com>
|
2025-03-02 10:59:50 +00:00 |
|
Jun Duan
|
82fbeae92b
|
[Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
2025-03-01 17:20:30 -08:00 |
|
Jee Jee Li
|
cc5e8f6db8
|
[Model] Add LoRA support for TransformersModel (#13770)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-02 09:17:34 +08:00 |
|
Chen Zhang
|
d54990da47
|
[v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081)
|
2025-03-01 20:46:02 +00:00 |
|
Chen Zhang
|
b9f1d4294e
|
[v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073)
|
2025-03-01 08:25:54 +00:00 |
|
Sage Moore
|
b28246f6ff
|
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2025-03-01 07:18:32 +00:00 |
|
Woosuk Kwon
|
3b5567a209
|
[V1][Minor] Do not print attn backend twice (#13985)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-01 07:09:14 +00:00 |
|
Isotr0py
|
fdcc405346
|
[Doc] Consolidate whisper and florence2 examples (#14050)
|
2025-02-28 22:49:15 -08:00 |
|
Kuntai Du
|
8994dabc22
|
[Documentation] Add more deployment guide for Kubernetes deployment (#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
|
2025-03-01 06:44:24 +00:00 |
|
Li, Jiang
|
02296f420d
|
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053)
|
2025-02-28 22:31:01 -08:00 |
|
YajieWang
|
6a92ff93e1
|
[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931)
|
2025-02-28 22:30:59 -08:00 |
|
Jee Jee Li
|
6a84164add
|
[Bugfix] Add file lock for ModelScope download (#14060)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-01 06:10:28 +00:00 |
|
Brayden Zhong
|
f64ffa8c25
|
[Docs] Add pipeline_parallel_size to optimization docs (#14059)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-03-01 05:43:54 +00:00 |
|
Luka Govedič
|
bd56c983d6
|
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
|
2025-02-28 16:20:11 -07:00 |
|
Rui Qiao
|
084bbac8cc
|
[core] Bump ray to 2.43 (#13994)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-02-28 21:47:44 +00:00 |
|
Chen Zhang
|
28943d36ce
|
[v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-28 20:53:31 +00:00 |
|
Andrey Talman
|
b526ca6726
|
Add RELEASE.md (#13926)
Signed-off-by: atalman <atalman@fb.com>
|
2025-02-28 12:25:50 -08:00 |
|
Chen Zhang
|
e7bd944e08
|
[v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-02-28 19:03:16 +00:00 |
|
iefgnoix
|
c3b6559a10
|
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-02-28 11:01:36 -07:00 |
|
Harry Mellor
|
4be4b26cb7
|
Fix entrypoint tests for embedding models (#14052)
|
2025-02-28 08:56:44 -08:00 |
|
Brayden Zhong
|
2aed2c9fa7
|
[Doc] Fix ROCm documentation (#14041)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-02-28 16:42:07 +00:00 |
|
Yang Liu
|
9b61dd41e7
|
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031)
|
2025-02-28 07:36:08 -08:00 |
|
Cyrus Leung
|
f7bee5c815
|
[VLM][Bugfix] Enable specifying prompt target via index (#14038)
|
2025-02-28 07:35:55 -08:00 |
|
Jee Jee Li
|
e0734387fb
|
[Bugfix] Fix MoeWNA16Method activation (#14024)
|
2025-02-28 15:22:42 +00:00 |
|
Harry Mellor
|
f58f8b5c96
|
Update AutoAWQ docs (#14042)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-02-28 15:20:29 +00:00 |
|
Thibault Schueller
|
b3f7aaccd0
|
[V1][Minor] Restore V1 compatibility with LLMEngine class (#13090)
|
2025-02-28 00:52:25 -08:00 |
|
Kacper Pietkun
|
b91660ddb8
|
[Hardware][Intel-Gaudi] Regional compilation support (#13213)
|
2025-02-28 00:51:49 -08:00 |
|
Harry Mellor
|
76c89fcadd
|
Use smaller embedding model when not testing model specifically (#13891)
|
2025-02-28 00:50:43 -08:00 |
|
Mathis Felardos
|
b9e41734c5
|
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
|
2025-02-28 07:53:45 +00:00 |
|