989 Commits

Author SHA1 Message Date
Benjamin Chislett
e858bfe051
[Cleanup] Refactor profiling env vars into a CLI config (#29912)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-09 13:29:33 -05:00
Cyrus Leung
e83b7e379c
Revert "[Renderer] Separate out RendererConfig from ModelConfig (#30145)" (#30199) 2025-12-07 00:00:22 -08:00
Cyrus Leung
27f4c2fd46
[Renderer] Separate out RendererConfig from ModelConfig (#30145)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-06 23:15:42 -08:00
Nick Hill
4026ae31e9
[Misc] Move disable_nccl_for_dp_synchronization init logic into VllmConfig (#30161)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-05 20:59:04 -08:00
Tova Movshovitz
adb315060c
[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache (#27170)
Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-05 18:33:26 +00:00
Matthew Bonanni
66e674cdd5
[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments (#26315)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2025-12-05 09:48:43 -08:00
Max Hu
c2894d3883
[Feature] Add Layer-wise NVTX Support (#29990)
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <maxhu@nvidia.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
2025-12-05 11:20:07 +00:00
Yong Hoon Shin
69520bc695
Add logging for cudagraph related info (#29825)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-12-03 01:01:48 -08:00
Sage Moore
e6f114ac25
[Bugfix][EPLB] Prevent user-provided EPLB config from being overwritten with defaults (#29911)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-12-02 13:20:22 -09:00
Isotr0py
63b1da76ba
[Chore]: Reorganize gguf utils funtions under transformers_utils (#29891)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-02 17:33:23 +00:00
Zhuohan Li
d0cd728907
[Core] Support reseting all running requests' KV while calling reset_prefix_cache (#28827)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 02:25:05 +00:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics (#27793)
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)

Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
Cyrus Leung
f0a28bf661
[Misc] Unify tokenizer registration (#29767)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-01 11:34:58 +00:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface (#29693)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Tsukasa OI
762a4a6ca9
[Frontend] Perform offline path replacement to tokenizer (#29706)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
2025-11-28 18:32:08 -08:00
Yanan Cao
3461e7efd8
[Frontend] Remap -O to -cc commandline flag (#29557)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-11-28 21:51:12 +00:00
Cyrus Leung
8d9338fae4
[Chore] Rename Processor to InputProcessor (#29682)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 09:35:41 -08:00
Isotr0py
f946a8d743
[Chore]: Reorganize model repo operating functions in transformers_utils (#29680)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-28 08:46:51 -08:00
Cyrus Leung
9e6bcda3ac
[mypy] Enable type checking for more directories (#29674)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 08:39:27 -08:00
wang.yuqi
f4b76056ee
Improve enable chunked_prefill & prefix_caching logic. (#26623)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-27 22:05:48 -08:00
Cyrus Leung
ea228b4491
[Misc] Remove unused code from protocol.py (#29616)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-27 18:39:59 +00:00
Morrison Turnansky
0838b52e2e
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure (#26847)
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: adabeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-27 01:55:58 -08:00
Harry Mellor
a1f2676879
Scheduled removal of override_pooler_config and disable_log_requests (#29402)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-25 16:08:57 +00:00
Yifan Qiao
48ddb02b79
[Hybrid Allocator] Support KV cache groups with different block_size (#29143)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-11-25 10:30:57 -05:00
Injae Ryou
794029f012
[Feature]: Improve GGUF loading from HuggingFace user experience like repo_id:quant_type (#29137)
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-25 14:28:53 +00:00
Thomas Parnell
516c3f7847
[Bugfix] Fix logic for choosing default prefix caching setting (#29393)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-11-25 14:05:10 +00:00
wang.yuqi
de6889946b
[Misc] Suppress log outputs when constructing the default vllm config. (#29291)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-25 03:00:44 -08:00
zhrrr
f242cfcdd5
[Perf] use cpu all reduce to avoid sync when async_scheduling & dp > 1 (#29311)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2025-11-25 15:31:07 +08:00
Harry Mellor
316c8492bf
Scheduled removal of guided_* config fields (#29326)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-25 05:24:05 +00:00
Harry Mellor
a4ad43ad5a
Scheduled removal of ParallelConfig's direct child EPLB fields (#29324)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-25 01:58:58 +00:00
Fadi Arafeh
730bd35378
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-22 09:04:36 -08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init (#29186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-22 19:34:15 +08:00
Cyrus Leung
ceca060501
[Deprecation] Deprecate seed=None (#29185)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 18:19:25 +00:00
Cyrus Leung
d7219bcda3
[Misc] Move dynamic seed initialization to EngineArgs (#29165)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 15:27:44 +00:00
Jee Jee Li
9875be6431
[LoRA][2/2]Remove LoRA extra vocab (#28545)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-21 09:46:43 +08:00
Samit
371b1d4c61
[RL] Add Pause and Resume Generation for Asynchronous RL Training (#28037)
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-11-20 03:01:03 -08:00
Cyrus Leung
20e4497be2
[V0 Deprecation] Remove num_lookahead_slots (#29000)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-20 06:39:10 +00:00
Qiu
2fd893b4ce
[Feature] Prefill Context Parallel (PCP) basic support (#28718)
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com>
Co-authored-by: LookAround <lixushi@huawei.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>
2025-11-19 15:52:44 -05:00
Didier Durand
7ed27f3cb5
[Doc]: fix typos in various files (#28945)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-18 22:52:30 -08:00
Cyrus Leung
bf9e1e8767
[Bugfix] Fix wrong CLI defaults for dynamic SchedulerConfig fields (#28872)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-17 20:30:29 -08:00
Lucia Fang
b316ac6589
[V1] Support MP Executor for multi node distributed inference (#23691)
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-16 09:01:21 +00:00
Zhuohan Li
dd6ac1c2bb
[RL] [V1] Remove unused device argument from reset_kv_cache (#28766)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-11-14 23:59:42 -08:00
Nicolò Lucchesi
6f1e7f7226
[DisaggEverything] Tokens in<>out /generate endpoint (#24261)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-14 09:58:01 -07:00
Cyrus Leung
511a6b611d
[Config] Clean up SchedulerConfig initialization (#28665)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-14 22:41:02 +08:00
elvischenv
5d6ce2b960
[Perf] Support stream interval for reducing host overhead (#27869)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-13 13:21:25 -05:00
Nick Hill
327c0a9a23
[BugFix] Ensure EngineArgs.create_engine_config is idempotent (#28515)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-13 17:14:08 +00:00
Chenguang Zheng
4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00
Li, Jiang
7f829be7d3
[CPU] Refactor CPU attention backend (#27954)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-12 09:43:06 +08:00
Matthew Bonanni
b30dfa03c5
[Attention] Refactor CUDA attention backend selection logic (#24794)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-11 07:40:44 -05:00
zhangsicheng5
2108a571d7
[DCP] Support dcp kv_cache interleave size > 1 (#26696)
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>
2025-11-09 04:45:27 +09:00