95 Commits

Author SHA1 Message Date
Wentao Ye
e6c22d2b2f [Perf] Apply torch.compile for per_block_cast_to_fp8 (#24611)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:54 -07:00
Luka Govedič
6dbbecd5b2 [torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) (#24542)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Harry Mellor
44be2b7349 Make mypy behave like a proper pre-commit hook (#25313)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Cyrus Leung
ddf4e1f56f [Misc] Remove unused encoder-decoder error strings (#25374)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Woosuk Kwon
a815d820ee Remove V0 attention backends (#25351)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Wenlong Wang
dad5f4d16d [Docs] Fix warnings in mkdocs build (continued) (#25042)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Nick Hill
d897924b45 [BugFix] Exclude self when checking for port collision (#25286)
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-03 13:35:53 -07:00
Andrew Sansom
9a4600e4dc
[CORE] Prompt Embeddings Support for v1 Engine (#24278)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-09-19 08:03:09 +08:00
elvischenv
e67a79db03
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic (#24600)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-09-17 15:36:29 -07:00
Wentao Ye
de2cc3d867
[Deprecation] Remove DeepGEMM Old Symbol Wrapper (#24902)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-09-15 20:03:29 -06:00
Lukas Geiger
1da0f1441d
[Core][Multimodal] Cache supports_kw (#24773)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-09-13 07:27:04 +00:00
dongluw
a5b84f1cbf
[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)
Signed-off-by: donglu <donglu@cohere.com>
2025-09-12 07:54:17 -07:00
Xiaozhu Meng
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention (#24197)
Signed-off-by: Xiaozhu <mxz297@gmail.com>
2025-09-11 14:20:09 -07:00
Boyuan Feng
94e6b2d55f
Allow users to specify kv cache memory size (#21489)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-09-11 13:41:07 +00:00
Charlie Fu
73e688cb79
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm (#24275)
Signed-off-by: charlifu <charlifu@amd.com>
2025-09-09 23:27:35 +00:00
22quinn
0cdd213641
[Misc] Improve Worker process title and logging prefix (#22205)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-09-08 21:43:48 -07:00
Zebing Lin
82dfb12e52
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead (#23673)
Signed-off-by: linzebing <linzebing1995@gmail.com>
2025-09-08 21:34:37 -07:00
Nick Hill
752d2e1c36
[Minor] Fix some random typos in comments (#24009)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-31 16:42:17 -07:00
Gabriel Marinho
5b8077b8ac
Fix wrong truncate_prompt_tokens type hint (#22761)
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
Signed-off-by: Gabriel Marinho <104592062+gmarinho2@users.noreply.github.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-30 20:39:38 +00:00
dubejf
5b31cb1781
[Bugfix] Fix --config arg expansion called from api_server.py (#23944)
Signed-off-by: Jean-Francois Dube <dubejf+gh@gmail.com>
Co-authored-by: Jean-Francois Dube <dubejf+gh@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-29 21:36:39 -07:00
Adit Chawdhary
4f7cde7272
Adds json_count_leaves utility function (#23899)
Signed-off-by: aditchawdhary <aditxy@hotmail.com>
2025-08-29 05:28:13 -07:00
Wentao Ye
d3d2aad5a2
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled (#23858) 2025-08-28 22:18:10 +00:00
Wentao Ye
3af47c3cc6
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt (#23666)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-08-27 14:09:08 +00:00
rongfu.leng
8dbf6ed7be
[Bugfix] fix when config.yaml config value is list parse error (#23528)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-08-27 05:54:39 +00:00
nvjullin
f66673a39d
[Kernel] Added flashinfer fp8 per-tensor gemms (#22895)
Signed-off-by: Julien Lin <jullin@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-26 06:54:04 -07:00
Wentao Ye
56dcf4e7e9
[Bug] Fix DeepGEMM Env Control (#23591)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-25 18:41:21 -07:00
Shiyan Deng
da65bec309
add an env var for path to pre-downloaded flashinfer cubin files (#22675) 2025-08-22 19:25:45 +00:00
Didier Durand
22cf679aad
[Doc]: fix various typos in multiple files (#23179)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-08-22 10:38:46 -07:00
Li, Jiang
88016c372a
[Bugfix] Fix pooling models on CPU backend (#23392)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-08-22 09:47:17 +00:00
Wentao Ye
394591e343
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement (#23351)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-21 21:01:08 -07:00
Ming Yang
10f535c086
[Bugfix] Fix port conflict by obtaining a list of open ports upfront (#21894)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-08-21 10:22:18 -07:00
elvischenv
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
afeldman-nm
bf7f470b22
[V1] Logits processors extensibility (#19912)
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeld2012@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-16 12:59:17 -07:00
Nicolò Lucchesi
070da660c1
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit (#22735)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-16 00:14:08 +00:00
Nick Hill
ad0297d113
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() (#22944)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-15 17:00:36 -07:00
Yichen Yan
236b864e4f
[BugFix] Make run_once thread-safe (#22978)
Signed-off-by: <wenji.yyc@alibaba-inc.com>
Signed-off-by: Yichen Yan <wenji.yyc@alibaba-inc.com>
2025-08-15 16:56:17 -07:00
Or Ozeri
c280066f9d
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-08-15 16:52:52 -07:00
Thomas Parnell
75531a6c13
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) (#22928)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-08-15 12:57:06 +00:00
nvjullin
279a5f31b3
[Kernel] Add nvfp4 gemm flashinfer backends (#22346)
Signed-off-by: Julien Lin <jullin@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-08-14 16:03:55 -04:00
Nick Hill
eb08487b18
[BugFix] Threadsafe close async zmq sockets (#22877)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-14 03:44:29 -07:00
HWH
9bd9294f0e
[Bugfix] Fix MiniCPMV Image input inference failed (#22813)
Signed-off-by: HWH <67449739+jio-H@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-08-13 09:41:41 -07:00
Michael Goin
c6b928798e
Force TRTLLM attention for gpt-oss on SM100 (#22678)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-12 21:22:16 -07:00
Wentao Ye
f7dcce7a4a
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale (#21968)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-11 09:39:08 -07:00
Cyrus Leung
951b038298
[Misc] Move jsontree to utils (#22622)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-11 03:49:32 -07:00
Harry Mellor
bc1d02ac85
[Docs] Add comprehensive CLI reference for all large vllm subcommands (#22601)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-11 00:13:33 -07:00
Benji Beck
b4e2916721
Migrate LlavaNextImageInputs to TensorSchema (#21774)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-10 09:05:21 -07:00
Harry Mellor
56186474f6
[Docs] Reduce noise in docs and --help from the JSON tip (#22567)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-09 08:31:32 -07:00
Wentao Ye
3157aebb63
[Log] Add Warning for Deprecation of DeepGEMM old version (#22194)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-08 23:07:48 -07:00
Yongye Zhu
e789cad6b8
[gpt-oss] triton kernel mxfp4 (#22421)
Signed-off-by: <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-08 08:24:07 -07:00
Harry Mellor
e5ebeeba53
Remove exception for Python 3.8 typing from linter (#22506)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-08 03:06:46 -07:00