7249 Commits

Author SHA1 Message Date
Sage Moore
57d404bbb8 misc
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-02 16:37:58 +00:00
Sage Moore
d833982e48 random push
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-30 17:08:51 +00:00
Sage Moore
4672c72f44 capture works replay does not
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-28 19:14:48 +00:00
Sage Moore
af68574e3d reintegrate full cudagraphs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-26 03:57:48 +00:00
Sage Moore
78228a67ce refactor a bunch of misc parameters into a UbatchMetadata class
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-26 00:14:18 +00:00
Sage Moore
54deb61b87 delete any notion of dummy_ubatch
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:48:16 +00:00
Sage Moore
0e2b4bd546 more refactoring
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:43:49 +00:00
Sage Moore
e2ba707d64 factored out some of the context creation code along with misc commeted infra
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:16:59 +00:00
Sage Moore
44a2b3494e add attention splitting to dummy runs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 21:39:33 +00:00
Sage Moore
144b148de2 initial full cudagraphs support. normal runs are working. ubatching does not
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 19:14:31 +00:00
Sage Moore
97dbafaad6 fix correctness issue with full-cudagraphs + attn splitting
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 22:47:42 +00:00
Sage Moore
96c0c4ea66 added initial code for cuda graph capturing ubatches
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 22:19:24 +00:00
Sage Moore
930efd02ab yields now work with deepep_ll
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 21:53:54 +00:00
Sage Moore
a4def24c2c setup deepepll for ubatching
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 21:20:49 +00:00
Sage Moore
ff2dd13145 more fixes
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-18 13:58:40 +00:00
Sage Moore
0889f66297 Merge branch 'main' of https://github.com/neuralmagic/vllm into lwilkinson/attn-slicing 2025-06-18 13:56:24 +00:00
Sage Moore
1d112d90a5 misc changes
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-17 13:34:46 +00:00
Nicolò Lucchesi
4c8f64faa7
[V1][Kernel] Flashinfer HND KV cache layout (#19280)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-06-17 09:09:22 -04:00
David Xia
93aee29fdb
[doc] split "Other AI Accelerators" tabs (#19708) 2025-06-17 22:05:29 +09:00
Reid
154d063b9f
[doc][mkdocs] Add edit button to documentation (#19637)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-06-17 11:10:31 +00:00
jvlunteren
ccd7c05089
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19152)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
2025-06-17 10:45:07 +00:00
Huy Do
c48c6c4008
Add a doc on how to update PyTorch version (#19705) 2025-06-17 18:10:37 +08:00
Isotr0py
aed8468642
[Doc] Add missing llava family multi-image examples (#19698)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-17 07:05:21 +00:00
quanliu
5c76b9cdaf
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager (#19686)
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
2025-06-17 04:40:58 +00:00
Driss Guessous
ddfed314f9
Fixes IMA for TP w/ flex-attention (#19712)
Signed-off-by: drisspg <drisspguessous@gmail.com>
2025-06-17 04:01:50 +00:00
Di Liu
5b3ad5ecf2
[DOC] fix doc typos (#19600)
Signed-off-by: Di Liu <liu-di@sjtu.edu.cn>
2025-06-17 11:34:53 +08:00
nguyenhoangthuan99
ede5c4ebdf
[Frontend] add chunking audio for > 30s audio (#19597)
Signed-off-by: nguyenhoangthuan99 <thuanhppro12@gmail.com>
2025-06-17 11:34:00 +08:00
Lucas Wilkinson
07334959d8
[Wheel Size] Only build FA2 8.0+PTX (#19336) 2025-06-17 12:32:49 +09:00
David Xia
119f683949
[doc] add project flag to gcloud TPU command (#19664)
Signed-off-by: David Xia <david@davidxia.com>
2025-06-17 01:00:09 +00:00
Conroy Cheers
0860087aff
[Fix] Fall back to Gloo when NCCL backend is unavailable (#19641)
Signed-off-by: conroy-cheers <conroy@corncheese.org>
2025-06-17 08:42:14 +08:00
Dipika Sikka
6bc7b57315
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 (#19563) 2025-06-16 17:33:51 -04:00
Russell Bryant
90f9c2eb5c
[V1] Change return type on get_multimodal_embeddings() (#19446)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-16 13:32:15 -04:00
qscqesze
387bdf0ab9
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (#19677)
Signed-off-by: QscQ <qscqesze@gmail.com>
2025-06-16 09:47:14 -07:00
bnellnm
5e5baa91aa
[Kernels] Use empty for modular MoE workspaces (#19667)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-16 14:58:01 +00:00
Chauncey
836d4ce140
[Bugfix] fix missing 'finish_reason': null in streaming chat (#19662)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-06-16 14:10:39 +00:00
Ning Xie
c3fec47bb7
[MISC] bump huggingface_hub pkg to 0.33.0 (#19547)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-16 05:22:28 -07:00
Isotr0py
1173804dca
[Bugfix] Fix TP inference for Flex attention backend (#19657)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-16 11:21:37 +00:00
Shawn Tan
4d5424029b
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (#19652)
Signed-off-by: Shawn Tan <shawntan@ibm.com>
2025-06-16 11:14:18 +00:00
Navanit Dubey
3e7506975c
[DOC] Add reasoning capability to vLLM streamlit code (#19557) 2025-06-16 07:09:12 -04:00
Nick Hill
ee35e96ac3
[BugFix] Don't catch BaseException when dumping execute_model errors (#19626)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-16 11:01:08 +00:00
Szymon Ożóg
dec66d253b
[Kernel] GGUF MMVQ kernel for multiple input vectors (#18754)
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
2025-06-16 17:33:26 +08:00
Russell Bryant
8d120701fd
[Docs] Move multiproc doc to v1 dir (#19651)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-16 09:10:12 +00:00
wang.yuqi
f40f763f12
[CI] Add mteb testing for rerank models (#19344) 2025-06-16 01:36:43 -07:00
Ning Xie
26bc46ef89
[MISC] typo fix (#19672)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-16 07:18:49 +00:00
Chengji Yao
a77aea59fd
[TPU] support attention head dim smaller than 128 (#19620)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-06-16 06:40:53 +00:00
Ye (Charlotte) Qi
b692e9cd07
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (#19660)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-06-16 06:30:29 +00:00
Francesco Bertolotti
367871a469
[Misc][Frontend] passthrough bad_words (#19564)
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
2025-06-16 05:05:13 +00:00
quanliu
92183b41f3
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (#18957)
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
2025-06-15 21:56:37 -07:00
Lu Fang
c6703d1e0d
[MISC] Remove unused variableds in C++ (#19609)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-06-15 20:05:28 -07:00
Isotr0py
a5e7242d5f
[Misc] Remove duplicate multiproc method setting for CPU platform (#19649)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-16 02:26:58 +00:00