269 Commits

Author SHA1 Message Date
Sage Moore
bfa828f399 format
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-08 17:13:49 +00:00
Sage Moore
1a0e7110dd _prepare_inputs cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-08 13:02:21 +00:00
Sage Moore
82ae694de6 comments cleanup etc
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 20:47:39 +00:00
Sage Moore
10ca263058 split some of the ubatching logic out of _run_model
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 20:26:56 +00:00
Sage Moore
908e9f8f54 cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 19:52:41 +00:00
Sage Moore
06cc133a63 cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 17:51:08 +00:00
Sage Moore
3a41a3dcff cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 17:23:30 +00:00
Sage Moore
bb0645c644 separate ubatch and normal runs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 17:07:58 +00:00
Sage Moore
510e839429 more cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 16:35:52 +00:00
Sage Moore
f7b6e600b8 gpu_model_runner cleanup
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 16:23:11 +00:00
Sage Moore
0056be26f6 less ARs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 14:33:53 +00:00
Sage Moore
7cc5a549ad cleanup some of the should_ubatch logic
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 14:22:53 +00:00
Sage Moore
1d75a029a9 remove cudagraph logic from flashmla.py
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-03 13:41:49 +00:00
Sage Moore
18f7bfb501 ubatching fix
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-02 22:22:41 +00:00
Sage Moore
0e499c4f4d first round of cleanups
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-02 21:11:28 +00:00
Sage Moore
c0efbbb5de misc changes
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-02 16:56:30 +00:00
Lucas Wilkinson
f7a3ee0ea1 Merge remote-tracking branch 'origin/main' into lwilkinson/attn-slicing
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-02 16:52:19 +00:00
Sage Moore
57d404bbb8 misc
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-07-02 16:37:58 +00:00
Sage Moore
d833982e48 random push
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-30 17:08:51 +00:00
Woosuk Kwon
2863befce3
[Optimization] Use Shared CachedRequestData Instance Across All Requests (#20232)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 09:07:50 -07:00
Woosuk Kwon
2062c0723d
[Spec Decode] Refactor spec decoding into a separate function (#20238)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 08:13:50 -07:00
Woosuk Kwon
19108ef311
[Misc] Fix import (#20233)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-29 20:34:54 -07:00
Sage Moore
4672c72f44 capture works replay does not
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-28 19:14:48 +00:00
Bowen Wang
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) (#18343)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-06-26 15:30:21 -07:00
Sage Moore
af68574e3d reintegrate full cudagraphs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-26 03:57:48 +00:00
Sage Moore
78228a67ce refactor a bunch of misc parameters into a UbatchMetadata class
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-26 00:14:18 +00:00
Sage Moore
54deb61b87 delete any notion of dummy_ubatch
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:48:16 +00:00
Sage Moore
0e2b4bd546 more refactoring
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:43:49 +00:00
Sage Moore
e2ba707d64 factored out some of the context creation code along with misc commeted infra
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 23:16:59 +00:00
Sage Moore
44a2b3494e add attention splitting to dummy runs
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 21:39:33 +00:00
Sage Moore
144b148de2 initial full cudagraphs support. normal runs are working. ubatching does not
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-25 19:14:31 +00:00
Sage Moore
96c0c4ea66 added initial code for cuda graph capturing ubatches
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 22:19:24 +00:00
Sage Moore
a4def24c2c setup deepepll for ubatching
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-24 21:20:49 +00:00
Vadim Gimpelson
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs (#19939)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-06-23 23:01:26 -07:00
Isotr0py
61f4fc5dc6
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 (#19956)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-23 18:38:06 +00:00
Vlad Tiberiu Mihailescu
2e3e3c86dc
Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
2025-06-20 22:47:16 +08:00
Maximilien de Bayser
799397ee4f
Support embedding models in V1 (#16188)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-06-18 21:36:33 -07:00
Richard Zou
ed33349738
[BugFix] Fix use_cudagraph=False (#19612)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-06-19 08:23:12 +08:00
Chen Zhang
a89209b78d
[v1] Support mamba2 (#19327)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-18 20:34:15 +00:00
Sage Moore
0889f66297 Merge branch 'main' of https://github.com/neuralmagic/vllm into lwilkinson/attn-slicing 2025-06-18 13:56:24 +00:00
Sage Moore
1d112d90a5 misc changes
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-17 13:34:46 +00:00
Luka Govedič
3597b06a4f
[CUDA] Enable full cudagraph for FlashMLA (#18581)
Signed-off-by: luka <luka@neuralmagic.com>
2025-06-13 18:12:26 +00:00
汪志鹏
cefdb9962d
[Fix] The zip function in Python 3.9 does not have the strict argument (#19549)
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
2025-06-13 14:57:48 +08:00
Russell Bryant
c57bb199b3
[V1] Resolve failed concurrent structured output requests (#19565)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-12 23:30:09 +00:00
Sage Moore
b74c731342 more hacking
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-06-12 20:36:13 +00:00
Sage Moore
d682f5e1bd wip cudagraphs 2025-06-12 14:33:21 +00:00
Robert Shaw
97a9465bbc
[UX] Add Feedback During CUDAGraph Capture (#19501)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-06-11 21:09:05 +00:00
Lukas Geiger
319cb1e351
[Core] Batch multi modal input using pinned memory (#19169)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-06-10 13:44:59 +08:00
Varun Sundar Rabindranath
5cf2daea9a
[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (#19298)
Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>
2025-06-09 10:50:39 -04:00
Yinghai Lu
770e5dcdb8
[full_graph] Fix query_start_loc padding (#19321)
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>
2025-06-09 21:32:56 +08:00