356 Commits

Author SHA1 Message Date
Hyesoo Yang
47195057e9
[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>
2025-03-20 19:19:40 -07:00
Woosuk Kwon
0c6f5023c3
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-20 17:50:43 -07:00
Woosuk Kwon
2b22290ce0
[V1] Add flag to disable cascade attention (#15243)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-20 15:24:16 -07:00
Jason
d8e82bc06d
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043)
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
2025-03-20 10:01:02 -07:00
Mickaël Seznec
a597a57595
[Attention] Flash Attention 3 - fp8 (#14570)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
2025-03-20 01:14:20 -04:00
Nicolò Lucchesi
d8c6d7d6b5
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-19 21:00:39 -07:00
Nick Hill
c47aafa37c
[BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-20 01:30:43 +00:00
iefgnoix
b0e96aaebb
[V1][TPU] Change kv cache shape. (#15145)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-03-19 12:16:42 -07:00
Wang Ran (汪然)
8310e0b59b
simple bugfix: Update stats.py (#15139) 2025-03-19 18:26:27 +00:00
maobaolong
26dd972adb
[FEAT]Support reset prefix cache by specified device (#15003) 2025-03-19 10:54:41 -07:00
Woosuk Kwon
05ccd0aa35
[V1] Ensure using int64 for sampled token ids (#15065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-18 23:52:19 -07:00
Chujie Zheng
027827cc1d
fix long dtype in topk sampling (#15049) 2025-03-18 15:57:31 -07:00
Woosuk Kwon
99abb8b650
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-18 14:31:54 -07:00
Russell Bryant
3a1e648158
[V1] Refactor Structured Output for multiple backends (#14694)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-18 19:49:15 +00:00
Nicolò Lucchesi
af35d3a3cc
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-18 07:34:45 -07:00
Varun Sundar Rabindranath
400d483e87
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-18 09:47:53 +00:00
Aaron Pham
c0efdd655b
[Fix][Structured Output] using vocab_size to construct matcher (#14868)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-03-17 11:42:45 -04:00
iefgnoix
b4ad56c1bd
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-03-17 01:48:28 -07:00
Cyrus Leung
b539222d4e
[V1] Remove input cache client (#14864)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-16 23:42:06 -07:00
Lily Liu
8d6cf89526
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 22:00:20 -07:00
Woosuk Kwon
7f6c5ee06c
[V1][Minor] Add __repr__ to ConstantList (#14907)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 20:20:15 -07:00
Woosuk Kwon
faa0275730
[V1] Optimize the overhead of rewinding (#14905)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 20:19:30 -07:00
Woosuk Kwon
31060b2757
[V1][BugFix] Detect interleaved sliding window attention (#14896)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 14:53:53 -07:00
Nick Hill
fc1f67715d
[BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-16 14:53:34 -07:00
Jun Duan
74bc397b0a
[Core] Expose API endpoint /is_sleeping (#14312)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-15 06:28:14 -07:00
Cyrus Leung
3556a41434
[VLM] Limit multimodal input cache by memory (#14805)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-15 02:52:05 -07:00
Aaron Pham
4c7629cae9
[V1][Structured Output] calculate vocab_size eagerly (#14851)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-14 22:09:51 -07:00
Robert Shaw
d4d93db2c5
[V1] V1 Enablement Oracle (#13726)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-03-14 22:02:20 -07:00
DefTruth
acaea3bb07
[Bugfix][V1] Fix flashinfer sampling (#14815) 2025-03-14 20:42:38 -07:00
Russell Bryant
1140991a7b
[V1] Fix vocab size calculation for structured output (#14826)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 09:18:38 -07:00
Woosuk Kwon
c77620d22d
[V1][Minor] Minor code cleanup for scheduling metrics (#14800)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-14 08:21:28 +00:00
Lucas Wilkinson
9532c49836
[Attention] MLA get rid of materialization (#14770)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-13 23:39:02 -07:00
Woosuk Kwon
32ef4983cd
[V1] Temporarily disable FlashInfer Rejection Sampler (#14788)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-13 20:40:35 -07:00
Roger Wang
ad19c8a003
[V1] Move OOM check into sampler run (#14728)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-03-13 20:40:23 -07:00
afeldman-nm
02fcaa3d0a
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output (#14624)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
2025-03-13 19:07:34 +00:00
Aaron Pham
8a4a2efc6f
[V1][Core] using cached vocab_size for Structured Outputs (#14630)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-13 11:39:28 -07:00
Woosuk Kwon
01b3fd0af7
[V1][Minor] Minor enhancements on scheduler (#14732)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-13 08:53:22 -07:00
Joe Runde
165290d357
[bugfix] fixup warning message for plugged schedulers for v1 (#14700)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-03-12 20:12:13 -07:00
Nick Hill
f5d3acd474
[BugFix][V1] Fix parallel sampling finishing/aborts (#14512)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-12 10:29:48 -07:00
Benjamin Chislett
5c538c37b2
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-03-11 22:12:41 -07:00
Aaron Pham
77a318bd01
[V1][Core] Support MistralTokenizer for Structured Output (#14625)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-12 10:40:09 +08:00
Joe Runde
47532cd9f4
[core][V1] pluggable scheduler (#14466)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-03-12 01:15:15 +00:00
Cody Yu
b706d898af
[Bugfix][V1][PP] Only warmup sampler at last PP rank (#14643)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-11 23:40:07 +00:00
iefgnoix
863d315c86
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly (#14597) 2025-03-11 19:12:26 -04:00
Russell Bryant
61a01b27a7
[V1] Delay all xgrammar usage until needed (#14616)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 20:21:33 +00:00
Russell Bryant
4cbf286794
[V1] Remove cache from StructuredOutputManager (#14622)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 10:36:07 -07:00
Russell Bryant
4bf82d4b90
[V1] Add regex structured output support with xgrammar (#14590)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 23:03:44 +08:00
Jeff Daily
a1c8f3796c
dynamic distpatch of fp8 kernels (#14245)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
2025-03-11 10:54:56 -04:00
Roger Wang
1fc973c0b5
[V1][Core] Fix memory issue with logits & sampling (#14508)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>
2025-03-11 04:03:41 +00:00
Cody Yu
4290b704ff
[V1][PP] Do not block engine core when no requests to schedule (#14585)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-10 19:48:24 -07:00