Hyesoo Yang
|
47195057e9
|
[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>
|
2025-03-20 19:19:40 -07:00 |
|
Woosuk Kwon
|
0c6f5023c3
|
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-03-20 17:50:43 -07:00 |
|
Woosuk Kwon
|
2b22290ce0
|
[V1] Add flag to disable cascade attention (#15243)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-20 15:24:16 -07:00 |
|
Jason
|
d8e82bc06d
|
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043)
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
|
2025-03-20 10:01:02 -07:00 |
|
Mickaël Seznec
|
a597a57595
|
[Attention] Flash Attention 3 - fp8 (#14570)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
|
2025-03-20 01:14:20 -04:00 |
|
Nicolò Lucchesi
|
d8c6d7d6b5
|
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-19 21:00:39 -07:00 |
|
Nick Hill
|
c47aafa37c
|
[BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-20 01:30:43 +00:00 |
|
iefgnoix
|
b0e96aaebb
|
[V1][TPU] Change kv cache shape. (#15145)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
|
2025-03-19 12:16:42 -07:00 |
|
Wang Ran (汪然)
|
8310e0b59b
|
simple bugfix: Update stats.py (#15139)
|
2025-03-19 18:26:27 +00:00 |
|
maobaolong
|
26dd972adb
|
[FEAT]Support reset prefix cache by specified device (#15003)
|
2025-03-19 10:54:41 -07:00 |
|
Woosuk Kwon
|
05ccd0aa35
|
[V1] Ensure using int64 for sampled token ids (#15065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-18 23:52:19 -07:00 |
|
Chujie Zheng
|
027827cc1d
|
fix long dtype in topk sampling (#15049)
|
2025-03-18 15:57:31 -07:00 |
|
Woosuk Kwon
|
99abb8b650
|
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-18 14:31:54 -07:00 |
|
Russell Bryant
|
3a1e648158
|
[V1] Refactor Structured Output for multiple backends (#14694)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-18 19:49:15 +00:00 |
|
Nicolò Lucchesi
|
af35d3a3cc
|
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-18 07:34:45 -07:00 |
|
Varun Sundar Rabindranath
|
400d483e87
|
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-18 09:47:53 +00:00 |
|
Aaron Pham
|
c0efdd655b
|
[Fix][Structured Output] using vocab_size to construct matcher (#14868)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-03-17 11:42:45 -04:00 |
|
iefgnoix
|
b4ad56c1bd
|
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
|
2025-03-17 01:48:28 -07:00 |
|
Cyrus Leung
|
b539222d4e
|
[V1] Remove input cache client (#14864)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2025-03-16 23:42:06 -07:00 |
|
Lily Liu
|
8d6cf89526
|
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 22:00:20 -07:00 |
|
Woosuk Kwon
|
7f6c5ee06c
|
[V1][Minor] Add __repr__ to ConstantList (#14907)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 20:20:15 -07:00 |
|
Woosuk Kwon
|
faa0275730
|
[V1] Optimize the overhead of rewinding (#14905)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 20:19:30 -07:00 |
|
Woosuk Kwon
|
31060b2757
|
[V1][BugFix] Detect interleaved sliding window attention (#14896)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 14:53:53 -07:00 |
|
Nick Hill
|
fc1f67715d
|
[BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-16 14:53:34 -07:00 |
|
Jun Duan
|
74bc397b0a
|
[Core] Expose API endpoint /is_sleeping (#14312)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
2025-03-15 06:28:14 -07:00 |
|
Cyrus Leung
|
3556a41434
|
[VLM] Limit multimodal input cache by memory (#14805)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-15 02:52:05 -07:00 |
|
Aaron Pham
|
4c7629cae9
|
[V1][Structured Output] calculate vocab_size eagerly (#14851)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2025-03-14 22:09:51 -07:00 |
|
Robert Shaw
|
d4d93db2c5
|
[V1] V1 Enablement Oracle (#13726)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2025-03-14 22:02:20 -07:00 |
|
DefTruth
|
acaea3bb07
|
[Bugfix][V1] Fix flashinfer sampling (#14815)
|
2025-03-14 20:42:38 -07:00 |
|
Russell Bryant
|
1140991a7b
|
[V1] Fix vocab size calculation for structured output (#14826)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-14 09:18:38 -07:00 |
|
Woosuk Kwon
|
c77620d22d
|
[V1][Minor] Minor code cleanup for scheduling metrics (#14800)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-14 08:21:28 +00:00 |
|
Lucas Wilkinson
|
9532c49836
|
[Attention] MLA get rid of materialization (#14770)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-03-13 23:39:02 -07:00 |
|
Woosuk Kwon
|
32ef4983cd
|
[V1] Temporarily disable FlashInfer Rejection Sampler (#14788)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-13 20:40:35 -07:00 |
|
Roger Wang
|
ad19c8a003
|
[V1] Move OOM check into sampler run (#14728)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2025-03-13 20:40:23 -07:00 |
|
afeldman-nm
|
02fcaa3d0a
|
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output (#14624)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
|
2025-03-13 19:07:34 +00:00 |
|
Aaron Pham
|
8a4a2efc6f
|
[V1][Core] using cached vocab_size for Structured Outputs (#14630)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2025-03-13 11:39:28 -07:00 |
|
Woosuk Kwon
|
01b3fd0af7
|
[V1][Minor] Minor enhancements on scheduler (#14732)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-13 08:53:22 -07:00 |
|
Joe Runde
|
165290d357
|
[bugfix] fixup warning message for plugged schedulers for v1 (#14700)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2025-03-12 20:12:13 -07:00 |
|
Nick Hill
|
f5d3acd474
|
[BugFix][V1] Fix parallel sampling finishing/aborts (#14512)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-12 10:29:48 -07:00 |
|
Benjamin Chislett
|
5c538c37b2
|
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-03-11 22:12:41 -07:00 |
|
Aaron Pham
|
77a318bd01
|
[V1][Core] Support MistralTokenizer for Structured Output (#14625)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2025-03-12 10:40:09 +08:00 |
|
Joe Runde
|
47532cd9f4
|
[core][V1] pluggable scheduler (#14466)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2025-03-12 01:15:15 +00:00 |
|
Cody Yu
|
b706d898af
|
[Bugfix][V1][PP] Only warmup sampler at last PP rank (#14643)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-11 23:40:07 +00:00 |
|
iefgnoix
|
863d315c86
|
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly (#14597)
|
2025-03-11 19:12:26 -04:00 |
|
Russell Bryant
|
61a01b27a7
|
[V1] Delay all xgrammar usage until needed (#14616)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-11 20:21:33 +00:00 |
|
Russell Bryant
|
4cbf286794
|
[V1] Remove cache from StructuredOutputManager (#14622)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-11 10:36:07 -07:00 |
|
Russell Bryant
|
4bf82d4b90
|
[V1] Add regex structured output support with xgrammar (#14590)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-03-11 23:03:44 +08:00 |
|
Jeff Daily
|
a1c8f3796c
|
dynamic distpatch of fp8 kernels (#14245)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
|
2025-03-11 10:54:56 -04:00 |
|
Roger Wang
|
1fc973c0b5
|
[V1][Core] Fix memory issue with logits & sampling (#14508)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>
|
2025-03-11 04:03:41 +00:00 |
|
Cody Yu
|
4290b704ff
|
[V1][PP] Do not block engine core when no requests to schedule (#14585)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-03-10 19:48:24 -07:00 |
|