Mengqing Cao
|
dd66fd2b01
|
[CI] fix pre-commit error (#12494)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-01-28 06:11:05 +00:00 |
|
Harry Mellor
|
823ab79633
|
Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-01-27 17:23:08 -07:00 |
|
Nicolò Lucchesi
|
6116ca8cd7
|
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
|
2025-01-27 13:38:35 -08:00 |
|
Cyrus Leung
|
59a0192fb9
|
[Core] Interface for accessing model from VllmRunner (#10353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-20 15:00:59 +08:00 |
|
youkaichao
|
ad34c0df0f
|
[core] platform agnostic executor via collective_rpc (#11256)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-15 13:45:21 +08:00 |
|
Cyrus Leung
|
ee77fdb5de
|
[Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-06 21:40:31 +08:00 |
|
youkaichao
|
b12e87f942
|
[platforms] enable platform plugins (#11602)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-30 20:24:45 +08:00 |
|
Rafael Vasquez
|
32aa2059ad
|
[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
|
2024-12-23 22:35:38 +00:00 |
|
Cyrus Leung
|
c889d5888b
|
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-07 17:20:49 +00:00 |
|
Cyrus Leung
|
aa39a8e175
|
[Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-05 11:19:35 +08:00 |
|
Yang Zheng
|
f6084f6324
|
[Speculative Decoding] Move indices to device before filtering output (#10850)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
|
2024-12-03 17:01:39 +08:00 |
|
jeongin601
|
1bf905ddaa
|
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198)
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
|
2024-11-27 05:07:30 +00:00 |
|
Chendi.Xue
|
0a71900bc9
|
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
2024-11-26 17:57:11 -08:00 |
|
Murali Andoorveedu
|
db66e018ea
|
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
|
2024-11-26 09:11:16 -08:00 |
|
youkaichao
|
eebad39f26
|
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 14:04:42 -08:00 |
|
Sky Lee
|
2ec8827288
|
[Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10350)
|
2024-11-15 05:40:10 +00:00 |
|
Cyrus Leung
|
e0191a95d8
|
[0/N] Rename MultiModalInputs to MultiModalKwargs (#10040)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-09 11:31:02 +08:00 |
|
Nicolò Lucchesi
|
9d43afcc53
|
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2024-11-07 08:15:14 -08:00 |
|
Sungjae Lee
|
0c63c34f72
|
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (#9730)
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
|
2024-11-06 01:45:45 +00:00 |
|
youkaichao
|
2094062b4e
|
[4.5/N] bugfix for quant config in speculative decode (#10007)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-04 15:11:59 -08:00 |
|
youkaichao
|
e893795443
|
[2/N] executor pass the complete config to worker/modelrunner (#9938)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2024-11-02 07:35:05 -07:00 |
|
科英
|
67a6882da4
|
[Misc] SpecDecodeWorker supports profiling (#9719)
Signed-off-by: Abatom <abatom@163.com>
|
2024-10-27 04:18:03 +00:00 |
|
Thomas Parnell
|
496e991da8
|
[Doc] Consistent naming of attention backends (#9498)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-10-21 22:29:57 +08:00 |
|
Lily Liu
|
8345045833
|
[Performance][Spec Decode] Optimize ngram lookup performance (#9333)
|
2024-10-16 13:37:45 -06:00 |
|
Lily Liu
|
89feb4c84d
|
[SpecDec] Remove Batch Expansion (2/3) (#9298)
|
2024-10-12 05:13:37 +00:00 |
|
Wallas Henrique
|
8baf85e4e9
|
[Doc] Compatibility matrix for mutual exclusive features (#8512)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
2024-10-11 11:18:50 -07:00 |
|
TJian
|
23fea8714a
|
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (#9101)
|
2024-10-06 13:00:04 +08:00 |
|
youkaichao
|
9aaf14c62e
|
[misc] add forward context for attention (#9029)
|
2024-10-03 12:09:42 -07:00 |
|
Lily Liu
|
1570203864
|
[Spec Decode] (1/2) Remove batch expansion (#8839)
|
2024-10-01 16:04:42 -07:00 |
|
Travis Johnson
|
01b6f9e1f0
|
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-09-24 17:29:56 -07:00 |
|
Lily Liu
|
c6bd70d772
|
[SpecDec][Misc] Cleanup, remove bonus token logic. (#8701)
|
2024-09-22 12:34:14 -07:00 |
|
Aaron Pham
|
9d104b5beb
|
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-18 11:00:56 +00:00 |
|
Kevin Lin
|
5faedf1b62
|
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
|
2024-09-10 13:18:14 -07:00 |
|
Lily Liu
|
e6a26ed037
|
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
|
2024-09-01 21:23:29 -07:00 |
|
afeldman-nm
|
428dd1445e
|
[Core] Logprobs support in Multi-step (#7652)
|
2024-08-29 19:19:08 -07:00 |
|
Jonas M. Kübler
|
f205c09854
|
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899)
|
2024-08-28 22:18:13 -07:00 |
|
Nick Hill
|
1856aff4d6
|
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851)
|
2024-08-25 15:45:14 -07:00 |
|
Travis Johnson
|
cc0eaf12b1
|
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-08-22 09:33:48 -04:00 |
|
Abhinav Goyal
|
a3fce56b88
|
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830)
|
2024-08-22 02:42:24 -07:00 |
|
Antoni Baum
|
3b682179dd
|
[Core] Add AttentionState abstraction (#7663)
|
2024-08-20 18:50:45 +00:00 |
|
Abhinav Goyal
|
312f761232
|
[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508)
|
2024-08-19 17:58:14 -07:00 |
|
SangBin Cho
|
ff7ec82c4d
|
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109)
|
2024-08-18 17:57:20 -07:00 |
|
Roger Wang
|
bbf55c4805
|
[VLM] Refactor MultiModalConfig initialization and profiling (#7530)
|
2024-08-17 13:30:55 -07:00 |
|
William Lin
|
f366f6339b
|
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-16 11:41:56 -07:00 |
|
Mahesh Keralapura
|
933790c209
|
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089)
|
2024-08-09 13:55:13 -07:00 |
|
William Lin
|
57b7be0e1c
|
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971)
|
2024-08-09 05:42:45 +00:00 |
|
Bongwon Jang
|
e9630458c7
|
[SpecDecode] Support FlashInfer in DraftModelRunner (#6926)
|
2024-08-05 08:05:05 -07:00 |
|
Cade Daniel
|
82a1b1a82b
|
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963)
|
2024-08-05 08:46:44 +00:00 |
|
Cyrus Leung
|
f230cc2ca6
|
[Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836)
|
2024-07-31 10:38:45 +08:00 |
|
Nick Hill
|
5cf9254a9c
|
[BugFix] Fix use of per-request seed with pipeline parallel (#6698)
|
2024-07-30 10:40:08 -07:00 |
|