Cyrus Leung
|
a9bcc7afb2
|
[Doc] Use intersphinx and update entrypoints docs (#5125)
|
2024-05-30 09:59:23 -07:00 |
|
Cyrus Leung
|
5ae5ed1e60
|
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-28 13:29:31 -07:00 |
|
Nick Hill
|
eb6d3c264d
|
[Core] Eliminate parallel worker per-step task scheduling overhead (#4894)
|
2024-05-23 06:17:27 +09:00 |
|
sasha0552
|
9b9a10d6cb
|
[Frontend] Dynamic RoPE scaling (#4638)
|
2024-05-22 01:32:35 -04:00 |
|
Nick Hill
|
676a99982f
|
[Core] Add MultiprocessingGPUExecutor (#4539)
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
|
2024-05-14 10:38:59 -07:00 |
|
SangBin Cho
|
e7c46b9527
|
[Scheduler] Warning upon preemption and Swapping (#4647)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-05-13 23:50:44 +09:00 |
|
Chang Su
|
e254497b66
|
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
|
2024-05-11 11:30:37 -07:00 |
|
DearPlanet
|
4302987069
|
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937)
|
2024-05-04 15:39:34 -07:00 |
|
Cody Yu
|
bc8ad68455
|
[Misc][Refactor] Introduce ExecuteModelData (#4540)
|
2024-05-03 17:47:07 -07:00 |
|
DefTruth
|
ce3f1eedf8
|
[Misc] remove chunk detected debug logs (#4571)
|
2024-05-03 04:48:08 +00:00 |
|
Roy
|
3a922c1e7e
|
[Bugfix][Core] Fix and refactor logging stats (#4336)
|
2024-05-01 20:08:14 +00:00 |
|
harrywu
|
f458112e8a
|
[Misc][Typo] type annotation fix (#4495)
|
2024-04-30 20:21:39 -07:00 |
|
Ronen Schaffer
|
bf480c5302
|
Add more Prometheus metrics (#2764)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-04-28 15:59:33 -07:00 |
|
DefTruth
|
9c7306ac11
|
[Misc] fix typo in llm_engine init logging (#4428)
|
2024-04-28 18:58:30 +08:00 |
|
Nick Hill
|
81661da7b2
|
[BugFix] Fix min_tokens when eos_token_id is None (#4389)
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
|
2024-04-27 09:52:46 -07:00 |
|
Roy
|
7134303cbb
|
[Bugfix][Core] Fix get decoding config from ray (#4335)
|
2024-04-27 11:30:08 +00:00 |
|
SangBin Cho
|
603ad84815
|
[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309)
|
2024-04-26 13:02:02 +00:00 |
|
SangBin Cho
|
a88081bf76
|
[CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
|
2024-04-26 00:16:58 -07:00 |
|
Nick Hill
|
15e7c675b0
|
[Core] Add shutdown() method to ExecutorBase (#4349)
|
2024-04-25 16:32:48 -07:00 |
|
Nick Hill
|
479d69fad0
|
[Core] Move ray_utils.py from engine to executor package (#4347)
|
2024-04-25 06:52:22 +00:00 |
|
Cade Daniel
|
62b8aebc6f
|
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951)
|
2024-04-23 08:02:36 +00:00 |
|
GeauxEric
|
a37d815b83
|
Make initialization of tokenizer and detokenizer optional (#3748)
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-21 22:06:46 +00:00 |
|
Simon Mo
|
a134ef6f5e
|
Support eos_token_id from generation_config.json (#4182)
|
2024-04-19 04:13:36 +00:00 |
|
Cade Daniel
|
e95cd87959
|
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894)
|
2024-04-16 13:09:21 -07:00 |
|
Antoni Baum
|
69e1d2fb69
|
[Core] Refactor model loading code (#4097)
|
2024-04-16 11:34:39 -07:00 |
|
Noam Gat
|
05434764cd
|
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-16 05:54:57 +00:00 |
|
Sanger Steel
|
711a000255
|
[Frontend] [Core] feat: Add model loading using tensorizer (#3476)
|
2024-04-13 17:13:01 -07:00 |
|
Nick Hill
|
e46a60aa4c
|
[BugFix] Fix handling of stop strings and stop token ids (#3672)
|
2024-04-11 15:34:12 -07:00 |
|
SangBin Cho
|
67b4221a61
|
[Core][5/N] Fully working chunked prefill e2e (#3884)
|
2024-04-10 17:56:48 -07:00 |
|
Cade Daniel
|
e7c7067b45
|
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
|
2024-04-09 11:44:15 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
leiwen83
|
ad6eca408b
|
Fix early CUDA init via get_architecture_class_name import (#3770)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-04-02 11:56:26 -07:00 |
|
bigPYJ1151
|
0e3f06fe9c
|
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-04-01 22:07:30 -07:00 |
|
Roy
|
6110c39dc8
|
[BugFix] Fix tokenizer out of vocab size (#3685)
|
2024-03-29 08:18:59 -07:00 |
|
yhu422
|
d8658c8cc1
|
Usage Stats Collection (#2852)
|
2024-03-28 22:16:12 -07:00 |
|
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
|
Cade Daniel
|
14ccd94c89
|
[Core][Bugfix]Refactor block manager for better testability (#3492)
|
2024-03-27 23:59:28 -07:00 |
|
Nick Hill
|
dfeb2ecc3a
|
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
|
2024-03-25 17:31:32 -07:00 |
|
xwjiang2010
|
64172a976c
|
[Feature] Add vision language model support. (#3042)
|
2024-03-25 14:16:30 -07:00 |
|
Travis Johnson
|
c13ad1b7bd
|
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-03-25 10:14:26 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Antoni Baum
|
bfdb1ba5c3
|
[Core] Improve detokenization performance for prefill (#3469)
Co-authored-by: MeloYang <meloyang05@gmail.com>
|
2024-03-22 13:44:12 -07:00 |
|
Zhuohan Li
|
e90fc21f2e
|
[Hardware][Neuron] Refactor neuron support (#3471)
|
2024-03-22 01:22:17 +00:00 |
|
Roy
|
865732342b
|
[Misc][Log] Add log for tokenizer length not equal to vocabulary size (#3500)
|
2024-03-21 18:07:48 +08:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|