Simon Mo
|
a134ef6f5e
|
Support eos_token_id from generation_config.json (#4182)
|
2024-04-19 04:13:36 +00:00 |
|
Cade Daniel
|
e95cd87959
|
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894)
|
2024-04-16 13:09:21 -07:00 |
|
Antoni Baum
|
69e1d2fb69
|
[Core] Refactor model loading code (#4097)
|
2024-04-16 11:34:39 -07:00 |
|
Noam Gat
|
05434764cd
|
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-16 05:54:57 +00:00 |
|
Sanger Steel
|
711a000255
|
[Frontend] [Core] feat: Add model loading using tensorizer (#3476)
|
2024-04-13 17:13:01 -07:00 |
|
Nick Hill
|
e46a60aa4c
|
[BugFix] Fix handling of stop strings and stop token ids (#3672)
|
2024-04-11 15:34:12 -07:00 |
|
SangBin Cho
|
67b4221a61
|
[Core][5/N] Fully working chunked prefill e2e (#3884)
|
2024-04-10 17:56:48 -07:00 |
|
Cade Daniel
|
e7c7067b45
|
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
|
2024-04-09 11:44:15 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
leiwen83
|
ad6eca408b
|
Fix early CUDA init via get_architecture_class_name import (#3770)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-04-02 11:56:26 -07:00 |
|
bigPYJ1151
|
0e3f06fe9c
|
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-04-01 22:07:30 -07:00 |
|
Roy
|
6110c39dc8
|
[BugFix] Fix tokenizer out of vocab size (#3685)
|
2024-03-29 08:18:59 -07:00 |
|
yhu422
|
d8658c8cc1
|
Usage Stats Collection (#2852)
|
2024-03-28 22:16:12 -07:00 |
|
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
|
Cade Daniel
|
14ccd94c89
|
[Core][Bugfix]Refactor block manager for better testability (#3492)
|
2024-03-27 23:59:28 -07:00 |
|
Nick Hill
|
dfeb2ecc3a
|
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
|
2024-03-25 17:31:32 -07:00 |
|
xwjiang2010
|
64172a976c
|
[Feature] Add vision language model support. (#3042)
|
2024-03-25 14:16:30 -07:00 |
|
Travis Johnson
|
c13ad1b7bd
|
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-03-25 10:14:26 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Antoni Baum
|
bfdb1ba5c3
|
[Core] Improve detokenization performance for prefill (#3469)
Co-authored-by: MeloYang <meloyang05@gmail.com>
|
2024-03-22 13:44:12 -07:00 |
|
Zhuohan Li
|
e90fc21f2e
|
[Hardware][Neuron] Refactor neuron support (#3471)
|
2024-03-22 01:22:17 +00:00 |
|
Roy
|
865732342b
|
[Misc][Log] Add log for tokenizer length not equal to vocabulary size (#3500)
|
2024-03-21 18:07:48 +08:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|
Tao He
|
14b8ae02e7
|
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-03-15 18:25:43 +00:00 |
|
Zhuohan Li
|
4c922709b6
|
Add distributed model executor abstraction (#3191)
|
2024-03-11 11:03:45 -07:00 |
|
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
|
Roy
|
9e8744a545
|
[BugFix] Fix get tokenizer when using ray (#3301)
|
2024-03-10 19:17:16 -07:00 |
|
Nick Hill
|
2efce05dc3
|
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-03-06 00:17:20 +00:00 |
|
Nick Hill
|
8999ec3c16
|
Store eos_token_id in Sequence for easy access (#3166)
|
2024-03-05 15:35:43 -08:00 |
|
Antoni Baum
|
ff578cae54
|
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-04 22:01:40 +00:00 |
|
Antoni Baum
|
22de45235c
|
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-03-04 19:54:06 +00:00 |
|
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
|
Jason Cox
|
d65fac2738
|
Add vLLM version info to logs and openai API server (#3161)
|
2024-03-02 21:00:29 -08:00 |
|
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
|
Sherry
|
54d3544784
|
Fix: Output text is always truncated in some models (#3016)
|
2024-03-01 07:52:22 +00:00 |
|
Nick Hill
|
29a8d6a554
|
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099)
|
2024-02-29 19:20:42 +00:00 |
|
Allen.Dou
|
9289e577ec
|
add cache_config's info to prometheus metrics. (#3100)
|
2024-02-29 06:15:18 +00:00 |
|
Liangfu Chen
|
3b7178cfa4
|
[Neuron] Support inference with transformers-neuronx (#2569)
|
2024-02-28 09:34:34 -08:00 |
|
Roy
|
c1c0d00b88
|
Don't use cupy when enforce_eager=True (#3037)
|
2024-02-26 17:33:38 -08:00 |
|
Harry Mellor
|
ef978fe411
|
Port metrics from aioprometheus to prometheus_client (#2730)
|
2024-02-25 11:54:00 -08:00 |
|
Ronen Schaffer
|
4caf7044e0
|
Include tokens from prompt phase in counter_generation_tokens (#2802)
|
2024-02-22 14:00:12 -08:00 |
|
Antoni Baum
|
017d9f1515
|
Add metrics to RequestOutput (#2876)
|
2024-02-20 21:55:57 -08:00 |
|
Ronen Schaffer
|
e433c115bc
|
Fix vllm:prompt_tokens_total metric calculation (#2869)
|
2024-02-18 23:55:41 -08:00 |
|
Nick Hill
|
185b2c29e2
|
Defensively copy sampling_params (#2881)
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.
Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
|
2024-02-17 11:18:04 -08:00 |
|
Woosuk Kwon
|
a463c333dd
|
Use CuPy for CUDA graphs (#2811)
|
2024-02-13 11:32:06 -08:00 |
|