335 Commits

Author SHA1 Message Date
Zhuohan Li
f04908cae7
[FIX] Minor bug fixes (#1035)
* [FIX] Minor bug fixes

* Address review comments
2023-09-13 16:38:12 -07:00
Jasmond L
ab019eea75
Add Model Revision Support (#1014)
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Antoni Baum
9841d48a10
Use TGI-like incremental detokenization (#984) 2023-09-13 13:38:01 -07:00
Ikko Eltociear Ashimine
3272d7a0b7
Fix typo in README.md (#1033) 2023-09-13 12:55:23 -07:00
Antoni Baum
0bb1e885a0
Make max_model_len configurable (#972) 2023-09-12 16:29:19 -07:00
leiwen83
d6545ad22e
add option to shorten prompt print in log (#991)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-12 15:10:14 -07:00
Woosuk Kwon
90eb3f43ca
Bump up the version to v0.1.7 (#1013) v0.1.7 2023-09-11 00:54:30 -07:00
Woosuk Kwon
e67b4f2c2a
Use FP32 in RoPE initialization (#1004)
Co-authored-by: One <imone@tuta.io>
2023-09-11 00:26:35 -07:00
Woosuk Kwon
d6770d1f23
Update setup.py (#1006) 2023-09-10 23:42:45 -07:00
Woosuk Kwon
b9cecc2635
[Docs] Update installation page (#1005) 2023-09-10 14:23:31 -07:00
Kyujin Cho
898285c9bf
fix: CUDA error when inferencing with Falcon-40B base model (#992) 2023-09-10 01:39:02 -07:00
Antoni Baum
a62de9ecfd
Fix wrong dtype in PagedAttentionWithALiBi bias (#996)
---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-09 14:58:35 -07:00
Jingru
4042d192f5
fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871) 2023-09-08 17:21:30 -07:00
Zhuohan Li
1117aa1411
Bump up the version to v0.1.6 (#989) v0.1.6 2023-09-08 00:07:46 -07:00
Antoni Baum
080438477f
Start background task in AsyncLLMEngine.generate (#988)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Robert Irvine
4b5bcf8906
faster startup of vLLM (#982)
* update

---------

Co-authored-by: Robert Irvine <robert@seamlessml.com>
2023-09-08 14:48:54 +09:00
Woosuk Kwon
852ef5b4f5
Bump up the version to v0.1.5 (#944) v0.1.5 2023-09-07 16:15:31 -07:00
Zhuohan Li
db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel

* Fix test_attention

* Fix

---------

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models (#974) 2023-09-07 15:49:52 -07:00
Antoni Baum
c07ece5ca4
Make AsyncLLMEngine more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Woosuk Kwon
7a9c20c715
Bum up transformers version (#976) 2023-09-07 13:15:53 -07:00
Antoni Baum
005ba458b5
Set torch default dtype in a context manager (#971)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-07 15:39:37 +09:00
Woosuk Kwon
320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Antoni Baum
c9927c1a6a
Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
Woosuk Kwon
fbd80ad409
Clean up kernel unit tests (#938) 2023-09-05 16:57:38 -07:00
Wen Sun
22379d5513
fix: typo (#948) 2023-09-04 23:22:30 -07:00
Antoni Baum
1696725879
Initialize AsyncLLMEngine bg loop correctly (#943) 2023-09-04 17:41:22 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Nelson Liu
e15932bb60
Only emit warning about internal tokenizer if it isn't being used (#939) 2023-09-05 00:50:55 +09:00
Antoni Baum
ce741ba3e4
Refactor AsyncLLMEngine (#880) 2023-09-03 21:43:43 -07:00
Woosuk Kwon
bf87484efa
[BugFix] Fix NaN errors in paged attention kernel (#936) 2023-09-04 09:20:06 +09:00
Woosuk Kwon
8ce9c50d40
Avoid compiling kernels for double data type (#933) 2023-09-02 14:59:47 +09:00
Woosuk Kwon
32b6816e55
Add tests for models (#922) 2023-09-01 11:19:43 +09:00
Zhuohan Li
c128d69856
Fix README.md Link (#927) 2023-08-31 17:18:34 -07:00
Woosuk Kwon
55b28b1eee
[Docs] Minor fixes in supported models (#920)
* Minor fix in supported models

* Add another small fix for Aquila model

---------

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-08-31 16:28:39 -07:00
Dong-Yong Lee
e11222333f
fix: bug fix when penalties are negative (#913)
Co-authored-by: dongyong-lee <dongyong.lee@navercorp.com>
2023-09-01 00:37:17 +09:00
Aman Gupta Karmani
28873a2799
Improve _prune_hidden_states micro-benchmark (#707) 2023-08-31 13:28:43 +09:00
Zhuohan Li
0080d8329d Add acknowledgement to a16z grant 2023-08-30 02:26:47 -07:00
JFDuan
0d93f15694
Accelerate LLaMA model loading (#234) 2023-08-30 01:00:13 -07:00
lplcor
becd7a56f1
Enable request body OpenAPI spec for OpenAI endpoints (#865) 2023-08-29 21:54:08 -07:00
Aman Gupta Karmani
75471386de
use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Zhuohan Li
d2b2eed67c
[Fix] Fix a condition for ignored sequences (#867) 2023-08-27 23:00:56 -07:00
Antoni Baum
4b6f069b6f
Add support for CodeLlama (#854) 2023-08-25 12:44:07 -07:00
Woosuk Kwon
791d79de32
Bump up the version to v0.1.4 (#846) v0.1.4 2023-08-25 12:28:00 +09:00
Woosuk Kwon
94d2f59895
Set replacement=True in torch.multinomial (#858) 2023-08-25 12:22:01 +09:00
wenjun93
75c0ca9d43
Clean up code (#844) 2023-08-23 16:44:15 -07:00
Woosuk Kwon
2a4ec90854
Fix for breaking changes in xformers 0.0.21 (#834) 2023-08-23 17:44:21 +09:00
ldwang
85ebcda94d
Fix typo of Aquila in README.md (#836) 2023-08-22 20:48:36 -07:00
Woosuk Kwon
d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Woosuk Kwon
a41c20435e
Add compute capability 8.9 to default targets (#829) 2023-08-23 07:28:38 +09:00