389 Commits

Author SHA1 Message Date
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
Woosuk Kwon
84e4e37d14
[Minor] Fix type annotations (#1238) 2023-10-02 15:28:31 -07:00
Zhuohan Li
a60b353005
support sharding llama2-70b on more than 8 GPUs (#1209)
Co-authored-by: JiCheng <247153481@qq.com>
2023-10-02 15:26:33 -07:00
Liang
ebe4d1db3a
Fix boundary check in paged attention kernel (#1241) 2023-10-01 11:35:06 -07:00
kg6-sleipnir
b5a10eb0ef
Added dtype arg to benchmarks (#1228) 2023-09-30 21:04:03 -07:00
Usama Ahmed
0967102c6d
fixing typo in tiiuae/falcon-rw-7b model name (#1226) 2023-09-29 13:40:25 -07:00
Woosuk Kwon
e2fb71ec9f
Bump up the version to v0.2.0 (#1212) v0.2.0 2023-09-28 15:30:38 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length (#1224) 2023-09-28 14:44:02 -07:00
Woosuk Kwon
6f88f762bf
Fix OOM in attention kernel test (#1223) 2023-09-28 14:33:24 -07:00
Woosuk Kwon
202351d5bf
Add Mistral to supported model list (#1221) 2023-09-28 14:33:04 -07:00
Woosuk Kwon
2e8e49fce3
[Fix] Remove false assertion (#1222) 2023-09-28 10:52:38 -07:00
Woosuk Kwon
a8e98aee0c
Fix Mistral model (#1220) 2023-09-28 10:44:05 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Qing
7bedab5748
Add rope_scaling to Qwen (#1210) 2023-09-28 00:49:23 -07:00
Dan Lord
20f7cc4cde
Add skip_special_tokens sampling params (#1186) 2023-09-27 19:21:42 -07:00
Danilo Peixoto
649aa730c5
Use standard extras for uvicorn (#1166) 2023-09-27 17:41:36 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens (#1198) 2023-09-27 16:34:00 -07:00
Qing
28e616c4e3
fix qwen-14b model (#1173) 2023-09-27 16:33:16 -07:00
Wang Ran (汪然)
30e775281d
fix typo (#1184)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-27 16:22:45 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling (#555)
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Antoni Baum
cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Woosuk Kwon
03ffd0a022
Add comments on RoPE initialization (#1176) 2023-09-26 10:48:33 -07:00
Woosuk Kwon
a425bd9a9a
[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs (#1074) 2023-09-26 10:21:08 -07:00
Wen Sun
bbbf86565f
Align max_tokens behavior with openai (#852) 2023-09-23 18:10:13 -07:00
Woosuk Kwon
9f6be8692e
Fix config for Falcon (#1164) 2023-09-23 17:38:43 -07:00
Zhuohan Li
f187877945
[FIX] Simplify sampler logic (#1156) 2023-09-23 17:21:56 -07:00
Zhuohan Li
947b794146
[Sampler] Vectorized sampling (simplified) (#1048)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Woosuk Kwon
8d926e91f1
Announce the First vLLM Meetup (#1148) 2023-09-22 11:37:14 -07:00
Nick Perez
4ee52bb169
Docs: Fix broken link to openai example (#1145)
Link to `openai_client.py` is no longer valid - updated to `openai_completion_client.py`
2023-09-22 11:36:09 -07:00
Woosuk Kwon
7d7e3b78a3
Use --ipc=host in docker run for distributed inference (#1125) 2023-09-21 18:26:47 -07:00
Ricardo Lu
f98b745a81
feat: support stop_token_ids parameter. (#1097) 2023-09-21 15:34:02 -07:00
Roy
2d1e86f1b1
clean api code, remove redundant background task. (#1102) 2023-09-21 13:25:05 -07:00
Woosuk Kwon
1ac4ccf73c
Add float16 and float32 (#1115) 2023-09-21 00:52:47 -07:00
Woosuk Kwon
2ac4d5e2bf
Replace DtypeTensor (#1123) 2023-09-21 00:51:47 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config (#1096)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00
Tanmay Verma
6f2dd6c37e
Add documentation to Triton server tutorial (#983) 2023-09-20 10:32:40 -07:00
Woosuk Kwon
bc0644574c
Add gpu_memory_utilization and swap_space to LLM (#1090) 2023-09-19 22:16:04 -07:00
Woosuk Kwon
400b8289f7
Add pyarrow to dependencies & Print warning on Ray import error (#1094) 2023-09-18 22:36:17 -07:00
Zhuohan Li
c1026311b5
[Community] Add vLLM Discord server (#1086) 2023-09-18 12:23:35 -07:00
Woosuk Kwon
2b1c116b5a
Add minimum capability requirement for AWQ (#1064) 2023-09-18 12:02:01 -07:00
Woosuk Kwon
cc796b1358
Convert before transpose (#1073) 2023-09-18 11:51:48 -07:00
Zhuohan Li
f029ef94d7
Fix get_max_num_running_seqs for waiting and swapped seq groups (#1068) 2023-09-18 11:49:40 -07:00
Roy
95592fa00a
align llm_engine and async_engine. (#1081) 2023-09-18 11:49:10 -07:00
orellavie1212
fbe66e1d0b
added support for quantize on LLM module (#1080) 2023-09-18 11:04:21 -07:00
Zhuohan Li
90979c38f8
[FIX] Don't initialize parameter by default (#1067) 2023-09-17 17:15:38 -07:00
陈序
e21d7687a9
Fix hanging when prompt exceeds limit (#1029) 2023-09-17 01:48:56 -07:00
Antoni Baum
ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task (#1059) 2023-09-17 00:29:08 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jerry Yang
b9fe4616f9
Abort when coroutine is cancelled (#1020) 2023-09-14 17:40:18 -07:00
Woosuk Kwon
64ca424e75
Fix warning message on LLaMA FastTokenizer (#1037) 2023-09-14 17:33:32 -07:00