Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument ( #2145 )
2023-12-17 01:56:16 -08:00
Woosuk Kwon
c3372e87be
Remove dependency on CuPy ( #2152 )
2023-12-17 01:49:07 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Yunfeng Bai
c06170cc8e
Add a flag to include stop string in output text ( #1976 )
2023-12-15 00:45:58 -08:00
mezuzza
6774bd50b0
Fix typing in AsyncLLMEngine & add toml to requirements-dev ( #2100 )
2023-12-14 00:19:41 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Woosuk Kwon
464dd985e3
Fix num_gpus when TP > 1 ( #1852 )
2023-12-03 12:24:30 -08:00
Simon Mo
5313c2cb8b
Add Production Metrics in Prometheus format ( #1890 )
2023-12-02 16:37:44 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
FlorianJoncour
0229c386c5
Better integration with Ray Serve ( #1821 )
...
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
2023-11-29 13:25:43 -08:00
Zhuohan Li
708e6c18b0
[FIX] Fix class naming ( #1803 )
2023-11-28 14:08:01 -08:00
Casper
a921d8be9d
[DOCS] Add engine args documentation ( #1741 )
2023-11-22 12:31:27 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. ( #1395 )
...
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step ( #1666 )
2023-11-16 13:11:41 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Dominik Schwabe
1b290ace4f
Run default _AsyncLLMEngine._run_workers_async in threadpool ( #1628 )
2023-11-11 14:50:44 -08:00
ljss
5687d584fe
[BugFix] Set engine_use_ray=True when TP>1 ( #1531 )
2023-11-01 02:14:18 -07:00
Dan Lord
7013a80170
Add support for spaces_between_special_tokens
2023-10-30 16:52:56 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
acbed3ef40
Use monotonic time where appropriate ( #1249 )
2023-10-02 19:22:05 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length ( #1224 )
2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support ( #1196 )
...
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Dan Lord
20f7cc4cde
Add skip_special_tokens sampling params ( #1186 )
2023-09-27 19:21:42 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens ( #1198 )
2023-09-27 16:34:00 -07:00
Wang Ran (汪然)
30e775281d
fix typo ( #1184 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-27 16:22:45 -07:00
Ricardo Lu
f98b745a81
feat: support stop_token_ids parameter. ( #1097 )
2023-09-21 15:34:02 -07:00
Woosuk Kwon
1ac4ccf73c
Add float16 and float32 ( #1115 )
2023-09-21 00:52:47 -07:00
Woosuk Kwon
400b8289f7
Add pyarrow to dependencies & Print warning on Ray import error ( #1094 )
2023-09-18 22:36:17 -07:00
Roy
95592fa00a
align llm_engine and async_engine. ( #1081 )
2023-09-18 11:49:10 -07:00
陈序
e21d7687a9
Fix hanging when prompt exceeds limit ( #1029 )
2023-09-17 01:48:56 -07:00
Antoni Baum
ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task ( #1059 )
2023-09-17 00:29:08 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA ( #1032 )
...
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jerry Yang
b9fe4616f9
Abort when coroutine is cancelled ( #1020 )
2023-09-14 17:40:18 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Antoni Baum
9841d48a10
Use TGI-like incremental detokenization ( #984 )
2023-09-13 13:38:01 -07:00
Antoni Baum
0bb1e885a0
Make max_model_len configurable ( #972 )
2023-09-12 16:29:19 -07:00
leiwen83
d6545ad22e
add option to shorten prompt print in log ( #991 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-12 15:10:14 -07:00
Jingru
4042d192f5
fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True ( #871 )
2023-09-08 17:21:30 -07:00
Antoni Baum
080438477f
Start background task in AsyncLLMEngine.generate ( #988 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Antoni Baum
c07ece5ca4
Make AsyncLLMEngine more robust & fix batched abort ( #969 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Antoni Baum
c9927c1a6a
Use queue for finished requests ( #957 )
2023-09-05 19:27:23 -07:00
Wen Sun
22379d5513
fix: typo ( #948 )
2023-09-04 23:22:30 -07:00
Antoni Baum
1696725879
Initialize AsyncLLMEngine bg loop correctly ( #943 )
2023-09-04 17:41:22 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00