78 Commits

Author SHA1 Message Date
Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument (#2145) 2023-12-17 01:56:16 -08:00
Woosuk Kwon
c3372e87be
Remove dependency on CuPy (#2152) 2023-12-17 01:49:07 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Yunfeng Bai
c06170cc8e
Add a flag to include stop string in output text (#1976) 2023-12-15 00:45:58 -08:00
mezuzza
6774bd50b0
Fix typing in AsyncLLMEngine & add toml to requirements-dev (#2100) 2023-12-14 00:19:41 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Woosuk Kwon
464dd985e3
Fix num_gpus when TP > 1 (#1852) 2023-12-03 12:24:30 -08:00
Simon Mo
5313c2cb8b
Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata (#1843) 2023-11-29 22:16:37 -08:00
FlorianJoncour
0229c386c5
Better integration with Ray Serve (#1821)
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
2023-11-29 13:25:43 -08:00
Zhuohan Li
708e6c18b0
[FIX] Fix class naming (#1803) 2023-11-28 14:08:01 -08:00
Casper
a921d8be9d
[DOCS] Add engine args documentation (#1741) 2023-11-22 12:31:27 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Dominik Schwabe
1b290ace4f
Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) 2023-11-11 14:50:44 -08:00
ljss
5687d584fe
[BugFix] Set engine_use_ray=True when TP>1 (#1531) 2023-11-01 02:14:18 -07:00
Dan Lord
7013a80170
Add support for spaces_between_special_tokens 2023-10-30 16:52:56 -07:00
chooper1
1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
acbed3ef40
Use monotonic time where appropriate (#1249) 2023-10-02 19:22:05 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision (#1163)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length (#1224) 2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Dan Lord
20f7cc4cde
Add skip_special_tokens sampling params (#1186) 2023-09-27 19:21:42 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens (#1198) 2023-09-27 16:34:00 -07:00
Wang Ran (汪然)
30e775281d
fix typo (#1184)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-27 16:22:45 -07:00
Ricardo Lu
f98b745a81
feat: support stop_token_ids parameter. (#1097) 2023-09-21 15:34:02 -07:00
Woosuk Kwon
1ac4ccf73c
Add float16 and float32 (#1115) 2023-09-21 00:52:47 -07:00
Woosuk Kwon
400b8289f7
Add pyarrow to dependencies & Print warning on Ray import error (#1094) 2023-09-18 22:36:17 -07:00
Roy
95592fa00a
align llm_engine and async_engine. (#1081) 2023-09-18 11:49:10 -07:00
陈序
e21d7687a9
Fix hanging when prompt exceeds limit (#1029) 2023-09-17 01:48:56 -07:00
Antoni Baum
ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task (#1059) 2023-09-17 00:29:08 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jerry Yang
b9fe4616f9
Abort when coroutine is cancelled (#1020) 2023-09-14 17:40:18 -07:00
Jasmond L
ab019eea75
Add Model Revision Support (#1014)
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Antoni Baum
9841d48a10
Use TGI-like incremental detokenization (#984) 2023-09-13 13:38:01 -07:00
Antoni Baum
0bb1e885a0
Make max_model_len configurable (#972) 2023-09-12 16:29:19 -07:00
leiwen83
d6545ad22e
add option to shorten prompt print in log (#991)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-12 15:10:14 -07:00
Jingru
4042d192f5
fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871) 2023-09-08 17:21:30 -07:00
Antoni Baum
080438477f
Start background task in AsyncLLMEngine.generate (#988)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models (#974) 2023-09-07 15:49:52 -07:00
Antoni Baum
c07ece5ca4
Make AsyncLLMEngine more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Antoni Baum
c9927c1a6a
Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
Wen Sun
22379d5513
fix: typo (#948) 2023-09-04 23:22:30 -07:00
Antoni Baum
1696725879
Initialize AsyncLLMEngine bg loop correctly (#943) 2023-09-04 17:41:22 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00