Jason Cox
d65fac2738
Add vLLM version info to logs and openai API server ( #3161 )
2024-03-02 21:00:29 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. ( #3109 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Huarong
90fbf12540
fix relative import path of protocol.py ( #3134 )
...
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
2024-03-01 19:42:06 +00:00
Seonghyeon
27ca23dc00
Remove exclude_unset in streaming response ( #3143 )
2024-03-01 09:59:06 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Jae-Won Chung
a6d471c759
Fix: AttributeError in OpenAI-compatible server ( #3018 )
2024-02-28 22:04:07 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client ( #2730 )
2024-02-25 11:54:00 -08:00
Mustafa Eyceoz
5574081c49
Added early stopping to completion APIs ( #2939 )
2024-02-21 18:24:01 -08:00
Nick Hill
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
Simon Mo
86fd8bb0ac
Add warning to prevent changes to benchmark api server ( #2858 )
2024-02-18 21:36:19 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
dancingpipi
51cd22ce56
set&get llm internal tokenizer instead of the TokenizerGroup ( #2741 )
...
Co-authored-by: shujunhua1 <shujunhua1@jd.com>
2024-02-04 14:25:36 -08:00
Simon Mo
b9e96b17de
fix python 3.8 syntax ( #2716 )
2024-02-01 14:00:58 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Simon Mo
3a7dd7e367
Support Batch Completion in Server ( #2529 )
2024-01-24 17:11:07 -08:00
Federico Galatolo
f1f6cc10c7
Added include_stop_str_in_output and length_penalty parameters to OpenAI API ( #2562 )
2024-01-24 10:21:56 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Erfan Al-Hossami
9c1352eb57
[Feature] Simple API token authentication and pluggable middlewares ( #1106 )
2024-01-23 15:13:00 -08:00
Jannis Schönleber
71d63ed72e
migrate pydantic from v1 to v2 ( #2531 )
2024-01-21 16:05:56 -08:00
Simon Mo
dd7e8f5f64
refactor complemention api for readability ( #2499 )
2024-01-18 16:45:14 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour
14cc317ba4
OpenAI Server refactoring ( #2360 )
2024-01-16 21:33:14 -08:00
Chirag Jain
ce036244c9
Allow setting fastapi root_path argument ( #2341 )
2024-01-12 10:59:59 -08:00
Iskren Ivov Chernev
d0215a58e7
Ensure metrics are logged regardless of requests ( #2347 )
2024-01-05 05:24:42 -08:00
Ronen Schaffer
74d8d77626
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK ( #2321 )
2024-01-03 15:49:07 -08:00
Harry Mellor
08133c4d1a
Add SSL arguments to API servers ( #2109 )
2023-12-18 10:56:23 +08:00
Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument ( #2145 )
2023-12-17 01:56:16 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Simon Mo
2e8fc0d4c3
Fix completion API echo and logprob combo ( #1992 )
2023-12-10 13:20:30 -08:00
Jin Shang
1aa1361510
Fix OpenAI server completion_tokens referenced before assignment ( #1996 )
2023-12-09 21:01:21 -08:00
Roy
60dc62dc9e
add custom server params ( #1868 )
2023-12-03 12:59:18 -08:00
Simon Mo
5313c2cb8b
Add Production Metrics in Prometheus format ( #1890 )
2023-12-02 16:37:44 -08:00
Adam Brusselback
66785cc05c
Support chat template and echo for chat API ( #1756 )
2023-11-30 16:43:13 -08:00
Michael McCulloch
c782195662
Disable Logs Requests should Disable Logging of requests. ( #1779 )
...
Co-authored-by: Michael McCulloch <mjm.gitlab@fastmail.com>
2023-11-29 21:50:02 -08:00
Yunmo Chen
665cbcec4b
Added echo function to OpenAI API server. ( #1504 )
2023-11-26 21:29:17 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
liuyhwangyh
edb305584b
Support download models from www.modelscope.cn ( #1588 )
2023-11-17 20:38:31 -08:00
Iskren Ivov Chernev
686f5e3210
Return usage for openai streaming requests ( #1663 )
2023-11-16 15:28:36 -08:00
Fluder-Paradyne
7e90a2d117
Add /health Endpoint for both Servers ( #1540 )
2023-11-01 10:29:44 -07:00
Dan Lord
7013a80170
Add support for spaces_between_special_tokens
2023-10-30 16:52:56 -07:00
Yunfeng Bai
09ff7f106a
API server support ipv4 / ipv6 dualstack ( #1288 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-07 15:15:54 -07:00
Antoni Baum
acbed3ef40
Use monotonic time where appropriate ( #1249 )
2023-10-02 19:22:05 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length ( #1224 )
2023-09-28 14:44:02 -07:00
Dan Lord
20f7cc4cde
Add skip_special_tokens sampling params ( #1186 )
2023-09-27 19:21:42 -07:00
Wen Sun
bbbf86565f
Align max_tokens behavior with openai ( #852 )
2023-09-23 18:10:13 -07:00