776 Commits

Author SHA1 Message Date
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Woosuk Kwon
95529e3253
Use Llama RMSNorm custom op for Gemma (#2974) 2024-02-21 18:28:23 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM (#2868) 2024-02-21 18:25:05 -08:00
Mustafa Eyceoz
5574081c49
Added early stopping to completion APIs (#2939) 2024-02-21 18:24:01 -08:00
Ronen Schaffer
d7f396486e
Update comment (#2934) 2024-02-21 18:18:37 -08:00
Zhuohan Li
8fbd84bf78
Bump up version to v0.3.2 (#2968)
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
v0.3.2
2024-02-21 11:47:25 -08:00
Nick Hill
7d2dcce175
Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Woosuk Kwon
dc903e70ac
[ROCm] Upgrade transformers to v4.38.0 (#2967) 2024-02-21 09:46:57 -08:00
Zhuohan Li
a9c8212895
[FIX] Add Gemma model to the doc (#2966) 2024-02-21 09:46:15 -08:00
Woosuk Kwon
c20ecb6a51
Upgrade transformers to v4.38.0 (#2965) 2024-02-21 09:38:03 -08:00
Xiang Xu
5253edaacb
Add Gemma model (#2964) 2024-02-21 09:34:30 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Antoni Baum
181b27d881
Make vLLM logging formatting optional (#2877) 2024-02-20 14:38:55 -08:00
Zhuohan Li
63e2a6419d
[FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
James Whedbee
264017a2bf
[ROCm] include gfx908 as supported (#2792) 2024-02-19 17:58:59 -08:00
Ronen Schaffer
e433c115bc
Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Simon Mo
86fd8bb0ac
Add warning to prevent changes to benchmark api server (#2858) 2024-02-18 21:36:19 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
Zhuohan Li
537c9755a7
[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) 2024-02-18 14:39:00 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub (#2892) 2024-02-17 22:36:53 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Nick Hill
185b2c29e2
Defensively copy sampling_params (#2881)
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
Woosuk Kwon
5f08050d8d
Bump up to v0.3.1 (#2887) v0.3.1 2024-02-16 15:05:18 -08:00
shiyi.c_98
64da65b322
Prefix Caching- fix t4 triton error (#2517) 2024-02-16 14:17:55 -08:00
Hongxia Yang
5255d99dc5
[ROCm] Dockerfile fix for flash-attention build (#2885) 2024-02-15 10:22:39 -08:00
Philipp Moritz
4f2ad11135
Fix DeciLM (#2883) 2024-02-14 22:29:57 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for LLM class (#2882) 2024-02-14 22:17:44 -08:00
Philipp Moritz
31348dff03
Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880)
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.

* Make LoRA logic for Mistral and Mixtral the same

---------

Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-02-15 01:00:43 +01:00
Woosuk Kwon
25e86b6a61
Don't use cupy NCCL for AMD backends (#2855) 2024-02-14 12:30:44 -08:00
Roy
4efbac6d35
Migrate AquilaForCausalLM to LlamaForCausalLM (#2867) 2024-02-14 12:30:24 -08:00
Nikola Borisov
87069ccf68
Fix docker python version (#2845) 2024-02-14 10:17:57 -08:00
Woosuk Kwon
7e45107f51
[Fix] Fix memory profiling when GPU is used by multiple processes (#2863) 2024-02-13 19:52:34 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861) 2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM (#2860)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead (#2854)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
Woosuk Kwon
a463c333dd
Use CuPy for CUDA graphs (#2811) 2024-02-13 11:32:06 -08:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models (#2637)" (#2851)
This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models (#2637) 2024-02-13 00:09:23 -08:00
Simon Mo
f964493274
[CI] Ensure documentation build is checked in CI (#2842) 2024-02-12 22:53:07 -08:00
Roger Wang
a4211a4dc3
Serving Benchmark Refactoring (#2433) 2024-02-12 22:53:00 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 (#2723)
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
2024-02-12 11:02:17 -08:00
Philipp Moritz
4ca2c358b1
Add documentation section about LoRA (#2834) 2024-02-12 17:24:45 +01:00
Hongxia Yang
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default (#2808) 2024-02-08 09:58:03 -08:00
SangBin Cho
65b89d16ee
[Ray] Integration compiled DAG off by default (#2471) 2024-02-08 09:57:25 -08:00
Philipp Moritz
931746bc6d
Add documentation on how to do incremental builds (#2796) 2024-02-07 14:42:02 -08:00
Hongxia Yang
c81dddb45c
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support (#2790) 2024-02-06 22:36:59 -08:00