xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-18 04:06:59 +08:00

Author	SHA1	Message	Date
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Woosuk Kwon	95529e3253	Use Llama RMSNorm custom op for Gemma (#2974 )	2024-02-21 18:28:23 -08:00
Roy	344020c926	Migrate MistralForCausalLM to LlamaForCausalLM (#2868 )	2024-02-21 18:25:05 -08:00
Mustafa Eyceoz	5574081c49	Added early stopping to completion APIs (#2939 )	2024-02-21 18:24:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00
Zhuohan Li	8fbd84bf78	Bump up version to v0.3.2 (#2968 ) This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832). v0.3.2	2024-02-21 11:47:25 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Woosuk Kwon	dc903e70ac	[ROCm] Upgrade transformers to v4.38.0 (#2967 )	2024-02-21 09:46:57 -08:00
Zhuohan Li	a9c8212895	[FIX] Add Gemma model to the doc (#2966 )	2024-02-21 09:46:15 -08:00
Woosuk Kwon	c20ecb6a51	Upgrade transformers to v4.38.0 (#2965 )	2024-02-21 09:38:03 -08:00
Xiang Xu	5253edaacb	Add Gemma model (#2964 )	2024-02-21 09:34:30 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Antoni Baum	181b27d881	Make vLLM logging formatting optional (#2877 )	2024-02-20 14:38:55 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
James Whedbee	264017a2bf	[ROCm] include gfx908 as supported (#2792 )	2024-02-19 17:58:59 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Simon Mo	86fd8bb0ac	Add warning to prevent changes to benchmark api server (#2858 )	2024-02-18 21:36:19 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
Zhuohan Li	537c9755a7	[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905 )	2024-02-18 14:39:00 -08:00
Mark Mozolewski	786b7f18a5	Add code-revision config argument for Hugging Face Hub (#2892 )	2024-02-17 22:36:53 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Nick Hill	185b2c29e2	Defensively copy `sampling_params` (#2881 ) If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request. Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059	2024-02-17 11:18:04 -08:00
Woosuk Kwon	5f08050d8d	Bump up to v0.3.1 (#2887 ) v0.3.1	2024-02-16 15:05:18 -08:00
shiyi.c_98	64da65b322	Prefix Caching- fix t4 triton error (#2517 )	2024-02-16 14:17:55 -08:00
Hongxia Yang	5255d99dc5	[ROCm] Dockerfile fix for flash-attention build (#2885 )	2024-02-15 10:22:39 -08:00
Philipp Moritz	4f2ad11135	Fix DeciLM (#2883 )	2024-02-14 22:29:57 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Philipp Moritz	31348dff03	Align LoRA code between Mistral and Mixtral (fixes #2875 ) (#2880 ) * Fix AttributeError: MixtralModel object has no attribute org_vocab_size. * Make LoRA logic for Mistral and Mixtral the same --------- Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>	2024-02-15 01:00:43 +01:00
Woosuk Kwon	25e86b6a61	Don't use cupy NCCL for AMD backends (#2855 )	2024-02-14 12:30:44 -08:00
Roy	4efbac6d35	Migrate AquilaForCausalLM to LlamaForCausalLM (#2867 )	2024-02-14 12:30:24 -08:00
Nikola Borisov	87069ccf68	Fix docker python version (#2845 )	2024-02-14 10:17:57 -08:00
Woosuk Kwon	7e45107f51	[Fix] Fix memory profiling when GPU is used by multiple processes (#2863 )	2024-02-13 19:52:34 -08:00
Philipp Moritz	0c48b37c31	Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861 )	2024-02-13 18:01:15 -08:00
Philipp Moritz	7eacffd951	Migrate InternLMForCausalLM to LlamaForCausalLM (#2860 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 17:12:05 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Philipp Moritz	317b29de0f	Remove Yi model definition, please use `LlamaForCausalLM` instead (#2854 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 14:22:22 -08:00
Woosuk Kwon	a463c333dd	Use CuPy for CUDA graphs (#2811 )	2024-02-13 11:32:06 -08:00
Philipp Moritz	ea356004d4	Revert "Refactor llama family models (#2637 )" (#2851 ) This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.	2024-02-13 09:24:59 -08:00
Roy	5c976a7e1a	Refactor llama family models (#2637 )	2024-02-13 00:09:23 -08:00
Simon Mo	f964493274	[CI] Ensure documentation build is checked in CI (#2842 )	2024-02-12 22:53:07 -08:00
Roger Wang	a4211a4dc3	Serving Benchmark Refactoring (#2433 )	2024-02-12 22:53:00 -08:00
Rex	563836496a	Refactor 2 awq gemm kernels into m16nXk32 (#2723 ) Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>	2024-02-12 11:02:17 -08:00
Philipp Moritz	4ca2c358b1	Add documentation section about LoRA (#2834 )	2024-02-12 17:24:45 +01:00
Hongxia Yang	0580aab02f	[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768 )	2024-02-10 23:14:37 -08:00
Woosuk Kwon	3711811b1d	Disable custom all reduce by default (#2808 )	2024-02-08 09:58:03 -08:00
SangBin Cho	65b89d16ee	[Ray] Integration compiled DAG off by default (#2471 )	2024-02-08 09:57:25 -08:00
Philipp Moritz	931746bc6d	Add documentation on how to do incremental builds (#2796 )	2024-02-07 14:42:02 -08:00
Hongxia Yang	c81dddb45c	[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support (#2790 )	2024-02-06 22:36:59 -08:00

1 2 3 4 5 ...

776 Commits