xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-01-28 23:47:13 +08:00

Author	SHA1	Message	Date
Jialun Lyu	27a7b070db	Add document for vllm paged attention kernel. (#2978 )	2024-03-04 09:23:34 -08:00
TianYu GUO	901cf4c52b	[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171 )	2024-03-03 22:48:27 -08:00
Liangfu Chen	d0fae88114	[DOC] add setup document to support neuron backend (#2777 )	2024-03-04 01:03:51 +00:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Zhuohan Li	996d095c54	[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158 )	2024-03-03 14:37:18 -08:00
Jason Cox	d65fac2738	Add vLLM version info to logs and openai API server (#3161 )	2024-03-02 21:00:29 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
cloudhan	baee28c46c	Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104 )	2024-03-02 14:34:48 +08:00
Allen.Dou	29e70e3e88	allow user chose log level by --log-level instead of fixed 'info'. (#3109 ) Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-01 23:28:41 +00:00
Woosuk Kwon	82091b864a	Bump up to v0.3.3 (#3129 ) v0.3.3	2024-03-01 12:58:06 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
Huarong	90fbf12540	fix relative import path of protocol.py (#3134 ) Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>	2024-03-01 19:42:06 +00:00
Yuan Tang	49d849b3ab	docs: Add tutorial on deploying vLLM model with KServe (#2586 ) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2024-03-01 11:04:14 -08:00
Seonghyeon	27ca23dc00	Remove exclude_unset in streaming response (#3143 )	2024-03-01 09:59:06 -08:00
Sherry	54d3544784	Fix: Output text is always truncated in some models (#3016 )	2024-03-01 07:52:22 +00:00
felixzhu555	703e42ee4b	Add guided decoding for OpenAI API server (#2819 ) Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-02-29 22:13:08 +00:00
Nick Hill	29a8d6a554	[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099 )	2024-02-29 19:20:42 +00:00
Billy Cao	2c08ff23c0	Fix building from source on WSL (#3112 )	2024-02-29 11:13:58 -08:00
Seonghyeon	bfdcfa6a05	Support starcoder2 architecture (#3089 )	2024-02-29 00:51:48 -08:00
Allen.Dou	9289e577ec	add cache_config's info to prometheus metrics. (#3100 )	2024-02-29 06:15:18 +00:00
Jae-Won Chung	a6d471c759	Fix: `AttributeError` in OpenAI-compatible server (#3018 )	2024-02-28 22:04:07 -08:00
CHU Tianxiang	01a5d18a53	Add Support for 2/3/8-bit GPTQ Quantization Models (#2330 )	2024-02-28 21:52:23 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Allen.Dou	e46fa5d52e	Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070 )	2024-02-28 05:38:26 +00:00
Ganesh Jagadeesan	a8683102cc	multi-lora documentation fix (#3064 )	2024-02-27 21:26:15 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Woosuk Kwon	8b430d7dea	[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046 )	2024-02-26 20:23:50 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Woosuk Kwon	4bd18ec0c7	[Minor] Fix type annotation in fused moe (#3045 )	2024-02-26 19:44:29 -08:00
Jingru	2410e320b3	fix `get_ip` error in pure ipv6 environment (#2931 )	2024-02-26 19:22:16 -08:00
张大成	48a8f4a7fd	Support Orion model (#2539 ) Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-02-26 19:17:06 -08:00
Roy	4dd6416faf	Fix stablelm (#3038 )	2024-02-26 18:31:10 -08:00
Roy	c1c0d00b88	Don't use cupy when `enforce_eager=True` (#3037 )	2024-02-26 17:33:38 -08:00
Roy	d9f726c4d0	[Minor] Remove unused config files (#3039 )	2024-02-26 17:25:22 -08:00
Woosuk Kwon	d6e4a130b0	[Minor] Remove gather_cached_kv kernel (#3043 )	2024-02-26 15:00:54 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Woosuk Kwon	f7c1234990	[Fix] Fissertion on YaRN model len (#2984 )	2024-02-23 12:57:48 -08:00
zhaoyang-star	57f044945f	Fix nvcc not found in vlm-openai image (#2781 )	2024-02-22 14:25:07 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	6f32cddf1c	Remove Flash Attention in test env (#2982 )	2024-02-22 09:58:29 -08:00
44670	c530e2cfe3	[FIX] Fix a bug in initializing Yarn RoPE (#2983 )	2024-02-22 01:40:05 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Woosuk Kwon	95529e3253	Use Llama RMSNorm custom op for Gemma (#2974 )	2024-02-21 18:28:23 -08:00
Roy	344020c926	Migrate MistralForCausalLM to LlamaForCausalLM (#2868 )	2024-02-21 18:25:05 -08:00
Mustafa Eyceoz	5574081c49	Added early stopping to completion APIs (#2939 )	2024-02-21 18:24:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00

1 2 3 4 5 ...

820 Commits