827 Commits

Author SHA1 Message Date
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access (#3166) 2024-03-05 15:35:43 -08:00
Hongxia Yang
05af6da8d9
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123)
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
2024-03-04 18:14:53 -08:00
Chen Wang
9a4548bae7
Fix the openai benchmarking requests to work with latest OpenAI apis (#2992)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-04 15:51:56 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
ttbachyinsda
76e8a70476
[Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176)
Co-authored-by: guofangze <guofangze@kuaishou.com>
2024-03-04 19:17:12 +00:00
Allen.Dou
9cbc7e5f3b
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
2024-03-04 10:37:58 -08:00
Jialun Lyu
27a7b070db
Add document for vllm paged attention kernel. (#2978) 2024-03-04 09:23:34 -08:00
TianYu GUO
901cf4c52b
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171) 2024-03-03 22:48:27 -08:00
Liangfu Chen
d0fae88114
[DOC] add setup document to support neuron backend (#2777) 2024-03-04 01:03:51 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Zhuohan Li
996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) 2024-03-03 14:37:18 -08:00
Jason Cox
d65fac2738
Add vLLM version info to logs and openai API server (#3161) 2024-03-02 21:00:29 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104) 2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Woosuk Kwon
82091b864a
Bump up to v0.3.3 (#3129) v0.3.3 2024-03-01 12:58:06 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Huarong
90fbf12540
fix relative import path of protocol.py (#3134)
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
2024-03-01 19:42:06 +00:00
Yuan Tang
49d849b3ab
docs: Add tutorial on deploying vLLM model with KServe (#2586)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-03-01 11:04:14 -08:00
Seonghyeon
27ca23dc00
Remove exclude_unset in streaming response (#3143) 2024-03-01 09:59:06 -08:00
Sherry
54d3544784
Fix: Output text is always truncated in some models (#3016) 2024-03-01 07:52:22 +00:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Nick Hill
29a8d6a554
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099) 2024-02-29 19:20:42 +00:00
Billy Cao
2c08ff23c0
Fix building from source on WSL (#3112) 2024-02-29 11:13:58 -08:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture (#3089) 2024-02-29 00:51:48 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. (#3100) 2024-02-29 06:15:18 +00:00
Jae-Won Chung
a6d471c759
Fix: AttributeError in OpenAI-compatible server (#3018) 2024-02-28 22:04:07 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) 2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Allen.Dou
e46fa5d52e
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070) 2024-02-28 05:38:26 +00:00
Ganesh Jagadeesan
a8683102cc
multi-lora documentation fix (#3064) 2024-02-27 21:26:15 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Woosuk Kwon
8b430d7dea
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046) 2024-02-26 20:23:50 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Woosuk Kwon
4bd18ec0c7
[Minor] Fix type annotation in fused moe (#3045) 2024-02-26 19:44:29 -08:00
Jingru
2410e320b3
fix get_ip error in pure ipv6 environment (#2931) 2024-02-26 19:22:16 -08:00
张大成
48a8f4a7fd
Support Orion model (#2539)
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-02-26 19:17:06 -08:00
Roy
4dd6416faf
Fix stablelm (#3038) 2024-02-26 18:31:10 -08:00
Roy
c1c0d00b88
Don't use cupy when enforce_eager=True (#3037) 2024-02-26 17:33:38 -08:00
Roy
d9f726c4d0
[Minor] Remove unused config files (#3039) 2024-02-26 17:25:22 -08:00
Woosuk Kwon
d6e4a130b0
[Minor] Remove gather_cached_kv kernel (#3043) 2024-02-26 15:00:54 -08:00
Philipp Moritz
cfc15a1031
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Woosuk Kwon
f7c1234990
[Fix] Fissertion on YaRN model len (#2984) 2024-02-23 12:57:48 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image (#2781) 2024-02-22 14:25:07 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
Woosuk Kwon
6f32cddf1c
Remove Flash Attention in test env (#2982) 2024-02-22 09:58:29 -08:00