Jialun Lyu
|
27a7b070db
|
Add document for vllm paged attention kernel. (#2978)
|
2024-03-04 09:23:34 -08:00 |
|
TianYu GUO
|
901cf4c52b
|
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
|
2024-03-03 22:48:27 -08:00 |
|
Liangfu Chen
|
d0fae88114
|
[DOC] add setup document to support neuron backend (#2777)
|
2024-03-04 01:03:51 +00:00 |
|
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
|
Zhuohan Li
|
996d095c54
|
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158)
|
2024-03-03 14:37:18 -08:00 |
|
Jason Cox
|
d65fac2738
|
Add vLLM version info to logs and openai API server (#3161)
|
2024-03-02 21:00:29 -08:00 |
|
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
|
cloudhan
|
baee28c46c
|
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104)
|
2024-03-02 14:34:48 +08:00 |
|
Allen.Dou
|
29e70e3e88
|
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-01 23:28:41 +00:00 |
|
Woosuk Kwon
|
82091b864a
|
Bump up to v0.3.3 (#3129)
v0.3.3
|
2024-03-01 12:58:06 -08:00 |
|
Robert Shaw
|
c0c2335ce0
|
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
|
2024-03-01 12:47:51 -08:00 |
|
Huarong
|
90fbf12540
|
fix relative import path of protocol.py (#3134)
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
|
2024-03-01 19:42:06 +00:00 |
|
Yuan Tang
|
49d849b3ab
|
docs: Add tutorial on deploying vLLM model with KServe (#2586)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
|
2024-03-01 11:04:14 -08:00 |
|
Seonghyeon
|
27ca23dc00
|
Remove exclude_unset in streaming response (#3143)
|
2024-03-01 09:59:06 -08:00 |
|
Sherry
|
54d3544784
|
Fix: Output text is always truncated in some models (#3016)
|
2024-03-01 07:52:22 +00:00 |
|
felixzhu555
|
703e42ee4b
|
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-02-29 22:13:08 +00:00 |
|
Nick Hill
|
29a8d6a554
|
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099)
|
2024-02-29 19:20:42 +00:00 |
|
Billy Cao
|
2c08ff23c0
|
Fix building from source on WSL (#3112)
|
2024-02-29 11:13:58 -08:00 |
|
Seonghyeon
|
bfdcfa6a05
|
Support starcoder2 architecture (#3089)
|
2024-02-29 00:51:48 -08:00 |
|
Allen.Dou
|
9289e577ec
|
add cache_config's info to prometheus metrics. (#3100)
|
2024-02-29 06:15:18 +00:00 |
|
Jae-Won Chung
|
a6d471c759
|
Fix: AttributeError in OpenAI-compatible server (#3018)
|
2024-02-28 22:04:07 -08:00 |
|
CHU Tianxiang
|
01a5d18a53
|
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330)
|
2024-02-28 21:52:23 -08:00 |
|
Woosuk Kwon
|
929b4f2973
|
Add LoRA support for Gemma (#3050)
|
2024-02-28 13:03:28 -08:00 |
|
Liangfu Chen
|
3b7178cfa4
|
[Neuron] Support inference with transformers-neuronx (#2569)
|
2024-02-28 09:34:34 -08:00 |
|
Allen.Dou
|
e46fa5d52e
|
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070)
|
2024-02-28 05:38:26 +00:00 |
|
Ganesh Jagadeesan
|
a8683102cc
|
multi-lora documentation fix (#3064)
|
2024-02-27 21:26:15 -08:00 |
|
Tao He
|
71bcaf99e2
|
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-02-27 01:14:31 -08:00 |
|
Woosuk Kwon
|
8b430d7dea
|
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046)
|
2024-02-26 20:23:50 -08:00 |
|
Dylan Hawk
|
e0ade06d63
|
Support logit bias for OpenAI API (#3027)
|
2024-02-27 11:51:53 +08:00 |
|
Woosuk Kwon
|
4bd18ec0c7
|
[Minor] Fix type annotation in fused moe (#3045)
|
2024-02-26 19:44:29 -08:00 |
|
Jingru
|
2410e320b3
|
fix get_ip error in pure ipv6 environment (#2931)
|
2024-02-26 19:22:16 -08:00 |
|
张大成
|
48a8f4a7fd
|
Support Orion model (#2539)
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-02-26 19:17:06 -08:00 |
|
Roy
|
4dd6416faf
|
Fix stablelm (#3038)
|
2024-02-26 18:31:10 -08:00 |
|
Roy
|
c1c0d00b88
|
Don't use cupy when enforce_eager=True (#3037)
|
2024-02-26 17:33:38 -08:00 |
|
Roy
|
d9f726c4d0
|
[Minor] Remove unused config files (#3039)
|
2024-02-26 17:25:22 -08:00 |
|
Woosuk Kwon
|
d6e4a130b0
|
[Minor] Remove gather_cached_kv kernel (#3043)
|
2024-02-26 15:00:54 -08:00 |
|
Philipp Moritz
|
cfc15a1031
|
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-02-26 13:48:56 -08:00 |
|
Jared Moore
|
70f3e8e3a1
|
Add LogProbs for Chat Completions in OpenAI (#2918)
|
2024-02-26 10:39:34 +08:00 |
|
Harry Mellor
|
ef978fe411
|
Port metrics from aioprometheus to prometheus_client (#2730)
|
2024-02-25 11:54:00 -08:00 |
|
Woosuk Kwon
|
f7c1234990
|
[Fix] Fissertion on YaRN model len (#2984)
|
2024-02-23 12:57:48 -08:00 |
|
zhaoyang-star
|
57f044945f
|
Fix nvcc not found in vlm-openai image (#2781)
|
2024-02-22 14:25:07 -08:00 |
|
Ronen Schaffer
|
4caf7044e0
|
Include tokens from prompt phase in counter_generation_tokens (#2802)
|
2024-02-22 14:00:12 -08:00 |
|
Woosuk Kwon
|
6f32cddf1c
|
Remove Flash Attention in test env (#2982)
|
2024-02-22 09:58:29 -08:00 |
|
44670
|
c530e2cfe3
|
[FIX] Fix a bug in initializing Yarn RoPE (#2983)
|
2024-02-22 01:40:05 -08:00 |
|
Woosuk Kwon
|
fd5dcc5c81
|
Optimize GeGLU layer in Gemma (#2975)
|
2024-02-21 20:17:52 -08:00 |
|
Massimiliano Pronesti
|
93dc5a2870
|
chore(vllm): codespell for spell checking (#2820)
|
2024-02-21 18:56:01 -08:00 |
|
Woosuk Kwon
|
95529e3253
|
Use Llama RMSNorm custom op for Gemma (#2974)
|
2024-02-21 18:28:23 -08:00 |
|
Roy
|
344020c926
|
Migrate MistralForCausalLM to LlamaForCausalLM (#2868)
|
2024-02-21 18:25:05 -08:00 |
|
Mustafa Eyceoz
|
5574081c49
|
Added early stopping to completion APIs (#2939)
|
2024-02-21 18:24:01 -08:00 |
|
Ronen Schaffer
|
d7f396486e
|
Update comment (#2934)
|
2024-02-21 18:18:37 -08:00 |
|