SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
TianYu GUO
|
e67c295b0c
|
[Bugfix] fix automatic prefix args and add log info (#3608)
|
2024-03-25 05:35:22 -07:00 |
|
Thomas Parnell
|
cf2f084d56
|
Dynamic scheduler delay to improve ITL performance (#3279)
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2024-03-22 12:28:14 -07:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|
Antoni Baum
|
22de45235c
|
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-03-04 19:54:06 +00:00 |
|
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
|
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
|
Liangfu Chen
|
3b7178cfa4
|
[Neuron] Support inference with transformers-neuronx (#2569)
|
2024-02-28 09:34:34 -08:00 |
|
Nick Hill
|
7d2dcce175
|
Support per-request seed (#2514)
|
2024-02-21 11:47:00 -08:00 |
|
Mark Mozolewski
|
786b7f18a5
|
Add code-revision config argument for Hugging Face Hub (#2892)
|
2024-02-17 22:36:53 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Hanzhi Zhou
|
380170038e
|
Implement custom all reduce kernels (#2192)
|
2024-01-27 12:46:35 -08:00 |
|
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
|
Suhong Moon
|
290e015c6c
|
Update Help Text for --gpu-memory-utilization Argument (#2183)
|
2023-12-18 11:33:24 -08:00 |
|
JohnSaxon
|
bbe4466fd9
|
[Minor] Fix typo (#2166)
Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>
|
2023-12-17 23:28:49 -08:00 |
|
Woosuk Kwon
|
30fb0956df
|
[Minor] Add more detailed explanation on quantization argument (#2145)
|
2023-12-17 01:56:16 -08:00 |
|
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
|
CHU Tianxiang
|
0fbfc4b81b
|
Add GPTQ support (#916)
|
2023-12-15 03:04:22 -08:00 |
|
Woosuk Kwon
|
27feead2f8
|
Refactor Worker & InputMetadata (#1843)
|
2023-11-29 22:16:37 -08:00 |
|
Casper
|
a921d8be9d
|
[DOCS] Add engine args documentation (#1741)
|
2023-11-22 12:31:27 -08:00 |
|
boydfd
|
4bb6b67188
|
fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
|
2023-11-20 19:02:42 -08:00 |
|
chooper1
|
1f24755bf8
|
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-10-21 23:14:59 -07:00 |
|
Woosuk Kwon
|
c1376e0f82
|
Change scheduler & input tensor shape (#1381)
|
2023-10-16 17:48:42 -07:00 |
|
Federico Cassano
|
66d18a7fb0
|
add support for tokenizer revision (#1163)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-10-02 19:19:46 -07:00 |
|
Woosuk Kwon
|
f936657eb6
|
Provide default max model length (#1224)
|
2023-09-28 14:44:02 -07:00 |
|
Chris Bamford
|
bb1ba58f06
|
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
|
2023-09-28 10:41:03 -07:00 |
|
Woosuk Kwon
|
a19bc5c628
|
Automatically configure max_num_batched_tokens (#1198)
|
2023-09-27 16:34:00 -07:00 |
|
Woosuk Kwon
|
1ac4ccf73c
|
Add float16 and float32 (#1115)
|
2023-09-21 00:52:47 -07:00 |
|
Woosuk Kwon
|
e3e79e9e8a
|
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
|
2023-09-16 00:03:37 -07:00 |
|
Jasmond L
|
ab019eea75
|
Add Model Revision Support (#1014)
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-09-13 15:20:02 -07:00 |
|
Antoni Baum
|
0bb1e885a0
|
Make max_model_len configurable (#972)
|
2023-09-12 16:29:19 -07:00 |
|
leiwen83
|
d6545ad22e
|
add option to shorten prompt print in log (#991)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-09-12 15:10:14 -07:00 |
|
Zhuohan Li
|
c957c741d9
|
Enable safetensors loading for all models (#974)
|
2023-09-07 15:49:52 -07:00 |
|
Zhuohan Li
|
58a072be15
|
[Fix] Add model sequence length into model config (#575)
|
2023-07-25 23:46:30 -07:00 |
|
Lily Liu
|
b4b195b360
|
fix max seq len (#489)
|
2023-07-17 23:20:20 -07:00 |
|
codethazine
|
a945fcc2ae
|
Add trust-remote-code flag to handle remote tokenizers (#364)
|
2023-07-07 11:04:58 -07:00 |
|
Zhuohan Li
|
d6fa1be3a8
|
[Quality] Add code formatter and linter (#326)
|
2023-07-03 11:31:55 -07:00 |
|
Lily Liu
|
dafd924c1f
|
Raise error for long prompt (#273)
|
2023-06-30 18:48:49 -07:00 |
|
Woosuk Kwon
|
998d9d1509
|
[Tokenizer] Add tokenizer mode (#298)
|
2023-06-28 14:19:22 -07:00 |
|
Woosuk Kwon
|
4338cc4750
|
[Tokenizer] Add an option to specify tokenizer (#284)
|
2023-06-28 09:46:58 -07:00 |
|
Zhuohan Li
|
bf5f121c02
|
Reduce GPU memory utilization to make sure OOM doesn't happen (#153)
|
2023-06-18 17:33:50 +08:00 |
|
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
|