Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
Mahmoud Ashraf
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-08 22:02:31 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
youkaichao
fbb9d9eef4
[Core] fix custom allreduce default value ( #4040 )
2024-04-12 16:40:39 -07:00
SangBin Cho
09473ee41c
[mypy] Add mypy type annotation part 1 ( #4006 )
2024-04-12 14:35:50 -07:00
yhu422
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels ( #2760 )
2024-03-21 23:02:58 -07:00
Chujie Zheng
4cb3b924cd
Add tqdm dynamic_ncols=True ( #3242 )
2024-03-06 22:41:42 +00:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
dancingpipi
51cd22ce56
set&get llm internal tokenizer instead of the TokenizerGroup ( #2741 )
...
Co-authored-by: shujunhua1 <shujunhua1@jd.com>
2024-02-04 14:25:36 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument ( #2145 )
2023-12-17 01:56:16 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
bc0644574c
Add gpu_memory_utilization and swap_space to LLM ( #1090 )
2023-09-19 22:16:04 -07:00
orellavie1212
fbe66e1d0b
added support for quantize on LLM module ( #1080 )
2023-09-18 11:04:21 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Woosuk Kwon
b6fbb9a565
Sort the outputs before return ( #402 )
2023-07-08 14:48:18 -07:00
codethazine
a945fcc2ae
Add trust-remote-code flag to handle remote tokenizers ( #364 )
2023-07-07 11:04:58 -07:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Woosuk Kwon
998d9d1509
[Tokenizer] Add tokenizer mode ( #298 )
2023-06-28 14:19:22 -07:00
Woosuk Kwon
4338cc4750
[Tokenizer] Add an option to specify tokenizer ( #284 )
2023-06-28 09:46:58 -07:00
Jishnu Ray Chowdhury
bdd6b4c8bc
Add LLM.set_tokenizer ( #283 )
2023-06-28 00:28:29 -07:00
Woosuk Kwon
14f0b39cda
[Bugfix] Fix a bug in RequestOutput.finished ( #202 )
2023-06-22 00:17:24 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00