Itay Etelis
7c3604fb68
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 ( #5031 )
2024-05-29 16:13:22 -07:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
bofeng huang
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
Pierre Dulac
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
Alex Wu
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
Alex Wu
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Cyrus Leung
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Mahmoud Ashraf
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-08 22:02:31 -07:00
Cyrus Leung
323f27b904
[Bugfix] Fix asyncio.Task not being subscriptable ( #4623 )
2024-05-06 09:31:05 -07:00
DearPlanet
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
Michael Goin
7e65477e5e
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None ( #4586 )
2024-05-03 10:32:21 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
Yang, Bo
808632d3b4
[BugFix] Prevent the task of _force_log from being garbage collected ( #4567 )
2024-05-03 01:35:18 +00:00
youkaichao
5b8a7c1cb0
[Misc] centralize all usage of environment variables ( #4548 )
2024-05-02 11:13:25 -07:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Robert Shaw
4dc8026d86
[Bugfix] Fix 307 Redirect for /metrics ( #4523 )
2024-05-01 09:14:13 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Roy
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
Cyrus Leung
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
SangBin Cho
a88081bf76
[CI] Disable non-lazy string operation on logging ( #4326 )
...
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
Norman Mu
2f30e7c72f
[Frontend] Add --log-level option to api server ( #4377 )
2024-04-26 05:36:01 +00:00
Cyrus Leung
1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened ( #4292 )
2024-04-23 18:19:03 +00:00
Jack Gordley
d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server ( #4016 )
2024-04-23 19:06:29 +08:00
SangBin Cho
0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory ( #4161 )
2024-04-22 21:32:44 -07:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Chirag Jain
bc9df1571b
Pass tokenizer_revision when getting tokenizer in openai serving ( #4214 )
2024-04-19 17:13:56 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Harry Mellor
66ded03067
Allow model to be served under multiple names ( #2894 )
...
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
2024-04-18 00:16:26 -07:00
Noam Gat
05434764cd
LM Format Enforcer Guided Decoding Support ( #3868 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
youkaichao
fbb9d9eef4
[Core] fix custom allreduce default value ( #4040 )
2024-04-12 16:40:39 -07:00
SangBin Cho
09473ee41c
[mypy] Add mypy type annotation part 1 ( #4006 )
2024-04-12 14:35:50 -07:00
Cyrus Leung
7fd3949a0b
[Frontend][Core] Move merge_async_iterators to utils ( #4026 )
2024-04-12 05:30:54 +00:00
Dylan Hawk
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-04-11 22:15:50 +00:00
Thomas Parnell
1d7c940d74
Add option to completion API to truncate prompt tokens ( #3144 )
2024-04-05 10:15:42 -07:00
A-Mahla
0739b1947f
[Frontend][Bugfix] allow using the default middleware with a root path ( #3788 )
...
Co-authored-by: A-Mahla <>
2024-04-02 01:20:28 -07:00
Roy
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
yhu422
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops ( #3450 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter ( #3124 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00