327 Commits

Author SHA1 Message Date
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Chang Su
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-15 15:40:43 -07:00
Nick Hill
e9d517f276
[BugFix] Fix chat API continuous usage stats (#9357) 2024-10-14 23:19:48 -07:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine (#9261)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-10-10 21:30:44 -07:00
Daniele
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (#8537) 2024-10-08 09:38:40 -07:00
Alex Brooks
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser (#9151)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:31:26 +00:00
Brendan Wong
8c746226c9
[Frontend] API support for beam search for MQLLMEngine (#9117) 2024-10-08 05:51:43 +00:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search (#9087)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
Flávia Béo
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation (#8999)
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-10-04 18:31:40 +00:00
Roger Wang
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models (#8717)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-04 10:38:25 -07:00
Joe Runde
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params (#8252)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-01 09:34:25 +08:00
danieljannai21
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message parameter (#8942) 2024-09-29 17:59:47 +00:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs (#8876) 2024-09-26 20:35:15 -07:00
Nick Hill
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade (#8829) 2024-09-26 16:46:43 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
Andy
2529d09b5a
[Frontend] Batch inference for llm.chat() API (#8648)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-09-24 09:44:11 -07:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335) 2024-09-23 15:38:04 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management (#6315) 2024-09-20 06:20:56 +00:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Joe Runde
f2e263b801
[Bugfix] Offline mode fix (#8376)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-12 11:11:57 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347) 2024-09-11 05:30:11 +00:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat (#8098) 2024-09-04 05:22:17 +00:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol (#7654) 2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig (#7615) 2024-08-21 22:49:39 +00:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX (#7716) 2024-08-21 19:53:01 +00:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints (#7248)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling (#7530) 2024-08-17 13:30:55 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453) 2024-08-16 02:38:08 +00:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation (#7426) 2024-08-13 16:24:17 -07:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process (#7484) 2024-08-13 15:40:17 -07:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models (#7446) 2024-08-13 17:39:33 +00:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) 2024-08-13 05:33:41 +00:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message (#7435) 2024-08-13 02:29:34 +00:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00