Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
Michael Goin
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format ( #7784 )
2024-08-22 11:37:08 -07:00
William Lin
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-21 15:39:26 -07:00
William Lin
91f4522cbf
[multi-step] Raise error if not using async engine ( #7703 )
2024-08-21 11:49:19 -07:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Travis Johnson
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding ( #7665 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-20 00:43:09 +00:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Ali Panahi
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 ( #5428 )
2024-08-19 20:47:00 +00:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
omrishiv
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-15 09:44:14 -07:00
William Lin
2ecf7b1757
[core] [3/N] multi-step args and sequence.py ( #7452 )
2024-08-14 12:32:45 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option ( #7347 )
2024-08-12 12:58:34 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine ( #7336 )
2024-08-09 07:06:36 +00:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Cade Daniel
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
youkaichao
83c644fe7e
[core][misc] simply output processing with shortcut code path ( #7117 )
2024-08-04 00:22:19 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
Cyrus Leung
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
fzyzcjy
f058403683
[Doc] Super tiny fix doc typo ( #6949 )
2024-07-30 09:14:03 -07:00
Earthwalker
7f8d612d24
[TPU] Support tensor parallelism in async llm engine ( #6891 )
2024-07-29 12:42:21 -07:00
Woosuk Kwon
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
tomeras91
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba ( #6784 )
2024-07-26 20:45:31 -07:00
Li, Jiang
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
Allen.Dou
40468b13fa
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. ( #6686 )
2024-07-24 08:58:42 -07:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Thomas Parnell
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-23 23:05:05 +00:00
Simon Mo
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
Woosuk Kwon
729171ae58
[Misc] Enable chunked prefill by default for long context models ( #6666 )
2024-07-22 20:03:13 -07:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
sroy745
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
Travis Johnson
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort ( #6587 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-19 19:18:19 -07:00
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00