William Lin
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-21 15:39:26 -07:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
youkaichao
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
youkaichao
3de6e6a30e
[core][distributed] support n layers % pp size != 0 ( #6115 )
2024-07-03 16:40:31 -07:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Cody Yu
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
2024-06-25 20:30:03 -07:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Nick Hill
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead ( #4894 )
2024-05-23 06:17:27 +09:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
Woosuk Kwon
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
youkaichao
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
youkaichao
91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl ( #4248 )
2024-04-23 18:32:19 -07:00
SangBin Cho
533d2a1f39
[Typing] Mypy typing part 2 ( #4043 )
...
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
2024-04-17 17:28:43 -07:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
Sanger Steel
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer ( #3476 )
2024-04-13 17:13:01 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
youkaichao
6d592eb430
[Core] separate distributed_init from worker ( #3904 )
2024-04-09 08:49:02 +00:00
youkaichao
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency ( #3625 )
2024-03-27 00:33:26 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00