85 Commits

Author SHA1 Message Date
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor (#6557) 2024-07-20 01:25:06 +00:00
Nick Hill
b5672a112c
[Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-18 19:15:52 -07:00
Rui Qiao
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2024-07-17 22:27:09 -07:00
Murali Andoorveedu
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP (#6495)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-17 08:25:10 +00:00
youkaichao
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test (#6410) 2024-07-16 15:44:22 -07:00
Thomas Parnell
eaec4b9153
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com>
2024-07-15 10:12:47 -07:00
youkaichao
41708e5034
[ci] try to add multi-node tests (#6280)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-12 21:51:48 -07:00
Woosuk Kwon
997df46a32
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313) 2024-07-10 16:39:02 -07:00
sangjune.park
44cc76610d
[Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296)
Signed-off-by: sangjune.park <sangjune.park@navercorp.com>
2024-07-10 10:03:32 -07:00
Woosuk Kwon
5ed3505d82
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279) 2024-07-09 19:30:56 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
youkaichao
70c232f85a
[core][distributed] fix ray worker rank assignment (#6235) 2024-07-08 21:31:44 -07:00
Murali Andoorveedu
0ed646b7aa
[Distributed][Core] Support Py39 and Py38 for PP (#6120)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-03 17:52:29 -07:00
Travis Johnson
1dab9bc8a9
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-07-03 16:56:59 -07:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
youkaichao
f666207161
[misc][distributed] error on invalid state (#6092) 2024-07-02 23:37:29 -07:00
Nick Hill
d830656a97
[BugFix] Avoid unnecessary Ray import warnings (#6079) 2024-07-03 14:09:40 +08:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408)
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
2024-06-25 20:30:03 -07:00
aws-patlange
82079729cc
[Bugfix] Fix assertion in NeuronExecutor (#5841) 2024-06-25 19:52:10 -07:00
Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) 2024-06-25 15:56:15 -07:00
Woosuk Kwon
bc34937d68
[Hardware][TPU] Refactor TPU backend (#5831) 2024-06-25 15:25:52 -07:00
Woosuk Kwon
0cbc1d2b4f
[Bugfix] Fix pin_lora error in TPU executor (#5760) 2024-06-21 22:25:14 -07:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 2024-06-21 15:42:46 -07:00
youkaichao
3eea74889f
[misc][distributed] use 127.0.0.1 for single-node (#5619) 2024-06-19 08:05:00 +00:00
youkaichao
1b44aaf4e3
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604) 2024-06-17 21:35:04 +00:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Antoni Baum
50eed24d25
Add cuda_device_count_stateless (#5473) 2024-06-13 16:06:49 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case (#5230)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
Junichi Sato
2e02311a1b
[Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 (#5254) 2024-06-11 10:38:07 -07:00
Antoni Baum
18a277b52d
Remove Ray health check (#4693) 2024-06-07 10:01:56 +00:00
zifeitong
a58f24e590
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) 2024-06-03 20:55:50 -07:00
Nick Hill
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead (#4894) 2024-05-23 06:17:27 +09:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader (#4690)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
Nick Hill
676a99982f
[Core] Add MultiprocessingGPUExecutor (#4539)
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-08 21:44:00 +00:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly (#4551)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-07 11:40:18 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData (#4540) 2024-05-03 17:47:07 -07:00
youkaichao
5b8a7c1cb0
[Misc] centralize all usage of environment variables (#4548) 2024-05-02 11:13:25 -07:00
Nick Hill
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers (#4357) 2024-05-01 18:41:59 +00:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
Nick Hill
2e240c69a9
[Core] Centralize GPU Worker construction (#4419) 2024-05-01 01:06:34 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
Nick Hill
ba4be44c32
[BugFix] Fix return type of executor execute_model methods (#4402) 2024-04-27 11:17:45 -07:00
Nick Hill
258a2c58d0
[Core] Introduce DistributedGPUExecutor abstract class (#4348) 2024-04-27 04:14:26 +00:00
SangBin Cho
a88081bf76
[CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00