youkaichao
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
omrishiv
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-15 09:44:14 -07:00
Woosuk Kwon
951fdd66d3
[TPU] Set per-rank XLA cache ( #7533 )
2024-08-14 14:47:51 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time ( #7340 )
2024-08-10 18:12:22 -07:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Alexander Matveev
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder ( #7360 )
2024-08-09 16:35:49 +00:00
Mor Zusman
07ab160741
[Model][Jamba] Mamba cache single buffer ( #6739 )
...
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-08-09 10:07:06 -04:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations ( #7162 )
2024-08-08 21:34:28 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Cody Yu
ef527be06c
[MISC] Use non-blocking transfer in prepare_input ( #7172 )
2024-08-05 23:41:27 +00:00
youkaichao
708989341e
[misc] add a flag to enable compile ( #7092 )
2024-08-02 16:18:45 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 13:55:40 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
Aurick Qiao
0437492ea9
PP comm optimization: replace send with partial send + allgather ( #6695 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2024-07-31 20:15:42 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
Woosuk Kwon
533d1932d2
[Bugfix][TPU] Set readonly=True for non-root devices ( #6980 )
2024-07-31 00:19:28 -07:00
Cyrus Leung
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs ( #6836 )
2024-07-31 10:38:45 +08:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
Woosuk Kwon
6e063ea35b
[TPU] Fix greedy decoding ( #6933 )
2024-07-30 02:06:29 -07:00
Woosuk Kwon
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version ( #6856 )
2024-07-27 10:28:33 -07:00
Woosuk Kwon
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
Li, Jiang
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
Antoni Baum
5448f67635
[Core] Tweaks to model runner/input builder developer APIs ( #6712 )
2024-07-24 12:17:12 -07:00
William Lin
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
youkaichao
7c2749a4fd
[misc] add start loading models for users information ( #6670 )
2024-07-22 20:08:02 -07:00
Cody Yu
e0c15758b8
[Core] Modulize prepare input and attention metadata builder ( #6596 )
2024-07-23 00:45:24 +00:00
Jiaxin Shan
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Woosuk Kwon
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode ( #6618 )
2024-07-21 18:43:11 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
Nick Hill
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-18 19:15:52 -07:00
Woosuk Kwon
4634c8728b
[TPU] Refactor TPU worker & model runner ( #6506 )
2024-07-18 01:34:16 -07:00
Rui Qiao
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG ( #6032 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2024-07-17 22:27:09 -07:00
youkaichao
1c27d25fb5
[core][model] yet another cpu offload implementation ( #6496 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-17 20:54:35 -07:00
Cody Yu
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
Woosuk Kwon
e09ce759aa
[TPU] Remove multi-modal args in TPU backend ( #6504 )
2024-07-17 04:02:53 -07:00
Woosuk Kwon
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel ( #6457 )
2024-07-16 09:56:28 -07:00
Cyrus Leung
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
Hongxia Yang
b6c16cf8ff
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm ( #6352 )
2024-07-11 21:30:46 -07:00
aniaan
3963a5335b
[Misc] refactor(config): clean up unused code ( #6320 )
2024-07-11 09:39:07 +00:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00