xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-04-12 12:47:09 +08:00

Author	SHA1	Message	Date
youkaichao	eed020f673	[misc] use nvml to get consistent device name (#7582 )	2024-08-16 21:15:13 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
omrishiv	9c1f78d5d6	[Bugfix] update neuron for version > 0.5.0 (#7175 ) Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-15 09:44:14 -07:00
Woosuk Kwon	951fdd66d3	[TPU] Set per-rank XLA cache (#7533 )	2024-08-14 14:47:51 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
youkaichao	16422ea76f	[misc][plugin] add plugin system implementation (#7426 )	2024-08-13 16:24:17 -07:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
Cyrus Leung	4ddc4743d7	[Core] Consolidate `GB` constant and enable float GB arguments (#7416 )	2024-08-12 14:14:14 -07:00
William Lin	c08e2b3086	[core] [2/N] refactor worker_base input preparation for multi-step (#7387 )	2024-08-11 08:50:08 -07:00
Woosuk Kwon	90bab18f24	[TPU] Use mark_dynamic to reduce compilation time (#7340 )	2024-08-10 18:12:22 -07:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Alexander Matveev	74af2bbd90	[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360 )	2024-08-09 16:35:49 +00:00
Mor Zusman	07ab160741	[Model][Jamba] Mamba cache single buffer (#6739 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-08-09 10:07:06 -04:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Cody Yu	ef527be06c	[MISC] Use non-blocking transfer in prepare_input (#7172 )	2024-08-05 23:41:27 +00:00
youkaichao	708989341e	[misc] add a flag to enable compile (#7092 )	2024-08-02 16:18:45 -07:00
Rui Qiao	05308891e2	[Core] Pipeline parallel with Ray ADAG (#6837 ) Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-02 13:55:40 -07:00
youkaichao	806949514a	[ci] set timeout for test_oot_registration.py (#7082 )	2024-08-02 10:03:24 -07:00
Aurick Qiao	0437492ea9	PP comm optimization: replace send with partial send + allgather (#6695 ) Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>	2024-07-31 20:15:42 -07:00
Jee Jee Li	7ecee34321	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036 )	2024-07-31 17:12:24 -07:00
Woosuk Kwon	533d1932d2	[Bugfix][TPU] Set readonly=True for non-root devices (#6980 )	2024-07-31 00:19:28 -07:00
Cyrus Leung	f230cc2ca6	[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836 )	2024-07-31 10:38:45 +08:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00
Woosuk Kwon	6e063ea35b	[TPU] Fix greedy decoding (#6933 )	2024-07-30 02:06:29 -07:00
Woosuk Kwon	fad5576c58	[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856 )	2024-07-27 10:28:33 -07:00
Woosuk Kwon	52f07e3dec	[Hardware][TPU] Implement tensor parallelism with Ray (#5871 )	2024-07-26 20:54:27 -07:00
Li, Jiang	3bbb4936dc	[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125 )	2024-07-26 13:50:10 -07:00
Antoni Baum	5448f67635	[Core] Tweaks to model runner/input builder developer APIs (#6712 )	2024-07-24 12:17:12 -07:00
William Lin	5e8ca973eb	[Bugfix] fix flashinfer cudagraph capture for PP (#6708 )	2024-07-24 01:49:44 +00:00
youkaichao	7c2749a4fd	[misc] add start loading models for users information (#6670 )	2024-07-22 20:08:02 -07:00
Cody Yu	e0c15758b8	[Core] Modulize prepare input and attention metadata builder (#6596 )	2024-07-23 00:45:24 +00:00
Jiaxin Shan	42c7f66a38	[Core] Support dynamically loading Lora adapter from HuggingFace (#6234 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-22 15:42:40 -07:00
Woosuk Kwon	42de2cefcb	[Misc] Add a wrapper for torch.inference_mode (#6618 )	2024-07-21 18:43:11 -07:00
Cyrus Leung	9042d68362	[Misc] Consolidate and optimize logic for building padded tensors (#6541 )	2024-07-20 04:17:24 +00:00
Antoni Baum	7bd82002ae	[Core] Allow specifying custom Executor (#6557 )	2024-07-20 01:25:06 +00:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
Woosuk Kwon	4634c8728b	[TPU] Refactor TPU worker & model runner (#6506 )	2024-07-18 01:34:16 -07:00
Rui Qiao	61e592747c	[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>	2024-07-17 22:27:09 -07:00
youkaichao	1c27d25fb5	[core][model] yet another cpu offload implementation (#6496 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-17 20:54:35 -07:00
Cody Yu	2fa4623d9e	[Core] Refactor _prepare_model_input_tensors - take 2 (#6164 )	2024-07-17 09:37:16 -07:00
Woosuk Kwon	a9a2e74d21	[Misc] Use `torch.Tensor` for type annotation (#6505 )	2024-07-17 13:01:10 +00:00
Woosuk Kwon	e09ce759aa	[TPU] Remove multi-modal args in TPU backend (#6504 )	2024-07-17 04:02:53 -07:00
Woosuk Kwon	c467dff24f	[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457 )	2024-07-16 09:56:28 -07:00
Cyrus Leung	d97011512e	[CI/Build] vLLM cache directory for images (#6444 )	2024-07-15 23:12:25 -07:00
Hongxia Yang	b6c16cf8ff	[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352 )	2024-07-11 21:30:46 -07:00
aniaan	3963a5335b	[Misc] refactor(config): clean up unused code (#6320 )	2024-07-11 09:39:07 +00:00
Abhinav Goyal	2416b26e11	[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978 )	2024-07-09 18:34:02 -07:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00

1 2 3 4 5

218 Commits