xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-17 09:49:09 +08:00

Author	SHA1	Message	Date
Kevin Lin	295c4730a8	[Misc] Raise error when using encoder/decoder model with cpu backend (#8355 )	2024-09-12 05:45:24 +00:00
Cody Yu	a65cb16067	[MISC] Dump model runner inputs when crashing (#8305 )	2024-09-12 01:12:25 +00:00
bnellnm	73202dbe77	[Kernel][Misc] register ops to prevent graph breaks (#6917 ) Co-authored-by: Sage Moore <sage@neuralmagic.com>	2024-09-11 12:52:19 -07:00
Li, Jiang	0b952af458	[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257 )	2024-09-11 09:46:46 -07:00
Yang Fan	3b7fea770f	[Model][VLM] Add Qwen2-VL model support (#7905 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-11 09:31:19 -07:00
Kevin Lin	5faedf1b62	[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224 )	2024-09-10 13:18:14 -07:00
Alexander Matveev	4ef41b8476	[Bugfix] Fix async postprocessor in case of preemption (#8267 )	2024-09-07 21:01:51 -07:00
youkaichao	ce2702a923	[tpu][misc] fix typo (#8260 )	2024-09-06 22:40:46 -07:00
Harsha vardhan manoj Bikki	008cf886c9	[Neuron] Adding support for adding/ overriding neuron configuration a… (#8062 ) Co-authored-by: Harsha Bikki <harbikh@amazon.com>	2024-09-04 16:33:43 -07:00
Woosuk Kwon	61f4a93d14	[TPU][Bugfix] Use XLA rank for persistent cache path (#8137 )	2024-09-03 18:35:33 -07:00
Woosuk Kwon	0af3abe3d3	[TPU][Bugfix] Fix next_token_ids shape (#8128 )	2024-09-03 13:29:24 -07:00
Alexander Matveev	6d646d08a2	[Core] Optimize Async + Multi-step (#8050 )	2024-09-03 18:50:29 +00:00
Cyrus Leung	98cef6a227	[Core] Increase default `max_num_batched_tokens` for multimodal models (#8028 )	2024-08-30 08:20:34 -07:00
Woosuk Kwon	80c7b089b1	[TPU] Async output processing for TPU (#8011 )	2024-08-29 19:35:29 -07:00
afeldman-nm	428dd1445e	[Core] Logprobs support in Multi-step (#7652 )	2024-08-29 19:19:08 -07:00
kushanam	c334b1898b	extend cuda graph size for H200 (#7894 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-29 12:15:04 -07:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
youkaichao	a7f65c2be9	[torch.compile] remove reset (#7975 )	2024-08-28 17:32:26 -07:00
youkaichao	ce6bf3a2cf	[torch.compile] avoid Dynamo guard evaluation overhead (#7898 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-08-28 16:10:12 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Alexander Matveev	f508e03e7f	[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911 )	2024-08-28 00:02:30 -07:00
bnellnm	c166e7e43e	[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886 )	2024-08-27 23:13:45 -04:00
Kunshang Ji	076169f603	[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810 )	2024-08-27 10:07:02 -07:00
youkaichao	64cc644425	[core][torch.compile] discard the compile for profiling (#7796 )	2024-08-26 21:33:58 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
omrishiv	760e9f71a8	[Bugfix] neuron: enable tensor parallelism (#7562 ) Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>	2024-08-26 15:13:13 -07:00
Jie Fu (傅杰)	faeddb565d	[misc] Add Torch profiler support for CPU-only devices (#7806 )	2024-08-23 05:46:25 +00:00
Kunshang Ji	fc5ebbd1d3	[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712 )	2024-08-22 20:06:54 -07:00
Abhinav Goyal	a3fce56b88	[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830 )	2024-08-22 02:42:24 -07:00
William Lin	dd53c4b023	[misc] Add Torch profiler support (#7451 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-21 15:39:26 -07:00
Isotr0py	6925cdbeea	[Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backend (#7735 )	2024-08-21 16:23:03 +00:00
Antoni Baum	3b682179dd	[Core] Add `AttentionState` abstraction (#7663 )	2024-08-20 18:50:45 +00:00
Kunshang Ji	c42590f97a	[Hardware] [Intel GPU] refactor xpu worker/executor (#7686 )	2024-08-20 09:54:10 -07:00
Woosuk Kwon	43735bf5e1	[TPU] Remove redundant input tensor cloning (#7660 )	2024-08-19 15:55:04 -07:00
William Lin	47b65a5508	[core] Multi Step Scheduling (#7000 ) Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>	2024-08-19 13:52:13 -07:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Woosuk Kwon	0c2fa50b84	[TPU] Use mark_dynamic only for dummy run (#7634 )	2024-08-18 00:18:53 -07:00
Woosuk Kwon	ce143353c6	[TPU] Skip creating empty tensor (#7630 )	2024-08-17 14:22:46 -07:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
youkaichao	eed020f673	[misc] use nvml to get consistent device name (#7582 )	2024-08-16 21:15:13 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
omrishiv	9c1f78d5d6	[Bugfix] update neuron for version > 0.5.0 (#7175 ) Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-15 09:44:14 -07:00
Woosuk Kwon	951fdd66d3	[TPU] Set per-rank XLA cache (#7533 )	2024-08-14 14:47:51 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
youkaichao	16422ea76f	[misc][plugin] add plugin system implementation (#7426 )	2024-08-13 16:24:17 -07:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
Cyrus Leung	4ddc4743d7	[Core] Consolidate `GB` constant and enable float GB arguments (#7416 )	2024-08-12 14:14:14 -07:00
William Lin	c08e2b3086	[core] [2/N] refactor worker_base input preparation for multi-step (#7387 )	2024-08-11 08:50:08 -07:00
Woosuk Kwon	90bab18f24	[TPU] Use mark_dynamic to reduce compilation time (#7340 )	2024-08-10 18:12:22 -07:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00

1 2 3 4 5 ...

257 Commits