Kevin Lin
|
295c4730a8
|
[Misc] Raise error when using encoder/decoder model with cpu backend (#8355)
|
2024-09-12 05:45:24 +00:00 |
|
Cody Yu
|
a65cb16067
|
[MISC] Dump model runner inputs when crashing (#8305)
|
2024-09-12 01:12:25 +00:00 |
|
bnellnm
|
73202dbe77
|
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
|
2024-09-11 12:52:19 -07:00 |
|
Li, Jiang
|
0b952af458
|
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
|
2024-09-11 09:46:46 -07:00 |
|
Yang Fan
|
3b7fea770f
|
[Model][VLM] Add Qwen2-VL model support (#7905)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-11 09:31:19 -07:00 |
|
Kevin Lin
|
5faedf1b62
|
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
|
2024-09-10 13:18:14 -07:00 |
|
Alexander Matveev
|
4ef41b8476
|
[Bugfix] Fix async postprocessor in case of preemption (#8267)
|
2024-09-07 21:01:51 -07:00 |
|
youkaichao
|
ce2702a923
|
[tpu][misc] fix typo (#8260)
|
2024-09-06 22:40:46 -07:00 |
|
Harsha vardhan manoj Bikki
|
008cf886c9
|
[Neuron] Adding support for adding/ overriding neuron configuration a… (#8062)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
|
2024-09-04 16:33:43 -07:00 |
|
Woosuk Kwon
|
61f4a93d14
|
[TPU][Bugfix] Use XLA rank for persistent cache path (#8137)
|
2024-09-03 18:35:33 -07:00 |
|
Woosuk Kwon
|
0af3abe3d3
|
[TPU][Bugfix] Fix next_token_ids shape (#8128)
|
2024-09-03 13:29:24 -07:00 |
|
Alexander Matveev
|
6d646d08a2
|
[Core] Optimize Async + Multi-step (#8050)
|
2024-09-03 18:50:29 +00:00 |
|
Cyrus Leung
|
98cef6a227
|
[Core] Increase default max_num_batched_tokens for multimodal models (#8028)
|
2024-08-30 08:20:34 -07:00 |
|
Woosuk Kwon
|
80c7b089b1
|
[TPU] Async output processing for TPU (#8011)
|
2024-08-29 19:35:29 -07:00 |
|
afeldman-nm
|
428dd1445e
|
[Core] Logprobs support in Multi-step (#7652)
|
2024-08-29 19:19:08 -07:00 |
|
kushanam
|
c334b1898b
|
extend cuda graph size for H200 (#7894)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-08-29 12:15:04 -07:00 |
|
Alexander Matveev
|
3f60f2244e
|
[Core] Combine async postprocessor and multi-step (#7921)
|
2024-08-29 11:18:26 -07:00 |
|
youkaichao
|
a7f65c2be9
|
[torch.compile] remove reset (#7975)
|
2024-08-28 17:32:26 -07:00 |
|
youkaichao
|
ce6bf3a2cf
|
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-28 16:10:12 -07:00 |
|
Cody Yu
|
e3580537a4
|
[Performance] Enable chunked prefill and prefix caching together (#7753)
|
2024-08-28 00:36:31 -07:00 |
|
Alexander Matveev
|
f508e03e7f
|
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911)
|
2024-08-28 00:02:30 -07:00 |
|
bnellnm
|
c166e7e43e
|
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886)
|
2024-08-27 23:13:45 -04:00 |
|
Kunshang Ji
|
076169f603
|
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810)
|
2024-08-27 10:07:02 -07:00 |
|
youkaichao
|
64cc644425
|
[core][torch.compile] discard the compile for profiling (#7796)
|
2024-08-26 21:33:58 -07:00 |
|
Megha Agarwal
|
2eedede875
|
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-08-26 20:53:20 -07:00 |
|
omrishiv
|
760e9f71a8
|
[Bugfix] neuron: enable tensor parallelism (#7562)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
|
2024-08-26 15:13:13 -07:00 |
|
Jie Fu (傅杰)
|
faeddb565d
|
[misc] Add Torch profiler support for CPU-only devices (#7806)
|
2024-08-23 05:46:25 +00:00 |
|
Kunshang Ji
|
fc5ebbd1d3
|
[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712)
|
2024-08-22 20:06:54 -07:00 |
|
Abhinav Goyal
|
a3fce56b88
|
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830)
|
2024-08-22 02:42:24 -07:00 |
|
William Lin
|
dd53c4b023
|
[misc] Add Torch profiler support (#7451)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-21 15:39:26 -07:00 |
|
Isotr0py
|
6925cdbeea
|
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend (#7735)
|
2024-08-21 16:23:03 +00:00 |
|
Antoni Baum
|
3b682179dd
|
[Core] Add AttentionState abstraction (#7663)
|
2024-08-20 18:50:45 +00:00 |
|
Kunshang Ji
|
c42590f97a
|
[Hardware] [Intel GPU] refactor xpu worker/executor (#7686)
|
2024-08-20 09:54:10 -07:00 |
|
Woosuk Kwon
|
43735bf5e1
|
[TPU] Remove redundant input tensor cloning (#7660)
|
2024-08-19 15:55:04 -07:00 |
|
William Lin
|
47b65a5508
|
[core] Multi Step Scheduling (#7000)
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
|
2024-08-19 13:52:13 -07:00 |
|
SangBin Cho
|
ff7ec82c4d
|
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109)
|
2024-08-18 17:57:20 -07:00 |
|
Woosuk Kwon
|
0c2fa50b84
|
[TPU] Use mark_dynamic only for dummy run (#7634)
|
2024-08-18 00:18:53 -07:00 |
|
Woosuk Kwon
|
ce143353c6
|
[TPU] Skip creating empty tensor (#7630)
|
2024-08-17 14:22:46 -07:00 |
|
Roger Wang
|
bbf55c4805
|
[VLM] Refactor MultiModalConfig initialization and profiling (#7530)
|
2024-08-17 13:30:55 -07:00 |
|
youkaichao
|
eed020f673
|
[misc] use nvml to get consistent device name (#7582)
|
2024-08-16 21:15:13 -07:00 |
|
Mahesh Keralapura
|
93478b63d2
|
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
|
2024-08-16 13:46:01 -07:00 |
|
omrishiv
|
9c1f78d5d6
|
[Bugfix] update neuron for version > 0.5.0 (#7175)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-15 09:44:14 -07:00 |
|
Woosuk Kwon
|
951fdd66d3
|
[TPU] Set per-rank XLA cache (#7533)
|
2024-08-14 14:47:51 -07:00 |
|
Cyrus Leung
|
3f674a49b5
|
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126)
|
2024-08-14 17:55:42 +00:00 |
|
youkaichao
|
16422ea76f
|
[misc][plugin] add plugin system implementation (#7426)
|
2024-08-13 16:24:17 -07:00 |
|
Peter Salas
|
00c3d68e45
|
[Frontend][Core] Add plumbing to support audio language models (#7446)
|
2024-08-13 17:39:33 +00:00 |
|
Cyrus Leung
|
4ddc4743d7
|
[Core] Consolidate GB constant and enable float GB arguments (#7416)
|
2024-08-12 14:14:14 -07:00 |
|
William Lin
|
c08e2b3086
|
[core] [2/N] refactor worker_base input preparation for multi-step (#7387)
|
2024-08-11 08:50:08 -07:00 |
|
Woosuk Kwon
|
90bab18f24
|
[TPU] Use mark_dynamic to reduce compilation time (#7340)
|
2024-08-10 18:12:22 -07:00 |
|
Mahesh Keralapura
|
933790c209
|
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089)
|
2024-08-09 13:55:13 -07:00 |
|