Cyrus Leung
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 06:00:21 +00:00
Woosuk Kwon
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-06 01:24:28 -08:00
Konrad Zawora
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Bob Zhu <bob.zhu@intel.com>
Signed-off-by: zehao-intel <zehao.huang@intel.com>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai>
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com>
Co-authored-by: Vivek Goel <vgoel@habana.ai>
Co-authored-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai>
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com>
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com>
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai>
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com>
Co-authored-by: Ilia Taraban <tarabanil@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai>
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com>
Co-authored-by: Sun Choi <schoi@habana.ai>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com>
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Co-authored-by: Zehao Huang <zehao.huang@intel.com>
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com>
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com>
Co-authored-by: Nir David <ndavid@habana.ai>
Co-authored-by: Yu-Zhou <yu.zhou@intel.com>
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai>
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: Jacek Czaja <jczaja@habana.ai>
Co-authored-by: Yuan <yuan.zhou@outlook.com>
2024-11-06 01:09:10 -08:00
Wallas Henrique
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-11-06 00:39:26 +00:00
Chenghao (Alan) Yang
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-05 09:50:50 -08:00
youkaichao
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 12:08:49 -07:00
youkaichao
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-02 07:35:05 -07:00
sroy745
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
Peter Salas
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-01 16:21:10 -07:00
Roger Wang
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-10-31 07:22:07 +00:00
Woosuk Kwon
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
Yan Ma
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn>
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn>
2024-10-29 21:34:45 -07:00
wangshuai09
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
youkaichao
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 00:19:28 -07:00
Mengqing Cao
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
Mengqing Cao
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
yudian0504
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
Thomas Parnell
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-10-21 22:29:57 +08:00
Chen Zhang
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Lucas Wilkinson
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-10-17 10:36:09 -04:00
Cyrus Leung
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
Woosuk Kwon
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
Kunshang Ji
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
Tyler Michael Smith
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
Allen Wang
c6cf9295e1
[Bugfix] Sets is_first_step_output for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
Wallas Henrique
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-11 11:18:50 -07:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-10-10 21:30:44 -07:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
Alex Brooks
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:12:56 +00:00
Kunshang Ji
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
Isotr0py
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
Cyrus Leung
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-10-06 16:35:27 +08:00
Chongming Ni
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com>
2024-10-04 16:42:20 -07:00
Cyrus Leung
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
youkaichao
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
Sergey Shlyapnikov
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
Varun Sundar Rabindranath
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-02 19:44:39 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
youkaichao
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
Jee Jee Li
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
Nick Hill
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-29 09:43:14 +00:00
Jee Jee Li
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00