Lucas Wilkinson
|
978b45f399
|
[Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-23 06:45:48 -08:00 |
|
youkaichao
|
66818e5b63
|
[core] separate builder init and builder prepare for each batch (#12253)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-22 14:13:52 +08:00 |
|
wangxiyuan
|
86bfb6dba7
|
[Misc] Pass attention to impl backend (#12218)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-01-20 23:25:28 +08:00 |
|
Chen Zhang
|
69d765f5a5
|
[V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-01-17 07:39:35 +00:00 |
|
wangxiyuan
|
3adf0ffda8
|
[Platform] Do not raise error if _Backend is not found (#12023)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
|
2025-01-15 10:14:15 +00:00 |
|
Elfie Guo
|
0794e7446e
|
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
|
2025-01-15 12:47:49 +08:00 |
|
Chen Zhang
|
a2d2acb4c8
|
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-01-14 15:45:05 +00:00 |
|
wangxiyuan
|
2e0e017610
|
[Platform] Add output for Attention Backend (#11981)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-01-14 13:27:04 +00:00 |
|
Chen Zhang
|
1f18adb245
|
[Kernel] Revert the API change of Attention.forward (#12038)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-01-14 20:59:32 +08:00 |
|
Chen Zhang
|
0f8cafe2d1
|
[Kernel] unified_attention for Attention.forward (#11967)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-01-13 19:28:53 +08:00 |
|
Siyuan Li
|
9dd02d85ca
|
[Bug] Fix usage of .transpose() and .view() consecutively. (#11979)
|
2025-01-13 06:24:10 +00:00 |
|
Chen Zhang
|
cf5f000d21
|
[torch.compile] Hide KV cache behind torch.compile boundary (#11677)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-01-10 13:14:42 +08:00 |
|
wangxiyuan
|
405eb8e396
|
[platform] Allow platform specify attention backend (#11609)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
|
2025-01-09 21:46:50 +08:00 |
|
Cyrus Leung
|
d848800e88
|
[Misc] Move print_*_once from utils to logger (#11298)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
|
2025-01-09 12:48:12 +08:00 |
|
Chen Zhang
|
e20c92bb61
|
[Kernel] Move attn_type to Attention.__init__() (#11690)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-01-07 00:11:28 +08:00 |
|
Cyrus Leung
|
ee77fdb5de
|
[Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-06 21:40:31 +08:00 |
|
Mengqing Cao
|
5c7963249d
|
[attn][tiny fix] fix attn backend in MultiHeadAttention (#11463)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2024-12-24 12:39:36 +00:00 |
|
Rafael Vasquez
|
32aa2059ad
|
[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
|
2024-12-23 22:35:38 +00:00 |
|
Isotr0py
|
f9ecbb18bf
|
[Misc] Allow passing logits_soft_cap for xformers backend (#11252)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-12-17 00:37:04 -08:00 |
|
Jani Monoses
|
0a56bcc03d
|
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169)
|
2024-12-13 18:00:40 +00:00 |
|
Cyrus Leung
|
cad5c0a6ed
|
[Doc] Update docs to refer to pooling models (#11093)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-11 13:36:27 +00:00 |
|
Tyler Michael Smith
|
9a93973708
|
[Bugfix] Fix Mamba multistep (#11071)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-11 00:16:22 +00:00 |
|
Konrad Zawora
|
cbcbdb1ceb
|
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
|
2024-12-09 13:21:06 -08:00 |
|
Cyrus Leung
|
aa39a8e175
|
[Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-05 11:19:35 +08:00 |
|
Isotr0py
|
10398b4706
|
[Model] Consolidate ViTs attention implementation without mask (#10893)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-12-04 18:11:08 +00:00 |
|
youkaichao
|
a4c4daf364
|
[misc] use out argument for flash attention (#10822)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-02 10:50:10 +00:00 |
|
Maximilien de Bayser
|
e25810ae29
|
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
2024-12-02 10:05:32 +08:00 |
|
Kunshang Ji
|
e85250b1d1
|
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-11-26 22:49:40 -08:00 |
|
Wallas Henrique
|
c27df94e1f
|
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-11-25 12:23:32 -05:00 |
|
youkaichao
|
05d1f8c9c6
|
[misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-25 09:27:30 +00:00 |
|
Isotr0py
|
04668ebe7a
|
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-11-23 18:12:20 +00:00 |
|
youkaichao
|
4aba6e3d1a
|
[core] gemma2 full context length support (#10584)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 20:13:54 -08:00 |
|
youkaichao
|
eebad39f26
|
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 14:04:42 -08:00 |
|
Pavani Majety
|
6c1208d083
|
[Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2024-11-20 19:56:47 -08:00 |
|
Woosuk Kwon
|
2f77b6cfec
|
[TPU] Implement prefix caching for TPUs (#10307)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-11-20 13:54:15 -08:00 |
|
Li, Jiang
|
63f1fde277
|
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2024-11-20 10:57:39 +00:00 |
|
Mengqing Cao
|
8c1fb50705
|
[Platform][Refactor] Extract func get_default_attn_backend to Platform (#10358)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2024-11-19 11:22:26 +08:00 |
|
Angus Wang
|
c2170a5b39
|
[Kernel] Explicitly specify other value in tl.load calls (#9014)
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
|
2024-11-18 11:39:40 -08:00 |
|
Maximilien de Bayser
|
4a18fd14ba
|
Support Roberta embedding models (#9387)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
|
2024-11-14 21:23:29 +00:00 |
|
Isotr0py
|
58170d6503
|
[Hardware][CPU] Add embedding models support for CPU backend (#10193)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-11-11 08:54:28 +00:00 |
|
Nicolò Lucchesi
|
9d43afcc53
|
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2024-11-07 08:15:14 -08:00 |
|
Yan Ma
|
d3859f1891
|
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (#9823)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-11-06 17:29:03 -08:00 |
|
Joe Runde
|
d58268c56a
|
[V1] Make v1 more testable (#9888)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-11-06 11:57:35 -08:00 |
|
Konrad Zawora
|
a02a50e6e5
|
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Bob Zhu <bob.zhu@intel.com>
Signed-off-by: zehao-intel <zehao.huang@intel.com>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai>
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com>
Co-authored-by: Vivek Goel <vgoel@habana.ai>
Co-authored-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai>
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com>
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com>
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai>
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com>
Co-authored-by: Ilia Taraban <tarabanil@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai>
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com>
Co-authored-by: Sun Choi <schoi@habana.ai>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com>
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Co-authored-by: Zehao Huang <zehao.huang@intel.com>
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com>
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com>
Co-authored-by: Nir David <ndavid@habana.ai>
Co-authored-by: Yu-Zhou <yu.zhou@intel.com>
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai>
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: Jacek Czaja <jczaja@habana.ai>
Co-authored-by: Yuan <yuan.zhou@outlook.com>
|
2024-11-06 01:09:10 -08:00 |
|
Aaron Pham
|
21063c11c7
|
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2024-11-06 07:11:55 +00:00 |
|
Peter Salas
|
ffc0f2b47a
|
[Model][OpenVINO] Fix regressions from #8346 (#10045)
Signed-off-by: Peter Salas <peter@fixie.ai>
|
2024-11-06 04:19:15 +00:00 |
|
Yang Zheng
|
4dbcbbeb09
|
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
|
2024-11-04 08:54:37 +00:00 |
|
sroy745
|
a78dd3303e
|
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559)
|
2024-11-01 23:22:49 -07:00 |
|
Peter Salas
|
6c0b7f548d
|
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
Signed-off-by: Peter Salas <peter@fixie.ai>
|
2024-11-01 16:21:10 -07:00 |
|
Pavani Majety
|
598b6d7b07
|
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
|
2024-11-01 12:15:05 -07:00 |
|