SangBin Cho
|
3521ba4f25
|
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
|
2024-05-03 10:20:12 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
Douglas Lehr
|
e4a28e5316
|
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262)
|
2024-03-10 15:27:45 -07:00 |
|
zhaoyang-star
|
923797fea4
|
Fix compile error when using rocm (#2648)
|
2024-02-01 09:35:09 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Jee Li
|
77af974b40
|
[FIX] Support non-zero CUDA devices in custom kernels (#1959)
|
2024-01-02 19:09:59 -08:00 |
|
wbn
|
dacaf5a400
|
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
|
2023-12-10 10:12:53 -08:00 |
|
TJian
|
6ccc0bfffb
|
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
|
2023-12-07 23:16:52 -08:00 |
|
Woosuk Kwon
|
0ce8647dc5
|
Fix integer overflows in attention & cache ops (#1514)
|
2023-10-31 15:19:30 -07:00 |
|
Woosuk Kwon
|
928de46888
|
Implement PagedAttention V2 (#1348)
|
2023-10-16 00:59:57 -07:00 |
|
Liang
|
ebe4d1db3a
|
Fix boundary check in paged attention kernel (#1241)
|
2023-10-01 11:35:06 -07:00 |
|
Antoni Baum
|
cf5cb1e33e
|
Allocate more shared memory to attention kernel (#1154)
|
2023-09-26 22:27:13 -07:00 |
|
Zhuohan Li
|
db09d4ad83
|
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel
* Fix test_attention
* Fix
---------
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
|
2023-09-07 15:53:14 -07:00 |
|
Woosuk Kwon
|
bf87484efa
|
[BugFix] Fix NaN errors in paged attention kernel (#936)
|
2023-09-04 09:20:06 +09:00 |
|
Dean Leitersdorf
|
79af7e96a0
|
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (#420)
|
2023-08-04 10:57:29 -07:00 |
|
Zhuohan Li
|
96853af5a8
|
Optimize MQA Kernel (#452)
|
2023-07-14 20:06:40 -04:00 |
|
Andre Slavescu
|
c894836108
|
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-07-08 17:55:16 -07:00 |
|
Woosuk Kwon
|
404422f42e
|
[Model] Add support for MPT (#334)
|
2023-07-03 16:47:53 -07:00 |
|
Woosuk Kwon
|
e41f06702c
|
Add support for BLOOM (#331)
|
2023-07-03 13:12:35 -07:00 |
|
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
|
Woosuk Kwon
|
e38074b1e6
|
Support FP32 (#141)
|
2023-06-07 00:40:21 -07:00 |
|
Woosuk Kwon
|
d721168449
|
Improve setup script & Add a guard for bfloat16 kernels (#130)
|
2023-05-27 00:59:32 -07:00 |
|
Woosuk Kwon
|
667ba3995c
|
Add copyright headers to source files adapted from FT (#104)
|
2023-05-14 22:19:19 -07:00 |
|
Woosuk Kwon
|
130d5fd8c7
|
Fix a bug in attention kernel (#68)
|
2023-05-04 02:56:09 -07:00 |
|
Woosuk Kwon
|
e070829ae8
|
Support bfloat16 data type (#54)
|
2023-05-03 14:09:44 -07:00 |
|
Woosuk Kwon
|
436e523bf1
|
Refactor attention kernels (#53)
|
2023-05-03 13:40:13 -07:00 |
|