Woosuk Kwon
|
89579a201f
|
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
|
2024-05-08 13:15:34 -07:00 |
|
youkaichao
|
20cfcdec99
|
[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659)
|
2024-05-08 12:07:05 -07:00 |
|
Woosuk Kwon
|
5510cf0e8a
|
[Misc] Add get_name method to attention backends (#4685)
|
2024-05-08 09:59:31 -07:00 |
|
DefTruth
|
0f9a6e3d22
|
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573)
|
2024-05-08 09:19:58 -07:00 |
|
youkaichao
|
63575bc2e1
|
[Core][Optimization] change python dict to pytorch tensor (#4607)
|
2024-05-06 21:30:27 -07:00 |
|
Lily Liu
|
43c413ec57
|
[Kernel] Use flashinfer for decoding (#4353)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
|
2024-05-03 15:51:27 -07:00 |
|
SangBin Cho
|
3521ba4f25
|
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
|
2024-05-03 10:20:12 -07:00 |
|
Michał Moskal
|
32881f3f31
|
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2024-05-02 11:23:37 -07:00 |
|
youkaichao
|
5b8a7c1cb0
|
[Misc] centralize all usage of environment variables (#4548)
|
2024-05-02 11:13:25 -07:00 |
|
Jee Li
|
d6f4bd7cdd
|
[Misc]Add customized information for models (#4132)
|
2024-04-30 21:18:14 -07:00 |
|
Hongxia Yang
|
18d23f642a
|
[ROCm][Hardware][AMD] Enable group query attention for triton FA (#4406)
|
2024-04-26 23:37:40 -07:00 |
|
Roy
|
b6dcb4d442
|
[Misc] Fix flash attention backend log (#4368)
|
2024-04-25 12:43:32 -07:00 |
|
SangBin Cho
|
0ae11f78ab
|
[Mypy] Part 3 fix typing for nested directories for most of directory (#4161)
|
2024-04-22 21:32:44 -07:00 |
|
Hongxia Yang
|
95e5b087cf
|
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129)
|
2024-04-21 21:57:24 -07:00 |
|
Michał Moskal
|
e8cc7967ff
|
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128)
|
2024-04-18 00:51:28 -07:00 |
|
Bellk17
|
d04973ad54
|
Fix triton compilation issue (#3984)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-12 16:41:26 -07:00 |
|
SangBin Cho
|
36729bac13
|
[Test] Test multiple attn backend for chunked prefill. (#4023)
|
2024-04-12 09:56:57 -07:00 |
|
bigPYJ1151
|
8afca50889
|
[Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance (#3824)
|
2024-04-11 11:56:49 -07:00 |
|
Kunshang Ji
|
e9da5a40c6
|
[Misc] Add indirection layer for custom ops (#3913)
|
2024-04-10 20:26:07 -07:00 |
|
SangBin Cho
|
e42df7227d
|
[Test] Add xformer and flash attn tests (#3961)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-11 03:09:50 +00:00 |
|
SangBin Cho
|
67b4221a61
|
[Core][5/N] Fully working chunked prefill e2e (#3884)
|
2024-04-10 17:56:48 -07:00 |
|
James Whedbee
|
8b317c6dd0
|
[Model][AMD] ROCm support for 256 head dims for Gemma (#3972)
|
2024-04-10 08:12:00 -07:00 |
|
Juan Villamizar
|
6c0b04515f
|
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643)
Co-authored-by: jpvillam <jpvillam@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-09 15:10:47 -07:00 |
|
Woosuk Kwon
|
498eb5cfa3
|
[Bugfix] Add kv_scale input parameter to CPU backend (#3840)
|
2024-04-04 04:33:08 +00:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
bigPYJ1151
|
0e3f06fe9c
|
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-04-01 22:07:30 -07:00 |
|
Hongxia Yang
|
9765b5c406
|
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic (#3699)
|
2024-03-29 14:52:36 -07:00 |
|
Woosuk Kwon
|
395aa823ea
|
[Misc] Minor type annotation fix (#3716)
|
2024-03-28 21:12:24 -07:00 |
|
Simon Mo
|
4716a32dd4
|
fix logging msg for block manager (#3701)
|
2024-03-28 23:29:55 +00:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Woosuk Kwon
|
925f3332ca
|
[Core] Refactor Attention Take 2 (#3462)
|
2024-03-25 04:39:33 +00:00 |
|