vllm/attention at 019877253be473bf0c12daaf2c29022150402052 - vllm - 丝路新云-代码仓

xinyun/vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-06-29 05:57:12 +08:00

History

Alexander Matveev 019877253b

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427 )

2024-09-12 21:01:50 +00:00

..

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427 )

2024-09-12 21:01:50 +00:00

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208 )

2024-08-12 22:47:41 +00:00

__init__.py

[Core] Add AttentionState abstraction (#7663 )

2024-08-20 18:50:45 +00:00

layer.py

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 )

2024-08-06 16:51:47 -04:00

selector.py

[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985 )

2024-08-29 14:53:11 -04:00