mirror of
https://git.datalinker.icu/ali-vilab/TeaCache
synced 2025-12-09 21:04:25 +08:00
Compare commits
35 Commits
3818a366b6
...
7c10efc470
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7c10efc470 | ||
|
|
2f5a990ee8 | ||
|
|
c730b01e42 | ||
|
|
78d2f837d5 | ||
|
|
ff6a083896 | ||
|
|
0a9b0358ca | ||
|
|
6a470cfade | ||
|
|
5670dc8e99 | ||
|
|
f7d676521a | ||
|
|
c9e2d6454c | ||
|
|
845823eed4 | ||
|
|
4588c2d970 | ||
|
|
6a9d6e0c84 | ||
|
|
e945259c7d | ||
|
|
ca1c215ee7 | ||
|
|
3dd7c3ffa2 | ||
|
|
9caba2ff26 | ||
|
|
f6325a5bb3 | ||
|
|
1c96035d27 | ||
|
|
e1f6b3ea77 | ||
|
|
2a85f3abe1 | ||
|
|
6b36ef8168 | ||
|
|
fca6462a17 | ||
|
|
efbeb585ba | ||
|
|
8870cf27de | ||
|
|
d680b3a2df | ||
|
|
a312550104 | ||
|
|
7c0aad1585 | ||
|
|
73d9573763 | ||
|
|
129a05d9c6 | ||
|
|
36b6ed12c9 | ||
|
|
0870af8a1d | ||
|
|
109add7c79 | ||
|
|
2af6e6dc99 | ||
|
|
ac4302b15d |
29
README.md
29
README.md
@ -1,4 +1,4 @@
|
|||||||
# [CVPR 2025] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
|
# [CVPR 2025 Highlight] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
|
||||||
|
|
||||||
<div class="is-size-5 publication-authors", align="center",>
|
<div class="is-size-5 publication-authors", align="center",>
|
||||||
<span class="author-block">
|
<span class="author-block">
|
||||||
@ -64,9 +64,14 @@ We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching
|
|||||||
|
|
||||||
## 🔥 Latest News
|
## 🔥 Latest News
|
||||||
- **If you like our project, please give us a star ⭐ on GitHub for the latest update.**
|
- **If you like our project, please give us a star ⭐ on GitHub for the latest update.**
|
||||||
|
- [2025/06/08] 🔥 Update coefficients of [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks [@spawner1145](https://github.com/spawner1145).
|
||||||
|
- [2025/05/26] 🔥 Support [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks [@spawner1145](https://github.com/spawner1145).
|
||||||
|
- [2025/05/25] 🔥 Support [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1). Thanks [@YunjieYu](https://github.com/YunjieYu).
|
||||||
|
- [2025/04/14] 🔥 Update coefficients of [CogVideoX1.5](https://github.com/THUDM/CogVideo). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
||||||
|
- [2025/04/05] 🎉 Recommended as a **highlight** in CVPR 2025, top 16.8% in accepted papers and top 3.7% in all papers.
|
||||||
- [2025/03/13] 🔥 Optimized TeaCache for [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
- [2025/03/13] 🔥 Optimized TeaCache for [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
||||||
- [2025/03/05] 🔥 Support [Wan2.1](https://github.com/Wan-Video/Wan2.1) for both T2V and I2V.
|
- [2025/03/05] 🔥 Support [Wan2.1](https://github.com/Wan-Video/Wan2.1) for both T2V and I2V.
|
||||||
- [2025/02/27] 🎉 Accepted in CVPR 2025.
|
- [2025/02/27] 🎉 Accepted in **CVPR 2025**.
|
||||||
- [2025/01/24] 🔥 Support [Cosmos](https://github.com/NVIDIA/Cosmos) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
- [2025/01/24] 🔥 Support [Cosmos](https://github.com/NVIDIA/Cosmos) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
||||||
- [2025/01/20] 🔥 Support [CogVideoX1.5-5B](https://github.com/THUDM/CogVideo) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
- [2025/01/20] 🔥 Support [CogVideoX1.5-5B](https://github.com/THUDM/CogVideo) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
|
||||||
- [2025/01/07] 🔥 Support [TangoFlux](https://github.com/declare-lab/TangoFlux). TeaCache works well for Audio Diffusion Models!
|
- [2025/01/07] 🔥 Support [TangoFlux](https://github.com/declare-lab/TangoFlux). TeaCache works well for Audio Diffusion Models!
|
||||||
@ -82,17 +87,18 @@ We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching
|
|||||||
If you develop/use TeaCache in your projects and you would like more people to see it, please inform us.(liufeng20@mails.ucas.ac.cn)
|
If you develop/use TeaCache in your projects and you would like more people to see it, please inform us.(liufeng20@mails.ucas.ac.cn)
|
||||||
|
|
||||||
**Model**
|
**Model**
|
||||||
|
- [FramePack](https://github.com/lllyasviel/FramePack) supports TeaCache. Thanks [@lllyasviel](https://github.com/lllyasviel).
|
||||||
- [FastVideo](https://github.com/hao-ai-lab/FastVideo) supports TeaCache. Thanks [@BrianChen1129](https://github.com/BrianChen1129) and [@jzhang38](https://github.com/jzhang38).
|
- [FastVideo](https://github.com/hao-ai-lab/FastVideo) supports TeaCache. Thanks [@BrianChen1129](https://github.com/BrianChen1129) and [@jzhang38](https://github.com/jzhang38).
|
||||||
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) supports TeaCache. Thanks [@hkunzhe](https://github.com/hkunzhe) and [@bubbliiiing](https://github.com/bubbliiiing).
|
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) supports TeaCache. Thanks [@hkunzhe](https://github.com/hkunzhe) and [@bubbliiiing](https://github.com/bubbliiiing).
|
||||||
- [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero).
|
- [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero).
|
||||||
- [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest).
|
- [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest).
|
||||||
|
|
||||||
**ComfyUI**
|
**ComfyUI**
|
||||||
|
- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
|
||||||
- [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper) supports TeaCache4Wan2.1. Thanks [@kijai](https://github.com/kijai).
|
- [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper) supports TeaCache4Wan2.1. Thanks [@kijai](https://github.com/kijai).
|
||||||
- [ComfyUI-TangoFlux](https://github.com/LucipherDev/ComfyUI-TangoFlux) supports TeaCache. Thanks [@LucipherDev](https://github.com/LucipherDev).
|
- [ComfyUI-TangoFlux](https://github.com/LucipherDev/ComfyUI-TangoFlux) supports TeaCache. Thanks [@LucipherDev](https://github.com/LucipherDev).
|
||||||
- [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing).
|
- [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing).
|
||||||
- [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig).
|
- [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig).
|
||||||
- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
|
|
||||||
- [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok).
|
- [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok).
|
||||||
- [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT).
|
- [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT).
|
||||||
|
|
||||||
@ -101,13 +107,13 @@ If you develop/use TeaCache in your projects and you would like more people to s
|
|||||||
- [Teacache-xDiT](https://github.com/MingXiangL/Teacache-xDiT) for multi-gpu inference. Thanks [@MingXiangL](https://github.com/MingXiangL).
|
- [Teacache-xDiT](https://github.com/MingXiangL/Teacache-xDiT) for multi-gpu inference. Thanks [@MingXiangL](https://github.com/MingXiangL).
|
||||||
|
|
||||||
**Engine**
|
**Engine**
|
||||||
|
- [SD.Next](https://github.com/vladmandic/sdnext) supports TeaCache. Thanks [@vladmandic](https://github.com/vladmandic).
|
||||||
- [DiffSynth Studio](https://github.com/modelscope/DiffSynth-Studio) supports TeaCache. Thanks [@Artiprocher](https://github.com/Artiprocher).
|
- [DiffSynth Studio](https://github.com/modelscope/DiffSynth-Studio) supports TeaCache. Thanks [@Artiprocher](https://github.com/Artiprocher).
|
||||||
|
|
||||||
## 🎉 Supported Models
|
## 🎉 Supported Models
|
||||||
**Text to Video**
|
**Text to Video**
|
||||||
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
|
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
|
||||||
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
|
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
|
||||||
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
|
|
||||||
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
|
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
|
||||||
- [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md)
|
- [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md)
|
||||||
- [TeaCache4Mochi](./TeaCache4Mochi/README.md)
|
- [TeaCache4Mochi](./TeaCache4Mochi/README.md)
|
||||||
@ -117,22 +123,15 @@ If you develop/use TeaCache in your projects and you would like more people to s
|
|||||||
- [TeaCache4Open-Sora-Plan](./eval/teacache/README.md)
|
- [TeaCache4Open-Sora-Plan](./eval/teacache/README.md)
|
||||||
- [TeaCache4Latte](./eval/teacache/README.md)
|
- [TeaCache4Latte](./eval/teacache/README.md)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
**Image to Video**
|
**Image to Video**
|
||||||
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
|
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
|
||||||
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
|
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
|
||||||
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
|
|
||||||
- Ruyi-Models. See [here](https://github.com/IamCreateAI/Ruyi-Models).
|
|
||||||
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
|
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
|
||||||
- [TeaCache4ConsisID](./TeaCache4ConsisID/README.md)
|
- [TeaCache4ConsisID](./TeaCache4ConsisID/README.md)
|
||||||
|
|
||||||
|
|
||||||
**Video to Video**
|
|
||||||
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
|
|
||||||
|
|
||||||
**Text to Image**
|
**Text to Image**
|
||||||
|
- [TeaCache4Lumina2](./TeaCache4Lumina2/README.md)
|
||||||
|
- [TeaCache4HiDream-I1](./TeaCache4HiDream-I1/README.md)
|
||||||
- [TeaCache4FLUX](./TeaCache4FLUX/README.md)
|
- [TeaCache4FLUX](./TeaCache4FLUX/README.md)
|
||||||
- [TeaCache4Lumina-T2X](./TeaCache4Lumina-T2X/README.md)
|
- [TeaCache4Lumina-T2X](./TeaCache4Lumina-T2X/README.md)
|
||||||
|
|
||||||
@ -146,12 +145,12 @@ If you develop/use TeaCache in your projects and you would like more people to s
|
|||||||
|
|
||||||
## 💐 Acknowledgement
|
## 💐 Acknowledgement
|
||||||
|
|
||||||
This repository is built based on [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos) and [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks for their contributions!
|
This repository is built based on [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos), [Wan2.1](https://github.com/Wan-Video/Wan2.1), [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks for their contributions!
|
||||||
|
|
||||||
## 🔒 License
|
## 🔒 License
|
||||||
|
|
||||||
* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
|
* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
|
||||||
* For [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos) and [Wan2.1](https://github.com/Wan-Video/Wan2.1), please follow their LICENSE.
|
* For [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos), [Wan2.1](https://github.com/Wan-Video/Wan2.1), [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0), please follow their LICENSE.
|
||||||
* The service is a research preview. Please contact us if you find any potential violations. (liufeng20@mails.ucas.ac.cn)
|
* The service is a research preview. Please contact us if you find any potential violations. (liufeng20@mails.ucas.ac.cn)
|
||||||
|
|
||||||
## 📖 Citation
|
## 📖 Citation
|
||||||
|
|||||||
@ -3,19 +3,19 @@
|
|||||||
|
|
||||||
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [CogVideoX1.5](https://github.com/THUDM/CogVideo) 1.8x without much visual quality degradation, in a training-free manner. The following video shows the results generated by TeaCache-CogVideoX1.5 with various `rel_l1_thresh` values: 0 (original), 0.1 (1.3x speedup), 0.2 (1.8x speedup), and 0.3(2.1x speedup).Additionally, the image-to-video (i2v) results are also demonstrated, with the following speedups: 0.1 (1.5x speedup), 0.2 (2.2x speedup), and 0.3 (2.7x speedup).
|
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [CogVideoX1.5](https://github.com/THUDM/CogVideo) 1.8x without much visual quality degradation, in a training-free manner. The following video shows the results generated by TeaCache-CogVideoX1.5 with various `rel_l1_thresh` values: 0 (original), 0.1 (1.3x speedup), 0.2 (1.8x speedup), and 0.3(2.1x speedup).Additionally, the image-to-video (i2v) results are also demonstrated, with the following speedups: 0.1 (1.5x speedup), 0.2 (2.2x speedup), and 0.3 (2.7x speedup).
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/c444b850-3252-4b37-ad4a-122d389218d9
|
https://github.com/user-attachments/assets/21261b03-71c6-47bf-9769-2a81c8dc452f
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/5f181a57-d5e3-46db-b388-8591e50f98e2
|
https://github.com/user-attachments/assets/5e98e646-4034-4ae7-9680-a65ecd88dac9
|
||||||
|
|
||||||
## 📈 Inference Latency Comparisons on a Single H100 GPU
|
## 📈 Inference Latency Comparisons on a Single H100 GPU
|
||||||
|
|
||||||
| CogVideoX1.5-t2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
|
| CogVideoX1.5-t2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
|
||||||
| :--------------: | :------------: | :------------: | :------------: |
|
| :--------------: | :------------: | :------------: | :------------: |
|
||||||
| ~465 s | ~372 s | ~261 s | ~223 s |
|
| ~465 s | ~322 s | ~260 s | ~204 s |
|
||||||
|
|
||||||
| CogVideoX1.5-i2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
|
| CogVideoX1.5-i2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
|
||||||
| :--------------: | :------------: | :------------: | :------------: |
|
| :--------------: | :------------: | :------------: | :------------: |
|
||||||
| ~475 s | ~323 s | ~218 s | ~171 s |
|
| ~475 s | ~316 s | ~239 s | ~204 s |
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
|
|||||||
@ -6,6 +6,14 @@ from diffusers.models.modeling_outputs import Transformer2DModelOutput
|
|||||||
from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, scale_lora_layers, unscale_lora_layers, export_to_video, load_image
|
from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, scale_lora_layers, unscale_lora_layers, export_to_video, load_image
|
||||||
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
|
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
|
||||||
|
|
||||||
|
coefficients_dict = {
|
||||||
|
"CogVideoX-2b":[-3.10658903e+01, 2.54732368e+01, -5.92380459e+00, 1.75769064e+00, -3.61568434e-03],
|
||||||
|
"CogVideoX-5b":[-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02],
|
||||||
|
"CogVideoX-5b-I2V":[-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02],
|
||||||
|
"CogVideoX1.5-5B":[ 2.50210439e+02, -1.65061612e+02, 3.57804877e+01, -7.81551492e-01, 3.58559703e-02],
|
||||||
|
"CogVideoX1.5-5B-I2V":[ 1.22842302e+02, -1.04088754e+02, 2.62981677e+01, -3.06009921e-01, 3.71213220e-02],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def teacache_forward(
|
def teacache_forward(
|
||||||
self,
|
self,
|
||||||
@ -64,13 +72,7 @@ def teacache_forward(
|
|||||||
should_calc = True
|
should_calc = True
|
||||||
self.accumulated_rel_l1_distance = 0
|
self.accumulated_rel_l1_distance = 0
|
||||||
else:
|
else:
|
||||||
if not self.config.use_rotary_positional_embeddings:
|
rescale_func = np.poly1d(self.coefficients)
|
||||||
# CogVideoX-2B
|
|
||||||
coefficients = [-3.10658903e+01, 2.54732368e+01, -5.92380459e+00, 1.75769064e+00, -3.61568434e-03]
|
|
||||||
else:
|
|
||||||
# CogVideoX-5B and CogvideoX1.5-5B
|
|
||||||
coefficients = [-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02]
|
|
||||||
rescale_func = np.poly1d(coefficients)
|
|
||||||
self.accumulated_rel_l1_distance += rescale_func(((emb-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
|
self.accumulated_rel_l1_distance += rescale_func(((emb-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
|
||||||
if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
|
if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
|
||||||
should_calc = False
|
should_calc = False
|
||||||
@ -196,6 +198,7 @@ def main(args):
|
|||||||
guidance_scale = args.guidance_scale
|
guidance_scale = args.guidance_scale
|
||||||
fps = args.fps
|
fps = args.fps
|
||||||
image_path = args.image_path
|
image_path = args.image_path
|
||||||
|
mode = ckpts_path.split("/")[-1]
|
||||||
|
|
||||||
if generate_type == "t2v":
|
if generate_type == "t2v":
|
||||||
pipe = CogVideoXPipeline.from_pretrained(ckpts_path, torch_dtype=torch.bfloat16)
|
pipe = CogVideoXPipeline.from_pretrained(ckpts_path, torch_dtype=torch.bfloat16)
|
||||||
@ -212,6 +215,7 @@ def main(args):
|
|||||||
pipe.transformer.__class__.previous_residual_encoder = None
|
pipe.transformer.__class__.previous_residual_encoder = None
|
||||||
pipe.transformer.__class__.num_steps = num_inference_steps
|
pipe.transformer.__class__.num_steps = num_inference_steps
|
||||||
pipe.transformer.__class__.cnt = 0
|
pipe.transformer.__class__.cnt = 0
|
||||||
|
pipe.transformer.__class__.coefficients = coefficients_dict[mode]
|
||||||
pipe.transformer.__class__.forward = teacache_forward
|
pipe.transformer.__class__.forward = teacache_forward
|
||||||
|
|
||||||
pipe.to("cuda")
|
pipe.to("cuda")
|
||||||
@ -243,7 +247,7 @@ def main(args):
|
|||||||
generator=torch.Generator("cuda").manual_seed(seed), # Set the seed for reproducibility
|
generator=torch.Generator("cuda").manual_seed(seed), # Set the seed for reproducibility
|
||||||
).frames[0]
|
).frames[0]
|
||||||
words = prompt.split()[:5]
|
words = prompt.split()[:5]
|
||||||
video_path = f"{output_path}/teacache_cogvideox1.5-5B_{words}.mp4"
|
video_path = f"{output_path}/teacache_cogvideox1.5-5B_{words}_{rel_l1_thresh}.mp4"
|
||||||
export_to_video(video, video_path, fps=fps)
|
export_to_video(video, video_path, fps=fps)
|
||||||
|
|
||||||
|
|
||||||
@ -263,7 +267,7 @@ if __name__ == "__main__":
|
|||||||
parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process")
|
parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process")
|
||||||
parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process")
|
parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process")
|
||||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
||||||
parser.add_argument("--fps", type=int, default=16, help="Number of steps for the inference process")
|
parser.add_argument("--fps", type=int, default=16, help="Frame rate of video")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
main(args)
|
main(args)
|
||||||
43
TeaCache4HiDream-I1/README.md
Normal file
43
TeaCache4HiDream-I1/README.md
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
<!-- ## **TeaCache4HiDream-I1** -->
|
||||||
|
# TeaCache4HiDream-I1
|
||||||
|
|
||||||
|
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) 2x without much visual quality degradation, in a training-free manner. The following image shows the results generated by TeaCache-HiDream-I1-Full with various `rel_l1_thresh` values: 0 (original), 0.17 (1.5x speedup), 0.25 (1.7x speedup), 0.3 (2.0x speedup), and 0.45 (2.6x speedup).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## 📈 Inference Latency Comparisons on a Single A100
|
||||||
|
|
||||||
|
| HiDream-I1-Full | TeaCache (0.17) | TeaCache (0.25) | TeaCache (0.3) | TeaCache (0.45) |
|
||||||
|
|:-----------------------:|:----------------------------:|:--------------------:|:---------------------:|:--------------------:|
|
||||||
|
| ~50 s | ~34 s | ~29 s | ~25 s | ~19 s |
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```shell
|
||||||
|
pip install git+https://github.com/huggingface/diffusers
|
||||||
|
pip install --upgrade transformers protobuf tiktoken tokenizers sentencepiece
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
You can modify the `rel_l1_thresh` in line 297 to obtain your desired trade-off between latency and visul quality. For single-gpu inference, you can use the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python teacache_hidream_i1.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
If you find TeaCache is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
|
||||||
|
|
||||||
|
```
|
||||||
|
@article{liu2024timestep,
|
||||||
|
title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
|
||||||
|
author={Liu, Feng and Zhang, Shiwei and Wang, Xiaofeng and Wei, Yujie and Qiu, Haonan and Zhao, Yuzhong and Zhang, Yingya and Ye, Qixiang and Wan, Fang},
|
||||||
|
journal={arXiv preprint arXiv:2411.19108},
|
||||||
|
year={2024}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
We would like to thank the contributors to the [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Diffusers](https://github.com/huggingface/diffusers).
|
||||||
307
TeaCache4HiDream-I1/teacache_hidream_i1.py
Normal file
307
TeaCache4HiDream-I1/teacache_hidream_i1.py
Normal file
@ -0,0 +1,307 @@
|
|||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
|
||||||
|
from diffusers import HiDreamImagePipeline
|
||||||
|
from diffusers.models import HiDreamImageTransformer2DModel
|
||||||
|
from diffusers.models.modeling_outputs import Transformer2DModelOutput
|
||||||
|
from diffusers.utils import logging, deprecate, USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
|
||||||
|
|
||||||
|
|
||||||
|
def teacache_forward(
|
||||||
|
self,
|
||||||
|
hidden_states: torch.Tensor,
|
||||||
|
timesteps: torch.LongTensor = None,
|
||||||
|
encoder_hidden_states_t5: torch.Tensor = None,
|
||||||
|
encoder_hidden_states_llama3: torch.Tensor = None,
|
||||||
|
pooled_embeds: torch.Tensor = None,
|
||||||
|
img_ids: Optional[torch.Tensor] = None,
|
||||||
|
img_sizes: Optional[List[Tuple[int, int]]] = None,
|
||||||
|
hidden_states_masks: Optional[torch.Tensor] = None,
|
||||||
|
attention_kwargs: Optional[Dict[str, Any]] = None,
|
||||||
|
return_dict: bool = True,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
encoder_hidden_states = kwargs.get("encoder_hidden_states", None)
|
||||||
|
|
||||||
|
if encoder_hidden_states is not None:
|
||||||
|
deprecation_message = "The `encoder_hidden_states` argument is deprecated. Please use `encoder_hidden_states_t5` and `encoder_hidden_states_llama3` instead."
|
||||||
|
deprecate("encoder_hidden_states", "0.35.0", deprecation_message)
|
||||||
|
encoder_hidden_states_t5 = encoder_hidden_states[0]
|
||||||
|
encoder_hidden_states_llama3 = encoder_hidden_states[1]
|
||||||
|
|
||||||
|
if img_ids is not None and img_sizes is not None and hidden_states_masks is None:
|
||||||
|
deprecation_message = (
|
||||||
|
"Passing `img_ids` and `img_sizes` with unpachified `hidden_states` is deprecated and will be ignored."
|
||||||
|
)
|
||||||
|
deprecate("img_ids", "0.35.0", deprecation_message)
|
||||||
|
|
||||||
|
if hidden_states_masks is not None and (img_ids is None or img_sizes is None):
|
||||||
|
raise ValueError("if `hidden_states_masks` is passed, `img_ids` and `img_sizes` must also be passed.")
|
||||||
|
elif hidden_states_masks is not None and hidden_states.ndim != 3:
|
||||||
|
raise ValueError(
|
||||||
|
"if `hidden_states_masks` is passed, `hidden_states` must be a 3D tensors with shape (batch_size, patch_height * patch_width, patch_size * patch_size * channels)"
|
||||||
|
)
|
||||||
|
|
||||||
|
if attention_kwargs is not None:
|
||||||
|
attention_kwargs = attention_kwargs.copy()
|
||||||
|
lora_scale = attention_kwargs.pop("scale", 1.0)
|
||||||
|
else:
|
||||||
|
lora_scale = 1.0
|
||||||
|
|
||||||
|
if USE_PEFT_BACKEND:
|
||||||
|
# weight the lora layers by setting `lora_scale` for each PEFT layer
|
||||||
|
scale_lora_layers(self, lora_scale)
|
||||||
|
else:
|
||||||
|
if attention_kwargs is not None and attention_kwargs.get("scale", None) is not None:
|
||||||
|
logger.warning(
|
||||||
|
"Passing `scale` via `attention_kwargs` when not using the PEFT backend is ineffective."
|
||||||
|
)
|
||||||
|
|
||||||
|
# spatial forward
|
||||||
|
batch_size = hidden_states.shape[0]
|
||||||
|
hidden_states_type = hidden_states.dtype
|
||||||
|
|
||||||
|
# Patchify the input
|
||||||
|
if hidden_states_masks is None:
|
||||||
|
hidden_states, hidden_states_masks, img_sizes, img_ids = self.patchify(hidden_states)
|
||||||
|
|
||||||
|
# Embed the hidden states
|
||||||
|
hidden_states = self.x_embedder(hidden_states)
|
||||||
|
|
||||||
|
# 0. time
|
||||||
|
timesteps = self.t_embedder(timesteps, hidden_states_type)
|
||||||
|
p_embedder = self.p_embedder(pooled_embeds)
|
||||||
|
temb = timesteps + p_embedder
|
||||||
|
|
||||||
|
encoder_hidden_states = [encoder_hidden_states_llama3[k] for k in self.config.llama_layers]
|
||||||
|
|
||||||
|
if self.caption_projection is not None:
|
||||||
|
new_encoder_hidden_states = []
|
||||||
|
for i, enc_hidden_state in enumerate(encoder_hidden_states):
|
||||||
|
enc_hidden_state = self.caption_projection[i](enc_hidden_state)
|
||||||
|
enc_hidden_state = enc_hidden_state.view(batch_size, -1, hidden_states.shape[-1])
|
||||||
|
new_encoder_hidden_states.append(enc_hidden_state)
|
||||||
|
encoder_hidden_states = new_encoder_hidden_states
|
||||||
|
encoder_hidden_states_t5 = self.caption_projection[-1](encoder_hidden_states_t5)
|
||||||
|
encoder_hidden_states_t5 = encoder_hidden_states_t5.view(batch_size, -1, hidden_states.shape[-1])
|
||||||
|
encoder_hidden_states.append(encoder_hidden_states_t5)
|
||||||
|
|
||||||
|
txt_ids = torch.zeros(
|
||||||
|
batch_size,
|
||||||
|
encoder_hidden_states[-1].shape[1]
|
||||||
|
+ encoder_hidden_states[-2].shape[1]
|
||||||
|
+ encoder_hidden_states[0].shape[1],
|
||||||
|
3,
|
||||||
|
device=img_ids.device,
|
||||||
|
dtype=img_ids.dtype,
|
||||||
|
)
|
||||||
|
ids = torch.cat((img_ids, txt_ids), dim=1)
|
||||||
|
image_rotary_emb = self.pe_embedder(ids)
|
||||||
|
|
||||||
|
# 2. Blocks
|
||||||
|
block_id = 0
|
||||||
|
initial_encoder_hidden_states = torch.cat([encoder_hidden_states[-1], encoder_hidden_states[-2]], dim=1)
|
||||||
|
initial_encoder_hidden_states_seq_len = initial_encoder_hidden_states.shape[1]
|
||||||
|
|
||||||
|
if self.enable_teacache:
|
||||||
|
modulated_inp = timesteps.clone()
|
||||||
|
if self.cnt < self.ret_steps:
|
||||||
|
should_calc = True
|
||||||
|
self.accumulated_rel_l1_distance = 0
|
||||||
|
else:
|
||||||
|
rescale_func = np.poly1d(self.coefficients)
|
||||||
|
self.accumulated_rel_l1_distance += rescale_func(((modulated_inp-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
|
||||||
|
if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
|
||||||
|
should_calc = False
|
||||||
|
else:
|
||||||
|
should_calc = True
|
||||||
|
self.accumulated_rel_l1_distance = 0
|
||||||
|
self.previous_modulated_input = modulated_inp
|
||||||
|
self.cnt += 1
|
||||||
|
if self.cnt == self.num_steps:
|
||||||
|
self.cnt = 0
|
||||||
|
|
||||||
|
if self.enable_teacache:
|
||||||
|
if not should_calc:
|
||||||
|
hidden_states += self.previous_residual
|
||||||
|
else:
|
||||||
|
# 2. Blocks
|
||||||
|
ori_hidden_states = hidden_states.clone()
|
||||||
|
for bid, block in enumerate(self.double_stream_blocks):
|
||||||
|
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
|
||||||
|
cur_encoder_hidden_states = torch.cat(
|
||||||
|
[initial_encoder_hidden_states, cur_llama31_encoder_hidden_states], dim=1
|
||||||
|
)
|
||||||
|
if torch.is_grad_enabled() and self.gradient_checkpointing:
|
||||||
|
hidden_states, initial_encoder_hidden_states = self._gradient_checkpointing_func(
|
||||||
|
block,
|
||||||
|
hidden_states,
|
||||||
|
hidden_states_masks,
|
||||||
|
cur_encoder_hidden_states,
|
||||||
|
temb,
|
||||||
|
image_rotary_emb,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
hidden_states, initial_encoder_hidden_states = block(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
hidden_states_masks=hidden_states_masks,
|
||||||
|
encoder_hidden_states=cur_encoder_hidden_states,
|
||||||
|
temb=temb,
|
||||||
|
image_rotary_emb=image_rotary_emb,
|
||||||
|
)
|
||||||
|
initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
|
||||||
|
block_id += 1
|
||||||
|
|
||||||
|
image_tokens_seq_len = hidden_states.shape[1]
|
||||||
|
hidden_states = torch.cat([hidden_states, initial_encoder_hidden_states], dim=1)
|
||||||
|
hidden_states_seq_len = hidden_states.shape[1]
|
||||||
|
if hidden_states_masks is not None:
|
||||||
|
encoder_attention_mask_ones = torch.ones(
|
||||||
|
(batch_size, initial_encoder_hidden_states.shape[1] + cur_llama31_encoder_hidden_states.shape[1]),
|
||||||
|
device=hidden_states_masks.device,
|
||||||
|
dtype=hidden_states_masks.dtype,
|
||||||
|
)
|
||||||
|
hidden_states_masks = torch.cat([hidden_states_masks, encoder_attention_mask_ones], dim=1)
|
||||||
|
|
||||||
|
for bid, block in enumerate(self.single_stream_blocks):
|
||||||
|
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
|
||||||
|
hidden_states = torch.cat([hidden_states, cur_llama31_encoder_hidden_states], dim=1)
|
||||||
|
if torch.is_grad_enabled() and self.gradient_checkpointing:
|
||||||
|
hidden_states = self._gradient_checkpointing_func(
|
||||||
|
block,
|
||||||
|
hidden_states,
|
||||||
|
hidden_states_masks,
|
||||||
|
None,
|
||||||
|
temb,
|
||||||
|
image_rotary_emb,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
hidden_states = block(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
hidden_states_masks=hidden_states_masks,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
temb=temb,
|
||||||
|
image_rotary_emb=image_rotary_emb,
|
||||||
|
)
|
||||||
|
hidden_states = hidden_states[:, :hidden_states_seq_len]
|
||||||
|
block_id += 1
|
||||||
|
|
||||||
|
hidden_states = hidden_states[:, :image_tokens_seq_len, ...]
|
||||||
|
self.previous_residual = hidden_states - ori_hidden_states
|
||||||
|
else:
|
||||||
|
for bid, block in enumerate(self.double_stream_blocks):
|
||||||
|
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
|
||||||
|
cur_encoder_hidden_states = torch.cat(
|
||||||
|
[initial_encoder_hidden_states, cur_llama31_encoder_hidden_states], dim=1
|
||||||
|
)
|
||||||
|
if torch.is_grad_enabled() and self.gradient_checkpointing:
|
||||||
|
hidden_states, initial_encoder_hidden_states = self._gradient_checkpointing_func(
|
||||||
|
block,
|
||||||
|
hidden_states,
|
||||||
|
hidden_states_masks,
|
||||||
|
cur_encoder_hidden_states,
|
||||||
|
temb,
|
||||||
|
image_rotary_emb,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
hidden_states, initial_encoder_hidden_states = block(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
hidden_states_masks=hidden_states_masks,
|
||||||
|
encoder_hidden_states=cur_encoder_hidden_states,
|
||||||
|
temb=temb,
|
||||||
|
image_rotary_emb=image_rotary_emb,
|
||||||
|
)
|
||||||
|
initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
|
||||||
|
block_id += 1
|
||||||
|
|
||||||
|
image_tokens_seq_len = hidden_states.shape[1]
|
||||||
|
hidden_states = torch.cat([hidden_states, initial_encoder_hidden_states], dim=1)
|
||||||
|
hidden_states_seq_len = hidden_states.shape[1]
|
||||||
|
if hidden_states_masks is not None:
|
||||||
|
encoder_attention_mask_ones = torch.ones(
|
||||||
|
(batch_size, initial_encoder_hidden_states.shape[1] + cur_llama31_encoder_hidden_states.shape[1]),
|
||||||
|
device=hidden_states_masks.device,
|
||||||
|
dtype=hidden_states_masks.dtype,
|
||||||
|
)
|
||||||
|
hidden_states_masks = torch.cat([hidden_states_masks, encoder_attention_mask_ones], dim=1)
|
||||||
|
|
||||||
|
for bid, block in enumerate(self.single_stream_blocks):
|
||||||
|
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
|
||||||
|
hidden_states = torch.cat([hidden_states, cur_llama31_encoder_hidden_states], dim=1)
|
||||||
|
if torch.is_grad_enabled() and self.gradient_checkpointing:
|
||||||
|
hidden_states = self._gradient_checkpointing_func(
|
||||||
|
block,
|
||||||
|
hidden_states,
|
||||||
|
hidden_states_masks,
|
||||||
|
None,
|
||||||
|
temb,
|
||||||
|
image_rotary_emb,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
hidden_states = block(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
hidden_states_masks=hidden_states_masks,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
temb=temb,
|
||||||
|
image_rotary_emb=image_rotary_emb,
|
||||||
|
)
|
||||||
|
hidden_states = hidden_states[:, :hidden_states_seq_len]
|
||||||
|
block_id += 1
|
||||||
|
|
||||||
|
hidden_states = hidden_states[:, :image_tokens_seq_len, ...]
|
||||||
|
|
||||||
|
output = self.final_layer(hidden_states, temb)
|
||||||
|
output = self.unpatchify(output, img_sizes, self.training)
|
||||||
|
if hidden_states_masks is not None:
|
||||||
|
hidden_states_masks = hidden_states_masks[:, :image_tokens_seq_len]
|
||||||
|
|
||||||
|
if USE_PEFT_BACKEND:
|
||||||
|
# remove `lora_scale` from each PEFT layer
|
||||||
|
unscale_lora_layers(self, lora_scale)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return (output,)
|
||||||
|
return Transformer2DModelOutput(sample=output)
|
||||||
|
|
||||||
|
HiDreamImageTransformer2DModel.forward = teacache_forward
|
||||||
|
num_inference_steps = 50
|
||||||
|
seed = 42
|
||||||
|
prompt = 'A cat holding a sign that says "Hi-Dreams.ai".'
|
||||||
|
|
||||||
|
tokenizer_4 = PreTrainedTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
|
||||||
|
text_encoder_4 = LlamaForCausalLM.from_pretrained(
|
||||||
|
"meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
output_hidden_states=True,
|
||||||
|
output_attentions=True,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
pipeline = HiDreamImagePipeline.from_pretrained(
|
||||||
|
"HiDream-ai/HiDream-I1-Full",
|
||||||
|
tokenizer_4=tokenizer_4,
|
||||||
|
text_encoder_4=text_encoder_4,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
)
|
||||||
|
# pipeline.enable_model_cpu_offload() # save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
|
||||||
|
|
||||||
|
# TeaCache
|
||||||
|
pipeline.transformer.__class__.enable_teacache = True
|
||||||
|
pipeline.transformer.__class__.cnt = 0
|
||||||
|
pipeline.transformer.__class__.num_steps = num_inference_steps
|
||||||
|
pipeline.transformer.__class__.ret_steps = num_inference_steps * 0.1
|
||||||
|
pipeline.transformer.__class__.rel_l1_thresh = 0.3 # 0.17 for 1.5x speedup, 0.25 for 1.7x speedup, 0.3 for 2x speedup, 0.45 for 2.6x speedup
|
||||||
|
pipeline.transformer.__class__.coefficients = [-3.13605009e+04, -7.12425503e+02, 4.91363285e+01, 8.26515490e+00, 1.08053901e-01]
|
||||||
|
|
||||||
|
pipeline.to("cuda")
|
||||||
|
img = pipeline(
|
||||||
|
prompt,
|
||||||
|
guidance_scale=5.0,
|
||||||
|
num_inference_steps=num_inference_steps,
|
||||||
|
generator=torch.Generator("cuda").manual_seed(seed)
|
||||||
|
).images[0]
|
||||||
|
img.save("{}.png".format('TeaCache_' + prompt))
|
||||||
72
TeaCache4Lumina2/README.md
Normal file
72
TeaCache4Lumina2/README.md
Normal file
@ -0,0 +1,72 @@
|
|||||||
|
<!-- ## **TeaCache4LuminaT2X** -->
|
||||||
|
# TeaCache4Lumina2
|
||||||
|
|
||||||
|
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0) without much visual quality degradation, in a training-free manner. The following image shows the experimental results of Lumina-Image-2.0 and TeaCache with different versions: v1(0 (original), 0.2 (1.25x speedup), 0.3 (1.5625x speedup), 0.4 (2.0833x speedup), 0.5 (2.5x speedup).) and v2(Lumina-Image-2.0 (~25 s), TeaCache (0.2) (~16.7 s, 1.5x speedup), TeaCache (0.3) (~15.6 s, 1.6x speedup), TeaCache (0.5) (~13.79 s, 1.8x speedup), TeaCache (1.1) (~11.9 s, 2.1x speedup)).
|
||||||
|
|
||||||
|
The v1 coefficients
|
||||||
|
`[393.76566581,−603.50993606,209.10239044,−23.00726601,0.86377344]`
|
||||||
|
exhibit poor quality at low L1 values but perform better with higher L1 settings, though at a slower speed. The v2 coefficients
|
||||||
|
`[225.7042019806413,−608.8453716535591,304.1869942338369,124.21267720116742,−1.4089066892956552]`
|
||||||
|
, however, offer faster computation and better quality at low L1 levels but incur significant feature loss at high L1 values.
|
||||||
|
|
||||||
|
You can change the value in line 72 to switch versions
|
||||||
|
|
||||||
|
## v1
|
||||||
|
<p align="center">
|
||||||
|
<img src="https://github.com/user-attachments/assets/d2c87b99-e4ac-4407-809a-caf9750f41ef" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/411ff763-9c31-438d-8a9b-3ec5c88f6c27" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/e57dfb60-a07f-4e17-837e-e46a69d8b9c0" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/6e3184fe-e31a-452c-a447-48d4b74fcc10" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/d6a52c4c-bd22-45c0-9f40-00a2daa85fc8" width="150" style="margin: 5px;">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
## v2
|
||||||
|
<p align="center">
|
||||||
|
<img src="https://github.com/user-attachments/assets/aea9907b-830e-497b-b968-aaeef463c7ef" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/0e258295-eaaa-49ce-b16f-bba7f7ada6c1" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/44600f22-3fd4-4bc4-ab00-29b0ed023d6d" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/bcb926ab-95fd-4c83-8b46-f72581a3359e" width="150" style="margin: 5px;">
|
||||||
|
<img src="https://github.com/user-attachments/assets/ec8db28e-0f9b-4d56-9096-fdc8b3c20f4b" width="150" style="margin: 5px;">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
## 📈 Inference Latency Comparisons on a single 4090 (step 50)
|
||||||
|
## v1
|
||||||
|
| Lumina-Image-2.0 | TeaCache (0.2) | TeaCache (0.3) | TeaCache (0.4) | TeaCache (0.5) |
|
||||||
|
|:-------------------------:|:---------------------------:|:--------------------:|:---------------------:|:---------------------:|
|
||||||
|
| ~25 s | ~20 s | ~16 s | ~12 s | ~10 s |
|
||||||
|
|
||||||
|
## v2
|
||||||
|
| Lumina-Image-2.0 | TeaCache (0.2) | TeaCache (0.3) | TeaCache (0.5) | TeaCache (1.1) |
|
||||||
|
|:-------------------------:|:---------------------------:|:--------------------:|:---------------------:|:---------------------:|
|
||||||
|
| ~25 s | ~16.7 s | ~15.6 s | ~13.79 s | ~11.9 s |
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```shell
|
||||||
|
pip install --upgrade diffusers[torch] transformers protobuf tokenizers sentencepiece
|
||||||
|
pip install flash-attn --no-build-isolation
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
You can modify the thresh in line 154 to obtain your desired trade-off between latency and visul quality. For single-gpu inference, you can use the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python teacache_lumina2.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
If you find TeaCache is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
|
||||||
|
|
||||||
|
```
|
||||||
|
@article{liu2024timestep,
|
||||||
|
title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
|
||||||
|
author={Liu, Feng and Zhang, Shiwei and Wang, Xiaofeng and Wei, Yujie and Qiu, Haonan and Zhao, Yuzhong and Zhang, Yingya and Ye, Qixiang and Wan, Fang},
|
||||||
|
journal={arXiv preprint arXiv:2411.19108},
|
||||||
|
year={2024}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
We would like to thank the contributors to the [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0) and [Diffusers](https://github.com/huggingface/diffusers).
|
||||||
183
TeaCache4Lumina2/teacache_lumina2.py
Normal file
183
TeaCache4Lumina2/teacache_lumina2.py
Normal file
@ -0,0 +1,183 @@
|
|||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import numpy as np
|
||||||
|
from typing import Any, Dict, Optional, Tuple, Union, List
|
||||||
|
|
||||||
|
from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline
|
||||||
|
from diffusers.models.modeling_outputs import Transformer2DModelOutput
|
||||||
|
from diffusers.utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
|
||||||
|
|
||||||
|
def teacache_forward_working(
|
||||||
|
self,
|
||||||
|
hidden_states: torch.Tensor,
|
||||||
|
timestep: torch.Tensor,
|
||||||
|
encoder_hidden_states: torch.Tensor,
|
||||||
|
encoder_attention_mask: torch.Tensor,
|
||||||
|
attention_kwargs: Optional[Dict[str, Any]] = None,
|
||||||
|
return_dict: bool = True,
|
||||||
|
) -> Union[torch.Tensor, Transformer2DModelOutput]:
|
||||||
|
if attention_kwargs is not None:
|
||||||
|
attention_kwargs = attention_kwargs.copy()
|
||||||
|
lora_scale = attention_kwargs.pop("scale", 1.0)
|
||||||
|
else:
|
||||||
|
lora_scale = 1.0
|
||||||
|
if USE_PEFT_BACKEND:
|
||||||
|
scale_lora_layers(self, lora_scale)
|
||||||
|
|
||||||
|
batch_size, _, height, width = hidden_states.shape
|
||||||
|
temb, encoder_hidden_states_processed = self.time_caption_embed(hidden_states, timestep, encoder_hidden_states)
|
||||||
|
(image_patch_embeddings, context_rotary_emb, noise_rotary_emb, joint_rotary_emb,
|
||||||
|
encoder_seq_lengths, seq_lengths) = self.rope_embedder(hidden_states, encoder_attention_mask)
|
||||||
|
image_patch_embeddings = self.x_embedder(image_patch_embeddings)
|
||||||
|
for layer in self.context_refiner:
|
||||||
|
encoder_hidden_states_processed = layer(encoder_hidden_states_processed, encoder_attention_mask, context_rotary_emb)
|
||||||
|
for layer in self.noise_refiner:
|
||||||
|
image_patch_embeddings = layer(image_patch_embeddings, None, noise_rotary_emb, temb)
|
||||||
|
|
||||||
|
max_seq_len = max(seq_lengths)
|
||||||
|
input_to_main_loop = image_patch_embeddings.new_zeros(batch_size, max_seq_len, self.config.hidden_size)
|
||||||
|
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
|
||||||
|
input_to_main_loop[i, :enc_len] = encoder_hidden_states_processed[i, :enc_len]
|
||||||
|
input_to_main_loop[i, enc_len:seq_len_val] = image_patch_embeddings[i]
|
||||||
|
|
||||||
|
use_mask = len(set(seq_lengths)) > 1
|
||||||
|
attention_mask_for_main_loop_arg = None
|
||||||
|
if use_mask:
|
||||||
|
mask = input_to_main_loop.new_zeros(batch_size, max_seq_len, dtype=torch.bool)
|
||||||
|
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
|
||||||
|
mask[i, :seq_len_val] = True
|
||||||
|
attention_mask_for_main_loop_arg = mask
|
||||||
|
|
||||||
|
should_calc = True
|
||||||
|
if self.enable_teacache:
|
||||||
|
cache_key = max_seq_len
|
||||||
|
if cache_key not in self.cache:
|
||||||
|
self.cache[cache_key] = {
|
||||||
|
"accumulated_rel_l1_distance": 0.0,
|
||||||
|
"previous_modulated_input": None,
|
||||||
|
"previous_residual": None,
|
||||||
|
}
|
||||||
|
|
||||||
|
current_cache = self.cache[cache_key]
|
||||||
|
modulated_inp, _, _, _ = self.layers[0].norm1(input_to_main_loop, temb)
|
||||||
|
|
||||||
|
if self.cnt == 0 or self.cnt == self.num_steps - 1:
|
||||||
|
should_calc = True
|
||||||
|
current_cache["accumulated_rel_l1_distance"] = 0.0
|
||||||
|
else:
|
||||||
|
if current_cache["previous_modulated_input"] is not None:
|
||||||
|
# v1 coefficients,you can switch it to [225.7042019806413, -608.8453716535591, 304.1869942338369, 124.21267720116742, -1.4089066892956552] as v2
|
||||||
|
coefficients = [393.76566581, -603.50993606, 209.10239044, -23.00726601, 0.86377344]
|
||||||
|
rescale_func = np.poly1d(coefficients)
|
||||||
|
prev_mod_input = current_cache["previous_modulated_input"]
|
||||||
|
prev_mean = prev_mod_input.abs().mean()
|
||||||
|
|
||||||
|
if prev_mean.item() > 1e-9:
|
||||||
|
rel_l1_change = ((modulated_inp - prev_mod_input).abs().mean() / prev_mean).cpu().item()
|
||||||
|
else:
|
||||||
|
rel_l1_change = 0.0 if modulated_inp.abs().mean().item() < 1e-9 else float('inf')
|
||||||
|
|
||||||
|
current_cache["accumulated_rel_l1_distance"] += rescale_func(rel_l1_change)
|
||||||
|
|
||||||
|
if current_cache["accumulated_rel_l1_distance"] < self.rel_l1_thresh:
|
||||||
|
should_calc = False
|
||||||
|
else:
|
||||||
|
should_calc = True
|
||||||
|
current_cache["accumulated_rel_l1_distance"] = 0.0
|
||||||
|
else:
|
||||||
|
should_calc = True
|
||||||
|
current_cache["accumulated_rel_l1_distance"] = 0.0
|
||||||
|
|
||||||
|
current_cache["previous_modulated_input"] = modulated_inp.clone()
|
||||||
|
|
||||||
|
if self.uncond_seq_len is None:
|
||||||
|
self.uncond_seq_len = cache_key
|
||||||
|
if cache_key != self.uncond_seq_len:
|
||||||
|
self.cnt += 1
|
||||||
|
if self.cnt >= self.num_steps:
|
||||||
|
self.cnt = 0
|
||||||
|
|
||||||
|
if self.enable_teacache and not should_calc:
|
||||||
|
if max_seq_len in self.cache and "previous_residual" in self.cache[max_seq_len] and self.cache[max_seq_len]["previous_residual"] is not None:
|
||||||
|
processed_hidden_states = input_to_main_loop + self.cache[max_seq_len]["previous_residual"]
|
||||||
|
else:
|
||||||
|
should_calc = True
|
||||||
|
current_processing_states = input_to_main_loop
|
||||||
|
for layer in self.layers:
|
||||||
|
current_processing_states = layer(current_processing_states, attention_mask_for_main_loop_arg, joint_rotary_emb, temb)
|
||||||
|
processed_hidden_states = current_processing_states
|
||||||
|
|
||||||
|
|
||||||
|
if not (self.enable_teacache and not should_calc) :
|
||||||
|
current_processing_states = input_to_main_loop
|
||||||
|
for layer in self.layers:
|
||||||
|
current_processing_states = layer(current_processing_states, attention_mask_for_main_loop_arg, joint_rotary_emb, temb)
|
||||||
|
|
||||||
|
if self.enable_teacache:
|
||||||
|
if max_seq_len in self.cache:
|
||||||
|
self.cache[max_seq_len]["previous_residual"] = current_processing_states - input_to_main_loop
|
||||||
|
else:
|
||||||
|
logger.warning(f"TeaCache: Cache key {max_seq_len} not found when trying to save residual.")
|
||||||
|
|
||||||
|
processed_hidden_states = current_processing_states
|
||||||
|
|
||||||
|
output_after_norm = self.norm_out(processed_hidden_states, temb)
|
||||||
|
p = self.config.patch_size
|
||||||
|
final_output_list = []
|
||||||
|
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
|
||||||
|
image_part = output_after_norm[i][enc_len:seq_len_val]
|
||||||
|
h_p, w_p = height // p, width // p
|
||||||
|
reconstructed_image = image_part.view(h_p, w_p, p, p, self.out_channels) \
|
||||||
|
.permute(4, 0, 2, 1, 3) \
|
||||||
|
.flatten(3, 4) \
|
||||||
|
.flatten(1, 2)
|
||||||
|
final_output_list.append(reconstructed_image)
|
||||||
|
|
||||||
|
final_output_tensor = torch.stack(final_output_list, dim=0)
|
||||||
|
|
||||||
|
if USE_PEFT_BACKEND:
|
||||||
|
unscale_lora_layers(self, lora_scale)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return (final_output_tensor,)
|
||||||
|
|
||||||
|
return Transformer2DModelOutput(sample=final_output_tensor)
|
||||||
|
|
||||||
|
|
||||||
|
Lumina2Transformer2DModel.forward = teacache_forward_working
|
||||||
|
|
||||||
|
ckpt_path = "NietaAniLumina_Alpha_full_round5_ep5_s182000.pth"
|
||||||
|
transformer = Lumina2Transformer2DModel.from_single_file(
|
||||||
|
ckpt_path, torch_dtype=torch.bfloat16
|
||||||
|
)
|
||||||
|
pipeline = Lumina2Pipeline.from_pretrained(
|
||||||
|
"Alpha-VLLM/Lumina-Image-2.0",
|
||||||
|
transformer=transformer,
|
||||||
|
torch_dtype=torch.bfloat16
|
||||||
|
).to("cuda")
|
||||||
|
|
||||||
|
num_inference_steps = 30
|
||||||
|
seed = 1024
|
||||||
|
prompt = "a cat holding a sign that says hello"
|
||||||
|
output_filename = f"teacache_lumina2_output.png"
|
||||||
|
|
||||||
|
# TeaCache
|
||||||
|
pipeline.transformer.__class__.enable_teacache = True
|
||||||
|
pipeline.transformer.__class__.cnt = 0
|
||||||
|
pipeline.transformer.__class__.num_steps = num_inference_steps
|
||||||
|
pipeline.transformer.__class__.rel_l1_thresh = 0.3
|
||||||
|
pipeline.transformer.__class__.cache = {}
|
||||||
|
pipeline.transformer.__class__.uncond_seq_len = None
|
||||||
|
|
||||||
|
|
||||||
|
pipeline.enable_model_cpu_offload()
|
||||||
|
image = pipeline(
|
||||||
|
prompt=prompt,
|
||||||
|
num_inference_steps=num_inference_steps,
|
||||||
|
generator=torch.Generator("cuda").manual_seed(seed)
|
||||||
|
).images[0]
|
||||||
|
|
||||||
|
image.save(output_filename)
|
||||||
|
print(f"Image saved to {output_filename}")
|
||||||
BIN
assets/TeaCache4HiDream-I1.png
Normal file
BIN
assets/TeaCache4HiDream-I1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 4.7 MiB |
Loading…
x
Reference in New Issue
Block a user