Compare commits

...

35 Commits

Author SHA1 Message Date
Feng Liu
7c10efc470
Update README.md 2025-06-08 22:29:09 +08:00
Feng Liu
2f5a990ee8
Update README.md 2025-06-08 22:27:39 +08:00
Feng Liu
c730b01e42
Merge pull request #76 from spawner1145/main
Update Coefficients for lumina2
2025-06-08 22:23:17 +08:00
spawner
78d2f837d5
Update README.md 2025-06-08 20:40:57 +08:00
spawner
ff6a083896
Update and rename teacache_lumina2_v1.py to teacache_lumina2.py 2025-06-08 20:39:07 +08:00
spawner
0a9b0358ca
Delete TeaCache4Lumina2/teacache_lumina2_v2.py 2025-06-08 20:29:47 +08:00
spawner
6a470cfade
Create teacache_lumina2_v1.py 2025-06-08 17:49:04 +08:00
spawner
5670dc8e99
Rename teacache_lumina2.py to teacache_lumina2_v2.py 2025-06-08 17:47:55 +08:00
spawner
f7d676521a
Update README.md 2025-06-08 17:47:30 +08:00
spawner
c9e2d6454c
Update README.md 2025-06-08 15:45:52 +08:00
spawner
845823eed4
Update README.md 2025-06-08 15:14:21 +08:00
spawner
4588c2d970
Update teacache_lumina2.py 2025-06-07 16:47:53 +08:00
Feng Liu
6a9d6e0c84
Merge pull request #75 from spawner1145/main
Optimize redundant code for lumina2
2025-06-04 23:43:36 +08:00
spawner
e945259c7d
Optimize redundant code 2025-06-04 21:15:31 +08:00
spawner
ca1c215ee7
Merge pull request #1 from ali-vilab/main
1
2025-05-26 07:54:57 +08:00
Feng Liu
3dd7c3ffa2
Update README.md 2025-05-26 00:08:30 +08:00
Feng Liu
9caba2ff26
Merge pull request #70 from spawner1145/main
support for lumina2
2025-05-25 23:59:36 +08:00
spawner
f6325a5bb3
Update README.md 2025-05-25 21:36:41 +08:00
spawner
1c96035d27
Update README.md 2025-05-25 17:50:47 +08:00
spawner
e1f6b3ea77
Update README.md 2025-05-25 17:50:07 +08:00
spawner
2a85f3abe1
Update README.md 2025-05-25 17:44:42 +08:00
spawner
6b36ef8168
Create README.md 2025-05-25 17:41:47 +08:00
Feng Liu
fca6462a17
Merge pull request #69 from jiahy0825/rename-sample-file-name
fix typo in filename
2025-05-25 17:15:06 +08:00
Feng Liu
efbeb585ba
Update README.md 2025-05-25 17:00:18 +08:00
Feng Liu
8870cf27de
Merge pull request #71 from YunjieYu/TeaCache4HiDream-I1
Add support for HiDream-I1
2025-05-25 16:41:01 +08:00
YunjieYu
d680b3a2df Add support for HiDream-I1 2025-05-24 23:19:18 +08:00
spawner1145
a312550104 support for lumina2 2025-05-23 13:46:59 +08:00
HongyuJia
7c0aad1585 Rename sample filename 2025-05-22 16:05:35 +08:00
Feng Liu
73d9573763
Update README.md 2025-04-18 15:17:26 +08:00
Feng Liu
129a05d9c6
Update README.md 2025-04-14 11:29:32 +08:00
Feng Liu
36b6ed12c9
Merge pull request #59 from zishen-ucap/feature-update
Update coefficients of CogVideoX1.5
2025-04-14 11:27:42 +08:00
zishen-ucap
0870af8a1d Modified the coefficients of CogVideoX1.5 2025-04-14 10:58:46 +08:00
Feng Liu
109add7c79
Update README.md 2025-04-08 11:05:20 +08:00
Feng Liu
2af6e6dc99
Update README.md 2025-04-08 10:48:56 +08:00
Feng Liu
ac4302b15d
Update README.md 2025-04-08 10:45:45 +08:00
8 changed files with 636 additions and 28 deletions

View File

@ -1,4 +1,4 @@
# [CVPR 2025] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model # [CVPR 2025 Highlight] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
<div class="is-size-5 publication-authors", align="center",> <div class="is-size-5 publication-authors", align="center",>
<span class="author-block"> <span class="author-block">
@ -64,9 +64,14 @@ We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching
## 🔥 Latest News ## 🔥 Latest News
- **If you like our project, please give us a star ⭐ on GitHub for the latest update.** - **If you like our project, please give us a star ⭐ on GitHub for the latest update.**
- [2025/06/08] 🔥 Update coefficients of [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks [@spawner1145](https://github.com/spawner1145).
- [2025/05/26] 🔥 Support [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks [@spawner1145](https://github.com/spawner1145).
- [2025/05/25] 🔥 Support [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1). Thanks [@YunjieYu](https://github.com/YunjieYu).
- [2025/04/14] 🔥 Update coefficients of [CogVideoX1.5](https://github.com/THUDM/CogVideo). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
- [2025/04/05] 🎉 Recommended as a **highlight** in CVPR 2025, top 16.8% in accepted papers and top 3.7% in all papers.
- [2025/03/13] 🔥 Optimized TeaCache for [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks [@zishen-ucap](https://github.com/zishen-ucap). - [2025/03/13] 🔥 Optimized TeaCache for [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
- [2025/03/05] 🔥 Support [Wan2.1](https://github.com/Wan-Video/Wan2.1) for both T2V and I2V. - [2025/03/05] 🔥 Support [Wan2.1](https://github.com/Wan-Video/Wan2.1) for both T2V and I2V.
- [2025/02/27] 🎉 Accepted in CVPR 2025. - [2025/02/27] 🎉 Accepted in **CVPR 2025**.
- [2025/01/24] 🔥 Support [Cosmos](https://github.com/NVIDIA/Cosmos) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap). - [2025/01/24] 🔥 Support [Cosmos](https://github.com/NVIDIA/Cosmos) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
- [2025/01/20] 🔥 Support [CogVideoX1.5-5B](https://github.com/THUDM/CogVideo) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap). - [2025/01/20] 🔥 Support [CogVideoX1.5-5B](https://github.com/THUDM/CogVideo) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap).
- [2025/01/07] 🔥 Support [TangoFlux](https://github.com/declare-lab/TangoFlux). TeaCache works well for Audio Diffusion Models! - [2025/01/07] 🔥 Support [TangoFlux](https://github.com/declare-lab/TangoFlux). TeaCache works well for Audio Diffusion Models!
@ -82,17 +87,18 @@ We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching
If you develop/use TeaCache in your projects and you would like more people to see it, please inform us.(liufeng20@mails.ucas.ac.cn) If you develop/use TeaCache in your projects and you would like more people to see it, please inform us.(liufeng20@mails.ucas.ac.cn)
**Model** **Model**
- [FramePack](https://github.com/lllyasviel/FramePack) supports TeaCache. Thanks [@lllyasviel](https://github.com/lllyasviel).
- [FastVideo](https://github.com/hao-ai-lab/FastVideo) supports TeaCache. Thanks [@BrianChen1129](https://github.com/BrianChen1129) and [@jzhang38](https://github.com/jzhang38). - [FastVideo](https://github.com/hao-ai-lab/FastVideo) supports TeaCache. Thanks [@BrianChen1129](https://github.com/BrianChen1129) and [@jzhang38](https://github.com/jzhang38).
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) supports TeaCache. Thanks [@hkunzhe](https://github.com/hkunzhe) and [@bubbliiiing](https://github.com/bubbliiiing). - [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) supports TeaCache. Thanks [@hkunzhe](https://github.com/hkunzhe) and [@bubbliiiing](https://github.com/bubbliiiing).
- [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero). - [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero).
- [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest). - [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest).
**ComfyUI** **ComfyUI**
- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
- [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper) supports TeaCache4Wan2.1. Thanks [@kijai](https://github.com/kijai). - [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper) supports TeaCache4Wan2.1. Thanks [@kijai](https://github.com/kijai).
- [ComfyUI-TangoFlux](https://github.com/LucipherDev/ComfyUI-TangoFlux) supports TeaCache. Thanks [@LucipherDev](https://github.com/LucipherDev). - [ComfyUI-TangoFlux](https://github.com/LucipherDev/ComfyUI-TangoFlux) supports TeaCache. Thanks [@LucipherDev](https://github.com/LucipherDev).
- [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing). - [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing).
- [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig). - [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig).
- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
- [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok). - [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok).
- [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT). - [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT).
@ -101,13 +107,13 @@ If you develop/use TeaCache in your projects and you would like more people to s
- [Teacache-xDiT](https://github.com/MingXiangL/Teacache-xDiT) for multi-gpu inference. Thanks [@MingXiangL](https://github.com/MingXiangL). - [Teacache-xDiT](https://github.com/MingXiangL/Teacache-xDiT) for multi-gpu inference. Thanks [@MingXiangL](https://github.com/MingXiangL).
**Engine** **Engine**
- [SD.Next](https://github.com/vladmandic/sdnext) supports TeaCache. Thanks [@vladmandic](https://github.com/vladmandic).
- [DiffSynth Studio](https://github.com/modelscope/DiffSynth-Studio) supports TeaCache. Thanks [@Artiprocher](https://github.com/Artiprocher). - [DiffSynth Studio](https://github.com/modelscope/DiffSynth-Studio) supports TeaCache. Thanks [@Artiprocher](https://github.com/Artiprocher).
## 🎉 Supported Models ## 🎉 Supported Models
**Text to Video** **Text to Video**
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md) - [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md) - [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md) - [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
- [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md) - [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md)
- [TeaCache4Mochi](./TeaCache4Mochi/README.md) - [TeaCache4Mochi](./TeaCache4Mochi/README.md)
@ -117,22 +123,15 @@ If you develop/use TeaCache in your projects and you would like more people to s
- [TeaCache4Open-Sora-Plan](./eval/teacache/README.md) - [TeaCache4Open-Sora-Plan](./eval/teacache/README.md)
- [TeaCache4Latte](./eval/teacache/README.md) - [TeaCache4Latte](./eval/teacache/README.md)
**Image to Video** **Image to Video**
- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md) - [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md) - [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
- Ruyi-Models. See [here](https://github.com/IamCreateAI/Ruyi-Models).
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md) - [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
- [TeaCache4ConsisID](./TeaCache4ConsisID/README.md) - [TeaCache4ConsisID](./TeaCache4ConsisID/README.md)
**Video to Video**
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
**Text to Image** **Text to Image**
- [TeaCache4Lumina2](./TeaCache4Lumina2/README.md)
- [TeaCache4HiDream-I1](./TeaCache4HiDream-I1/README.md)
- [TeaCache4FLUX](./TeaCache4FLUX/README.md) - [TeaCache4FLUX](./TeaCache4FLUX/README.md)
- [TeaCache4Lumina-T2X](./TeaCache4Lumina-T2X/README.md) - [TeaCache4Lumina-T2X](./TeaCache4Lumina-T2X/README.md)
@ -146,12 +145,12 @@ If you develop/use TeaCache in your projects and you would like more people to s
## 💐 Acknowledgement ## 💐 Acknowledgement
This repository is built based on [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos) and [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks for their contributions! This repository is built based on [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos), [Wan2.1](https://github.com/Wan-Video/Wan2.1), [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0). Thanks for their contributions!
## 🔒 License ## 🔒 License
* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file. * The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
* For [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos) and [Wan2.1](https://github.com/Wan-Video/Wan2.1), please follow their LICENSE. * For [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys), [Diffusers](https://github.com/huggingface/diffusers), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [Latte](https://github.com/Vchitect/Latte), [CogVideoX](https://github.com/THUDM/CogVideo), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [FLUX](https://github.com/black-forest-labs/flux), [Mochi](https://github.com/genmoai/mochi), [LTX-Video](https://github.com/Lightricks/LTX-Video), [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X), [TangoFlux](https://github.com/declare-lab/TangoFlux), [Cosmos](https://github.com/NVIDIA/Cosmos), [Wan2.1](https://github.com/Wan-Video/Wan2.1), [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0), please follow their LICENSE.
* The service is a research preview. Please contact us if you find any potential violations. (liufeng20@mails.ucas.ac.cn) * The service is a research preview. Please contact us if you find any potential violations. (liufeng20@mails.ucas.ac.cn)
## 📖 Citation ## 📖 Citation

View File

@ -3,19 +3,19 @@
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [CogVideoX1.5](https://github.com/THUDM/CogVideo) 1.8x without much visual quality degradation, in a training-free manner. The following video shows the results generated by TeaCache-CogVideoX1.5 with various `rel_l1_thresh` values: 0 (original), 0.1 (1.3x speedup), 0.2 (1.8x speedup), and 0.3(2.1x speedup).Additionally, the image-to-video (i2v) results are also demonstrated, with the following speedups: 0.1 (1.5x speedup), 0.2 (2.2x speedup), and 0.3 (2.7x speedup). [TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [CogVideoX1.5](https://github.com/THUDM/CogVideo) 1.8x without much visual quality degradation, in a training-free manner. The following video shows the results generated by TeaCache-CogVideoX1.5 with various `rel_l1_thresh` values: 0 (original), 0.1 (1.3x speedup), 0.2 (1.8x speedup), and 0.3(2.1x speedup).Additionally, the image-to-video (i2v) results are also demonstrated, with the following speedups: 0.1 (1.5x speedup), 0.2 (2.2x speedup), and 0.3 (2.7x speedup).
https://github.com/user-attachments/assets/c444b850-3252-4b37-ad4a-122d389218d9 https://github.com/user-attachments/assets/21261b03-71c6-47bf-9769-2a81c8dc452f
https://github.com/user-attachments/assets/5f181a57-d5e3-46db-b388-8591e50f98e2 https://github.com/user-attachments/assets/5e98e646-4034-4ae7-9680-a65ecd88dac9
## 📈 Inference Latency Comparisons on a Single H100 GPU ## 📈 Inference Latency Comparisons on a Single H100 GPU
| CogVideoX1.5-t2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) | | CogVideoX1.5-t2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
| :--------------: | :------------: | :------------: | :------------: | | :--------------: | :------------: | :------------: | :------------: |
| ~465 s | ~372 s | ~261 s | ~223 s | | ~465 s | ~322 s | ~260 s | ~204 s |
| CogVideoX1.5-i2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) | | CogVideoX1.5-i2v | TeaCache (0.1) | TeaCache (0.2) | TeaCache (0.3) |
| :--------------: | :------------: | :------------: | :------------: | | :--------------: | :------------: | :------------: | :------------: |
| ~475 s | ~323 s | ~218 s | ~171 s | | ~475 s | ~316 s | ~239 s | ~204 s |
## Installation ## Installation

View File

@ -6,6 +6,14 @@ from diffusers.models.modeling_outputs import Transformer2DModelOutput
from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, scale_lora_layers, unscale_lora_layers, export_to_video, load_image from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, scale_lora_layers, unscale_lora_layers, export_to_video, load_image
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
coefficients_dict = {
"CogVideoX-2b":[-3.10658903e+01, 2.54732368e+01, -5.92380459e+00, 1.75769064e+00, -3.61568434e-03],
"CogVideoX-5b":[-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02],
"CogVideoX-5b-I2V":[-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02],
"CogVideoX1.5-5B":[ 2.50210439e+02, -1.65061612e+02, 3.57804877e+01, -7.81551492e-01, 3.58559703e-02],
"CogVideoX1.5-5B-I2V":[ 1.22842302e+02, -1.04088754e+02, 2.62981677e+01, -3.06009921e-01, 3.71213220e-02],
}
def teacache_forward( def teacache_forward(
self, self,
@ -64,13 +72,7 @@ def teacache_forward(
should_calc = True should_calc = True
self.accumulated_rel_l1_distance = 0 self.accumulated_rel_l1_distance = 0
else: else:
if not self.config.use_rotary_positional_embeddings: rescale_func = np.poly1d(self.coefficients)
# CogVideoX-2B
coefficients = [-3.10658903e+01, 2.54732368e+01, -5.92380459e+00, 1.75769064e+00, -3.61568434e-03]
else:
# CogVideoX-5B and CogvideoX1.5-5B
coefficients = [-1.53880483e+03, 8.43202495e+02, -1.34363087e+02, 7.97131516e+00, -5.23162339e-02]
rescale_func = np.poly1d(coefficients)
self.accumulated_rel_l1_distance += rescale_func(((emb-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item()) self.accumulated_rel_l1_distance += rescale_func(((emb-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
if self.accumulated_rel_l1_distance < self.rel_l1_thresh: if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
should_calc = False should_calc = False
@ -196,6 +198,7 @@ def main(args):
guidance_scale = args.guidance_scale guidance_scale = args.guidance_scale
fps = args.fps fps = args.fps
image_path = args.image_path image_path = args.image_path
mode = ckpts_path.split("/")[-1]
if generate_type == "t2v": if generate_type == "t2v":
pipe = CogVideoXPipeline.from_pretrained(ckpts_path, torch_dtype=torch.bfloat16) pipe = CogVideoXPipeline.from_pretrained(ckpts_path, torch_dtype=torch.bfloat16)
@ -212,6 +215,7 @@ def main(args):
pipe.transformer.__class__.previous_residual_encoder = None pipe.transformer.__class__.previous_residual_encoder = None
pipe.transformer.__class__.num_steps = num_inference_steps pipe.transformer.__class__.num_steps = num_inference_steps
pipe.transformer.__class__.cnt = 0 pipe.transformer.__class__.cnt = 0
pipe.transformer.__class__.coefficients = coefficients_dict[mode]
pipe.transformer.__class__.forward = teacache_forward pipe.transformer.__class__.forward = teacache_forward
pipe.to("cuda") pipe.to("cuda")
@ -243,7 +247,7 @@ def main(args):
generator=torch.Generator("cuda").manual_seed(seed), # Set the seed for reproducibility generator=torch.Generator("cuda").manual_seed(seed), # Set the seed for reproducibility
).frames[0] ).frames[0]
words = prompt.split()[:5] words = prompt.split()[:5]
video_path = f"{output_path}/teacache_cogvideox1.5-5B_{words}.mp4" video_path = f"{output_path}/teacache_cogvideox1.5-5B_{words}_{rel_l1_thresh}.mp4"
export_to_video(video, video_path, fps=fps) export_to_video(video, video_path, fps=fps)
@ -263,7 +267,7 @@ if __name__ == "__main__":
parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process") parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process")
parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process") parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process")
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance") parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
parser.add_argument("--fps", type=int, default=16, help="Number of steps for the inference process") parser.add_argument("--fps", type=int, default=16, help="Frame rate of video")
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)

View File

@ -0,0 +1,43 @@
<!-- ## **TeaCache4HiDream-I1** -->
# TeaCache4HiDream-I1
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) 2x without much visual quality degradation, in a training-free manner. The following image shows the results generated by TeaCache-HiDream-I1-Full with various `rel_l1_thresh` values: 0 (original), 0.17 (1.5x speedup), 0.25 (1.7x speedup), 0.3 (2.0x speedup), and 0.45 (2.6x speedup).
![visualization](../assets/TeaCache4HiDream-I1.png)
## 📈 Inference Latency Comparisons on a Single A100
| HiDream-I1-Full | TeaCache (0.17) | TeaCache (0.25) | TeaCache (0.3) | TeaCache (0.45) |
|:-----------------------:|:----------------------------:|:--------------------:|:---------------------:|:--------------------:|
| ~50 s | ~34 s | ~29 s | ~25 s | ~19 s |
## Installation
```shell
pip install git+https://github.com/huggingface/diffusers
pip install --upgrade transformers protobuf tiktoken tokenizers sentencepiece
```
## Usage
You can modify the `rel_l1_thresh` in line 297 to obtain your desired trade-off between latency and visul quality. For single-gpu inference, you can use the following command:
```bash
python teacache_hidream_i1.py
```
## Citation
If you find TeaCache is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
```
@article{liu2024timestep,
title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
author={Liu, Feng and Zhang, Shiwei and Wang, Xiaofeng and Wei, Yujie and Qiu, Haonan and Zhao, Yuzhong and Zhang, Yingya and Ye, Qixiang and Wan, Fang},
journal={arXiv preprint arXiv:2411.19108},
year={2024}
}
```
## Acknowledgements
We would like to thank the contributors to the [HiDream-I1](https://github.com/HiDream-ai/HiDream-I1) and [Diffusers](https://github.com/huggingface/diffusers).

View File

@ -0,0 +1,307 @@
from typing import Any, Dict, List, Optional, Tuple
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
from diffusers import HiDreamImagePipeline
from diffusers.models import HiDreamImageTransformer2DModel
from diffusers.models.modeling_outputs import Transformer2DModelOutput
from diffusers.utils import logging, deprecate, USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
import torch
import numpy as np
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
def teacache_forward(
self,
hidden_states: torch.Tensor,
timesteps: torch.LongTensor = None,
encoder_hidden_states_t5: torch.Tensor = None,
encoder_hidden_states_llama3: torch.Tensor = None,
pooled_embeds: torch.Tensor = None,
img_ids: Optional[torch.Tensor] = None,
img_sizes: Optional[List[Tuple[int, int]]] = None,
hidden_states_masks: Optional[torch.Tensor] = None,
attention_kwargs: Optional[Dict[str, Any]] = None,
return_dict: bool = True,
**kwargs,
):
encoder_hidden_states = kwargs.get("encoder_hidden_states", None)
if encoder_hidden_states is not None:
deprecation_message = "The `encoder_hidden_states` argument is deprecated. Please use `encoder_hidden_states_t5` and `encoder_hidden_states_llama3` instead."
deprecate("encoder_hidden_states", "0.35.0", deprecation_message)
encoder_hidden_states_t5 = encoder_hidden_states[0]
encoder_hidden_states_llama3 = encoder_hidden_states[1]
if img_ids is not None and img_sizes is not None and hidden_states_masks is None:
deprecation_message = (
"Passing `img_ids` and `img_sizes` with unpachified `hidden_states` is deprecated and will be ignored."
)
deprecate("img_ids", "0.35.0", deprecation_message)
if hidden_states_masks is not None and (img_ids is None or img_sizes is None):
raise ValueError("if `hidden_states_masks` is passed, `img_ids` and `img_sizes` must also be passed.")
elif hidden_states_masks is not None and hidden_states.ndim != 3:
raise ValueError(
"if `hidden_states_masks` is passed, `hidden_states` must be a 3D tensors with shape (batch_size, patch_height * patch_width, patch_size * patch_size * channels)"
)
if attention_kwargs is not None:
attention_kwargs = attention_kwargs.copy()
lora_scale = attention_kwargs.pop("scale", 1.0)
else:
lora_scale = 1.0
if USE_PEFT_BACKEND:
# weight the lora layers by setting `lora_scale` for each PEFT layer
scale_lora_layers(self, lora_scale)
else:
if attention_kwargs is not None and attention_kwargs.get("scale", None) is not None:
logger.warning(
"Passing `scale` via `attention_kwargs` when not using the PEFT backend is ineffective."
)
# spatial forward
batch_size = hidden_states.shape[0]
hidden_states_type = hidden_states.dtype
# Patchify the input
if hidden_states_masks is None:
hidden_states, hidden_states_masks, img_sizes, img_ids = self.patchify(hidden_states)
# Embed the hidden states
hidden_states = self.x_embedder(hidden_states)
# 0. time
timesteps = self.t_embedder(timesteps, hidden_states_type)
p_embedder = self.p_embedder(pooled_embeds)
temb = timesteps + p_embedder
encoder_hidden_states = [encoder_hidden_states_llama3[k] for k in self.config.llama_layers]
if self.caption_projection is not None:
new_encoder_hidden_states = []
for i, enc_hidden_state in enumerate(encoder_hidden_states):
enc_hidden_state = self.caption_projection[i](enc_hidden_state)
enc_hidden_state = enc_hidden_state.view(batch_size, -1, hidden_states.shape[-1])
new_encoder_hidden_states.append(enc_hidden_state)
encoder_hidden_states = new_encoder_hidden_states
encoder_hidden_states_t5 = self.caption_projection[-1](encoder_hidden_states_t5)
encoder_hidden_states_t5 = encoder_hidden_states_t5.view(batch_size, -1, hidden_states.shape[-1])
encoder_hidden_states.append(encoder_hidden_states_t5)
txt_ids = torch.zeros(
batch_size,
encoder_hidden_states[-1].shape[1]
+ encoder_hidden_states[-2].shape[1]
+ encoder_hidden_states[0].shape[1],
3,
device=img_ids.device,
dtype=img_ids.dtype,
)
ids = torch.cat((img_ids, txt_ids), dim=1)
image_rotary_emb = self.pe_embedder(ids)
# 2. Blocks
block_id = 0
initial_encoder_hidden_states = torch.cat([encoder_hidden_states[-1], encoder_hidden_states[-2]], dim=1)
initial_encoder_hidden_states_seq_len = initial_encoder_hidden_states.shape[1]
if self.enable_teacache:
modulated_inp = timesteps.clone()
if self.cnt < self.ret_steps:
should_calc = True
self.accumulated_rel_l1_distance = 0
else:
rescale_func = np.poly1d(self.coefficients)
self.accumulated_rel_l1_distance += rescale_func(((modulated_inp-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
should_calc = False
else:
should_calc = True
self.accumulated_rel_l1_distance = 0
self.previous_modulated_input = modulated_inp
self.cnt += 1
if self.cnt == self.num_steps:
self.cnt = 0
if self.enable_teacache:
if not should_calc:
hidden_states += self.previous_residual
else:
# 2. Blocks
ori_hidden_states = hidden_states.clone()
for bid, block in enumerate(self.double_stream_blocks):
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
cur_encoder_hidden_states = torch.cat(
[initial_encoder_hidden_states, cur_llama31_encoder_hidden_states], dim=1
)
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states, initial_encoder_hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
hidden_states_masks,
cur_encoder_hidden_states,
temb,
image_rotary_emb,
)
else:
hidden_states, initial_encoder_hidden_states = block(
hidden_states=hidden_states,
hidden_states_masks=hidden_states_masks,
encoder_hidden_states=cur_encoder_hidden_states,
temb=temb,
image_rotary_emb=image_rotary_emb,
)
initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
block_id += 1
image_tokens_seq_len = hidden_states.shape[1]
hidden_states = torch.cat([hidden_states, initial_encoder_hidden_states], dim=1)
hidden_states_seq_len = hidden_states.shape[1]
if hidden_states_masks is not None:
encoder_attention_mask_ones = torch.ones(
(batch_size, initial_encoder_hidden_states.shape[1] + cur_llama31_encoder_hidden_states.shape[1]),
device=hidden_states_masks.device,
dtype=hidden_states_masks.dtype,
)
hidden_states_masks = torch.cat([hidden_states_masks, encoder_attention_mask_ones], dim=1)
for bid, block in enumerate(self.single_stream_blocks):
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
hidden_states = torch.cat([hidden_states, cur_llama31_encoder_hidden_states], dim=1)
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
hidden_states_masks,
None,
temb,
image_rotary_emb,
)
else:
hidden_states = block(
hidden_states=hidden_states,
hidden_states_masks=hidden_states_masks,
encoder_hidden_states=None,
temb=temb,
image_rotary_emb=image_rotary_emb,
)
hidden_states = hidden_states[:, :hidden_states_seq_len]
block_id += 1
hidden_states = hidden_states[:, :image_tokens_seq_len, ...]
self.previous_residual = hidden_states - ori_hidden_states
else:
for bid, block in enumerate(self.double_stream_blocks):
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
cur_encoder_hidden_states = torch.cat(
[initial_encoder_hidden_states, cur_llama31_encoder_hidden_states], dim=1
)
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states, initial_encoder_hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
hidden_states_masks,
cur_encoder_hidden_states,
temb,
image_rotary_emb,
)
else:
hidden_states, initial_encoder_hidden_states = block(
hidden_states=hidden_states,
hidden_states_masks=hidden_states_masks,
encoder_hidden_states=cur_encoder_hidden_states,
temb=temb,
image_rotary_emb=image_rotary_emb,
)
initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
block_id += 1
image_tokens_seq_len = hidden_states.shape[1]
hidden_states = torch.cat([hidden_states, initial_encoder_hidden_states], dim=1)
hidden_states_seq_len = hidden_states.shape[1]
if hidden_states_masks is not None:
encoder_attention_mask_ones = torch.ones(
(batch_size, initial_encoder_hidden_states.shape[1] + cur_llama31_encoder_hidden_states.shape[1]),
device=hidden_states_masks.device,
dtype=hidden_states_masks.dtype,
)
hidden_states_masks = torch.cat([hidden_states_masks, encoder_attention_mask_ones], dim=1)
for bid, block in enumerate(self.single_stream_blocks):
cur_llama31_encoder_hidden_states = encoder_hidden_states[block_id]
hidden_states = torch.cat([hidden_states, cur_llama31_encoder_hidden_states], dim=1)
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
hidden_states_masks,
None,
temb,
image_rotary_emb,
)
else:
hidden_states = block(
hidden_states=hidden_states,
hidden_states_masks=hidden_states_masks,
encoder_hidden_states=None,
temb=temb,
image_rotary_emb=image_rotary_emb,
)
hidden_states = hidden_states[:, :hidden_states_seq_len]
block_id += 1
hidden_states = hidden_states[:, :image_tokens_seq_len, ...]
output = self.final_layer(hidden_states, temb)
output = self.unpatchify(output, img_sizes, self.training)
if hidden_states_masks is not None:
hidden_states_masks = hidden_states_masks[:, :image_tokens_seq_len]
if USE_PEFT_BACKEND:
# remove `lora_scale` from each PEFT layer
unscale_lora_layers(self, lora_scale)
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)
HiDreamImageTransformer2DModel.forward = teacache_forward
num_inference_steps = 50
seed = 42
prompt = 'A cat holding a sign that says "Hi-Dreams.ai".'
tokenizer_4 = PreTrainedTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text_encoder_4 = LlamaForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
output_hidden_states=True,
output_attentions=True,
torch_dtype=torch.bfloat16,
)
pipeline = HiDreamImagePipeline.from_pretrained(
"HiDream-ai/HiDream-I1-Full",
tokenizer_4=tokenizer_4,
text_encoder_4=text_encoder_4,
torch_dtype=torch.bfloat16,
)
# pipeline.enable_model_cpu_offload() # save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
# TeaCache
pipeline.transformer.__class__.enable_teacache = True
pipeline.transformer.__class__.cnt = 0
pipeline.transformer.__class__.num_steps = num_inference_steps
pipeline.transformer.__class__.ret_steps = num_inference_steps * 0.1
pipeline.transformer.__class__.rel_l1_thresh = 0.3 # 0.17 for 1.5x speedup, 0.25 for 1.7x speedup, 0.3 for 2x speedup, 0.45 for 2.6x speedup
pipeline.transformer.__class__.coefficients = [-3.13605009e+04, -7.12425503e+02, 4.91363285e+01, 8.26515490e+00, 1.08053901e-01]
pipeline.to("cuda")
img = pipeline(
prompt,
guidance_scale=5.0,
num_inference_steps=num_inference_steps,
generator=torch.Generator("cuda").manual_seed(seed)
).images[0]
img.save("{}.png".format('TeaCache_' + prompt))

View File

@ -0,0 +1,72 @@
<!-- ## **TeaCache4LuminaT2X** -->
# TeaCache4Lumina2
[TeaCache](https://github.com/LiewFeng/TeaCache) can speedup [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0) without much visual quality degradation, in a training-free manner. The following image shows the experimental results of Lumina-Image-2.0 and TeaCache with different versions: v1(0 (original), 0.2 (1.25x speedup), 0.3 (1.5625x speedup), 0.4 (2.0833x speedup), 0.5 (2.5x speedup).) and v2(Lumina-Image-2.0 (~25 s), TeaCache (0.2) (~16.7 s, 1.5x speedup), TeaCache (0.3) (~15.6 s, 1.6x speedup), TeaCache (0.5) (~13.79 s, 1.8x speedup), TeaCache (1.1) (~11.9 s, 2.1x speedup)).
The v1 coefficients
`[393.76566581,603.50993606,209.10239044,23.00726601,0.86377344]`
exhibit poor quality at low L1 values but perform better with higher L1 settings, though at a slower speed. The v2 coefficients
`[225.7042019806413,608.8453716535591,304.1869942338369,124.21267720116742,1.4089066892956552]`
, however, offer faster computation and better quality at low L1 levels but incur significant feature loss at high L1 values.
You can change the value in line 72 to switch versions
## v1
<p align="center">
<img src="https://github.com/user-attachments/assets/d2c87b99-e4ac-4407-809a-caf9750f41ef" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/411ff763-9c31-438d-8a9b-3ec5c88f6c27" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/e57dfb60-a07f-4e17-837e-e46a69d8b9c0" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/6e3184fe-e31a-452c-a447-48d4b74fcc10" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/d6a52c4c-bd22-45c0-9f40-00a2daa85fc8" width="150" style="margin: 5px;">
</p>
## v2
<p align="center">
<img src="https://github.com/user-attachments/assets/aea9907b-830e-497b-b968-aaeef463c7ef" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/0e258295-eaaa-49ce-b16f-bba7f7ada6c1" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/44600f22-3fd4-4bc4-ab00-29b0ed023d6d" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/bcb926ab-95fd-4c83-8b46-f72581a3359e" width="150" style="margin: 5px;">
<img src="https://github.com/user-attachments/assets/ec8db28e-0f9b-4d56-9096-fdc8b3c20f4b" width="150" style="margin: 5px;">
</p>
## 📈 Inference Latency Comparisons on a single 4090 (step 50)
## v1
| Lumina-Image-2.0 | TeaCache (0.2) | TeaCache (0.3) | TeaCache (0.4) | TeaCache (0.5) |
|:-------------------------:|:---------------------------:|:--------------------:|:---------------------:|:---------------------:|
| ~25 s | ~20 s | ~16 s | ~12 s | ~10 s |
## v2
| Lumina-Image-2.0 | TeaCache (0.2) | TeaCache (0.3) | TeaCache (0.5) | TeaCache (1.1) |
|:-------------------------:|:---------------------------:|:--------------------:|:---------------------:|:---------------------:|
| ~25 s | ~16.7 s | ~15.6 s | ~13.79 s | ~11.9 s |
## Installation
```shell
pip install --upgrade diffusers[torch] transformers protobuf tokenizers sentencepiece
pip install flash-attn --no-build-isolation
```
## Usage
You can modify the thresh in line 154 to obtain your desired trade-off between latency and visul quality. For single-gpu inference, you can use the following command:
```bash
python teacache_lumina2.py
```
## Citation
If you find TeaCache is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
```
@article{liu2024timestep,
title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
author={Liu, Feng and Zhang, Shiwei and Wang, Xiaofeng and Wei, Yujie and Qiu, Haonan and Zhao, Yuzhong and Zhang, Yingya and Ye, Qixiang and Wan, Fang},
journal={arXiv preprint arXiv:2411.19108},
year={2024}
}
```
## Acknowledgements
We would like to thank the contributors to the [Lumina-Image-2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0) and [Diffusers](https://github.com/huggingface/diffusers).

View File

@ -0,0 +1,183 @@
import torch
import torch.nn as nn
import numpy as np
from typing import Any, Dict, Optional, Tuple, Union, List
from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline
from diffusers.models.modeling_outputs import Transformer2DModelOutput
from diffusers.utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
def teacache_forward_working(
self,
hidden_states: torch.Tensor,
timestep: torch.Tensor,
encoder_hidden_states: torch.Tensor,
encoder_attention_mask: torch.Tensor,
attention_kwargs: Optional[Dict[str, Any]] = None,
return_dict: bool = True,
) -> Union[torch.Tensor, Transformer2DModelOutput]:
if attention_kwargs is not None:
attention_kwargs = attention_kwargs.copy()
lora_scale = attention_kwargs.pop("scale", 1.0)
else:
lora_scale = 1.0
if USE_PEFT_BACKEND:
scale_lora_layers(self, lora_scale)
batch_size, _, height, width = hidden_states.shape
temb, encoder_hidden_states_processed = self.time_caption_embed(hidden_states, timestep, encoder_hidden_states)
(image_patch_embeddings, context_rotary_emb, noise_rotary_emb, joint_rotary_emb,
encoder_seq_lengths, seq_lengths) = self.rope_embedder(hidden_states, encoder_attention_mask)
image_patch_embeddings = self.x_embedder(image_patch_embeddings)
for layer in self.context_refiner:
encoder_hidden_states_processed = layer(encoder_hidden_states_processed, encoder_attention_mask, context_rotary_emb)
for layer in self.noise_refiner:
image_patch_embeddings = layer(image_patch_embeddings, None, noise_rotary_emb, temb)
max_seq_len = max(seq_lengths)
input_to_main_loop = image_patch_embeddings.new_zeros(batch_size, max_seq_len, self.config.hidden_size)
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
input_to_main_loop[i, :enc_len] = encoder_hidden_states_processed[i, :enc_len]
input_to_main_loop[i, enc_len:seq_len_val] = image_patch_embeddings[i]
use_mask = len(set(seq_lengths)) > 1
attention_mask_for_main_loop_arg = None
if use_mask:
mask = input_to_main_loop.new_zeros(batch_size, max_seq_len, dtype=torch.bool)
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
mask[i, :seq_len_val] = True
attention_mask_for_main_loop_arg = mask
should_calc = True
if self.enable_teacache:
cache_key = max_seq_len
if cache_key not in self.cache:
self.cache[cache_key] = {
"accumulated_rel_l1_distance": 0.0,
"previous_modulated_input": None,
"previous_residual": None,
}
current_cache = self.cache[cache_key]
modulated_inp, _, _, _ = self.layers[0].norm1(input_to_main_loop, temb)
if self.cnt == 0 or self.cnt == self.num_steps - 1:
should_calc = True
current_cache["accumulated_rel_l1_distance"] = 0.0
else:
if current_cache["previous_modulated_input"] is not None:
# v1 coefficientsyou can switch it to [225.7042019806413, -608.8453716535591, 304.1869942338369, 124.21267720116742, -1.4089066892956552] as v2
coefficients = [393.76566581, -603.50993606, 209.10239044, -23.00726601, 0.86377344]
rescale_func = np.poly1d(coefficients)
prev_mod_input = current_cache["previous_modulated_input"]
prev_mean = prev_mod_input.abs().mean()
if prev_mean.item() > 1e-9:
rel_l1_change = ((modulated_inp - prev_mod_input).abs().mean() / prev_mean).cpu().item()
else:
rel_l1_change = 0.0 if modulated_inp.abs().mean().item() < 1e-9 else float('inf')
current_cache["accumulated_rel_l1_distance"] += rescale_func(rel_l1_change)
if current_cache["accumulated_rel_l1_distance"] < self.rel_l1_thresh:
should_calc = False
else:
should_calc = True
current_cache["accumulated_rel_l1_distance"] = 0.0
else:
should_calc = True
current_cache["accumulated_rel_l1_distance"] = 0.0
current_cache["previous_modulated_input"] = modulated_inp.clone()
if self.uncond_seq_len is None:
self.uncond_seq_len = cache_key
if cache_key != self.uncond_seq_len:
self.cnt += 1
if self.cnt >= self.num_steps:
self.cnt = 0
if self.enable_teacache and not should_calc:
if max_seq_len in self.cache and "previous_residual" in self.cache[max_seq_len] and self.cache[max_seq_len]["previous_residual"] is not None:
processed_hidden_states = input_to_main_loop + self.cache[max_seq_len]["previous_residual"]
else:
should_calc = True
current_processing_states = input_to_main_loop
for layer in self.layers:
current_processing_states = layer(current_processing_states, attention_mask_for_main_loop_arg, joint_rotary_emb, temb)
processed_hidden_states = current_processing_states
if not (self.enable_teacache and not should_calc) :
current_processing_states = input_to_main_loop
for layer in self.layers:
current_processing_states = layer(current_processing_states, attention_mask_for_main_loop_arg, joint_rotary_emb, temb)
if self.enable_teacache:
if max_seq_len in self.cache:
self.cache[max_seq_len]["previous_residual"] = current_processing_states - input_to_main_loop
else:
logger.warning(f"TeaCache: Cache key {max_seq_len} not found when trying to save residual.")
processed_hidden_states = current_processing_states
output_after_norm = self.norm_out(processed_hidden_states, temb)
p = self.config.patch_size
final_output_list = []
for i, (enc_len, seq_len_val) in enumerate(zip(encoder_seq_lengths, seq_lengths)):
image_part = output_after_norm[i][enc_len:seq_len_val]
h_p, w_p = height // p, width // p
reconstructed_image = image_part.view(h_p, w_p, p, p, self.out_channels) \
.permute(4, 0, 2, 1, 3) \
.flatten(3, 4) \
.flatten(1, 2)
final_output_list.append(reconstructed_image)
final_output_tensor = torch.stack(final_output_list, dim=0)
if USE_PEFT_BACKEND:
unscale_lora_layers(self, lora_scale)
if not return_dict:
return (final_output_tensor,)
return Transformer2DModelOutput(sample=final_output_tensor)
Lumina2Transformer2DModel.forward = teacache_forward_working
ckpt_path = "NietaAniLumina_Alpha_full_round5_ep5_s182000.pth"
transformer = Lumina2Transformer2DModel.from_single_file(
ckpt_path, torch_dtype=torch.bfloat16
)
pipeline = Lumina2Pipeline.from_pretrained(
"Alpha-VLLM/Lumina-Image-2.0",
transformer=transformer,
torch_dtype=torch.bfloat16
).to("cuda")
num_inference_steps = 30
seed = 1024
prompt = "a cat holding a sign that says hello"
output_filename = f"teacache_lumina2_output.png"
# TeaCache
pipeline.transformer.__class__.enable_teacache = True
pipeline.transformer.__class__.cnt = 0
pipeline.transformer.__class__.num_steps = num_inference_steps
pipeline.transformer.__class__.rel_l1_thresh = 0.3
pipeline.transformer.__class__.cache = {}
pipeline.transformer.__class__.uncond_seq_len = None
pipeline.enable_model_cpu_offload()
image = pipeline(
prompt=prompt,
num_inference_steps=num_inference_steps,
generator=torch.Generator("cuda").manual_seed(seed)
).images[0]
image.save(output_filename)
print(f"Image saved to {output_filename}")

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.7 MiB