Update README.md

2025-12-08 20:34:24 +08:00 · 2025-03-13 21:28:53 +08:00 · 2025-03-13 20:14:51 +08:00 · 2025-03-13 20:07:35 +08:00 · 2025-03-13 19:07:42 +08:00 · 2025-03-13 19:01:03 +08:00
3 changed files with 438 additions and 340 deletions
--- a/README.md
+++ b/README.md
@ -60,10 +60,11 @@
 ![visualization](./assets/tisser.png)

 ## 🫖 Introduction 
-We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps, thereby accelerating the inference. TeaCache works well for Video Diffusion Models, Image Diffusion models and Audio Diffusion Models. For more details and results, please visit our [project page](https://github.com/LiewFeng/TeaCache).
+We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps, thereby accelerating the inference. TeaCache works well for Video Diffusion Models, Image Diffusion models and Audio Diffusion Models. For more details and results, please visit our [project page](https://liewfeng.github.io/TeaCache/).

 ## 🔥 Latest News 
 - **If you like our project, please give us a star ⭐ on GitHub for the latest update.**
+- [2025/03/13] 🔥 Optimized TeaCache for [Wan2.1](https://github.com/Wan-Video/Wan2.1). Thanks [@zishen-ucap](https://github.com/zishen-ucap).
 - [2025/03/05] 🔥 Support [Wan2.1](https://github.com/Wan-Video/Wan2.1) for both T2V and I2V.
 - [2025/02/27] 🎉 Accepted in CVPR 2025.
 - [2025/01/24] 🔥 Support [Cosmos](https://github.com/NVIDIA/Cosmos) for both T2V and I2V. Thanks [@zishen-ucap](https://github.com/zishen-ucap). 
@ -78,20 +79,23 @@ We introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching
 - [2024/11/28] 🎉 Release the [paper](https://arxiv.org/abs/2411.19108) of TeaCache.

 ## 🧩 Community Contributions  
-If you develop/use TeaCache in your projects, welcome to let us know.
+If you develop/use TeaCache in your projects and you would like more people to see it, please inform us.(liufeng20@mails.ucas.ac.cn)

 **Model**
- [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest).
- [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero).
+- [FastVideo](https://github.com/hao-ai-lab/FastVideo) supports TeaCache. Thanks [@BrianChen1129](https://github.com/BrianChen1129) and [@jzhang38](https://github.com/jzhang38).
 - [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) supports TeaCache. Thanks [@hkunzhe](https://github.com/hkunzhe) and [@bubbliiiing](https://github.com/bubbliiiing).
+- [Ruyi-Models](https://github.com/IamCreateAI/Ruyi-Models) supports TeaCache. Thanks [@cellzero](https://github.com/cellzero).
+- [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) supports TeaCache. Thanks [@SHYuanBest](https://github.com/SHYuanBest).

 **ComfyUI**
- [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT).
- [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok).
- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
- [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig).
- [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing).
+- [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper) supports TeaCache4Wan2.1. Thanks [@kijai](https://github.com/kijai).
 - [ComfyUI-TangoFlux](https://github.com/LucipherDev/ComfyUI-TangoFlux) supports TeaCache. Thanks [@LucipherDev](https://github.com/LucipherDev).
+- [ComfyUI_Patches_ll](https://github.com/lldacing/ComfyUI_Patches_ll) supports TeaCache. Thanks [@lldacing](https://github.com/lldacing).
+- [Comfyui_TTP_Toolset](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset) supports TeaCache. Thanks [@TTPlanetPig](https://github.com/TTPlanetPig).
+- [ComfyUI-TeaCache](https://github.com/welltop-cn/ComfyUI-TeaCache) for TeaCache. Thanks [@YunjieYu](https://github.com/YunjieYu).
+- [ComfyUI-TeaCacheHunyuanVideo](https://github.com/facok/ComfyUI-TeaCacheHunyuanVideo) for TeaCache4HunyuanVideo. Thanks [@facok](https://github.com/facok).
+- [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) supports TeaCache4HunyuanVideo. Thanks [@kijai](https://github.com/kijai), [ctf05](https://github.com/ctf05) and [DarioFT](https://github.com/DarioFT).
+

 **Parallelism**
 - [Teacache-xDiT](https://github.com/MingXiangL/Teacache-xDiT) for multi-gpu inference. Thanks [@MingXiangL](https://github.com/MingXiangL).
@ -101,25 +105,28 @@ If you develop/use TeaCache in your projects, welcome to let us know.
 ## 🎉 Supported Models 
 **Text to Video**
 - [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
+- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
+- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
+- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
+- [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md)
+- [TeaCache4Mochi](./TeaCache4Mochi/README.md)
+- [TeaCache4HunyuanVideo](./TeaCache4HunyuanVideo/README.md)
+- [TeaCache4CogVideoX](./eval/teacache/README.md)
 - [TeaCache4Open-Sora](./eval/teacache/README.md)
 - [TeaCache4Open-Sora-Plan](./eval/teacache/README.md)
 - [TeaCache4Latte](./eval/teacache/README.md)
- [TeaCache4CogVideoX](./eval/teacache/README.md)
- [TeaCache4HunyuanVideo](./TeaCache4HunyuanVideo/README.md)
- [TeaCache4Mochi](./TeaCache4Mochi/README.md)
- [TeaCache4LTX-Video](./TeaCache4LTX-Video/README.md)
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
- [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
+
+


 **Image to Video** 
- - [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
- [TeaCache4ConsisID](./TeaCache4ConsisID/README.md)
- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
- Ruyi-Models. See [here](https://github.com/IamCreateAI/Ruyi-Models).
- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
+- [TeaCache4Wan2.1](./TeaCache4Wan2.1/README.md)
 - [TeaCache4Cosmos](./eval/TeaCache4Cosmos/README.md)
+- EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
+- Ruyi-Models. See [here](https://github.com/IamCreateAI/Ruyi-Models).
+- [TeaCache4CogVideoX1.5](./TeaCache4CogVideoX1.5/README.md)
+- [TeaCache4ConsisID](./TeaCache4ConsisID/README.md)
+

 **Video to Video**
 - EasyAnimate, see [here](https://github.com/aigc-apps/EasyAnimate).
--- a/TeaCache4Wan2.1/README.md
+++ b/TeaCache4Wan2.1/README.md
@ -57,6 +57,49 @@ For I2V with 720P resolution, you can use the following command:
 python teacache_generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." --base_seed 42 --offload_model True --t5_cpu --frame_num 61  --teacache_thresh 0.3
 ```

+## Faster Video Generation Using the `use_ret_steps` Parameter
+
+Using Retention Steps will result in faster generation speed and better generation quality (except for t2v-1.3B).
+
+https://github.com/user-attachments/assets/f241b5f5-1044-4223-b2a4-449dc6dc1ad7
+
+https://github.com/user-attachments/assets/01db60f9-4aaf-43c4-8f1b-6e050cfa1180
+
+https://github.com/user-attachments/assets/e03621f2-1085-4571-8eca-51889f47ce18
+
+https://github.com/user-attachments/assets/d1340197-20c1-4f9e-a780-31f789af0893
+
+
+|       use_ref_steps        |   Wan2.1 t2v 1.3B (thresh)   |     Slow (thresh)     |     Fast (thresh)     |
+|:--------------------------:|:----------------------------:|:---------------------:|:---------------------:|
+|          False             |        ~97 s (0.00)          |     ~64 s (0.05)      |     ~49 s (0.08)      |
+|          True              |        ~97 s (0.00)          |     ~61 s (0.05)      |     ~41 s (0.10)      |
+
+|       use_ref_steps        |   Wan2.1 t2v 14B (thresh)    |     Slow (thresh)     |     Fast (thresh)     |
+|:--------------------------:|:----------------------------:|:---------------------:|:---------------------:|
+|          False             |       ~1829 s (0.00)         |    ~1234 s (0.14)     |    ~909 s (0.20)      |
+|          True              |       ~1829 s (0.00)         |    ~915 s (0.10)      |    ~578 s (0.20)      |
+
+|       use_ref_steps        |   Wan2.1 i2v 480p (thresh)   |     Slow (thresh)     |     Fast (thresh)     |
+|:--------------------------:|:----------------------------:|:---------------------:|:---------------------:|
+|          False             |       ~385 s (0.00)          |    ~241 s (0.13)      |    ~156 s (0.26)      |
+|          True              |       ~385 s (0.00)          |    ~212 s (0.20)      |    ~164 s (0.30)      |
+
+|       use_ref_steps        |   Wan2.1 i2v 720p (thresh)   |     Slow (thresh)     |     Fast (thresh)     |
+|:--------------------------:|:----------------------------:|:---------------------:|:---------------------:|
+|          False             |       ~903 s (0.00)          |    ~476 s (0.20)      |    ~363 s (0.30)      |
+|          True              |       ~903 s (0.00)          |    ~430 s (0.20)      |    ~340 s (0.30)      |
+
+
+You can refer to the previous video generation instructions and use the `use_ret_steps` parameter to speed up the video generation process, achieving results closer to [Wan2.1](https://github.com/Wan-Video/Wan2.1). Simply add the `--use_ret_steps` parameter to the original command and adjust the `--teacache_thresh` parameter to achieve more efficient video generation. The value of the `--teacache_thresh` parameter can be referenced from the table, allowing you to choose the appropriate value based on different models and settings.
+
+### Example Command:
+
+```bash
+python teacache_generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B  --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." --base_seed 42 --offload_model True --t5_cpu  --teacache_thresh 0.3 --use_ret_steps
+```
+
+
 ## Acknowledgements

 We would like to thank the contributors to the [Wan2.1](https://github.com/Wan-Video/Wan2.1).
--- a/TeaCache4Wan2.1/teacache_generate.py
+++ b/TeaCache4Wan2.1/teacache_generate.py
@ -29,6 +29,7 @@ from wan.utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
 from wan.utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
 from tqdm import tqdm

+
 EXAMPLE_PROMPT = {
    "t2v-1.3B": {
        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
@ -48,183 +49,6 @@ EXAMPLE_PROMPT = {
 }


-
-def _validate_args(args):
-    # Basic check
-    assert args.ckpt_dir is not None, "Please specify the checkpoint directory."
-    assert args.task in WAN_CONFIGS, f"Unsupport task: {args.task}"
-    assert args.task in EXAMPLE_PROMPT, f"Unsupport task: {args.task}"
-
-    # The default sampling steps are 40 for image-to-video tasks and 50 for text-to-video tasks.
-    if args.sample_steps is None:
-        args.sample_steps = 40 if "i2v" in args.task else 50
-
-    if args.sample_shift is None:
-        args.sample_shift = 5.0
-        if "i2v" in args.task and args.size in ["832*480", "480*832"]:
-            args.sample_shift = 3.0
-
-    # The default number of frames are 1 for text-to-image tasks and 81 for other tasks.
-    if args.frame_num is None:
-        args.frame_num = 1 if "t2i" in args.task else 81
-
-    # T2I frame_num check
-    if "t2i" in args.task:
-        assert args.frame_num == 1, f"Unsupport frame_num {args.frame_num} for task {args.task}"
-
-    args.base_seed = args.base_seed if args.base_seed >= 0 else random.randint(
-        0, sys.maxsize)
-    # Size check
-    assert args.size in SUPPORTED_SIZES[
-        args.
-        task], f"Unsupport size {args.size} for task {args.task}, supported sizes are: {', '.join(SUPPORTED_SIZES[args.task])}"
-
-
-def _parse_args():
-    parser = argparse.ArgumentParser(
-        description="Generate a image or video from a text prompt or image using Wan"
-    )
-    parser.add_argument(
-        "--task",
-        type=str,
-        default="t2v-14B",
-        choices=list(WAN_CONFIGS.keys()),
-        help="The task to run.")
-    parser.add_argument(
-        "--size",
-        type=str,
-        default="1280*720",
-        choices=list(SIZE_CONFIGS.keys()),
-        help="The area (width*height) of the generated video. For the I2V task, the aspect ratio of the output video will follow that of the input image."
-    )
-    parser.add_argument(
-        "--frame_num",
-        type=int,
-        default=None,
-        help="How many frames to sample from a image or video. The number should be 4n+1"
-    )
-    parser.add_argument(
-        "--ckpt_dir",
-        type=str,
-        default=None,
-        help="The path to the checkpoint directory.")
-    parser.add_argument(
-        "--offload_model",
-        type=str2bool,
-        default=None,
-        help="Whether to offload the model to CPU after each model forward, reducing GPU memory usage."
-    )
-    parser.add_argument(
-        "--ulysses_size",
-        type=int,
-        default=1,
-        help="The size of the ulysses parallelism in DiT.")
-    parser.add_argument(
-        "--ring_size",
-        type=int,
-        default=1,
-        help="The size of the ring attention parallelism in DiT.")
-    parser.add_argument(
-        "--t5_fsdp",
-        action="store_true",
-        default=False,
-        help="Whether to use FSDP for T5.")
-    parser.add_argument(
-        "--t5_cpu",
-        action="store_true",
-        default=False,
-        help="Whether to place T5 model on CPU.")
-    parser.add_argument(
-        "--dit_fsdp",
-        action="store_true",
-        default=False,
-        help="Whether to use FSDP for DiT.")
-    parser.add_argument(
-        "--save_file",
-        type=str,
-        default=None,
-        help="The file to save the generated image or video to.")
-    parser.add_argument(
-        "--prompt",
-        type=str,
-        default=None,
-        help="The prompt to generate the image or video from.")
-    parser.add_argument(
-        "--use_prompt_extend",
-        action="store_true",
-        default=False,
-        help="Whether to use prompt extend.")
-    parser.add_argument(
-        "--prompt_extend_method",
-        type=str,
-        default="local_qwen",
-        choices=["dashscope", "local_qwen"],
-        help="The prompt extend method to use.")
-    parser.add_argument(
-        "--prompt_extend_model",
-        type=str,
-        default=None,
-        help="The prompt extend model to use.")
-    parser.add_argument(
-        "--prompt_extend_target_lang",
-        type=str,
-        default="ch",
-        choices=["ch", "en"],
-        help="The target language of prompt extend.")
-    parser.add_argument(
-        "--base_seed",
-        type=int,
-        default=-1,
-        help="The seed to use for generating the image or video.")
-    parser.add_argument(
-        "--image",
-        type=str,
-        default=None,
-        help="The image to generate the video from.")
-    parser.add_argument(
-        "--sample_solver",
-        type=str,
-        default='unipc',
-        choices=['unipc', 'dpm++'],
-        help="The solver used to sample.")
-    parser.add_argument(
-        "--sample_steps", type=int, default=None, help="The sampling steps.")
-    parser.add_argument(
-        "--sample_shift",
-        type=float,
-        default=None,
-        help="Sampling shift factor for flow matching schedulers.")
-    parser.add_argument(
-        "--sample_guide_scale",
-        type=float,
-        default=5.0,
-        help="Classifier free guidance scale.")
-    parser.add_argument(
-        "--teacache_thresh",
-        type=float,
-        default=0.05,
-        help="The size of the ulysses parallelism in DiT.")
-
-    args = parser.parse_args()
-
-    _validate_args(args)
-
-    return args
-
-
-def _init_logging(rank):
-    # logging
-    if rank == 0:
-        # set format
-        logging.basicConfig(
-            level=logging.INFO,
-            format="[%(asctime)s] %(levelname)s: %(message)s",
-            handlers=[logging.StreamHandler(stream=sys.stdout)])
-    else:
-        logging.basicConfig(level=logging.ERROR)
-
-
-# add a cond_flag
 def t2v_generate(self,
                 input_prompt,
                 size=(1280, 720),
@ -341,8 +165,8 @@ def t2v_generate(self,
            # sample videos
            latents = noise

-            arg_c = {'context': context, 'seq_len': seq_len, 'cond_flag': True}
-            arg_null = {'context': context_null, 'seq_len': seq_len, 'cond_flag': False}
+            arg_c = {'context': context, 'seq_len': seq_len}
+            arg_null = {'context': context_null, 'seq_len': seq_len}

            for _, t in enumerate(tqdm(timesteps)):
                latent_model_input = latents
@ -385,7 +209,7 @@ def t2v_generate(self,
        return videos[0] if self.rank == 0 else None


-# add a cond_flag
+
 def i2v_generate(self,
                 input_prompt,
                 img,
@ -543,7 +367,7 @@ def i2v_generate(self,
                'clip_fea': clip_context,
                'seq_len': max_seq_len,
                'y': [y],
-                'cond_flag': True,
+                # 'cond_flag': True,
            }

            arg_null = {
@ -551,7 +375,7 @@ def i2v_generate(self,
                'clip_fea': clip_context,
                'seq_len': max_seq_len,
                'y': [y],
-                'cond_flag': False,
+                # 'cond_flag': False,
            }

            if offload_model:
@ -610,131 +434,332 @@ def i2v_generate(self,
        return videos[0] if self.rank == 0 else None


+
 def teacache_forward(
-        self,
-        x,
-        t,
-        context,
-        seq_len,
-        clip_fea=None,
-        y=None,
-        cond_flag=False,
-    ):
-        r"""
-        Forward pass through the diffusion model
+    self,
+    x,
+    t,
+    context,
+    seq_len,
+    clip_fea=None,
+    y=None,
+):
+    r"""
+    Forward pass through the diffusion model

-        Args:
-            x (List[Tensor]):
-                List of input video tensors, each with shape [C_in, F, H, W]
-            t (Tensor):
-                Diffusion timesteps tensor of shape [B]
-            context (List[Tensor]):
-                List of text embeddings each with shape [L, C]
-            seq_len (`int`):
-                Maximum sequence length for positional encoding
-            clip_fea (Tensor, *optional*):
-                CLIP image features for image-to-video mode
-            y (List[Tensor], *optional*):
-                Conditional video inputs for image-to-video mode, same shape as x
+    Args:
+        x (List[Tensor]):
+            List of input video tensors, each with shape [C_in, F, H, W]
+        t (Tensor):
+            Diffusion timesteps tensor of shape [B]
+        context (List[Tensor]):
+            List of text embeddings each with shape [L, C]
+        seq_len (`int`):
+            Maximum sequence length for positional encoding
+        clip_fea (Tensor, *optional*):
+            CLIP image features for image-to-video mode
+        y (List[Tensor], *optional*):
+            Conditional video inputs for image-to-video mode, same shape as x

-        Returns:
-            List[Tensor]:
-                List of denoised video tensors with original input shapes [C_out, F, H / 8, W / 8]
-        """
-        if self.model_type == 'i2v':
-            assert clip_fea is not None and y is not None
-        # params
-        device = self.patch_embedding.weight.device
-        if self.freqs.device != device:
-            self.freqs = self.freqs.to(device)
+    Returns:
+        List[Tensor]:
+            List of denoised video tensors with original input shapes [C_out, F, H / 8, W / 8]
+    """
+    if self.model_type == 'i2v':
+        assert clip_fea is not None and y is not None
+    # params
+    device = self.patch_embedding.weight.device
+    if self.freqs.device != device:
+        self.freqs = self.freqs.to(device)

-        if y is not None:
-            x = [torch.cat([u, v], dim=0) for u, v in zip(x, y)]
+    if y is not None:
+        x = [torch.cat([u, v], dim=0) for u, v in zip(x, y)]

-        # embeddings
-        x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
-        grid_sizes = torch.stack(
-            [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
-        x = [u.flatten(2).transpose(1, 2) for u in x]
-        seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
-        assert seq_lens.max() <= seq_len
-        x = torch.cat([
-            torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))],
-                      dim=1) for u in x
-        ])
-        # time embeddings
-        with amp.autocast(dtype=torch.float32):
-            e = self.time_embedding(
-                sinusoidal_embedding_1d(self.freq_dim, t).float())
-            e0 = self.time_projection(e).unflatten(1, (6, self.dim))
-            assert e.dtype == torch.float32 and e0.dtype == torch.float32
+    # embeddings
+    x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
+    grid_sizes = torch.stack(
+        [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+    x = [u.flatten(2).transpose(1, 2) for u in x]
+    seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
+    assert seq_lens.max() <= seq_len
+    x = torch.cat([
+        torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))],
+                    dim=1) for u in x
+    ])

-        # context
-        context_lens = None
-        context = self.text_embedding(
-            torch.stack([
-                torch.cat(
-                    [u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
-                for u in context
-            ]))
+    # time embeddings
+    with amp.autocast(dtype=torch.float32):
+        e = self.time_embedding(
+            sinusoidal_embedding_1d(self.freq_dim, t).float())
+        e0 = self.time_projection(e).unflatten(1, (6, self.dim))
+        assert e.dtype == torch.float32 and e0.dtype == torch.float32

-        if clip_fea is not None:
-            context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-            context = torch.concat([context_clip, context], dim=1)
+    # context
+    context_lens = None
+    context = self.text_embedding(
+        torch.stack([
+            torch.cat(
+                [u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+            for u in context
+        ]))

-        # arguments
-        kwargs = dict(
-            e=e0,
-            seq_lens=seq_lens,
-            grid_sizes=grid_sizes,
-            freqs=self.freqs,
-            context=context,
-            context_lens=context_lens)
+    if clip_fea is not None:
+        context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
+        context = torch.concat([context_clip, context], dim=1)
+
+    # arguments
+    kwargs = dict(
+        e=e0,
+        seq_lens=seq_lens,
+        grid_sizes=grid_sizes,
+        freqs=self.freqs,
+        context=context,
+        context_lens=context_lens)
        
-        if self.enable_teacache:
-            if cond_flag:
-                modulated_inp = e
-                if self.cnt == 0 or self.cnt == self.num_steps-1:
-                    should_calc = True
-                    self.accumulated_rel_l1_distance = 0
-                else: 
-                    rescale_func = np.poly1d(self.coefficients)
-                    if cond_flag:
-                        self.accumulated_rel_l1_distance += rescale_func(((modulated_inp-self.previous_modulated_input).abs().mean() / self.previous_modulated_input.abs().mean()).cpu().item())
-                    if self.accumulated_rel_l1_distance < self.rel_l1_thresh:
-                        should_calc = False
-                    else:
-                        should_calc = True
-                        self.accumulated_rel_l1_distance = 0
-                self.previous_modulated_input = modulated_inp 
-                self.cnt = 0 if self.cnt == self.num_steps-1 else self.cnt + 1
-                self.should_calc = should_calc
+    if self.enable_teacache:
+        modulated_inp = e0 if self.use_ref_steps else e
+        # teacache
+        if self.cnt%2==0: # even -> conditon
+            self.is_even = True
+            if self.cnt < self.ret_steps or self.cnt >= self.cutoff_steps:
+                    should_calc_even = True
+                    self.accumulated_rel_l1_distance_even = 0
            else:
-                should_calc = self.should_calc
-            # if not cond_flag: 
-            #     self.cnt = 0 if self.cnt == self.num_steps-1 else self.cnt + 1          
-        
-        if self.enable_teacache:
-            if not should_calc:
-                x = x + self.previous_residual_cond if cond_flag else x + self.previous_residual_uncond
+                rescale_func = np.poly1d(self.coefficients)
+                self.accumulated_rel_l1_distance_even += rescale_func(((modulated_inp-self.previous_e0_even).abs().mean() / self.previous_e0_even.abs().mean()).cpu().item())
+                if self.accumulated_rel_l1_distance_even < self.teacache_thresh:
+                    should_calc_even = False
+                else:
+                    should_calc_even = True
+                    self.accumulated_rel_l1_distance_even = 0
+            self.previous_e0_even = modulated_inp.clone()
+
+        else: # odd -> unconditon
+            self.is_even = False
+            if self.cnt < self.ret_steps or self.cnt >= self.cutoff_steps:
+                    should_calc_odd = True
+                    self.accumulated_rel_l1_distance_odd = 0
+            else: 
+                rescale_func = np.poly1d(self.coefficients)
+                self.accumulated_rel_l1_distance_odd += rescale_func(((modulated_inp-self.previous_e0_odd).abs().mean() / self.previous_e0_odd.abs().mean()).cpu().item())
+                if self.accumulated_rel_l1_distance_odd < self.teacache_thresh:
+                    should_calc_odd = False
+                else:
+                    should_calc_odd = True
+                    self.accumulated_rel_l1_distance_odd = 0
+            self.previous_e0_odd = modulated_inp.clone()
+
+    if self.enable_teacache: 
+        if self.is_even:
+            if not should_calc_even:
+                x += self.previous_residual_even
            else:
                ori_x = x.clone()
                for block in self.blocks:
                    x = block(x, **kwargs)
-                if cond_flag:
-                    self.previous_residual_cond = x - ori_x
-                else:
-                    self.previous_residual_uncond = x - ori_x
+                self.previous_residual_even = x - ori_x
        else:
-            for block in self.blocks:
-                x = block(x, **kwargs)
+            if not should_calc_odd:
+                x += self.previous_residual_odd
+            else:
+                ori_x = x.clone()
+                for block in self.blocks:
+                    x = block(x, **kwargs)
+                self.previous_residual_odd = x - ori_x
+    
+    else:
+        for block in self.blocks:
+            x = block(x, **kwargs)

-        # head
-        x = self.head(x, e)
+    # head
+    x = self.head(x, e)

-        # unpatchify
-        x = self.unpatchify(x, grid_sizes)
-        return [u.float() for u in x]
+    # unpatchify
+    x = self.unpatchify(x, grid_sizes)
+    self.cnt += 1
+    if self.cnt >= self.num_steps:
+        self.cnt = 0
+    return [u.float() for u in x]
+
+def _validate_args(args):
+    # Basic check
+    assert args.ckpt_dir is not None, "Please specify the checkpoint directory."
+    assert args.task in WAN_CONFIGS, f"Unsupport task: {args.task}"
+    assert args.task in EXAMPLE_PROMPT, f"Unsupport task: {args.task}"
+
+    # The default sampling steps are 40 for image-to-video tasks and 50 for text-to-video tasks.
+    if args.sample_steps is None:
+        args.sample_steps = 40 if "i2v" in args.task else 50
+
+    if args.sample_shift is None:
+        args.sample_shift = 5.0
+        if "i2v" in args.task and args.size in ["832*480", "480*832"]:
+            args.sample_shift = 3.0
+
+    # The default number of frames are 1 for text-to-image tasks and 81 for other tasks.
+    if args.frame_num is None:
+        args.frame_num = 1 if "t2i" in args.task else 81
+
+    # T2I frame_num check
+    if "t2i" in args.task:
+        assert args.frame_num == 1, f"Unsupport frame_num {args.frame_num} for task {args.task}"
+
+    args.base_seed = args.base_seed if args.base_seed >= 0 else random.randint(
+        0, sys.maxsize)
+    # Size check
+    assert args.size in SUPPORTED_SIZES[
+        args.
+        task], f"Unsupport size {args.size} for task {args.task}, supported sizes are: {', '.join(SUPPORTED_SIZES[args.task])}"
+
+
+def _parse_args():
+    parser = argparse.ArgumentParser(
+        description="Generate a image or video from a text prompt or image using Wan"
+    )
+    parser.add_argument(
+        "--task",
+        type=str,
+        default="t2v-14B",
+        choices=list(WAN_CONFIGS.keys()),
+        help="The task to run.")
+    parser.add_argument(
+        "--size",
+        type=str,
+        default="1280*720",
+        choices=list(SIZE_CONFIGS.keys()),
+        help="The area (width*height) of the generated video. For the I2V task, the aspect ratio of the output video will follow that of the input image."
+    )
+    parser.add_argument(
+        "--frame_num",
+        type=int,
+        default=None,
+        help="How many frames to sample from a image or video. The number should be 4n+1"
+    )
+    parser.add_argument(
+        "--ckpt_dir",
+        type=str,
+        default=None,
+        help="The path to the checkpoint directory.")
+    parser.add_argument(
+        "--offload_model",
+        type=str2bool,
+        default=None,
+        help="Whether to offload the model to CPU after each model forward, reducing GPU memory usage."
+    )
+    parser.add_argument(
+        "--ulysses_size",
+        type=int,
+        default=1,
+        help="The size of the ulysses parallelism in DiT.")
+    parser.add_argument(
+        "--ring_size",
+        type=int,
+        default=1,
+        help="The size of the ring attention parallelism in DiT.")
+    parser.add_argument(
+        "--t5_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for T5.")
+    parser.add_argument(
+        "--t5_cpu",
+        action="store_true",
+        default=False,
+        help="Whether to place T5 model on CPU.")
+    parser.add_argument(
+        "--dit_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for DiT.")
+    parser.add_argument(
+        "--save_file",
+        type=str,
+        default=None,
+        help="The file to save the generated image or video to.")
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        default=None,
+        help="The prompt to generate the image or video from.")
+    parser.add_argument(
+        "--use_prompt_extend",
+        action="store_true",
+        default=False,
+        help="Whether to use prompt extend.")
+    parser.add_argument(
+        "--prompt_extend_method",
+        type=str,
+        default="local_qwen",
+        choices=["dashscope", "local_qwen"],
+        help="The prompt extend method to use.")
+    parser.add_argument(
+        "--prompt_extend_model",
+        type=str,
+        default=None,
+        help="The prompt extend model to use.")
+    parser.add_argument(
+        "--prompt_extend_target_lang",
+        type=str,
+        default="ch",
+        choices=["ch", "en"],
+        help="The target language of prompt extend.")
+    parser.add_argument(
+        "--base_seed",
+        type=int,
+        default=-1,
+        help="The seed to use for generating the image or video.")
+    parser.add_argument(
+        "--image",
+        type=str,
+        default=None,
+        help="The image to generate the video from.")
+    parser.add_argument(
+        "--sample_solver",
+        type=str,
+        default='unipc',
+        choices=['unipc', 'dpm++'],
+        help="The solver used to sample.")
+    parser.add_argument(
+        "--sample_steps", type=int, default=None, help="The sampling steps.")
+    parser.add_argument(
+        "--sample_shift",
+        type=float,
+        default=None,
+        help="Sampling shift factor for flow matching schedulers.")
+    parser.add_argument(
+        "--sample_guide_scale",
+        type=float,
+        default=5.0,
+        help="Classifier free guidance scale.")
+    parser.add_argument(
+        "--teacache_thresh",
+        type=float,
+        default=0.2,
+        help="Higher speedup will cause to worse quality -- 0.1 for 2.0x speedup -- 0.2 for 3.0x speedup")
+    parser.add_argument(
+        "--use_ret_steps",
+        action="store_true",
+        default=False,
+        help="Using Retention Steps will result in faster generation speed and better generation quality.")
+        
+
+    args = parser.parse_args()
+
+    _validate_args(args)
+
+    return args
+
+
+def _init_logging(rank):
+    # logging
+    if rank == 0:
+        # set format
+        logging.basicConfig(
+            level=logging.INFO,
+            format="[%(asctime)s] %(levelname)s: %(message)s",
+            handlers=[logging.StreamHandler(stream=sys.stdout)])
+    else:
+        logging.basicConfig(level=logging.ERROR)


 def generate(args):
@ -841,21 +866,32 @@ def generate(args):

        # TeaCache
        wan_t2v.__class__.generate = t2v_generate
-        wan_t2v.model.__class__.cnt = 0
        wan_t2v.model.__class__.enable_teacache = True
-        wan_t2v.model.__class__.num_steps = args.sample_steps if args.sample_steps is not None else 50
-        wan_t2v.model.__class__.rel_l1_thresh = args.teacache_thresh # 2min54s, 0.05: 1min 55s(1.5x), 0.1, 1min 24s(2.1x) 0.15, 1min 6s, 0.08: 1min 27s(2x), 0.07: 1min 48s(1.6x), 0.06: 1min 51s
-        wan_t2v.model.__class__.accumulated_rel_l1_distance = 0
-        wan_t2v.model.__class__.previous_modulated_input = None
-        wan_t2v.model.__class__.previous_residual = None
-        wan_t2v.model.__class__.previous_residual_uncond = None
-        wan_t2v.model.__class__.should_calc = True
-        if '1.3B' in args.ckpt_dir:
-            wan_t2v.model.__class__.coefficients = [2.39676752e+03, -1.31110545e+03,  2.01331979e+02, -8.29855975e+00, 1.37887774e-01]
-        if '14B' in args.ckpt_dir:
-            wan_t2v.model.__class__.coefficients = [-5784.54975374,  5449.50911966, -1811.16591783,   256.27178429, -13.02252404]
-        wan_t2v.model.__class__.forward = teacache_forward 
-        
+        wan_t2v.model.__class__.forward = teacache_forward
+        wan_t2v.model.__class__.cnt = 0
+        wan_t2v.model.__class__.num_steps = args.sample_steps*2
+        wan_t2v.model.__class__.teacache_thresh = args.teacache_thresh
+        wan_t2v.model.__class__.accumulated_rel_l1_distance_even = 0
+        wan_t2v.model.__class__.accumulated_rel_l1_distance_odd = 0
+        wan_t2v.model.__class__.previous_e0_even = None
+        wan_t2v.model.__class__.previous_e0_odd = None
+        wan_t2v.model.__class__.previous_residual_even = None
+        wan_t2v.model.__class__.previous_residual_odd = None
+        wan_t2v.model.__class__.use_ref_steps = args.use_ret_steps
+        if args.use_ret_steps:
+            if '1.3B' in args.ckpt_dir:
+                wan_t2v.model.__class__.coefficients = [-5.21862437e+04, 9.23041404e+03, -5.28275948e+02, 1.36987616e+01, -4.99875664e-02]
+            if '14B' in args.ckpt_dir:
+                wan_t2v.model.__class__.coefficients = [-3.03318725e+05, 4.90537029e+04, -2.65530556e+03, 5.87365115e+01, -3.15583525e-01]
+            wan_t2v.model.__class__.ret_steps = 5*2
+            wan_t2v.model.__class__.cutoff_steps = args.sample_steps*2
+        else:
+            if '1.3B' in args.ckpt_dir:
+                wan_t2v.model.__class__.coefficients = [2.39676752e+03, -1.31110545e+03,  2.01331979e+02, -8.29855975e+00, 1.37887774e-01]
+            if '14B' in args.ckpt_dir:
+                wan_t2v.model.__class__.coefficients = [-5784.54975374,  5449.50911966, -1811.16591783,   256.27178429, -13.02252404]
+            wan_t2v.model.__class__.ret_steps = 1*2
+            wan_t2v.model.__class__.cutoff_steps = args.sample_steps*2 - 2
        logging.info(
            f"Generating {'image' if 't2i' in args.task else 'video'} ...")
        video = wan_t2v.generate(
@ -912,23 +948,34 @@ def generate(args):
            use_usp=(args.ulysses_size > 1 or args.ring_size > 1),
            t5_cpu=args.t5_cpu,
        )
-
        # TeaCache
        wan_i2v.__class__.generate = i2v_generate
-        wan_i2v.model.__class__.cnt = 0
        wan_i2v.model.__class__.enable_teacache = True
-        wan_i2v.model.__class__.num_steps = args.sample_steps if args.sample_steps is not None else 40
-        wan_i2v.model.__class__.rel_l1_thresh = args.teacache_thresh # 12min 26s
-        wan_i2v.model.__class__.accumulated_rel_l1_distance = 0
-        wan_i2v.model.__class__.previous_modulated_input = None
-        wan_i2v.model.__class__.previous_residual_cond = None
-        wan_i2v.model.__class__.previous_residual_uncond = None
-        wan_i2v.model.__class__.should_calc = True
-        if '480P' in args.ckpt_dir:
-            wan_i2v.model.__class__.coefficients = [-3.02331670e+02,  2.23948934e+02, -5.25463970e+01,  5.87348440e+00, -2.01973289e-01]
-        if '720P' in args.ckpt_dir:
-            wan_i2v.model.__class__.coefficients = [-114.36346466,   65.26524496,  -18.82220707,    4.91518089,   -0.23412683]
        wan_i2v.model.__class__.forward = teacache_forward
+        wan_i2v.model.__class__.cnt = 0
+        wan_i2v.model.__class__.num_steps = args.sample_steps*2
+        wan_i2v.model.__class__.teacache_thresh = args.teacache_thresh
+        wan_i2v.model.__class__.accumulated_rel_l1_distance_even = 0
+        wan_i2v.model.__class__.accumulated_rel_l1_distance_odd = 0
+        wan_i2v.model.__class__.previous_e0_even = None
+        wan_i2v.model.__class__.previous_e0_odd = None
+        wan_i2v.model.__class__.previous_residual_even = None
+        wan_i2v.model.__class__.previous_residual_odd = None
+        wan_i2v.model.__class__.use_ref_steps = args.use_ret_steps
+        if args.use_ret_steps:
+            if '480P' in args.ckpt_dir:
+                wan_i2v.model.__class__.coefficients = [ 2.57151496e+05, -3.54229917e+04,  1.40286849e+03, -1.35890334e+01, 1.32517977e-01]
+            if '720P' in args.ckpt_dir:
+                wan_i2v.model.__class__.coefficients = [ 8.10705460e+03,  2.13393892e+03, -3.72934672e+02,  1.66203073e+01, -4.17769401e-02]
+            wan_i2v.model.__class__.ret_steps = 5*2
+            wan_i2v.model.__class__.cutoff_steps = args.sample_steps*2
+        else:
+            if '480P' in args.ckpt_dir:
+                wan_i2v.model.__class__.coefficients = [-3.02331670e+02,  2.23948934e+02, -5.25463970e+01,  5.87348440e+00, -2.01973289e-01]
+            if '720P' in args.ckpt_dir:
+                wan_i2v.model.__class__.coefficients = [-114.36346466,   65.26524496,  -18.82220707,    4.91518089,   -0.23412683]
+            wan_i2v.model.__class__.ret_steps = 1*2
+            wan_i2v.model.__class__.cutoff_steps = args.sample_steps*2 - 2

        logging.info("Generating video ...")
        video = wan_i2v.generate(
@ -968,8 +1015,9 @@ def generate(args):
                nrow=1,
                normalize=True,
                value_range=(-1, 1))
-    logging.info("Finished.")
-
+    logging.info("Finished.")    
+    
+    

 if __name__ == "__main__":
    args = _parse_args()
Author	SHA1	Message	Date
Feng Liu	2031aa0bd1	Update README.md	2025-03-13 21:28:53 +08:00
Feng Liu	be76c09add	Update README.md	2025-03-13 20:14:51 +08:00
Feng Liu	bfd9e654d3	Update README.md	2025-03-13 20:07:35 +08:00
Feng Liu	f9c183df5b	Update README.md	2025-03-13 19:07:42 +08:00
Feng Liu	a9489fbf78	Merge pull request #51 from zishen-ucap/teacache4Wan2.1_v2 Add --use_ret_steps Mode to Accelerate Inference and Make Generated Results Closer to Wan2.1	2025-03-13 19:01:03 +08:00
zishen-ucap	8ae3503442	Support parameter -- use_ret_stpes	2025-03-13 17:16:51 +08:00
zishen-ucap	cabe560cf6	Support parameter -- use_ret_stpes	2025-03-13 17:06:39 +08:00
zishen-ucap	2295d6a10c	Support parameter - use_det_stpes	2025-03-13 16:18:28 +08:00