* Add Kandinsky5 model support
lite and pro T2V tested to work
* Update kandinsky5.py
* Fix fp8
* Fix fp8_scaled text encoder
* Add transformer_options for attention
* Code cleanup, optimizations, use fp32 for all layers originally at fp32
* ImageToVideo -node
* Fix I2V, add necessary latent post process nodes
* Support text to image model
* Support block replace patches (SLG mostly)
* Support official LoRAs
* Don't scale RoPE for lite model as that just doesn't work...
* Update supported_models.py
* Rever RoPE scaling to simpler one
* Fix typo
* Handle latent dim difference for image model in the VAE instead
* Add node to use different prompts for clip_l and qwen25_7b
* Reduce peak VRAM usage a bit
* Further reduce peak VRAM consumption by chunking ffn
* Update chunking
* Update memory_usage_factor
* Code cleanup, don't force the fp32 layers as it has minimal effect
* Allow for stronger changes with first frames normalization
Default values are too weak for any meaningful changes, these should probably be exposed as advanced node options when that's available.
* Add image model's own chat template, remove unused image2video template
* Remove hard error in ReplaceVideoLatentFrames -node
* Update kandinsky5.py
* Update supported_models.py
* Fix typos in prompt template
They were now fixed in the original repository as well
* Update ReplaceVideoLatentFrames
Add tooltips
Make source optional
Better handle negative index
* Rename NormalizeVideoLatentFrames -node
For bit better clarity what it does
* Fix NormalizeVideoLatentStart node out on non-op
* Apply cond slice fix
* Add FreeNoise
* Update context_windows.py
* Add option to retain condition by indexes for each window
This allows for example Wan/HunyuanVideo image to video to "work" by using the initial start frame for each window, otherwise windows beyond first will be pure T2V generations.
* Update context_windows.py
* Allow splitting multiple conds into different windows
* Add handling for audio_embed
* whitespace
* Allow freenoise to work on other dims, handle 4D batch timestep
Refactor Freenoise function. And fix batch handling as timesteps seem to be expanded to batch size now.
* Disable experimental options for now
So that the Freenoise and bugfixes can be merged first
---------
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
Co-authored-by: ozbayb <17261091+ozbayb@users.noreply.github.com>
Im able to push vram above estimate on partial unload. Bump the
estimate. This is experimentally determined with a 720P and 480P
datapoint calibrating for 24GB VRAM total.
TIL that the WAN TE has a 2GB weight followed by 16MB as the next size
down. This means that team 8GB VRAM would fully offload the TE in async
offload mode as it just multiplied this giant size my the num streams.
Do the more complex logic of summing up the upcoming to-load weight
sizes to avoid triple counting this massive weight.
partial unload does the converse of recording the NS most recent
unloads as they go.
* mp: only count the offload cost of math once
This was previously bundling the combined weight storage and computation
cost
* ops: put all post async transfer compute on the main stream
Some models have massive weights that need either complex
dequantization or lora patching. Don't do these patchings on the offload
stream, instead do them on the main stream to syncrhonize the
potentially large vram spikes for these compute processes. This avoids
having to assume a worst case scenario of multiple offload streams
all spiking VRAM is parallel with whatever the main stream is doing.
* hunyuan upsampler: rework imports
Remove the transitive import of VideoConv3d and Resnet and takes these
from actual implementation source.
* model: remove unused give_pre_end
According to git grep, this is not used now, and was not used in the
initial commit that introduced it (see below).
This semantic is difficult to implement temporal roll VAE for (and would
defeat the purpose). Rather than implement the complex if, just delete
the unused feature.
(venv) rattus@rattus-box2:~/ComfyUI$ git log --oneline
220afe33 (HEAD) Initial commit.
(venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre
comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end
comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end:
(venv) rattus@rattus-box2:~/ComfyUI$ git co origin/master
Previous HEAD position was 220afe33 Initial commit.
HEAD is now at 9d8a8179 Enable async offloading by default on Nvidia. (#10953)
(venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre
comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end
comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end:
* move refiner VAE temporal roller to core
Move the carrying conv op to the common VAE code and give it a better
name. Roll the carry implementation logic for Resnet into the base
class and scrap the Hunyuan specific subclass.
* model: Add temporal roll to main VAE decoder
If there are no attention layers, its a standard resnet and VideoConv3d
is asked for, substitute in the temporal rolloing VAE algorithm. This
reduces VAE usage by the temporal dimension (can be huge VRAM savings).
* model: Add temporal roll to main VAE encoder
If there are no attention layers, its a standard resnet and VideoConv3d
is asked for, substitute in the temporal rolling VAE algorithm. This
reduces VAE usage by the temporal dimension (can be huge VRAM savings).
These are not actual controlnets so put it in the models/model_patches
folder and use the ModelPatchLoader + QwenImageDiffsynthControlnet node to
use it.
* Support video tiny VAEs
* lighttaew scaling fix
* Also support video taes in previews
Only first frame for now as live preview playback is currently only available through VHS custom nodes.
* Support Wan 2.1 lightVAE
* Relocate elif block and set Wan VAE dim directly without using pruning rate for lightvae
* mm: default to 0 for NUM_STREAMS
Dont count the compute stream as an offload stream. This makes async
offload accounting easier.
* mm: remove 128MB minimum
This is from a previous offloading system requirement. Remove it to
make behaviour of the loader and partial unloader consistent.
* mp: order the module list by offload expense
Calculate an approximate offloading temporary VRAM cost to offload a
weight and primary order the module load list by that. In the simple
case this is just the same as the module weight, but with Loras, a
weight with a lora consumes considerably more VRAM to do the Lora
application on-the-fly.
This will slightly prioritize lora weights, but is really for
proper VRAM offload accounting.
* mp: Account for the VRAM cost of weight offloading
when checking the VRAM headroom, assume that the weight needs to be
offloaded, and only load if it has space for both the load and offload
* the number of streams.
As the weights are ordered from largest to smallest by offload cost
this is guaranteed to fit in VRAM (tm), as all weights that follow
will be smaller.
Make the partial unload aware of this system as well by saving the
budget for offload VRAM to the model state and accounting accordingly.
Its possible that partial unload increases the size of the largest
offloaded weights, and thus needs to unload a little bit more than
asked to accomodate the bigger temp buffers.
Honor the existing codes floor on model weight loading of 128MB by
having the patcher honor this separately withough regard to offloading.
Otherwise when MM specifies its 128MB minimum, MP will see the biggest
weights, and budget that 128MB to only offload buffer and load nothing
which isnt the intent of these minimums. The same clamp applies in
case of partial offload of the currently loading model.
* init
* update
* Update model.py
* Update model.py
* remove print
* Fix text encoding
* Prevent empty negative prompt
Really doesn't work otherwise
* fp16 works
* I2V
* Update model_base.py
* Update nodes_hunyuan.py
* Better latent rgb factors
* Use the correct sigclip output...
* Support HunyuanVideo1.5 SR model
* whitespaces...
* Proper latent channel count
* SR model fixes
This also still needs timesteps scheduling based on the noise scale, can be used with two samplers too already
* vae_refiner: roll the convolution through temporal
Work in progress.
Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.
* Support HunyuanVideo15 latent resampler
* fix
* Some cleanup
Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>
* Proper hyvid15 I2V channels
Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>
* Fix TokenRefiner for fp16
Otherwise x.sum has infs, just in case only casting if input is fp16, I don't know if necessary.
* Bugfix for the HunyuanVideo15 SR model
* vae_refiner: roll the convolution through temporal II
Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.
Added support for encoder, lowered to 1 latent frame to save more
VRAM, made work for Hunyuan Image 3.0 (as code shared).
Fixed names, cleaned up code.
* Allow any number of input frames in VAE.
* Better VAE encode mem estimation.
* Lowvram fix.
* Fix hunyuan image 2.1 refiner.
* Fix mistake.
* Name changes.
* Rename.
* Whitespace.
* Fix.
* Fix.
---------
Co-authored-by: kijai <40791699+kijai@users.noreply.github.com>
Co-authored-by: Rattus <rattus128@gmail.com>