1821 Commits

Author SHA1 Message Date
comfyanonymous
878db3a727
Implement the Ovis image model. (#11030) 2025-12-01 20:56:17 -05:00
comfyanonymous
2640acb31c
Update qwen tokenizer to add qwen 3 tokens. (#11029)
Doesn't actually change anything for current workflows because none of the
current models have a template with the think tokens.
2025-12-01 17:13:48 -05:00
comfyanonymous
0a6746898d
Make the ScaleRope node work on Z Image and Lumina. (#10994) 2025-11-29 18:00:55 -05:00
comfyanonymous
5151cff293
Add some missing z image lora layers. (#10980) 2025-11-28 23:55:00 -05:00
comfyanonymous
52a32e2b32
Support some z image lora formats. (#10978) 2025-11-28 21:12:42 -05:00
Jukka Seppänen
b907085709
Support video tiny VAEs (#10884)
* Support video tiny VAEs

* lighttaew scaling fix

* Also support video taes in previews

Only first frame for now as live preview playback is currently only available through VHS custom nodes.

* Support Wan 2.1 lightVAE

* Relocate elif block and set Wan VAE dim directly without using pruning rate for lightvae
2025-11-28 19:40:19 -05:00
rattus
0ff0457892
mm: wrap the raw stream in context manager (#10958)
The documentation of torch.foo.Stream being usable with with: suggests
it starts at version 2.7. Use the old API for backwards compatibility.
2025-11-28 16:38:12 -05:00
Urle Sistiana
6484ac89dc
fix QuantizedTensor.is_contiguous (#10956) (#10959) 2025-11-28 16:33:07 -05:00
comfyanonymous
f55c98a89f
Disable offload stream when torch compile. (#10961) 2025-11-28 16:16:46 -05:00
comfyanonymous
9d8a817985
Enable async offloading by default on Nvidia. (#10953)
Add --disable-async-offload to disable it.

If this causes OOMs that go away when you --disable-async-offload please
report it.
2025-11-27 17:46:12 -05:00
rattus
3f382a4f98
quant ops: Dequantize weight in-place (#10935)
In flux2 these weights are huge (200MB). As plain_tensor is a throw-away
deep copy, do this multiplication in-place to save VRAM.
2025-11-27 08:06:30 -08:00
rattus
f17251bec6
Account for the VRAM cost of weight offloading (#10733)
* mm: default to 0 for NUM_STREAMS

Dont count the compute stream as an offload stream. This makes async
offload accounting easier.

* mm: remove 128MB minimum

This is from a previous offloading system requirement. Remove it to
make behaviour of the loader and partial unloader consistent.

* mp: order the module list by offload expense

Calculate an approximate offloading temporary VRAM cost to offload a
weight and primary order the module load list by that. In the simple
case this is just the same as the module weight, but with Loras, a
weight with a lora consumes considerably more VRAM to do the Lora
application on-the-fly.

This will slightly prioritize lora weights, but is really for
proper VRAM offload accounting.

* mp: Account for the VRAM cost of weight offloading

when checking the VRAM headroom, assume that the weight needs to be
offloaded, and only load if it has space for both the load and offload
 * the number of streams.

As the weights are ordered from largest to smallest by offload cost
this is guaranteed to fit in VRAM (tm), as all weights that follow
will be smaller.

Make the partial unload aware of this system as well by saving the
budget for offload VRAM to the model state and accounting accordingly.
Its possible that partial unload increases the size of the largest
offloaded weights, and thus needs to unload a little bit more than
asked to accomodate the bigger temp buffers.

Honor the existing codes floor on model weight loading of 128MB by
having the patcher honor this separately withough regard to offloading.
Otherwise when MM specifies its 128MB minimum, MP will see the biggest
weights, and budget that 128MB to only offload buffer and load nothing
which isnt the intent of these minimums. The same clamp applies in
case of partial offload of the currently loading model.
2025-11-27 01:03:03 -05:00
Haoming
c38e7d6599
block info (#10841) 2025-11-26 20:28:44 -08:00
comfyanonymous
eaf68c9b5b
Make lora training work on Z Image and remove some redundant nodes. (#10927) 2025-11-26 19:25:32 -05:00
comfyanonymous
f16219e3aa
Add cheap latent preview for flux 2. (#10907)
Thank you to the person who calculated them. You saved me a percent of my
time.
2025-11-26 04:00:43 -05:00
comfyanonymous
58b8574661
Fix Flux2 reference image mem estimation. (#10905) 2025-11-26 02:36:19 -05:00
comfyanonymous
bdb10a583f
Fix loras not working on mixed fp8. (#10899) 2025-11-26 00:07:58 -05:00
comfyanonymous
0e24dbb19f
Adjustments to Z Image. (#10893) 2025-11-25 19:02:51 -05:00
comfyanonymous
e9aae31fa2
Z Image model. (#10892) 2025-11-25 18:41:45 -05:00
comfyanonymous
d196a905bb
Lower vram usage for flux 2 text encoder. (#10887) 2025-11-25 14:58:39 -05:00
comfyanonymous
dff996ca39
Fix crash. (#10885) 2025-11-25 14:30:24 -05:00
comfyanonymous
6b573ae0cb
Flux 2 (#10879) 2025-11-25 10:50:19 -05:00
comfyanonymous
015a0599d0
I found a case where this is needed (#10875) 2025-11-25 03:23:19 -05:00
comfyanonymous
acfaa5c4a1
Don't try fp8 matrix mult in quantized ops if not supported by hardware. (#10874) 2025-11-25 02:55:49 -05:00
comfyanonymous
b6805429b9
Allow pinning quantized tensors. (#10873) 2025-11-25 02:48:20 -05:00
comfyanonymous
25022e0b09
Cleanup and fix issues with text encoder quants. (#10872) 2025-11-25 01:48:53 -05:00
Haoming
b2ef58e2b1
block info (#10844) 2025-11-24 10:40:09 -08:00
Haoming
6a6d456c88
block info (#10842) 2025-11-24 10:38:38 -08:00
Haoming
3d1fdaf9f4
block info (#10843) 2025-11-24 10:30:40 -08:00
comfyanonymous
cbd68e3d58
Add better error message for common error. (#10846) 2025-11-23 04:55:22 -05:00
comfyanonymous
532938b16b
--disable-api-nodes now sets CSP header to force frontend offline. (#10829) 2025-11-21 17:51:55 -05:00
comfyanonymous
943b3b615d
HunyuanVideo 1.5 (#10819)
* init

* update

* Update model.py

* Update model.py

* remove print

* Fix text encoding

* Prevent empty negative prompt

Really doesn't work otherwise

* fp16 works

* I2V

* Update model_base.py

* Update nodes_hunyuan.py

* Better latent rgb factors

* Use the correct sigclip output...

* Support HunyuanVideo1.5 SR model

* whitespaces...

* Proper latent channel count

* SR model fixes

This also still needs timesteps scheduling based on the noise scale, can be used with two samplers too already

* vae_refiner: roll the convolution through temporal

Work in progress.

Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.

* Support HunyuanVideo15 latent resampler

* fix

* Some cleanup

Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>

* Proper hyvid15 I2V channels

Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>

* Fix TokenRefiner for fp16

Otherwise x.sum has infs, just in case only casting if input is fp16, I don't know if necessary.

* Bugfix for the HunyuanVideo15 SR model

* vae_refiner: roll the convolution through temporal II

Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.

Added support for encoder, lowered to 1 latent frame to save more
VRAM, made work for Hunyuan Image 3.0 (as code shared).

Fixed names, cleaned up code.

* Allow any number of input frames in VAE.

* Better VAE encode mem estimation.

* Lowvram fix.

* Fix hunyuan image 2.1 refiner.

* Fix mistake.

* Name changes.

* Rename.

* Whitespace.

* Fix.

* Fix.

---------

Co-authored-by: kijai <40791699+kijai@users.noreply.github.com>
Co-authored-by: Rattus <rattus128@gmail.com>
2025-11-20 22:44:43 -05:00
comfyanonymous
cb96d4d18c
Disable workaround on newer cudnn. (#10807) 2025-11-19 23:56:23 -05:00
comfyanonymous
17027f2a6a
Add a way to disable the final norm in the llama based TE models. (#10794) 2025-11-18 22:36:03 -05:00
comfyanonymous
d526974576
Fix hunyuan 3d 2.0 (#10792) 2025-11-18 16:46:19 -05:00
comfyanonymous
bd01d9f7fd
Add left padding support to tokenizers. (#10753) 2025-11-15 06:54:40 -05:00
comfyanonymous
443056c401
Fix custom nodes import error. (#10747)
This should fix the import errors but will break if the custom nodes actually try to use the class.
2025-11-14 03:26:05 -05:00
comfyanonymous
f60923590c
Use same code for chroma and flux blocks so that optimizations are shared. (#10746) 2025-11-14 01:28:05 -05:00
rattus
94c298f962
flux: reduce VRAM usage (#10737)
Cleanup a bunch of stack tensors on Flux. This take me from B=19 to B=22
for 1600x1600 on RTX5090.
2025-11-13 16:02:03 -08:00
contentis
3b3ef9a77a
Quantized Ops fixes (#10715)
* offload support, bug fixes, remove mixins

* add readme
2025-11-12 18:26:52 -05:00
rattus
1c7eaeca10
qwen: reduce VRAM usage (#10725)
Clean up a bunch of stacked and no-longer-needed tensors on the QWEN
VRAM peak (currently FFN).

With this I go from OOMing at B=37x1328x1328 to being able to
succesfully run B=47 (RTX5090).
2025-11-12 16:20:53 -05:00
rattus
18e7d6dba5
mm/mp: always unload re-used but modified models (#10724)
The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.

Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.
2025-11-12 16:19:53 -05:00
comfyanonymous
1199411747
Don't pin tensor if not a torch.nn.parameter.Parameter (#10718) 2025-11-11 19:33:30 -05:00
rattus
c350009236
ops: Put weight cast on the offload stream (#10697)
This needs to be on the offload stream. This reproduced a black screen
with low resolution images on a slow bus when using FP8.
2025-11-09 22:52:11 -05:00
comfyanonymous
dea899f221
Unload weights if vram usage goes up between runs. (#10690) 2025-11-09 18:51:33 -05:00
comfyanonymous
e632e5de28
Add logging for model unloading. (#10692) 2025-11-09 18:06:39 -05:00
comfyanonymous
2abd2b5c20
Make ScaleROPE node work on Flux. (#10686) 2025-11-08 15:52:02 -05:00
comfyanonymous
a1a70362ca
Only unpin tensor if it was pinned by ComfyUI (#10677) 2025-11-07 11:15:05 -05:00
rattus
cf97b033ee
mm: guard against double pin and unpin explicitly (#10672)
As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.
2025-11-06 21:20:48 -05:00
comfyanonymous
09dc24c8a9
Pinned mem also seems to work on AMD. (#10658) 2025-11-05 19:11:15 -05:00