1875 Commits

Author SHA1 Message Date
rattus
ab7ab5be23
Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
comfyanonymous
ec4fc2a09a
Fix case of weights not being unpinned. (#10533) 2025-10-29 15:48:06 -04:00
comfyanonymous
1a58087ac2
Reduce memory usage for fp8 scaled op. (#10531) 2025-10-29 15:43:51 -04:00
comfyanonymous
e525673f72
Fix issue. (#10527) 2025-10-29 00:37:00 -04:00
comfyanonymous
3fa7a5c04a
Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
contentis
8817f8fc14
Mixed Precision Quantization System (#10498)
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Fix missing keys

* Rename quant dtype parameter

* Rename quant dtype parameter

* Fix unittests for CPU build
2025-10-28 16:20:53 -04:00
comfyanonymous
f6bbc1ac84
Fix mistake. (#10484) 2025-10-25 23:07:29 -04:00
comfyanonymous
098a352f13
Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
comfyanonymous
426cde37f1
Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
comfyanonymous
1bcda6df98
WIP way to support multi multi dimensional latents. (#10456) 2025-10-23 21:21:14 -04:00
strint
dc7c77e78c better partial unload 2025-10-23 18:09:47 +08:00
strint
c312733b8c refine log 2025-10-23 15:53:35 +08:00
strint
58d28edade no limit for offload size 2025-10-23 15:50:57 +08:00
strint
aab0e244f7 fix MMAP_MEM_THRESHOLD_GB default 2025-10-23 14:44:51 +08:00
strint
f3c673d086 Merge branch 'master' of https://github.com/siliconflow/ComfyUI into refine_offload 2025-10-22 21:15:28 +08:00
comfyanonymous
9cdc64998f
Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
strint
98ba311511 add env 2025-10-21 19:06:34 +08:00
strint
80383932ec lazy rm file 2025-10-21 18:00:31 +08:00
strint
08e094ed81 use native mmap 2025-10-21 17:00:56 +08:00
strint
fff56de63c fix format 2025-10-21 11:59:59 +08:00
strint
2d010f545c refine code 2025-10-21 11:54:56 +08:00
strint
2f0d56656e refine code 2025-10-21 11:38:17 +08:00
comfyanonymous
2c2aa409b0
Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
strint
05c2518c6d refact mmap 2025-10-21 02:59:51 +08:00
strint
8aeebbf7ef fix to 2025-10-21 02:27:40 +08:00
strint
49561788cf fix log 2025-10-21 02:03:38 +08:00
strint
e9e1d2f0e8 add mmap tensor 2025-10-21 00:40:14 +08:00
strint
4ac827d564 unload partial 2025-10-20 18:27:38 +08:00
strint
21ebcada1d debug free mem 2025-10-20 16:22:50 +08:00
comfyanonymous
b4f30bd408
Pytorch is stupid. (#10398) 2025-10-19 01:25:35 -04:00
comfyanonymous
dad076aee6
Speed up chroma radiance. (#10395) 2025-10-18 23:19:52 -04:00
comfyanonymous
0cf33953a7
Fix batch size above 1 giving bad output in chroma radiance. (#10394) 2025-10-18 23:15:34 -04:00
comfyanonymous
5b80addafd
Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
comfyanonymous
9da397ea2f
Disable torch compiler for cast_bias_weight function (#10384)
* Disable torch compiler for cast_bias_weight function

* Fix torch compile.
2025-10-17 20:03:28 -04:00
strint
49597bfa3e load remains mmap 2025-10-17 21:43:49 +08:00
strint
6583cc0142 debug load mem 2025-10-17 18:28:25 +08:00
strint
5c3c6c02b2 add debug log of cpu load 2025-10-17 16:33:14 +08:00
comfyanonymous
b1293d50ef
workaround also works on cudnn 91200 (#10375) 2025-10-16 19:59:56 -04:00
comfyanonymous
19b466160c
Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (#10373) 2025-10-16 18:16:03 -04:00
strint
e5ff6a1b53 refine log 2025-10-16 22:47:03 +08:00
strint
9352987e9b add log 2025-10-16 22:25:17 +08:00
strint
c1eac555c0 add debug log 2025-10-16 21:42:48 +08:00
strint
2b222962c3 add debug log 2025-10-16 21:42:02 +08:00
strint
fa19dd4620 debug offload 2025-10-16 17:00:47 +08:00
strint
6e33ee391a debug error 2025-10-16 16:45:08 +08:00
Faych
afa8a24fe1
refactor: Replace manual patches merging with merge_nested_dicts (#10360) 2025-10-15 17:16:09 -07:00
Jedrzej Kosinski
493b81e48f
Fix order of inputs nested merge_nested_dicts (#10362) 2025-10-15 16:47:26 -07:00
comfyanonymous
1c10b33f9b
gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
comfyanonymous
3374e900d0
Faster workflow cancelling. (#10301) 2025-10-13 23:43:53 -04:00
comfyanonymous
dfff7e5332
Better memory estimation for the SD/Flux VAE on AMD. (#10334) 2025-10-13 22:37:19 -04:00