ComfyUI

mirror of https://git.datalinker.icu/comfyanonymous/ComfyUI synced 2025-12-09 14:04:26 +08:00

Author	SHA1	Message	Date
rattus	0ff0457892	mm: wrap the raw stream in context manager (#10958 ) The documentation of torch.foo.Stream being usable with with: suggests it starts at version 2.7. Use the old API for backwards compatibility.	2025-11-28 16:38:12 -05:00
comfyanonymous	f55c98a89f	Disable offload stream when torch compile. (#10961 )	2025-11-28 16:16:46 -05:00
comfyanonymous	9d8a817985	Enable async offloading by default on Nvidia. (#10953 ) Add --disable-async-offload to disable it. If this causes OOMs that go away when you --disable-async-offload please report it.	2025-11-27 17:46:12 -05:00
rattus	f17251bec6	Account for the VRAM cost of weight offloading (#10733 ) * mm: default to 0 for NUM_STREAMS Dont count the compute stream as an offload stream. This makes async offload accounting easier. * mm: remove 128MB minimum This is from a previous offloading system requirement. Remove it to make behaviour of the loader and partial unloader consistent. * mp: order the module list by offload expense Calculate an approximate offloading temporary VRAM cost to offload a weight and primary order the module load list by that. In the simple case this is just the same as the module weight, but with Loras, a weight with a lora consumes considerably more VRAM to do the Lora application on-the-fly. This will slightly prioritize lora weights, but is really for proper VRAM offload accounting. * mp: Account for the VRAM cost of weight offloading when checking the VRAM headroom, assume that the weight needs to be offloaded, and only load if it has space for both the load and offload * the number of streams. As the weights are ordered from largest to smallest by offload cost this is guaranteed to fit in VRAM (tm), as all weights that follow will be smaller. Make the partial unload aware of this system as well by saving the budget for offload VRAM to the model state and accounting accordingly. Its possible that partial unload increases the size of the largest offloaded weights, and thus needs to unload a little bit more than asked to accomodate the bigger temp buffers. Honor the existing codes floor on model weight loading of 128MB by having the patcher honor this separately withough regard to offloading. Otherwise when MM specifies its 128MB minimum, MP will see the biggest weights, and budget that 128MB to only offload buffer and load nothing which isnt the intent of these minimums. The same clamp applies in case of partial offload of the currently loading model.	2025-11-27 01:03:03 -05:00
comfyanonymous	b6805429b9	Allow pinning quantized tensors. (#10873 )	2025-11-25 02:48:20 -05:00
rattus	18e7d6dba5	mm/mp: always unload re-used but modified models (#10724 ) The partial unloader path in model re-use flow skips straight to the actual unload without any check of the patching UUID. This means that if you do an upscale flow with a model patch on an existing model, it will not apply your patchings. Fix by delaying the partial_unload until after the uuid checks. This is done by making partial_unload a model of partial_load where extra_mem is -ve.	2025-11-12 16:19:53 -05:00
comfyanonymous	1199411747	Don't pin tensor if not a torch.nn.parameter.Parameter (#10718 )	2025-11-11 19:33:30 -05:00
comfyanonymous	dea899f221	Unload weights if vram usage goes up between runs. (#10690 )	2025-11-09 18:51:33 -05:00
comfyanonymous	a1a70362ca	Only unpin tensor if it was pinned by ComfyUI (#10677 )	2025-11-07 11:15:05 -05:00
rattus	cf97b033ee	mm: guard against double pin and unpin explicitly (#10672 ) As commented, if you let cuda be the one to detect double pin/unpinning it actually creates an asyc GPU error.	2025-11-06 21:20:48 -05:00
comfyanonymous	09dc24c8a9	Pinned mem also seems to work on AMD. (#10658 )	2025-11-05 19:11:15 -05:00
comfyanonymous	1d69245981	Enable pinned memory by default on Nvidia. (#10656 ) Removed the --fast pinned_memory flag. You can use --disable-pinned-memory to disable it. Please report if it causes any issues.	2025-11-05 18:08:13 -05:00
comfyanonymous	7f3e4d486c	Limit amount of pinned memory on windows to prevent issues. (#10638 )	2025-11-04 17:37:50 -05:00
rattus	ab7ab5be23	Fix Race condition in --async-offload that can cause corruption (#10501 ) * mm: factor out the current stream getter Make this a reusable function. * ops: sync the offload stream with the consumption of w&b This sync is nessacary as pytorch will queue cuda async frees on the same stream as created to tensor. In the case of async offload, this will be on the offload stream. Weights and biases can go out of scope in python which then triggers the pytorch garbage collector to queue the free operation on the offload stream possible before the compute stream has used the weight. This causes a use after free on weight data leading to total corruption of some workflows. So sync the offload stream with the compute stream after the weight has been used so the free has to wait for the weight to be used. The cast_bias_weight is extended in a backwards compatible way with the new behaviour opt-in on a defaulted parameter. This handles custom node packs calling cast_bias_weight and defeatures async-offload for them (as they do not handle the race). The pattern is now: cast_bias_weight(... , offloadable=True) #This might be offloaded thing(weight, bias, ...) uncast_bias_weight(...) * controlnet: adopt new cast_bias_weight synchronization scheme This is nessacary for safe async weight offloading. * mm: sync the last stream in the queue, not the next Currently this peeks ahead to sync the next stream in the queue of streams with the compute stream. This doesnt allow a lot of parallelization, as then end result is you can only get one weight load ahead regardless of how many streams you have. Rotate the loop logic here to synchronize the end of the queue before returning the next stream. This allows weights to be loaded ahead of the compute streams position.	2025-10-29 17:17:46 -04:00
comfyanonymous	3fa7a5c04a	Speed up offloading using pinned memory. (#10526 ) To enable this feature use: --fast pinned_memory	2025-10-29 00:21:01 -04:00
comfyanonymous	098a352f13	Add warning for torch-directml usage (#10482 ) Added a warning message about the state of torch-directml.	2025-10-25 20:05:22 -04:00
comfyanonymous	426cde37f1	Remove useless function (#10472 )	2025-10-24 19:56:51 -04:00
comfyanonymous	9cdc64998f	Only disable cudnn on newer AMD GPUs. (#10437 )	2025-10-21 19:15:23 -04:00
comfyanonymous	2c2aa409b0	Log message for cudnn disable on AMD. (#10418 )	2025-10-20 15:43:24 -04:00
comfyanonymous	5b80addafd	Turn off cuda malloc by default when --fast autotune is turned on. (#10393 )	2025-10-18 22:35:46 -04:00
comfyanonymous	1c10b33f9b	gfx942 doesn't support fp8 operations. (#10348 )	2025-10-15 00:21:11 -04:00
comfyanonymous	c8674bc6e9	Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332 )	2025-10-13 21:19:03 -04:00
comfyanonymous	a125cd84b0	Improve AMD performance. (#10302 ) I honestly have no idea why this improves things but it does.	2025-10-12 00:28:01 -04:00
Guy Niv	c8d2117f02	Fix memory leak by properly detaching model finalizer (#9979 ) When unloading models in load_models_gpu(), the model finalizer was not being explicitly detached, leading to a memory leak. This caused linear memory consumption increase over time as models are repeatedly loaded and unloaded. This change prevents orphaned finalizer references from accumulating in memory during model switching operations.	2025-09-24 22:35:12 -04:00
DELUXA	8d6653fca6	Enable fp8 ops by default on gfx1200 (#9926 )	2025-09-18 19:50:37 -04:00
comfyanonymous	fb763d4333	Fix amd_min_version crash when cpu device. (#9754 )	2025-09-07 21:16:29 -04:00
comfyanonymous	bcbd7884e3	Don't enable pytorch attention on AMD if triton isn't available. (#9747 )	2025-09-07 00:29:38 -04:00
comfyanonymous	27a0fcccc3	Enable bf16 VAE on RDNA4. (#9746 )	2025-09-06 23:25:22 -04:00
comfyanonymous	0963493a9c	Support for Qwen Diffsynth Controlnets canny and depth. (#9465 ) These are not real controlnets but actually a patch on the model so they will be treated as such. Put them in the models/model_patches/ folder. Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.	2025-08-20 22:26:37 -04:00
Simon Lui	c991a5da65	Fix XPU iGPU regressions (#9322 ) * Change bf16 check and switch non-blocking to off default with option to force to regain speed on certain classes of iGPUs and refactor xpu check. * Turn non_blocking off by default for xpu. * Update README.md for Intel GPUs.	2025-08-13 19:13:35 -04:00
comfyanonymous	5828607ccf	Not sure if AMD actually support fp16 acc but it doesn't crash. (#9258 )	2025-08-09 12:49:25 -04:00
comfyanonymous	735bb4bdb1	Users report gfx1201 is buggy on flux with pytorch attention. (#9244 )	2025-08-08 04:21:00 -04:00
comfyanonymous	7d593baf91	Extra reserved vram on large cards on windows. (#9093 )	2025-07-29 04:07:45 -04:00
comfyanonymous	69cb57b342	Print xpu device name. (#9035 )	2025-07-24 15:06:25 -04:00
honglyua	0ccc88b03f	Support Iluvatar CoreX (#8585 ) * Support Iluvatar CoreX Co-authored-by: mingjiang.li <mingjiang.li@iluvatar.com>	2025-07-24 13:57:36 -04:00
comfyanonymous	d3504e1778	Enable pytorch attention by default for gfx1201 on torch 2.8 (#9029 )	2025-07-23 19:21:29 -04:00
comfyanonymous	a86a58c308	Fix xpu function not implemented p2. (#9027 )	2025-07-23 18:18:20 -04:00
comfyanonymous	39dda1d40d	Fix xpu function not implemented. (#9026 )	2025-07-23 18:10:59 -04:00
comfyanonymous	5ad33787de	Add default device argument. (#9023 )	2025-07-23 14:20:49 -04:00
Simon Lui	255f139863	Add xpu version for async offload and some other things. (#9004 )	2025-07-22 15:20:09 -04:00
comfyanonymous	a96e65df18	Disable omnigen2 fp16 on older pytorch versions. (#8672 )	2025-06-26 03:39:09 -04:00
comfyanonymous	6e28a46454	Apple most likely is never fixing the fp16 attention bug. (#8485 )	2025-06-10 13:06:24 -04:00
comfyanonymous	7f800d04fa	Enable AMD fp8 and pytorch attention on some GPUs. (#8474 ) Information is from the pytorch source code.	2025-06-09 12:50:39 -04:00
comfyanonymous	97755eed46	Enable fp8 ops by default on gfx1201 (#8464 )	2025-06-08 14:15:34 -04:00
comfyanonymous	daf9d25ee2	Cleaner torch version comparisons. (#8453 )	2025-06-07 10:01:15 -04:00
comfyanonymous	704fc78854	Put ROCm version in tuple to make it easier to enable stuff based on it. (#8348 )	2025-05-30 15:41:02 -04:00
comfyanonymous	89a84e32d2	Disable initial GPU load when novram is used. (#8294 )	2025-05-26 16:39:27 -04:00
comfyanonymous	e5799c4899	Enable pytorch attention by default on AMD gfx1151 (#8282 )	2025-05-26 04:29:25 -04:00
comfyanonymous	0b50d4c0db	Add argument to explicitly enable fp8 compute support. (#8257 ) This can be used to test if your current GPU/pytorch version supports fp8 matrix mult in combination with --fast or the fp8_e4m3fn_fast dtype.	2025-05-23 17:43:50 -04:00
comfyanonymous	0a66d4b0af	Per device stream counters for async offload. (#7873 )	2025-04-29 20:28:52 -04:00

1 2 3 4 5 ...

321 Commits