ComfyUI

mirror of https://git.datalinker.icu/comfyanonymous/ComfyUI synced 2025-12-09 22:14:34 +08:00

Author	SHA1	Message	Date
comfyanonymous	6fd463aec9	Fix regression when text encoder loaded directly on GPU. (#11129 )	2025-12-05 15:33:16 -05:00
comfyanonymous	43071e3de3	Make old scaled fp8 format use the new mixed quant ops system. (#11000 )	2025-12-05 14:35:42 -05:00
rattus	519c941165	Prs/lora reservations (reduce massive Lora reservations especially on Flux2) (#11069 ) * mp: only count the offload cost of math once This was previously bundling the combined weight storage and computation cost * ops: put all post async transfer compute on the main stream Some models have massive weights that need either complex dequantization or lora patching. Don't do these patchings on the offload stream, instead do them on the main stream to syncrhonize the potentially large vram spikes for these compute processes. This avoids having to assume a worst case scenario of multiple offload streams all spiking VRAM is parallel with whatever the main stream is doing.	2025-12-03 02:28:45 -05:00
rattus	0ff0457892	mm: wrap the raw stream in context manager (#10958 ) The documentation of torch.foo.Stream being usable with with: suggests it starts at version 2.7. Use the old API for backwards compatibility.	2025-11-28 16:38:12 -05:00
comfyanonymous	bdb10a583f	Fix loras not working on mixed fp8. (#10899 )	2025-11-26 00:07:58 -05:00
comfyanonymous	acfaa5c4a1	Don't try fp8 matrix mult in quantized ops if not supported by hardware. (#10874 )	2025-11-25 02:55:49 -05:00
comfyanonymous	25022e0b09	Cleanup and fix issues with text encoder quants. (#10872 )	2025-11-25 01:48:53 -05:00
comfyanonymous	cb96d4d18c	Disable workaround on newer cudnn. (#10807 )	2025-11-19 23:56:23 -05:00
contentis	3b3ef9a77a	Quantized Ops fixes (#10715 ) * offload support, bug fixes, remove mixins * add readme	2025-11-12 18:26:52 -05:00
rattus	c350009236	ops: Put weight cast on the offload stream (#10697 ) This needs to be on the offload stream. This reproduced a black screen with low resolution images on a slow bus when using FP8.	2025-11-09 22:52:11 -05:00
comfyanonymous	0f4ef3afa0	This seems to slow things down slightly on Linux. (#10624 )	2025-11-03 21:47:14 -05:00
comfyanonymous	0652cb8e2d	Speed up torch.compile (#10620 )	2025-11-03 17:37:12 -05:00
rattus	135fa49ec2	Small speed improvements to --async-offload (#10593 ) * ops: dont take an offload stream if you dont need one * ops: prioritize mem transfer The async offload streams reason for existence is to transfer from RAM to GPU. The post processing compute steps are a bonus on the side stream, but if the compute stream is running a long kernel, it can stall the side stream, as it wait to type-cast the bias before transferring the weight. So do a pure xfer of the weight straight up, then do everything bias, then go back to fix the weight type and do weight patches.	2025-11-01 18:48:53 -04:00
comfyanonymous	c58c13b2ba	Fix torch compile regression on fp8 ops. (#10580 )	2025-11-01 00:25:17 -04:00
comfyanonymous	906c089957	Fix small performance regression with fp8 fast and scaled fp8. (#10537 )	2025-10-29 19:29:01 -04:00
rattus	ab7ab5be23	Fix Race condition in --async-offload that can cause corruption (#10501 ) * mm: factor out the current stream getter Make this a reusable function. * ops: sync the offload stream with the consumption of w&b This sync is nessacary as pytorch will queue cuda async frees on the same stream as created to tensor. In the case of async offload, this will be on the offload stream. Weights and biases can go out of scope in python which then triggers the pytorch garbage collector to queue the free operation on the offload stream possible before the compute stream has used the weight. This causes a use after free on weight data leading to total corruption of some workflows. So sync the offload stream with the compute stream after the weight has been used so the free has to wait for the weight to be used. The cast_bias_weight is extended in a backwards compatible way with the new behaviour opt-in on a defaulted parameter. This handles custom node packs calling cast_bias_weight and defeatures async-offload for them (as they do not handle the race). The pattern is now: cast_bias_weight(... , offloadable=True) #This might be offloaded thing(weight, bias, ...) uncast_bias_weight(...) * controlnet: adopt new cast_bias_weight synchronization scheme This is nessacary for safe async weight offloading. * mm: sync the last stream in the queue, not the next Currently this peeks ahead to sync the next stream in the queue of streams with the compute stream. This doesnt allow a lot of parallelization, as then end result is you can only get one weight load ahead regardless of how many streams you have. Rotate the loop logic here to synchronize the end of the queue before returning the next stream. This allows weights to be loaded ahead of the compute streams position.	2025-10-29 17:17:46 -04:00
contentis	8817f8fc14	Mixed Precision Quantization System (#10498 ) * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Fix missing keys * Rename quant dtype parameter * Rename quant dtype parameter * Fix unittests for CPU build	2025-10-28 16:20:53 -04:00
comfyanonymous	b4f30bd408	Pytorch is stupid. (#10398 )	2025-10-19 01:25:35 -04:00
comfyanonymous	5b80addafd	Turn off cuda malloc by default when --fast autotune is turned on. (#10393 )	2025-10-18 22:35:46 -04:00
comfyanonymous	9da397ea2f	Disable torch compiler for cast_bias_weight function (#10384 ) * Disable torch compiler for cast_bias_weight function * Fix torch compile.	2025-10-17 20:03:28 -04:00
comfyanonymous	b1293d50ef	workaround also works on cudnn 91200 (#10375 )	2025-10-16 19:59:56 -04:00
comfyanonymous	19b466160c	Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (#10373 )	2025-10-16 18:16:03 -04:00
comfyanonymous	3374e900d0	Faster workflow cancelling. (#10301 )	2025-10-13 23:43:53 -04:00
comfyanonymous	139addd53c	More surgical fix for #10267 (#10276 )	2025-10-09 16:37:35 -04:00
Kohaku-Blueleaf	7be2b49b6b	Fix LoRA Trainer bugs with FP8 models. (#9854 ) * Fix adapter weight init * Fix fp8 model training * Avoid inference tensor	2025-09-20 21:24:48 -04:00
contentis	e2d1e5dad9	Enable Convolution AutoTuning (#9301 )	2025-09-01 20:33:50 -04:00
comfyanonymous	4e5c230f6a	Fix last commit not working on older pytorch. (#9346 )	2025-08-14 23:44:02 -04:00
Xiangxi Guo (Ryan)	f0d5d0111f	Avoid torch compile graphbreak for older pytorch versions (#9344 ) Turns out torch.compile has some gaps in context manager decorator syntax support. I've sent patches to fix that in PyTorch, but it won't be available for all the folks running older versions of PyTorch, hence this trivial patch.	2025-08-14 23:41:37 -04:00
comfyanonymous	9df8792d4b	Make last PR not crash comfy on old pytorch. (#9324 )	2025-08-13 15:12:41 -04:00
contentis	3da5a07510	SDPA backend priority (#9299 )	2025-08-13 14:53:27 -04:00
comfyanonymous	111f583e00	Fallback to regular op when fp8 op throws exception. (#8761 )	2025-07-02 00:57:13 -04:00
comfyanonymous	d42613686f	Fix issue with fp8 ops on some models. (#8045 ) _scaled_mm errors when an input is non contiguous.	2025-05-10 07:52:56 -04:00
comfyanonymous	ac10a0d69e	Make loras work with --async-offload (#7824 )	2025-04-26 19:56:22 -04:00
comfyanonymous	0dcc75ca54	Add experimental --async-offload lowvram weight offloading. (#7820 ) This should speed up the lowvram mode a bit. It currently is only enabled when --async-offload is used but it will be enabled by default in the future if there are no problems.	2025-04-26 16:11:21 -04:00
comfyanonymous	9ad792f927	Basic support for hidream i1 model.	2025-04-15 17:35:05 -04:00
comfyanonymous	8a438115fb	add RMSNorm to comfy.ops	2025-04-14 18:00:33 -04:00
catboxanon	1714a4c158	Add CublasOps support (#7574 ) * CublasOps support * Guard CublasOps behind --fast arg	2025-04-12 18:29:15 -04:00
comfyanonymous	70e15fd743	No need for scale_input when fp8 matrix mult is disabled.	2025-03-07 04:49:20 -05:00
comfyanonymous	e1474150de	Support fp8_scaled diffusion models that don't use fp8 matrix mult.	2025-03-07 04:39:21 -05:00
comfyanonymous	4dc6709307	Rename argument in last commit and document the options.	2025-03-01 02:43:49 -05:00
Chenlei Hu	4d55f16ae8	Use enum list for --fast options (#7024 )	2025-03-01 02:37:35 -05:00
comfyanonymous	cf0b549d48	--fast now takes a number as argument to indicate how fast you want it. The idea is that you can indicate how much quality vs speed you want. At the moment: --fast 2 enables fp16 accumulation if your pytorch supports it. --fast 5 enables fp8 matrix mult on fp8 models and the optimization above. --fast without a number enables all optimizations.	2025-02-28 02:48:20 -05:00
comfyanonymous	ab888e1e0b	Add add_weight_wrapper function to model patcher. Functions can now easily be added to wrap/modify model weights.	2025-02-12 05:55:35 -05:00
comfyanonymous	99a1fb6027	Make fast fp8 take a bit less peak memory.	2024-12-24 18:05:19 -05:00
Haoming	fbf68c4e52	clamp input (#5928 )	2024-12-07 14:00:31 -05:00
comfyanonymous	915fdb5745	Fix lowvram edge case.	2024-10-22 16:34:50 -04:00
comfyanonymous	8ce2a1052c	Optimizations to --fast and scaled fp8.	2024-10-22 02:12:28 -04:00
comfyanonymous	0075c6d096	Mixed precision diffusion models with scaled fp8. This change allows supports for diffusion models where all the linears are scaled fp8 while the other weights are the original precision.	2024-10-21 18:12:51 -04:00
comfyanonymous	83ca891118	Support scaled fp8 t5xxl model.	2024-10-20 22:27:00 -04:00
comfyanonymous	f9f9faface	Fixed model merging issue with scaled fp8.	2024-10-20 06:24:31 -04:00

1 2

85 Commits