* execution: Roll the UI cache into the outputs
Currently the UI cache is parallel to the output cache with
expectations of being a content superset of the output cache.
At the same time the UI and output cache are maintained completely
seperately, making it awkward to free the output cache content without
changing the behaviour of the UI cache.
There are two actual users (getters) of the UI cache. The first is
the case of a direct content hit on the output cache when executing a
node. This case is very naturally handled by merging the UI and outputs
cache.
The second case is the history JSON generation at the end of the prompt.
This currently works by asking the cache for all_node_ids and then
pulling the cache contents for those nodes. all_node_ids is the nodes
of the dynamic prompt.
So fold the UI cache into the output cache. The current UI cache setter
now writes to a prompt-scope dict. When the output cache is set, just
get this value from the dict and tuple up with the outputs.
When generating the history, simply iterate prompt-scope dict.
This prepares support for more complex caching strategies (like RAM
pressure caching) where less than 1 workflow will be cached and it
will be desirable to keep the UI cache and output cache in sync.
* sd: Implement RAM getter for VAE
* model_patcher: Implement RAM getter for ModelPatcher
* sd: Implement RAM getter for CLIP
* Implement RAM Pressure cache
Implement a cache sensitive to RAM pressure. When RAM headroom drops
down below a certain threshold, evict RAM-expensive nodes from the
cache.
Models and tensors are measured directly for RAM usage. An OOM score
is then computed based on the RAM usage of the node.
Note the due to indirection through shared objects (like a model
patcher), multiple nodes can account the same RAM as their individual
usage. The intent is this will free chains of nodes particularly
model loaders and associate loras as they all score similar and are
sorted in close to each other.
Has a bias towards unloading model nodes mid flow while being able
to keep results like text encodings and VAE.
* execution: Convert the cache entry to NamedTuple
As commented in review.
Convert this to a named tuple and abstract away the tuple type
completely from graph.py.
* mm: factor out the current stream getter
Make this a reusable function.
* ops: sync the offload stream with the consumption of w&b
This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.
Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.
So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.
The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).
The pattern is now:
cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)
* controlnet: adopt new cast_bias_weight synchronization scheme
This is nessacary for safe async weight offloading.
* mm: sync the last stream in the queue, not the next
Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.
Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.
* Updated design using Tensor Subclasses
* Fix FP8 MM
* An actually functional POC
* Remove CK reference and ensure correct compute dtype
* Update unit tests
* ruff lint
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.
* Updated design using Tensor Subclasses
* Fix FP8 MM
* An actually functional POC
* Remove CK reference and ensure correct compute dtype
* Update unit tests
* ruff lint
* Fix missing keys
* Rename quant dtype parameter
* Rename quant dtype parameter
* Fix unittests for CPU build
Same change pattern as 7e8dd275c243ad460ed5015d2e13611d81d2a569
applied to WAN2.2
If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.
The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.
Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.
So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.
## Summary
Fixed incorrect type hint syntax in `MotionEncoder_tc.__init__()` parameter list.
## Changes
- Line 647: Changed `num_heads=int` to `num_heads: int`
- This corrects the parameter annotation from a default value assignment to proper type hint syntax
## Details
The parameter was using assignment syntax (`=`) instead of type annotation syntax (`:`), which would incorrectly set the default value to the `int` class itself rather than annotating the expected type.
If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.
The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.
Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.
So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.
When the VAE catches this VRAM OOM, it launches the fallback logic
straight from the exception context.
Python however refs the entire call stack that caused the exception
including any local variables for the sake of exception report and
debugging. In the case of tensors, this can hold on the references
to GBs of VRAM and inhibit the VRAM allocated from freeing them.
So dump the except context completely before going back to the VAE
via the tiler by getting out of the except block with nothing but
a flag.
The greately increases the reliability of the tiler fallback,
especially on low VRAM cards, as with the bug, if the leak randomly
leaked more than the headroom needed for a single tile, the tiler
would fallback would OOM and fail the flow.
* flux: math: Use _addcmul to avoid expensive VRAM intermediate
The rope process can be the VRAM peak and this intermediate
for the addition result before releasing the original can OOM.
addcmul_ it.
* wan: Delete the self attention before cross attention
This saves VRAM when the cross attention and FFN are in play as the
VRAM peak.
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.
This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.