WaveSpeed
Introduction
WaveSpeed introduces dynamic caching (First Block Cache) and torch.compile capabilities, enabling faster and more efficient processsing while maintaining high accuracy. By integrating WaveSpeed into your ComfyUI workflows, you can achieve significant speedups without compromising too much quality.
Features
Apply First Block Cache
The Apply First Block Cache
node provides a dynamic caching mechanism designed to optimize model inference performance for certain types of transformer-based models. The key idea behind FBCache is to determine when the output of the model has changed sufficiently since the last computation and, if not, reuse the previous result to skip unnecessary computations.
When a model is processed through the node, the (residual) output of the first transformer block is saved for future comparison. For each subsequent step in the model’s computation, the current residual output of the first transformer block is compared to the previous residual. If the difference between these two residuals is below the specified threshold (residual_diff_threshold
), it indicates that the model’s output hasn’t changed significantly since the last computation.
When the comparison indicates minimal change, FBCache decides to reuse the previous result. This skips the computation of all subsequent transformer blocks for the current time step. To prevent over-reliance on cached results and maintain model accuracy, a limit can be set on how many consecutive cache hits can occur (max_consecutive_cache_hits
). This ensures that the model periodically recomputes to incorporate new input information.
The start
and end
parameters define the time range during which FBCache can be applied.
Let $t$ be the current time step, and $r_t$ be the residual at time $t$.
The condition for using cached data is:
$$ |r_t - r_{t-1}| < \text{threshold} $$If this condition holds, the model skips computing the subsequent transformer blocks and reuses the previous output. If not, it computes the full model as usual:
$$ \text{caching\_decision} = \left\{ \begin{array}{ll} 1 & \text{if } |r_t - r_{t-1}| < \text{threshold} \\ 0 & \text{otherwise} \end{array} \right. $$
When evaluating FBCache, there is a plethora of variables to keep in mind, starting with the model you are trying to use it with. For example, my experiments, SDXL is a lot more stable with it than 3.5 Large. Prompt complexity and model stability will also greatly count towards making your evaluation a personal experience. Lastly, your chosen sampler and scheduler will also affect the effectiveness of it, for example, ancestral/SDE samplers will prevent the cache from activating because they keep introducing random noise at each step.
If the scheduler produces very gradual changes between steps, the cache hit rate increases and expensive recalculations can be skipped, Karras and Exponential are particularly effective here, because they are designed to spend more time sampling at the lower (i.e. finer) noise levels. Their noise schedules tend to create very smooth transitions between timesteps, resulting in smaller differences in the residual outputs of the first block. Other schedulers that might distribute their timesteps more uniformly or with more abrupt changes tend to produce larger differences between consecutive steps, which reduces the opportunity for caching and thereby diminishes the performance gains.
A notable GitHub issue by easygoing0114 in #87 explores different settings with some very nice findings.
Compile Model+
This node dynamically compiles the model using PyTorch’s _dynamo
or inductor
backends depending on configuration.
⚠️ NOTE: Before playing with this node, it is pretty important you add the --gpu-only
flag to your launch parameters because the compilation may not work with model offloading.
Comfy-Cli users can just run:
comfy launch -- --gpu-only
The node accepts multiple parameters to control how the model is compiled:
model
(Required Input): Takes any model input. This is where you connect your diffusion model output. Can be a base model, patched model, or one that’s already had First Block Cache applied.is_patcher
: Controls how the model is handled during compilation.true
treats the input as a fresh model that needs patching, whilefalse
expects the input to already be a patcher object. Generally you will want to leave this astrue
.object_to_patch
: Specifies which part of the model architecture to optimize, the default valuediffusion_model
targets the main diffusion model component, which is typically what you want for standard Stable Diffusion workflows.compiler
: Selects which compiler to use, the defaulttorch.compile
uses PyTorch’s native compiler. The node will dynamically import the selected function.fullgraph
: Whentrue
, PyTorch will attempt to compile the entire model as a single graph. May, or may not result in longer compilation times, but it will definitely increase memory usage during compilation.false
is generally the safer choice.dynamic
: Controls how the compiler handles varying input dimensions. Whenfalse
, the compiler will optimize for fixed dimensions. It will expect consistent batch sizes, image resolution, sequence lengths and feature dimensions. It will recompile if any of these change in exchange for better performance.mode
: Sets the optimization mode.""
(empty): The default, uses the compiler’s default settings, which is pretty balanced.max-autotune
: Offers a more aggresive optimization strategy, in exchange for longer compilation times. Uses CUDA graphs for caching GPU operations.max-autotune-no-cudagraphs
: Similar tomax-autotune
, but without CUDA graphs, useful, if it causes issues.
options
: Allows you to pass additional options to the compiler. When not empty, it expects a valid JSON.disable
: Disables the compilation, ending the fun. Useful, because bypassing the node doesn’t work. 🤷♂️backend
: Specifies the compilation backend to use, the default, `
Comparing WaveSpeed and TeaCache: Dynamic Caching in ComfyUI
Both WaveSpeed and TeaCache introduce dynamic caching mechanisms to accelerate transformer-based diffusion models in ComfyUI. While their goals are similar—reducing redundant computation for faster inference—they differ in implementation, configuration, and supported models.
How They Work
WaveSpeed (First Block Cache)
- Mechanism: Caches the output of the first transformer block. If the difference between the current and previous residuals is below a threshold, it skips the rest of the blocks and reuses the previous output.
- Configuration:
residual_diff_threshold
: Controls cache sensitivity.max_consecutive_cache_hits
: Prevents overuse of cache.start
/end
: Limits caching to a range of timesteps.
- Supported Models: SDXL, SD3.5, FLUX, LTXV, HunyuanVideo, and more.
- Integration: Provided as a node (
Apply First Block Cache
) in ComfyUI.
TeaCache
- Mechanism: Inspired by similar ideas, but uses the relative L1 norm between outputs to decide when to cache. If the change is below
rel_l1_thresh
, it reuses the previous result. - Configuration:
rel_l1_thresh
: Sensitivity of cache.start_percent
/end_percent
: Range of steps for caching.
- Supported Models: FLUX, LTXV, HunyuanVideo, HiDream, Wan2.1, and more (see node options).
- Integration: Provided as a node (
TeaCache
) in ComfyUI.
Key Differences
Feature | WaveSpeed (FBCache) | TeaCache |
---|---|---|
Caching Metric | Residual difference (L2) | Relative L1 norm |
Configurability | More granular (threshold, max hits, range) | Simpler (threshold, range) |
Supported Models | Broad, including SDXL/3.5 | Broad, but model-specific |
Compilation Support | Yes (Compile Model+ node) | Yes (Compile Model node) |
Advanced Options | Max consecutive cache hits | N/A |
Node Category | wavespeed | TeaCache |
When to Use Which?
- WaveSpeed is ideal if you want fine-grained control, are working with SDXL/3.5, or want to experiment with advanced options like limiting cache hits.
- TeaCache is a great choice for quick setup, especially with models like FLUX, LTXV, or HunyuanVideo, and if you want a simple threshold-based approach.
Advanced Tips & Troubleshooting
- Sampler/Scheduler Impact: Both caches work best with schedulers that produce gradual changes (e.g., Karras, Exponential). Ancestral/SDE samplers may reduce cache effectiveness.
- Prompt/Resolution Changes: Changing input dimensions or prompts may trigger recompilation (for compiled models). Enable
dynamic
mode if you see frequent recompiles. - Model Compatibility: Some nodes (e.g., FreeU Advanced) may interfere with caching. Check compatibility if you see issues.
- Debugging: For WaveSpeed, set
max_consecutive_cache_hits
to a low value to force periodic full computation and check for artifacts.
Quick Reference Table
Node Name | Package | Main Option(s) | Best For |
---|---|---|---|
Apply First Block Cache | WaveSpeed | residual_diff_threshold , max_consecutive_cache_hits | SDXL, SD3.5, FLUX, LTXV |
TeaCache | TeaCache | rel_l1_thresh | FLUX, LTXV, HunyuanVideo |
Compile Model+ | WaveSpeed | compiler , dynamic | All, with LoRA support |
Compile Model | TeaCache | mode , backend | All |
Example: Using Both in a Workflow
You can chain these nodes for maximum effect. For example, apply Apply First Block Cache
(WaveSpeed) after loading your model, then use Compile Model+
for further speedup. Or, use TeaCache
for supported models and add Compile Model
for compilation.
flowchart TD A[Load Diffusion Model] --> B[Apply First Block Cache] B --> C[Compile Model+] C --> D[Rest of Workflow]
Or, for TeaCache:
flowchart TD A[Load Diffusion Model] --> B[TeaCache] B --> C[Compile Model] C --> D[Rest of Workflow]
Conclusion
Both WaveSpeed and TeaCache offer powerful, configurable caching for faster inference in ComfyUI. Choose the one that best fits your model and workflow, and experiment with the thresholds for optimal results.