WaveSpeed
On this page
The image depicts a fennec girl submerged underwater, gracefully swimming beneath the surface. She has blonde, slightly tousled hair, large fennec-like ears with white fur at the tips, and a bushy, cream-colored tail. She is dressed in a dark blue oversized jacket, which drapes loosely over her form, and wears black shoes. Her expression is calm and serene as she gazes toward the viewer, with small bubbles escaping toward the water's surface, which is illuminated by soft sunlight filtering from above. The deep blue hue of the surrounding water contrasts with her light fur and clothing, creating a visually striking composition that conveys a tranquil and dreamlike atmosphere.

WaveSpeed


Introduction


WaveSpeed introduces dynamic caching (First Block Cache) and torch.compile capabilities, enabling faster and more efficient processsing while maintaining high accuracy. By integrating WaveSpeed into your ComfyUI workflows, you can achieve significant speedups without compromising too much quality.

Features


Apply First Block Cache

The Apply First Block Cache node provides a dynamic caching mechanism designed to optimize model inference performance for certain types of transformer-based models. The key idea behind FBCache is to determine when the output of the model has changed sufficiently since the last computation and, if not, reuse the previous result to skip unnecessary computations.

When a model is processed through the node, the (residual) output of the first transformer block is saved for future comparison. For each subsequent step in the model’s computation, the current residual output of the first transformer block is compared to the previous residual. If the difference between these two residuals is below the specified threshold (residual_diff_threshold), it indicates that the model’s output hasn’t changed significantly since the last computation.

When the comparison indicates minimal change, FBCache decides to reuse the previous result. This skips the computation of all subsequent transformer blocks for the current time step. To prevent over-reliance on cached results and maintain model accuracy, a limit can be set on how many consecutive cache hits can occur (max_consecutive_cache_hits). This ensures that the model periodically recomputes to incorporate new input information.

The start and end parameters define the time range during which FBCache can be applied.

Let $t$ be the current time step, and $r_t$ be the residual at time $t$.

The condition for using cached data is:

$$ |r_t - r_{t-1}| < \text{threshold} $$

If this condition holds, the model skips computing the subsequent transformer blocks and reuses the previous output. If not, it computes the full model as usual:

$$ \text{caching\_decision} = \left\{ \begin{array}{ll} 1 & \text{if } |r_t - r_{t-1}| < \text{threshold} \\ 0 & \text{otherwise} \end{array} \right. $$
An XYPlot with inference times.

When evaluating FBCache, there is a plethora of variables to keep in mind, starting with the model you are trying to use it with. For example, my experiments, SDXL is a lot more stable with it than 3.5 Large. Prompt complexity and model stability will also greatly count towards making your evaluation a personal experience. Lastly, your chosen sampler and scheduler will also affect the effectiveness of it, for example, ancestral/SDE samplers will prevent the cache from activating because they keep introducing random noise at each step.

If the scheduler produces very gradual changes between steps, the cache hit rate increases and expensive recalculations can be skipped, Karras and Exponential are particularly effective here, because they are designed to spend more time sampling at the lower (i.e. finer) noise levels. Their noise schedules tend to create very smooth transitions between timesteps, resulting in smaller differences in the residual outputs of the first block. Other schedulers that might distribute their timesteps more uniformly or with more abrupt changes tend to produce larger differences between consecutive steps, which reduces the opportunity for caching and thereby diminishes the performance gains.

A notable GitHub issue by easygoing0114 in #87 explores different settings with some very nice findings.

Compile Model+

This node dynamically compiles the model using PyTorch’s _dynamo or inductor backends depending on configuration.

⚠️ NOTE: Before playing with this node, it is pretty important you add the --gpu-only flag to your launch parameters because the compilation may not work with model offloading.

Comfy-Cli users can just run:

comfy launch -- --gpu-only

The node accepts multiple parameters to control how the model is compiled:

  • model (Required Input): Takes any model input. This is where you connect your diffusion model output. Can be a base model, patched model, or one that’s already had First Block Cache applied.
  • is_patcher: Controls how the model is handled during compilation. true treats the input as a fresh model that needs patching, while false expects the input to already be a patcher object. Generally you will want to leave this as true.
  • object_to_patch: Specifies which part of the model architecture to optimize, the default value diffusion_model targets the main diffusion model component, which is typically what you want for standard Stable Diffusion workflows.
  • compiler: Selects which compiler to use, the default torch.compile uses PyTorch’s native compiler. The node will dynamically import the selected function.
  • fullgraph: When true, PyTorch will attempt to compile the entire model as a single graph. May, or may not result in longer compilation times, but it will definitely increase memory usage during compilation. false is generally the safer choice.
  • dynamic: Controls how the compiler handles varying input dimensions. When false, the compiler will optimize for fixed dimensions. It will expect consistent batch sizes, image resolution, sequence lengths and feature dimensions. It will recompile if any of these change in exchange for better performance.
  • mode: Sets the optimization mode.
    • "" (empty): The default, uses the compiler’s default settings, which is pretty balanced.
    • max-autotune: Offers a more aggresive optimization strategy, in exchange for longer compilation times. Uses CUDA graphs for caching GPU operations.
    • max-autotune-no-cudagraphs: Similar to max-autotune, but without CUDA graphs, useful, if it causes issues.
  • options: Allows you to pass additional options to the compiler. When not empty, it expects a valid JSON.
  • disable: Disables the compilation, ending the fun. Useful, because bypassing the node doesn’t work. 🤷‍♂️
  • backend: Specifies the compilation backend to use, the default, `

Comparing WaveSpeed and TeaCache: Dynamic Caching in ComfyUI

Both WaveSpeed and TeaCache introduce dynamic caching mechanisms to accelerate transformer-based diffusion models in ComfyUI. While their goals are similar—reducing redundant computation for faster inference—they differ in implementation, configuration, and supported models.

How They Work

WaveSpeed (First Block Cache)

  • Mechanism: Caches the output of the first transformer block. If the difference between the current and previous residuals is below a threshold, it skips the rest of the blocks and reuses the previous output.
  • Configuration:
    • residual_diff_threshold: Controls cache sensitivity.
    • max_consecutive_cache_hits: Prevents overuse of cache.
    • start/end: Limits caching to a range of timesteps.
  • Supported Models: SDXL, SD3.5, FLUX, LTXV, HunyuanVideo, and more.
  • Integration: Provided as a node (Apply First Block Cache) in ComfyUI.

TeaCache

  • Mechanism: Inspired by similar ideas, but uses the relative L1 norm between outputs to decide when to cache. If the change is below rel_l1_thresh, it reuses the previous result.
  • Configuration:
    • rel_l1_thresh: Sensitivity of cache.
    • start_percent/end_percent: Range of steps for caching.
  • Supported Models: FLUX, LTXV, HunyuanVideo, HiDream, Wan2.1, and more (see node options).
  • Integration: Provided as a node (TeaCache) in ComfyUI.

Key Differences

FeatureWaveSpeed (FBCache)TeaCache
Caching MetricResidual difference (L2)Relative L1 norm
ConfigurabilityMore granular (threshold, max hits, range)Simpler (threshold, range)
Supported ModelsBroad, including SDXL/3.5Broad, but model-specific
Compilation SupportYes (Compile Model+ node)Yes (Compile Model node)
Advanced OptionsMax consecutive cache hitsN/A
Node CategorywavespeedTeaCache

When to Use Which?

  • WaveSpeed is ideal if you want fine-grained control, are working with SDXL/3.5, or want to experiment with advanced options like limiting cache hits.
  • TeaCache is a great choice for quick setup, especially with models like FLUX, LTXV, or HunyuanVideo, and if you want a simple threshold-based approach.

Advanced Tips & Troubleshooting

  • Sampler/Scheduler Impact: Both caches work best with schedulers that produce gradual changes (e.g., Karras, Exponential). Ancestral/SDE samplers may reduce cache effectiveness.
  • Prompt/Resolution Changes: Changing input dimensions or prompts may trigger recompilation (for compiled models). Enable dynamic mode if you see frequent recompiles.
  • Model Compatibility: Some nodes (e.g., FreeU Advanced) may interfere with caching. Check compatibility if you see issues.
  • Debugging: For WaveSpeed, set max_consecutive_cache_hits to a low value to force periodic full computation and check for artifacts.

Quick Reference Table

Node NamePackageMain Option(s)Best For
Apply First Block CacheWaveSpeedresidual_diff_threshold, max_consecutive_cache_hitsSDXL, SD3.5, FLUX, LTXV
TeaCacheTeaCacherel_l1_threshFLUX, LTXV, HunyuanVideo
Compile Model+WaveSpeedcompiler, dynamicAll, with LoRA support
Compile ModelTeaCachemode, backendAll

Example: Using Both in a Workflow

You can chain these nodes for maximum effect. For example, apply Apply First Block Cache (WaveSpeed) after loading your model, then use Compile Model+ for further speedup. Or, use TeaCache for supported models and add Compile Model for compilation.

flowchart TD
    A[Load Diffusion Model] --> B[Apply First Block Cache]
    B --> C[Compile Model+]
    C --> D[Rest of Workflow]

Or, for TeaCache:

flowchart TD
    A[Load Diffusion Model] --> B[TeaCache]
    B --> C[Compile Model]
    C --> D[Rest of Workflow]

Conclusion

Both WaveSpeed and TeaCache offer powerful, configurable caching for faster inference in ComfyUI. Choose the one that best fits your model and workflow, and experiment with the thresholds for optimal results.