PatchModelAddDownscale (Kohya DeepShrink)
Introduction
When generating high-resolution images with diffusion models, various issues often emerge, such as inconsistent anatomy, poor composition, or unnatural structures. The PatchModelAddDownscale technique, also known as Kohya DeepShrink, addresses these problems through approach to the diffusion process that maintains quality while enabling higher resolutions.
What is PatchModelAddDownscale?
The PatchModelAddDownscale node (Kohya DeepShrink) enables higher-resolution generation with better consistency by applying strategic downscaling to the UNet during early denoising steps. It adds a downscaling operation to specific model blocks, allowing the model to establish composition at a lower resolution first before gradually transitioning to full resolution. This helps prevent consistency issues like incorrect anatomy or composition problems that often occur at higher resolutions.
Mathematical Foundation
The enhanced version (v2) implements a three-phase scaling process with a gradual transition:
Full Downscale (when $\sigma_\text{start} \geq \sigma \geq \sigma_\text{end}$):
$$h' = D\left(h, \frac{W}{f}, \frac{H}{f}\right)$$Gradual Transition (when $\sigma_\text{end} > \sigma \geq \sigma_\text{gradual}$):
$$s(\sigma) = \frac{1}{f} + \left(1 - \frac{1}{f}\right) \cdot \frac{\sigma_\text{end} - \sigma}{\sigma_\text{end} - \sigma_\text{gradual}}$$$$h' = D(h, W \cdot s(\sigma), H \cdot s(\sigma))$$Original Size (when $\sigma < \sigma_\text{gradual}$):
$$h' = h$$
Where:
- $h$ is the latent representation at a given UNet block
- $D$ is the downscaling function using the specified method
- $W, H$ are the original width and height
- $f$ is the downscale factor
- $\sigma$ is the current noise level
- $\sigma_\text{start}, \sigma_\text{end}, \sigma_\text{gradual}$ are noise levels corresponding to the percentage parameters
Key Parameters
The effectiveness of DeepShrink depends on several key parameters:
downscale_factor
($f$): Typically 2.0, controls the amount of downscalingstart_percent
andend_percent
: Define when in the sampling process scaling occursgradual_percent
: Controls when the transition to full resolution completesdownscale_method
: Method used for downscaling (typically “bicubic”)
How It Works
DeepShrink operates on a simple but principle: diffusion models often perform better at establishing overall composition at lower resolutions, then refining details at higher resolutions. The technique implements this by:
Early Steps (High Noise Levels): Downscaling the latent representation during the early denoising steps, which forces the model to focus on overall composition and structure.
Transition Phase: Gradually increasing the resolution as the denoising progresses, which allows the model to smoothly adapt from broad strokes to finer details.
Final Steps (Low Noise Levels): Processing at full resolution for the final steps, where the detailed elements are refined.
This approach bridges the gap between the model’s training resolution and the target generation resolution, resulting in more coherent outputs.
Benefits and Applications
PatchModelAddDownscale offers several key advantages:
- Better Composition: More coherent overall structure and composition
- Improved Anatomy: Fewer anatomical inconsistencies
- Higher Resolution Generation: Enables generation at resolutions beyond what the model was trained on
- Smoother Transitions: Gradual scaling prevents jarring shifts in the generation process
It’s particularly useful for:
- Portrait generation where anatomical correctness is crucial
- Complex scenes with multiple interacting elements
- Any case where you want to generate at resolutions significantly higher than the model’s training resolution
Implementation in ComfyUI
In ComfyUI, the PatchModelAddDownscale node typically connects to a loaded model, modifying its internal behavior during sampling. The parameters can be tuned depending on the specific model and generation scenario, with higher downscale factors generally providing more stability at very high resolutions.
Conclusion
PatchModelAddDownscale represents an elegant mathematical solution to the challenges of high-resolution generation with diffusion models. By strategically applying downscaling during the early phases of the process, it enables models to produce more coherent and anatomically correct images at resolutions that would otherwise lead to inconsistent results.