Perturbed Attention Guidance (PAG)
On this page

Perturbed Attention Guidance (PAG)


Introduction


Perturbed Attention Guidance (PAG) is an advanced technique used in diffusion models to enhance image quality and structural coherence. Unlike other enhancement methods that require additional training or model modifications, PAG works by cleverly manipulating the sampling process itself, making it both efficient and relatively easy to implement.

What is Perturbed Attention Guidance?


Perturbed Attention Guidance improves sample quality by guiding the diffusion process using self-attention mechanisms. It works by creating intermediate samples with degraded structure (by replacing self-attention maps with identity matrices) and then steering the generation away from these degraded samples. This enhances structural coherence without requiring additional training or modules.

Mathematical Foundation


Mathematically, PAG applies a guidance signal proportional to the difference between normal denoising and perturbed denoising:

$$ \epsilon_\text{PAG}(\mathbf{x}_t, c, t) = \epsilon_\theta(\mathbf{x}_t, c, t) + s \cdot [\epsilon_\theta(\mathbf{x}_t, c, t) - \epsilon_\theta^\text{perturbed}(\mathbf{x}_t, c, t)] $$

Where:

  • $\epsilon_\theta(\mathbf{x}_t, c, t)$ is the standard denoising prediction
  • $\epsilon_\theta^\text{perturbed}(\mathbf{x}_t, c, t)$ is the denoising prediction with perturbed attention maps
  • $s$ is the PAG scale that controls guidance strength

The perturbation replaces key attention maps with identity matrices:

$$ \text{Att}^\text{perturbed}(Q, K, V) = V $$

Instead of the normal attention calculation:

$$ \text{Att}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$

How It Works


  1. Normal Denoising Process: The model computes a standard denoising prediction.
  2. Perturbed Denoising: The model computes a second prediction with perturbed attention maps.
  3. Guidance Application: The difference between these predictions is scaled and added to the normal prediction.

This approach essentially tells the model “don’t generate images that look like those with degraded attention,” pushing it to create images with stronger structural coherence.

Benefits and Applications


PAG offers several key advantages:

  1. Improved Structure: More coherent and anatomically correct subjects
  2. Enhanced Composition: Better spatial relationships between elements
  3. No Additional Training: Works with existing models without fine-tuning
  4. Adjustable Strength: The guidance scale can be tuned for different images and models

PAG is particularly useful for:

  • Complex scenes with multiple subjects
  • Images requiring precise anatomical details
  • High-resolution generations where structural issues often emerge
  • Situations where fine-tuning models is impractical

Implementation in ComfyUI


In ComfyUI, PAG is implemented through dedicated nodes that modify the denoising process. The primary node available is the PerturbedAttentionGuidance node, which can be found in the “model_patches/unet” category.

Available PAG Nodes

  1. PerturbedAttentionGuidance - A basic implementation that provides essential functionality with minimal complexity:

    • Takes a model and applies PAG with a single scale parameter
    • Scale parameter (default: 3.0, range: 0.0-100.0) controls the strength of guidance
    • Works by patching the middle UNet block’s self-attention mechanism
    • More resistant to breaking with ComfyUI updates due to its simplicity
  2. PerturbedAttentionGuidance (Advanced) - Available from pamparamm’s repository, this version offers more options and greater flexibility:

    • Provides additional configuration parameters beyond the basic scale
    • Allows for more fine-grained control over the PAG implementation
    • May require occasional updates to maintain compatibility with ComfyUI

Implementation Details

The PAG node works by modifying the attention mechanism in the UNet:

def perturbed_attention(q, k, v, extra_options, mask=None):
    return v  # Replace attention with identity function

def post_cfg_function(args):
    # ... existing processing ...
    
    # Replace Self-attention with PAG
    model_options = comfy.model_patcher.set_model_options_patch_replace(
        model_options, perturbed_attention, "attn1", unet_block, unet_block_id)
    (pag,) = comfy.samplers.calc_cond_batch(model, [cond], x, sigma, model_options)
    
    # Apply PAG formula: result + scale * (normal_prediction - perturbed_prediction)
    return cfg_result + (cond_pred - pag) * scale

Using PAG in Your Workflow

To incorporate PAG into your ComfyUI workflow:

  1. Add the PerturbedAttentionGuidance node to your workflow

  2. Connect your model to the node’s input

  3. Adjust the scale parameter based on your needs:

    • Lower values (0.5-2.0): Subtle structural improvements
    • Medium values (2.0-5.0): Balanced enhancement
    • Higher values (5.0+): Strong structural enforcement
  4. Connect the output to subsequent nodes in your generation pipeline

The key parameter is the PAG scale, which controls the strength of the guidance. Higher values enforce stronger structural coherence but may limit creativity, while lower values allow more freedom but provide less guidance.

Extended Theoretical Foundation and Insights from the ECCV Paper


Deeper Theoretical Justification

PAG is grounded in a rigorous energy-based framework. The method introduces an implicit discriminator \(\mathcal{D}\) that distinguishes desirable (realistic) from undesirable (degraded) samples during the diffusion process. The guidance term is derived as the gradient of the generator loss of this discriminator, leading to the update:

$$ \tilde{\epsilon}_\theta(x_t) = \epsilon_\theta(x_t) + s(\epsilon_\theta(x_t) - \hat{\epsilon}_\theta(x_t)) $$

where \(\hat{\epsilon}_\theta(x_t)\) is the prediction with perturbed self-attention (identity matrix). This formulation generalizes classifier-free guidance (CFG), which is a special case where the perturbation is dropping the class label. PAG, however, works even in unconditional settings and for tasks where CFG is not applicable.

Why Perturb Only Self-Attention?

The self-attention module in diffusion U-Nets separates structure (query-key similarity) from appearance (value tensor). By perturbing only the self-attention map (replacing it with the identity matrix), PAG degrades structural information while preserving appearance, avoiding out-of-distribution issues that arise from perturbing the value tensor directly.


Algorithm and Pseudocode

The ECCV paper provides the following pseudocode for PAG sampling:

```python
for t in T, T-1, ..., 1:
    eps = Model(x_t)  # normal self-attention
    eps_hat = Model'(x_t)  # perturbed self-attention (identity)
    eps_guided = eps + s * (eps - eps_hat)
    x_{t-1} ~ N(mean(eps_guided), variance)

Comparison to Other Guidance Methods

  • Classifier-Free Guidance (CFG):

    • CFG is a special case of the PAG framework, where the perturbation is dropping the class label.
    • PAG can be combined with CFG for even better results, especially in text-to-image tasks.
  • Self-Attention Guidance (SAG):

    • PAG is more robust to scale and less sensitive to hyperparameters than SAG.
    • PAG is faster and more stable, as it can be implemented efficiently if the model supports concurrent computation.

Limitations and Future Work

  • At very high guidance scales, over-saturation or artifacts may appear.
  • PAG requires two forward passes per step, increasing computational cost compared to no guidance.

Further Reading

For a rigorous theoretical explanation, extensive experiments, and more ablation studies, see the ECCV paper: