Perturbed Attention Guidance (PAG)
Introduction
Perturbed Attention Guidance (PAG) is an advanced technique used in diffusion models to enhance image quality and structural coherence. Unlike other enhancement methods that require additional training or model modifications, PAG works by cleverly manipulating the sampling process itself, making it both efficient and relatively easy to implement.
What is Perturbed Attention Guidance?
Perturbed Attention Guidance improves sample quality by guiding the diffusion process using self-attention mechanisms. It works by creating intermediate samples with degraded structure (by replacing self-attention maps with identity matrices) and then steering the generation away from these degraded samples. This enhances structural coherence without requiring additional training or modules.
Mathematical Foundation
Mathematically, PAG applies a guidance signal proportional to the difference between normal denoising and perturbed denoising:
$$ \epsilon_\text{PAG}(\mathbf{x}_t, c, t) = \epsilon_\theta(\mathbf{x}_t, c, t) + s \cdot [\epsilon_\theta(\mathbf{x}_t, c, t) - \epsilon_\theta^\text{perturbed}(\mathbf{x}_t, c, t)] $$Where:
- $\epsilon_\theta(\mathbf{x}_t, c, t)$ is the standard denoising prediction
- $\epsilon_\theta^\text{perturbed}(\mathbf{x}_t, c, t)$ is the denoising prediction with perturbed attention maps
- $s$ is the PAG scale that controls guidance strength
The perturbation replaces key attention maps with identity matrices:
$$ \text{Att}^\text{perturbed}(Q, K, V) = V $$Instead of the normal attention calculation:
$$ \text{Att}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$How It Works
- Normal Denoising Process: The model computes a standard denoising prediction.
- Perturbed Denoising: The model computes a second prediction with perturbed attention maps.
- Guidance Application: The difference between these predictions is scaled and added to the normal prediction.
This approach essentially tells the model “don’t generate images that look like those with degraded attention,” pushing it to create images with stronger structural coherence.
Benefits and Applications
PAG offers several key advantages:
- Improved Structure: More coherent and anatomically correct subjects
- Enhanced Composition: Better spatial relationships between elements
- No Additional Training: Works with existing models without fine-tuning
- Adjustable Strength: The guidance scale can be tuned for different images and models
PAG is particularly useful for:
- Complex scenes with multiple subjects
- Images requiring precise anatomical details
- High-resolution generations where structural issues often emerge
- Situations where fine-tuning models is impractical
Implementation in ComfyUI
In ComfyUI, PAG is implemented through dedicated nodes that modify the denoising process. The primary node available is the PerturbedAttentionGuidance
node, which can be found in the “model_patches/unet” category.
Available PAG Nodes
PerturbedAttentionGuidance - A basic implementation that provides essential functionality with minimal complexity:
- Takes a model and applies PAG with a single scale parameter
- Scale parameter (default: 3.0, range: 0.0-100.0) controls the strength of guidance
- Works by patching the middle UNet block’s self-attention mechanism
- More resistant to breaking with ComfyUI updates due to its simplicity
PerturbedAttentionGuidance (Advanced) - Available from pamparamm’s repository, this version offers more options and greater flexibility:
- Provides additional configuration parameters beyond the basic scale
- Allows for more fine-grained control over the PAG implementation
- May require occasional updates to maintain compatibility with ComfyUI
Implementation Details
The PAG node works by modifying the attention mechanism in the UNet:
def perturbed_attention(q, k, v, extra_options, mask=None):
return v # Replace attention with identity function
def post_cfg_function(args):
# ... existing processing ...
# Replace Self-attention with PAG
model_options = comfy.model_patcher.set_model_options_patch_replace(
model_options, perturbed_attention, "attn1", unet_block, unet_block_id)
(pag,) = comfy.samplers.calc_cond_batch(model, [cond], x, sigma, model_options)
# Apply PAG formula: result + scale * (normal_prediction - perturbed_prediction)
return cfg_result + (cond_pred - pag) * scale
Using PAG in Your Workflow
To incorporate PAG into your ComfyUI workflow:
Add the
PerturbedAttentionGuidance
node to your workflowConnect your model to the node’s input
Adjust the scale parameter based on your needs:
- Lower values (0.5-2.0): Subtle structural improvements
- Medium values (2.0-5.0): Balanced enhancement
- Higher values (5.0+): Strong structural enforcement
Connect the output to subsequent nodes in your generation pipeline
The key parameter is the PAG scale, which controls the strength of the guidance. Higher values enforce stronger structural coherence but may limit creativity, while lower values allow more freedom but provide less guidance.
Extended Theoretical Foundation and Insights from the ECCV Paper
Deeper Theoretical Justification
PAG is grounded in a rigorous energy-based framework. The method introduces an implicit discriminator \(\mathcal{D}\) that distinguishes desirable (realistic) from undesirable (degraded) samples during the diffusion process. The guidance term is derived as the gradient of the generator loss of this discriminator, leading to the update:
$$ \tilde{\epsilon}_\theta(x_t) = \epsilon_\theta(x_t) + s(\epsilon_\theta(x_t) - \hat{\epsilon}_\theta(x_t)) $$where \(\hat{\epsilon}_\theta(x_t)\) is the prediction with perturbed self-attention (identity matrix). This formulation generalizes classifier-free guidance (CFG), which is a special case where the perturbation is dropping the class label. PAG, however, works even in unconditional settings and for tasks where CFG is not applicable.
Why Perturb Only Self-Attention?
The self-attention module in diffusion U-Nets separates structure (query-key similarity) from appearance (value tensor). By perturbing only the self-attention map (replacing it with the identity matrix), PAG degrades structural information while preserving appearance, avoiding out-of-distribution issues that arise from perturbing the value tensor directly.
Algorithm and Pseudocode
The ECCV paper provides the following pseudocode for PAG sampling:
```python
for t in T, T-1, ..., 1:
eps = Model(x_t) # normal self-attention
eps_hat = Model'(x_t) # perturbed self-attention (identity)
eps_guided = eps + s * (eps - eps_hat)
x_{t-1} ~ N(mean(eps_guided), variance)
Comparison to Other Guidance Methods
Classifier-Free Guidance (CFG):
- CFG is a special case of the PAG framework, where the perturbation is dropping the class label.
- PAG can be combined with CFG for even better results, especially in text-to-image tasks.
Self-Attention Guidance (SAG):
- PAG is more robust to scale and less sensitive to hyperparameters than SAG.
- PAG is faster and more stable, as it can be implemented efficiently if the model supports concurrent computation.
Limitations and Future Work
- At very high guidance scales, over-saturation or artifacts may appear.
- PAG requires two forward passes per step, increasing computational cost compared to no guidance.
Further Reading
For a rigorous theoretical explanation, extensive experiments, and more ablation studies, see the ECCV paper: