The ComfyUI Bible

The ComfyUI Bible


Installing ComfyUI


If you need help installing ComfyUI, you didn’t come to the right place. If you are using Windows, you can use the prebuilt package, or you can install it manually otherwise.

Requirements for running the code examples on this page.

To run the supplied visualization codes in this document, you’ll need the following Python packages:

torchvision
numpy
Pillow
opencv-python
diffusers

Additionally, you’ll need FFmpeg installed on your system for the video conversion. The code assumes FFmpeg is available in your system PATH.

Understanding Diffusion Models


Before diving into ComfyUI’s practical aspects, let’s understand the mathematical foundations of diffusion models that power modern AI image generation. You can skip most, but not all of the intimidating equations, and jump to the practical part, but your brain will thank you for it!

The Diffusion Process

Diffusion models work through a process that’s similar to gradually adding static noise to a TV signal, and then learning to remove that noise to recover the original picture.

Forward Diffusion

The forward diffusion process systematically transforms a clear image into pure random noise through a series of precise mathematical steps. Here’s how it works:

  1. We start with an original image $x_0$ containing clear, detailed information
  2. At each timestep $t$, we apply a carefully controlled amount of Gaussian noise
  3. The noise intensity follows a schedule $\beta_t$ that gradually increases over time
  4. Each step slightly corrupts the previous image state according to our diffusion equation
  5. After $T$ timesteps, we reach $x_T$ which is effectively pure Gaussian noise

The code used to generate the video.

This demonstrates pixel-space diffusion, just to give you an idea of what’s happening.

To actually see the diffusion process in latent space, first we have to learn about latent space..

Latent Space

it’s important to understand that AI models don’t work directly with regular images. Instead, they work with something called “latents” - a compressed representation of images that’s more efficient for the AI to process.

Think of them like a blueprint of an image:

  • A regular image stores exact colors for each pixel (like a detailed painting)
  • A latent stores abstract patterns and features (like an architect’s blueprint)

Here is a video showing the forward diffusion process in latent space:

The code used to generate the video.

The main differences you’ll notice to pixel-space diffusion are:

  • The noise patterns will look more structured due to the VAE’s learned latent space.
  • The degradation process might preserve more semantic information.
The Mathematics

The process of adding noise can be described mathematically with the following equation:

$$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_tI)$$

Let’s break down each component of this equation:

  1. Core Variables

    • $x_t$ - The image state at the current timestep
    • $x_{t-1}$ - The image state at the previous timestep
    • $\beta_t$ - The noise schedule parameter that controls noise intensity
    • $I$ - The identity matrix (ensures noise is applied uniformly)
  2. Distribution Components

    • $q(x_t|x_{t-1})$ - The probability distribution of the current state given the previous state
    • $\mathcal{N}$ - Denotes a normal (Gaussian) distribution
    • $\sqrt{1-\beta_t}x_{t-1}$ - The mean of the distribution
    • $\beta_tI$ - The variance of the distribution

The equation describes a carefully controlled process of gradually corrupting an image with noise. At each timestep:

  • The previous image is scaled down by $\sqrt{1-\beta_t}$, which decreases as more noise is added
  • Random Gaussian noise with variance $\beta_tI$ is added to this scaled image
  • The process creates a smooth transition from the original image to pure noise
  • The noise schedule $\beta_t$ ensures this happens in a controlled, predictable way

This mathematical foundation is crucial because it allows the model to learn the reverse process - taking a noisy image and progressively removing noise to generate the final output.

Alpha Bar

Alpha bar ($\bar{\alpha}$) is a crucial concept in diffusion models that represents the cumulative product of the signal scaling factors. Here’s how it works:

  1. At each timestep $t$, we have:

    • $\beta_t$ (beta): the noise schedule parameter
    • $\alpha_t$ (alpha): $1 - \beta_t$, the signal scaling factor
  2. Alpha bar is then defined as:

    $$\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$$

This means:

  • $\bar{\alpha}$ starts at 1 (no noise) and decreases over time
  • It represents how much of the original signal remains at time $t$
  • At the end of diffusion, $\bar{\alpha}$ approaches 0 (pure noise)

The complete forward process can be written in terms of $\bar{\alpha}$:

$$x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$$

where:

  • $x_t$ is the noisy image at time t
  • $x_0$ is the original image
  • $\epsilon$ is random Gaussian noise
  • $\sqrt{\bar{\alpha}_t}$ controls how much original signal remains
  • $\sqrt{1-\bar{\alpha}_t}$ controls how much noise is added

This formulation proves to be particularly powerful for several key reasons. First, it enables us to directly sample any timestep from the original image $x_0$ without having to calculate all intermediate steps. Additionally, it provides precise visibility into the ratio between signal and noise at each point in the process. Finally, this formulation makes the reverse process mathematically tractable, which is essential for training the model to denoise images effectively.

The process serves as the foundation for training our AI models. By understanding exactly how images are corrupted, we can teach the model to reverse this corruption during the generation process.

Reverse Diffusion

Something cool happens when the AI learns to reverse this process. It learns to:

  1. Start with pure noise
  2. Gradually remove the noise
  3. Eventually recover a clear image

This is what happens every time you generate an image with Stable Diffusion or similar AI models. The model has learned to take random noise and progressively “denoise” it into a clear image that matches your prompt.

Latents

In ComfyUI, you’ll encounter latents in three main ways:

Empty Latent Image

When you want to generate an image from scratch, referred to as text-to-image (t2i).

  • Creates a blank canvas of random noise in latent space
  • Size: 8x smaller than your target image (e.g., 512x512 image = 64x64 latents)
  • This is your starting point for pure text-to-image generation
Load Image → VAE Encode

When you want to modify an existing image, referred to as image-to-image (i2i).

  • Load Image: Brings your regular image into ComfyUI
  • VAE Encode: Converts it into latents.
  • This is your starting point for image-to-image generation
  1. VAE Decode: The final step in any workflow
    • Converts latents back into a regular image
    • Like turning the blueprint back into a detailed painting
    • This is how you get your final output image

The beauty of working with latents is that they’re much more efficient:

  • Takes less memory (8x smaller than regular images)
  • Easier for the AI to manipulate
  • Contains the essential “structure” of images without unnecessary details

The KSampler Node The KSampler node in ComfyUI implements the reverse diffusion process. It takes the noisy latent $x_T$ and progressively denoises it using the model’s predictions. The node’s parameters directly control this process:

  1. Steps: Controls the number of $t$ steps in the reverse process

    • More steps = finer granularity but slower
    • Mathematically: divides $[0,T]$ into this many intervals
  2. CFG Scale: Implements Classifier-Free Guidance

    $$\epsilon_\text{CFG} = \epsilon_\theta(x_t, t) + w[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t)]$$
    • Higher values follow the prompt more strictly
    • Lower values (1-4) allow more creative freedom
  3. Scheduler: Controls $\beta_t$ schedule

    • Karras: $\beta_t$ follows the schedule from Karras et al. $$\sigma_i = \sigma_\text{min}^{1-f(i)} \sigma_\text{max}^{f(i)}$$
    • Normal: Linear schedule $$\beta_t = \beta_\text{min} + t(\beta_\text{max} - \beta_\text{min})$$
  4. Sampler: Determines how to use model predictions

    • Euler: Simple first-order method $$x_{t-1} = x_t - \eta \nabla \log p(x_t)$$
    • DPM++ 2M: Second-order method with momentum $$v_t = \mu v_{t-1} + \epsilon_t$$ $$x_{t-1} = x_t + \eta v_t$$
  5. Seed: Controls the initial noise

    • Same seed + same parameters = reproducible results
    • Mathematically: initializes the random state for $x_T$

The Noise Node The optional Noise node lets you directly manipulate the initial noise $x_T$. It implements:

$$x_T = \mathcal{N}(0, I)$$

You can:

  • Set a specific seed for reproducibility
  • Control noise dimensions (width, height)
  • Mix different noise patterns

V-Prediction and Angular Parameterization

The core of v-prediction is representing the diffusion process as a rotation in a 2D space.

The visualization below shows three key aspects of the v-prediction process: the noisy image progression (left), the raw velocity field in latent space (middle), and the decoded velocity field (right). The velocity field represents the direction and magnitude of change at each point, showing how the image should be denoised. The middle panel reveals the actual latent-space velocities at 1/8 resolution, while the right panel shows what these velocities “look like” when decoded back to image space. The overlay displays the current timestep ($t$), angle in radians ($\phi$), and cumulative signal scaling factor ($\bar{\alpha}$), helping us understand how the process evolves over time.

The code used to generate the video.

# Calculate phi (angle) from alpha and sigma
phi_t = torch.arctan2(sigma_t, alpha_t)

This calculates the angle $\phi_t$ that represents how far along we are in the diffusion process.

  • At $phi_t = 0$, we have the original image
  • At $phi_t = \pi/2$, we have pure noise
  • The angle smoothly interpolates between these states

While the standard formulation predicts noise $\epsilon$, an alternative approach called v-prediction parameterizes the diffusion process in terms of velocity. In this formulation, we define an angle $\phi_t = \text{arctan}(\sigma_t/\alpha_t)$ that represents the progression through the diffusion process. For a variance-preserving process, we have:

$$\alpha_\phi = \cos(\phi), \quad \sigma_\phi = \sin(\phi)$$

The noisy image at angle $\phi$ can then be expressed as:

$$\mathbf{z}_\phi = \cos(\phi)\mathbf{x} + \sin(\phi)\epsilon$$

The key insight is to define a velocity vector:

$$\mathbf{v}_\phi = \frac{d\mathbf{z}_\phi}{d\phi} = \cos(\phi)\epsilon - \sin(\phi)\mathbf{x}$$

This velocity represents the direction of change in the noisy image as we move through the diffusion process. The model predicts this velocity instead of the noise:

$$\hat{\mathbf{v}}_\theta(\mathbf{z}_\phi) = \cos(\phi)\hat{\epsilon}_\theta(\mathbf{z}_\phi) - \sin(\phi)\hat{\mathbf{x}}_\theta(\mathbf{z}_\phi)$$

The sampling process then becomes a rotation in the $(\mathbf{z}_\phi, \mathbf{v}_\phi)$ plane:

$$\mathbf{z}_{\phi_{t-\delta}} = \cos(\delta)\mathbf{z}_{\phi_t} - \sin(\delta)\hat{\mathbf{v}}_\theta(\mathbf{z}_{\phi_t})$$

This formulation offers several key advantages: it provides a more natural parameterization of the diffusion trajectory, simplifies the sampling process into a straightforward rotation operation, and can potentially lead to improved sample quality in certain scenarios.

Implementation in ComfyUI: V-Prediction Samplers ComfyUI implements v-prediction through several specialized samplers in the KSampler node:

  1. DPM++ 2M Karras

    • Most advanced v-prediction implementation
    • Uses momentum-based updates
    • Best for detailed, high-quality images
    • Recommended settings:
      • Steps: 25-30
      • CFG: 7-8
      • Scheduler: Karras
  2. DPM++ SDE Karras

    • Stochastic differential equation variant
    • Good balance of speed and quality
    • Recommended settings:
      • Steps: 20-25
      • CFG: 7-8
      • Scheduler: Karras
  3. UniPC

    • Unified predictor-corrector method
    • Fastest v-prediction sampler
    • Great for quick iterations
    • Recommended settings:
      • Steps: 20-23
      • CFG: 7-8
      • Scheduler: Karras

Advanced V-Prediction Control For more control over the v-prediction process, you can use:

  1. Advanced KSampler

    • Exposes additional v-prediction parameters

    • Allows fine-tuning of the angle calculations

    • Parameters:

      start_at_step: 0.0
      end_at_step: 1.0
      add_noise: true
      return_with_leftover_noise: false
      
  2. Sampling Refinement The velocity prediction can be refined using:

    • VAEEncodeKSamplerVAEDecode chain

    • Multiple sampling passes with decreasing step counts

    • Example workflow:

      First Pass: 30 steps, CFG 8
      Refine: 15 steps, CFG 4
      Final: 10 steps, CFG 2
      

Conditioning and Control

Text-to-image generation involves conditioning the diffusion process on text embeddings. The mathematical formulation becomes:

$$p_\theta(x_{t-1}|x_t, \mathbf{c}) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, \mathbf{c}), \Sigma_\theta(x_t, t, \mathbf{c}))$$

where $\mathbf{c}$ represents the conditioning information. In the context of text-to-image generation, this conditioning vector typically comes from CLIP (Contrastive Language-Image Pre-training), a neural network developed by OpenAI that creates a shared embedding space for both text and images.

Understanding CLIP and Conditioning

CLIP (Contrastive Language-Image Pre-training) is a neural network trained to learn the relationship between images and text through contrastive learning. It consists of two encoders: one for text and one for images. During training, CLIP learns to maximize the cosine similarity between matching image-text pairs while minimizing it for non-matching pairs. This is achieved through a contrastive loss function operating on batches of N image-text pairs, creating an N×N similarity matrix.

The text encoder first tokenizes the input text into a sequence of tokens, then processes these through a transformer to produce a sequence of token embeddings $\text{CLIP}_\text{text}(\text{text}) \rightarrow [\mathbf{z}_1, ..., \mathbf{z}_n] \in \mathbb{R}^{n \times d}$, where $n$ is the sequence length and $d$ is the embedding dimension. Unlike traditional transformer architectures that use pooling layers, CLIP simply takes the final token’s embedding (corresponding to the [EOS] token) after layer normalization. The image encoder maps images to a similar high-dimensional representation $\text{CLIP}_\text{image}(\text{image}) \rightarrow \mathbf{z} \in \mathbb{R}^d$.

Implementation in ComfyUI: The CLIPTextEncode Node ComfyUI exposes this CLIP architecture through the CLIPTextEncode node. When you input a prompt, the node:

  1. Tokenizes your text into subwords using CLIP’s tokenizer
  2. Processes tokens through the transformer layers
  3. Produces the conditioning vectors that guide the diffusion process

The node comes in two variants that expose different levels of control over the embedding process:

  1. Basic CLIPTextEncode

    • Implements the standard CLIP text encoding: $\text{CLIP}_\text{text}(\text{text})$

    • Single text input field for the entire prompt

    • Handles the full sequence of tokens as one unit

    • Example usage:

      prompt: "a beautiful sunset over mountains, high quality"
      negative prompt: "blurry, low quality, distorted"
      
  2. CLIPTextEncode (Advanced)

    • Extends the basic encoding with token-level control
    • Supports weight adjustment per token: (word:1.2)
    • Mathematical operation on token embeddings: $$z_\text{weighted} = \sum_i w_i \cdot z_i$$ where $w_i$ are the per-token weights

The sequence of token embeddings plays a crucial role in steering the diffusion process. Through cross-attention layers in the U-Net, the model can attend to different parts of the text representation as it generates the image. This mechanism enables the model to understand and incorporate multiple concepts and their relationships from the prompt.

Consider what happens when you input a prompt like “a red cat sitting on a blue chair”. The text is first split into tokens, and each token (or subword) gets its own embedding. The model can then attend differently to “red”, “cat”, “blue”, and “chair” during different stages of the generation process, allowing it to properly place and render each concept in the final image.

Advanced CLIP Operations: The ConditioningCombine Node To support complex prompting scenarios, ComfyUI provides the ConditioningCombine node that implements mathematical operations on conditioning vectors:

  1. Concatenation Mode

    $$c_\text{combined} = [c_1; c_2]$$
    • Preserves both conditions fully
    • Useful for regional prompting
  2. Average Mode

    $$c_\text{combined} = \alpha c_1 + (1-\alpha) c_2$$
    • Blends multiple conditions
    • $\alpha$ controls the mixing ratio

Image-Based Conditioning: The CLIPVisionEncode Node This node implements the image encoder part of CLIP:

$$\text{CLIP}_\text{image}(\text{image}) \rightarrow \mathbf{z} \in \mathbb{R}^d$$

The complete pipeline for image-guided generation becomes:

$$z_\text{image} = \text{CLIP}_\text{image}(\text{image})$$$$z_\text{text} = \text{CLIP}_\text{text}(\text{prompt})$$$$c_\text{combined} = \text{Combine}(z_\text{text}, z_\text{image})$$

Beyond simple text conditioning, modern diffusion models support various forms of guidance. Image conditioning (img2img) allows existing images to influence the generation process. Control signals through ControlNet provide fine-grained control over structural elements. Style vectors extracted from reference images can guide aesthetic qualities, while structural guidance through depth maps or pose estimation can enforce specific spatial arrangements. Each of these conditioning methods can provide additional context to guide the generation process.

Unconditional vs Conditional Generation and CFG

The diffusion model can actually generate images in two modes: unconditional, where no guidance is provided ($\mathbf{c} = \emptyset$), and conditional, where we use our CLIP embedding or other conditioning signals. Classifier-Free Guidance (CFG) leverages both of these modes to enhance the generation quality.

The CFG process works by predicting two denoising directions at each timestep:

  1. An unconditional prediction: $\epsilon_\theta(x_t, t)$
  2. A conditional prediction: $\epsilon_\theta(x_t, t, \mathbf{c})$

These predictions are then combined using a guidance scale $w$ (often called the CFG scale):

$$\epsilon_\text{CFG} = \epsilon_\theta(x_t, t) + w[\epsilon_\theta(x_t, t, \mathbf{c}) - \epsilon_\theta(x_t, t)]$$

The guidance scale $w$ controls how strongly the conditioning influences the generation. A higher value of $w$ (typically 7-12) results in images that more closely match the prompt but may be less realistic, while lower values (1-4) produce more natural images that follow the prompt more loosely. When $w = 0$, we get purely unconditional generation, and as $w \to \infty$, the model becomes increasingly deterministic in following the conditioning.

This is why in ComfyUI, you’ll often see a “CFG Scale” parameter in sampling nodes. It directly controls this weighting between unconditional and conditional predictions, allowing you to balance prompt adherence against image quality.

Implementation in ComfyUI: CFG Control

  1. Basic CFG Control The KSampler node provides direct control over CFG through its parameters:

    • CFG Scale: The $w$ parameter in the equation

    • Recommended ranges:

      Photorealistic: 4-7
      Artistic: 7-12
      Strong stylization: 12-15
      
  2. Advanced CFG Techniques ComfyUI offers several nodes for fine-tuning CFG behavior:

    a) CFGScheduler Node

    • Dynamically adjusts CFG during sampling

    • Mathematical operation:

      $$w_t = w_\text{start} + t(w_\text{end} - w_\text{start})$$
    • Example schedule:

      Start CFG: 12 (for initial structure)
      End CFG: 7 (for natural details)
      

    b) CFGDenoiser Node

    • Provides manual control over the denoising process
    • Allows separate CFG for different regions
    • Useful for:
      • Regional prompt strength
      • Selective detail enhancement
      • Balancing multiple conditions
  3. Practical CFG Workflows

    a) Basic Text-to-Image

    CLIPTextEncode [positive]  KSampler (CFG: 7)
    CLIPTextEncode [negative]  KSampler
    

    b) Dynamic CFG

    CLIPTextEncode  CFGScheduler  KSampler
    Parameters:
    - Start: 12 (0-25% steps)
    - Middle: 8 (25-75% steps)
    - End: 6 (75-100% steps)
    

    c) Regional CFG

    CLIPTextEncode  ConditioningSetArea  CFGDenoiser
    Multiple regions with different CFG values:
    - Focus area: CFG 12
    - Background: CFG 7
    
  4. CFG Troubleshooting Common issues and solutions:

    • Over-saturation: Reduce CFG or use scheduling
    • Loss of details: Increase CFG in final steps
    • Unrealistic results: Lower CFG, especially in early steps
    • Inconsistent style: Use CFG scheduling to balance

Introduction to ComfyUI


ComfyUI provides a visual interface to interact with these mathematical processes through a node-based workflow. If

Core Components


Models and Their Mathematical Foundations

Before starting with ComfyUI, you need to understand the different types of models:

  1. Base Models (Checkpoints)

    • Stored in models\checkpoints
    • Implement the full diffusion process: $p_\theta(x_{0:T})$
    • Examples: CompassMix XL Lightning, Pony Diffusion V6 XL

    The Load Checkpoint Node The Load Checkpoint node in ComfyUI is your interface to the base diffusion model. It loads the model weights ($\theta$) and initializes the U-Net architecture that performs the reverse diffusion process. When you connect this node, you’re essentially preparing the neural network that will implement:

    $$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

    The node outputs three crucial components:

    • MODEL: The U-Net that predicts noise or velocity
    • CLIP: The text encoder for conditioning
    • VAE: The encoder/decoder for image latents

    These outputs correspond to the three main mathematical operations in the diffusion process:

    1. MODEL: Implements $p_\theta(x_{t-1}|x_t)$
    2. CLIP: Provides conditioning $c$ for $p_\theta(x_{t-1}|x_t, c)$
    3. VAE: Handles the mapping between image space $x$ and latent space $z$
  2. LoRAs (Low-Rank Adaptation)

    • Stored in models\loras
    • Mathematically represented as: $W = W_0 + BA$ where:
      • $W_0$ is the original weight matrix
      • $B$ and $A$ are low-rank matrices
    • Reduces parameter count while maintaining model quality

    The LoRA Loader Node The LoRA Loader node implements the low-rank adaptation by modifying the base model’s weights according to the equation:

    $$W_{\text{final}} = W_0 + \alpha \cdot (BA)$$

    where $\alpha$ is the weight parameter in the node (typically 0.5-1.0). The node:

    • Takes a MODEL input from Load Checkpoint
    • Applies the LoRA weights ($BA$) with scaling $\alpha$
    • Outputs the modified model

    When you adjust the weight parameter, you’re directly controlling the influence of the low-rank matrices on the original weights.

  3. VAE (Variational Autoencoder)

    • Stored in models\vae
    • Implements the encoding $q_\phi(z|x)$ and decoding $p_\psi(x|z)$
    • Works in a lower-dimensional latent space for efficiency

    The VAE Encode/Decode Nodes These nodes implement the probabilistic encoding and decoding:

    $$z \sim q_\phi(z|x)$$$$x \sim p_\psi(x|z)$$
    • VAE Encode: Converts images to latents (mean + variance)
    • VAE Decode: Converts latents back to images

    The latent space operations are crucial because:

    1. They reduce memory usage (latents are 4x smaller)
    2. They provide a more stable space for diffusion
    3. They help maintain semantic consistency

Node Based Workflow


The node-based interface in ComfyUI represents the mathematical operations as interconnected components. Each node performs specific operations in the diffusion process:

  • Sampling nodes implement the reverse diffusion process
  • Conditioning nodes handle text embeddings and other control signals
  • VAE nodes handle encoding/decoding between image and latent space: $\mathcal{E}(x)$ and $\mathcal{D}(z)$

Arcane Wizardry

When you’re new to node-based workflows, think of each connection as passing tensors and parameters between mathematical operations. The entire workflow represents a computational graph that implements the full diffusion process.

Getting Started

To begin experimenting with these concepts, you can clear your workflow:

You can add nodes by either right-clicking or double-clicking on an empty area:

Right Click Add Method