The ComfyUI Bible

Installing ComfyUI

ComfyUI is a powerful and flexible node-based UI for Stable Diffusion. This guide covers installation methods for Windows, Linux, and macOS.

Windows Installation

There are two main ways to install ComfyUI on Windows:

Option 1: Portable Version (Recommended)

Download the standalone ComfyUI portable version
Extract the .7z file using 7-Zip
Place your checkpoint models in ComfyUI_windows_portable\ComfyUI\models\checkpoints
Run the appropriate batch file:
- For NVIDIA GPU: run_nvidia_gpu.bat
- For CPU only: run_cpu.bat

Option 2: Manual Installation

Install Git and Python 3.10+
Install Microsoft Visual C++ Redistributable if needed
Open Command Prompt or PowerShell and run:

# Clone the repository
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Create and activate a virtual environment (optional but recommended)
python -m venv venv
venv\Scripts\activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

# Download a model (example: Stable Diffusion 1.5)
mkdir -p models\checkpoints
# Download a model and place it in the models/checkpoints folder

# Run ComfyUI
python main.py

Updating ComfyUI

For portable version: Run ComfyUI_windows_portable\update\update_comfyui.bat
For manual installation: In the ComfyUI directory, run git pull

Linux Installation

Linux Prerequisites

Python 3.10 or newer
Git

Linux Installation Steps

Clone the repository:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate

Install PyTorch based on your GPU:

For NVIDIA GPUs:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

For AMD GPUs:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

For Intel GPUs or CPU only:

pip install torch torchvision torchaudio

Install dependencies:
```
pip install -r requirements.txt
```

Download models:

mkdir -p models/checkpoints
# Download models manually or use wget:
# wget -P models/checkpoints https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.ckpt

Run ComfyUI:

python main.py

For AMD GPUs with compatibility issues, you might need to set an environment variable:

HSA_OVERRIDE_GFX_VERSION=10.3.0 python main.py
# or for RDNA3 cards like the 7600:
HSA_OVERRIDE_GFX_VERSION=11.0.0 python main.py

Creating a Simple Launch Script

Create a file named launch.sh:

#!/bin/bash
source venv/bin/activate
python main.py

Make it executable:

chmod +x launch.sh

Updating ComfyUI on Linux

In the ComfyUI directory:

git pull
pip install -r requirements.txt

macOS Installation

macOS Prerequisites

macOS 12.3+ (for MPS acceleration)
Python 3.10 or newer
Homebrew

macOS Installation Steps

Install Homebrew (if not already installed):

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install required packages:

brew install cmake protobuf rust python@3.10 git wget

Clone the repository:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

Create a virtual environment:
```
python3 -m venv venv
```

Install PyTorch for Apple Silicon:

./venv/bin/pip install torch torchvision torchaudio

Install dependencies:

./venv/bin/pip install -r requirements.txt

Download models:

mkdir -p models/checkpoints
# Download models manually or use wget:
# wget -P models/checkpoints https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.ckpt

Run ComfyUI:
```
./venv/bin/python main.py
```

Creating a Launch Script

Create a file named launch.sh:

#!/bin/bash
./venv/bin/python main.py

Make it executable:

chmod +x launch.sh

Updating ComfyUI on macOS

In the ComfyUI directory:

git pull
./venv/bin/pip install -r requirements.txt

Desktop App Installation

ComfyUI also offers a desktop application that provides a simpler installation experience.

Visit the ComfyUI Desktop releases page
Download the appropriate version for your operating system
Follow the installation instructions for your platform

Note that the desktop version currently supports:

Windows systems with NVIDIA GPUs
macOS with Apple Silicon chips

🐋 Docker Installation

For users who prefer Docker, the wolfy repository provides a robust Docker setup for ComfyUI, allowing for easy development and deployment.

Docker Prerequisites

Docker and Docker Compose installed
NVIDIA GPU with compatible drivers
NVIDIA Container Toolkit installed
Node.js and npm (for frontend development)
Python 3.x (for frontend development)

Initial Setup

Clone the repository and required components:

git clone https://github.com/rakki194/wolfy.git
cd wolfy
git clone https://github.com/comfyanonymous/ComfyUI.git
git clone https://github.com/comfyanonymous/ComfyUI_frontend.git

Create directories for models and other data:
```
mkdir -p models
```

Build the Docker images and initialize the environment:

# Build Docker images
UID=$(id -u) GID=$(id -g) docker-compose build

# Create the Python virtual environment with dependencies
UID=$(id -u) GID=$(id -g) docker-compose --profile init up create-venv

Starting ComfyUI

Once the environment is created, you can start ComfyUI:

UID=$(id -u) GID=$(id -g) docker-compose up comfyui

This will:

Start the ComfyUI container
Map port 8188 to localhost
Mount the necessary volumes for ComfyUI
Activate the Python virtual environment and run ComfyUI

ComfyUI will be accessible at http://localhost:8188.

Creating a Named Container for Easy Restart

To create a named container that can be easily restarted:

UID=$(id -u) GID=$(id -g) docker-compose up --detach comfyui

This creates a container with a name like wolfy-comfyui-1. You can then stop and restart it easily:

# Stop the container
docker stop wolfy-comfyui-1

# Restart with attachment to see logs
docker start --attach wolfy-comfyui-1

Installing Custom Node Dependencies

Many custom nodes require additional Python packages. To install them:

UID=$(id -u) GID=$(id -g) docker-compose --profile init run --rm create-venv /bin/dash -c ". /app/venv/bin/activate && pip install opencv-python diffusers insightface trimesh statsmodels numba gguf bitsandbytes simpleeval timm transparent_background piexif resampy webcolors jaxtyping onnxruntime-gpu pymunk numexpr dlib imageio_ffmpeg pillow_jxl_plugin librosa segment_anything"

Or start an interactive shell and manually install the dependencies:

UID=$(id -u) GID=$(id -g) docker-compose --profile init run --rm create-venv /bin/bash
. /app/venv/bin/activate
pip install [package-name]

Updating ComfyUI on Docker

For a complete update of code and dependencies:

./update.sh

This script will:

Pull the latest changes from the ComfyUI and ComfyUI_frontend repositories
Update all git-managed custom nodes
Build the frontend using npm
Create a Python wheel package for the frontend

Troubleshooting

GPU Issues:

Ensure NVIDIA drivers and NVIDIA Container Toolkit are installed
Check Docker logs with docker logs wolfy-comfyui-1

Environment Issues:

If changes to requirements are made, rebuild the environment with:
```
docker-compose --profile init up create-venv
```

Permission Issues:

Ensure the UID and GID environment variables are set correctly
Check permissions of mounted directories

Running the code examples

Requirements for running the code examples on this page.

To run the supplied visualization codes in this document, you’ll need the following Python packages:

torchvision
numpy
Pillow
opencv-python
diffusers

Additionally, you’ll need FFmpeg installed on your system for the video conversion. The code assumes FFmpeg is available in your system PATH.

Understanding Diffusion Models

Before diving into ComfyUI’s practical aspects, let’s understand the mathematical foundations of diffusion models that power modern AI image generation. If you disagree, you can scribble over all the equations on your monitor with a permanent marker!

The Diffusion Process

Diffusion models work through a process that’s similar to gradually adding static noise to a TV signal, and then learning to remove that noise to recover the original picture.

Forward Diffusion

The forward diffusion process systematically transforms a clear image into pure random noise through a series of precise mathematical steps.

We start with an original image $x_0$ containing clear, detailed information
At each timestep $t$, we apply a carefully controlled amount of Gaussian noise
The noise intensity follows a schedule $\beta_t$ that gradually increases over time
Each step slightly corrupts the previous image state according to our diffusion equation
After $T$ timesteps, we reach $x_T$ which is effectively pure Gaussian noise

The code used to generate the video.

This demonstrates pixel-space diffusion, just to give you an idea of what’s happening.

To actually see the diffusion process in latent space, first we have to learn about latent space.

Latent Space

it’s important to understand that AI models don’t work directly with regular images. Instead, they work with something called “latents” - a compressed representation of images that’s more efficient for the AI to process.

Think of them like a blueprint of an image:

A regular image stores exact colors for each pixel (like a detailed painting)
A latent stores abstract patterns and features (like an architect’s blueprint)

Mathematically, the process of converting an image into latent is defined as:

$$ z = \text{VAE}_\text{enc}(x) $$

Where $x$ is the original image, and $z$ is the corresponding latent representation. This is typically a lower-dimensional tensor with learned feature mappings that retain the essential structure of the image.

The decoder function is then used to reconstruct the image:

$$ \hat{x} = \text{VAE}_\text{dec}(z) $$

Where $\hat{x}$ is the reconstructed image. Ideally $\hat{x}$ should closely resemble $x$, but due to the compression processs, some high-frequency details may be lost.

Variational Autoencoders (VAEs) introduce an additional probabilistic constraint by encoding images into a latent distribution rather than a fixed latent vector. Instead of directly mapping an image to a latent vector $z$, the encoder outputs two components:

$$ \mu = f_\text{enc}(x), \quad \sigma = g_\text{enc}(x) $$

Where $\mu$ represents the mean and $\sigma$ represents the standard deviation of the latent space distribution. A latent sample $z$ is drawn from this distribution using:

$$ z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) $$

This reparameterization trick allows gradients to flow through the sampling process during training. The decoder then reconstructs the image as:

$$ \hat{x} = \text{VAE}_\text{dec}(z) $$

Because the VAE encodes images into a lower-dimensional space while enforcing a smooth latent distribution, diffusion models can perform denoising efficiently in this latent space instead of the original high-dimensional pixel space.

Here is a video showing the forward diffusion process in latent space:

The code used to generate the video.

The beauty of working with latents is that they’re much more efficient:

Takes less memory (8x smaller than regular images)
Easier for the AI to manipulate
Contains the essential “structure” of images without unnecessary details

The main differences you’ll notice to pixel-space diffusion are:

The noise patterns will look more structured due to the VAE’s learned latent space.
The degradation process might preserve more semantic information.

ComfyUI provides a visual interface to interact with these mathematical processes through a node-based workflow, you’ll encounter latents in three main ways:

Empty Latent Image

When you want to generate an image from scratch, referred to as text-to-image (t2i).

This is your starting point for pure text-to-image generation
Size: target image size. The node transparently outputs an 8x smaller grid than your target image (e.g., 512x512 image = 64x64 latents)
The actual latent content is zero latent pixels. Unlike the RGB zero, they don’t map to black but a greyish color. Noise will be added at the start of the diffusion process (see KSampler node).

Load Image → VAE Encode

When you want to modify an existing image, referred to as image-to-image (i2i).

Load Image: Brings your regular image into ComfyUI
VAE Encode: Converts it into latents.
This is your starting point for image-to-image generation

VAE Decode → Save Image

The final step in any workflow:

Converts latents back into a regular image
Like turning the blueprint back into a detailed painting
This is how you get your final output image

The Mathematics

The diffusion process in DDIM is characterized by a Markov chain, where Gaussian noise is incrementally introduced over discrete time steps. This implies that the probability of the current state, $x_t$ is contingent solely on the preceeding state $x_{t-1}. Mathematically this transition is represented as:

$$ q(x_t | x_{t-1}) := \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t \mathbf{I}) $$

Here, $q$ represents the diffusion transition, modeled as a normal distribution. Pixel values $x_t[i,j]$ are independently sampled with variance $\beta_t$ according to the noise schedule. The mean term, $\sqrt{1-\beta_t}x_{t-1}$, retains a scaled down, or faded version of the previous step, effectively diminishing its intensity as noise is progressively added, while $\beta_t \mathbf{I}$ ensures noise independence across pixels.

By normalizing the dataset, we can crudely approximate sampling the initial image $ x_0 \sim \mathcal{D} $ as $ x_0 \sim \mathcal{N}(0, \mathbf{I}) $. Under this assumption, the variance of $ \sqrt{1-\beta_t} x_0 $ is $ 1-\beta_t $. Adding independent Gaussian noise with variance $ \beta_t $ then ensures the total variance remains 1 at each step.

Alpha Bar and SNR

Alpha bar ($\bar{\alpha}$) is a crucial concept in diffusion models that represents the cumulative product of the signal scaling factors. Here’s how it works:

At each timestep $t$, we have:
- $\beta_t$ (beta): the noise schedule parameter
- $\alpha_t$ (alpha): $1 - \beta_t$, the signal scaling factor
Alpha bar is then defined as:
$$\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$$

This means:

$\bar{\alpha}$ starts at 1 (no noise) and decreases over time
It represents how much of the original signal remains at time $t$
At the end of diffusion, $\bar{\alpha}$ approaches 0 (pure noise)

The complete forward process can be written in terms of $\bar{\alpha}$:

$$x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$$

where:

$x_t$ is the noisy image at time t
$x_0$ is the original image
$\epsilon ~ \mathcal{N}(0, I)$ is random Gaussian noise
$\sqrt{\bar{\alpha}_t}$ controls how much original signal remains
$\sqrt{1-\bar{\alpha}_t}$ controls how much noise is added

This formulation proves to be particularly powerful for several key reasons. First, it enables us to directly sample any timestep from the original image $x_0$ without having to calculate all intermediate steps. Additionally, it provides precise visibility into the ratio between signal and noise at each point in the process. Finally, this formulation makes the reverse process mathematically tractable, which is essential for training the model to denoise images effectively.

With a sufficiently large dataset, we can estimate the variance of the distributions of images (or latents). The variance represents the average “signal’s energy” in the images. When adding two independent signals together, their variance adds up under the Gaussian assumption, often referred to in the context of Gaussian processes, which is a fundamental concept in probability theory and statistics. It assumes that a collection of random variables follows a multivariate normal distribution. This means that any finite subset of these variables will have a joint Gaussian distribution. The Gaussian assumption is named after Carl Friedrich Gauss, who introduced the Gaussian distribution, also known as the normal distribution. It allows for the modeling of complex, continuous phenomena using Gaussian processes. These processes are completely defined by their mean and covariance functions. The mean function represents the expected value of the process at any given point, while the covariance function describes how the values of the process at different points are related to each other. This assumption simplifies the analysis and computation of various statistical properties, making Gaussian processes a powerful tool for modeling and predicting real-world phenomena.

Diffusion models, such as Denoising Diffusion Probabilistic Models (DDPM), rely on the assumption that the data distribution can be approximated by a Gaussian distribution at each step of the diffusion process. This assumption simplifies the mathematical formulation and allows for efficient sampling and generation of images. The Gaussian assumption ensures that the noise added at each step follows a normal distribution, making the process mathematically tractable and allowing for the use of well-established techniques in probability theory and statistics.

This is important in the diffusion process, since the noise levels are also measured in terms of variance. After adding noise to the image, the Signal-to-Noise Ratio (SNR) is used to define how much useful signal is left over the noise background. This measure is independant of the total variance, thus any noising process can be remaped to a variance preserving process, meaning we add as much noise as we remove image energy. The dataset is often normalized to variance of 1 by removing the mean and dividing by the standard deviation.

The process serves as the foundation for training our AI models. By understanding exactly how images are corrupted, we can teach the model to reverse this corruption during the generation process.

Reverse Diffusion

When the AI learns to reverse this process, it learns to:

Start with pure noise
Gradually remove the noise
Eventually recover a clear image

This is what happens every time you generate an image with Stable Diffusion or similar AI models. The model has learned to take random noise and progressively “denoise” it into a clear image that matches your prompt.

The KSampler Node

The KSampler node in ComfyUI implements the reverse diffusion process. It takes the noisy latent $x_T$ and progressively denoises it using the model’s predictions. The node’s parameters directly control this process:

Steps: Controls the number of $t$ steps in the reverse process
- More steps = finer granularity but slower
- Mathematically: divides $[0,T]$ into this many intervals
CFG Scale: Implements Classifier-Free Guidance
$$ \epsilon_\text{CFG} = \epsilon_\theta(x_t, t) + w[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t)] $$
- Higher values follow the prompt more strictly
- Lower values (1-4) allow more creative freedom
Scheduler: Controls $\beta_t$ schedule
- Karras: $\beta_t$ follows the schedule from Karras et al. $$ \sigma_i = \sigma_\text{min}^{1-f(i)} \sigma_\text{max}^{f(i)} $$
- Normal: Linear schedule $$ \beta_t = \beta_\text{min} + t(\beta_\text{max} - \beta_\text{min}) $$
Sampler: Determines how to use model predictions
- Euler: Simple first-order method $$ x_{t-1} = x_t - \eta \nabla \log p(x_t) $$
- DPM++ 2M: Second-order method with momentum $$ v_t = \mu v_{t-1} + \epsilon_t $$ $$ x_{t-1} = x_t + \eta v_t $$
Seed: Controls the initial noise
- Same seed + same parameters = reproducible results
- Mathematically: initializes the random state for $x_T$

The Noise Node The optional Noise node lets you directly manipulate the initial noise $x_T$. It implements:

$$ x_T = \mathcal{N}(0, I) $$

You can:

Set a specific seed for reproducibility
Control noise dimensions (width, height)
Mix different noise patterns

V-Prediction and Angular Parameterization

The concept of v-prediction in image diffusion models has evolved significantly over the years. Initially, diffusion models focused on noise prediction, where the model learned to predict the noise added to the data at each step of the diffusion process. This approach was popularized by the Denoising Diffusion Probabilistic Models (DDPM) introduced by Ho et al. in 2020. The noise prediction method proved effective for generating high-quality images, but it had limitations in terms of stability and efficiency. To address these limitations, researchers explored alternative prediction methods, leading to the development of v-prediction. V-prediction involves predicting the velocity of the data as it evolves through the diffusion process, rather than the noise. This approach was introduced to improve the stability and accuracy of the diffusion models, particularly near the zero point of the time variable. By focusing on the velocity, v-prediction helps mitigate issues related to Lipschitz singularities, which can pose challenges during both training and inference.

We can view an image or a latent as a single vector, composed of all the pixel values. Remember that diffusion processes can be remapped to a variance-preserving process, where the vectors for noise, image and intermediary steps are unit length. This means that we can represent the diffusion process as a rotation between the image and a gaussian noise vector. This happens in a high-dimensional space, but we are only interpolating between two directions (noise and image), so we can represent it in 2D coordinates. The intermediary noised images are all on a circle, and the velocity vector is tangent to that trajectory.

The visualization below shows three key aspects of the v-prediction process: the noisy image progression (left), the raw velocity field in latent space (middle), and the decoded velocity field (right). The velocity field represents the direction and magnitude of change at each point, showing how the image should be denoised. The middle panel reveals the actual latent-space velocities at 1/8 resolution, while the right panel shows what these velocities “look like” when decoded back to image space. The overlay displays the current timestep ($t$), angle in radians ($\phi$), and cumulative signal scaling factor ($\bar{\alpha}$), helping us understand how the process evolves over time.

The code used to generate the video.

# Calculate phi (angle) from alpha and sigma
phi_t = torch.arctan2(sigma_t, alpha_t)

This calculates the angle $\phi_t$ that represents how far along we are in the diffusion process.

At $phi_t = 0$, we have the original image
At $phi_t = \pi/2$, we have pure noise
The angle smoothly interpolates between these states

While the standard formulation predicts noise $\epsilon$, an alternative approach called v-prediction parameterizes the diffusion process in terms of velocity. In this formulation, we use an angle $\phi_t = \text{arctan}(\sigma_t/\alpha_t)$ to represent the progression through the diffusion process.

$$ \alpha_\phi = \cos(\phi), \quad \sigma_\phi = \sin(\phi) $$

The noisy image at angle $\phi$ can then be expressed as:

$$ \mathbf{z}_\phi = \cos(\phi)\mathbf{x} + \sin(\phi)\epsilon $$

The key insight is to define a velocity vector:

$$ \mathbf{v}_\phi = \frac{d\mathbf{z}_\phi}{d\phi} = \cos(\phi)\epsilon - \sin(\phi)\mathbf{x} $$

Starting from a clean image, the diffusion process At the start of the diffusion process, the latent representation is heavily influenced by noise. In this stage the velocity vector $\mathbf{v}_\phi$ is primarily composed of noise ($\epsilon$), making it perpendicular to the image vector $\mathbf{x}$. This perpendicular relationship indicates that the velocity is introducing noise into the latent space, moving away from the clear image representation.

This velocity represents the direction of change in the noisy image as we move through the diffusion process. The model predicts this velocity instead of the noise:

$$ \hat{\mathbf{v}}_\theta(\mathbf{z}_\phi) = \cos(\phi)\hat{\epsilon}_\theta(\mathbf{z}_\phi) - \sin(\phi)\hat{\mathbf{x}}_\theta(\mathbf{z}_\phi) $$

The sampling process then becomes a rotation in the $(\mathbf{z}_\phi, \mathbf{v}_\phi)$ plane:

$$ \mathbf{z}_{\phi_{t-\delta}} = \cos(\delta)\mathbf{z}_{\phi_t} - \sin(\delta)\hat{\mathbf{v}}_\theta(\mathbf{z}_{\phi_t}) $$

By learning to predict the velocity vectors ($\mathbf{v}_\phi$), the model ensures a stable and directed denoising path, where each velocity vector incrementally adjusts the latent representation towards the target image. As the process evolves, the alignment of velocity vectors with the image components showcases the structured denoising, gradually revealing the underlying image from the noisy latent. This formulation offers several key advantages: it provides a more natural parameterization of the diffusion trajectory, simplifies the sampling process into a straightforward rotation operation, and can potentially lead to improved sample quality in certain scenarios.

Implementation in ComfyUI: V-Prediction Samplers ComfyUI implements v-prediction through several specialized samplers in the KSampler node:

DPM++ 2M Karras
- Most advanced v-prediction implementation
- Uses momentum-based updates
- Best for detailed, high-quality images
- Recommended settings:
  - Steps: 25-30
  - CFG: 7-8
  - Scheduler: Karras
DPM++ SDE Karras
- Stochastic differential equation variant
- Good balance of speed and quality
- Recommended settings:
  - Steps: 20-25
  - CFG: 7-8
  - Scheduler: Karras
UniPC
- Unified predictor-corrector method
- Fastest v-prediction sampler
- Great for quick iterations
- Recommended settings:
  - Steps: 20-23
  - CFG: 7-8
  - Scheduler: Karras

Advanced V-Prediction Control For more control over the v-prediction process, you can use:

Advanced KSampler
- Exposes additional v-prediction parameters
- Allows fine-tuning of the angle calculations
- Parameters:
```
start_at_step: 0.0
end_at_step: 1.0
add_noise: true
return_with_leftover_noise: false
```
Sampling Refinement The velocity prediction can be refined using:
- VAEEncode → KSampler → VAEDecode chain
- Multiple sampling passes with decreasing step counts
- Example workflow:
```
First Pass: 30 steps, CFG 8
Refine: 15 steps, CFG 4
Final: 10 steps, CFG 2
```

Conditioning and Control

The image depicts a scene where five silhouetted figures with pointed ears are seated in front of a large screen. On the screen, there is a close-up of a white anthropomorphic wolf with green eyes and an open mouth, showing sharp teeth. The wolf appears to be speaking or making an expression, and the background of the screen is blue. The scene suggests that the silhouetted figures are watching or listening to the wolf on the screen, possibly in a setting reminiscent of a movie theater or a presentation. The contrast between the dark silhouettes and the brightly lit wolf on the screen creates a striking visual effect, making the image intriguing and engaging.

Text-to-image generation involves conditioning the diffusion process on text embeddings. The mathematical formulation becomes:

$$ p_\theta(x_{t-1}|x_t, \mathbf{c}) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, \mathbf{c}), \Sigma_\theta(x_t, t, \mathbf{c})) $$

Here, $\mu_\theta$ and $\Sigma_\theta$ are conditioned on both $x_t$ and $\mathbf{c}$, which represents the conditioning information. In the context of text-to-image generation, this conditioning vector typically comes from CLIP (Contrastive Language-Image Pre-training), a neural network developed by OpenAI that creates a shared embedding space for both text and images.

Understanding CLIP and Conditioning

CLIP is a neural network trained to learn the relationship between images and text through contrastive learning. It consists of two encoders: one for text and one for images. During training, CLIP learns to maximize the cosine similarity between matching image-text pairs while minimizing it for non-matching pairs. This is achieved through a contrastive loss function operating on batches of N image-text pairs, creating an N*N similarity matrix.

The text encoder first tokenizes the input text into a sequence of tokens, then processes these through a transformer to produce a sequence of token embeddings $\text{CLIP}_\text{text}(\text{text}) \rightarrow [\mathbf{z}_1, \ldots, \mathbf{z}_n] \in \mathbb{R}^{n \times d}$, where $n$ is the sequence length and $d$ is the embedding dimension. Unlike traditional transformer architectures that use pooling layers, CLIP simply takes the final token’s embedding (corresponding to the [EOS] token) after layer normalization. The image encoder maps images to a similar high-dimensional representation $\text{CLIP}_\text{image}(\text{image}) \rightarrow \mathbf{z} \in \mathbb{R}^d$.

Implementation in ComfyUI

ComfyUI exposes this CLIP architecture through the CLIPTextEncode node. When you input a prompt, the node:

Tokenizes your text into subwords using CLIP’s tokenizer
Processes tokens through the transformer layers
Produces the conditioning vectors that guide the diffusion process

CLIPTextEncode

Implements the standard CLIP text encoding $\text{CLIP}_\text{text}(\text{text})$ with a single text input field for the entire prompt. Depending on where you end up plugging it in either the positive or the negative input field of KSampler, SamplerCustom or , it will condition the diffusion process accordingly.

Single text input field for the entire prompt
Handles the full sequence of tokens as one unit
Example usage:

prompt: "a beautiful sunset over mountains, high quality"
negative prompt: "blurry, low quality, distorted"

Prompt Weighting

In CLIP-based text conditioning, you can modify the influence of specific terms using weight modifiers. The mathematical representation of weighted prompting can be expressed as:

$$ c_\text{weighted} = w \cdot c_\text{base} $$

where $w$ is the weight multiplier and $c_\text{base}$ is the base conditioning vector.

Syntax and Implementation

ComfyUI supports the following weighting syntax out of the box:

Parentheses Method

Format: (text:weight)
Example: (beautiful sunset:1.2)
Weights > 1 increase emphasis
Weights < 1 decrease emphasis

The weighting affects how strongly the diffusion process considers certain concepts during generation. For instance:

prompt: "a (red:1.3) cat on a (blue:.8) chair"

This prompt will emphasize the redness of the cat while subtly reducing the blue intensity of the chair.

When you apply weights, the underlying CLIP embeddings are scaled according to:

$$ \text{CLIP}_\text{weighted}(\text{text}, w) = w \cdot \text{CLIP}_\text{text}(\text{text}) $$

This scaling affects the cross-attention layers in the U-Net, modifying how strongly different parts of the prompt influence the generation process. Multiple weighted terms combine through their effects on the overall conditioning vector:

$$ c_\text{final} = \sum_i w_i \cdot \text{CLIP}_\text{text}(\text{text}_i) $$

Zero Weights

When a weight of 0.0 is applied, the corresponding term’s contribution to the conditioning vector becomes nullified:

$$ \text{CLIP}_\text{weighted}(\text{text}, 0) = \mathbf{0} $$

However, this doesn’t mean the term is simply ignored. Setting a weight to 0.0 can have unexpected effects because:

The token still occupies a position in the sequence
It may affect how surrounding terms are interpreted

For this reason, it’s generally better to omit terms you don’t want rather than setting their weights to 0.0, however, neutralizing unwanted effects from multi-token trigger words is a valuable use case for them. Some words or phrases that get split into multiple tokens by CLIP’s tokenizer can have unintended influences on generation. By using zero weights strategically, you can keep the trigger word’s primary effect while nullifying its unwanted secondary effects. Example:

# This prompt engineering ensures the character will be styled as a furry sticker,
# with a white outline around it but avoids putting stickers on the character's body.
furry (sticker:0.0)

0.0: Complete nullification
0.1 to 0.9: Reduced emphasis
1.0: Default weight
1.1 to 1.5: Moderate emphasis
1.5 to 2.0: Strong emphasis
2.0: Very strong emphasis (use with caution)

Depending on the model you are using, even 1.3 can be too strong and can also depend on the specific token in question. When writing prompts, try to use related terms to compliment the attention of the model before trying to use weights.

When a weight gets too strong, this is called in our industry as an “explosion”, it is a pretty accurate term visually most of the time, what actually happens in this case, could be literally anything, since the model was not trained in that specific conditioning input range you are trying to force images out of.

CLIPTextEncodeSD3, CLIPTextEncodeFlux, CLIPTextEncodeSDXL

Specialized CLIP nodes, as their name suggests, for different models.

While these are useful to some degree for FLUX, SD3 and above, for SDXL, since both CLIP-L and CLIP-G was trained with the same prompts, you will achieve better results by using the same prompt, therefore the regular CLIPTextEncode node is sufficient, and as shown on the screenshot below, for regional conditioning you can always use Conditioning (Set Area), which will work with all of the CLIP nodes.

The rest of the CLIP Gang

The sequence of token embeddings plays a crucial role in steering the diffusion process. Through cross-attention layers in the U-Net, the model can attend to different parts of the text representation as it generates the image. This mechanism enables the model to understand and incorporate multiple concepts and their relationships from the prompt.

Consider what happens when you input a prompt like “a red cat sitting on a blue chair”. The text is first split into tokens, and each token (or subword) gets its own embedding. The model can then attend differently to “red”, “cat”, “blue”, and “chair” during different stages of the generation process, allowing it to properly place and render each concept in the final image.

In most workflows, interaction with CLIP starts right from Load Checkpoint and ends at KSampler:

CLIP’s place in the workflow

But you can also load it separately from the checkpoint using either the Load CLIP, DualCLIPLoader or TripleCLIPLoader nodes as needed.

Various CLIP loaders in ComfyUI

CLIP

Advanced CLIP Operations: The ConditioningCombine Node To support complex prompting scenarios, ComfyUI provides the ConditioningCombine node that implements mathematical operations on conditioning vectors:

Concatenation Mode
$$ c_\text{combined} = [c_1; c_2] $$
- Preserves both conditions fully
- Useful for regional prompting
Average Mode
$$ c_\text{combined} = \alpha c_1 + (1-\alpha) c_2 $$
- Blends multiple conditions
- $\alpha$ controls the mixing ratio

Image-Based Conditioning: The CLIPVisionEncode Node This node implements the image encoder part of CLIP:

$$ \text{CLIP}_\text{image}(\text{image}) \rightarrow \mathbf{z} \in \mathbb{R}^d $$

The complete pipeline for image-guided generation becomes:

$$ z_\text{image} = \text{CLIP}_\text{image}(\text{image}) $$$$ z_\text{text} = \text{CLIP}_\text{text}(\text{prompt}) $$$$ c_\text{combined} = \text{Combine}(z_\text{text}, z_\text{image}) $$

Beyond simple text conditioning, modern diffusion models support various forms of guidance. Image conditioning (img2img) allows existing images to influence the generation process. Control signals through ControlNet provide fine-grained control over structural elements. Style vectors extracted from reference images can guide aesthetic qualities, while structural guidance through depth maps or pose estimation can enforce specific spatial arrangements. Each of these conditioning methods can provide additional context to guide the generation process.

Unconditional vs Conditional Generation and CFG

The diffusion model can actually generate images in two modes: unconditional, where no guidance is provided ($\mathbf{c} = \emptyset$), and conditional, where we use our CLIP embedding or other conditioning signals. Classifier-Free Guidance (CFG) leverages both of these modes to enhance the generation quality.

The CFG process works by predicting two denoising directions at each timestep:

An unconditional prediction: $\epsilon_\theta(x_t, t)$
A conditional prediction: $\epsilon_\theta(x_t, t, \mathbf{c})$

These predictions are then combined using a guidance scale $w$ (often called the CFG scale):

$$ \epsilon_\text{CFG} = \epsilon_\theta(x_t, t) + w[\epsilon_\theta(x_t, t, \mathbf{c}) - \epsilon_\theta(x_t, t)] $$

The guidance scale $w$ controls how strongly the conditioning influences the generation. A higher value of $w$ (typically 7-12) results in images that more closely match the prompt but may be less realistic, while lower values (1-4) produce more natural images that follow the prompt more loosely. When $w = 0$, we get purely unconditional generation, and as $w \to \infty$, the model becomes increasingly deterministic in following the conditioning.

This is why in ComfyUI, you’ll often see a “CFG Scale” parameter in sampling nodes. It directly controls this weighting between unconditional and conditional predictions, allowing you to balance prompt adherence against image quality.

Implementation in ComfyUI: CFG Control

Basic CFG Control The KSampler node provides direct control over CFG through its parameters:
- CFG Scale: The $w$ parameter in the equation
- Recommended ranges:
```
Photorealistic: 4-7
Artistic: 7-12
Strong stylization: 12-15
```
Advanced CFG Techniques ComfyUI offers several nodes for fine-tuning CFG behavior:
a) CFGScheduler Node
- Dynamically adjusts CFG during sampling
- Mathematical operation:
  $$w_t = w_\text{start} + t(w_\text{end} - w_\text{start})$$
- Example schedule:
```
Start CFG: 12 (for initial structure)
End CFG: 7 (for natural details)
```
b) CFGDenoiser Node
- Provides manual control over the denoising process
- Allows separate CFG for different regions
- Useful for:
  - Regional prompt strength
  - Selective detail enhancement
  - Balancing multiple conditions

Practical CFG Workflows

a) Basic Text-to-Image

CLIPTextEncode [positive] → KSampler (CFG: 7)
CLIPTextEncode [negative] → KSampler

b) Dynamic CFG

CLIPTextEncode → CFGScheduler → KSampler
Parameters:
- Start: 12 (0-25% steps)
- Middle: 8 (25-75% steps)
- End: 6 (75-100% steps)

c) Regional CFG

CLIPTextEncode → ConditioningSetArea → CFGDenoiser
Multiple regions with different CFG values:
- Focus area: CFG 12
- Background: CFG 7

CFG Troubleshooting Common issues and solutions:
- Over-saturation: Reduce CFG or use scheduling
- Loss of details: Increase CFG in final steps
- Unrealistic results: Lower CFG, especially in early steps
- Inconsistent style: Use CFG scheduling to balance

Core Components

Models and Their Mathematical Foundations

Before starting with ComfyUI, you need to understand the different types of models:

Base Models (Checkpoints)
- Stored in models\checkpoints
- Implement the full diffusion process: $p_\theta(x_{0:T})$
- Examples: CompassMix XL Lightning, Pony Diffusion V6 XL, NoobAI, FLUX
The Load Checkpoint Node The Load Checkpoint node in ComfyUI is your interface to the base diffusion model. It loads the model weights ($\theta$) and initializes the U-Net architecture that performs the reverse diffusion process. When you connect this node, you’re essentially preparing the neural network that will implement:
$$ p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $$
The node outputs three crucial components:
- MODEL: The U-Net that predicts noise or velocity
- CLIP: The text encoder for conditioning
- VAE: The encoder/decoder for image latents
These outputs correspond to the three main mathematical operations in the diffusion process:
1. MODEL: Implements $p_\theta(x_{t-1}|x_t)$
2. CLIP: Provides conditioning $c$ for $p_\theta(x_{t-1}|x_t, c)$
3. VAE: Handles the mapping between image space $x$ and latent space $z$
LoRAs (Low-Rank Adaptation)
- Stored in models\loras
- Mathematically represented as: $W = W_0 + BA$ where:
  - $W_0$ is the original weight matrix
  - $B$ and $A$ are low-rank matrices
- Reduces parameter count while maintaining model quality
The LoRA Loader Node The LoRA Loader node implements the low-rank adaptation by modifying the base model’s weights according to the equation:
$$ W_{\text{final}} = W_0 + \alpha \cdot (BA) $$
where $\alpha$ is the weight parameter in the node (typically 0.5-1.0). The node:
- Takes a MODEL input from Load Checkpoint
- Applies the LoRA weights ($BA$) with scaling $\alpha$
- Outputs the modified model
When you adjust the weight parameter, you’re directly controlling the influence of the low-rank matrices on the original weights.
VAE (Variational Autoencoder)
The VAE (Variational Autoencoder) is a type of neural network that learns to:
1. Encode images into a low-dimensional latent space representation using a neural encoder.
2. Decode these latent vectors back into images during generation.
3. Learn to generate new images by sampling from a lower-dimensional latent space for efficiency.
- Stored in models\vae
- Implements the encoding $q_\phi(z|x)$ and decoding $p_\psi(x|z)$
The VAE Encode/Decode Nodes These nodes implement the probabilistic encoding and decoding:
$$ z \sim q_\phi(z|x) $$$$ x \sim p_\psi(x|z) $$
- VAE Encode: Converts images to latents (mean + variance)
- VAE Decode: Converts latents back to images
The latent space operations are crucial because:
1. They reduce memory usage (latents are 4x smaller)
2. They provide a more stable space for diffusion
3. They help maintain semantic consistency

Node Based Workflow

The node-based interface in ComfyUI represents the mathematical operations as interconnected components. Each node performs specific operations in the diffusion process:

Sampling nodes implement the reverse diffusion process
Conditioning nodes handle text embeddings and other control signals
VAE nodes handle encoding/decoding between image and latent space: $\mathcal{E}(x)$ and $\mathcal{D}(z)$

Arcane Wizardry

When you’re new to node-based workflows, think of each connection as passing tensors and parameters between mathematical operations. The entire workflow represents a computational graph that implements the full diffusion process.

Getting Started

To begin experimenting with these concepts, you can clear your workflow:

You can add nodes by either right-clicking or double-clicking on an empty area:

Right Click Add Method

The ComfyUI Bible #

Installing ComfyUI #

Windows Installation #

Option 1: Portable Version (Recommended) #

Option 2: Manual Installation #

Updating ComfyUI #

Linux Installation #

Linux Prerequisites #

Linux Installation Steps #

Creating a Simple Launch Script #

Updating ComfyUI on Linux #

macOS Installation #

macOS Prerequisites #

macOS Installation Steps #

Creating a Launch Script #

Updating ComfyUI on macOS #

Desktop App Installation #

🐋 Docker Installation #

Docker Prerequisites #

Initial Setup #

Starting ComfyUI #

Creating a Named Container for Easy Restart #

Installing Custom Node Dependencies #

Updating ComfyUI on Docker #

Troubleshooting #

Running the code examples #

Understanding Diffusion Models #

The Diffusion Process #

Forward Diffusion #

Latent Space #

Empty Latent Image #

Load Image → VAE Encode #

VAE Decode → Save Image #

The Mathematics #

Alpha Bar and SNR #

Reverse Diffusion #

The KSampler Node #

V-Prediction and Angular Parameterization #

Conditioning and Control #

Understanding CLIP and Conditioning #

Implementation in ComfyUI #

CLIPTextEncode #

Prompt Weighting #

Syntax and Implementation #

Parentheses Method #

Zero Weights #

CLIPTextEncodeSD3, CLIPTextEncodeFlux, CLIPTextEncodeSDXL #

Unconditional vs Conditional Generation and CFG #

Implementation in ComfyUI: CFG Control #

Core Components #

Models and Their Mathematical Foundations #

Node Based Workflow #

Getting Started #

Related Posts

Split Your Sigmas: Advanced Diffusion Control for Enhanced Image Generation in ComfyUI

Custom ComfyUI Workflow with the Krita AI Plugin

The ComfyUI Bible

Installing ComfyUI

Windows Installation

Option 1: Portable Version (Recommended)

Option 2: Manual Installation

Updating ComfyUI

Linux Installation

Linux Prerequisites

Linux Installation Steps

Creating a Simple Launch Script

Updating ComfyUI on Linux

macOS Installation

macOS Prerequisites

macOS Installation Steps

Creating a Launch Script

Updating ComfyUI on macOS

Desktop App Installation

🐋 Docker Installation

Docker Prerequisites

Initial Setup

Starting ComfyUI

Creating a Named Container for Easy Restart

Installing Custom Node Dependencies

Updating ComfyUI on Docker

Troubleshooting

Running the code examples

Understanding Diffusion Models

The Diffusion Process

Forward Diffusion

Latent Space

Empty Latent Image

Load Image → VAE Encode

VAE Decode → Save Image

The Mathematics

Alpha Bar and SNR

Reverse Diffusion

The KSampler Node

V-Prediction and Angular Parameterization

Conditioning and Control

Understanding CLIP and Conditioning

Implementation in ComfyUI

CLIPTextEncode

Prompt Weighting

Syntax and Implementation

Parentheses Method

Zero Weights

CLIPTextEncodeSD3, CLIPTextEncodeFlux, CLIPTextEncodeSDXL

Unconditional vs Conditional Generation and CFG

Implementation in ComfyUI: CFG Control

Core Components

Models and Their Mathematical Foundations

Node Based Workflow

Getting Started