DoRA: Advanced Weight-Decomposed LoRA Training for Stable Diffusion
The image depicts a vibrant and dynamic phoenix rising from a digital landscape. The phoenix, a mythical bird often associated with rebirth and renewal, is illustrated with fiery red and orange feathers that transition into a digital matrix of blue and white as they extend outward. The background features a complex array of digital codes, grids, and circuitry, blending the organic form of the phoenix with the structured, technological elements. This juxtaposition creates a striking visual metaphor for the fusion of nature and technology, symbolizing transformation and the emergence of new possibilities from the digital realm. The image is both visually captivating and conceptually intriguing, highlighting themes of innovation and regeneration in the context of modern technology.

DoRA: Weight-Decomposed Low-Rank Adaptation for Stable Diffusion Models

Written by Gaeros and Claude et al.

DoRA weight decomposition visualization

DoRA (Weight-Decomposed Low-Rank Adaptation) by Liu et al. represents a significant advancement in LoRA training technology for diffusion models like SDXL. This innovative approach improves upon traditional LoRA by separately handling weight magnitudes and directions during the fine-tuning process.

By decomposing weights into these distinct components, DoRA achieves better performance than standard LoRA methods while maintaining similar parameter efficiency - a critical factor for training stable diffusion models. It shows significant improvements across various tasks including:

  • Commonsense reasoning (+1.0-4.4%)
  • Visual-language understanding (+0.9-1.9%)
  • Instruction tuning (+0.7-1.1%)

When applied to stable diffusion models like SDXL or Pony Diffusion, DoRA can create more accurate and efficient LoRA adaptations with the same training resources.

Understanding DoRA for LoRA Training

Key Insight for Diffusion Model Training

Drawing on Weight Normalization techniques, which achieve faster convergence by improving gradient conditioning through weight reparameterization, the authors conducted a novel analysis of how weights change during fine-tuning diffusion models. By decomposing weights into magnitude and directional components, they discovered that traditional LoRA and full fine-tuning (FT) exhibit markedly distinct patterns of updates.

Their analysis revealed a critical empirical finding for stable diffusion model training: when fine-tuning pre-trained models, the most effective adaptations (as seen in FT) often involve either:

  1. Large magnitude changes with minimal directional adjustments
  2. Significant directional changes while preserving magnitudes

But rarely both simultaneously. In contrast, standard LoRA showed a coupled behavior where magnitude and directional changes were proportional, limiting its effectiveness for SDXL and other diffusion models.

This discovery suggested that pre-trained diffusion weights already encode useful feature combinations, and effective adaptation primarily needs to adjust either their importance (magnitude) or their mixing (direction) independently. This insight led to DoRA’s design: explicitly separating these components to enhance both learning capacity and training stability.

Weight Decomposition and Activations in Diffusion Models

In neural networks powering stable diffusion models, each weight matrix $W \in \mathbb{R}^{d \times k}$ transforms input vectors $x \in \mathbb{R}^k$ into output vectors $y \in \mathbb{R}^d$. This transformation can be understood through two complementary views:

  • Column view (geometric): Each column $w_j$ is a vector in output space ($\mathbb{R}^d$) that gets scaled by the corresponding input component $x_j$
  • Row view (algebraic): Each row $w^i$ computes one component of the output through inner product with the input

These views manifest in how the transformation works in diffusion model training:

  • Column view: $y = \sum_j x_j w_j$ (each input component $x_j$ scales its corresponding output-space direction $w_j$)
  • Row view: $y_i = \langle w^i, x \rangle$ (each row computes one component of the output)

DoRA vs LoRA comparison

The paper defines $||\cdot||_c$ as the L2 norm of each column vector. When we decompose weights in DoRA for diffusion model training, each column vector $w_j$ is decomposed as:

  • Direction: $v_j = w_j/||w_j||_2$ (normalized vector in output space)
  • Magnitude: $m_j = ||w_j||_2$ (scalar)

This column-wise decomposition has a clear geometric meaning for SDXL models:

  • Each $v_j$ is a unit vector in $\mathbb{R}^d$ defining a direction in output space
  • $m_j$ scales this direction’s contribution when input component $x_j$ is active
  • The full transformation becomes $y = \sum_j (m_j x_j) v_j$

The paper’s empirical analysis reveals that effective fine-tuning of diffusion models (as seen in FT) often requires:

  1. Adjusting the scaling factors ($m_j$) while preserving the output-space directions ($v_j$), or
  2. Refining the output-space directions ($v_j$) while maintaining their relative scales ($m_j$)

Pattern Analysis of LoRA and Full Fine-tuning for Diffusion Models

Drawing from Weight Normalization, the paper introduces a decomposition analysis to understand how input features’ influence patterns change during fine-tuning SDXL and other diffusion models. For a weight matrix $W \in \mathbb{R}^{d \times k}$, the decomposition is:

$$W = m\frac{V}{||V||{c}} = ||W||{c}\frac{W}{||W||_{c}}$$

where:

  • $m \in \mathbb{R}^{1 \times k}$ contains the scaling factors for each input feature’s influence
  • $V \in \mathbb{R}^{d \times k}$ contains the unnormalized output-space directions
  • $||\cdot||_{c}$ computes the L2 norm of each column vector in output space

The analysis showed that standard LoRA exhibits coupled changes in both scaling factors and output-space directions, while full fine-tuning (FT) has the ability to independently adjust either scaling factors or directions. This key difference indicates that traditional LoRA lacks the capability for such decoupled adjustments, which limits its flexibility compared to FT when training stable diffusion models.

Implementing DoRA for Diffusion Models

DoRA explicitly separates each column’s magnitude from its direction in the weight matrix. Given a pre-trained stable diffusion weight $W_0$, DoRA:

  1. Extracts the column norms as trainable parameters:

    • $m = ||W_0||_c \in \mathbb{R}^{1 \times k}$ (one scalar per input dimension)
    • Each $m_j$ represents the magnitude of input feature $j$’s influence
  2. Applies LoRA-style updates to the directions:

    • Base directions: $V = \frac{W_0}{||W_0||_c}$
    • Low-rank update: $\Delta V = BA$ where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$
    • $r \ll min(d,k)$ is the rank hyperparameter (typically 8-32 for diffusion models)

The fine-tuned weight for your stable diffusion model is then computed as:

$$W’ = m\frac{V+\Delta V}{||V+\Delta V||_c} = m\frac{W_0+BA}{||W_0+BA||_c}$$

where:

  • $\frac{V+\Delta V}{||V+\Delta V||_c}$ ensures each column is a unit vector in output space
  • $m$ scales these unit vectors to control each input feature’s influence
  • The separation allows independent updates to magnitudes and directions

This decomposition enables DoRA to adjust feature importance through $m$ without changing directions, refine feature combinations through $B,A$ while maintaining unit-length directions, and achieve more efficient adaptation by targeting the specific type of update needed for your diffusion model.

Implementation Note: Column-wise Normalization for Diffusion Training

An important implementation detail, which was ambiguous in the original paper and later clarified in the LyCORIS library, concerns how to compute the column-wise norms when training stable diffusion models. For a weight matrix $W \in \mathbb{R}^{d \times k}$ (where $d$ is output dimension and $k$ is input dimension), there are now two implementations:

  • Column-wise view (new correct method, wd_on_out=True):

    • Directly computes norms across input dimensions for each output dimension
    • Results in norms of shape $(d, 1)$, one per output feature
    • Each norm represents the magnitude of that output feature’s response to all inputs
  • Transposed view (default method, wd_on_out=False):

    • Computes norms across output dimensions for each input dimension
    • Results in norms of shape $(1, k)$, one per input feature
    • Each norm represents how much an input feature influences all outputs combined

The column-wise view is considered “correct” for diffusion model training because:

  1. It preserves the natural interpretation where each column $w_j$ is a vector in output space ($\mathbb{R}^d$)
  2. Each normalized column represents a unit direction in output space, controlling how that input feature influences the output distribution
  3. It matches Weight Normalization’s original formulation for gradient conditioning

The key implementation in LyCORIS (which many use for training SDXL models) shows these two approaches:

if self.wd_on_out:
    # Column-wise view (correct):
    # Norms of shape (output_features, 1)
    weight_norm = (
        weight.reshape(weight.shape[0], -1)      # [d, k] -> [d, -1]
        .norm(dim=1)                             # Norm along input dimensions
        .reshape(weight.shape[0], *[1] * self.dora_norm_dims)
    ) + torch.finfo(weight.dtype).eps
else:
    # Transposed view (not recommended):
    # Norms of shape (1, input_features)
    weight_norm = (
        weight.transpose(0, 1)                    # [d, k] -> [k, d]
        .reshape(weight.shape[1], -1)            # Reshape to [k, -1]
        .norm(dim=1, keepdim=True)               # Norm along output dimensions
        .reshape(weight.shape[1], *[1] * self.dora_norm_dims)
        .transpose(0, 1)
    ) + torch.finfo(weight.dtype).eps

Gradient Analysis for Efficient Training

The gradients with respect to $V’ = V+\Delta V$ and $m$ are:

$$\nabla_{V’} \mathcal{L} = \frac{m}{||V’||{c}}\left( I - \frac{V’V’^{\mathbf{T}}}{||V’||{c}^2} \right) \nabla_{W’} \mathcal{L}$$

$$\nabla_m \mathcal{L} =\frac{\nabla_{W’} \mathcal{L} \cdot V’}{||V’||_{c}}$$

To reduce memory overhead during training diffusion models, $||V + \Delta V||_{c}$ is treated as a constant during backpropagation (detached from the gradient graph), leading to simplified gradients:

$$\nabla_{V’} \mathcal{L} = \frac{m}{C} \nabla_{W’} \mathcal{L} \text{ where } C = ||V’||_{c}$$

This modification preserves the gradient with respect to $m$ while reducing training memory by ~24.4% for LLaMA and ~12.4% for VL-BART. The impact on accuracy is negligible, with only a 0.2% difference on LLaMA and no measurable difference on VL-BART. For resource-intensive stable diffusion training, this memory efficiency is particularly valuable.

Training Results with DoRA

Commonsense Reasoning Models

  • Tested on LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B
  • DoRA outperforms LoRA by:
    • +3.7% on LLaMA-7B
    • +1.0% on LLaMA-13B
    • +2.9% on LLaMA2-7B
    • +4.4% on LLaMA3-8B
  • Half-rank DoRA ($\text{DoRA}^{\dagger}$) still outperforms full-rank LoRA

While these aren’t diffusion models, the principles apply similarly - DoRA consistently outperforms standard LoRA approaches.

Image/Video-Text Understanding (Relevant to Diffusion)

  • VL-BART backbone
  • Image tasks: VQA, GQA, NLVR2, COCO Caption
    • DoRA: 77.4% vs LoRA: 76.5%
  • Video tasks: TVQA, How2QA, TVC, YC2C
    • DoRA: 85.4% vs LoRA: 83.5%

These visual understanding improvements translate directly to better performance when training stable diffusion models.

Visual Instruction Tuning

  • LLaVA-1.5-7B model
  • DoRA: 67.6% vs LoRA: 66.9% vs FT: 66.5%

Using DoRA for Your Stable Diffusion Training

To implement DoRA in your fast SDXL LoRA training workflow:

  1. Use the LyCORIS implementation with the algo=dora parameter
  2. Consider setting wd_on_out=True for the theoretically correct column-wise normalization
  3. Maintain similar rank settings as you would for standard LoRA (typically 8-32)
  4. You may be able to use fewer training steps due to DoRA’s improved efficiency

Example training command with DoRA:

accelerate launch --num_cpu_threads_per_process=2  "./sdxl_train_network.py" \
    --pretrained_model_name_or_path=/models/ponyDiffusionV6XL_v6StartWithThisOne.safetensors \
    --train_data_dir=/training_dir \
    --resolution="1024,1024" \
    --output_dir="/output_dir" \
    --output_name="dora-test" \
    --network_module="lycoris.kohya" \
    --network_args \
               "algo=dora" \
               "conv_dim=32" \
               "conv_alpha=16" \
               "use_tucker=False" \
               "wd_on_out=True" \
    --network_dim=32 \
    --network_alpha=16 \
    --max_train_steps=400 \
    --learning_rate=1e-4

For detailed information on LyCORIS implementation and other training parameters, check our guide on LyCORIS model optimization.

Conclusion: DoRA’s Impact on Stable Diffusion Training

DoRA represents a significant advancement in LoRA training techniques for stable diffusion models. By decomposing weights into magnitude and directional components, it offers several advantages:

  1. Higher Quality Results: DoRA consistently outperforms traditional LoRA across diverse tasks, including visual processing relevant to diffusion models.

  2. Efficient Parameter Usage: By separately handling weight magnitudes and directions, DoRA makes more effective use of the same parameter budget as LoRA.

  3. Flexible Adaptation: The decomposed structure allows the model to independently adjust either feature importance or feature combinations, leading to more precise fine-tuning.

  4. Memory Efficiency: The simplified gradient computation reduces training memory requirements without sacrificing performance.

  5. Theoretical Soundness: DoRA’s approach aligns with empirical observations about how weights change during effective fine-tuning.

For stable diffusion practitioners looking to create high-quality, efficient LoRA adaptations, DoRA offers a compelling upgrade over traditional methods. When combined with other optimization techniques like training change tracking and proper model shrinking, it enables the creation of more powerful LoRAs with the same computational resources.