10 Minute LoRA Training for the Ultimate Degenerates
On this page

10 Minute SDXL LoRA Training for the Ultimate Degenerates


Introduction


This method is a short, direct guide to an experimental SDXL LoRA training technique. You can complete the training in just 80 steps, which works well with both Pony Diffusion V6 XL and CompassMix XL models.

The “10 Minute” in the title is just click bait - your actual training time will depend on:

  • Your dataset size
  • Step size configuration
  • GPU capabilities

The good news? It’s incredibly fast! These 80 steps work perfectly for character and style training with 40-200 images without scaling issues. You’ll only need to adjust the output name between training sessions.

Setup and Training


First, you’ll need to modify sd-scripts with the optimizer-specific changes described in our Custom Optimizer Guide, but instead of the optimizer on that page, we’ll use SAVEUS for SDXL model training.

The SAVEUS Optimizer

import torch
from torch.optim import Optimizer
from typing import Callable, Optional, Tuple

class SAVEUS(Optimizer):
    r"""
    Implements the SAVEUS optimization algorithm, incorporating techniques from ADOPT.
    
    The optimizer combines several advanced optimization techniques:
    1. Gradient Centralization: Removes the mean of gradients for each layer:
       g_t = g_t - mean(g_t)
    
    2. Adaptive Gradient Normalization: Normalizes gradients using their standard deviation:
       g_t = (1 - α) * g_t + α * (g_t / std(g_t))
       where α is the normalization parameter
    
    3. Momentum with Amplification: 
       - First moment: m_t = β₁ * m_{t-1} + (1 - β₁) * g_t
       - Amplified gradient: g_t = g_t + amp_fac * m_t
       where β₁ is the first moment decay rate
    
    4. Adaptive Step Sizes:
       - Second moment: v_t = β₂ * v_{t-1} + (1 - β₂) * g_t²
       - Bias correction: m̂_t = m_t / (1 - β₁ᵗ)
                         v̂_t = v_t / (1 - β₂ᵗ)
       - Step size: η_t = lr / (1 - β₁ᵗ)
       where β₂ is the second moment decay rate
    
    Complete Update Rule:
    1. If decouple_weight_decay:
       θ_t = θ_{t-1} * (1 - η_t * λ) - η_t * g_t / √(v̂_t + ε)
    2. Otherwise:
       θ_t = θ_{t-1} - η_t * (g_t + λ * θ_{t-1}) / √(v̂_t + ε)

    Where:
    - θ_t: Parameters at step t
    - η_t: Learning rate with bias correction
    - g_t: Gradient (after centralization, normalization, and amplification)
    - v̂_t: Bias-corrected second moment estimate
    - λ: Weight decay coefficient
    - ε: Small constant for numerical stability
    
    Arguments:
        params (iterable):
            Iterable of parameters to optimize or dicts defining parameter groups.
        lr (float, optional):
            Learning rate (default: 1e-3).
        betas (Tuple[float, float], optional):
            Coefficients for computing running averages of gradient (β₁) and its square (β₂) (default: (0.9, 0.999)).
        eps (float, optional):
            Term added to the denominator to improve numerical stability (default: 1e-8).
        weight_decay (float, optional):
            Weight decay (L2 penalty) (default: 0).
        centralization (float, optional):
            Strength of gradient centralization (default: 0.5).
        normalization (float, optional):
            Interpolation factor for normalized gradients (default: 0.5).
        normalize_channels (bool, optional):
            Whether to normalize gradients channel-wise (default: True).
        amp_fac (float, optional):
            Amplification factor for the momentum term (default: 2.0).
        clip_lambda (Optional[Callable[[int], float]], optional):
            Function computing gradient clipping threshold from step number (default: step**0.25).
        decouple_weight_decay (bool, optional):
            Whether to apply weight decay directly to weights (default: False).
        clip_gradients (bool, optional):
            Whether to enable gradient clipping (default: False).
    """

    def __init__(
        self,
        params,
        lr: float = 1e-3,
        betas: Tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-8,
        weight_decay: float = 0.0,
        centralization: float = 0.5,
        normalization: float = 0.5,
        normalize_channels: bool = True,
        amp_fac: float = 2.0,
        clip_lambda: Optional[Callable[[int], float]] = lambda step: step**0.25,
        decouple_weight_decay: bool = False,
        clip_gradients: bool = False,
    ):
        defaults = dict(
            lr=lr,
            betas=betas,
            eps=eps,
            weight_decay=weight_decay,
            centralization=centralization,
            normalization=normalization,
            normalize_channels=normalize_channels,
            amp_fac=amp_fac,
            clip_lambda=clip_lambda,
            decouple_weight_decay=decouple_weight_decay,
            clip_gradients=clip_gradients,
        )
        super(SAVEUS, self).__init__(params, defaults)

    def normalize_gradient(
        self,
        x: torch.Tensor,
        use_channels: bool = False,
        alpha: float = 1.0,
        epsilon: float = 1e-8,
    ) -> None:
        r"""Normalize gradient with standard deviation.
        :param x: torch.Tensor. Gradient.
        :param use_channels: bool. Channel-wise normalization.
        :param alpha: float. Interpolation weight between original and normalized gradient.
        :param epsilon: float. Small value to prevent division by zero.
        """
        size: int = x.dim()
        if size > 1 and use_channels:
            s = x.std(dim=tuple(range(1, size)), keepdim=True).add_(epsilon)
            x.lerp_(x.div_(s), weight=alpha)
        elif torch.numel(x) > 2:
            s = x.std().add_(epsilon)
            x.lerp_(x.div_(s), weight=alpha)

    def step(self, closure: Optional[Callable] = None):
        """Perform a single optimization step.
        
        Args:
            closure (Callable, optional):
                A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            lr = group["lr"]
            betas = group["betas"]
            eps = group["eps"]
            weight_decay = group["weight_decay"]
            centralization = group["centralization"]
            normalization = group["normalization"]
            normalize_channels = group["normalize_channels"]
            amp_fac = group["amp_fac"]
            clip_lambda = group["clip_lambda"]
            decouple_weight_decay = group["decouple_weight_decay"]
            clip_gradients = group["clip_gradients"]

            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError("SAVEUS does not support sparse gradients")

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    state["ema"] = torch.zeros_like(p.data)
                    state["ema_squared"] = torch.zeros_like(p.data)

                ema, ema_squared = state["ema"], state["ema_squared"]
                beta1, beta2 = betas
                state["step"] += 1

                # Center the gradient
                if centralization != 0:
                    grad.sub_(
                        grad.mean(dim=tuple(range(1, grad.dim())), keepdim=True).mul_(centralization)
                    )

                # Normalize the gradient
                if normalization != 0:
                    self.normalize_gradient(
                        grad, use_channels=normalize_channels, alpha=normalization
                    )

                # Bias correction
                bias_correction = 1 - beta1 ** state["step"]
                bias_correction_sqrt = (1 - beta2 ** state["step"]) ** 0.5
                step_size = lr / bias_correction

                # Update EMA of gradient
                ema.mul_(beta1).add_(grad, alpha=1 - beta1)
                # Amplify gradient with EMA
                grad.add_(ema, alpha=amp_fac)
                # Update EMA of squared gradient
                ema_squared.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)

                # Compute denominator
                denom = ema_squared.sqrt().div_(bias_correction_sqrt).add_(eps)

                if decouple_weight_decay and weight_decay != 0:
                    p.data.mul_(1 - step_size * weight_decay)
                elif weight_decay != 0:
                    grad.add_(p.data, alpha=weight_decay)

                # Apply gradient clipping if enabled
                if clip_gradients and clip_lambda is not None:
                    clip = clip_lambda(state["step"])
                    grad.clamp_(-clip, clip)

                # Update parameters
                p.data.addcdiv_(grad, denom, value=-step_size)

        return loss

Optimized Training Settings

Use these training settings for the fastest results:

accelerate launch --num_cpu_threads_per_process=2  "./sdxl_train_network.py" \
    --pretrained_model_name_or_path=/models/ponyDiffusionV6XL_v6StartWithThisOne.safetensors \
    --train_data_dir=/training_dir \
    --resolution="1024,1024" \
    --output_dir="/output_dir" \
    --output_name="yifftoolkit-schnell" \
    --enable_bucket \
    --min_bucket_reso=256 \
    --max_bucket_reso=2048 \
    --network_alpha=4 \
    --save_model_as="safetensors" \
    --network_module="lycoris.kohya" \
    --network_args \
               "preset=full" \
               "conv_dim=256" \
               "conv_alpha=4" \
               "rank_dropout=0" \
               "module_dropout=0" \
               "use_tucker=False" \
               "use_scalar=False" \
               "rank_dropout_scale=False" \
               "algo=locon" \
               "dora_wd=False" \
               "train_norm=False" \
    --network_dropout=0 \
    --lr_scheduler="cosine" \
    --lr_scheduler_args="num_cycles=0.375" \
    --learning_rate=0.0003 \
    --unet_lr=0.0003 \
    --text_encoder_lr=0.0001 \
    --network_dim=8 \
    --no_half_vae \
    --flip_aug \
    --save_every_n_steps=1 \
    --mixed_precision="bf16" \
    --save_precision="fp16" \
    --cache_latents \
    --cache_latents_to_disk \
    --optimizer_type=SAVEUS \
    --max_grad_norm=1 \
    --max_data_loader_n_workers=8 \
    --bucket_reso_steps=32 \
    --multires_noise_iterations=12 \
    --multires_noise_discount=0.4 \
    --log_prefix=xl-locon \
    --log_with=tensorboard \
    --logging_dir=/output_dir/logs \
    --gradient_accumulation_steps=6 \
    --gradient_checkpointing \
    --train_batch_size=8 \
    --dataset_repeats=1 \
    --shuffle_caption \
    --max_train_steps=80 \
    --sdpa \
    --caption_extension=".txt" \
    --sample_prompts=/training_dir/sample-prompts.txt \
    --sample_sampler="euler_a" \
    --sample_every_n_steps=10

Pro tip: Set the --sample_every_n_steps to 1 at least once to watch how quickly the LoRA model learns. The training progress is fascinating to observe!

LoRA Optimization: Shrinking


After training, optimize your LoRA size with resize_lora. This technique helps reduce model size while maintaining quality:

python resize_lora.py -r fro_ckpt=1,thr=-3.55 {model_path} {lora_path}

LoRA Optimization: Chopping


In the resize_lora repository, you’ll find a Python script called chop_blocks.py. This utility allows you to remove unused layers from your LoRA that don’t contain information about your trained character, style, or concept.

Run it on your original trained LoRA (not the resized one) with:

python chop_blocks.py {lora_path} 

The output will look something like this:

INFO: Blocks layout:
INFO:   [ 0]  input_blocks.1 layers=9
INFO:   [ 1]  input_blocks.2 layers=9
INFO:   [ 2]  input_blocks.3 layers=3
INFO:   [ 3]  input_blocks.4 layers=78
INFO:   [ 4]  input_blocks.5 layers=75
INFO:   [ 5]  input_blocks.6 layers=3
INFO:   [ 6]  input_blocks.7 layers=318
INFO:   [ 7]  input_blocks.8 layers=315
INFO:   [ 8]  middle_block.0 layers=9
INFO:   [ 9]  middle_block.1 layers=306
INFO:   [10]  middle_block.2 layers=9
INFO:   [11] output_blocks.0 layers=318
INFO:   [12] output_blocks.1 layers=318
INFO:   [13] output_blocks.2 layers=321
INFO:   [14] output_blocks.3 layers=78
INFO:   [15] output_blocks.4 layers=78
INFO:   [16] output_blocks.5 layers=81
INFO:   [17] output_blocks.6 layers=12
INFO:   [18] output_blocks.7 layers=12
INFO:   [19] output_blocks.8 layers=12
INFO: Vector string : "1,INP01,INP02,INP03,INP04,INP05,INP06,INP07,INP08,MID00,MID01,MID02,OUT00,OUT01,OUT02,OUT03,OUT04,OUT05,OUT06,OUT07,OUT08"
INFO: Pass through layers: 264

Selective Layer Retention

For example, to keep only output_blocks.1 (OUT01), use:

python chop_blocks.py {⚠️resized⚠️_lora_path} 1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0

Note: The first number is for compatibility with ComfyUI; the next numbers correspond to each block starting with input_blocks.1.

Visual Block Analysis with ComfyUI

To visually identify which blocks contain useful information, install ComfyUI-Inspire-Pack and use the Lora Loader (Block Weight) node in your ComfyUI workflow.

The screenshot shows a user interface element from the ComfyUI-Inspire-Pack, specifically a node named "Lora Loader (Block Weight)." This node is part of a visual programming environment and includes various adjustable parameters. Key settings visible in the node are "model", "clip", "category_filter", "lora_name", "strength_model", "strength_clip", "inverse", "control_after_generate", and "preset" Each parameter has corresponding input fields or dropdown menus for user customization. The preset field contains a detailed alphanumeric string, representing a specific configuration.

Important: Set control_after_generate to fixed!

You can use the presets to check all IN, OUT, or MID blocks, but most useful information is typically in OUT1. Check our Experimental Features Guide for more advanced techniques.

Once you’ve identified which blocks to keep, chop up your resized LoRA to create a tiny model that’s easy to share and use!

Conclusion


This ultra-fast SDXL LoRA training method gives you:

  1. Incredibly efficient training (just 80 steps!)
  2. High-quality results with minimal compute resources
  3. The ability to further optimize your LoRA through shrinking and chopping

By following this guide, you can create compact, performant LoRA models for your specific characters or styles in minutes rather than hours. Experiment with different settings to find what works best for your specific training needs.

For more advanced techniques, check out our guides on LyCORIS Chopping and Tracking Training Changes.