10 Minute SDXL LoRA Training for the Ultimate Degenerates
Introduction
This method is a short, direct guide to an experimental SDXL LoRA training technique. You can complete the training in just 80 steps, which works well with both Pony Diffusion V6 XL and CompassMix XL models.
The “10 Minute” in the title is just click bait - your actual training time will depend on:
- Your dataset size
- Step size configuration
- GPU capabilities
The good news? It’s incredibly fast! These 80 steps work perfectly for character and style training with 40-200 images without scaling issues. You’ll only need to adjust the output name between training sessions.
Setup and Training
First, you’ll need to modify sd-scripts
with the optimizer-specific changes described in our Custom Optimizer Guide, but instead of the optimizer on that page, we’ll use SAVEUS for SDXL model training.
The SAVEUS Optimizer
import torch
from torch.optim import Optimizer
from typing import Callable, Optional, Tuple
class SAVEUS(Optimizer):
r"""
Implements the SAVEUS optimization algorithm, incorporating techniques from ADOPT.
The optimizer combines several advanced optimization techniques:
1. Gradient Centralization: Removes the mean of gradients for each layer:
g_t = g_t - mean(g_t)
2. Adaptive Gradient Normalization: Normalizes gradients using their standard deviation:
g_t = (1 - α) * g_t + α * (g_t / std(g_t))
where α is the normalization parameter
3. Momentum with Amplification:
- First moment: m_t = β₁ * m_{t-1} + (1 - β₁) * g_t
- Amplified gradient: g_t = g_t + amp_fac * m_t
where β₁ is the first moment decay rate
4. Adaptive Step Sizes:
- Second moment: v_t = β₂ * v_{t-1} + (1 - β₂) * g_t²
- Bias correction: m̂_t = m_t / (1 - β₁ᵗ)
v̂_t = v_t / (1 - β₂ᵗ)
- Step size: η_t = lr / (1 - β₁ᵗ)
where β₂ is the second moment decay rate
Complete Update Rule:
1. If decouple_weight_decay:
θ_t = θ_{t-1} * (1 - η_t * λ) - η_t * g_t / √(v̂_t + ε)
2. Otherwise:
θ_t = θ_{t-1} - η_t * (g_t + λ * θ_{t-1}) / √(v̂_t + ε)
Where:
- θ_t: Parameters at step t
- η_t: Learning rate with bias correction
- g_t: Gradient (after centralization, normalization, and amplification)
- v̂_t: Bias-corrected second moment estimate
- λ: Weight decay coefficient
- ε: Small constant for numerical stability
Arguments:
params (iterable):
Iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional):
Learning rate (default: 1e-3).
betas (Tuple[float, float], optional):
Coefficients for computing running averages of gradient (β₁) and its square (β₂) (default: (0.9, 0.999)).
eps (float, optional):
Term added to the denominator to improve numerical stability (default: 1e-8).
weight_decay (float, optional):
Weight decay (L2 penalty) (default: 0).
centralization (float, optional):
Strength of gradient centralization (default: 0.5).
normalization (float, optional):
Interpolation factor for normalized gradients (default: 0.5).
normalize_channels (bool, optional):
Whether to normalize gradients channel-wise (default: True).
amp_fac (float, optional):
Amplification factor for the momentum term (default: 2.0).
clip_lambda (Optional[Callable[[int], float]], optional):
Function computing gradient clipping threshold from step number (default: step**0.25).
decouple_weight_decay (bool, optional):
Whether to apply weight decay directly to weights (default: False).
clip_gradients (bool, optional):
Whether to enable gradient clipping (default: False).
"""
def __init__(
self,
params,
lr: float = 1e-3,
betas: Tuple[float, float] = (0.9, 0.999),
eps: float = 1e-8,
weight_decay: float = 0.0,
centralization: float = 0.5,
normalization: float = 0.5,
normalize_channels: bool = True,
amp_fac: float = 2.0,
clip_lambda: Optional[Callable[[int], float]] = lambda step: step**0.25,
decouple_weight_decay: bool = False,
clip_gradients: bool = False,
):
defaults = dict(
lr=lr,
betas=betas,
eps=eps,
weight_decay=weight_decay,
centralization=centralization,
normalization=normalization,
normalize_channels=normalize_channels,
amp_fac=amp_fac,
clip_lambda=clip_lambda,
decouple_weight_decay=decouple_weight_decay,
clip_gradients=clip_gradients,
)
super(SAVEUS, self).__init__(params, defaults)
def normalize_gradient(
self,
x: torch.Tensor,
use_channels: bool = False,
alpha: float = 1.0,
epsilon: float = 1e-8,
) -> None:
r"""Normalize gradient with standard deviation.
:param x: torch.Tensor. Gradient.
:param use_channels: bool. Channel-wise normalization.
:param alpha: float. Interpolation weight between original and normalized gradient.
:param epsilon: float. Small value to prevent division by zero.
"""
size: int = x.dim()
if size > 1 and use_channels:
s = x.std(dim=tuple(range(1, size)), keepdim=True).add_(epsilon)
x.lerp_(x.div_(s), weight=alpha)
elif torch.numel(x) > 2:
s = x.std().add_(epsilon)
x.lerp_(x.div_(s), weight=alpha)
def step(self, closure: Optional[Callable] = None):
"""Perform a single optimization step.
Args:
closure (Callable, optional):
A closure that reevaluates the model and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
lr = group["lr"]
betas = group["betas"]
eps = group["eps"]
weight_decay = group["weight_decay"]
centralization = group["centralization"]
normalization = group["normalization"]
normalize_channels = group["normalize_channels"]
amp_fac = group["amp_fac"]
clip_lambda = group["clip_lambda"]
decouple_weight_decay = group["decouple_weight_decay"]
clip_gradients = group["clip_gradients"]
for p in group["params"]:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError("SAVEUS does not support sparse gradients")
state = self.state[p]
# State initialization
if len(state) == 0:
state["step"] = 0
state["ema"] = torch.zeros_like(p.data)
state["ema_squared"] = torch.zeros_like(p.data)
ema, ema_squared = state["ema"], state["ema_squared"]
beta1, beta2 = betas
state["step"] += 1
# Center the gradient
if centralization != 0:
grad.sub_(
grad.mean(dim=tuple(range(1, grad.dim())), keepdim=True).mul_(centralization)
)
# Normalize the gradient
if normalization != 0:
self.normalize_gradient(
grad, use_channels=normalize_channels, alpha=normalization
)
# Bias correction
bias_correction = 1 - beta1 ** state["step"]
bias_correction_sqrt = (1 - beta2 ** state["step"]) ** 0.5
step_size = lr / bias_correction
# Update EMA of gradient
ema.mul_(beta1).add_(grad, alpha=1 - beta1)
# Amplify gradient with EMA
grad.add_(ema, alpha=amp_fac)
# Update EMA of squared gradient
ema_squared.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
# Compute denominator
denom = ema_squared.sqrt().div_(bias_correction_sqrt).add_(eps)
if decouple_weight_decay and weight_decay != 0:
p.data.mul_(1 - step_size * weight_decay)
elif weight_decay != 0:
grad.add_(p.data, alpha=weight_decay)
# Apply gradient clipping if enabled
if clip_gradients and clip_lambda is not None:
clip = clip_lambda(state["step"])
grad.clamp_(-clip, clip)
# Update parameters
p.data.addcdiv_(grad, denom, value=-step_size)
return loss
Optimized Training Settings
Use these training settings for the fastest results:
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" \
--pretrained_model_name_or_path=/models/ponyDiffusionV6XL_v6StartWithThisOne.safetensors \
--train_data_dir=/training_dir \
--resolution="1024,1024" \
--output_dir="/output_dir" \
--output_name="yifftoolkit-schnell" \
--enable_bucket \
--min_bucket_reso=256 \
--max_bucket_reso=2048 \
--network_alpha=4 \
--save_model_as="safetensors" \
--network_module="lycoris.kohya" \
--network_args \
"preset=full" \
"conv_dim=256" \
"conv_alpha=4" \
"rank_dropout=0" \
"module_dropout=0" \
"use_tucker=False" \
"use_scalar=False" \
"rank_dropout_scale=False" \
"algo=locon" \
"dora_wd=False" \
"train_norm=False" \
--network_dropout=0 \
--lr_scheduler="cosine" \
--lr_scheduler_args="num_cycles=0.375" \
--learning_rate=0.0003 \
--unet_lr=0.0003 \
--text_encoder_lr=0.0001 \
--network_dim=8 \
--no_half_vae \
--flip_aug \
--save_every_n_steps=1 \
--mixed_precision="bf16" \
--save_precision="fp16" \
--cache_latents \
--cache_latents_to_disk \
--optimizer_type=SAVEUS \
--max_grad_norm=1 \
--max_data_loader_n_workers=8 \
--bucket_reso_steps=32 \
--multires_noise_iterations=12 \
--multires_noise_discount=0.4 \
--log_prefix=xl-locon \
--log_with=tensorboard \
--logging_dir=/output_dir/logs \
--gradient_accumulation_steps=6 \
--gradient_checkpointing \
--train_batch_size=8 \
--dataset_repeats=1 \
--shuffle_caption \
--max_train_steps=80 \
--sdpa \
--caption_extension=".txt" \
--sample_prompts=/training_dir/sample-prompts.txt \
--sample_sampler="euler_a" \
--sample_every_n_steps=10
Pro tip: Set the --sample_every_n_steps
to 1
at least once to watch how quickly the LoRA model learns. The training progress is fascinating to observe!
LoRA Optimization: Shrinking
After training, optimize your LoRA size with resize_lora. This technique helps reduce model size while maintaining quality:
python resize_lora.py -r fro_ckpt=1,thr=-3.55 {model_path} {lora_path}
LoRA Optimization: Chopping
In the resize_lora
repository, you’ll find a Python script called chop_blocks.py
. This utility allows you to remove unused layers from your LoRA that don’t contain information about your trained character, style, or concept.
Run it on your original trained LoRA (not the resized one) with:
python chop_blocks.py {lora_path}
The output will look something like this:
INFO: Blocks layout:
INFO: [ 0] input_blocks.1 layers=9
INFO: [ 1] input_blocks.2 layers=9
INFO: [ 2] input_blocks.3 layers=3
INFO: [ 3] input_blocks.4 layers=78
INFO: [ 4] input_blocks.5 layers=75
INFO: [ 5] input_blocks.6 layers=3
INFO: [ 6] input_blocks.7 layers=318
INFO: [ 7] input_blocks.8 layers=315
INFO: [ 8] middle_block.0 layers=9
INFO: [ 9] middle_block.1 layers=306
INFO: [10] middle_block.2 layers=9
INFO: [11] output_blocks.0 layers=318
INFO: [12] output_blocks.1 layers=318
INFO: [13] output_blocks.2 layers=321
INFO: [14] output_blocks.3 layers=78
INFO: [15] output_blocks.4 layers=78
INFO: [16] output_blocks.5 layers=81
INFO: [17] output_blocks.6 layers=12
INFO: [18] output_blocks.7 layers=12
INFO: [19] output_blocks.8 layers=12
INFO: Vector string : "1,INP01,INP02,INP03,INP04,INP05,INP06,INP07,INP08,MID00,MID01,MID02,OUT00,OUT01,OUT02,OUT03,OUT04,OUT05,OUT06,OUT07,OUT08"
INFO: Pass through layers: 264
Selective Layer Retention
For example, to keep only output_blocks.1
(OUT01), use:
python chop_blocks.py {⚠️resized⚠️_lora_path} 1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
Note: The first number is for compatibility with ComfyUI; the next numbers correspond to each block starting with input_blocks.1
.
Visual Block Analysis with ComfyUI
To visually identify which blocks contain useful information, install ComfyUI-Inspire-Pack and use the Lora Loader (Block Weight)
node in your ComfyUI workflow.

Important: Set control_after_generate
to fixed
!
You can use the presets to check all IN, OUT, or MID blocks, but most useful information is typically in OUT1. Check our Experimental Features Guide for more advanced techniques.
Once you’ve identified which blocks to keep, chop up your resized LoRA to create a tiny model that’s easy to share and use!
Conclusion
This ultra-fast SDXL LoRA training method gives you:
- Incredibly efficient training (just 80 steps!)
- High-quality results with minimal compute resources
- The ability to further optimize your LoRA through shrinking and chopping
By following this guide, you can create compact, performant LoRA models for your specific characters or styles in minutes rather than hours. Experiment with different settings to find what works best for your specific training needs.
For more advanced techniques, check out our guides on LyCORIS Chopping and Tracking Training Changes.