Advanced Training Concepts

Advanced Training Concepts

This guide covers more advanced concepts for LoRA training that can help you optimize your results.

Steps vs. Epochs

When training a model, it’s crucial to understand the difference between steps and epochs.

Steps

A step is a single update to the model’s parameters based on one batch of data. During each step:

  1. The model processes a batch of images
  2. Loss is calculated based on the model’s predictions
  3. Gradients are computed and used to update the model’s weights

Epochs

An epoch represents a complete pass through the entire training dataset. After processing all batches in the dataset, one epoch is completed.

Example Calculation

Let’s say you have:

  • 1000 training images
  • Batch size of 10 images
  • Training for 10 epochs

This means:

  • Each epoch consists of 100 steps (1000 images ÷ 10 images per batch)
  • Total training will involve 1000 steps (100 steps × 10 epochs)

Choosing Between Steps and Epochs

When configuring your training, you can specify either:

--max_train_steps=400

OR

--max_train_epochs=10

If both are specified, the number of epochs takes precedence. For smaller datasets, epoch-based training is often more intuitive. For larger datasets, step-based training gives more control over the total training duration.

Gradient Accumulation

Gradient accumulation is a technique that allows you to simulate larger batch sizes even with limited GPU memory.

How Gradient Accumulation Works

Instead of updating the model weights after each batch, gradient accumulation:

  1. Processes multiple batches
  2. Accumulates (adds up) the gradients from each batch
  3. Updates the model weights once after a specified number of batches

Benefits of Gradient Accumulation

  • Enables training with effectively larger batch sizes
  • Improves training stability
  • Allows for higher learning rates
  • Works around memory limitations

Implementation of Gradient Accumulation

To use gradient accumulation, add:

--gradient_accumulation_steps=6

This means the model parameters will be updated every 6 batches, effectively creating a batch size that’s 6 times larger than the specified --train_batch_size.

Effective Batch Size Calculation

Your effective batch size is:

$$\text{Effective Batch Size} = \text{Train Batch Size} \times \text{Gradient Accumulation Steps}$$

For example, with --train_batch_size=8 and --gradient_accumulation_steps=6, your effective batch size is 48.

Gradient Checkpointing

Gradient checkpointing is a technique to reduce memory usage during training by trading compute for memory.

How Gradient Checkpointing Works

Rather than storing all activations from the forward pass (which requires a lot of memory), gradient checkpointing:

  1. Discards some activations during the forward pass
  2. Recomputes them during the backward pass when needed for gradient calculation

Benefits of Gradient Checkpointing

  • Significantly reduces memory usage
  • Allows training larger models or using larger batch sizes
  • Makes training possible on GPUs with limited VRAM

Implementation of Gradient Checkpointing

To enable gradient checkpointing:

--gradient_checkpointing

Note that this will slow down training somewhat (typically 20-30%), but the memory savings can be substantial.

Network Dropout

Dropout is a regularization technique that helps prevent overfitting by randomly “dropping” (setting to zero) some neurons during training.

Types of Dropout in LoRA Training

LoRA implementations support several types of dropout:

  1. Network Dropout: Regular dropout applied to the whole network

    --network_dropout=0.1
    
  2. Rank Dropout: Dropout applied to specific ranks in the low-rank matrices

    --network_args "rank_dropout=0.1"
    
  3. Module Dropout: Randomly skips entire LoRA modules during training

    --network_args "module_dropout=0.1"
    
  • For most cases, starting with small dropout values (0.1-0.2) is recommended
  • For smaller datasets prone to overfitting, higher values (0.3-0.4) may help
  • Setting any dropout to 0 disables that type of dropout

Multi-Resolution Noise

Multi-resolution noise is a technique that adds noise at multiple resolutions during training, helping the model generate more diverse images.

Benefits

  • Improves image diversity
  • Works particularly well with small datasets
  • Can generate extreme light or dark images more easily
  • Helps the model learn details at different scales

Implementation

--multires_noise_iterations=10 --multires_noise_discount=0.1
  • multires_noise_iterations: Number of noise scales to add (6-10 recommended)
  • multires_noise_discount: Controls how much the noise is weakened at each resolution (0.1 recommended)

This technique has minimal downsides and is generally recommended for all LoRA training.

Next Steps

Now that you understand advanced training concepts, the final step is to learn how to monitor your training progress. Continue to the Monitoring Training Progress guide to learn how to use tools like Wandb or TensorBoard to track your training metrics and visualize results.