Monitoring Training Progress

Monitoring Training Progress

Monitoring your LoRA training is crucial for identifying issues early, making informed adjustments, and understanding the overall training progress. This guide focuses on using Weights & Biases (wandb) and TensorBoard, two powerful visualization tools for machine learning.

Setting Up Weights & Biases (wandb)

Weights & Biases (wandb) is a popular experiment tracking tool that provides beautiful visualizations, is accessible from anywhere via the web, and offers collaboration features.

Creating a wandb Account

Before you can use wandb, you need to create an account on the Weights & Biases website. Sign up is free and provides sufficient resources for personal projects.

Installing wandb

Install the wandb Python package:

pip install wandb

Logging In to wandb

After installation, log in to your wandb account from the terminal:

wandb login

You’ll be prompted to enter your API key, which you can find in your wandb account settings.

Enabling wandb Logging in sd-scripts

To enable wandb for your LoRA training, add these parameters to your training command:

--log_prefix=your-model-name
--log_with=wandb
--wandb_api_key=your-api-key
  • log_prefix: A prefix for your wandb logs
  • log_with: Specifies wandb as the logging tool
  • wandb_api_key: Your wandb API key (optional if you’ve already logged in via the command line)

You can also add the following optional parameters:

--wandb_project=your-project-name
--wandb_entity=your-username-or-team
  • wandb_project: The project name for organizing your experiments (default: “sd-scripts”)
  • wandb_entity: Your wandb username or team name

Viewing Your wandb Dashboard

Once your training starts, logs will be automatically sent to wandb. You can view your experiment by:

  1. Going to the wandb website
  2. Navigating to your project
  3. Selecting your run

The dashboard will show training losses, sample images (if generated), and other metrics in real-time.

Setting Up TensorBoard

Enabling TensorBoard Logging

To enable TensorBoard for your training, add these parameters to your training command:

--log_prefix=your-model-name
--log_with=tensorboard
--logging_dir=/output_dir/logs
  • log_prefix: A prefix for your TensorBoard logs
  • log_with: Specifies TensorBoard as the logging tool
  • logging_dir: Directory where logs will be stored

Installing TensorBoard

If you haven’t already installed TensorBoard:

pip install tensorboard

For more detailed installation instructions, refer to the official TensorBoard installation guide.

Launching TensorBoard

After your training is running (or even after it’s completed), navigate to your output directory and start TensorBoard:

cd /output_dir
tensorboard --logdir=logs

Then open a web browser and go to http://localhost:6006/ to view your training metrics.

Understanding Loss Curves

TensorBoard displays loss curves that show how the training and validation losses change over time. Understanding these curves is crucial for monitoring progress.

Healthy Loss Curve Characteristics

A healthy training process typically shows:

  1. Downward Trend: The loss should generally decrease over time
  2. Smooth Decline: Extremely jagged or erratic curves may indicate issues
  3. Plateauing: Eventually, the curve should flatten as the model approaches optimal performance

Warning Signs to Watch For

Watch for these potential issues:

  1. Increasing Loss: If the loss starts increasing consistently, it may indicate that the learning rate is too high
  2. No Improvement: If the loss doesn’t decrease at all, there might be issues with your dataset or training configuration
  3. NaN Values: If the loss suddenly becomes NaN (Not a Number), it indicates numerical instability

Sample Image Generation

In addition to monitoring loss curves, generating sample images during training provides visual feedback on your model’s progress.

To enable sample generation during training:

--sample_prompts=/training_dir/sample-prompts.txt
--sample_sampler="euler_a"
--sample_every_n_steps=100

or

--sample_every_n_epochs=1

These samples will be saved in your output directory and can help you:

  • Visually track learning progress
  • Identify when the model begins to produce satisfactory results
  • Determine if the model is overfitting
  • Decide when to stop training

Making Adjustments Based on Monitoring

Based on what you observe in TensorBoard and sample images, you might want to adjust your training parameters:

If Loss is Not Decreasing

  • Increase the learning rate
  • Check your dataset for issues
  • Adjust network dimensions or alpha

If Loss is Increasing or Fluctuating Wildly

  • Decrease the learning rate
  • Add or increase gradient clipping (--max_grad_norm)
  • Increase batch size or gradient accumulation steps

If Training Seems to Plateau Too Early

  • Adjust the learning rate scheduler
  • Increase the number of training steps/epochs
  • Consider modifying the network architecture

Advanced TensorBoard Features

TensorBoard offers several advanced features:

Comparing Runs

You can view multiple training runs simultaneously in TensorBoard, allowing you to compare different parameter configurations side by side.

Distribution and Histogram Analysis

TensorBoard can display distributions and histograms of your model’s weights and gradients, providing insights into how these values evolve during training.

Custom Scalar Tracking

You can track custom scalars beyond just loss, such as learning rates, gradient norms, or any other metric you want to monitor.

Conclusion

Effectively monitoring your LoRA training is essential for producing high-quality models. Both wandb and TensorBoard provide powerful visualization capabilities that help you understand what’s happening during training and make informed adjustments to improve results.

Next Steps

Congratulations! You’ve completed the entire LoRA training guide series. With the knowledge you’ve gained, you can now:

  1. Prepare high-quality datasets with proper tagging and captioning
  2. Set up your training environment with the necessary tools
  3. Configure optimal training parameters for your specific needs
  4. Shrink and optimize your models for better performance
  5. Monitor training progress and make informed adjustments

If you’re ready to start a new training project, you can return to the Dataset Preparation guide. For more advanced topics, you might want to explore other sections of our documentation, such as fine-tuning techniques or specialized LoRA applications.