Monitoring Training Progress
Monitoring your LoRA training is crucial for identifying issues early, making informed adjustments, and understanding the overall training progress. This guide focuses on using Weights & Biases (wandb) and TensorBoard, two powerful visualization tools for machine learning.
Setting Up Weights & Biases (wandb)
Weights & Biases (wandb) is a popular experiment tracking tool that provides beautiful visualizations, is accessible from anywhere via the web, and offers collaboration features.
Creating a wandb Account
Before you can use wandb, you need to create an account on the Weights & Biases website. Sign up is free and provides sufficient resources for personal projects.
Installing wandb
Install the wandb Python package:
pip install wandb
Logging In to wandb
After installation, log in to your wandb account from the terminal:
wandb login
You’ll be prompted to enter your API key, which you can find in your wandb account settings.
Enabling wandb Logging in sd-scripts
To enable wandb for your LoRA training, add these parameters to your training command:
--log_prefix=your-model-name
--log_with=wandb
--wandb_api_key=your-api-key
log_prefix
: A prefix for your wandb logslog_with
: Specifies wandb as the logging toolwandb_api_key
: Your wandb API key (optional if you’ve already logged in via the command line)
You can also add the following optional parameters:
--wandb_project=your-project-name
--wandb_entity=your-username-or-team
wandb_project
: The project name for organizing your experiments (default: “sd-scripts”)wandb_entity
: Your wandb username or team name
Viewing Your wandb Dashboard
Once your training starts, logs will be automatically sent to wandb. You can view your experiment by:
- Going to the wandb website
- Navigating to your project
- Selecting your run
The dashboard will show training losses, sample images (if generated), and other metrics in real-time.
Setting Up TensorBoard
Enabling TensorBoard Logging
To enable TensorBoard for your training, add these parameters to your training command:
--log_prefix=your-model-name
--log_with=tensorboard
--logging_dir=/output_dir/logs
log_prefix
: A prefix for your TensorBoard logslog_with
: Specifies TensorBoard as the logging toollogging_dir
: Directory where logs will be stored
Installing TensorBoard
If you haven’t already installed TensorBoard:
pip install tensorboard
For more detailed installation instructions, refer to the official TensorBoard installation guide.
Launching TensorBoard
After your training is running (or even after it’s completed), navigate to your output directory and start TensorBoard:
cd /output_dir
tensorboard --logdir=logs
Then open a web browser and go to http://localhost:6006/ to view your training metrics.
Understanding Loss Curves
TensorBoard displays loss curves that show how the training and validation losses change over time. Understanding these curves is crucial for monitoring progress.
Healthy Loss Curve Characteristics
A healthy training process typically shows:
- Downward Trend: The loss should generally decrease over time
- Smooth Decline: Extremely jagged or erratic curves may indicate issues
- Plateauing: Eventually, the curve should flatten as the model approaches optimal performance
Warning Signs to Watch For
Watch for these potential issues:
- Increasing Loss: If the loss starts increasing consistently, it may indicate that the learning rate is too high
- No Improvement: If the loss doesn’t decrease at all, there might be issues with your dataset or training configuration
- NaN Values: If the loss suddenly becomes NaN (Not a Number), it indicates numerical instability
Sample Image Generation
In addition to monitoring loss curves, generating sample images during training provides visual feedback on your model’s progress.
To enable sample generation during training:
--sample_prompts=/training_dir/sample-prompts.txt
--sample_sampler="euler_a"
--sample_every_n_steps=100
or
--sample_every_n_epochs=1
These samples will be saved in your output directory and can help you:
- Visually track learning progress
- Identify when the model begins to produce satisfactory results
- Determine if the model is overfitting
- Decide when to stop training
Making Adjustments Based on Monitoring
Based on what you observe in TensorBoard and sample images, you might want to adjust your training parameters:
If Loss is Not Decreasing
- Increase the learning rate
- Check your dataset for issues
- Adjust network dimensions or alpha
If Loss is Increasing or Fluctuating Wildly
- Decrease the learning rate
- Add or increase gradient clipping (
--max_grad_norm
) - Increase batch size or gradient accumulation steps
If Training Seems to Plateau Too Early
- Adjust the learning rate scheduler
- Increase the number of training steps/epochs
- Consider modifying the network architecture
Advanced TensorBoard Features
TensorBoard offers several advanced features:
Comparing Runs
You can view multiple training runs simultaneously in TensorBoard, allowing you to compare different parameter configurations side by side.
Distribution and Histogram Analysis
TensorBoard can display distributions and histograms of your model’s weights and gradients, providing insights into how these values evolve during training.
Custom Scalar Tracking
You can track custom scalars beyond just loss, such as learning rates, gradient norms, or any other metric you want to monitor.
Conclusion
Effectively monitoring your LoRA training is essential for producing high-quality models. Both wandb and TensorBoard provide powerful visualization capabilities that help you understand what’s happening during training and make informed adjustments to improve results.
Next Steps
Congratulations! You’ve completed the entire LoRA training guide series. With the knowledge you’ve gained, you can now:
- Prepare high-quality datasets with proper tagging and captioning
- Set up your training environment with the necessary tools
- Configure optimal training parameters for your specific needs
- Shrink and optimize your models for better performance
- Monitor training progress and make informed adjustments
If you’re ready to start a new training project, you can return to the Dataset Preparation guide. For more advanced topics, you might want to explore other sections of our documentation, such as fine-tuning techniques or specialized LoRA applications.