Dataset Preparation
Before you begin collecting your dataset you will need to decide what you want to teach the model, it can be a character, a style or a new concept.
For now let’s imagine you want to teach your model wickerbeasts so you can generate your VRChat avatar every night.
Create the training_dir
Directory
Before starting we need a directory where we’ll organize our datasets. We will also be using git and huggingface to version control our smut.
Open up a terminal by pressing Win + R
and typing in pwsh
or open PowerShell from the Start menu.
# Create a directory for your training data
mkdir C:\training_dir
cd C:\training_dir
Make sure to download and install Git for Windows if you haven’t already.
Open your terminal with Ctrl + Alt + T
or through your application menu.
# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir
Install Git if you haven’t already:
# For Debian/Ubuntu
sudo apt-get update
sudo apt-get install git
# For Arch Linux
sudo pacman -S git
# For Fedora
sudo dnf install git
Open Terminal from the Applications folder or by searching in Spotlight.
# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir
If you don’t have Git installed, you can install it using Homebrew:
# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install Git
brew install git
Understanding Git for Dataset Management
Git is a powerful version control system that helps you track changes to files over time. For dataset management, Git offers several benefits:
- Version tracking: Keep track of changes to your dataset
- Branching: Create separate branches for different concepts or characters
- Collaboration: Share your dataset with others and merge their contributions
- Backup: Store your dataset on remote repositories like Hugging Face
Key Git Concepts
- Repository: A storage location for your project containing all files and their revision history
- Commit: A snapshot of your files at a specific point in time
- Branch: A parallel version of your repository that allows you to work on different features independently
- Clone: Creating a local copy of a remote repository
- Push/Pull: Sending/retrieving changes to/from a remote repository
Getting Started with Git
Set up your Git identity:
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
Setting Up Your Dataset Repository
Hugging Face provides an excellent platform for hosting datasets. Follow their getting started guide to create a new dataset repository.
Once you have your dataset repository on Hugging Face, you can clone it to your local machine:
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir C:\training_dir
cd C:\training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir
Creating Branches for Different Concepts
For each concept you want to train (like wickerbeast in our example), create a separate branch:
# Create and switch to a new branch
git branch wickerbeast
git checkout wickerbeast
# Or use this shorthand to create and switch at once
# git checkout -b wickerbeast
Common Git Workflow for Dataset Management
After adding or modifying files in your dataset:
# See what files have changed git status # Add all changes to staging git add . # Commit your changes with a descriptive message git commit -m "Added 50 wickerbeast images"
Push your changes to Hugging Face:
git push origin wickerbeast
If you want to create a new branch for a different concept:
# First commit any current changes git add . git commit -m "Finalized wickerbeast dataset" # Create a new branch from main git checkout main git checkout -b new_concept
With Git set up, let’s continue with downloading some wickerbeast data. For this we’ll make good use of the furry booru e621.net. There are several ways to download data from this site with the metadata intact. We’ll cover both command-line and GUI options.
Gallery-DL
Gallery-DL is a command-line program to download image galleries and collections from several image hosting sites, including e621.net. It’s perfect for users who prefer working with the terminal or need to automate the image collection process.
You can install Gallery-DL using pip:
pip install gallery-dl
Or download the latest Windows executable from the releases page.
Install Gallery-DL using pip:
pip install gallery-dl
On Arch Linux, Gallery-DL is available in the AUR:
# Using an AUR helper like yay
yay -S gallery-dl
# Or manually
git clone https://aur.archlinux.org/gallery-dl.git
cd gallery-dl
makepkg -si
For Debian/Ubuntu, you can also use pip or install from source:
# Using pip in a virtual environment (recommended)
python -m venv galleryenv
source galleryenv/bin/activate
pip install gallery-dl
Install Gallery-DL using pip:
pip3 install gallery-dl
If you have Homebrew installed, you can also use it:
brew install gallery-dl
Using Gallery-DL for E621
To download wickerbeast images from e621.net, use the following command:
# Download 40 wickerbeast images
gallery-dl "https://e621.net/posts?tags=wickerbeast+solo+-duo+-group+-comic+-meme+-animated+order:score" -l 40
For images with multiple characters:
# Download 10 wickerbeast images with multiple characters
gallery-dl "https://e621.net/posts?tags=wickerbeast+-solo+-comic+-meme+-animated+order:score" -l 10
Basic Dataset Configuration for sd-scripts
Before diving into advanced TOML configurations, it’s helpful to understand how to set up datasets using basic command-line arguments with sd-scripts. This approach is simpler but offers less flexibility than the TOML method.
Simple Directory Structure
For basic training with sd-scripts, you can organize your dataset like this:
training_dir/
├── train_data/ # Main training images directory
│ ├── image1.png
│ ├── image1.txt # Caption file with same name as image
│ ├── image2.png
│ └── image2.txt
└── reg_data/ # Optional regularization images
├── reg1.png
├── reg1.txt
├── reg2.png
└── reg2.txt
Command-Line Training Options
Here are the basic command-line arguments for dataset configuration in sd-scripts:
DreamBooth Style Training
For training a concept or character using the DreamBooth approach:
accelerate launch train_network.py \
--pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
--train_data_dir="./train_data" \ # Directory with your concept images
--reg_data_dir="./reg_data" \ # Optional regularization images
--output_dir="./output/my_lora" \
--caption_extension=".txt" \ # File extension for captions
--resolution=512 \ # Training resolution
--train_batch_size=1 \ # Batch size
--max_train_epochs=10 \ # Number of training epochs
--learning_rate=1e-4 \ # Learning rate
--network_module="networks.lora" \ # Network type (LoRA)
--network_dim=32 \ # Network dimension
--network_alpha=16 # Network alpha
Additional Dataset Options
You can further customize your dataset with these arguments:
--shuffle_caption \ # Shuffle caption tags
--keep_tokens=1 \ # Keep first N tokens unshuffled
--color_aug \ # Enable color augmentation
--flip_aug \ # Enable flip augmentation
--random_crop \ # Enable random cropping
--dataset_repeats=10 \ # Repeat dataset N times
--enable_bucket \ # Enable resolution bucketing
--min_bucket_reso=256 \ # Min bucket resolution
--max_bucket_reso=1024 \ # Max bucket resolution
--bucket_reso_steps=64 \ # Bucket resolution steps
--bucket_no_upscale # Don't upscale small images
Caption Processing Options
sd-scripts offers several options for handling captions during training:
--caption_dropout_rate=0.1 \ # Probability of dropping caption
--caption_tag_dropout_rate=0.1 \ # Probability of dropping tags
--caption_dropout_every_n_epochs=2 \ # Drop captions every N epochs
--caption_extension=".txt" \ # Caption file extension
--shuffle_caption # Shuffle caption tags
For a detailed explanation of how captions are processed in sd-scripts, see the How do the Captions Get Processed guide.
File Types Support
By default, sd-scripts supports these image formats:
- PNG (.png)
- JPEG (.jpg, .jpeg)
- Bitmap (.bmp)
- WebP (.webp)
- JXL (.jxl) [if plugin is installed]
- AVIF (.avif) [if plugin is installed]
Each image should have a corresponding caption file with the same name but with the caption extension (default: .txt).
Working with Regularization Images
Regularization images help prevent overfitting and language drift in DreamBooth-style training. They’re examples of the base class or generic concept that your specific subject belongs to.
For a comprehensive guide on regularization images, including what they are, why they’re essential, and how to create and use them effectively, see the Regularization Images guide.
A basic example setup:
- Place regularization images in a separate directory
- Specify this directory with
--reg_data_dir
- Caption files for regularization images should contain generic class concepts
For example, if training a “wickerbeast” concept:
- Training image captions: “wickerbeast, red fur, standing”
- Regularization image captions: “animal, standing”
Example Training Commands
Training a Character/Concept
accelerate launch train_network.py \
--pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
--train_data_dir="./wickerbeast_images" \
--reg_data_dir="./animal_images" \
--caption_extension=".txt" \
--resolution=512 \
--train_batch_size=1 \
--max_train_epochs=15 \
--shuffle_caption \
--keep_tokens=1 \
--network_module="networks.lora" \
--network_dim=32 \
--network_alpha=16 \
--output_dir="./output/wickerbeast_lora" \
--save_model_as="safetensors"
Training an Art Style
accelerate launch train_network.py \
--pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
--train_data_dir="./impressionist_style" \
--caption_extension=".txt" \
--resolution=768 \
--train_batch_size=2 \
--max_train_epochs=5 \
--shuffle_caption \
--flip_aug \
--color_aug \
--network_module="networks.lora" \
--network_dim=16 \
--network_alpha=8 \
--output_dir="./output/impressionist_style" \
--save_model_as="safetensors"
Limitations of Basic Configuration
While the command-line approach is simpler, it has limitations compared to the TOML method:
- Can’t mix multiple datasets with different resolutions
- Can’t set different parameters for different image directories
- Limited control over how captions are processed
- Can’t use advanced features like tag grouping or wildcard notation
For more complex training needs, consider using the TOML configuration described in the next section.
Advanced Dataset Configuration for sd-scripts
While sd-scripts supports basic dataset setups using the command-line parameters, more advanced configuration is possible through the use of TOML configuration files. This section covers how to organize your dataset directories and create TOML config files that fully leverage sd-scripts’ capabilities.
Directory Structure
For training with sd-scripts, it’s recommended to organize your datasets in a structured manner. Here’s a suggested directory layout:
training_dir/
├── configs/
│ └── dataset_config.toml # TOML configuration for your datasets
├── datasets/
│ ├── concept1/ # Images for your first concept
│ │ ├── image1.png
│ │ ├── image1.txt # Caption file for image1.png
│ │ ├── image2.png
│ │ └── image2.txt
│ ├── concept2/ # Images for your second concept
│ │ ├── image1.png
│ │ ├── image1.txt
│ │ ├── image2.png
│ │ └── image2.txt
│ └── regularization/ # Regularization images if needed
│ ├── reg1.png
│ ├── reg1.txt
│ ├── reg2.png
│ └── reg2.txt
└── output/ # Directory for trained models
└── my_lora/ # Each training run gets its own folder
TOML Configuration File
The TOML configuration file allows you to specify multiple datasets with different settings, providing much more flexibility than command-line arguments. Here’s a comprehensive example of a dataset_config.toml
file:
[general]
# General settings that apply to all datasets
shuffle_caption = true
caption_extension = ".txt"
keep_tokens = 1
enable_bucket = true
bucket_reso_steps = 64
bucket_no_upscale = true
# First dataset (DreamBooth style)
[[datasets]]
resolution = 512 # Training resolution
batch_size = 4 # Batch size for this dataset
keep_tokens = 2 # Override general keep_tokens setting
# Main concept subset
[[datasets.subsets]]
image_dir = "datasets/concept1"
class_tokens = "wickerbeast furry" # Class tokens for concept
num_repeats = 10 # Repeat each image 10 times
# Regularization subset
[[datasets.subsets]]
is_reg = true # This is a regularization dataset
image_dir = "datasets/regularization"
class_tokens = "animal"
keep_tokens = 1 # Override parent dataset's keep_tokens
# Second dataset (at different resolution)
[[datasets]]
resolution = [768, 768] # Different resolution as [width, height]
batch_size = 2 # Different batch size for this dataset
flip_aug = true # Enable flip augmentation for this dataset
[[datasets.subsets]]
image_dir = "datasets/concept2"
caption_prefix = "masterpiece, best quality, " # Add prefix to all captions
caption_suffix = ", digital art" # Add suffix to all captions
num_repeats = 8
Understanding TOML Configuration Options
The configuration file has three levels of settings:
- [general] - Global settings applied to all datasets
- [[datasets]] - Settings for each dataset
- [[datasets.subsets]] - Settings for individual subsets within a dataset
Global and Dataset Options
Option | Description | Default | Example |
---|---|---|---|
resolution | Training resolution (single value or [width, height]) | 512 | 512 or [768, 768] |
batch_size | Batch size for training | 1 | 4 |
enable_bucket | Enable resolution bucketing | true | true |
bucket_no_upscale | Don’t upscale images smaller than target resolution | true | false |
bucket_reso_steps | Resolution steps for bucketing | 64 | 64 |
min_bucket_reso | Minimum resolution for bucketing | 256 | 256 |
max_bucket_reso | Maximum resolution for bucketing | 1024 | 1024 |
Subset Options
Option | Description | Default | Example |
---|---|---|---|
image_dir | Directory containing training images | Required | "datasets/concept1" |
class_tokens | Class tokens for DreamBooth training | "wickerbeast furry" | |
is_reg | Whether subset is for regularization | false | true |
num_repeats | Number of times to repeat each image | 1 | 10 |
keep_tokens | Number of tokens to keep (not shuffle) at caption start | 0 | 2 |
caption_extension | File extension for caption files | “.txt” | ".txt" |
shuffle_caption | Whether to shuffle caption tags | false | true |
flip_aug | Enable horizontal flip augmentation | false | true |
color_aug | Enable color augmentation | false | true |
random_crop | Enable random cropping | false | true |
caption_prefix | Prefix to add to all captions | "masterpiece, best quality, " | |
caption_suffix | Suffix to add to all captions | ", digital art" | |
alpha_mask | Use transparency as mask for calculating loss | false | true |
caption_dropout_rate | Caption dropout probability | 0.0 | 0.1 |
caption_tag_dropout_rate | Caption tag dropout probability | 0.0 | 0.05 |
keep_tokens_separator | Separator for keeping specific token sections | "|||" | |
secondary_separator | Secondary separator for grouping tags | ";;;" | |
enable_wildcard | Enable wildcard notation in captions | false | true |
Advanced Caption Features
The TOML configuration enables several advanced caption features:
1. Wildcard Notation
With enable_wildcard = true
, you can use wildcard notation in captions:
1girl, hatsune miku, {simple|white|plain} background
This randomly selects “simple”, “white”, or “plain” when training.
2. Keep Tokens Separator
With keep_tokens_separator = "|||"
:
1girl, hatsune miku ||| best quality, digital art ||| masterpiece, highly detailed
The first section (“1girl, hatsune miku”) stays fixed based on keep_tokens, while the second remains unshuffled. The third section can be shuffled or dropped depending on settings.
3. Secondary Separator for Tag Grouping
With secondary_separator = ";;;"
:
1girl, hatsune miku, sky;;;clouds;;;sunset
The “sky;;;clouds;;;sunset” part is treated as a single group, maintaining their relationship.
Using the TOML Configuration
To use your TOML configuration with sd-scripts, add the --dataset_config
parameter to your training command:
accelerate launch --num_cpu_threads_per_process=2 "./train_network.py" \
--pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
--network_module="networks.lora" \
--network_dim=32 \
--network_alpha=16 \
--train_batch_size=1 \
--learning_rate=1e-4 \
--max_train_epochs=5 \
--output_dir="./output/my_lora" \
--save_model_as="safetensors" \
--dataset_config="./configs/dataset_config.toml" \
--mixed_precision="fp16"
Special Dataset Features
Masked Loss Training
sd-scripts supports masked loss training, which lets you focus on specific parts of the image during training. This is useful for character training where you want to ignore backgrounds.
There are two ways to implement masked loss:
Using transparency - Set
alpha_mask = true
in your TOML configuration:[[datasets.subsets]] image_dir = "datasets/masked_concept" alpha_mask = true # Use transparency as mask
Using separate mask images - Provide a directory with mask images:
[[datasets.subsets]] image_dir = "datasets/concept" conditioning_data_dir = "datasets/concept_masks" # Directory containing masks
Mask images should be the same size as training images, with white areas indicating parts to train and black areas to ignore.
By leveraging TOML configuration files, you can create sophisticated dataset setups that maximize the effectiveness of your training with sd-scripts.
Next Steps
Once you have prepared your dataset with quality images, you should:
Consider whether you need Regularization Images to prevent overfitting and language drift in your model
Automatically tag and caption your images to make them more useful for training. Continue to the Auto-Tagging and Captioning guide to learn how to use AI to analyze, tag, and caption your image content.