Dataset Preparation
An anthro female deer is standing inside an office facing a desk with several pieces of paper and office supplies. She has brown fur brown eyes and a black nose. She is wearing a white shirt black vest and pants. Her short tail is visible behind her. The desk is cluttered with papers a calculator and office gadgets. The background shows a large window with sunlight streaming through a filing cabinet filled with numerous papers and files and some plants on either side. The deer is holding papers.

Dataset Preparation

Before you begin collecting your dataset you will need to decide what you want to teach the model, it can be a character, a style or a new concept.

For now let’s imagine you want to teach your model wickerbeasts so you can generate your VRChat avatar every night.

Create the training_dir Directory

Before starting we need a directory where we’ll organize our datasets. We will also be using git and huggingface to version control our smut.

Open up a terminal by pressing Win + R and typing in pwsh or open PowerShell from the Start menu.

# Create a directory for your training data
mkdir C:\training_dir
cd C:\training_dir

Make sure to download and install Git for Windows if you haven’t already.

Open your terminal with Ctrl + Alt + T or through your application menu.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

Install Git if you haven’t already:

# For Debian/Ubuntu
sudo apt-get update
sudo apt-get install git

# For Arch Linux
sudo pacman -S git

# For Fedora
sudo dnf install git

Open Terminal from the Applications folder or by searching in Spotlight.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

If you don’t have Git installed, you can install it using Homebrew:

# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Git
brew install git

Understanding Git for Dataset Management

Git is a powerful version control system that helps you track changes to files over time. For dataset management, Git offers several benefits:

  • Version tracking: Keep track of changes to your dataset
  • Branching: Create separate branches for different concepts or characters
  • Collaboration: Share your dataset with others and merge their contributions
  • Backup: Store your dataset on remote repositories like Hugging Face

Key Git Concepts

  • Repository: A storage location for your project containing all files and their revision history
  • Commit: A snapshot of your files at a specific point in time
  • Branch: A parallel version of your repository that allows you to work on different features independently
  • Clone: Creating a local copy of a remote repository
  • Push/Pull: Sending/retrieving changes to/from a remote repository

Getting Started with Git

Set up your Git identity:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Setting Up Your Dataset Repository

Hugging Face provides an excellent platform for hosting datasets. Follow their getting started guide to create a new dataset repository.

Once you have your dataset repository on Hugging Face, you can clone it to your local machine:

# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir C:\training_dir
cd C:\training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir

Creating Branches for Different Concepts

For each concept you want to train (like wickerbeast in our example), create a separate branch:

# Create and switch to a new branch
git branch wickerbeast
git checkout wickerbeast

# Or use this shorthand to create and switch at once
# git checkout -b wickerbeast

Common Git Workflow for Dataset Management

  1. After adding or modifying files in your dataset:

    # See what files have changed
    git status
    
    # Add all changes to staging
    git add .
    
    # Commit your changes with a descriptive message
    git commit -m "Added 50 wickerbeast images"
    
  2. Push your changes to Hugging Face:

    git push origin wickerbeast
    
  3. If you want to create a new branch for a different concept:

    # First commit any current changes
    git add .
    git commit -m "Finalized wickerbeast dataset"
    
    # Create a new branch from main
    git checkout main
    git checkout -b new_concept
    

With Git set up, let’s continue with downloading some wickerbeast data. For this we’ll make good use of the furry booru e621.net. There are several ways to download data from this site with the metadata intact. We’ll cover both command-line and GUI options.

Gallery-DL is a command-line program to download image galleries and collections from several image hosting sites, including e621.net. It’s perfect for users who prefer working with the terminal or need to automate the image collection process.

You can install Gallery-DL using pip:

pip install gallery-dl

Or download the latest Windows executable from the releases page.

Install Gallery-DL using pip:

pip install gallery-dl

On Arch Linux, Gallery-DL is available in the AUR:

# Using an AUR helper like yay
yay -S gallery-dl

# Or manually
git clone https://aur.archlinux.org/gallery-dl.git
cd gallery-dl
makepkg -si

For Debian/Ubuntu, you can also use pip or install from source:

# Using pip in a virtual environment (recommended)
python -m venv galleryenv
source galleryenv/bin/activate
pip install gallery-dl

Install Gallery-DL using pip:

pip3 install gallery-dl

If you have Homebrew installed, you can also use it:

brew install gallery-dl

To download wickerbeast images from e621.net, use the following command:

# Download 40 wickerbeast images
gallery-dl "https://e621.net/posts?tags=wickerbeast+solo+-duo+-group+-comic+-meme+-animated+order:score" -l 40

For images with multiple characters:

# Download 10 wickerbeast images with multiple characters
gallery-dl "https://e621.net/posts?tags=wickerbeast+-solo+-comic+-meme+-animated+order:score" -l 10

Basic Dataset Configuration for sd-scripts

Before diving into advanced TOML configurations, it’s helpful to understand how to set up datasets using basic command-line arguments with sd-scripts. This approach is simpler but offers less flexibility than the TOML method.

Simple Directory Structure

For basic training with sd-scripts, you can organize your dataset like this:

training_dir/
├── train_data/               # Main training images directory
│   ├── image1.png
│   ├── image1.txt            # Caption file with same name as image
│   ├── image2.png
│   └── image2.txt
└── reg_data/                 # Optional regularization images
    ├── reg1.png
    ├── reg1.txt
    ├── reg2.png
    └── reg2.txt

Command-Line Training Options

Here are the basic command-line arguments for dataset configuration in sd-scripts:

DreamBooth Style Training

For training a concept or character using the DreamBooth approach:

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./train_data" \           # Directory with your concept images
  --reg_data_dir="./reg_data" \               # Optional regularization images
  --output_dir="./output/my_lora" \
  --caption_extension=".txt" \                # File extension for captions
  --resolution=512 \                          # Training resolution
  --train_batch_size=1 \                      # Batch size
  --max_train_epochs=10 \                     # Number of training epochs
  --learning_rate=1e-4 \                      # Learning rate
  --network_module="networks.lora" \          # Network type (LoRA)
  --network_dim=32 \                          # Network dimension
  --network_alpha=16                          # Network alpha

Additional Dataset Options

You can further customize your dataset with these arguments:

  --shuffle_caption \                         # Shuffle caption tags
  --keep_tokens=1 \                           # Keep first N tokens unshuffled
  --color_aug \                               # Enable color augmentation
  --flip_aug \                                # Enable flip augmentation
  --random_crop \                             # Enable random cropping
  --dataset_repeats=10 \                      # Repeat dataset N times
  --enable_bucket \                           # Enable resolution bucketing
  --min_bucket_reso=256 \                     # Min bucket resolution
  --max_bucket_reso=1024 \                    # Max bucket resolution
  --bucket_reso_steps=64 \                    # Bucket resolution steps
  --bucket_no_upscale                         # Don't upscale small images

Caption Processing Options

sd-scripts offers several options for handling captions during training:

  --caption_dropout_rate=0.1 \                # Probability of dropping caption
  --caption_tag_dropout_rate=0.1 \            # Probability of dropping tags
  --caption_dropout_every_n_epochs=2 \        # Drop captions every N epochs
  --caption_extension=".txt" \                # Caption file extension
  --shuffle_caption                           # Shuffle caption tags

For a detailed explanation of how captions are processed in sd-scripts, see the How do the Captions Get Processed guide.

File Types Support

By default, sd-scripts supports these image formats:

  • PNG (.png)
  • JPEG (.jpg, .jpeg)
  • Bitmap (.bmp)
  • WebP (.webp)
  • JXL (.jxl) [if plugin is installed]
  • AVIF (.avif) [if plugin is installed]

Each image should have a corresponding caption file with the same name but with the caption extension (default: .txt).

Working with Regularization Images

Regularization images help prevent overfitting and language drift in DreamBooth-style training. They’re examples of the base class or generic concept that your specific subject belongs to.

For a comprehensive guide on regularization images, including what they are, why they’re essential, and how to create and use them effectively, see the Regularization Images guide.

A basic example setup:

  1. Place regularization images in a separate directory
  2. Specify this directory with --reg_data_dir
  3. Caption files for regularization images should contain generic class concepts

For example, if training a “wickerbeast” concept:

  • Training image captions: “wickerbeast, red fur, standing”
  • Regularization image captions: “animal, standing”

Example Training Commands

Training a Character/Concept

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./wickerbeast_images" \
  --reg_data_dir="./animal_images" \
  --caption_extension=".txt" \
  --resolution=512 \
  --train_batch_size=1 \
  --max_train_epochs=15 \
  --shuffle_caption \
  --keep_tokens=1 \
  --network_module="networks.lora" \
  --network_dim=32 \
  --network_alpha=16 \
  --output_dir="./output/wickerbeast_lora" \
  --save_model_as="safetensors"

Training an Art Style

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./impressionist_style" \
  --caption_extension=".txt" \
  --resolution=768 \
  --train_batch_size=2 \
  --max_train_epochs=5 \
  --shuffle_caption \
  --flip_aug \
  --color_aug \
  --network_module="networks.lora" \
  --network_dim=16 \
  --network_alpha=8 \
  --output_dir="./output/impressionist_style" \
  --save_model_as="safetensors"

Limitations of Basic Configuration

While the command-line approach is simpler, it has limitations compared to the TOML method:

  1. Can’t mix multiple datasets with different resolutions
  2. Can’t set different parameters for different image directories
  3. Limited control over how captions are processed
  4. Can’t use advanced features like tag grouping or wildcard notation

For more complex training needs, consider using the TOML configuration described in the next section.

Advanced Dataset Configuration for sd-scripts

While sd-scripts supports basic dataset setups using the command-line parameters, more advanced configuration is possible through the use of TOML configuration files. This section covers how to organize your dataset directories and create TOML config files that fully leverage sd-scripts’ capabilities.

Directory Structure

For training with sd-scripts, it’s recommended to organize your datasets in a structured manner. Here’s a suggested directory layout:

training_dir/
├── configs/
│   └── dataset_config.toml   # TOML configuration for your datasets
├── datasets/
│   ├── concept1/             # Images for your first concept
│   │   ├── image1.png
│   │   ├── image1.txt        # Caption file for image1.png
│   │   ├── image2.png
│   │   └── image2.txt
│   ├── concept2/             # Images for your second concept
│   │   ├── image1.png
│   │   ├── image1.txt
│   │   ├── image2.png
│   │   └── image2.txt
│   └── regularization/       # Regularization images if needed
│       ├── reg1.png
│       ├── reg1.txt
│       ├── reg2.png
│       └── reg2.txt
└── output/                   # Directory for trained models
    └── my_lora/              # Each training run gets its own folder

TOML Configuration File

The TOML configuration file allows you to specify multiple datasets with different settings, providing much more flexibility than command-line arguments. Here’s a comprehensive example of a dataset_config.toml file:

[general]
# General settings that apply to all datasets
shuffle_caption = true
caption_extension = ".txt"
keep_tokens = 1
enable_bucket = true
bucket_reso_steps = 64
bucket_no_upscale = true

# First dataset (DreamBooth style)
[[datasets]]
resolution = 512             # Training resolution
batch_size = 4               # Batch size for this dataset
keep_tokens = 2              # Override general keep_tokens setting

  # Main concept subset
  [[datasets.subsets]]
  image_dir = "datasets/concept1"
  class_tokens = "wickerbeast furry"  # Class tokens for concept
  num_repeats = 10                    # Repeat each image 10 times

  # Regularization subset
  [[datasets.subsets]]
  is_reg = true              # This is a regularization dataset
  image_dir = "datasets/regularization"
  class_tokens = "animal"
  keep_tokens = 1            # Override parent dataset's keep_tokens

# Second dataset (at different resolution)
[[datasets]]
resolution = [768, 768]      # Different resolution as [width, height]
batch_size = 2               # Different batch size for this dataset
flip_aug = true              # Enable flip augmentation for this dataset

  [[datasets.subsets]]
  image_dir = "datasets/concept2"
  caption_prefix = "masterpiece, best quality, "  # Add prefix to all captions
  caption_suffix = ", digital art"                # Add suffix to all captions
  num_repeats = 8

Understanding TOML Configuration Options

The configuration file has three levels of settings:

  1. [general] - Global settings applied to all datasets
  2. [[datasets]] - Settings for each dataset
  3. [[datasets.subsets]] - Settings for individual subsets within a dataset

Global and Dataset Options

OptionDescriptionDefaultExample
resolutionTraining resolution (single value or [width, height])512512 or [768, 768]
batch_sizeBatch size for training14
enable_bucketEnable resolution bucketingtruetrue
bucket_no_upscaleDon’t upscale images smaller than target resolutiontruefalse
bucket_reso_stepsResolution steps for bucketing6464
min_bucket_resoMinimum resolution for bucketing256256
max_bucket_resoMaximum resolution for bucketing10241024

Subset Options

OptionDescriptionDefaultExample
image_dirDirectory containing training imagesRequired"datasets/concept1"
class_tokensClass tokens for DreamBooth training"wickerbeast furry"
is_regWhether subset is for regularizationfalsetrue
num_repeatsNumber of times to repeat each image110
keep_tokensNumber of tokens to keep (not shuffle) at caption start02
caption_extensionFile extension for caption files“.txt”".txt"
shuffle_captionWhether to shuffle caption tagsfalsetrue
flip_augEnable horizontal flip augmentationfalsetrue
color_augEnable color augmentationfalsetrue
random_cropEnable random croppingfalsetrue
caption_prefixPrefix to add to all captions"masterpiece, best quality, "
caption_suffixSuffix to add to all captions", digital art"
alpha_maskUse transparency as mask for calculating lossfalsetrue
caption_dropout_rateCaption dropout probability0.00.1
caption_tag_dropout_rateCaption tag dropout probability0.00.05
keep_tokens_separatorSeparator for keeping specific token sections"|||"
secondary_separatorSecondary separator for grouping tags";;;"
enable_wildcardEnable wildcard notation in captionsfalsetrue

Advanced Caption Features

The TOML configuration enables several advanced caption features:

1. Wildcard Notation

With enable_wildcard = true, you can use wildcard notation in captions:

1girl, hatsune miku, {simple|white|plain} background

This randomly selects “simple”, “white”, or “plain” when training.

2. Keep Tokens Separator

With keep_tokens_separator = "|||":

1girl, hatsune miku ||| best quality, digital art ||| masterpiece, highly detailed

The first section (“1girl, hatsune miku”) stays fixed based on keep_tokens, while the second remains unshuffled. The third section can be shuffled or dropped depending on settings.

3. Secondary Separator for Tag Grouping

With secondary_separator = ";;;":

1girl, hatsune miku, sky;;;clouds;;;sunset

The “sky;;;clouds;;;sunset” part is treated as a single group, maintaining their relationship.

Using the TOML Configuration

To use your TOML configuration with sd-scripts, add the --dataset_config parameter to your training command:

accelerate launch --num_cpu_threads_per_process=2 "./train_network.py" \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --network_module="networks.lora" \
  --network_dim=32 \
  --network_alpha=16 \
  --train_batch_size=1 \
  --learning_rate=1e-4 \
  --max_train_epochs=5 \
  --output_dir="./output/my_lora" \
  --save_model_as="safetensors" \
  --dataset_config="./configs/dataset_config.toml" \
  --mixed_precision="fp16"

Special Dataset Features

Masked Loss Training

sd-scripts supports masked loss training, which lets you focus on specific parts of the image during training. This is useful for character training where you want to ignore backgrounds.

There are two ways to implement masked loss:

  1. Using transparency - Set alpha_mask = true in your TOML configuration:

    [[datasets.subsets]]
    image_dir = "datasets/masked_concept"
    alpha_mask = true  # Use transparency as mask
    
  2. Using separate mask images - Provide a directory with mask images:

    [[datasets.subsets]]
    image_dir = "datasets/concept"
    conditioning_data_dir = "datasets/concept_masks"  # Directory containing masks
    

Mask images should be the same size as training images, with white areas indicating parts to train and black areas to ignore.

By leveraging TOML configuration files, you can create sophisticated dataset setups that maximize the effectiveness of your training with sd-scripts.

Next Steps

Once you have prepared your dataset with quality images, you should:

  1. Consider whether you need Regularization Images to prevent overfitting and language drift in your model

  2. Automatically tag and caption your images to make them more useful for training. Continue to the Auto-Tagging and Captioning guide to learn how to use AI to analyze, tag, and caption your image content.