This digital illustration depicts an anthropomorphic deer standing in a cluttered office. The deer, with brown fur, large ears, and a white-tipped tail, wears a white shirt and black vest, exuding a professional demeanor. He is holding a piece of paper and looking over his shoulder. The office is filled with stacks of papers, a desktop printer, and a lamp on a wooden desk. A window with sheer curtains allows soft, natural light to enter. The background includes a tall, overflowing file cabinet and green plants in the foreground, adding a touch of nature to the otherwise busy workspace.

Dataset Preparation

Before you begin collecting your dataset you will need to decide what you want to teach the model, it can be a character, a style or a new concept.

For now let’s imagine you want to teach your model wickerbeasts so you can generate your VRChat avatar every night.

Create the `training_dir` Directory

Before starting we need a directory where we’ll organize our datasets. We will also be using git and huggingface to version control our smut.

Open up a terminal by pressing Win + R and typing in pwsh or open PowerShell from the Start menu.

# Create a directory for your training data
mkdir C:\training_dir
cd C:\training_dir

Make sure to download and install Git for Windows if you haven’t already.

Open your terminal with Ctrl + Alt + T or through your application menu.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

Install Git if you haven’t already:

# For Debian/Ubuntu
sudo apt-get update
sudo apt-get install git

# For Arch Linux
sudo pacman -S git

# For Fedora
sudo dnf install git

Open Terminal from the Applications folder or by searching in Spotlight.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

If you don’t have Git installed, you can install it using Homebrew:

# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Git
brew install git

Understanding Git for Dataset Management

Git is a powerful version control system that helps you track changes to files over time. For dataset management, Git offers several benefits:

Version tracking: Keep track of changes to your dataset
Branching: Create separate branches for different concepts or characters
Collaboration: Share your dataset with others and merge their contributions
Backup: Store your dataset on remote repositories like Hugging Face

Key Git Concepts

Repository: A storage location for your project containing all files and their revision history
Commit: A snapshot of your files at a specific point in time
Branch: A parallel version of your repository that allows you to work on different features independently
Clone: Creating a local copy of a remote repository
Push/Pull: Sending/retrieving changes to/from a remote repository

Getting Started with Git

Set up your Git identity:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Setting Up Your Dataset Repository

Hugging Face provides an excellent platform for hosting datasets. Follow their getting started guide to create a new dataset repository.

Once you have your dataset repository on Hugging Face, you can clone it to your local machine:

# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir C:\training_dir
cd C:\training_dir

# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir

# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir

Creating Branches for Different Concepts

For each concept you want to train (like wickerbeast in our example), create a separate branch:

# Create and switch to a new branch
git branch wickerbeast
git checkout wickerbeast

# Or use this shorthand to create and switch at once
# git checkout -b wickerbeast

Common Git Workflow for Dataset Management

After adding or modifying files in your dataset:

# See what files have changed
git status

# Add all changes to staging
git add .

# Commit your changes with a descriptive message
git commit -m "Added 50 wickerbeast images"

Push your changes to Hugging Face:
```
git push origin wickerbeast
```

If you want to create a new branch for a different concept:

# First commit any current changes
git add .
git commit -m "Finalized wickerbeast dataset"

# Create a new branch from main
git checkout main
git checkout -b new_concept

With Git set up, let’s continue with downloading some wickerbeast data. For this we’ll make good use of the furry booru e621.net. There are several ways to download data from this site with the metadata intact. We’ll cover both command-line and GUI options.

Gallery-DL

Gallery-DL is a command-line program to download image galleries and collections from several image hosting sites, including e621.net. It’s perfect for users who prefer working with the terminal or need to automate the image collection process.

You can install Gallery-DL using pip:

pip install gallery-dl

Or download the latest Windows executable from the releases page.

Install Gallery-DL using pip:

pip install gallery-dl

On Arch Linux, Gallery-DL is available in the AUR:

# Using an AUR helper like yay
yay -S gallery-dl

# Or manually
git clone https://aur.archlinux.org/gallery-dl.git
cd gallery-dl
makepkg -si

For Debian/Ubuntu, you can also use pip or install from source:

# Using pip in a virtual environment (recommended)
python -m venv galleryenv
source galleryenv/bin/activate
pip install gallery-dl

Install Gallery-DL using pip:

pip3 install gallery-dl

If you have Homebrew installed, you can also use it:

brew install gallery-dl

Using Gallery-DL for E621

To download wickerbeast images from e621.net, use the following command:

# Download 40 wickerbeast images
gallery-dl "https://e621.net/posts?tags=wickerbeast+solo+-duo+-group+-comic+-meme+-animated+order:score" -l 40

For images with multiple characters:

# Download 10 wickerbeast images with multiple characters
gallery-dl "https://e621.net/posts?tags=wickerbeast+-solo+-comic+-meme+-animated+order:score" -l 10

Basic Dataset Configuration for sd-scripts

Before diving into advanced TOML configurations, it’s helpful to understand how to set up datasets using basic command-line arguments with sd-scripts. This approach is simpler but offers less flexibility than the TOML method.

Simple Directory Structure

For basic training with sd-scripts, you can organize your dataset like this:

training_dir/
├── train_data/               # Main training images directory
│   ├── image1.png
│   ├── image1.txt            # Caption file with same name as image
│   ├── image2.png
│   └── image2.txt
└── reg_data/                 # Optional regularization images
    ├── reg1.png
    ├── reg1.txt
    ├── reg2.png
    └── reg2.txt

Command-Line Training Options

Here are the basic command-line arguments for dataset configuration in sd-scripts:

DreamBooth Style Training

For training a concept or character using the DreamBooth approach:

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./train_data" \           # Directory with your concept images
  --reg_data_dir="./reg_data" \               # Optional regularization images
  --output_dir="./output/my_lora" \
  --caption_extension=".txt" \                # File extension for captions
  --resolution=512 \                          # Training resolution
  --train_batch_size=1 \                      # Batch size
  --max_train_epochs=10 \                     # Number of training epochs
  --learning_rate=1e-4 \                      # Learning rate
  --network_module="networks.lora" \          # Network type (LoRA)
  --network_dim=32 \                          # Network dimension
  --network_alpha=16                          # Network alpha

Additional Dataset Options

You can further customize your dataset with these arguments:

  --shuffle_caption \                         # Shuffle caption tags
  --keep_tokens=1 \                           # Keep first N tokens unshuffled
  --color_aug \                               # Enable color augmentation
  --flip_aug \                                # Enable flip augmentation
  --random_crop \                             # Enable random cropping
  --dataset_repeats=10 \                      # Repeat dataset N times
  --enable_bucket \                           # Enable resolution bucketing
  --min_bucket_reso=256 \                     # Min bucket resolution
  --max_bucket_reso=1024 \                    # Max bucket resolution
  --bucket_reso_steps=64 \                    # Bucket resolution steps
  --bucket_no_upscale                         # Don't upscale small images

Caption Processing Options

sd-scripts offers several options for handling captions during training:

  --caption_dropout_rate=0.1 \                # Probability of dropping caption
  --caption_tag_dropout_rate=0.1 \            # Probability of dropping tags
  --caption_dropout_every_n_epochs=2 \        # Drop captions every N epochs
  --caption_extension=".txt" \                # Caption file extension
  --shuffle_caption                           # Shuffle caption tags

For a detailed explanation of how captions are processed in sd-scripts, see the How do the Captions Get Processed guide.

File Types Support

By default, sd-scripts supports these image formats:

PNG (.png)
JPEG (.jpg, .jpeg)
Bitmap (.bmp)
WebP (.webp)
JXL (.jxl) [if plugin is installed]
AVIF (.avif) [if plugin is installed]

Each image should have a corresponding caption file with the same name but with the caption extension (default: .txt).

Working with Regularization Images

Regularization images help prevent overfitting and language drift in DreamBooth-style training. They’re examples of the base class or generic concept that your specific subject belongs to.

For a comprehensive guide on regularization images, including what they are, why they’re essential, and how to create and use them effectively, see the Regularization Images guide.

A basic example setup:

Place regularization images in a separate directory
Specify this directory with --reg_data_dir
Caption files for regularization images should contain generic class concepts

For example, if training a “wickerbeast” concept:

Training image captions: “wickerbeast, red fur, standing”
Regularization image captions: “animal, standing”

Example Training Commands

Training a Character/Concept

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./wickerbeast_images" \
  --reg_data_dir="./animal_images" \
  --caption_extension=".txt" \
  --resolution=512 \
  --train_batch_size=1 \
  --max_train_epochs=15 \
  --shuffle_caption \
  --keep_tokens=1 \
  --network_module="networks.lora" \
  --network_dim=32 \
  --network_alpha=16 \
  --output_dir="./output/wickerbeast_lora" \
  --save_model_as="safetensors"

Training an Art Style

accelerate launch train_network.py \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --train_data_dir="./impressionist_style" \
  --caption_extension=".txt" \
  --resolution=768 \
  --train_batch_size=2 \
  --max_train_epochs=5 \
  --shuffle_caption \
  --flip_aug \
  --color_aug \
  --network_module="networks.lora" \
  --network_dim=16 \
  --network_alpha=8 \
  --output_dir="./output/impressionist_style" \
  --save_model_as="safetensors"

Limitations of Basic Configuration

While the command-line approach is simpler, it has limitations compared to the TOML method:

Can’t mix multiple datasets with different resolutions
Can’t set different parameters for different image directories
Limited control over how captions are processed
Can’t use advanced features like tag grouping or wildcard notation

For more complex training needs, consider using the TOML configuration described in the next section.

Advanced Dataset Configuration for sd-scripts

While sd-scripts supports basic dataset setups using the command-line parameters, more advanced configuration is possible through the use of TOML configuration files. This section covers how to organize your dataset directories and create TOML config files that fully leverage sd-scripts’ capabilities.

Directory Structure

For training with sd-scripts, it’s recommended to organize your datasets in a structured manner. Here’s a suggested directory layout:

training_dir/
├── configs/
│   └── dataset_config.toml   # TOML configuration for your datasets
├── datasets/
│   ├── concept1/             # Images for your first concept
│   │   ├── image1.png
│   │   ├── image1.txt        # Caption file for image1.png
│   │   ├── image2.png
│   │   └── image2.txt
│   ├── concept2/             # Images for your second concept
│   │   ├── image1.png
│   │   ├── image1.txt
│   │   ├── image2.png
│   │   └── image2.txt
│   └── regularization/       # Regularization images if needed
│       ├── reg1.png
│       ├── reg1.txt
│       ├── reg2.png
│       └── reg2.txt
└── output/                   # Directory for trained models
    └── my_lora/              # Each training run gets its own folder

TOML Configuration File

The TOML configuration file allows you to specify multiple datasets with different settings, providing much more flexibility than command-line arguments. Here’s a comprehensive example of a dataset_config.toml file:

[general]
# General settings that apply to all datasets
shuffle_caption = true
caption_extension = ".txt"
keep_tokens = 1
enable_bucket = true
bucket_reso_steps = 64
bucket_no_upscale = true

# First dataset (DreamBooth style)
[[datasets]]
resolution = 512             # Training resolution
batch_size = 4               # Batch size for this dataset
keep_tokens = 2              # Override general keep_tokens setting

  # Main concept subset
  [[datasets.subsets]]
  image_dir = "datasets/concept1"
  class_tokens = "wickerbeast furry"  # Class tokens for concept
  num_repeats = 10                    # Repeat each image 10 times

  # Regularization subset
  [[datasets.subsets]]
  is_reg = true              # This is a regularization dataset
  image_dir = "datasets/regularization"
  class_tokens = "animal"
  keep_tokens = 1            # Override parent dataset's keep_tokens

# Second dataset (at different resolution)
[[datasets]]
resolution = [768, 768]      # Different resolution as [width, height]
batch_size = 2               # Different batch size for this dataset
flip_aug = true              # Enable flip augmentation for this dataset

  [[datasets.subsets]]
  image_dir = "datasets/concept2"
  caption_prefix = "masterpiece, best quality, "  # Add prefix to all captions
  caption_suffix = ", digital art"                # Add suffix to all captions
  num_repeats = 8

Understanding TOML Configuration Options

The configuration file has three levels of settings:

[general] - Global settings applied to all datasets
[[datasets]] - Settings for each dataset
[[datasets.subsets]] - Settings for individual subsets within a dataset

Global and Dataset Options

Option	Description	Default	Example
`resolution`	Training resolution (single value or [width, height])	512	`512` or `[768, 768]`
`batch_size`	Batch size for training	1	`4`
`enable_bucket`	Enable resolution bucketing	true	`true`
`bucket_no_upscale`	Don’t upscale images smaller than target resolution	true	`false`
`bucket_reso_steps`	Resolution steps for bucketing	64	`64`
`min_bucket_reso`	Minimum resolution for bucketing	256	`256`
`max_bucket_reso`	Maximum resolution for bucketing	1024	`1024`

Subset Options

Option	Description	Default	Example
`image_dir`	Directory containing training images	Required	`"datasets/concept1"`
`class_tokens`	Class tokens for DreamBooth training		`"wickerbeast furry"`
`is_reg`	Whether subset is for regularization	false	`true`
`num_repeats`	Number of times to repeat each image	1	`10`
`keep_tokens`	Number of tokens to keep (not shuffle) at caption start	0	`2`
`caption_extension`	File extension for caption files	“.txt”	`".txt"`
`shuffle_caption`	Whether to shuffle caption tags	false	`true`
`flip_aug`	Enable horizontal flip augmentation	false	`true`
`color_aug`	Enable color augmentation	false	`true`
`random_crop`	Enable random cropping	false	`true`
`caption_prefix`	Prefix to add to all captions		`"masterpiece, best quality, "`
`caption_suffix`	Suffix to add to all captions		`", digital art"`
`alpha_mask`	Use transparency as mask for calculating loss	false	`true`
`caption_dropout_rate`	Caption dropout probability	0.0	`0.1`
`caption_tag_dropout_rate`	Caption tag dropout probability	0.0	`0.05`
`keep_tokens_separator`	Separator for keeping specific token sections		`"\|\|\|"`
`secondary_separator`	Secondary separator for grouping tags		`";;;"`
`enable_wildcard`	Enable wildcard notation in captions	false	`true`

Advanced Caption Features

The TOML configuration enables several advanced caption features:

1. Wildcard Notation

With enable_wildcard = true, you can use wildcard notation in captions:

1girl, hatsune miku, {simple|white|plain} background

This randomly selects “simple”, “white”, or “plain” when training.

2. Keep Tokens Separator

With keep_tokens_separator = "|||":

1girl, hatsune miku ||| best quality, digital art ||| masterpiece, highly detailed

The first section (“1girl, hatsune miku”) stays fixed based on keep_tokens, while the second remains unshuffled. The third section can be shuffled or dropped depending on settings.

3. Secondary Separator for Tag Grouping

With secondary_separator = ";;;":

1girl, hatsune miku, sky;;;clouds;;;sunset

The “sky;;;clouds;;;sunset” part is treated as a single group, maintaining their relationship.

Using the TOML Configuration

To use your TOML configuration with sd-scripts, add the --dataset_config parameter to your training command:

accelerate launch --num_cpu_threads_per_process=2 "./train_network.py" \
  --pretrained_model_name_or_path="./models/stable-diffusion-v1-5" \
  --network_module="networks.lora" \
  --network_dim=32 \
  --network_alpha=16 \
  --train_batch_size=1 \
  --learning_rate=1e-4 \
  --max_train_epochs=5 \
  --output_dir="./output/my_lora" \
  --save_model_as="safetensors" \
  --dataset_config="./configs/dataset_config.toml" \
  --mixed_precision="fp16"

Special Dataset Features

Masked Loss Training

sd-scripts supports masked loss training, which lets you focus on specific parts of the image during training. This is useful for character training where you want to ignore backgrounds.

There are two ways to implement masked loss:

Using transparency - Set alpha_mask = true in your TOML configuration:

[[datasets.subsets]]
image_dir = "datasets/masked_concept"
alpha_mask = true  # Use transparency as mask

Using separate mask images - Provide a directory with mask images:

[[datasets.subsets]]
image_dir = "datasets/concept"
conditioning_data_dir = "datasets/concept_masks"  # Directory containing masks

Mask images should be the same size as training images, with white areas indicating parts to train and black areas to ignore.

By leveraging TOML configuration files, you can create sophisticated dataset setups that maximize the effectiveness of your training with sd-scripts.

Next Steps

Once you have prepared your dataset with quality images, you should:

Consider whether you need Regularization Images to prevent overfitting and language drift in your model
Automatically tag and caption your images to make them more useful for training. Continue to the Auto-Tagging and Captioning guide to learn how to use AI to analyze, tag, and caption your image content.

Dataset Preparation #

Create the training_dir Directory #

Understanding Git for Dataset Management #

Key Git Concepts #

Getting Started with Git #

Setting Up Your Dataset Repository #

Creating Branches for Different Concepts #

Common Git Workflow for Dataset Management #

Gallery-DL #

Using Gallery-DL for E621 #

Basic Dataset Configuration for sd-scripts #

Simple Directory Structure #

Command-Line Training Options #

DreamBooth Style Training #

Additional Dataset Options #

Caption Processing Options #

File Types Support #

Working with Regularization Images #

Example Training Commands #

Training a Character/Concept #

Training an Art Style #

Limitations of Basic Configuration #

Advanced Dataset Configuration for sd-scripts #

Directory Structure #

TOML Configuration File #

Understanding TOML Configuration Options #

Global and Dataset Options #

Subset Options #

Advanced Caption Features #

1. Wildcard Notation #

2. Keep Tokens Separator #

3. Secondary Separator for Tag Grouping #

Using the TOML Configuration #

Special Dataset Features #

Masked Loss Training #

Next Steps #

Related Posts

Regularization Images

How do the Captions Get Processed?

Installation and Setup