Dataset Preparation
On this page
An anthro male raccoon with brown fur and a black nose sits in a chair inside a spacecraft using a complex control panel. He is wearing a hoodie and appears to be engaging in a conversation. The control panel is filled with various buttons and knobs and a screen displays a shadowy figure in the background. The scene is set in a detailed spacecraft interior with a high-tech environment. The raccoon's fingers rest on the control panel and there is a visible text document on the panel which seems to be a narrative context.

Dataset Preparation

Before you begin collecting your dataset you will need to decide what you want to teach the model, it can be a character, a style or a new concept.

For now let’s imagine you want to teach your model wickerbeasts so you can generate your VRChat avatar every night.

Create the training_dir Directory

Before starting we need a directory where we’ll organize our datasets. We will also be using git and huggingface to version control our smut.

Open up a terminal by pressing Win + R and typing in pwsh or open PowerShell from the Start menu.

# Create a directory for your training data
mkdir C:\training_dir
cd C:\training_dir

Make sure to download and install Git for Windows if you haven’t already.

Open your terminal with Ctrl + Alt + T or through your application menu.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

Install Git if you haven’t already:

# For Debian/Ubuntu
sudo apt-get update
sudo apt-get install git

# For Arch Linux
sudo pacman -S git

# For Fedora
sudo dnf install git

Open Terminal from the Applications folder or by searching in Spotlight.

# Create a directory for your training data
mkdir -p ~/training_dir
cd ~/training_dir

If you don’t have Git installed, you can install it using Homebrew:

# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Git
brew install git

Understanding Git for Dataset Management

Git is a powerful version control system that helps you track changes to files over time. For dataset management, Git offers several benefits:

  • Version tracking: Keep track of changes to your dataset
  • Branching: Create separate branches for different concepts or characters
  • Collaboration: Share your dataset with others and merge their contributions
  • Backup: Store your dataset on remote repositories like Hugging Face

Key Git Concepts

  • Repository: A storage location for your project containing all files and their revision history
  • Commit: A snapshot of your files at a specific point in time
  • Branch: A parallel version of your repository that allows you to work on different features independently
  • Clone: Creating a local copy of a remote repository
  • Push/Pull: Sending/retrieving changes to/from a remote repository

Getting Started with Git

Set up your Git identity:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Setting Up Your Dataset Repository

Hugging Face provides an excellent platform for hosting datasets. Follow their getting started guide to create a new dataset repository.

Once you have your dataset repository on Hugging Face, you can clone it to your local machine:

# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir C:\training_dir
cd C:\training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir
# Replace 'user' with your Hugging Face username
git clone git@hf.co:/datasets/user/training_dir ~/training_dir
cd ~/training_dir

Creating Branches for Different Concepts

For each concept you want to train (like wickerbeast in our example), create a separate branch:

# Create and switch to a new branch
git branch wickerbeast
git checkout wickerbeast

# Or use this shorthand to create and switch at once
# git checkout -b wickerbeast

Common Git Workflow for Dataset Management

  1. After adding or modifying files in your dataset:

    # See what files have changed
    git status
    
    # Add all changes to staging
    git add .
    
    # Commit your changes with a descriptive message
    git commit -m "Added 50 wickerbeast images"
    
  2. Push your changes to Hugging Face:

    git push origin wickerbeast
    
  3. If you want to create a new branch for a different concept:

    # First commit any current changes
    git add .
    git commit -m "Finalized wickerbeast dataset"
    
    # Create a new branch from main
    git checkout main
    git checkout -b new_concept
    

With Git set up, let’s continue with downloading some wickerbeast data. For this we’ll make good use of the furry booru e621.net. There are several ways to download data from this site with the metadata intact. We’ll cover both command-line and GUI options.

Gallery-DL is a command-line program to download image galleries and collections from several image hosting sites, including e621.net. It’s perfect for users who prefer working with the terminal or need to automate the image collection process.

You can install Gallery-DL using pip:

pip install gallery-dl

Or download the latest Windows executable from the releases page.

Install Gallery-DL using pip:

pip install gallery-dl

On Arch Linux, Gallery-DL is available in the AUR:

# Using an AUR helper like yay
yay -S gallery-dl

# Or manually
git clone https://aur.archlinux.org/gallery-dl.git
cd gallery-dl
makepkg -si

For Debian/Ubuntu, you can also use pip or install from source:

# Using pip in a virtual environment (recommended)
python -m venv galleryenv
source galleryenv/bin/activate
pip install gallery-dl

Install Gallery-DL using pip:

pip3 install gallery-dl

If you have Homebrew installed, you can also use it:

brew install gallery-dl

To download wickerbeast images from e621.net, use the following command:

# Download 40 wickerbeast images
gallery-dl "https://e621.net/posts?tags=wickerbeast+solo+-duo+-group+-comic+-meme+-animated+order:score" -l 40

For images with multiple characters:

# Download 10 wickerbeast images with multiple characters
gallery-dl "https://e621.net/posts?tags=wickerbeast+-solo+-comic+-meme+-animated+order:score" -l 10

Next Steps

Once you have prepared your dataset with quality images, the next step is to automatically tag and caption these images to make them more useful for training. Continue to the Auto-Tagging and Captioning guide to learn how to use AI to analyze, tag, and caption your image content.