Tag Normalization
Tag normalization is a critical step in preparing datasets for AI model training. It ensures consistency, reduces redundancy, and improves the quality of your training data. This guide explores how to effectively normalize tags using tools like e6db and provides best practices for maintaining clean, standardized datasets.
Why Normalize Tags?
When collecting data from various sources like e621, Danbooru, or other image boards, tags often suffer from several issues:
- Inconsistent formatting: Tags may use spaces, underscores, or different word separators
- Redundant tags: Some tags directly imply others (e.g., “canine” is implied by “wolf”)
- Tag conflicts: Different sources may use different terms for the same concept
- Irrelevant tags: Dataset may include unnecessary meta tags (like “high_resolution” or “pooled”)
- Format variations: Same concept represented with slight variations (“blue_eyes” vs “blue eyes”)
Normalizing tags addresses these issues, resulting in cleaner, more efficient training data.
Using e6db for Tag Normalization
The e6db tool provides a powerful solution for tag normalization, particularly for furry datasets that use tags from e621.
Installation
First, clone the repository:
git clone https://huggingface.co/datasets/Gaeros/e6db
Basic Usage
Run the normalization script on your dataset directory:
python ./normalize_tags.py /training_dir
For separate input and output directories:
python ./normalize_tags.py /input_dir /output_dir
Key Features
The e6db normalization tool provides several powerful features:
- Tag standardization: Converts all tags to a consistent format (spaces vs underscores)
- Implication filtering: Removes tags that are directly implied by others
- Category-based filtering: Can remove tags from specific categories (e.g., “meta” or “pool”)
- Alias resolution: Resolves alternative names to canonical tags
- Custom blacklisting: Allows you to specify tags to be removed
- Statistical reporting: Provides stats on tag frequency and distribution
Advanced Configuration
To further customize the normalization process, create a normalize.toml
configuration file. This allows for detailed control over how tags are processed.
Sample Configuration
Here’s a basic configuration file:
# Tag handling settings
use_underscores = true # Use underscores instead of spaces in tags
keep_underscores = [
"in_heat",
"rating:safe"
] # Tags that should keep underscores regardless
# Implication settings
keep_implied = false # Whether to keep tags implied by others
min_antecedent_freq = 100 # Minimum frequency for implication filtering
# Blacklisting options
blacklist = [
"hi_res",
"high_resolution",
"high_quality"
] # Tags to always remove
blacklist_categories = [
"meta",
"pool"
] # Categories to remove entirely
# Alias handling
on_alias_conflict = "warn" # How to handle conflicts (silent, warn, raise, overwrite)
# Tag transformation
artist_by_prefix = true # Add "by_" prefix to artist tags
remove_parens = true # Remove parenthetical suffixes (e.g., _(artist))
# Custom mappings
[aliases]
"blue eyes" = "blue_eyes"
"closed_eye" = "closed_eyes"
[renames]
"kitsune" = "fox"
[aliases_overrides]
"my_important_tag" = "my_important_tag" # Force keeping the original
[renames]
"unwanted_rename" = "preferred_name" # Force a specific rename
Configuration Options Explained
Tag Handling
use_underscores
: Converts spaces to underscores in output tagskeep_underscores
: List of tags that should maintain underscores
Implication Filtering
keep_implied
: Whether to keep tags implied by othersmin_antecedent_freq
: Only apply implications for tags meeting this frequency thresholddrop_antecedent_freq
: Drop antecedent tags below this frequency
Blacklisting
blacklist
: Specific tags to removeblacklist_regexp
: Remove tags matching these regular expressionsblacklist_categories
: Remove entire categories of tagsblacklist_implied
: Whether to also blacklist tags implied by blacklisted tags
Aliases and Renames
aliases
: Custom tag equivalence mappingsaliases_overrides
: Aliases that override existing mappingsrenames
: Change output tag nameson_alias_conflict
: How to handle conflicts when creating aliases
Analyzing Tag Distributions
The tool also provides useful statistics about your dataset:
# Show top 100 most common tags
python ./normalize_tags.py /training_dir -k 100
# Show top 50 tags for specific categories
python ./normalize_tags.py /training_dir -k 50 -s species -s character
# Show tags removed by implication
python ./normalize_tags.py /training_dir -j 100
Best Practices with Version Control
When working with tag normalization, it’s essential to track changes to understand how the process impacts your dataset.
Using Git for Tag Change Tracking
Initialize a git repository for your dataset:
cd /training_dir
git init
git add .
git commit -m "Initial dataset commit"
After running tag normalization:
git add .
git commit -m "Applied tag normalization"
Viewing Tag Changes
Git provides powerful tools to track changes to tag files:
# View differences specifically for tags
git diff --word-diff-regex='[^,]+' --patience
To compare changes between the current and previous commit:
git diff HEAD^ HEAD --word-diff-regex='[^,]+' --patience
Understanding the Code
The e6db tag normalizer works by:
- Loading a database of known tags and their relationships
- Creating a tag set normalizer that applies rules to your tags
- Processing each file in your dataset
- Encoding tags to their canonical form
- Removing blacklisted and implied tags
- Decoding back to string format
- Writing the results
The core functionality is in the TagSetNormalizer
class, which handles:
- Encoding tags to standardized IDs
- Tracking tag implications
- Managing tag categories
- Applying custom rules and transformations
Common Issues and Solutions
Problem: Too Many Tags Removed
If normalization removes too many useful tags:
# Adjust the implication threshold
python ./normalize_tags.py /training_dir --config custom_config.toml
In your config:
min_antecedent_freq = 500 # Higher value means fewer implications
keep_implied = ["important_tag1", "important_tag2"] # Keep these tags even if implied
Problem: Desired Tags Getting Renamed
If important tags are being renamed undesirably:
# In your configuration
[aliases_overrides]
"my_important_tag" = "my_important_tag" # Force keeping the original
[renames]
"unwanted_rename" = "preferred_name" # Force a specific rename
Conclusion
Tag normalization is an essential step in creating high-quality training datasets. With tools like e6db, you can efficiently clean and standardize your tags, ensuring your AI models receive consistent, meaningful data during training.
By investing time in proper tag normalization, you’ll improve the effectiveness of your training process and ultimately achieve better results in your generated images.
Next Steps
After normalizing your tags for consistency, the next step is to set up your environment for training. Continue to the Installation and Setup guide to learn how to install and configure the necessary tools for LoRA training.