Tag Normalization
An anthro male otter sits on a wooden bench outside during the day. He wears a white shirt and beige pants. He has brown fur a black nose and is smiling with an open mouth showing his fangs and tongue. He is holding a red book in one hand and a key in the other. His eyes are closed and he is looking at a butterfly in the air with his other hand raised as if about to land. There is a book lying on the bench beside him. The background is detailed with lush green leaves.

Tag Normalization

Tag normalization is a critical step in preparing datasets for AI model training. It ensures consistency, reduces redundancy, and improves the quality of your training data. This guide explores how to effectively normalize tags using tools like e6db and provides best practices for maintaining clean, standardized datasets.

Why Normalize Tags?

When collecting data from various sources like e621, Danbooru, or other image boards, tags often suffer from several issues:

  • Inconsistent formatting: Tags may use spaces, underscores, or different word separators
  • Redundant tags: Some tags directly imply others (e.g., “canine” is implied by “wolf”)
  • Tag conflicts: Different sources may use different terms for the same concept
  • Irrelevant tags: Dataset may include unnecessary meta tags (like “high_resolution” or “pooled”)
  • Format variations: Same concept represented with slight variations (“blue_eyes” vs “blue eyes”)

Normalizing tags addresses these issues, resulting in cleaner, more efficient training data.

Using e6db for Tag Normalization

The e6db tool provides a powerful solution for tag normalization, particularly for furry datasets that use tags from e621.

Installation

First, clone the repository:

git clone https://huggingface.co/datasets/Gaeros/e6db

Basic Usage

Run the normalization script on your dataset directory:

python ./normalize_tags.py /training_dir

For separate input and output directories:

python ./normalize_tags.py /input_dir /output_dir

Key Features

The e6db normalization tool provides several powerful features:

  1. Tag standardization: Converts all tags to a consistent format (spaces vs underscores)
  2. Implication filtering: Removes tags that are directly implied by others
  3. Category-based filtering: Can remove tags from specific categories (e.g., “meta” or “pool”)
  4. Alias resolution: Resolves alternative names to canonical tags
  5. Custom blacklisting: Allows you to specify tags to be removed
  6. Statistical reporting: Provides stats on tag frequency and distribution

Advanced Configuration

To further customize the normalization process, create a normalize.toml configuration file. This allows for detailed control over how tags are processed.

Sample Configuration

Here’s a basic configuration file:

# Tag handling settings
use_underscores = true  # Use underscores instead of spaces in tags
keep_underscores = [
  "in_heat",
  "rating:safe"
]  # Tags that should keep underscores regardless

# Implication settings
keep_implied = false  # Whether to keep tags implied by others
min_antecedent_freq = 100  # Minimum frequency for implication filtering

# Blacklisting options
blacklist = [
  "hi_res",
  "high_resolution",
  "high_quality"
]  # Tags to always remove

blacklist_categories = [
  "meta",
  "pool"
]  # Categories to remove entirely

# Alias handling
on_alias_conflict = "warn"  # How to handle conflicts (silent, warn, raise, overwrite)

# Tag transformation
artist_by_prefix = true  # Add "by_" prefix to artist tags
remove_parens = true  # Remove parenthetical suffixes (e.g., _(artist))

# Custom mappings
[aliases]
"blue eyes" = "blue_eyes"
"closed_eye" = "closed_eyes"

[renames]
"kitsune" = "fox"

[aliases_overrides]
"my_important_tag" = "my_important_tag"  # Force keeping the original

[renames]
"unwanted_rename" = "preferred_name"  # Force a specific rename

Configuration Options Explained

Tag Handling

  • use_underscores: Converts spaces to underscores in output tags
  • keep_underscores: List of tags that should maintain underscores

Implication Filtering

  • keep_implied: Whether to keep tags implied by others
  • min_antecedent_freq: Only apply implications for tags meeting this frequency threshold
  • drop_antecedent_freq: Drop antecedent tags below this frequency

Blacklisting

  • blacklist: Specific tags to remove
  • blacklist_regexp: Remove tags matching these regular expressions
  • blacklist_categories: Remove entire categories of tags
  • blacklist_implied: Whether to also blacklist tags implied by blacklisted tags

Aliases and Renames

  • aliases: Custom tag equivalence mappings
  • aliases_overrides: Aliases that override existing mappings
  • renames: Change output tag names
  • on_alias_conflict: How to handle conflicts when creating aliases

Analyzing Tag Distributions

The tool also provides useful statistics about your dataset:

# Show top 100 most common tags
python ./normalize_tags.py /training_dir -k 100

# Show top 50 tags for specific categories
python ./normalize_tags.py /training_dir -k 50 -s species -s character

# Show tags removed by implication
python ./normalize_tags.py /training_dir -j 100

Best Practices with Version Control

When working with tag normalization, it’s essential to track changes to understand how the process impacts your dataset.

Using Git for Tag Change Tracking

Initialize a git repository for your dataset:

cd /training_dir
git init
git add .
git commit -m "Initial dataset commit"

After running tag normalization:

git add .
git commit -m "Applied tag normalization"

Viewing Tag Changes

Git provides powerful tools to track changes to tag files:

# View differences specifically for tags
git diff --word-diff-regex='[^,]+' --patience

To compare changes between the current and previous commit:

git diff HEAD^ HEAD --word-diff-regex='[^,]+' --patience

Understanding the Code

The e6db tag normalizer works by:

  1. Loading a database of known tags and their relationships
  2. Creating a tag set normalizer that applies rules to your tags
  3. Processing each file in your dataset
  4. Encoding tags to their canonical form
  5. Removing blacklisted and implied tags
  6. Decoding back to string format
  7. Writing the results

The core functionality is in the TagSetNormalizer class, which handles:

  • Encoding tags to standardized IDs
  • Tracking tag implications
  • Managing tag categories
  • Applying custom rules and transformations

Common Issues and Solutions

Problem: Too Many Tags Removed

If normalization removes too many useful tags:

# Adjust the implication threshold
python ./normalize_tags.py /training_dir --config custom_config.toml

In your config:

min_antecedent_freq = 500  # Higher value means fewer implications
keep_implied = ["important_tag1", "important_tag2"]  # Keep these tags even if implied

Problem: Desired Tags Getting Renamed

If important tags are being renamed undesirably:

# In your configuration
[aliases_overrides]
"my_important_tag" = "my_important_tag"  # Force keeping the original

[renames]
"unwanted_rename" = "preferred_name"  # Force a specific rename

Conclusion

Tag normalization is an essential step in creating high-quality training datasets. With tools like e6db, you can efficiently clean and standardize your tags, ensuring your AI models receive consistent, meaningful data during training.

By investing time in proper tag normalization, you’ll improve the effectiveness of your training process and ultimately achieve better results in your generated images.

Next Steps

After normalizing your tags for consistency, the next step is to set up your environment for training. Continue to the Installation and Setup guide to learn how to install and configure the necessary tools for LoRA training.