Taggers and Captioners
An anthro male maned wolf with orange and white fur sits at a wooden table in a cozy room. He wears a red tank top and black gloves and he is writing on an open book with a pen. The detailed background shows a window with sunlight streaming in potted plants and various objects like a potted plant books a book and a cup. Shelves filled with jars and other items are visible on the right side. Sunlight streams through the window illuminating the scene. The overall atmosphere is warm and inviting.

Taggers and Captioners

Introduction

Creating high-quality AI models requires robust datasets with detailed metadata. This guide explores specialized AI tools designed to automatically analyze and annotate furry artwork, dramatically reducing the time-intensive process of manual labeling.

Automated tagging and captioning tools serve distinct but complementary functions:

  • Taggers: Identify specific elements, attributes, and features present in images (e.g., “anthro,” “canine,” “outdoor scene”)
  • Captioners: Generate natural language descriptions that capture the overall content and context of images

These tools utilize specialized vision models trained on domain-specific data to recognize furry art characteristics that general-purpose AI systems might miss. While they’re not perfect, they provide a solid foundation that can be manually refined, potentially saving dozens of hours when preparing large datasets.

Taggers

Auto taggers employ computer vision and machine learning to analyze visual content and generate relevant tags. These specialized models are trained to recognize common elements in furry artwork including:

  • Species and character traits
  • Poses and expressions
  • Clothing and accessories
  • Environmental elements
  • Art styles and techniques

Modern taggers like JTP2 (Joint Tagger Project) can identify thousands of distinct elements with increasingly accurate confidence scores, revolutionizing the dataset preparation workflow.

JTP2

The Joint Tagger Project (JTP2) represents a significant advancement in specialized tagging for furry and anthro artwork. Developed with a focus on comprehensive metadata generation, it’s particularly valuable for training furry art generators.

Technical Overview

JTP2 is built on a Vision Transformer (ViT) architecture based on the SigLIP model, modified with:

  • A gated prediction head for improved multi-label classification
  • Support for over 9,000 distinct tags
  • A custom image preprocessing pipeline optimized for artwork
  • Confidence scoring for each predicted tag

The model was trained on a diverse corpus of annotated furry artwork, giving it specialized knowledge of species characteristics, artistic styles, and content elements specific to the furry fandom.

Integration with Other Tools

Furrence-2-Large by Thouph/Lodestone/Microsoft incorporates JTP2 for its tagging capabilities. You can modify the script as shown here to save both tags and captions simultaneously, creating a more efficient workflow.

JTP2 Implementation

You can download JTP2 from Hugging Face. The tagger script below processes an entire directory of images, automatically creating text files containing tags for each image.

You will need torch, safetensors, Pillow and timm installed:

pip install torch safetensors Pillow timm
Click to reveal source code.
import os
import json
from PIL import Image
import safetensors.torch
import timm
from timm.models import VisionTransformer
import torch
from torchvision.transforms import transforms
from torchvision.transforms import InterpolationMode
import torchvision.transforms.functional as TF
import argparse

torch.set_grad_enabled(False)

class Fit(torch.nn.Module):
    def __init__(self, bounds: tuple[int, int] | int, interpolation=InterpolationMode.LANCZOS, grow: bool = True, pad: float | None = None):
        super().__init__()
        self.bounds = (bounds, bounds) if isinstance(bounds, int) else bounds
        self.interpolation = interpolation
        self.grow = grow
        self.pad = pad

    def forward(self, img: Image) -> Image:
        wimg, himg = img.size
        hbound, wbound = self.bounds
        hscale = hbound / himg
        wscale = wbound / wimg
        if not self.grow:
            hscale = min(hscale, 1.0)
            wscale = min(wscale, 1.0)
        scale = min(hscale, wscale)
        if scale == 1.0:
            return img
        hnew = min(round(himg * scale), hbound)
        wnew = min(round(wimg * scale), wbound)
        img = TF.resize(img, (hnew, wnew), self.interpolation)
        if self.pad is None:
            return img
        hpad = hbound - hnew
        wpad = wbound - wnew
        tpad = hpad // 2
        bpad = hpad - tpad
        lpad = wpad // 2
        rpad = wpad - lpad
        return TF.pad(img, (lpad, tpad, rpad, bpad), self.pad)

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(bounds={self.bounds}, interpolation={self.interpolation.value}, grow={self.grow}, pad={self.pad})"

class CompositeAlpha(torch.nn.Module):
    def __init__(self, background: tuple[float, float, float] | float):
        super().__init__()
        self.background = (background, background, background) if isinstance(background, float) else background
        self.background = torch.tensor(self.background).unsqueeze(1).unsqueeze(2)

    def forward(self, img: torch.Tensor) -> torch.Tensor:
        if img.shape[-3] == 3:
            return img
        alpha = img[..., 3, None, :, :]
        img[..., :3, :, :] *= alpha
        background = self.background.expand(-1, img.shape[-2], img.shape[-1])
        if background.ndim == 1:
            background = background[:, None, None]
        elif background.ndim == 2:
            background = background[None, :, :]
        img[..., :3, :, :] += (1.0 - alpha) * background
        return img[..., :3, :, :]

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(background={self.background})"

transform = transforms.Compose([
    Fit((384, 384)),
    transforms.ToTensor(),
    CompositeAlpha(0.5),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
    transforms.CenterCrop((384, 384)),
])

model = timm.create_model("vit_so400m_patch14_siglip_384.webli", pretrained=False, num_classes=9083)  # type: VisionTransformer

class GatedHead(torch.nn.Module):
    def __init__(self, num_features: int, num_classes: int):
        super().__init__()
        self.num_classes = num_classes
        self.linear = torch.nn.Linear(num_features, num_classes * 2)
        self.act = torch.nn.Sigmoid()
        self.gate = torch.nn.Sigmoid()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.linear(x)
        x = self.act(x[:, :self.num_classes]) * self.gate(x[:, self.num_classes:])
        return x

model.head = GatedHead(min(model.head.weight.shape), 9083)
safetensors.torch.load_model(model, "JTP_PILOT2-e3-vit_so400m_patch14_siglip_384.safetensors")

if torch.cuda.is_available():
    model.cuda()
    if torch.cuda.get_device_capability()[0] >= 7:  # tensor cores
        model.to(dtype=torch.float16, memory_format=torch.channels_last)

model.eval()

with open("tags.json", "r") as file:
    tags = json.load(file)  # type: dict
allowed_tags = list(tags.keys())

for idx, tag in enumerate(allowed_tags):
    allowed_tags[idx] = tag.replace("_", " ")

sorted_tag_score = {}

def run_classifier(image, threshold):
    global sorted_tag_score
    img = image.convert('RGBA')
    tensor = transform(img).unsqueeze(0)
    if torch.cuda.is_available():
        tensor = tensor.cuda()
        if torch.cuda.get_device_capability()[0] >= 7:  # tensor cores
            tensor = tensor.to(dtype=torch.float16, memory_format=torch.channels_last)
    with torch.no_grad():
        probits = model(tensor)[0].cpu()
        values, indices = probits.topk(250)
    tag_score = dict()
    for i in range(indices.size(0)):
        tag_score[allowed_tags[indices[i]]] = values[i].item()
    sorted_tag_score = dict(sorted(tag_score.items(), key=lambda item: item[1], reverse=True))
    return create_tags(threshold)

def create_tags(threshold):
    global sorted_tag_score
    filtered_tag_score = {key: value for key, value in sorted_tag_score.items() if value > threshold}
    text_no_impl = ", ".join(filtered_tag_score.keys())
    return text_no_impl, filtered_tag_score

def process_directory(directory, threshold):
    results = {}
    for root, _, files in os.walk(directory):
        for file in files:
            if file.lower().endswith(('.jpg', '.jpeg', '.png')):
                image_path = os.path.join(root, file)
                image = Image.open(image_path)
                tags, _ = run_classifier(image, threshold)
                results[image_path] = tags
                # Save tags to a text file with the same name as the image
                text_file_path = os.path.splitext(image_path)[0] + ".txt"
                with open(text_file_path, "w") as text_file:
                    text_file.write(tags)
    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run inference on a directory of images.")
    parser.add_argument("directory", type=str, help="Target directory containing images.")
    parser.add_argument("--threshold", type=float, default=0.2, help="Threshold for tag filtering.")
    args = parser.parse_args()

    results = process_directory(args.directory, args.threshold)
    for image_path, tags in results.items():
        print(f"{image_path}: {tags}")

eva02-vit-large-448-8046

The eva02-vit tagger represents another powerful option for automatically analyzing furry artwork. Developed by Thouph, this model builds upon the EVA-02 vision transformer architecture to create a specialized tagging solution.

Technical Highlights

  • Architecture: Based on EVA-02, a powerful vision transformer developed at BAAI
  • Resolution: Processes images at 448x448 pixels, offering good detail recognition
  • Tag Coverage: Supports over 8,000 tags (8046 specifically)
  • Model Size: Large model variant with robust feature extraction capabilities

When to Choose This Tagger

Eva02-vit excels in several areas where it may outperform other taggers:

  • Fine-grained species detection: Particularly strong at distinguishing between similar species
  • Art style recognition: Better identifies traditional art techniques and digital styles
  • Lower resource requirements: Can run on systems with more limited GPU memory

Getting Started

You can access the model on Hugging Face. Implementation requires minimal dependencies:

pip install torch timm

For a quick demonstration and implementation example, check out this Colab Notebook that shows the tagger in action and provides ready-to-use code.

Captioners

Image captioning tools represent a more complex AI challenge than tagging, as they must generate coherent natural language descriptions that accurately capture visual elements, relationships, and context. Unlike taggers that simply identify discrete elements, captioners must understand spatial relationships, character interactions, scene composition, and even artistic intent.

How Captioning Works

Modern captioning systems for furry artwork typically combine:

  1. Vision encoders: Specialized neural networks (often transformers) that analyze visual features
  2. Language decoders: Text generation systems that convert visual features into descriptive text
  3. Domain-specific fine-tuning: Additional training on anthro/furry art to recognize species, character traits, and furry-specific elements

The most advanced captioners like Furrence-2-Large integrate both tagging capabilities and language modeling, using identified tags to enhance caption accuracy.

Benefits for Dataset Creation

  • Rich training signals: Detailed captions provide context and relationships that simple tags can’t capture
  • Natural language supervision: Enables text-to-image models to learn from more nuanced descriptions
  • Improved search: Makes finding specific images in large collections easier
  • Content organization: Helps categorize images by scene type, activity, or mood

Current Limitations

While the latest generation of captioning tools shows impressive results, some challenges remain:

  • Species identification: May confuse similar species (e.g., wolf vs. husky, fox vs. coyote)
  • Color accuracy: Sometimes misidentifies character colors, especially for complex patterns
  • Spatial relationships: May incorrectly describe the positions or interactions between characters
  • Artistic style detection: Less reliable at identifying specific art styles or techniques

For best results, use these tools as a starting point and review/edit captions for critical elements like species, color, and positioning before incorporating them into training datasets.

Furrence-2-Large

Furrence-2-Large is a specialized image captioning system based on the Florence2-large model fine-tuned by lodestone-horizon and Thouph. It combines the Florence2 model with the Joint Tagger Project PILOT 2 (JTP2) for advanced image tagging and custom prompt construction.

Features

  • High-quality detailed captions for furry artwork
  • JTP PILOT 2 tagging for accurate content recognition
  • Support for multiple image formats
  • Web interface for easy use

Setup Instructions for Web Demo

  1. Clone the repository:

    git clone https://huggingface.co/spaces/Thouph/Furrence-2-Large-Demo
    cd Furrence-2-Large-Demo
    
  2. Install required dependencies:

    pip install -r requirements.txt
    pip install gradio safetensors timm transformers torch torchvision pillow
    
  3. Run the web demo:

    python app.py
    

This will launch a local web interface at http://127.0.0.1:7860 where you can upload images and generate captions.

Adapting for Batch Processing

If you need to process multiple images, you can adapt the demo code to create a batch processing script:

import os
import sys
from pathlib import Path
from PIL import Image
import torch
from transformers import AutoProcessor
from florence2_implementation.modeling_florence2 import Florence2ForConditionalGeneration

# Import necessary components from the demo
# Make sure you've cloned the Furrence-2-Large-Demo repository first
sys.path.append("path/to/Furrence-2-Large-Demo")
from app import generate_prompt, tagger_model, tagger_transform, pruner, allowed_tags, THRESHOLD

# Load models
model_id = "lodestone-horizon/furrence2-large"
model = Florence2ForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained("./florence2_implementation/", trust_remote_code=True)

def caption_image(image_path, expected_caption_length=100, seq_len=512):
    try:
        # Open image
        image = Image.open(image_path).convert("RGB")
        
        # Generate prompt
        prompt_input = generate_prompt(image, expected_caption_length)
        
        # Process image and generate caption
        pixel_values = processor.image_processor(image, return_tensors="pt")["pixel_values"]
        encoder_inputs = processor.tokenizer(
            text=prompt_input,
            return_tensors="pt",
        )
        
        generated_ids = model.generate(
            input_ids=encoder_inputs["input_ids"],
            attention_mask=encoder_inputs["attention_mask"],
            pixel_values=pixel_values,
            max_new_tokens=seq_len,
            early_stopping=False,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.9,
        )
        
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        
        # Save caption to file
        caption_file = Path(image_path).with_suffix('.florence')
        with open(caption_file, 'w', encoding='utf-8') as f:
            f.write(generated_text)
            
        print(f"Processed {image_path}")
        return generated_text
        
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None

# Process all images in a directory
def process_directory(directory):
    for img_path in Path(directory).glob('*.png'):
        caption_image(img_path)
    for img_path in Path(directory).glob('*.jpg'):
        caption_image(img_path)

if __name__ == "__main__":
    if len(sys.argv) > 1:
        input_path = sys.argv[1]
        if os.path.isdir(input_path):
            process_directory(input_path)
        else:
            caption_image(input_path)
    else:
        print("Please provide an image or directory path")

Output

When using the adapted batch processing script, a .florence file will be created for each image with the generated caption.

Joy-Caption

Joy-Caption represents a specialized multi-modal captioning system designed specifically for furry artwork, developed by a person known as fancyfeast. This tool combines powerful vision encoding with advanced language modeling to generate nuanced, context-aware descriptions.

Architecture and Capabilities

Joy-Caption integrates:

  • JoyTag image analysis: Leverages the JoyTag tagger for detailed visual element detection
  • LLM Integration: Uses Meta-Llama-3.1-8B for generating natural-sounding captions
  • Domain-specific fine-tuning: Optimized for furry art elements and characteristics

The system excels at recognizing species-specific details, character relationships, and artistic elements common in furry artwork that general captioners might miss or misinterpret.

Key Features

  • Contextual understanding: Generates captions that reflect relationships between characters
  • Style recognition: Identifies and describes artistic styles specific to furry art
  • Species accuracy: Particularly strong at correctly identifying anthropomorphic species
  • Expression detection: Captures nuanced emotional states and expressions
  • Iterative refinement: Uses multi-pass processing to improve caption quality

Joy-Caption Implementation

To use Joy-Caption:

  1. Access the Hugging Face Space for web interface usage

  2. Clone the repository:

    git clone https://huggingface.co/spaces/fancyfeast
    
  3. Important: You’ll need access credentials for meta-llama/Meta-Llama-3.1-8B to use the full functionality

The system outputs detailed captions that can be directly incorporated into training datasets or used as a starting point for manual refinement.

Joy-Caption Batch Processing

Joy-Caption offers powerful batch processing capabilities, allowing you to automatically caption entire directories of images. The implementation uses two main Python scripts:

  1. caption_utils.py: Provides shared utilities for image captioning Link
  2. joy_caption.py: The main implementation with batch processing functionality Link
  3. e6db: For various normalization tasks Link
Key Features of Batch Processing
  • Multiple Caption Types: Generate descriptive captions, training prompts, booru tag lists, and more
  • Quality Control: Automatically validates captions and regenerates if they don’t meet quality thresholds
  • Tag Integration: Can use existing tags (.tags files) to enhance caption accuracy
  • Customization Options: Control caption length, style, and format
  • Efficient Processing: Handles various image formats (PNG, JPEG, WEBP, JXL, AVIF)
Using the Command Line Interface

The simplest way to batch process a directory of images:

python joy_caption.py /path/to/your/images

With advanced options:

python joy_caption.py /path/to/your/images --caption_type "descriptive" --caption_length "300" --feed-from-tags
Command Line Options
OptionDescription
--caption_typeType of caption to generate (descriptive, training prompt, booru tag list, etc.)
--caption_lengthTarget length for the generated captions
--feed-from-tagsUse existing .tags files to enhance captions
--random-tagsRandomly select a specified number of tags from .tags files
--artist-from-folderExtract artist name from parent folder
--dont-strip-commasPreserve commas in the generated captions
--add-commas-to-sentence-endsAdd commas after periods in sentences
--verboseIncrease output verbosity
Integration with Existing Tags

A particularly powerful feature is the ability to use existing tag files to guide the captioning process. If you have images with corresponding .tags files (same filename but with .tags extension), the script will:

  1. Read and categorize the tags (artist, character, species, copyright, etc.)
  2. Build a structured prompt incorporating these tags
  3. Generate a caption that accurately reflects the tagged content

This creates a synergy between automatic tagging and captioning, producing more accurate and consistent results.

Output Format

For each processed image, the script:

  1. Generates a caption based on your specifications
  2. Validates the caption against quality criteria (minimum sentence count, word variety, etc.)
  3. Saves the caption in a .caption file with the same base name as the image

Next Steps

After tagging and captioning your images, the next step is to normalize the tags to ensure consistency throughout your dataset. Continue to the Tag Normalization guide to learn how to standardize tags for better training results.