Taggers and Captioners
Introduction
Creating high-quality AI models requires robust datasets with detailed metadata. This guide explores specialized AI tools designed to automatically analyze and annotate furry artwork, dramatically reducing the time-intensive process of manual labeling.
Automated tagging and captioning tools serve distinct but complementary functions:
- Taggers: Identify specific elements, attributes, and features present in images (e.g., “anthro,” “canine,” “outdoor scene”)
- Captioners: Generate natural language descriptions that capture the overall content and context of images
These tools utilize specialized vision models trained on domain-specific data to recognize furry art characteristics that general-purpose AI systems might miss. While they’re not perfect, they provide a solid foundation that can be manually refined, potentially saving dozens of hours when preparing large datasets.
Taggers
Auto taggers employ computer vision and machine learning to analyze visual content and generate relevant tags. These specialized models are trained to recognize common elements in furry artwork including:
- Species and character traits
- Poses and expressions
- Clothing and accessories
- Environmental elements
- Art styles and techniques
Modern taggers like JTP2 (Joint Tagger Project) can identify thousands of distinct elements with increasingly accurate confidence scores, revolutionizing the dataset preparation workflow.
JTP2
The Joint Tagger Project (JTP2) represents a significant advancement in specialized tagging for furry and anthro artwork. Developed with a focus on comprehensive metadata generation, it’s particularly valuable for training furry art generators.
Technical Overview
JTP2 is built on a Vision Transformer (ViT) architecture based on the SigLIP model, modified with:
- A gated prediction head for improved multi-label classification
- Support for over 9,000 distinct tags
- A custom image preprocessing pipeline optimized for artwork
- Confidence scoring for each predicted tag
The model was trained on a diverse corpus of annotated furry artwork, giving it specialized knowledge of species characteristics, artistic styles, and content elements specific to the furry fandom.
Integration with Other Tools
Furrence-2-Large by Thouph/Lodestone/Microsoft incorporates JTP2 for its tagging capabilities. You can modify the script as shown here to save both tags and captions simultaneously, creating a more efficient workflow.
JTP2 Implementation
You can download JTP2 from Hugging Face. The tagger script below processes an entire directory of images, automatically creating text files containing tags for each image.
You will need torch
, safetensors
, Pillow
and timm
installed:
pip install torch safetensors Pillow timm
Click to reveal source code.
import os
import json
from PIL import Image
import safetensors.torch
import timm
from timm.models import VisionTransformer
import torch
from torchvision.transforms import transforms
from torchvision.transforms import InterpolationMode
import torchvision.transforms.functional as TF
import argparse
torch.set_grad_enabled(False)
class Fit(torch.nn.Module):
def __init__(self, bounds: tuple[int, int] | int, interpolation=InterpolationMode.LANCZOS, grow: bool = True, pad: float | None = None):
super().__init__()
self.bounds = (bounds, bounds) if isinstance(bounds, int) else bounds
self.interpolation = interpolation
self.grow = grow
self.pad = pad
def forward(self, img: Image) -> Image:
wimg, himg = img.size
hbound, wbound = self.bounds
hscale = hbound / himg
wscale = wbound / wimg
if not self.grow:
hscale = min(hscale, 1.0)
wscale = min(wscale, 1.0)
scale = min(hscale, wscale)
if scale == 1.0:
return img
hnew = min(round(himg * scale), hbound)
wnew = min(round(wimg * scale), wbound)
img = TF.resize(img, (hnew, wnew), self.interpolation)
if self.pad is None:
return img
hpad = hbound - hnew
wpad = wbound - wnew
tpad = hpad // 2
bpad = hpad - tpad
lpad = wpad // 2
rpad = wpad - lpad
return TF.pad(img, (lpad, tpad, rpad, bpad), self.pad)
def __repr__(self) -> str:
return f"{self.__class__.__name__}(bounds={self.bounds}, interpolation={self.interpolation.value}, grow={self.grow}, pad={self.pad})"
class CompositeAlpha(torch.nn.Module):
def __init__(self, background: tuple[float, float, float] | float):
super().__init__()
self.background = (background, background, background) if isinstance(background, float) else background
self.background = torch.tensor(self.background).unsqueeze(1).unsqueeze(2)
def forward(self, img: torch.Tensor) -> torch.Tensor:
if img.shape[-3] == 3:
return img
alpha = img[..., 3, None, :, :]
img[..., :3, :, :] *= alpha
background = self.background.expand(-1, img.shape[-2], img.shape[-1])
if background.ndim == 1:
background = background[:, None, None]
elif background.ndim == 2:
background = background[None, :, :]
img[..., :3, :, :] += (1.0 - alpha) * background
return img[..., :3, :, :]
def __repr__(self) -> str:
return f"{self.__class__.__name__}(background={self.background})"
transform = transforms.Compose([
Fit((384, 384)),
transforms.ToTensor(),
CompositeAlpha(0.5),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
transforms.CenterCrop((384, 384)),
])
model = timm.create_model("vit_so400m_patch14_siglip_384.webli", pretrained=False, num_classes=9083) # type: VisionTransformer
class GatedHead(torch.nn.Module):
def __init__(self, num_features: int, num_classes: int):
super().__init__()
self.num_classes = num_classes
self.linear = torch.nn.Linear(num_features, num_classes * 2)
self.act = torch.nn.Sigmoid()
self.gate = torch.nn.Sigmoid()
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.linear(x)
x = self.act(x[:, :self.num_classes]) * self.gate(x[:, self.num_classes:])
return x
model.head = GatedHead(min(model.head.weight.shape), 9083)
safetensors.torch.load_model(model, "JTP_PILOT2-e3-vit_so400m_patch14_siglip_384.safetensors")
if torch.cuda.is_available():
model.cuda()
if torch.cuda.get_device_capability()[0] >= 7: # tensor cores
model.to(dtype=torch.float16, memory_format=torch.channels_last)
model.eval()
with open("tags.json", "r") as file:
tags = json.load(file) # type: dict
allowed_tags = list(tags.keys())
for idx, tag in enumerate(allowed_tags):
allowed_tags[idx] = tag.replace("_", " ")
sorted_tag_score = {}
def run_classifier(image, threshold):
global sorted_tag_score
img = image.convert('RGBA')
tensor = transform(img).unsqueeze(0)
if torch.cuda.is_available():
tensor = tensor.cuda()
if torch.cuda.get_device_capability()[0] >= 7: # tensor cores
tensor = tensor.to(dtype=torch.float16, memory_format=torch.channels_last)
with torch.no_grad():
probits = model(tensor)[0].cpu()
values, indices = probits.topk(250)
tag_score = dict()
for i in range(indices.size(0)):
tag_score[allowed_tags[indices[i]]] = values[i].item()
sorted_tag_score = dict(sorted(tag_score.items(), key=lambda item: item[1], reverse=True))
return create_tags(threshold)
def create_tags(threshold):
global sorted_tag_score
filtered_tag_score = {key: value for key, value in sorted_tag_score.items() if value > threshold}
text_no_impl = ", ".join(filtered_tag_score.keys())
return text_no_impl, filtered_tag_score
def process_directory(directory, threshold):
results = {}
for root, _, files in os.walk(directory):
for file in files:
if file.lower().endswith(('.jpg', '.jpeg', '.png')):
image_path = os.path.join(root, file)
image = Image.open(image_path)
tags, _ = run_classifier(image, threshold)
results[image_path] = tags
# Save tags to a text file with the same name as the image
text_file_path = os.path.splitext(image_path)[0] + ".txt"
with open(text_file_path, "w") as text_file:
text_file.write(tags)
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run inference on a directory of images.")
parser.add_argument("directory", type=str, help="Target directory containing images.")
parser.add_argument("--threshold", type=float, default=0.2, help="Threshold for tag filtering.")
args = parser.parse_args()
results = process_directory(args.directory, args.threshold)
for image_path, tags in results.items():
print(f"{image_path}: {tags}")
eva02-vit-large-448-8046
The eva02-vit tagger represents another powerful option for automatically analyzing furry artwork. Developed by Thouph, this model builds upon the EVA-02 vision transformer architecture to create a specialized tagging solution.
Technical Highlights
- Architecture: Based on EVA-02, a powerful vision transformer developed at BAAI
- Resolution: Processes images at 448x448 pixels, offering good detail recognition
- Tag Coverage: Supports over 8,000 tags (8046 specifically)
- Model Size: Large model variant with robust feature extraction capabilities
When to Choose This Tagger
Eva02-vit excels in several areas where it may outperform other taggers:
- Fine-grained species detection: Particularly strong at distinguishing between similar species
- Art style recognition: Better identifies traditional art techniques and digital styles
- Lower resource requirements: Can run on systems with more limited GPU memory
Getting Started
You can access the model on Hugging Face. Implementation requires minimal dependencies:
pip install torch timm
For a quick demonstration and implementation example, check out this Colab Notebook that shows the tagger in action and provides ready-to-use code.
Captioners
Image captioning tools represent a more complex AI challenge than tagging, as they must generate coherent natural language descriptions that accurately capture visual elements, relationships, and context. Unlike taggers that simply identify discrete elements, captioners must understand spatial relationships, character interactions, scene composition, and even artistic intent.
How Captioning Works
Modern captioning systems for furry artwork typically combine:
- Vision encoders: Specialized neural networks (often transformers) that analyze visual features
- Language decoders: Text generation systems that convert visual features into descriptive text
- Domain-specific fine-tuning: Additional training on anthro/furry art to recognize species, character traits, and furry-specific elements
The most advanced captioners like Furrence-2-Large integrate both tagging capabilities and language modeling, using identified tags to enhance caption accuracy.
Benefits for Dataset Creation
- Rich training signals: Detailed captions provide context and relationships that simple tags can’t capture
- Natural language supervision: Enables text-to-image models to learn from more nuanced descriptions
- Improved search: Makes finding specific images in large collections easier
- Content organization: Helps categorize images by scene type, activity, or mood
Current Limitations
While the latest generation of captioning tools shows impressive results, some challenges remain:
- Species identification: May confuse similar species (e.g., wolf vs. husky, fox vs. coyote)
- Color accuracy: Sometimes misidentifies character colors, especially for complex patterns
- Spatial relationships: May incorrectly describe the positions or interactions between characters
- Artistic style detection: Less reliable at identifying specific art styles or techniques
For best results, use these tools as a starting point and review/edit captions for critical elements like species, color, and positioning before incorporating them into training datasets.
Furrence-2-Large
Furrence-2-Large is a specialized image captioning system based on the Florence2-large model fine-tuned by lodestone-horizon and Thouph. It combines the Florence2 model with the Joint Tagger Project PILOT 2 (JTP2) for advanced image tagging and custom prompt construction.
Features
- High-quality detailed captions for furry artwork
- JTP PILOT 2 tagging for accurate content recognition
- Support for multiple image formats
- Web interface for easy use
Setup Instructions for Web Demo
Clone the repository:
git clone https://huggingface.co/spaces/Thouph/Furrence-2-Large-Demo cd Furrence-2-Large-Demo
Install required dependencies:
pip install -r requirements.txt pip install gradio safetensors timm transformers torch torchvision pillow
Run the web demo:
python app.py
This will launch a local web interface at http://127.0.0.1:7860 where you can upload images and generate captions.
Adapting for Batch Processing
If you need to process multiple images, you can adapt the demo code to create a batch processing script:
import os
import sys
from pathlib import Path
from PIL import Image
import torch
from transformers import AutoProcessor
from florence2_implementation.modeling_florence2 import Florence2ForConditionalGeneration
# Import necessary components from the demo
# Make sure you've cloned the Furrence-2-Large-Demo repository first
sys.path.append("path/to/Furrence-2-Large-Demo")
from app import generate_prompt, tagger_model, tagger_transform, pruner, allowed_tags, THRESHOLD
# Load models
model_id = "lodestone-horizon/furrence2-large"
model = Florence2ForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained("./florence2_implementation/", trust_remote_code=True)
def caption_image(image_path, expected_caption_length=100, seq_len=512):
try:
# Open image
image = Image.open(image_path).convert("RGB")
# Generate prompt
prompt_input = generate_prompt(image, expected_caption_length)
# Process image and generate caption
pixel_values = processor.image_processor(image, return_tensors="pt")["pixel_values"]
encoder_inputs = processor.tokenizer(
text=prompt_input,
return_tensors="pt",
)
generated_ids = model.generate(
input_ids=encoder_inputs["input_ids"],
attention_mask=encoder_inputs["attention_mask"],
pixel_values=pixel_values,
max_new_tokens=seq_len,
early_stopping=False,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.9,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Save caption to file
caption_file = Path(image_path).with_suffix('.florence')
with open(caption_file, 'w', encoding='utf-8') as f:
f.write(generated_text)
print(f"Processed {image_path}")
return generated_text
except Exception as e:
print(f"Error processing {image_path}: {e}")
return None
# Process all images in a directory
def process_directory(directory):
for img_path in Path(directory).glob('*.png'):
caption_image(img_path)
for img_path in Path(directory).glob('*.jpg'):
caption_image(img_path)
if __name__ == "__main__":
if len(sys.argv) > 1:
input_path = sys.argv[1]
if os.path.isdir(input_path):
process_directory(input_path)
else:
caption_image(input_path)
else:
print("Please provide an image or directory path")
Output
When using the adapted batch processing script, a .florence
file will be created for each image with the generated caption.
Joy-Caption
Joy-Caption represents a specialized multi-modal captioning system designed specifically for furry artwork, developed by a person known as fancyfeast. This tool combines powerful vision encoding with advanced language modeling to generate nuanced, context-aware descriptions.
Architecture and Capabilities
Joy-Caption integrates:
- JoyTag image analysis: Leverages the JoyTag tagger for detailed visual element detection
- LLM Integration: Uses Meta-Llama-3.1-8B for generating natural-sounding captions
- Domain-specific fine-tuning: Optimized for furry art elements and characteristics
The system excels at recognizing species-specific details, character relationships, and artistic elements common in furry artwork that general captioners might miss or misinterpret.
Key Features
- Contextual understanding: Generates captions that reflect relationships between characters
- Style recognition: Identifies and describes artistic styles specific to furry art
- Species accuracy: Particularly strong at correctly identifying anthropomorphic species
- Expression detection: Captures nuanced emotional states and expressions
- Iterative refinement: Uses multi-pass processing to improve caption quality
Joy-Caption Implementation
To use Joy-Caption:
Access the Hugging Face Space for web interface usage
Clone the repository:
git clone https://huggingface.co/spaces/fancyfeast
Important: You’ll need access credentials for meta-llama/Meta-Llama-3.1-8B to use the full functionality
The system outputs detailed captions that can be directly incorporated into training datasets or used as a starting point for manual refinement.
Joy-Caption Batch Processing
Joy-Caption offers powerful batch processing capabilities, allowing you to automatically caption entire directories of images. The implementation uses two main Python scripts:
caption_utils.py
: Provides shared utilities for image captioning Linkjoy_caption.py
: The main implementation with batch processing functionality Linke6db
: For various normalization tasks Link
Key Features of Batch Processing
- Multiple Caption Types: Generate descriptive captions, training prompts, booru tag lists, and more
- Quality Control: Automatically validates captions and regenerates if they don’t meet quality thresholds
- Tag Integration: Can use existing tags (.tags files) to enhance caption accuracy
- Customization Options: Control caption length, style, and format
- Efficient Processing: Handles various image formats (PNG, JPEG, WEBP, JXL, AVIF)
Using the Command Line Interface
The simplest way to batch process a directory of images:
python joy_caption.py /path/to/your/images
With advanced options:
python joy_caption.py /path/to/your/images --caption_type "descriptive" --caption_length "300" --feed-from-tags
Command Line Options
Option | Description |
---|---|
--caption_type | Type of caption to generate (descriptive, training prompt, booru tag list, etc.) |
--caption_length | Target length for the generated captions |
--feed-from-tags | Use existing .tags files to enhance captions |
--random-tags | Randomly select a specified number of tags from .tags files |
--artist-from-folder | Extract artist name from parent folder |
--dont-strip-commas | Preserve commas in the generated captions |
--add-commas-to-sentence-ends | Add commas after periods in sentences |
--verbose | Increase output verbosity |
Integration with Existing Tags
A particularly powerful feature is the ability to use existing tag files to guide the captioning process. If you have images with corresponding .tags files (same filename but with .tags extension), the script will:
- Read and categorize the tags (artist, character, species, copyright, etc.)
- Build a structured prompt incorporating these tags
- Generate a caption that accurately reflects the tagged content
This creates a synergy between automatic tagging and captioning, producing more accurate and consistent results.
Output Format
For each processed image, the script:
- Generates a caption based on your specifications
- Validates the caption against quality criteria (minimum sentence count, word variety, etc.)
- Saves the caption in a .caption file with the same base name as the image
Next Steps
After tagging and captioning your images, the next step is to normalize the tags to ensure consistency throughout your dataset. Continue to the Tag Normalization guide to learn how to standardize tags for better training results.