Dataset Tools

Dataset Tools


These scripts are pretty self explanatory by just the file name but almost all of them contain an AI generated description about what exactly they do. If you want to use them you will need to edit the path to your training_dir folder, the variable will be called path or directory and look something like this:

def main():
    path = 'C:\\Users\\kade\\Desktop\\training_dir_staging'

The ⚡ in the title means it is a PowerShell script and 🐍 means Python and a 🦀 means Rust of course!

Don’t be afraid of editing Python scripts, unlike the real snake, these won’t bite! In the worst case they’ll just delete your files!

🦀 extract-metadata
Processes safetensors files, extracts their metadata, converts it into a JSON object, and writes the JSON to a new file. It can process individual files or all safetensors files in a directory.
🦀 format-json
Formats JSON files from single-line to multi-line format using serde_json.
🦀 remove-extra-file-extensions
This Rust script renames text files in a specified directory by removing any extra image extensions (.jpeg, .png, or .jpg) from their names.
🐍 Check for Duplicate Words Between Captions and Tags
This script traverses through a directory, searches for text files, processes each file to extract tags and captions, and highlights occurrences of tags within captions using random colors, displaying the results in a visually rich format in the terminal.
🐍 Check for Large Images
This script scans a directory for images and checks their dimensions. If an image’s dimensions exceed specified thresholds, the script logs the image’s path and dimensions to a file.
🐍 Check for Transparency
This script checks for transparency in PNG images within a specified directory and its subdirectories.
🐍 Convert RGBA to RGB in PNGs
This script automates the process of converting .png images from RGBA to RGB format in a specified directory, utilizing multiprocessing to enhance efficiency.
🐍 Count Images in Folder
This script counts the total number of JPEG and PNG images in a specified directory.
🐍 Create Empty Captions for Images
This Python script creates an empty text file with the same name as each image file (.jpg, .png, or .jpeg) present in a specified directory. The script checks if the directory exists, and then iterates through all the image files in the directory.
🐍 e621 JSON to Caption
This Python script is designed to process JSON files found within a specified directory and its subdirectories. Each JSON file is expected to contain data related to image posts sourced from e621.net or e6ai.net. The script parses these JSON files, extracts relevant information such as image URL, ratings, and tags, and generates caption files (.txt) based on this data.
🐍 FurryTagger
Loads eva02-vit-large-448-8046, applies it to a set of images in a specified directory, and write the model’s output tags to a text file for each image.
🐍 Newlines to Commas
Recursively modify the content of .txt files in the specified directory and its subdirectories by replacing newlines with commas and spaces.
🐍 Replace Transparency with Black
This Python script processes all .png images in a specified directory by adding a black layer to each, utilizing multiprocessing to handle the images in parallel for efficiency.
🐍 Search for Tag
This script is used to search for the word “anthrofied” in all .txt files within a specified directory and its subdirectories. It uses multiprocessing to speed up the search by checking multiple files simultaneously.
⚡ Format-JSONFiles
Formats JSON files from single-line to multi-line format using the jq command-line JSON processor.
⚡ Format-JSONFilesToSingleLine
Formats JSON files to a single-line format using the jq utility.
⚡ Get-Seed
Retrieves the ss_seed value from the metadata of a .safetensors file.
⚡ Inspect-Lora
Takes a file path as input and uses Python to read the metadata from a .safetensors file. It then pretty-prints the metadata contents to the console and saves it next to the LoRA.

The Eventual Rust Rewrite


I also have dataset-tools which is a horrible amalgamation of various tools I use.

My gists are also a graveyard of useful snippets and Jupyter notebooks I have at some point used.

But to make it even more complicated, I also stashed my most often used scripts and training scripts in k4d3/toolkit on huggingface.