Dataset Tools

Dataset Tools #


A “small” collection of Python and PowerShell scripts that dataset curators might find handy. The โšก in the title means it is a PowerShell script and ๐Ÿ means Python and a ๐Ÿฆ€ means Rust of course!

๐Ÿฆ€ extract-metadata
Processes safetensors files, extracts their metadata, converts it into a JSON object, and writes the JSON to a new file. It can process individual files or all safetensors files in a directory.
๐Ÿฆ€ format-json
Formats JSON files from single-line to multi-line format using serde_json.
๐Ÿฆ€ remove-extra-file-extensions
This Rust script renames text files in a specified directory by removing any extra image extensions (.jpeg, .png, or .jpg) from their names.
๐Ÿ Check for Duplicate Words Between Captions and Tags
This script traverses through a directory, searches for text files, processes each file to extract tags and captions, and highlights occurrences of tags within captions using random colors, displaying the results in a visually rich format in the terminal.
๐Ÿ Check for Large Images
This script checks the resolution of all images in a specified directory and its subdirectories. If the resolution of an image exceeds a certain limit, the path of the image is written to an output file. The script uses multiprocessing to speed up the process.
๐Ÿ Check for Transparency
This script recursively traverses a specified directory, identifying image files with extension .png. For each identified image, it checks if it contains transparency by examining its mode with PIL.
๐Ÿ Convert RGBA to RGB in PNGs
This script automates the process of converting .png images from RGBA to RGB format in a specified directory, utilizing multiprocessing to enhance efficiency.
๐Ÿ Count Images in Folder
This script counts the total number of JPEG and PNG images in a specified directory.
๐Ÿ Create Empty Captions for Images
This Python script creates an empty text file with the same name as each image file (.jpg, .png, or .jpeg) present in a specified directory. The script checks if the directory exists, and then iterates through all the image files in the directory.
๐Ÿ e621 JSON to Caption
This Python script is designed to process JSON files found within a specified directory and its subdirectories. Each JSON file is expected to contain data related to image posts sourced from e621.net or e6ai.net. The script parses these JSON files, extracts relevant information such as image URL, ratings, and tags, and generates caption files (.txt) based on this data.
๐Ÿ FurryTagger
Loads eva02-vit-large-448-8046, applies it to a set of images in a specified directory, and write the modelโ€™s output tags to a text file for each image.
๐Ÿ Newlines to Commas
Recursively modify the content of .txt files in the specified directory and its subdirectories by replacing newlines with commas and spaces.
๐Ÿ Replace Transparency with Black
This Python script processes all .png images in a specified directory by adding a black layer to each, utilizing multiprocessing to handle the images in parallel for efficiency.
๐Ÿ Search for Tag
This script is used to search for the word “anthrofied” in all .txt files within a specified directory and its subdirectories. It uses multiprocessing to speed up the search by checking multiple files simultaneously.
โšก Format-JSONFiles
Formats JSON files from single-line to multi-line format using the jq command-line JSON processor.
โšก Format-JSONFilesToSingleLine
Formats JSON files to a single-line format using the jq utility.
โšก Get-Seed
Retrieves the ss_seed value from the metadata of a .safetensors file.
โšก Inspect-Lora
Takes a file path as input and uses Python to read the metadata from a .safetensors file. It then pretty-prints the metadata contents to the console and saves it next to the LoRA.