Dataset Tools #
A “small” collection of Python and PowerShell scripts that dataset curators might find handy. The ⚡ in the title means it is a PowerShell script and 🐍 means Python and a 🦀 means Rust of course!
- 🦀 extract-metadata
- Processes
safetensors
files, extracts their metadata, converts it into a JSON object, and writes the JSON to a new file. It can process individual files or allsafetensors
files in a directory. - 🦀 format-json
- Formats JSON files from single-line to multi-line format using
serde_json
. - 🦀 remove-extra-file-extensions
- This Rust script renames text files in a specified directory by removing any extra image extensions (
.jpeg
,.png
, or.jpg
) from their names. - 🐍 Check for Duplicate Words Between Captions and Tags
- This script traverses through a directory, searches for text files, processes each file to extract tags and captions, and highlights occurrences of tags within captions using random colors, displaying the results in a visually rich format in the terminal.
- 🐍 Check for Large Images
- This script checks the resolution of all images in a specified directory and its subdirectories. If the resolution of an image exceeds a certain limit, the path of the image is written to an output file. The script uses multiprocessing to speed up the process.
- 🐍 Check for Transparency
- This script recursively traverses a specified directory, identifying image files with extension
.png
. For each identified image, it checks if it contains transparency by examining its mode with PIL. - 🐍 Convert RGBA to RGB in PNGs
- This script automates the process of converting
.png
images from RGBA to RGB format in a specified directory, utilizing multiprocessing to enhance efficiency. - 🐍 Count Images in Folder
- This script counts the total number of JPEG and PNG images in a specified directory.
- 🐍 Create Empty Captions for Images
- This Python script creates an empty text file with the same name as each image file (.jpg, .png, or .jpeg) present in a specified directory. The script checks if the directory exists, and then iterates through all the image files in the directory.
- 🐍 e621 JSON to Caption
- This Python script is designed to process JSON files found within a specified directory and its subdirectories. Each JSON file is expected to contain data related to image posts sourced from e621.net or e6ai.net. The script parses these JSON files, extracts relevant information such as image URL, ratings, and tags, and generates caption files (
.txt
) based on this data. - 🐍 FurryTagger
- Loads
eva02-vit-large-448-8046
, applies it to a set of images in a specified directory, and write the model’s output tags to a text file for each image. - 🐍 Newlines to Commas
- Recursively modify the content of
.txt
files in the specified directory and its subdirectories by replacing newlines with commas and spaces. - 🐍 Replace Transparency with Black
- This Python script processes all
.png
images in a specified directory by adding a black layer to each, utilizing multiprocessing to handle the images in parallel for efficiency. - 🐍 Search for Tag
- This script is used to search for the word “anthrofied” in all .txt files within a specified directory and its subdirectories. It uses multiprocessing to speed up the search by checking multiple files simultaneously.
- ⚡ Format-JSONFiles
- Formats JSON files from single-line to multi-line format using the
jq
command-line JSON processor. - ⚡ Format-JSONFilesToSingleLine
- Formats JSON files to a single-line format using the
jq
utility. - ⚡ Get-Seed
- Retrieves the
ss_seed
value from the metadata of a.safetensors
file. - ⚡ Inspect-Lora
- Takes a file path as input and uses Python to read the metadata from a
.safetensors
file. It then pretty-prints the metadata contents to the console and saves it next to the LoRA.