An easy-to-use image captioning for data set annotation. Need to caption your images? Inscriptor has you. This is an implementation of diffusers utilizing Blip2's new zero-shot instructed vision-to-language generation. Runs best on the blip2-opt-6.7b-coco
model, but you can change it to any of the other models in the blip2
family that are lower or higher requirements.
24GB VRAM and 48GB RAM
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
or install from https://pytorch.org/get-started/locally/
Coming soon
Inside inscriptor-mass-captioning.ipynb
, point imagesDirectory
to your local dataset directory. The dataset directory should contain images, in any of these formats: .jpg, .png, .jpeg, .webp, .gif
, and can contain subfolders with their own images. Inscriptor will recursively search for images in the directories. The names of these folders can be used as tokens to add to the caption. For example, if you have a folder named cat
and another named dog
, the caption will contain the tokens cat
and dog
. This will generate a .txt
file with the in the same location where your original image is located with the same filename. The .txt
file will contain the caption generated by Inscriptor.