SmolVLM Tagger for ComfyUI

SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.

Model Summary

Developed by: Hugging Face 🤗
Model type: Multi-modal model (image+text)
Language(s) (NLP): English
License: Apache 2.0
Architecture: Based on Idefics3 (see technical summary)

Resources

Demo: SmolVLM Demo
Blog: Blog post

Uses

SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.

To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.

Technical Summary

SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to previous Idefics models:

Image compression: We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
Visual Token Encoding: SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.

Installation:

Clone this repository to 'ComfyUI/custom_nodes` folder.

Install the dependencies in requirements.txt, transformers version 4.38.0 minimum is required:

pip install -r requirements.txt

Workflows

Use as single image captioning Combine simple caption with tag caption and save to output files

(Save image and grag to ComfyUI to try)

Huggingface model

Model should be automatically downloaded the first time when you use the node. In any case that didn't happen, you can manually download it. HuggingFaceTB/SmolVLM-Instruct The downloaded model will be placed underComfyUI/LLM folder

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
nodes.py		nodes.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmolVLM Tagger for ComfyUI

Model Summary

Resources

Uses

Technical Summary

Installation:

Workflows

Huggingface model

About

Releases

Packages

Languages

License

mamorett/ComfyUI-SmolVLM

Folders and files

Latest commit

History

Repository files navigation

SmolVLM Tagger for ComfyUI

Model Summary

Resources

Uses

Technical Summary

Installation:

Workflows

Huggingface model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages