SmolVLM is a model that can process visual input and generate textual output. It distinguishes itself by requiring significantly less GPU power than comparable models, about half the resources.
Hugging Face describes SmolVLM as an “open multimodal model” that accepts arbitrary combinations of visual and text input and generates text output. The model is versatile: It can answer questions about images, describe visual content, create stories based on multiple images, or function as a traditional language model without visual input.
SmolVLM can be an interesting option for companies, especially given the high cost of implementing large language models in organizations. Multimodal models, which process text and visual input, can be particularly costly because they require high IT resources, such as computing power.
New way of working
For SmolVLM, Hugging Face significantly modified the architecture, resulting in a model that requires 5.02 GB of RAM. For example, this is significantly less than InternVL2 2B, which requires 10.52 GB of memory. This more efficient approach makes SmolVLM suitable for on-device applications, with the model continuing to deliver strong performance.
Technically, Hugging Face applies a new image compression method, allowing the model to make faster decisions with less RAM usage. SmolVLM uses 81 visual tokens to encode image patches of 384×384 pixels. Larger images are divided into patches that are encoded separately. This keeps the model running efficiently without compromising performance.