Vision Language Model - Labellerr AI

Vision Language Model

A collection of 2 posts

Advanced Vision Language Models: Gemma 3 And 3N Explained

Gemma 3 represent a leap in vision-language AI, featuring SigLIP-based visual encoders, up to 128k-token context windows, and state-of-the-art multilingual and function-calling capabilities.

Vision Language Model

BLIP Explained: Use It For VQA & Captioning

BLIP (Bootstrapping Language‑Image Pre‑training) is a Vision‑Language Model that fuses image and text understanding. This blog dives into BLIP’s architecture, training tasks, and shows you how to set it up locally for captioning, visual QA, and cross‑modal retrieval.