Vision Language Model - Labellerr AI

Vision Language Model

A collection of 4 posts

Gemma 4 12B Tutorial

Gemma 4 12B : Run Locally, Fine-Tune, Benchmark Performance

Google Gemma 4 12B introduces an encoder-free multimodal architecture that natively processes text, images, audio, and video. Learn how it works, benchmark results, fine-tuning advantages, and how to run it locally on consumer hardware.

Advanced Vision Language Models: Gemma 3 And 3N Explained

Gemma 3 represent a leap in vision-language AI, featuring SigLIP-based visual encoders, up to 128k-token context windows, and state-of-the-art multilingual and function-calling capabilities.

Vision Language Model

BLIP Explained: Use It For VQA & Captioning

BLIP (Bootstrapping Language‑Image Pre‑training) is a Vision‑Language Model that fuses image and text understanding. This blog dives into BLIP’s architecture, training tasks, and shows you how to set it up locally for captioning, visual QA, and cross‑modal retrieval.