computer vision - Labellerr AI

computer vision

A collection of 110 posts

Power Grid Inspection using Computer Vision

computer vision

Power Grid Inspection using Computer Vision

Manual power grid inspections are risky and slow. Discover how Computer Vision and drones are transforming utility maintenance. This guide explores how AI automates defect detection, ensures worker safety, and enables predictive maintenance to prevent outages before they happen.

YOLO11 vs YOLOv8

YOLO11 vs YOLOv8: Model Comparison

A detailed expert comparison of YOLOv8 and YOLO11 object detection models, covering performance, accuracy, hardware needs, and practical recommendations for developers and researchers.

Pill Counting System using YOLOv12

Building a Pill Counting System with Labellerr and YOLO

Fine-tuning YOLO for pill counting enables accurate detection and tracking of pills in pharmaceutical setups. Learn how to customize YOLO for your dataset to handle overlapping pills, varied lighting, and real-time counting tasks efficiently.

DINOv3 Explained: The Future of Self-Supervised Learning

DINOv3 is Meta’s open-source vision backbone trained on over a billion images using self-supervised learning. It provides pretrained models, adapters, training code, and deployment support for advanced, annotation-free vision solutions.

CVPR 2025: Breakthroughs in GenAI and Computer Vision

CVPR 2025: Breakthroughs in GenAI and Computer Vision

CVPR 2025 (June 11–15, Music City Center, Nashville & virtual) features top-tier computer vision research: 3D modeling, multimodal AI, embodied agents, AR/VR, deep learning, workshops, demos, art exhibits and robotics innovations.

Microsoft's KOSMOS-2

KOSMOS-2 Explained: Microsoft’s Multimodal Marvel

KOSMOS-2 brings grounding to vision-language models, letting AI pinpoint visual regions based on text. In this blog, I explore how well it performs through real-world experiments and highlight both its promise and limitations in grounding and image understanding.

CVPR 2025: Innovations in Computer Vision (Part 1)

CVPR 2025: Breakthroughs in Object Detection & Segmentation

CVPR 2025 (June 11–15, Music City Center, Nashville & virtual) features top-tier computer vision research: 3D modeling, multimodal AI, embodied agents, AR/VR, deep learning, workshops, demos, art exhibits and robotics innovations.

Vision Language Model

BLIP Explained: Use It For VQA & Captioning

BLIP (Bootstrapping Language‑Image Pre‑training) is a Vision‑Language Model that fuses image and text understanding. This blog dives into BLIP’s architecture, training tasks, and shows you how to set it up locally for captioning, visual QA, and cross‑modal retrieval.

object tracking

Learn DeepSORT: Real-Time Object Tracking Guide

Learn to implement DeepSORT for robust multi-object tracking in videos. This guide covers setup, integration with detectors like YOLO for real-time use.

Fine-Tune Llama 3.2 Vision

Vision-language models

How to Fine-Tune Llama 3.2 Vision On a Custom Dataset?

Unlock advanced multimodal AI by fine‑tuning Llama 3.2 Vision on your own dataset. Follow this guide through Unsloth, NeMo 2.0 and Hugging Face workflows to customize image‑text reasoning for OCR, VQA, captioning, and more.

ByteTrack implementation guide

object tracking

How to Implement ByteTrack for Multi-Object Tracking

This blog shows code implementation of ByteTrack, combining high- and low-confidence detections to maintain consistent object IDs across frames. By matching strong detections first and “rescuing” weaker ones, it excels at tracking in cluttered or occluded scenes.

Best Open-Source Vision Language Models of 2025

computer vision

Best Open-Source Vision Language Models of 2025

Discover the leading open-source vision-language models (VLMs) of 2025 including Qwen 2.5 VL, LLaMA 3.2 Vision, and DeepSeek-VL. This guide compares key specs, encoders, and capabilities like OCR, reasoning, and multilingual support.

Llama 3.2 vision model performance

A Hands-On Guide to Meta's Llama 3.2 Vision

Explore Meta’s Llama 3.2 Vision in this hands-on guide. Learn how to use its multimodal image-text capabilities, deploy the model via AWS or locally, and apply it to real-world use cases like OCR, VQA, and visual reasoning across industries.

SegGPT Demo + Code: Next-Gen Segmentation is Here

SegGPT is a versatile, unified vision model that performs semantic, instance, panoptic, and niche-domain segmentation via in-context “color-in” prompting—no task-specific fine-tuning required, instantly adapting to new classes from just a few annotated examples.

SegFormer Explained: Perform Semantic Segmentation using Transformers

Semantic segmenatation

SegFormer Tutorial: Master Semantic Segmentation Fast

Learn how SegFormer uses Transformers and MLPs to perform semantic segmentation. Also implement Segformer yourself.

YOLO-NAS: What is, How to Use

computer vision

The Ultimate YOLO-NAS Guide (2025): What It Is & How to Use

Explore YOLO-NAS! This guide explains its new Neural Architecture Search (NAS) for creating highly efficient and accurate object detection models for diverse hardware.

How to Perform YOLOv11 Various Tasks

The Only YOLOv11 Multi-Labeling Guide You’ll Ever Need

This guide details how to perform all vision tasks: detection, segmentation, pose estimation & more in YOLOv11.

Computer Vision In Security & Surveillance

computer vision

Computer Vision in Security & Surveillance

Explore how computer vision is revolutionizing security and surveillance, enabling real-time threat detection, facial recognition, and automated monitoring to enhance safety and operational efficiency across various sectors.

Vision Agent Using SAM

Vision Agent Using SAM-Description-Based Object Segmentation Agent

Build Vision Agents using Segment Anything (SAM)! Learn how to combine text descriptions (like with Grounding DINO) and SAM for powerful, zero-shot object segmentation, bypassing traditional training needs. Understand and build your own description-based vision agent.

RT-DETR v RT-DETRv2

object detection

RT-DETRv2 Beats YOLO? Full Comparison + Tutorial

Explore a comparison between RT-DETR and RT-DETRv2 in real-time object detection with transformer power. Learn how to implement it using HuggingFace.

OWL v2 - Open World Learning Version 2

computer vision

How to Perform Object Detection Tasks Using OWL v2

Explore how to implement OWLv2, a powerful open-vocabulary object detection model. Learn about its zero-shot capabilities, classification, guided image query, and how it understands text and images together for real-world use.

computer vision

How To Perform Vision Tasks Using Florence 2

Discover the way to perform various tasks Florence 2 can handle, from object detection to OCR using just prompts. Learn how this unified vision model simplifies complex workflows without sacrificing accuracy.

How Autonomous Vehicles Perceive the World?

computer vision

How Computer Vision Powers Autonomous Vehicles

Computer vision helps self-driving cars “see” and understand their surroundings using AI, cameras, LiDAR, and radar. It powers object detection, lane tracking, and decision-making in real time, making autonomous vehicles smarter, safer, and ready for complex road conditions.

Fine-tune Yolo for pose estimation

computer vision

How To Fine-Tune YOLO For Pose Estimation On Custom Dataset

Fine-tuning YOLO for pose estimation on a custom dataset allows for precise keypoint detection tailored to specific applications like sports analytics, healthcare, and robotics. In this guide, we cover everything from dataset annotation and keypoint formatting to model training and Fine-tuning.

Why Robots Are Finally Learning to Think, See, and Act Like Humans?

VisionLanguageActionModels

How Vision-Language-Action Models Powering Humanoid Robots

Vision-Language-Action (VLA) models are transforming robotics by integrating visual perception, natural language understanding, and real-world actions. This groundbreaking AI approach enables robots to comprehend and interact with their environment like never before.