CVPR 2025: Breakthroughs in Object Detection & Segmentation

The annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) continues to be the premier stage for groundbreaking research.

The 2025 conference showcases significant advancements in how machines perceive and understand the visual world.

This first installment of our CVPR 2025 series explores key papers in data annotation and object detection, highlighting models that learn with less human supervision and achieve new levels of accuracy in identifying and segmenting objects.

The following table is a brief summary of the article.

Data Labeling and Annotation
Model Summary Paper Link
VideoGLaMM VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
A large multimodal model for fine-grained, pixel-level visual grounding in videos. Connects a large language model with dual vision encoders and a spatio-temporal decoder, enabling the model to generate object masks linked to natural language queries and movements in video content.
CVPR
DOtA Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
Detect Objects from Multi-Agent LiDAR scans (DOtA) is an unsupervised 3D object detection method that leverages multi-agent viewpoints to generate high-quality detections for autonomous driving, without manual labels.
CVPR
PointSR PointSR: Self-Regularized Point Supervision for Drone-View Object Detection
A framework for drone-view object detection using only point-level supervision. Introduces a self-regularized sampling strategy and a Temporal-Ensembling Encoder to generate high-quality pseudo-box labels from point annotations, improving detection in densely packed scenes.
CVPR
DViN DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension
Dynamic Visual Routing Network for weakly supervised referring expression comprehension. Features a sparse routing mechanism for feature combination and a Routing-based Feature Alignment objective, achieving state-of-the-art results in fine-grained visual understanding.
CVPR
Object Detection and Segmentation
Model Summary Paper Link
MI-DETR MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
Introduces a parallel Multi-time Inquiries (MI) mechanism to DETR-like models, allowing object queries to perform multiple inquiries in parallel. This improves detection performance, especially for small and occluded objects, by gathering richer image features.
CVPR
QueryMeldNet Scaling up Image Segmentation across Data and Tasks
A scalable segmentation framework that handles multiple datasets and tasks. Uses a "query meld" mechanism to fuse different query types, balancing instance- and stuff-level segmentation, and leverages synthetic data for improved generalization in open-set segmentation.
CVPR
v-CLR v-CLR: View-Consistent Learning for Open-World Instance Segmentation
View-Consistent Learning framework for open-world instance segmentation. Enforces appearance-invariant representations by generating multiple image views with altered textures and ensuring feature consistency, enabling robust segmentation of novel objects.
CVPR
CALICO CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
The first Large Vision-Language Model for part-focused semantic co-segmentation. Features a Correspondence Extraction Module for part-level semantic similarity and parameter-efficient adaptation, enabling detailed analysis of objects and their parts across images.
CVPR
Multi-scale Spiking Detector (MSD) Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection
An energy-efficient object detection framework based on Spiking Neural Networks (SNNs). Uses a novel spiking convolutional neuron and multi-scale fusion to emulate biological neural processes, achieving high accuracy with lower energy consumption and fewer parameters.
CVPR
SGC-Net SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
Stratified Granular Comparison Network for open-vocabulary human-object interaction detection. Aggregates global and local semantic features and uses a Hierarchical Group Comparison module with LLMs to enhance discrimination between interaction classes.
arXiv

Data Labeling and Annotation

This year, a key trend is the development of models that reduce the dependency on manually labeled data.

These papers introduce sophisticated techniques for weakly supervised and self-supervised learning, enabling robust performance without exhaustive human effort.

1. VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

To address the challenge of aligning text with complex video content, VideoGLaMM is introduced as a large multimodal model designed for fine-grained, pixel-level grounding.

The architecture connects a large language model with a dual vision encoder for spatial and temporal details, and a spatio-temporal decoder to generate precise object masks.

This allows the model to respond to natural language queries by generating textual responses that are directly linked to specific objects and their movements in the video, providing a detailed, grounded understanding of the video's content.

2. Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels

This paper introduces DOtA (Detect Objects from Multi-Agent LiDAR scans), an unsupervised method for 3D object detection that eliminates the need for manual labels.

The model leverages the complementary viewpoints from multiple agents to overcome the limitations of data sparsity and limited fields of view that hinder single-agent systems.

DOtA uses the shared ego-pose and ego-shape information between agents to initialize a detector and generate preliminary labels, which are then refined using multi-scale encoding to produce high-quality object detections for autonomous driving applications.

3. PointSR: Self-Regularized Point Supervision for Drone-View Object Detection

Addressing the high cost of bounding-box annotations, PointSR is a framework for object detection in drone imagery using only point-level supervision.

This is particularly challenging for drone-view images, which often contain densely packed, small objects.

PointSR introduces a self-regularized sampling strategy that integrates temporal and informational constraints to generate high-quality pseudo-box labels from simple point annotations.

The model uses a Temporal-Ensembling Encoder to ensure stable predictions and an informative negative sampling strategy to refine the quality of the generated bounding boxes.

4. DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension

DViN is a novel framework for weakly supervised Referring Expression Comprehension (REC), the task of locating an object in an image based on a natural language description.

The model addresses the performance limitations of existing methods by enhancing their fine-grained visual capabilities. DViN features a sparse routing mechanism that dynamically combines features from multiple visual encoders, improving descriptive power.

It also introduces a weakly supervised objective called Routing-based Feature Alignment, which enhances visual understanding through both intra-modal and inter-modal alignment, achieving state-of-the-art results.

Object Detection and Segmentation

The papers in this category push the boundaries of how accurately and efficiently models can identify, classify, and delineate objects.

From new transformer architectures to brain-inspired neural networks, these contributions tackle challenges in open-world settings, part-focused segmentation, and energy efficiency.

1. MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

This paper identifies a key limitation in the cascaded decoder architecture of DETR-like models, where object queries are updated sequentially, limiting their ability to learn from image features.

To address this, MI-DETR introduces a parallel Multi-time Inquiries (MI) mechanism. This allows object queries to perform multiple parallel inquiries to the image features, gathering more comprehensive information.

This simple yet effective modification allows MI-DETR to outperform existing DETR-like models, showing significant improvements on the COCO benchmark, especially for challenging, small, or occluded objects.

2. Scaling up Image Segmentation across Data and Tasks

To overcome the limitations of specialized segmentation models, this work proposes scaling up image segmentation across diverse datasets and tasks simultaneously.

The introduced framework, QueryMeldNet, is designed to scale with both data volume and task diversity. It uses a dynamic object query mechanism called "query meld" that fuses different query types to balance instance-level and stuff-level segmentation.

The framework also leverages synthetically generated data to reduce reliance on human annotation, demonstrating improved generalization and performance on open-set segmentation benchmarks.

3. v-CLR: View-Consistent Learning for Open-World Instance Segmentation

This paper tackles open-world instance segmentation by addressing the bias of visual networks to learn appearance-based information like texture, which causes failures when encountering novel objects.

The proposed v-CLR framework enforces the learning of appearance-invariant representations. It introduces additional "views" of an image where texture is altered but structure is preserved.

By enforcing consistency between object features across these different views, the model is encouraged to rely on more robust, generalizable features, significantly improving its ability to discover and segment previously unseen objects.

4. CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

CALICO introduces and addresses the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects and their unique parts across multiple images.

This is a more granular task than standard segmentation. CALICO is the first Large Vision-Language Model designed for this multi-image, part-level reasoning.

It features a novel Correspondence Extraction Module to identify part-level semantic similarities and adapts this information into the LVLM in a parameter-efficient way, enabling detailed, comparative analysis of objects across different images.

5. Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection

This work presents the Multi-scale Spiking Detector (MSD), an energy-efficient, high-performance object detection framework based on Spiking Neural Networks (SNNs).

Inspired by biological neural processes, the model uses a novel spiking convolutional neuron within an Optic Nerve Nucleus Block to enhance deep feature extraction.

The framework emulates the brain's ability to respond to stimuli from different objects by using spiking multi-scale fusion to integrate features from various depths.

MSD achieves competitive accuracy on standard benchmarks while requiring significantly less energy and fewer parameters than traditional ANN-based detectors.

6. SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection

SGC-Net addresses two key issues in open-vocabulary human-object interaction (OV-HOI) detection: feature granularity deficiency and semantic similarity confusion.

The proposed network introduces a Granularity Sensing Alignment module that aggregates global semantic features with local details from intermediate layers, creating a more robust alignment between visual features and text embeddings.

Additionally, a Hierarchical Group Comparison module uses an LLM to recursively compare and group interaction classes, generating more discriminative descriptions and improving the ability to distinguish between semantically similar interactions.

What's Next: A Hands-On Exploration

The innovations showcased at CVPR 2025 represent more than just incremental improvements; they signal major shifts in how AI will interact with and understand our visual world.

While these summaries provide a high-level glimpse, the true test of any model lies in its practical application.

Therefore, this overview is just the beginning. In a series of upcoming blog posts, I will conduct a hands-on exploration of these groundbreaking models.

I plan to implement and run many of them individually to assess their capabilities, uncover their challenges, and see how they perform on real-world data.

Join me for this technical deep dive as we put these systems to the test. Stay tuned for the first installment, where we'll kick things off with a detailed look at VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos.

FAQs

Q: When and where is CVPR 2025 held?
A: June 11–15, 2025 at Music City Center, Nashville, Tennessee, with virtual attendance options.

Q: What are the major topics?
A: Key themes include 3D computer vision, multimodal vision-language, embodied AI, image & video synthesis, AR/VR, robotics, low-level vision, and more .

Q: How competitive is the conference?
A: Out of 13,008 submissions, about 2,872 papers were accepted (≈22% acceptance), with only ~3.3% selected for oral presentations.

Q: What formats are available?
A: The conference includes keynote talks, oral and poster sessions, demos, tutorials, workshops, an art program, and an industry expo.

Q: Who sponsors and attends?
A: Co-sponsored by IEEE Computer Society and CVF; attracts 10k+ attendees and exhibitors like Adobe, Apple, Google, Meta, Sony, and Waymo.

References

CVPR 2025 List Of All The Papers