CVPR 2025: Breakthroughs in GenAI and Computer Vision
CVPR 2025 (June 11–15, Music City Center, Nashville & virtual) features top-tier computer vision research: 3D modeling, multimodal AI, embodied agents, AR/VR, deep learning, workshops, demos, art exhibits and robotics innovations.

Welcome to the second part of our exploration into the most influential papers from CVPR 2025.
In Part 1, we covered groundbreaking advancements in data annotation and object detection. Now, we turn our attention to how machines track objects in motion, perceive the world in 3D, and generate novel visual content.
This installment highlights key developments in object tracking, 3D perception, Vision-Language Models (VLMs), and specialized applications in medicine and document analysis.
The table below shows a brief summary of the article.
Model | Summary | Paper Link |
---|---|---|
Focusing on Tracks for Online Multi-Object Tracking | Introduces a robust method for online multi-object tracking by learning from the entire trajectory ("track") of each object, not just individual frames. This approach handles occlusions and re-identification, providing more stable tracking in real-world scenarios. | CVPR |
Model | Summary | Paper Link |
---|---|---|
Any6D |
Any6D: Model-free 6D Pose Estimation of Novel Objects A system for estimating the 3D position and rotation (6D pose) of previously unseen objects, without needing a pre-existing 3D model. Enables robots to grasp and manipulate unfamiliar items in unstructured environments. |
CVPR |
HiPART |
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation A transformer-based model for estimating full 3D human pose from 2D videos, even when parts of the body are occluded. Uses a hierarchical approach to infer hidden limb positions, producing more accurate pose reconstructions. |
CVPR |
Seurat |
Seurat: From Moving Points to Depth Reconstructs dense 3D depth maps from a moving camera by analyzing how points in the scene shift over time, enabling immersive AR and robust robotic navigation. |
CVPR |
Model | Summary | Paper Link |
---|---|---|
VLMs-4-All Workshop | Workshop and papers focused on democratizing Vision-Language Models, exploring efficient training, domain adaptation, and safety for real-world multimodal applications. | Workshop |
Janus |
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Proposes a flexible architecture that decouples visual encoding from the language model, excelling at both understanding multimodal inputs and generating visual content from text. |
CVPR |
Task Preference Optimization |
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Fine-tunes VLMs using task-specific preference alignment rather than generic feedback, resulting in significant performance gains in object detection and segmentation. |
CVPR |
Chat2SVG |
Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models Combines LLMs and diffusion models to generate editable SVG vector graphics from chat-based text prompts, enabling automated design and illustration. |
CVPR |
PhD |
PhD: A ChatGPT-Visual Hallucination Evaluation Dataset Introduces a benchmark dataset for measuring and analyzing hallucinations in vision-language models, using ChatGPT-generated prompts to test model reliability and factual accuracy. |
CVPR |
Model | Summary | Paper Link |
---|---|---|
DocSAM |
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning Segments complex document images into titles, paragraphs, tables, and figures using a query decomposition approach, streamlining automated document processing. |
CVPR |
STPro |
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding Learns to localize actions in video with only weak supervision, progressively refining spatial and temporal localization for efficient large-scale video analysis. |
CVPR |
M3-VOS |
M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation Robustly segments objects across long, complex videos with changing scenery and object states, enabling advanced video editing and analysis. |
CVPR |
MIMO |
MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output A vision-language model for medical imaging that produces pixel-level grounded outputs, directly linking diagnostic findings to image regions for clinical transparency. |
CVPR |
Omni-Scene |
Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction Generates detailed 3D scene reconstructions from sparse, ego-centric camera views using a novel Omni-Gaussian representation, advancing VR and robotics. |
CVPR |
1. Object Tracking
This research focuses on the complex task of following specific objects as they move and interact within a video.
1.1 Focusing on Tracks for Online Multi-Object Tracking
This paper introduces a more intelligent and robust method for tracking multiple objects in real-time. Instead of making decisions based on single frames, the model learns from an object's entire movement history, or "track."
By focusing on the complete trajectory, the system becomes significantly better at handling challenges like temporary occlusions and re-identifying objects that reappear.
This leads to more stable and reliable tracking, crucial for autonomous systems and surveillance.
Read the paper. (Link)
2. Pose and Depth Perception
These papers delve into the critical ability to understand the 3D orientation of objects and the spatial layout of a scene.
2.1 Any6D: Model-free 6D Pose Estimation of Novel Objects
This research presents a revolutionary system for determining the precise 3D position and rotation (6D pose) of any object, even if the model has never encountered it before.
Critically, it does not require a pre-existing 3D model of the object, which has been a major bottleneck for robotic manipulation.
This technology is a significant leap forward, enabling robots to intelligently grasp and interact with unfamiliar items in unstructured environments.
Read the paper. (Link)
2.2 HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation
This work tackles the difficult challenge of estimating a person's complete 3D body pose when parts of them are hidden from view.
The proposed transformer-based model uses a hierarchical approach, allowing it to understand the structural constraints of the human body.
This enables it to logically infer the position of occluded limbs, resulting in more accurate and coherent 3D pose reconstructions directly from standard 2D videos.
Read the paper. (Link)
2.3 Seurat: From Moving Points to Depth
This paper introduces an elegant method for reconstructing the 3D depth of a scene using only a moving camera.
Named Seurat, the model observes how points in the environment shift as the camera moves through it. By analyzing this motion, it effectively calculates the distance to various surfaces and builds a dense, detailed depth map of the scene.
This technique is fundamental for creating immersive experiences in augmented reality and for robust navigation in robotics.
Read the paper. (Link)
3. VLM and Generative AI
This category showcases advances in models that connect vision and language, as well as AI that generates new visual content.
3.1 Vision Language Models For All (VLMs-4-All) Workshop
This workshop focuses on one of the hottest topics in AI: making powerful Vision-Language Models accessible and effective for a wide range of real-world problems.
The sessions and papers explore new methods for training more efficient models, adapting them to specialized domains, and ensuring they operate safely and reliably.
The goal is to democratize this technology, empowering developers to build sophisticated multimodal applications without needing massive computational resources.
Read about it. (Link)
3.2 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
This paper proposes a more flexible and powerful architecture for Vision-Language Models (VLMs) named Janus.
It uniquely decouples the process of encoding visual information from the language model itself.
This innovative design allows the model to excel at both understanding multimodal inputs (like answering questions about an image) and generating high-quality visual content from text descriptions, setting a new standard for unified multimodal AI.
Read the paper. (Link)
3.3 Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
This research introduces a smarter way to fine-tune VLMs for specific visual tasks. Instead of using generic human feedback, this method aligns the model's training with task-specific preferences (e.g., "which of these two bounding boxes is more accurate?").
This targeted optimization process helps the model specialize more effectively, leading to significant performance gains on practical applications like object detection and image segmentation.
Read the paper. (Link)
3.4 Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models
This work presents an innovative tool that allows anyone to create high-quality Scalable Vector Graphics (SVGs) through a simple chat interface.
The system intelligently combines the conversational reasoning of a Large Language Model with the creative power of an image diffusion model to translate text descriptions into clean, editable vector graphics.
This technology opens up exciting new avenues for automated design, personalized logos, and on-the-fly illustrations.
Read the paper. (Link)
3.5 PhD: A ChatGPT-Visual Hallucination Evaluation Dataset
This paper addresses the critical challenge of reliability in VLMs by introducing a new benchmark dataset called PhD.
It is specifically designed to test when and why models "hallucinate"—or invent visual details not present in an image.
Using a novel approach where ChatGPT generates challenging prompts, this dataset provides a standardized way to measure, understand, and ultimately reduce factual errors in vision-language systems.
Read the paper. (Link)
4. Specialized Applications
These models apply advanced computer vision techniques to solve specific, high-impact problems in document analysis, video understanding, and medical imaging.
4.1 DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
This paper introduces a powerful model for intelligently parsing the layout of complex documents.
DocSAM can precisely segment a document image, automatically distinguishing between elements like titles, text paragraphs, tables, and figures.
It uses a novel query decomposition method to handle this diverse mix of content types within a single, unified framework, making it highly effective for digital document processing and automated information extraction.
Read the paper. (Link)
4.2 STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
This research presents a model that can pinpoint where and when a specific action occurs in a video, a task known as spatio-temporal grounding.
It learns to do this with "weak supervision," meaning it only needs to know that an action happens somewhere in the video, not its exact timing or location.
The model progressively refines its search across space and time to zero in on the event, making large-scale video analysis far more efficient.
Read the paper. (Link)
4.3 M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
This paper tackles the extremely difficult task of segmenting a specific object throughout a long and complex video.
The M^3-VOS model is engineered for robustness, excelling in scenarios with changing scenery, object state transitions (like a flower blooming), and different phases of motion.
It achieves state-of-the-art performance, providing a powerful and reliable tool for detailed video editing and content analysis.
Read the paper. (Link)
4.4 MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output
This research introduces MIMO, a Vision-Language Model built specifically for the medical domain.
It can analyze medical images and related text, but its key innovation is that its outputs are "pixel-grounded."
This means it can highlight the exact region in a scan that corresponds to its diagnostic finding, creating a transparent and trustworthy link between its analysis and the visual evidence, which is invaluable for clinical support.
Read the paper. (Link)
4.5 Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction
This paper presents a new method for creating detailed 3D models of an environment from a first-person perspective.
Even with just a few viewpoints from a moving camera, Omni-Scene can generate a comprehensive 3D reconstruction using a novel "Omni-Gaussian" representation.
This technology marks a major step forward for creating immersive virtual reality content and enabling advanced spatial awareness in robotics from limited visual data.
What's Next: A Hands-On Exploration
The innovations showcased at CVPR 2025 represent more than just incremental improvements; they signal major shifts in how AI will interact with and understand our visual world.
While these summaries provide a high-level glimpse, the true test of any model lies in its practical application.
Therefore, this overview is just the beginning. In a series of upcoming blog posts, I will conduct a hands-on exploration of these groundbreaking models.
I plan to implement and run many of them individually to assess their capabilities, uncover their challenges, and see how they perform on real-world data.
Join me for this technical deep dive as we put these systems to the test. Stay tuned for the first installment, where we'll kick things off with a detailed look at VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos.
FAQs
Q: When and where is CVPR 2025 held?
A: June 11–15, 2025 at Music City Center, Nashville, Tennessee, with virtual attendance options.
Q: What are the major topics?
A: Key themes include 3D computer vision, multimodal vision-language, embodied AI, image & video synthesis, AR/VR, robotics, low-level vision, and more .
Q: How competitive is the conference?
A: Out of 13,008 submissions, about 2,872 papers were accepted (≈22% acceptance), with only ~3.3% selected for oral presentations.
Q: What formats are available?
A: The conference includes keynote talks, oral and poster sessions, demos, tutorials, workshops, an art program, and an industry expo.
Q: Who sponsors and attends?
A: Co-sponsored by IEEE Computer Society and CVF; attracts 10k+ attendees and exhibitors like Adobe, Apple, Google, Meta, Sony, and Waymo.
References

Simplify Your Data Annotation Workflow With Proven Strategies
.png)
