MASA Implementation Guide: Track Objects in Video Using SAM
MASA revolutionizes object tracking by combining the Segment Anything Model (SAM) with self-supervised matching. Track any object across video frames without category-specific training or manual labels. Learn how MASA works and implement state-of-the-art universal tracking in your projects.

Have you ever wondered how modern AI systems can recognize, segment, and track virtually any object in a video or image, regardless of how diverse, complex, or unpredictable the scene?
According to recent research, over 80% of computer vision tasks in fields like autonomous driving, robotics, and video surveillance now rely on advanced segmentation and tracking algorithms for real-world applications.
Yet, despite this progress, most existing solutions are limited by their reliance on specific object categories, static scenes, or labor-intensive manual annotations.
Enter MASA (Matching Anything by Segment Anything), a groundbreaking approach that promises to revolutionize the way we match and track objects across frames by leveraging the power of Segment Anything Models (SAM).
MASA combines the generalization ability of SAM with robust matching algorithms, enabling seamless object association in challenging, dynamic environments, without the need for category-specific training.
But how does MASA actually work? What makes it different from traditional tracking-by-detection or segmentation pipelines? And most importantly, how can you implement MASA in your own projects to unlock state-of-the-art performance with minimal effort?

MASA Inference Result
In this blog, we’ll dive deep into the implementation of MASA.
Why is MASA required?
Imagine you’re hiking with a friend who suddenly puts on sunglasses. At first, you might squint to recognize their face, but within seconds, your brain adjusts.
MASA does something similar for AI: it helps models instantly "recognize" new situations they weren’t trained for, like foggy roads, blurry medical scans, or glitchy video feeds.
The Problem: Stubborn AI Brains
Most AI models are like rigid students: they memorize textbook examples but panic when faced with real-world surprises.
Like, A self-driving car AI trained on sunny days might freeze in rain, or A tumor-spotting model could misdiagnose if hospital lighting changes.
Retraining these models takes hours (or even days) and requires expensive, labeled data.
How does MASA work?
MASA (Matching Anything by Segment Anything) uses a smart and simple approach to match and track any object in videos or images, no matter the scene or object type. Here’s how it works, step by step:
Segment Everything with SAM
MASA starts by using the Segment Anything Model (SAM), a powerful AI tool that can find and outline every object in an image, even if it has never seen that object before.
SAM divides the image into distinct object regions, creating a detailed map of everything present.
Create Instance-Level Matches
To learn how to match objects across frames, MASA applies different geometric transformations (like flipping or rotating) to the same image.
By comparing these transformed images, MASA automatically knows which regions in both images belong to the same object.
This process generates “self-supervision” signals, so MASA learns to recognize and match objects without needing labeled data.
Extract and Compare Features
For each segmented region, MASA extracts features such as shape, color, and texture.
It then uses advanced matching algorithms to compare these features between frames.
If two regions look alike and share similar features, MASA considers them to be the same object.
Use Spatial Relationships
MASA doesn’t just look at appearance. It also considers where objects are and how they move.
By tracking the positions and movements of objects, MASA can correctly match objects even if they change appearance, move around, or overlap with others.
Universal MASA Adapter
MASA includes a special adapter that can connect with other detection or segmentation models.
This adapter transforms the features from these models, allowing them to benefit from MASA’s matching ability. As
a result, even existing models can now track any object they detect, without extra training or manual labeling.
No Manual Labels Needed
Unlike older methods that rely on manually labeled videos, MASA learns from raw, unlabeled images. This makes the process faster, cheaper, and more flexible, and it works well even in new or complex environments
How can you implement MASA?
To implement MASA, first you have to clone its GitHub repository using,
NOTE: This guide is for linux system.
git clone https://github.com/siyuanliii/masa.git
Navigate to the project directory(use: cd masa) and create a Conda environment using:
conda env create -f environment.yml
conda activate masaenv
Run the install_dependencies.sh
Script to install the required dependencies:
sh install_dependencies.sh
Install the remaining Python packages using:
pip install -r requirements.txt
Now, write python in the terminal and download the NTLK files using.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')
Preparation
- First, create a folder named
saved_models
in the root directory of the project. Then, download the following models and put them in thesaved_models
folder. - Download the MASA-GroundingDINO and put it in
saved_models/masa_models/gdino_masa.pth
folder. - (Optional) Second, download the demo videos and put them in the
demo
folder. We provide two short videos for testing (minions_rush_out.mp4 and giraffe_short.mp4). You can download more demo videos here. - Finally, create the
demo_outputs
folder in the root directory of the project to save the output videos.
To run inference, run the following code.
python demo/video_demo_with_text.py demo/minions_rush_out.mp4 --out demo_outputs/minions_rush_out_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "yellow_minions" --score-thr 0.2 --unified --show_fps --fp16
It will create a demo_outputs folder, which stores the saved result.
Inference Result
Another example of inference you can do by following the Command Line
python demo/video_demo_with_text.py demo/giraffe_short.mp4 --out demo_outputs/giraffe_short_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "giraffe" --score-thr 0.2 --unified --show_fps --fp16
Inference Result
For object detection and mask,
Download SAM-H weights and put it in saved_models/pretrain_weights/sam_vit_h_4b8939.pth
folder.
Download the carton kangaroo dance.mp4 and put it in the demo
folder.
python demo/video_demo_with_text.py demo/carton_kangaroo_dance.mp4 --out demo_outputs/carton_kangaroo_dance_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "kangaroo" --score-thr 0.2 --unified --show_fps --fp16 --sam_mask
Object Detection and SAM segmentation Result
Conclusion
Implementing MASA unlocks the ability to match and track any object in images or videos with minimal manual effort and high flexibility.
By following the MASA workflow, you can leverage the power of the Segment Anything Model (SAM) for universal segmentation, use MASA’s robust matching algorithms to associate objects across frames, and integrate these capabilities into your applications.
The process is straightforward: segment everything in each frame, extract features, and let MASA’s matching engine do the rest.
With no need for manual labeling or category-specific training, MASA makes state-of-the-art object tracking accessible to everyone.
Start experimenting with MASA in your projects, and you’ll quickly see how it transforms complex visual data into actionable, organized information: no matter what you want to match or where you want to deploy it.
FAQs
How does MASA differ fundamentally from traditional object tracking methods?
Unlike traditional trackers requiring category-specific training or detection pipelines, MASA leverages SAM's universal segmentation and self-supervised feature matching. This allows it to track any segmented object across frames without prior knowledge of object types or manual annotations.
Can MASA run in real-time for applications like autonomous driving?
Performance depends heavily on hardware and video resolution. While powerful, running SAM + MASA matching on high-res streams can be computationally intensive. Optimization (like --fp16
) helps, but achieving real-time speeds on edge devices may require model distillation or dedicated hardware acceleration for demanding applications.
Does MASA require any pre-labeled video data to learn object matching?
No, that's a key innovation. MASA generates its own training signals ("self-supervision") by applying geometric transformations (e.g., flipping, rotating) to single images. It learns matching by identifying corresponding regions between these transformed views, eliminating the need for labor-intensive video annotations.
References

Simplify Your Data Annotation Workflow With Proven Strategies
.png)
