object segmentation

MASA Implementation Guide: Track Objects in Video Using SAM

MASA revolutionizes object tracking by combining the Segment Anything Model (SAM) with self-supervised matching. Track any object across video frames without category-specific training or manual labels. Learn how MASA works and implement state-of-the-art universal tracking in your projects.

Yash Raj Suman

Jun 4, 2025 • 6 min read

Share this blog

Matching Anything By Segmenting Anything

Have you ever wondered how modern AI systems can recognize, segment, and track virtually any object in a video or image, regardless of how diverse, complex, or unpredictable the scene?

According to recent research, over 80% of computer vision tasks in fields like autonomous driving, robotics, and video surveillance now rely on advanced segmentation and tracking algorithms for real-world applications.

Yet, despite this progress, most existing solutions are limited by their reliance on specific object categories, static scenes, or labor-intensive manual annotations.

Enter MASA (Matching Anything by Segment Anything), a groundbreaking approach that promises to revolutionize the way we match and track objects across frames by leveraging the power of Segment Anything Models (SAM).

MASA combines the generalization ability of SAM with robust matching algorithms, enabling seamless object association in challenging, dynamic environments, without the need for category-specific training.

But how does MASA actually work? What makes it different from traditional tracking-by-detection or segmentation pipelines? And most importantly, how can you implement MASA in your own projects to unlock state-of-the-art performance with minimal effort?

MASA Inference Result

In this blog, we’ll dive deep into the implementation of MASA.

Why is MASA required?

Imagine you’re hiking with a friend who suddenly puts on sunglasses. At first, you might squint to recognize their face, but within seconds, your brain adjusts.

MASA does something similar for AI: it helps models instantly "recognize" new situations they weren’t trained for, like foggy roads, blurry medical scans, or glitchy video feeds.

The Problem: Stubborn AI Brains

Most AI models are like rigid students: they memorize textbook examples but panic when faced with real-world surprises.

Like, A self-driving car AI trained on sunny days might freeze in rain, or A tumor-spotting model could misdiagnose if hospital lighting changes.

Retraining these models takes hours (or even days) and requires expensive, labeled data.

How does MASA work?

MASA (Matching Anything by Segment Anything) uses a smart and simple approach to match and track any object in videos or images, no matter the scene or object type. Here’s how it works, step by step:

Segment Everything with SAM

MASA starts by using the Segment Anything Model (SAM), a powerful AI tool that can find and outline every object in an image, even if it has never seen that object before.

SAM divides the image into distinct object regions, creating a detailed map of everything present.

Create Instance-Level Matches

To learn how to match objects across frames, MASA applies different geometric transformations (like flipping or rotating) to the same image.

By comparing these transformed images, MASA automatically knows which regions in both images belong to the same object.

This process generates “self-supervision” signals, so MASA learns to recognize and match objects without needing labeled data.

Extract and Compare Features

For each segmented region, MASA extracts features such as shape, color, and texture.

It then uses advanced matching algorithms to compare these features between frames.

If two regions look alike and share similar features, MASA considers them to be the same object.

Use Spatial Relationships

MASA doesn’t just look at appearance. It also considers where objects are and how they move.

By tracking the positions and movements of objects, MASA can correctly match objects even if they change appearance, move around, or overlap with others.

Universal MASA Adapter

MASA includes a special adapter that can connect with other detection or segmentation models.

This adapter transforms the features from these models, allowing them to benefit from MASA’s matching ability. As

a result, even existing models can now track any object they detect, without extra training or manual labeling.

No Manual Labels Needed

Unlike older methods that rely on manually labeled videos, MASA learns from raw, unlabeled images. This makes the process faster, cheaper, and more flexible, and it works well even in new or complex environments

How can you implement MASA?

To implement MASA, first you have to clone its GitHub repository using,

NOTE: This guide is for linux system.


git clone https://github.com/siyuanliii/masa.git

Clone the project repository to your local machine

Navigate to the project directory(use: cd masa) and create a Conda environment using:


conda env create -f environment.yml
conda activate masaenv

Create conda environment and activate

Run the install_dependencies.sh Script to install the required dependencies:


sh install_dependencies.sh

Install Dependencies

Install the remaining Python packages using:


pip install -r requirements.txt

Installing remaining requirements

Now, write python in the terminal and download the NTLK files using.


import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')

Downloading NTLK files

Preparation

First, create a folder named saved_models in the root directory of the project. Then, download the following models and put them in the saved_models folder.
Download the MASA-GroundingDINO and put it in saved_models/masa_models/gdino_masa.pth folder.
(Optional) Second, download the demo videos and put them in the demo folder. We provide two short videos for testing (minions_rush_out.mp4 and giraffe_short.mp4). You can download more demo videos here.
Finally, create the demo_outputs folder in the root directory of the project to save the output videos.

To run inference, run the following code.


python demo/video_demo_with_text.py   demo/minions_rush_out.mp4   --out demo_outputs/minions_rush_out_outputs.mp4   --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py   --masa_checkpoint saved_models/masa_models/gdino_masa.pth   --texts "yellow_minions"   --score-thr 0.2   --unified   --show_fps   --fp16

Performing inference using CLI

It will create a demo_outputs folder, which stores the saved result.

Inference Result

Another example of inference you can do by following the Command Line


python demo/video_demo_with_text.py   demo/giraffe_short.mp4   --out demo_outputs/giraffe_short_outputs.mp4   --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py   --masa_checkpoint saved_models/masa_models/gdino_masa.pth   --texts "giraffe"   --score-thr 0.2   --unified   --show_fps   --fp16

Performing inference using CLI

Inference Result

For object detection and mask,

Download SAM-H weights and put it in saved_models/pretrain_weights/sam_vit_h_4b8939.pth folder.

Download the carton kangaroo dance.mp4 and put it in the demo folder.


python demo/video_demo_with_text.py   demo/carton_kangaroo_dance.mp4   --out demo_outputs/carton_kangaroo_dance_outputs.mp4   --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py   --masa_checkpoint saved_models/masa_models/gdino_masa.pth   --texts "kangaroo"   --score-thr 0.2   --unified   --show_fps   --fp16  --sam_mask

Object Detection with SAM mask

Object Detection and SAM segmentation Result

Conclusion

Implementing MASA unlocks the ability to match and track any object in images or videos with minimal manual effort and high flexibility.

By following the MASA workflow, you can leverage the power of the Segment Anything Model (SAM) for universal segmentation, use MASA’s robust matching algorithms to associate objects across frames, and integrate these capabilities into your applications.

The process is straightforward: segment everything in each frame, extract features, and let MASA’s matching engine do the rest.

With no need for manual labeling or category-specific training, MASA makes state-of-the-art object tracking accessible to everyone.

Start experimenting with MASA in your projects, and you’ll quickly see how it transforms complex visual data into actionable, organized information: no matter what you want to match or where you want to deploy it.

FAQs

How does MASA differ fundamentally from traditional object tracking methods?

Unlike traditional trackers requiring category-specific training or detection pipelines, MASA leverages SAM's universal segmentation and self-supervised feature matching. This allows it to track any segmented object across frames without prior knowledge of object types or manual annotations.

Can MASA run in real-time for applications like autonomous driving?

Performance depends heavily on hardware and video resolution. While powerful, running SAM + MASA matching on high-res streams can be computationally intensive. Optimization (like --fp16) helps, but achieving real-time speeds on edge devices may require model distillation or dedicated hardware acceleration for demanding applications.

Does MASA require any pre-labeled video data to learn object matching?

No, that's a key innovation. MASA generates its own training signals ("self-supervision") by applying geometric transformations (e.g., flipping, rotating) to single images. It learns matching by identifying corresponding regions between these transformed views, eliminating the need for labor-intensive video annotations.