object tracking

Learn DeepSORT: Real-Time Object Tracking Guide

Learn to implement DeepSORT for robust multi-object tracking in videos. This guide covers setup, integration with detectors like YOLO for real-time use.

Yash Raj Suman

Jun 13, 2025 • 5 min read

Share this blog

DeepSORT

Are you struggling to keep track of multiple moving objects in your video feeds, whether for surveillance, traffic monitoring, or sports analytics?

You’re not alone. Studies show that traditional tracking methods can lose up to 30% accuracy in crowded scenes or when objects overlap.

As an AI researcher with hands-on experience in computer vision, I can tell you that traditional tracking methods often struggle in real-world conditions.

That’s where DeepSORT comes in.

DeepSORT (Deep Simple Online and Realtime Tracking) combines deep learning with classic tracking techniques to deliver more accurate and robust object tracking, even in complex environments.

In this blog, we’ll explore how DeepSORT works, why it’s so effective, and how you can implement and test it in your own projects.

What is Object Tracking?

Object tracking is the process of following moving objects across video frames.

The goal is to assign a unique ID to each object and keep that ID consistent as the object moves.

This is important in many real-world applications, such as monitoring vehicles on roads, tracking people in public spaces, or analyzing players in sports videos. Beyond tracking, computer vision is also used for object counting and sorting in various industrial applications.

How DeepSORT works

DeepSORT tracks multiple objects in a video by combining object detection, motion prediction, and appearance matching.

Here’s how it works, step by step:

Object Detection

An object detector (like YOLO or Faster R-CNN) scans each video frame and finds objects, drawing bounding boxes around them.

Feature Extraction

For every detected object, DeepSORT extracts appearance features, unique characteristics like color and texture, using a neural network or color histograms.
These features help tell objects apart, even if they are close together or look similar.

Motion Prediction

DeepSORT uses a Kalman Filter to predict where each tracked object will move in the next frame.
The filter uses information about the object’s last known position and speed to make this prediction.

Data Association (Matching)

The algorithm matches new detections to existing tracks by comparing both their predicted positions (using Mahalanobis distance) and their appearance features (using cosine similarity).
The Hungarian Algorithm finds the best matches between detections and tracks, so each object keeps its unique ID as it moves.

Track Management

If a detection matches an existing track, the track is updated with the new position and features.
If a detection does not match any track, DeepSORT starts a new track for it.
If a track is not matched for several frames, it is removed, assuming the object has left the scene.

This combination of motion and appearance information helps DeepSORT track objects accurately, even when they overlap or disappear briefly

Why This Works Better

DeepSORT outperforms basic tracking methods because it does not rely only on the position of objects.

By using deep learning to compare how objects look, it can:

Track objects that cross paths or overlap (occlusion).
Handle situations where objects look very similar.
Recover tracking even if an object is briefly lost from view

DeepSORT tracking in various cases

Tracking people

Tracking players

Tracking cars

Tracking plane

Implementing DeepSORT

To perform DeepSORT is simple; you can do it using Python libraries and with an object detection model like YOLO .


!pip install ultralytics deep-sort-realtime opencv-python numpy
!pip install git+https://github.com/openai/CLIP.git

Installing required libraries

Now, import those libraries into the environment


from ultralytics import YOLO
from deep_sort_realtime.deepsort_tracker import DeepSort
import cv2
import os
import numpy as np
import random
import clip

Importing the libraries


def deepsort(path, output='output.mp4', target_classes=None):
    # Initialize YOLOv10 model
    model = YOLO('yolov10n.pt')  # Choose your model

    # Initialize video capture
    cap = cv2.VideoCapture(path)
    
    # Get video properties
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))

    # Create output directory if not exists
    os.makedirs("output_videos", exist_ok=True)
    output_path = f"output_videos/{output}"

    # Initialize video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))

    # Initialize DeepSort tracker
    tracker = DeepSort(
        max_age=20,
        n_init=2,
        embedder='clip_ViT-B/16',
        half=True,
        embedder_gpu=True
    )
    
    # Create color palette for IDs
    color_palette = {}
    
    # Set default target classes (person, car, truck) if none provided
    if target_classes is None:
        target_classes = [0, 2, 7]  # COCO class IDs: 0=person, 2=car, 7=truck

    frame_count = 0
    try:
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            
            # Run YOLOv10 detection
            results = model(frame, verbose=False)[0]
            
            # Convert detections to DeepSort format
            detections = []
            for box in results.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
                conf = float(box.conf[0])
                cls_id = int(box.cls[0])
                
                # Filter by target classes
                if cls_id in target_classes:
                    detections.append(([x1, y1, x2-x1, y2-y1], conf, cls_id))
            
            # Update tracker
            tracks = tracker.update_tracks(detections, frame=frame)
            
            # Draw tracking results
            for track in tracks:
                if not track.is_confirmed():
                    continue
                    
                track_id = track.track_id
                ltrb = track.to_ltrb()
                x1, y1, x2, y2 = map(int, ltrb)
                
                # Generate unique color for each ID
                if track_id not in color_palette:
                    # Generate random but distinct color
                    color_palette[track_id] = (
                        random.randint(50, 200),
                        random.randint(50, 200),
                        random.randint(50, 200)
                    )
                color = color_palette[track_id]
                
                # Draw thicker bounding box (4px instead of 2)
                cv2.rectangle(frame, (x1, y1), (x2, y2), color, 4)
                
                # Create white background for ID text
                text = f"ID:{track_id}"
                text_scale = 1.5  # Increased from 0.7 (3x larger)
                text_thickness = 4
                text_size = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 
                                           text_scale, text_thickness)[0]
                
                # Position background above bounding box
                bg_x1 = x1
                bg_y1 = max(0, y1 - text_size[1] - 10)  # Ensure within frame
                bg_x2 = x1 + text_size[0] + 5
                bg_y2 = y1 - 10
                
                # Draw background if it's within frame boundaries
                if bg_y1 >= 0 and bg_y2 < frame_height and bg_x2 < frame_width:
                    cv2.rectangle(frame, 
                                 (bg_x1, bg_y1),
                                 (bg_x2, bg_y2),
                                 (255, 255, 255), -1)  # White background
                
                    # Display ID with same color as bounding box
                    cv2.putText(frame, text, (x1, y1 - 15), 
                               cv2.FONT_HERSHEY_SIMPLEX, text_scale, color, 
                               text_thickness)
            
            # Write frame to video file
            out.write(frame)
            
            # Print progress
            frame_count += 1
            if frame_count % 10 == 0:
                print(f"Processed {frame_count} frames")
                
    except KeyboardInterrupt:
        print("Interrupted by user")
    finally:
        # Release resources
        cap.release()
        out.release()
        print(f"Video saved to: {output_path}")
        print(f"Total frames processed: {frame_count}")

Function to perform DeepSORT tracking with YOLO

Limitations of DeepSORT

While DeepSORT is powerful, it has some limitations:

Dependency on detector accuracy: If the object detector misses objects, DeepSORT cannot track them.
Appearance feature limitations: If two objects look almost identical, DeepSORT may confuse them.
Computational cost: Extracting deep features for every detection can be slow on large videos or with many objects

Conclusion

DeepSORT is a robust and widely used algorithm for multi-object tracking. By combining motion prediction and deep learning-based appearance features, it handles complex tracking scenarios better than traditional methods.

However, its performance depends on the quality of object detection and the distinctiveness of object appearances.

With open-source implementations and active community support, DeepSORT is a great choice for real-time object tracking tasks.

FAQs

What's needed to implement DeepSORT?

You'll need Python, an object detector (like YOLO), and libraries such as OpenCV, NumPy, and a DeepSORT implementation (e.g., from GitHub). Pre-trained appearance models are essential for feature extraction.

How do I test DeepSORT’s performance?

Use MOT metrics like MOTA (Multi-Object Tracking Accuracy) and IDF1. Tools like PyMotMetrics evaluate tracking consistency, occlusion handling, and ID switches on benchmark videos.

Can DeepSORT run in real-time?

Yes, but speed depends on hardware and detector choice. With a lightweight detector (YOLOv5 Nano) and GPU acceleration, it achieves 20-30 FPS. Optimize by reducing input resolution or filtering object classes.