Learn DeepSORT: Real-Time Object Tracking Guide
Learn to implement DeepSORT for robust multi-object tracking in videos. This guide covers setup, integration with detectors like YOLO for real-time use.

Are you struggling to keep track of multiple moving objects in your video feeds, whether for surveillance, traffic monitoring, or sports analytics?
You’re not alone. Studies show that traditional tracking methods can lose up to 30% accuracy in crowded scenes or when objects overlap.
As an AI researcher with hands-on experience in computer vision, I can tell you that traditional tracking methods often struggle in real-world conditions.
That’s where DeepSORT comes in.
DeepSORT (Deep Simple Online and Realtime Tracking) combines deep learning with classic tracking techniques to deliver more accurate and robust object tracking, even in complex environments.
In this blog, we’ll explore how DeepSORT works, why it’s so effective, and how you can implement and test it in your own projects.
What is Object Tracking?
Object tracking is the process of following moving objects across video frames.
The goal is to assign a unique ID to each object and keep that ID consistent as the object moves.
This is important in many real-world applications, such as monitoring vehicles on roads, tracking people in public spaces, or analyzing players in sports videos
How DeepSORT works
DeepSORT tracks multiple objects in a video by combining object detection, motion prediction, and appearance matching.
Here’s how it works, step by step:
- Object Detection
- An object detector (like YOLO or Faster R-CNN) scans each video frame and finds objects, drawing bounding boxes around them.
- Feature Extraction
- For every detected object, DeepSORT extracts appearance features, unique characteristics like color and texture, using a neural network or color histograms.
- These features help tell objects apart, even if they are close together or look similar.
- Motion Prediction
- DeepSORT uses a Kalman Filter to predict where each tracked object will move in the next frame.
- The filter uses information about the object’s last known position and speed to make this prediction.
- Data Association (Matching)
- The algorithm matches new detections to existing tracks by comparing both their predicted positions (using Mahalanobis distance) and their appearance features (using cosine similarity).
- The Hungarian Algorithm finds the best matches between detections and tracks, so each object keeps its unique ID as it moves.
- Track Management
- If a detection matches an existing track, the track is updated with the new position and features.
- If a detection does not match any track, DeepSORT starts a new track for it.
- If a track is not matched for several frames, it is removed, assuming the object has left the scene.
This combination of motion and appearance information helps DeepSORT track objects accurately, even when they overlap or disappear briefly
Why This Works Better
DeepSORT outperforms basic tracking methods because it does not rely only on the position of objects.
By using deep learning to compare how objects look, it can:
- Track objects that cross paths or overlap (occlusion).
- Handle situations where objects look very similar.
- Recover tracking even if an object is briefly lost from view
DeepSORT tracking in various cases
Tracking people
Tracking players
Tracking cars
Tracking plane
Implementing DeepSORT
To perform DeepSORT is simple; you can do it using Python libraries and with an object detection model like YOLO.
!pip install ultralytics deep-sort-realtime opencv-python numpy
!pip install git+https://github.com/openai/CLIP.git
Now, import those libraries into the environment
from ultralytics import YOLO
from deep_sort_realtime.deepsort_tracker import DeepSort
import cv2
import os
import numpy as np
import random
import clip
def deepsort(path, output='output.mp4', target_classes=None):
# Initialize YOLOv10 model
model = YOLO('yolov10n.pt') # Choose your model
# Initialize video capture
cap = cv2.VideoCapture(path)
# Get video properties
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = int(cap.get(cv2.CAP_PROP_FPS))
# Create output directory if not exists
os.makedirs("output_videos", exist_ok=True)
output_path = f"output_videos/{output}"
# Initialize video writer
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))
# Initialize DeepSort tracker
tracker = DeepSort(
max_age=20,
n_init=2,
embedder='clip_ViT-B/16',
half=True,
embedder_gpu=True
)
# Create color palette for IDs
color_palette = {}
# Set default target classes (person, car, truck) if none provided
if target_classes is None:
target_classes = [0, 2, 7] # COCO class IDs: 0=person, 2=car, 7=truck
frame_count = 0
try:
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Run YOLOv10 detection
results = model(frame, verbose=False)[0]
# Convert detections to DeepSort format
detections = []
for box in results.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
conf = float(box.conf[0])
cls_id = int(box.cls[0])
# Filter by target classes
if cls_id in target_classes:
detections.append(([x1, y1, x2-x1, y2-y1], conf, cls_id))
# Update tracker
tracks = tracker.update_tracks(detections, frame=frame)
# Draw tracking results
for track in tracks:
if not track.is_confirmed():
continue
track_id = track.track_id
ltrb = track.to_ltrb()
x1, y1, x2, y2 = map(int, ltrb)
# Generate unique color for each ID
if track_id not in color_palette:
# Generate random but distinct color
color_palette[track_id] = (
random.randint(50, 200),
random.randint(50, 200),
random.randint(50, 200)
)
color = color_palette[track_id]
# Draw thicker bounding box (4px instead of 2)
cv2.rectangle(frame, (x1, y1), (x2, y2), color, 4)
# Create white background for ID text
text = f"ID:{track_id}"
text_scale = 1.5 # Increased from 0.7 (3x larger)
text_thickness = 4
text_size = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX,
text_scale, text_thickness)[0]
# Position background above bounding box
bg_x1 = x1
bg_y1 = max(0, y1 - text_size[1] - 10) # Ensure within frame
bg_x2 = x1 + text_size[0] + 5
bg_y2 = y1 - 10
# Draw background if it's within frame boundaries
if bg_y1 >= 0 and bg_y2 < frame_height and bg_x2 < frame_width:
cv2.rectangle(frame,
(bg_x1, bg_y1),
(bg_x2, bg_y2),
(255, 255, 255), -1) # White background
# Display ID with same color as bounding box
cv2.putText(frame, text, (x1, y1 - 15),
cv2.FONT_HERSHEY_SIMPLEX, text_scale, color,
text_thickness)
# Write frame to video file
out.write(frame)
# Print progress
frame_count += 1
if frame_count % 10 == 0:
print(f"Processed {frame_count} frames")
except KeyboardInterrupt:
print("Interrupted by user")
finally:
# Release resources
cap.release()
out.release()
print(f"Video saved to: {output_path}")
print(f"Total frames processed: {frame_count}")
Limitations of DeepSORT
While DeepSORT is powerful, it has some limitations:
- Dependency on detector accuracy: If the object detector misses objects, DeepSORT cannot track them.
- Appearance feature limitations: If two objects look almost identical, DeepSORT may confuse them.
- Computational cost: Extracting deep features for every detection can be slow on large videos or with many objects
Conclusion
DeepSORT is a robust and widely used algorithm for multi-object tracking. By combining motion prediction and deep learning-based appearance features, it handles complex tracking scenarios better than traditional methods.
However, its performance depends on the quality of object detection and the distinctiveness of object appearances.
With open-source implementations and active community support, DeepSORT is a great choice for real-time object tracking tasks.
FAQs
What's needed to implement DeepSORT?
You'll need Python, an object detector (like YOLO), and libraries such as OpenCV, NumPy, and a DeepSORT implementation (e.g., from GitHub). Pre-trained appearance models are essential for feature extraction.
How do I test DeepSORT’s performance?
Use MOT metrics like MOTA (Multi-Object Tracking Accuracy) and IDF1. Tools like PyMotMetrics evaluate tracking consistency, occlusion handling, and ID switches on benchmark videos.
Can DeepSORT run in real-time?
Yes, but speed depends on hardware and detector choice. With a lightweight detector (YOLOv5 Nano) and GPU acceleration, it achieves 20-30 FPS. Optimize by reducing input resolution or filtering object classes.
References

Simplify Your Data Annotation Workflow With Proven Strategies
.png)
