RT-DETRv2 Beats YOLO? Full Comparison + Tutorial

Explore a comparison between RT-DETR and RT-DETRv2 in real-time object detection with transformer power. Learn how to implement it using HuggingFace.

RT-DETR v RT-DETRv2
RT-DETR v RT-DETRv2

Can transformer-based detectors finally dethrone YOLO in real-time object detection? And more importantly, should you be betting on RT-DETR or its successor, RT-DETRv2 for your next high-performance project?

Here’s what the numbers say,

RT-DETR stunned the community by delivering 108 FPS on a T4 GPU while achieving 53.1% AP on COCO, a remarkable feat that outpaced YOLOv8 in both speed and accuracy.

But the story didn’t stop there. Its next-gen upgrade, RT-DETRv2, improved performance even further, boosting accuracy by +2.2% AP (to 55.3%), without compromising real-time speed.

So, how do you decide which model aligns better with your goals, whether it's low-latency surveillance, drone-based analytics, or scalable deployment pipelines?

We’ll break down what each version offers under the hood, walk through various object detection scenarios, and share hands-on implementation tips.

What is RT-DETR and RT-DETRv2?

RT-DETR and RT-DETRv2 are fast and accurate object detection models built on the DETR (Detection Transformer) framework. Unlike earlier DETR models, which were accurate but slow, these versions are optimized for real-time performance, making them ideal for tasks like video analytics, autonomous driving, and robotics.

RT-DETR vs RT-DETRv2

RT-DETR vs RT-DETRv2

What makes RT-DETR special is its end-to-end design. Unlike traditional detectors that rely on extra post-processing steps like Non-Maximum Suppression (NMS), RT-DETR directly predicts object locations using a transformer decoder.

It also uses a hybrid encoder that mixes CNNs for local feature extraction with transformers for global understanding. Plus, it selects object queries based on uncertainty, so the model focuses on the most relevant areas first, making training faster and more efficient.

What's new in RT-DETRv2?

RT-DETRv2 builds on the strengths of RT-DETR but adds smart upgrades to make it even more accurate and easier to deploy.

One of the biggest improvements is how it handles objects of different sizes.

Using a method called Selective Multi-Scale Sampling, the model treats large and small objects differently, which helps it detect tiny items like a ball just as well as larger ones like a car.

This leads to better accuracy across a wide range of real-world scenarios.

Implementation of RT-DETR and RT-DETRv2

We will implement RT-DETR using Hugging Face's transformers library.

Before that, we install the necessary modules using:



!pip install transformers torch pillow matplotlib requests opencv-python timm


Installing Required libraries

Then we import them,


import torch
import requests
from tqdm.autonotebook import tqdm
from PIL import Image, ImageDraw, ImageFont
import numpy as np
import random

Import Libraries

we will create a function that loads the model and draws bounding boxes on detected objects.


def RT_DETR_OBJ_DET(image, 
                  class_names=coco_classes,
                  threshold= 0.5,
                  model=None
                  processor=None
                  ):
    """
    Perform object detection on the given image using a pre-trained model.
    Args:
        image (PIL.Image): The input image for object detection.
        class_names (list): List of class names for the model.
        model: Pre-trained object detection model.
        processor: Image processor for the model.        
    Returns:
        PIL.Image: The image with detected objects drawn on it."""
    # Prepare model and inputs
    model.to(device)
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Post-process on GPU
    target_sizes = torch.tensor([image.size[::-1]], device=device)  # (height, width)
    detections = processor.post_process_object_detection(
        outputs, 
        target_sizes=target_sizes, 
        threshold=threshold
    )
    
    # Prepare drawing
    image = image.copy()
    draw = ImageDraw.Draw(image)
    
    # Process detections
    for detection in detections:
        # GPU-based filtering
        scores = detection['scores'].cpu().numpy()
        labels = detection['labels'].cpu().numpy()
        boxes = detection['boxes'].cpu().numpy().astype(int)
        
        # Draw results
        for label, score, box in zip(labels, scores, boxes):
            x1, y1, x2, y2 = box
            color = tuple(random.randint(0, 255) for _ in range(3))
            
            # Draw rectangle
            draw.rectangle([x1, y1, x2, y2], outline=color, width=2)
            
            # Draw label
            text = f"{class_names[label]}: {score:.2f}"
            text_bbox = draw.textbbox((0, 0), text)
            text_w, text_h = text_bbox[2]-text_bbox[0], text_bbox[3]-text_bbox[1]
            text_y = max(y1 - text_h, 0)
            
            draw.rectangle([x1, text_y, x1+text_w, text_y+text_h], fill=color)
            draw.text((x1, text_y), text, fill='white')
    
    return image

Function to detect object and draw BBox

Now we import the required models

RT-DETR


from transformers import RTDetrForObjectDetection, RTDetrImageProcessor

model=RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")         processor=RTDetrImageProcessor.from_pretrained("PekingU/rtdetr_r50vd")


Loading RT-DETR model

RT-DETRv2


from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor

model=RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_v2_r18vd")         processor=RTDetrImageProcessor.from_pretrained("PekingU/rtdetr_v2_r18vd")


Loading RT-DETRv2 model

For loading images from a URL, we create another helper function


def load_img(url):
    image = Image.open(requests.get(url, stream=True).raw)
    return image

Loading img using URL

Comparison of models in various situations

Real-Time Object Detection Comparison

RT-DETR and RT-DETRv2 in real time

Detection in a complex environment

Original Image

Original Image

RT-DETR Inference Result

RT-DETR inference result

RT-DETRv2 inference result

RT-DETRv2 inference result

RT-DETRv2 performs slightly better than its predecessor in detecting objects. For example, RT-DETRv2 easily detects the plant in the front, whereas RT-DETR fails to notice it.

Detection when objects are small and different

<Link Name>

Original Image

RT-DETR inference result

RT-DETR inference result

RT-DETRv2 inference result

RT-DETRv2 inference result

RT-DETRv2 performs better in detecting small objects

RT-DETRv2 performs better in detecting small objects

Detecting the number of objects vs the Original number of Objects

Original Image

Original Image

RT-DETR inference result

RT-DETR inference result

RT-DETRv2 inference result

RT-DETRv2 inference result

RT-DETR and RT-DETR perform equally well in detecting the number of objects.

Here, out of 25 sheep, both RT-DETR and RT-DETRv2 detect 13 accurately.

which is nearly half of the total objects. With fine-tuning, we can overcome this problem.

Conclusion

RT-DETR and RT-DETRv2 represent significant advancements in real-time object detection, successfully challenging the previously dominant YOLO family of models.

By combining the end-to-end detection capabilities of transformer-based models with real-time performance, these architectures offer compelling alternatives for computer vision tasks that require both speed and accuracy.

The improvements introduced in RT-DETRv2 further enhance the real-time detection.

As these models continue to evolve and gain adoption, they are likely to become increasingly important tools in the computer vision landscape, particularly for applications requiring real-time processing capabilities.

FAQ

How does RT-DETRv2 improve upon RT-DETR?

RT-DETRv2 introduces Selective Multi-Scale Sampling, improving detection of small and large objects. It also boosts COCO AP by +2.2%, while maintaining real-time performance.

Which model is better for edge devices – RT-DETR or RT-DETRv2?

RT-DETR is slightly lighter and ideal for tighter resource constraints, while RT-DETRv2 offers better accuracy with similar efficiency, making it the better choice for most real-time applications if GPU resources are sufficient.

Can RT-DETRv2 replace YOLOv8 in real-time detection tasks?

Yes, in many scenarios. RT-DETRv2 surpasses YOLOv8 in both FPS and accuracy on COCO, making it a viable transformer-based alternative for real-time use cases.

References

Labellerr's Notebook

Free
Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Download the Free Guide