How to Perform Object Detection Tasks Using OWL v2

Explore how to implement OWLv2, a powerful open-vocabulary object detection model. Learn about its zero-shot capabilities, classification, guided image query, and how it understands text and images together for real-world use.

OWL v2 - Open World Learning Version 2
OWL v2 - Open World Learning Version 2

What if you could detect any object in an image, without ever training your model on that object?

That’s exactly what OWLv2, a powerful AI model from Google DeepMind, can do.

In today’s world of computer vision, where most models need large, labeled datasets to recognize even common objects, OWLv2 stands out.

It's a zero-shot, open-vocabulary model, meaning it understands both images and words, and it can detect objects it has never seen before, just by reading a description.

In this blog, we’ll explain how you can use it in different scenario , and the different tasks it can handle, like zero-shot object detection, open-vocabulary classification, and image-guided object detection.

What is Owlv2?

OWLv2 is a powerful computer vision model developed by researchers at Google.

It helps computers understand and describe what's happening in images and videos, using both text and visuals.

OWLv2 stands for Open-World Learning Version 2, and it's designed to recognize objects and actions, even ones it hasn’t seen during training.

This makes it useful in real-world situations where you might deal with new or uncommon items.

The model uses a combination of images and text prompts to detect and classify objects.

You can ask it to find things using natural language, like “a book below a candle” or “a glasses on the book,” and it will locate them in the image.

OWLv2 inference

OWLv2 inference

What tasks can OWLv2 perform?

OWLv2 is not just a single-purpose model, it’s a versatile tool that supports a wide range of vision-language tasks.

By combining the power of visual understanding with natural language prompts, OWLv2 allows you to perform advanced image analysis without the need for task-specific training.

This makes it ideal for real-world applications where flexibility and scalability are key.

In the following sections, we’ll implement several key tasks using OWLv2.

  • Zero-Shot Object Detection refers to the model’s ability to detect objects based on text descriptions, even if it has never encountered those objects during training.
  • Open-Vocabulary Classification lets you classify objects into thousands of categories using dynamic text prompts instead of fixed labels.
  • Image-Guided Object Detection guide detection using a reference image to find visually similar objects in new scenes.

Implementing OWLv2 various tasks

We will use Huggingface's transformer library to load the model.


from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

Load OWLv2 model from hugging face

We will create various helper functions for each particular task


def load_image_from_url(url: str) -> Image.Image:
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

Function to load a image using URL

def detect_objects(
    image: Image.Image,
    queries: list[list[str]],
    processor: Owlv2Processor,
    model: Owlv2ForObjectDetection,
    threshold: float = 0.1
) -> list[dict]:
    
    inputs = processor(text=queries, images=[image], return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Prepare target size for rescaling boxes: (height, width)
    target_sizes = torch.Tensor([image.size[::-1]])
    results = processor.post_process_object_detection(
        outputs=outputs,
        target_sizes=target_sizes,
        threshold=threshold
    )
    return results

Run object detection on an image for given text queries

def display_image_with_boxes(
    image: Image.Image,
    results: list[dict],
    queries: list[list[str]],
    figsize: tuple = (10, 10)
) -> None:
    
    plt.figure(figsize=figsize)
    plt.imshow(image)
    ax = plt.gca()

    # Assuming batch size of 1 for simplicity
    res = results[0]
    boxes, scores, labels = res["boxes"], res["scores"], res["labels"]

    # Assign a distinct random color for each detected object index
    colors = []
    for _ in range(len(boxes)):
        colors.append((random.random(), random.random(), random.random()))

    for (box, score, label), color in zip(zip(boxes, scores, labels), colors):
        xmin, ymin, xmax, ymax = box.tolist()
        ax.add_patch(plt.Rectangle(
            (xmin, ymin), xmax - xmin, ymax - ymin,
            edgecolor=color, facecolor='none', linewidth=2
        ))
        text = queries[0][label]
        ax.text(
            xmin,
            ymin - 5,
            f"{text}: {score:.2f}",
            fontsize=12,
            backgroundcolor=color,
            color='white'
        )

    plt.axis('off')
    plt.show()

Drawing bbox around detected object

Zero-Shot Object Detection

Detect objects that the model has never seen before, just by using text descriptions. No additional training data required.


queries = [["shoes", "jeep", 'phone', 'glasses', 'pen', 'mug', 'jacket', 'toy', 'cloth', 'book']]
results = detect_objects(image, queries, processor, model, threshold=0.1)
display_image_with_boxes(image, results, queries)

Detect objects using free-text descriptions without training
Zero-Shot Object Detection

Zero-Shot Object Detection

Open-Vocabulary Classification

Classify objects into various categories, guided by natural language instead of fixed labels.


url = "https://i.pinimg.com/736x/42/d8/44/42d84425fba8413a77495a9d950290e9.jpg"
image = load_image_from_url(url)

Load sample image
Mutiple Objects Sample image

Mutiple Objects Sample image


queries = [['glasses']]
results = detect_objects(image, queries, processor, model, threshold=0.3)
display_image_with_boxes(image, results, queries)

Query for find glasses

queries = [['shades']]
results = detect_objects(image, queries, processor, model, threshold=0.3)
display_image_with_boxes(image, results, queries)

Query for find shades
Classify objects by various classes

Classify objects by various classes

Image-Guided Object Detection

Use one image to find similar objects in another, enabling more intuitive and visual search workflows.

Here, we are searching 'Cat' using Query image in the Target image.


target_image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
  
query_image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000524280.jpg", stream=True).raw)

Step 1: Load target and query images
Target Image and Query Image

Target Image & Query Image


inputs = processor(images=target_image, query_images=query_image, return_tensors="pt")

Step 2: Prepare inputs

import torch
  
# Run image-guided detection
with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)
    target_sizes = torch.tensor([target_image.size[::-1]])
    results = processor.post_process_image_guided_detection(
        outputs=outputs,
        target_sizes=target_sizes,
        threshold = 0.9  # Adjust confidence threshold
    )[0]

Step 3: Perform image-guided detection

From PIL import ImageDraw
  
draw = ImageDraw.Draw(target_image)
for box in results["boxes"]:
    draw.rectangle(box.tolist(), outline="blue", width=3)
target_image.show()

Step 4: Visualize matches
Image Query Result

Image Query Result

Various use cases of OWLv2

OWLv2 is an advanced object detection model that understands both images and text.

It works in real time and doesn’t need retraining to recognize new objects. Thanks to its zero-shot learning ability, it adapts quickly to new tasks across many industries.

Here’s how different fields use OWLv2:

Surveillance and Security

OWLv2 helps spot unusual items or actions in security footage using text prompts like “suspicious package”. It doesn't need a pre-made list of threats, so it quickly adjusts to new risks without retraining.

Autonomous Systems

In self-driving cars, OWLv2 detects rare objects like “overturned kayak” that standard models might miss. Delivery drones use it to avoid things like “kite strings” while flying.

Healthcare

Doctors use OWLv2 to find rare problems in medical images, such as “fractured cervical vertebrae” or a “dislodged IV line”, even when there's little labeled data available.

Retail and E-Commerce

Warehouses use OWLv2 to find specific items, like “2025 limited-edition sneakers”, just by describing them. Online stores also use it for visual search—people can look for items using descriptions like “glass coffee table with bronze legs”.

Content Moderation

OWLv2 helps platforms detect harmful content in photos or videos using prompts like “graphic violence” or “counterfeit money”. It adjusts easily to new content policies without retraining the model.

Robotics

In factories, robots use OWLv2 to identify tools or parts, like a “misaligned gearbox”, by just using text commands. At home, robots can follow instructions like “find the yellow LEGO under the couch.”

Agriculture and Environment

Farmers use OWLv2 to spot crop diseases such as “wheat stem rust” or invasive pests like “spotted lanternfly” from drone images. It also helps track endangered animals in wildlife photos.

Disaster Response

During emergencies, OWLv2 can find survivors in drone footage using prompts like “person waving from rubble”, even in difficult conditions where older models struggle.

Conclusion

Implementing OWLv2 across different tasks highlights its flexibility, power, and ease of use.

Whether it's zero-shot object detection, open-vocabulary classification, or image-guided object detection, OWLv2 handles each task using natural language and without the need for additional training.

Its ability to understand both visual and textual inputs allows developers and researchers to solve complex problems with minimal data and maximum efficiency.

As we've seen through these examples, OWLv2 makes advanced computer vision accessible, adaptable, and ready for real-world applications.

With just a few lines of code, you can unlock a wide range of capabilities, making it a valuable tool for any modern AI pipeline.

FAQs

What are the key tasks OWLv2 can perform?

OWLv2 can perform zero-shot object detection, open-vocabulary classification, and image segmentation using natural language descriptions.

Do I need labeled data to use OWLv2?

No, OWLv2 works in a zero-shot setting. You can detect and classify new objects using text prompts without additional training data.

How do I run inference using OWLv2?

You preprocess your image and text prompts, pass them through the model, and retrieve bounding boxes and class predictions based on similarity in the shared latent space.

References

OWLv2 Notebook

HuggingFace Documentation

Free
Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Download the Free Guide