How to Perform Object Detection Tasks Using OWL v2
Explore how to implement OWLv2, a powerful open-vocabulary object detection model. Learn about its zero-shot capabilities, classification, guided image query, and how it understands text and images together for real-world use.

What if you could detect any object in an image, without ever training your model on that object?
That’s exactly what OWLv2, a powerful AI model from Google DeepMind, can do.
In today’s world of computer vision, where most models need large, labeled datasets to recognize even common objects, OWLv2 stands out.
It's a zero-shot, open-vocabulary model, meaning it understands both images and words, and it can detect objects it has never seen before, just by reading a description.
In this blog, we’ll explain how you can use it in different scenario , and the different tasks it can handle, like zero-shot object detection, open-vocabulary classification, and image-guided object detection.
What is Owlv2?
OWLv2 is a powerful computer vision model developed by researchers at Google.
It helps computers understand and describe what's happening in images and videos, using both text and visuals.
OWLv2 stands for Open-World Learning Version 2, and it's designed to recognize objects and actions, even ones it hasn’t seen during training.
This makes it useful in real-world situations where you might deal with new or uncommon items.
The model uses a combination of images and text prompts to detect and classify objects.
You can ask it to find things using natural language, like “a book below a candle” or “a glasses on the book,” and it will locate them in the image.

OWLv2 inference
What tasks can OWLv2 perform?
OWLv2 is not just a single-purpose model, it’s a versatile tool that supports a wide range of vision-language tasks.
By combining the power of visual understanding with natural language prompts, OWLv2 allows you to perform advanced image analysis without the need for task-specific training.
This makes it ideal for real-world applications where flexibility and scalability are key.
In the following sections, we’ll implement several key tasks using OWLv2.
- Zero-Shot Object Detection refers to the model’s ability to detect objects based on text descriptions, even if it has never encountered those objects during training.
- Open-Vocabulary Classification lets you classify objects into thousands of categories using dynamic text prompts instead of fixed labels.
- Image-Guided Object Detection guide detection using a reference image to find visually similar objects in new scenes.
Implementing OWLv2 various tasks
We will use Huggingface's transformer library to load the model.
from transformers import Owlv2Processor, Owlv2ForObjectDetection
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
We will create various helper functions for each particular task
def load_image_from_url(url: str) -> Image.Image:
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
return image
def detect_objects(
image: Image.Image,
queries: list[list[str]],
processor: Owlv2Processor,
model: Owlv2ForObjectDetection,
threshold: float = 0.1
) -> list[dict]:
inputs = processor(text=queries, images=[image], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Prepare target size for rescaling boxes: (height, width)
target_sizes = torch.Tensor([image.size[::-1]])
results = processor.post_process_object_detection(
outputs=outputs,
target_sizes=target_sizes,
threshold=threshold
)
return results
def display_image_with_boxes(
image: Image.Image,
results: list[dict],
queries: list[list[str]],
figsize: tuple = (10, 10)
) -> None:
plt.figure(figsize=figsize)
plt.imshow(image)
ax = plt.gca()
# Assuming batch size of 1 for simplicity
res = results[0]
boxes, scores, labels = res["boxes"], res["scores"], res["labels"]
# Assign a distinct random color for each detected object index
colors = []
for _ in range(len(boxes)):
colors.append((random.random(), random.random(), random.random()))
for (box, score, label), color in zip(zip(boxes, scores, labels), colors):
xmin, ymin, xmax, ymax = box.tolist()
ax.add_patch(plt.Rectangle(
(xmin, ymin), xmax - xmin, ymax - ymin,
edgecolor=color, facecolor='none', linewidth=2
))
text = queries[0][label]
ax.text(
xmin,
ymin - 5,
f"{text}: {score:.2f}",
fontsize=12,
backgroundcolor=color,
color='white'
)
plt.axis('off')
plt.show()
Zero-Shot Object Detection
Detect objects that the model has never seen before, just by using text descriptions. No additional training data required.
queries = [["shoes", "jeep", 'phone', 'glasses', 'pen', 'mug', 'jacket', 'toy', 'cloth', 'book']]
results = detect_objects(image, queries, processor, model, threshold=0.1)
display_image_with_boxes(image, results, queries)

Zero-Shot Object Detection
Open-Vocabulary Classification
Classify objects into various categories, guided by natural language instead of fixed labels.
url = "https://i.pinimg.com/736x/42/d8/44/42d84425fba8413a77495a9d950290e9.jpg"
image = load_image_from_url(url)

Mutiple Objects Sample image
queries = [['glasses']]
results = detect_objects(image, queries, processor, model, threshold=0.3)
display_image_with_boxes(image, results, queries)
queries = [['shades']]
results = detect_objects(image, queries, processor, model, threshold=0.3)
display_image_with_boxes(image, results, queries)

Classify objects by various classes
Image-Guided Object Detection
Use one image to find similar objects in another, enabling more intuitive and visual search workflows.
Here, we are searching 'Cat' using Query image in the Target image.
target_image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
query_image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000524280.jpg", stream=True).raw)

Target Image & Query Image
inputs = processor(images=target_image, query_images=query_image, return_tensors="pt")
import torch
# Run image-guided detection
with torch.no_grad():
outputs = model.image_guided_detection(**inputs)
target_sizes = torch.tensor([target_image.size[::-1]])
results = processor.post_process_image_guided_detection(
outputs=outputs,
target_sizes=target_sizes,
threshold = 0.9 # Adjust confidence threshold
)[0]
From PIL import ImageDraw
draw = ImageDraw.Draw(target_image)
for box in results["boxes"]:
draw.rectangle(box.tolist(), outline="blue", width=3)
target_image.show()

Image Query Result
Various use cases of OWLv2
OWLv2 is an advanced object detection model that understands both images and text.
It works in real time and doesn’t need retraining to recognize new objects. Thanks to its zero-shot learning ability, it adapts quickly to new tasks across many industries.
Here’s how different fields use OWLv2:
Surveillance and Security
OWLv2 helps spot unusual items or actions in security footage using text prompts like “suspicious package”. It doesn't need a pre-made list of threats, so it quickly adjusts to new risks without retraining.
Autonomous Systems
In self-driving cars, OWLv2 detects rare objects like “overturned kayak” that standard models might miss. Delivery drones use it to avoid things like “kite strings” while flying.
Healthcare
Doctors use OWLv2 to find rare problems in medical images, such as “fractured cervical vertebrae” or a “dislodged IV line”, even when there's little labeled data available.
Retail and E-Commerce
Warehouses use OWLv2 to find specific items, like “2025 limited-edition sneakers”, just by describing them. Online stores also use it for visual search—people can look for items using descriptions like “glass coffee table with bronze legs”.
Content Moderation
OWLv2 helps platforms detect harmful content in photos or videos using prompts like “graphic violence” or “counterfeit money”. It adjusts easily to new content policies without retraining the model.
Robotics
In factories, robots use OWLv2 to identify tools or parts, like a “misaligned gearbox”, by just using text commands. At home, robots can follow instructions like “find the yellow LEGO under the couch.”
Agriculture and Environment
Farmers use OWLv2 to spot crop diseases such as “wheat stem rust” or invasive pests like “spotted lanternfly” from drone images. It also helps track endangered animals in wildlife photos.
Disaster Response
During emergencies, OWLv2 can find survivors in drone footage using prompts like “person waving from rubble”, even in difficult conditions where older models struggle.
Conclusion
Implementing OWLv2 across different tasks highlights its flexibility, power, and ease of use.
Whether it's zero-shot object detection, open-vocabulary classification, or image-guided object detection, OWLv2 handles each task using natural language and without the need for additional training.
Its ability to understand both visual and textual inputs allows developers and researchers to solve complex problems with minimal data and maximum efficiency.
As we've seen through these examples, OWLv2 makes advanced computer vision accessible, adaptable, and ready for real-world applications.
With just a few lines of code, you can unlock a wide range of capabilities, making it a valuable tool for any modern AI pipeline.
FAQs
What are the key tasks OWLv2 can perform?
OWLv2 can perform zero-shot object detection, open-vocabulary classification, and image segmentation using natural language descriptions.
Do I need labeled data to use OWLv2?
No, OWLv2 works in a zero-shot setting. You can detect and classify new objects using text prompts without additional training data.
How do I run inference using OWLv2?
You preprocess your image and text prompts, pass them through the model, and retrieve bounding boxes and class predictions based on similarity in the shared latent space.
References

Simplify Your Data Annotation Workflow With Proven Strategies
.png)
