Egocentric AI

Real-Time Focus and Distraction Monitoring AI

Learn to build an egocentric workspace assistant using a custom computer vision pipeline. By combining custom YOLO instance segmentation on Labellerr with a smart state machine, this project tracks hand-object proximity in real time to calculate a reliable daily focus percentage score.

Aaryan Aggarwal

Jun 22, 2026 • 6 min read

Share this blog

Real-Time Focus and Distraction Monitoring AI

Modern knowledge work and academic studying rely heavily on deep mental focus. Every day, millions of students and professionals sit at their desks, surrounded by textbooks, notebooks, and digital tools. However, our physical workspaces are filled with subtle productivity traps.

The greatest culprit is the smartphone. A single notification can break our concentration, pulling us into hours of mindless scrolling. Traditional digital productivity tools, like browser extensions or screen-time trackers, fail to solve this problem. They only monitor what happens inside your computer monitor. They remain completely blind to what your hands are doing in the physical world.

That is why we built the Real-Time Focus and Distraction Monitoring AI. This project uses egocentric, first-person computer vision to turn a standard downward-facing webcam into an intelligent workspace assistant. It tracks exactly what items you pick up, handle, or leave behind on your desk.

By combining high-speed object detection with structural tracking, the system automatically measures how much time you spend studying versus being distracted. It bridges the gap between physical behavior and digital metrics. In this blog, we will explore how this system works, how it overcomes the challenge of hand occlusion, and how custom data pipelines create an unbiased focus scoreboard.

The Problem with Generic Tracking Frameworks

Most computer vision projects attempt to solve object tracking by using public, off-the-shelf datasets. These pre-trained models are excellent for standard security camera perspectives or simple image detection demonstrations. However, they fail immediately when deployed in an egocentric environment where the camera is mounted from the user's point of view, looking directly down at their hands.

When a camera looks at a desk from this angle, it faces a massive technical challenge called dynamic occlusion. Dynamic occlusion happens when your own hands cover, block, or change the visual appearance of an object as you reach out to grab it. If you use a generic public model, the tracking logic breaks down the instant your hand touches your phone. The model suddenly drops the label entirely because your palm is hiding the device's edges.

Furthermore, standard public models cannot judge the context of human behavior. They cannot tell the difference between a phone sitting peacefully on a desk mat and a phone being held in an active hand. To build a system that detects true human-object interaction without dropping labels during a hand grab, developers cannot rely on public assets. You must build a tailored, domain-specific dataset from scratch.

How the Custom Focus Monitor Fixes This

To solve these tracking limitations, I designed an independent pipeline that operates across three core architectural layers: custom instance segmentation, a proximity interaction engine, and an automated behavioral state machine. By taking control of the training data and combining it with smart software logic, the system transforms a raw webcam feed into a highly accurate productivity telemetry engine.

Instead of relying on basic rectangular bounding boxes that capture extra background noise, our system relies on precise pixel-level boundaries. We recorded high-resolution video streams of custom desk setups, capturing different phones, pens, and paper formats from multiple first-person angles. These video frames were uploaded directly into the Labellerr Instance Segmentation Platform.

Using Labellerr's professional annotation suite, we traced the exact contours of every target class. This meant labeling a phone even when it was partially covered by a hand during a reach event. We explicitly trained the model to understand what a "partial phone" or "hidden pen" looks like. This custom-labeled dataset allowed us to train a compact, lightweight YOLO network optimized for high-speed workspace detection.

Once the model detects the objects, the system must determine if the user is actually interacting with them. Standard computer vision systems use an intersection metric called Intersection over Union (IoU). However, IoU creates a massive mathematical flaw when analyzing workspace tools. A pen bounding box is long and skinny, while a hand box is large and square. When a hand picks up a pen, the overlapping intersection area is microscopic compared to the total combined area of both boxes. The standard IoU value drops close to zero, causing the system to miss the interaction entirely.

To fix this, we implemented an Intersection over Minimum Area (IoM) mathematical filter inside our pipeline.

By dividing the overlapping intersection area specifically by the area of the smaller object, we get an accurate contact metric. If the IoM score rises above 15%, the system registers a definitive interaction event.

Project Workflow

Real-time video runs at 30 frames per second. To prevent the tracking metrics from flickering if an object is briefly hidden, the system feeds the continuous contact events into a prioritized state machine. The state machine operates on strict behavioral rules.

If the system detects an active hand interacting with a phone, it flags a distraction event. Phone usage carries the highest priority and overrides all other behaviors. If there is no phone interaction, but the hands are contacting pens or paper, the system enters the active study state. If no objects are being touched, the system defaults to a passive focus state, assuming the user is reading or thinking. It logs these states frame-by-frame, converting frame counts directly into a real-time focus percentage.

Real-World Applications

This tracking technology extends far beyond a personal study timer. Because the core pipeline is built on an independent data workflow, the underlying logic of egocentric human-object interaction can be customized for several major industries.

In modern fulfillment centers, operators must quickly assemble component packages or kits. Missing a single item ruins the shipment. By applying our egocentric tracking pipeline to a worker's station, an AI assistant can monitor the assembly line in real time. It can verify that an operator's hand has interacted with the correct component bin in the proper sequence, automating quality control without slowing down the workflow.

Maintaining strict cleanliness guidelines is vital inside hospital operating rooms. Medical residents must learn to handle highly specialized surgical tools with absolute precision. By mounting an egocentric camera to a doctor's headlamp, this system can track how long a resident handles specific instruments like scalpels or forceps. It provides automated, objective feedback on tool-switching efficiency and procedural dexterity without requiring an instructor to watch over their shoulder.

Understanding how customers physically interact with products on checkout counters or interactive display kiosks provides valuable market insight. This software can be integrated above self-service kiosks to analyze user engagement. It can track which products are picked up, how long they are held, or if an item is mistakenly left behind on the counter surface, optimizing store management seamlessly.

Real World Applications

Key Features of the System

The technical pipeline rests on four engineering pillars that ensure real-time accuracy and fluid execution.

Our custom segmentation dataset was built entirely using the Labellerr platform to ensure pixel-perfect boundary mapping across thousands of unique hand-object occlusion frames. This foundational layer is what keeps the system highly accurate during fast hand movements.

We completely overhauled the proximity matrix by swapping standard IoU for a tailored IoM algorithm. This algorithmic change ensures small or narrow workspace tools like pens are tracked accurately during continuous hand interaction instead of registering false negatives.

The software uses a headless execution script optimized for cloud and notebook environments. It handles heavy model processing seamlessly and outputs a fully rendered metrics video directly to disk without requiring a local graphical display.

Our context-aware analytical engine features a multi-tier state machine. It intelligently filters background noise and sudden frame drops to separate active work, passive reading, and smartphone distractions into distinct, smooth metrics.

Conclusion

The Real-Time Focus and Distraction Monitoring AI represents a significant shift in how we track and analyze human productivity. By moving away from rigid browser extensions and embracing an egocentric data workflow through Labellerr, we have created a tool that accurately measures physical work habits on standard camera hardware. This project proves that when you combine clean machine learning data with context-aware software logic, you can turn a basic webcam into an intelligent workspace supervisor.

Whether deployed as a student study assistant, an industrial assembly monitor, or a medical training analyzer, this technology provides an unbiased way to evaluate focus. It eliminates the guesswork of time tracking and delivers hard data for every movement. As we move toward a future of smart, responsive environments, custom-trained tools like this will become the baseline standard. Through the power of precision interaction tracking, we are making workspace automation smarter and more insightful for everyone.

FAQs

Why does standard Intersection over Union (IoU) fail when tracking small workspace items like pens?

IoU measures the overlap area divided by the total combined area of both bounding boxes. When a massive hand bounding box overlaps a long, skinny pen box, the combined union area stays huge while the intersection area is tiny, causing the IoU score to drop near zero.

How does the Intersection over Minimum Area (IoM) algorithm fix this tracking issue?

IoM divides the overlapping intersection area strictly by the area of the smaller object (the pen) instead of the total union. This ensures that even a partial hand touch registers a high, mathematically accurate contact score.

Why is an egocentric camera perspective better than a traditional security camera view for focus tracking?

An egocentric view captures the exact first-person line of sight of the worker, mirroring their true interaction with tools. This perspective provides clean data on hand-object touch events that a distant wall-mounted camera would completely miss due to distance and body angles.

Free

Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Download the Free Guide