How Labellerr Improved Model Accuracy for a Healthcare AI Startup

Every year, millions of patients check into hospitals seeking care and leave with an infection they didn't arrive with. Hospital-acquired infections (HAIs) remain one of the most stubborn, costly, and often fatal challenges in modern healthcare.

Despite rigorous protocols and dedicated Environmental Services teams, the gap between what should be cleaned and what actually gets cleaned is nearly impossible to close through human oversight alone. A missed surface. A rushed procedure. A log that says "done" when it wasn't quite right.

The consequences are measured not just in dollars, HAIs cost the U.S. healthcare system an estimated $28 billion annually, but in patient lives.

Solving this problem requires more than better checklists. It requires a system that sees everything, in real time, without fatigue or bias. It requires AI.

One healthcare technology startup set out to build exactly that: an AI-powered computer vision system capable of auditing every cleaning event inside a clinical environment, automatically and continuously.

But building a vision model that works reliably in the controlled chaos of a real hospital is no small feat. It demands one foundational ingredient before anything else, clean, accurate, and scalable training data.

That's where Labellerr came in.

About the Customer

Our customer is an early-stage deep tech startup based in California, operating at the intersection of artificial intelligence and healthcare safety. Founded in 2022, the company is on a mission to eliminate preventable hospital-acquired infections, one of the most persistent and costly challenges in modern healthcare, using AI-powered computer vision.

Their flagship product is a real-time automated auditing system that monitors environmental cleaning and disinfection events within hospital and clinical settings.

By analyzing every cleaning interaction as it happens, the system gives Environmental Services (EVS) teams and infection control officers actionable, data-backed insights to ensure compliance and close the gaps that manual auditing simply cannot catch at scale.

At the core of their technology stack is a suite of computer vision models trained to understand complex, real-world clinical environments. Scaling and improving these models demanded high-quality, precisely annotated training data, which is where Labellerr came in.

Customer Requirements

Before Labellerr could begin, it was important to understand the full scope and complexity of what the customer needed. This wasn't a straightforward labeling task, it was a precision annotation challenge inside one of the most demanding visual environments imaginable.

The customer needed 50+ distinct medical objects tracked consistently across every scene, all of them patient-contact surfaces that matter deeply in infection control.

These included high-touch equipment like bed rails, IV poles, vital signs monitors, breathing tubes, humidifier units, and a range of other clinical apparatus that EVS staff interact with during every cleaning event.

Each of these objects had to be individually identified, labeled, and tracked with stable identities throughout the video, because in the context of hospital safety AI, a missed object or a broken tracking chain isn't just a data error, it's a gap in patient protection.

The full scope of the task:

100+ videos to be annotated end-to-end
Video annotation with object tracking, not just frame-level labeling, but consistent identity tracking of objects across the full length of each video
50+ distinct medical object classes including bed rails, IV poles, vital signs monitors, breathing tubes, humidifier units, and other patient-contact clinical equipment
Each video ranged between 900 to 1,800 frames, making manual frame-by-frame annotation both time-intensive and error-prone at scale

The sheer volume was significant. But the real challenge wasn't quantity, it was consistency.

Challenges faced by the customer

Hospital environments are visually complex. Cleaning staff move across rooms, equipment shifts position, objects are picked up, put down, tucked behind carts, or temporarily obscured by people passing through the frame.

For a computer vision model to learn from this data, every object needs to be tracked with a consistent identity, from the first frame it appears to the last, even when it momentarily vanishes from view.

This is where conventional object tracking models fell short. Standard tracking algorithms rely on continuous visual signals. The moment an object becomes occluded, hidden behind another object or person, many trackers lose the thread entirely, assigning a new identity when the object reappears. Across 900 to 1,800 frames per video, with 50+ objects in motion, these inconsistencies compound quickly and corrupt the training data.

The challenge Labellerr faced was clear: how do you maintain tracking consistency across long, dense, occlusion-heavy videos, at scale, without sacrificing annotation accuracy?

Annotated frame

How Labellerr Helped?

Video Annotation Screen

Deploying SAM2 for Robust Object Tracking

Labellerr's team turned to SAM2 (Segment Anything Model 2) by Meta, a state-of-the-art foundation model built for video object segmentation and tracking. Unlike traditional trackers, SAM2 is designed to handle object persistence across frames, making it significantly more resilient in occluded environments.

But deploying SAM2 on raw video alone wasn't enough. The team developed a targeted workflow to specifically address the occlusion problem.

Keyframe Extraction at Occlusion Points

Rather than running tracking from the first frame and hoping it held through every occlusion, Labellerr's annotators identified the specific keyframes where occlusion events occurred , the exact moments where an object disappeared or re-emerged. These keyframes were used as fresh reference points to reinitialize object tracking, ensuring that identity consistency was restored precisely at the moments it was most at risk of breaking down.

This approach: keyframe-anchored tracking, meant that even in the most visually chaotic scenes, object identities remained stable throughout the full video. No lost threads. No duplicate IDs. No corrupted sequences.

Closing the Loop with Model-Assisted Pre-Annotation

As the annotated dataset grew, Labellerr put it to work immediately. The customer's computer vision model, now trained on Labellerr's high-quality annotated data, was integrated back into the annotation pipeline as a pre-annotation engine for subsequent batches.

Incoming videos were automatically pre-labeled by the model, with annotators then reviewing and correcting outputs rather than starting from scratch. This closed-loop system meant that with every new batch, the process got faster, the pre-annotations got more accurate, and the human effort required per video steadily decreased, without any compromise on the final label quality.

Results and Impact

The combined approach of SAM2-powered tracking, keyframe-based occlusion handling, and model-assisted pre-annotation delivered measurable outcomes across every dimension that mattered:

100+ videos annotated with consistent, reliable object tracking across all frames
50+ object classes tracked simultaneously with stable identities, even through heavy occlusion
Significant reduction in annotation turnaround time as the model pre-annotation loop matured with each batch
Higher data quality compared to traditional tracking pipelines, validated through annotation review and model performance
A scalable, repeatable pipeline the customer can continue to use as their video dataset grows

Most importantly, the customer now has a training dataset strong enough to power a computer vision model that can reliably audit hospital cleaning events in real-world conditions, the exact kind of visual complexity the annotations were built to handle.

Conclusion

Building AI that works in the real world starts with data that reflects the real world, in all its messiness, complexity, and unpredictability. For a healthcare AI startup tackling one of medicine's hardest operational problems, that meant training data annotated with a level of precision that standard tools simply couldn't deliver.

Labellerr's ability to combine cutting-edge models like SAM2 with human expertise and a self-improving annotation pipeline made the difference, turning hundreds of complex clinical videos into a robust, high-quality training dataset that is now at the core of an AI system designed to protect patients.

Because when the mission is hospital safety, the data behind the model can't afford to be anything less than accurate.

FAQs

Q1: What makes video annotation for healthcare AI different from standard object tracking tasks?

Healthcare environments involve dense, fast-moving scenes with frequent occlusions — staff, equipment, and surfaces constantly overlap. Unlike typical tracking tasks, every object must maintain a consistent identity across hundreds of frames, making accuracy-critical annotation far more complex than general-purpose use cases.

Q2: Why is SAM2 better suited for occluded environments than traditional tracking models?

Traditional trackers rely on continuous visual signals and often lose object identity the moment something is hidden from view. SAM2 is built for video object segmentation with persistent memory across frames, making it significantly more resilient when objects temporarily disappear and reappear in complex scenes.

Q3: How does model-assisted pre-annotation improve the data labeling process over time?

As annotated data grows, the trained model can be fed back into the pipeline to auto-label new incoming videos. Annotators then review and correct predictions rather than labeling from scratch — reducing turnaround time, lowering cost per video, and improving consistency with each successive batch.