EMMA: Teaching Robots Through Egocentric Human Learning

What if robots could learn to navigate and manipulate objects just by watching humans do it, no expensive teleoperation rigs, no lab-controlled setups, just a person wearing smart glasses going about a task?

That is exactly what EMMA (Egocentric Mobile Manipulation) delivers. Published in IEEE Robotics and Automation Letters by researchers at Georgia Institute of Technology, EMMA is one of the most practically significant advances in robot learning this decade. It does not just improve performance, it changes where robot training data comes from.

source

The Problem: Mobile Manipulation Is Starving for Data

Teaching robots to move and manipulate at the same time is hard. Static manipulation, picking up objects from a fixed position, has seen rapid progress. But mobile manipulation, where the robot must navigate while also performing precise arm tasks, is a different beast.

The bottleneck is data. Current systems like Mobile ALOHA rely on mobile robot teleoperation: a human physically controls a robot through a full task, and that motion gets recorded. This process is slow, expensive, physically demanding, and hard to scale. You need the robot hardware on-site, a trained operator, and controlled conditions, every single time.

EMMA's core argument is simple: humans already do mobile manipulation constantly. They walk to a shelf, grab items, carry them across a room, and hand them off. If you can record that data from the human's first-person view and teach a robot from it, you break the teleoperation bottleneck entirely.

What Is EMMA?

EMMA is an end-to-end framework that co-trains robot policies using two data sources: egocentric human mobile manipulation data and static robot manipulation data collected via standard teleoperation. The human data is captured using Meta Project Aria glasses, a wearable that records egocentric RGB video, 3D hand pose, and global head position using SLAM.

The robot never sees mobile teleoperation data at all. Yet it learns to navigate and manipulate effectively by bridging the gap between how humans move and how a differential-drive robot moves.

Official Paper

The Hardware: A Robot Built to Mirror a Human

bimanual mobile manipulator

The team built a custom low-cost bimanual mobile manipulator. It uses two 6-DoF ViperX 300s arms mounted in an inverted configuration on a height-adjustable rig, sitting on an AgileX TRACER differential-drive base capable of speeds up to 2 m/s. The robot stands 1.75m tall, close to human height.

Critically, the robot wears Aria glasses in a position that mimics where a human adult's eyes would be. Each arm also carries an Intel Realsense D405 wrist camera for close-range manipulation. This design deliberately shrinks the perceptual gap between human and robot before any algorithmic bridging begins.

The Architecture: Three Systems Working Together

architecture

EMMA's architecture has three main components. Together, they solve a hard alignment problem: how do you train one neural network from data that comes from two fundamentally different embodiments?

1. Data Retargeting and Alignment

Raw human head pose data cannot be fed directly to a differential-drive robot. Humans walk omnidirectionally. Robots on a differential drive can only move in straight lines and arcs.

EMMA solves this with an optimization problem. Given the 3D head trajectory projected onto the ground plane, it finds velocity commands (linear v and angular ω) that produce the closest feasible robot trajectory under kinematic constraints.

The optimization balances three terms: position tracking accuracy, heading alignment, and velocity smoothness. Velocity is constrained to ±1.6 m/s and angular velocity to ±1.5 rad/s. The result is a smooth, executable path that preserves the human's navigation intent without violating robot physics.

For manipulation, the system aligns coordinate frames using SLAM for human data and hand-eye calibration for robot data, then applies Z-score normalization separately per source to manage distributional gaps from biomechanics and sensor differences.

2. Co-Training Architecture

The policy model is a decoder-only Transformer with modality-specific input stems and multiple output heads. A shared vision stem processes the egocentric RGB stream from Aria glasses, whether worn by a human or mounted on the robot. This shared stem forces visual feature alignment across embodiments.

Separate stems handle robot wrist cameras, end-effector poses, joint positions, and navigation waypoints. The Transformer trunk is updated by all data sources and modalities, learning representations that work across both humans and robots.

There are four output heads: robot bimanual joint actions, human Cartesian end-effector actions, robot base navigation actions, and an auxiliary phase prediction head.

During a human data batch, the navigation head, human manipulation head, and phase head are active. During a robot data batch, the robot manipulation head and wrist vision stem are active. The trunk and ego vision stem train on everything.

3. Auxiliary Phase Identification

Mobile manipulation tasks alternate between navigation and manipulation phases. Moving the base while trying to perform a precise grasp causes drift and errors. EMMA uses an unsupervised phase detection algorithm based on the ratio of hand velocity to head velocity.

When hand velocity is high relative to head velocity, and head velocity is below 0.4 m/s, the system classifies that moment as a manipulation phase. A Gaussian Mixture Model fitted to head positions during these periods spatially localizes up to K manipulation zones.

The result: the robot knows when to stop its base and focus on its arms, and when to move without wasting effort on unnecessary arm motion. Accuracy across all tested tasks was 92% or higher (MoF ≥ 0.92).

Four Real-World Tasks, One Framework

The team tested EMMA across four real-world tasks, running 50 trials per variant - 1,150 mobile manipulation rollout evaluations in total.

Table Service: The robot picks up utensils from a kitchen table, navigates 2m to a dining table, places them, then returns to pick a croissant and carry it back, navigating around a wine glass.

Handover Wine: The robot picks a wine glass from a mat and navigates toward a human standing in a 3m × 3m area, handing it precisely to their right hand.

Grocery Shopping: The robot simultaneously grabs a juice pouch and a chip bag from different shelf positions, then uses both arms to pick a large popcorn bag, loads them into a shopping bag, and navigates to a table.

Push Chair: The robot grasps an office chair backrest and pushes it against a table. This task uses continuous arm-base coupling, no phase switching, testing unified full-body coordination.

Results: Human Data Beats Teleoperation

cumulative success rate bar charts comparing EMMA vs. Mobile ALOHA across Serve Utensils, Serve Croissant, Handover Wine, and Grocery Shopping subtasks

The results are clear. On the Handover Wine task, replacing one hour of mobile robot teleoperation data with one hour of human mobile manipulation data produced an 82% success rate versus 52% for Mobile ALOHA, a 30-point jump. On Grocery Shopping, EMMA significantly outperformed Mobile ALOHA (p < 0.05) with a full task success rate of 16% versus Mobile ALOHA's 0%.

For Push Chair, EMMA trained on 10 minutes of mobile robot data plus 20 minutes of human data matched the performance of Mobile ALOHA trained on 30 minutes of pure robot teleoperation. Human data effectively subsidized robot data at a favorable ratio.

Scaling behavior matters here

Keeping static robot data fixed at one hour, EMMA's success rate on Handover Wine climbed from 0.36 at 15 minutes of human data to 0.82 at 60 minutes. Mobile ALOHA grew from 0.26 to 0.52 over the same range of equivalent robot teleoperation time. The performance gap widened from 10 percentage points to 30 as data volume increased.

Generalization: The Proof That Really Matters

generalization bar chart showing EMMA achieving 54% success in a novel environment vs Mobile ALOHA at near 0%

EMMA was tested in a completely unseen environment for the Handover Wine task, a new spatial layout where the recipient stood within a 5m × 2m area. EMMA had never seen robot data from this layout. It still achieved a 54% full-task success rate.

Mobile ALOHA, trained only on teleoperated data from the original room, failed to even complete the initial wine glass grasp. The environment change broke it entirely.

This gap exists because human data is collected in varied environments, with varied lighting, varied object positions, and varied movement patterns. That natural diversity transfers. Teleoperation data, recorded under controlled lab conditions, does not.

Limitations

EMMA assumes that what the robot sees during deployment resembles what humans saw during data collection. When the viewpoint or kinematic gap grows large, transfer quality falls.
For tasks that need tightly coupled arm-base coordination from the start, beyond what Phase Identification handles, pure human data transfer is insufficient. Limited mobile robot data is still needed as an anchor.

Conclusion

EMMA makes a compelling case that the future of robot training data is not in teleoperation labs. It is on people. Egocentric human data, collected with a wearable, in real environments, at natural scale, can produce robot policies that match or outperform systems built on expensive mobile teleoperation. It scales better. It generalizes better. And it costs a fraction of the effort.

The architecture is clean, the results are rigorous, and the implications reach well beyond mobile manipulation.

Power Your Robot Learning Pipeline with Labellerr

Egocentric robot learning starts with high-quality data, and that is precisely where most teams hit a wall. Collecting, curating, and annotating first-person video at the scale EMMA demands requires a purpose-built data pipeline.

Labellerr offers end-to-end egocentric data services built for robotics teams. From wearable data capture workflows to multi-modal annotation, hand pose, gaze tracking, action segmentation, phase labeling, and semantic scene tagging, Labellerr handles the full data lifecycle.

Whether you are building mobile manipulation policies, training imitation learning systems, or benchmarking cross-embodiment transfer, Labellerr's annotation platform and managed data services give your team the labeled data it needs to ship faster.

Ready to scale your egocentric data pipeline? Talk to the Labellerr team today.

FAQs

1. What makes EMMA different from traditional robot teleoperation systems?

EMMA trains robots using egocentric human data collected from wearable devices instead of relying entirely on mobile robot teleoperation. This reduces hardware dependency, lowers data collection costs, and improves generalization in real-world environments.

2. How does EMMA transfer human movement to a mobile robot?

EMMA uses trajectory optimization and kinematic alignment to convert human head and hand movements into feasible robot navigation and manipulation commands while respecting differential-drive robot constraints.

3. Why is egocentric data important for robot learning?

Egocentric data captures natural human interactions in diverse environments, helping robots learn navigation and manipulation behaviors that generalize better than lab-recorded teleoperation datasets.