ExBody2: Stable Expressive Humanoid Whole Body Motion Tracking with Sim-to-Real Learning

Humanoid robots fail when expressive motion breaks balance. EgoBody2 shows how feasibility-aware data curation and decoupled control enable stable, long-horizon, human-like full-body motion on real robots.

Egobody2
Egobody2

Ask a humanoid robot to walk forward, and most systems can handle it.
Ask the same robot to walk while throwing a punch, shifting weight dynamically, and recovering balance mid-motion and things breakfast.

This is not a hardware problem. It is a whole-body control problem. Human motion is expressive, high-momentum, and deeply coupled across the entire body. A punch is not just an arm movement it is a coordinated interaction between feet, hips, torso, and shoulders.

Traditional humanoid controllers fail because they treat these components independently and rely on brittle global tracking assumptions.

ExBody2 addresses this exact failure mode.
It shows, for the first time, that a real humanoid robot can execute long-horizon, expressive, full-body human motions without sacrificing stability.

Video Source

The Core Difficulty of Whole-Body Humanoid Control

Whole body humanoid control is fundamentally different from manipulation or locomotion alone. A humanoid must simultaneously manage balance, contact forces and motion tracking across more than 20 degrees of freedom. Any small error in the lower body propagates upward destabilizing the entire system.

Previous learning-based approaches faced a hard trade-off. Methods that emphasized expressiveness often became unstable, while methods that prioritized stability suppressed dynamic motion. This tension becomes worse when learning directly from human motion data, which contains movements that are physically impossible for robots.

ExBody2 reframes the problem. Instead of forcing a single policy to learn everything at once, it introduces structure at the dataset level, the policy level, and the control objective itself.

What ExBody2 Is, Precisely

ExBody2 is a sim-to-real framework for expressive humanoid whole-body motion tracking. Its goal is to allow a humanoid robot to imitate full-body human motion sequences with high fidelity while remaining stable in real-world deployment.

The system is built around 3 technical ideas:

Human motion data is automatically filtered using an empirically grounded feasibility metric instead of manual heuristics.

Motion learning is divided into a generalist - specialist hierarchy to balance diversity and precision.

Motion tracking is decoupled from velocity control to eliminate global drift.

These ideas are tightly coupled. Removing any one of them causes the system to collapse under expressive motion.

  Overall Pipeline

Motion Retargeting and Dataset Preparation

ExBody2 begins with large-scale human motion datasets, such as CMU Mocap, that contain a wide range of actions including walking, dancing, punching, and dynamic transitions.

These datasets are retargeted to the humanoid robot’s kinematic structure so that joint trajectories and key points are expressed in the robot’s coordinate space.

At this stage, the dataset still contains infeasible motions. This is intentional. ExBody2 does not assume feasibility a priori. Instead, it learns feasibility empirically by observing how a policy fails to track certain motions.

This choice is critical. Human intuition is a poor filter for robot feasibility, especially when dealing with subtle balance violations and long-horizon instability.

Automated Data Curation Using Tracking Error

The central insight of ExBody2 is that lower-body feasibility determines whole-body stability. Upper-body expressiveness can be diverse, but lower-body motion must remain within physical limits for the robot to stay upright.

To operationalize this, ExBody2 defines a tracking error score for each motion sequence s :

e(s) = αEkey(s) + βEdof(s)

Here, Ekey(s) represents the mean lower-body keypoint position error, which penalizes motions that cause large deviations in foot and leg placement. Edof(s) represents the mean joint-angle tracking error of the lower body. The coefficients are set to α = 0.1 and β = 0.9, deliberately prioritizing joint feasibility over raw spatial alignment.

A base policy π0 is trained on the full, unfiltered dataset. This policy is then used to evaluate every motion sequence, producing an empirical distribution of tracking errors. Instead of guessing which motions are feasible, ExBody2 lets the robot reveal them through failure.

Discovering the Optimal Feasibility - Diversity Balance

Filtering too aggressively removes dynamic motions and reduces generalization. Filtering too loosely introduces infeasible samples that destabilize training. ExBody2 resolves this by explicitly optimizing the filtering threshold.

For a given threshold τ, a filtered dataset Dτ is defined as all motion sequences with e(s) ≤ τ. A policy πτ is trained on this subset, and its performance is evaluated across the full dataset. The optimal threshold is defined as:

τ* = arg maxτ  Es ∈ D[Performance(πτ, s)]

Empirical evaluation shows that a moderate threshold, specifically τ = 0.15, yields the best results. This threshold removes extreme lower-body motions while preserving upper-body diversity, producing a dataset that is both expressive and physically feasible.

This step is the foundation of the entire ExBody2 system.

The Generalist-Specialist Policy Hierarchy

  

Training a single policy on all motions forces it to average across incompatible dynamics. Training separate policies from scratch destroys generalization. ExBody2 avoids both extremes.

A generalist policy is trained on the automatically curated dataset using the optimal threshold. This policy learns a broad prior over human motion while maintaining stability. It is robust, adaptable, and capable of tracking diverse motion types.

For high-precision tasks such as dancing or kung fu, the generalist is then fine-tuned to create specialist policies. These specialists inherit the generalist’s robustness while refining control strategies for a narrow motion distribution.

This pretrain–finetune structure is essential. Experiments show that specialist policies consistently outperform policies trained from scratch, especially on high-momentum motions.

Decoupled Motion-Velocity Control

Traditional whole-body imitation methods rely on global key point tracking. Over time, small errors accumulate, and the robot diverges from the reference trajectory. This leads to instability even when the local motion is correct.

ExBody2 decouples control objectives. Global movement is governed by velocity tracking, while posture and expression are governed by local key point tracking. Key points are expressed in the robot’s local coordinate frame, not the world frame.

This allows the robot to remain expressive even when its global position drifts slightly. Stability is preserved because balance is controlled through velocity and orientation objectives rather than brittle spatial alignment.

Teacher-Student Learning for Sim-to-Real Transfer

  Teacher-student framework for humanoidmotion learning, wheretheteacherusesprivilegedinformation,andthestudent learns frompastobservations togeneratecontrolactions.

Image Source

The teacher policy is trained in simulation using Proximal Policy Optimization (PPO), where the full system state is accessible. This training happens in the reinforcement learning phase of ExBody2, before any sim-to-real transfer.

During this stage, the teacher has access to privileged information that is unavailable on real hardware, including the true root linear velocity, precise body link positions, and physical environment parameters such as ground friction.

The teacher’s objective is to learn a control policy that tracks the reference motion while maintaining stability. This is achieved by maximizing the expected cumulative discounted reward, defined as:

Eπ̂ [ Σt=0T γt R(st, ât) ]

Here,
π̂ denotes the teacher policy,
t is the discrete time step,
T is the episode horizon,
γ ∈ (0,1) is the discount factor controlling the importance of future rewards,
st represents the system state at time step t, which includes privileged information, proprioceptive observations, and motion tracking targets,
t is the action output by the teacher policy at time t, and
R(st, ât) is the reward function, which encourages accurate joint tracking, keypoint alignment, velocity tracking, and overall stability.

The teacher policy outputs target joint positions rather than torques. These targets are passed to low-level Proportional-Derivative (PD) controllers, which convert them into motor torques.

This design choice ensures smooth motion execution and improves stability during both training and deployment.

Once the teacher policy has converged, ExBody2 transitions to the student policy training stage which is designed for real world execution.

The student policy does not have access to privileged information, as such data cannot be reliably measured on physical robots. Instead, the student observes a history of past proprioceptive states and motion targets.

The student policy is trained using DAgger-style imitation learning, where the teacher provides supervision during rollouts. The student minimizes the following mean squared error loss:

l = ∥at − ât2

In this equation,
l denotes the distillation loss,
at is the action predicted by the student policy at time step t,
t is the corresponding teacher action.

This loss is evaluated at every visited state during training, and gradients are backpropagated through the student network to align its behavior with that of the teacher.

A history length of ten frames is used for the student’s input. This temporal window allows the student to implicitly infer quantities such as velocity, contact conditions that the teacher observes explicitly through privileged information.

As a result, the student policy becomes robust to sensor noise, partial observability, and real-world uncertainty while maintaining behavior closely aligned with the teacher’s stable and expressive motion strategy.

Evaluation

Method Whole-Body Joint Error (MPJPE) Upper-Body Error Lower-Body Error
ExBody (Original) 0.2020 0.1343 0.2952
OmniH2O* 0.1681 0.1564 0.1843
ExBody2 (Ours) 0.1079 0.0953 0.1253

Simulation Results (on the CMU Dataset)

In simulation, ExBody2 achieves a whole-body mean per-joint position error of 0.1079 radians, outperforming both OmniH2O and the original ExBody. This reduction indicates more accurate full-body motion tracking across diverse motion sequences.

The improvement is especially pronounced in the lower body. Compared to OmniH2O, lower-body joint error drops from 0.1843 radians to 0.1253 radians. Because lower-body accuracy directly governs balance and contact stability, this gain reflects a meaningful improvement in physical feasibility rather than superficial pose alignment.

Method Whole-Body Joint Error (MPJPE) Upper-Body Error Lower-Body Error
ExBody (Original) 0.2178 0.1223 0.3239
OmniH2O* 0.1396 0.1273 0.1533
ExBody2 (Ours) 0.1074 0.1092 0.1054

Real-World Results (Unitree G1 Deployment)

The same pattern appears in real-world deployment on the Unitree G1 humanoid. ExBody2 maintains a whole body MPJPE of 0.1074 radians, compared to 0.1396 radians for OmniH2O and 0.2178 radians for ExBody.

Lower-body joint error in the real world is reduced to 0.1054 radians. This level of precision allows the robot to remain stable during dynamic, long-horizon motions, including extended dance sequences lasting over 40 seconds.

Together, these results show that expressive humanoid motion does not require sacrificing stability. When feasibility is enforced at both the dataset and control levels, accurate motion imitation and robust physical execution can coexist.

Conclusion

ExBody2 shows that expressive humanoid motion is not a hardware fantasy. It is a systems problem that can be solved through principled data curation, structured learning, and realistic control objectives.

The framework enables humanoid robots to move with intent rather than caution. It lays a foundation for future work in human robot interaction, teleoperation, and embodied learning, where motion quality is as important as task success.

This is not the end of humanoid motion control. But it is the first time the problem is attacked honestly, at the level where it actually breaks.

Why do traditional humanoid controllers fail during expressive full-body motion?

Traditional controllers rely on global tracking and treat body parts independently. This causes instability when high-momentum, coupled motions require precise lower-body coordination for balance.

How does EgoBody2 maintain stability while executing expressive motions?

EgoBody2 enforces lower-body feasibility through automated data curation and decouples velocity control from posture tracking, preventing drift and balance failure.

Why is lower-body accuracy more critical than upper-body accuracy in humanoid control?

Lower-body errors directly affect contact forces and balance. Even small inaccuracies in leg motion can destabilize the entire humanoid system.

Blue Decoration Semi-Circle
Free
Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Free data annotation guide book cover
Download the Free Guide
Blue Decoration Semi-Circle