DreamDojo Platform for Scalable Robot Training

Robotics still suffers from a data bottleneck. We want robots that can fold clothes, pack fruit or manipulate tools in unseen kitchens. Yet most robot learning pipelines rely on narrow teleoperation datasets.

These datasets are expensive to collect. They cover limited scenes. They encode a small set of skills. Modern embodied AI systems demand more. They need generalization across objects, environments, and action distributions.

They need a robot world model that predicts what will happen after an action. They must support model based reinforcement learning and policy evaluation without repeated physical deployment. Human videos provide the missing scale.

Everyday human interactions encode contact dynamics, object permanence, deformation and causal structure. If we can learn from that signal, we can train foundation models for robotics that understand physics before ever touching hardware.

DreamDojo takes this path. It introduces a generalist robot world model trained from 44,711 hours of egocentric human video. It learns interaction dynamics from large-scale human experience, then adapts to specific robots through post training and distillation.

The result is a controllable, action conditioned, real time world model for embodied AI. This is a shift from narrow policy learning to scalable world modeling.

What Is DreamDojo?

Video source

DreamDojo is a foundation robot world model. It implements an interactive state transition function:

s_t+1 ∼ p(· | s_t, a_t)

where s_t represents the current state of the world at time step t, a_t represents the action taken at time t, s_t+1 represents the next state after executing that action, and p(· | s_t, a_t) denotes the conditional distribution over possible next states given the current state and action.

In DreamDojo, states are video frames represented in a latent space. Actions are either latent actions extracted from human videos or robot joint commands after post training. The model predicts future video latents conditioned on these actions.

Unlike prior video world models trained on games or driving data, DreamDojo targets contact rich manipulation. It supports high dimensional continuous control. It handles counterfactual actions. It generalizes to unseen objects.

This matters for generalist robotics. A robot equipped with such a world model can simulate trajectories, evaluate policies, and perform model based planning without repeated real world rollouts.

DreamDojo positions robot world modeling inside the foundation model paradigm.

Large Scale Human Video Training for Robot World Models

Quantitative Insights into DreamDojo’s Robotics World Modeling

The core insight behind DreamDojo is scale. The DreamDojo-HV dataset contains 44,711 hours of egocentric human video.

The full data mixture includes:

· 9,869 unique scenes

· 6,015 unique tasks

· 43,237 unique objects

The dataset spans household, retail, industrial, educational, and administrative environments. It covers diverse skills beyond pick and place, including scrubbing, folding, rotation, assembly, and carving. This diversity drives generalization.

Robot datasets often contain narrow task distributions. Human video data contains stochastic intentions, varied contact modes, and long horizon subtask structure. DreamDojo leverages this distributional richness to improve action controllability and physics modeling.

The final pretraining mixture combines In-lab, EgoDex and DreamDojo-HV data. The scale exceeds prior robot world model datasets in both duration and diversity. This is large scale human video training applied directly to embodied AI.

Core Technical Architecture of the DreamDojo World Model

Overview of DreamDojo

DreamDojo builds on Cosmos-Predict2.5, a latent video diffusion model trained with flow matching. The base objective is:

L_flow(θ) = E_x,ε,c,t ||u(x_t, t, c; θ) - v_t||²

where v_t = ε - x and u(·) is the denoiser that predicts the velocity field in latent space.

DreamDojo modifies this architecture for robotics.

First, it uses relative actions instead of absolute joint poses. Actions are re-baselined at the beginning of each latent frame chunk. This reduces modeling complexity and improves compositional control over continuous robot actions.

Second, it injects actions in temporal chunks. Because the WAN2.2 tokenizer compresses time by a factor of 4, DreamDojo concatenates four consecutive actions and conditions the corresponding latent frame. This preserves causality between actions and predicted future states.

Third, it introduces a temporal consistency loss:

L_temporal(θ) = E [ ∑_i=1^K-1 ||(z_i+1 - z_i) - (v_i+1 - v_i)||² ]

where L_temporal(θ) is the temporal consistency loss parameterized by θ, E denotes expectation, ∑_i=1^K-1 represents summation over the latent sequence length K, z_i is the predicted velocity at step i, v_i is the ground truth velocity at step i, and ||·||² denotes the squared L2 norm measuring the difference between predicted and true temporal transitions.

The final objective becomes:

L_final(θ) = L_flow(θ) + λ L_temporal(θ)

where L_final(θ) is the final training objective, L_flow(θ) is the flow matching loss, L_temporal(θ) is the temporal consistency loss, λ (= 0.1) is the weighting coefficient balancing the two terms, and θ represents the model parameters being optimized.

These changes improve action controllability, object completeness, and long horizon consistency, turning a generic video diffusion backbone into an action conditioned robot world model.

Continuous Latent Actions for Cross Embodiment Transfer

Latent Video Understanding Turns Actions into Robot Predictions

Human videos lack explicit action labels. DreamDojo addresses this limitation by introducing continuous latent actions.

A 700M-parameter spatiotemporal Transformer VAE extracts compact action embeddings directly from consecutive video frames, allowing the model to infer motion without relying on manual annotations.

The encoder consumes frame pairs f_t:t+1 and outputs a low-dimensional latent â_t. The decoder reconstructs f_t+1 conditioned on f_t and â_t.

This formulation forces the latent variable to capture only the motion necessary to transform one frame into the next. The training objective is:

L_pred = E_{q_φ(â | f_t:t+1)} log p_θ(f_t+1 | â, f_t) − β D_KL(q_φ(â | f_t:t+1) || p(â))

where L_pred is the prediction loss, E denotes expectation, q_φ(â | f_t:t+1) is the encoder distribution parameterized by φ, â is the latent action, f_t and f_t+1 are consecutive video frames, p_θ(f_t+1 | â, f_t) is the decoder distribution parameterized by θ, D_KL is the Kullback–Leibler divergence, p(â) is the prior distribution over latent actions, and β = 10^-6 is the regularization coefficient controlling the KL term.

The KL regularization introduces an information bottleneck, ensuring that the latent action remains compact and motion centric rather than encoding static visual content.

As a result, the embedding captures semantically meaningful interaction dynamics that transfer across embodiments.

During world model pretraining, these continuous latent actions serve as a universal control proxy. They condition the diffusion backbone through learned MLP projections, providing a unified action interface across large scale human video and downstream robot control.

This design bridges raw human interaction data with robot action spaces while preserving strong zero shot generalization.

Training Pipeline and Scaling Strategy

DreamDojo uses a three-phase training pipeline.

Pretraining uses 44,711 hours of human video with a sampling ratio of 1:2:10 across In-lab, EgoDex, and DreamDojo-HV. Models are trained for 140k steps with effective batch size 1024. Two variants are released: 2B and 14B parameter models.

Post-training adapts the model to specific robots such as GR-1, G1, and AgiBot. The action conditioning layer is reset and finetuned with robot joint trajectories.

Distillation converts the bidirectional diffusion model into an autoregressive student model using Self Forcing. The teacher uses 35 denoising steps. The student uses 4 steps. This enables real-time inference at 10.81 FPS.

The scaling strategy mirrors foundation models in NLP and vision. Pretrain large. Adapt small. Distill for deployment.

Diversity, Generalization and Out-of-Distribution Robustness

DreamDojo evaluates generalization on six OOD benchmarks.

Human preference evaluation shows strong improvements over the base model. DreamDojo-14B achieves 73.50% physics correctness preference over Cosmos-Predict2.5. It also improves action following to 72.55% preference. Quantitative metrics also improve with larger data mixtures. Increasing human video diversity consistently boosts PSNR and SSIM across OOD and counterfactual settings. Distillation preserves generalization. The student model runs at 10.81 FPS versus 2.72 FPS for the teacher. Long-horizon degradation remains modest.

These results show that large-scale human video training improves cross-scene generalization and counterfactual reasoning.

Applications in Robot Learning and Model Based Reinforcement Learning

Core applications

DreamDojo enables three core applications.

Policy evaluation

In fruit packing tasks, DreamDojo’s predicted success rates correlate strongly with real-world outcomes. Pearson r=0.995. Mean Maximum Rank Violation equals 0.003. This shows that the world model can act as a reliable simulator.

Model based planning

By generating multiple action proposals and selecting via a value model, DreamDojo improves success rates. For one policy group, planning improves performance by 17% over the best checkpoint. In other settings, it yields nearly 2× improvement over uniform sampling .

Live teleoperation

The distilled DreamDojo-2B runs in real time on a single GPU and supports VR-based control at interactive rates. These capabilities demonstrate how foundation models for robotics can integrate with control stacks, reinforcement learning, and teleoperation systems.

Technical Implications for Generalist Robotics

DreamDojo reframes robot learning. Instead of training policies from scratch per task, we can train large robot world models from human data. We can adapt them to embodiments with limited robot data. We can simulate outcomes before executing actions.

This changes how we design embodied AI systems.

It reduces reliance on teleoperation. It supports scalable model-based reinforcement learning. It enables safe policy evaluation. It brings robotics closer to the foundation model paradigm seen in language and vision.

Most importantly, it shows that large-scale human video training can produce physically grounded world models that transfer to robots.

Conclusion

DreamDojo demonstrates that robot world models benefit from scale. By training on 44,711 hours of human video, it introducing continuous latent actions, and applying autoregressive distillation, it achieves strong physics modeling, action controllability, and real-time interactivity.

It generalizes to unseen objects. It handles counterfactual actions. It correlates with real-world performance in policy evaluation.

The path forward is clear. Foundation models for robotics will rely on large-scale multimodal pretraining, embodiment adaptation, and fast generative simulation. DreamDojo provides a blueprint for that future.

The robot dojo is now virtual, learned from human experience, and ready to simulate the next action before it happens.

What is a robot world model and why is it important?

A robot world model predicts future states of the environment given current observations and actions. It enables simulation, planning, and policy evaluation without repeated physical interaction.

How does DreamDojo achieve zero-shot generalization?

DreamDojo leverages 44,711 hours of large-scale human video and continuous latent actions as a universal control proxy, allowing it to generalize to unseen objects and environments.

Why use human video data instead of only robot demonstrations?

Human videos provide large-scale interaction diversity across tasks, scenes, and objects, enabling broader generalization and richer physical understanding than limited teleoperation datasets.