ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

Yanghao Zhou*, Jingyu Ma*, Yibo Peng, Zhenguo Sun, Yu Bai, Börje F. Karlsson
Beijing Academy of Artificial Intelligence (BAAI)
*Equal Contribution    Corresponding Author

Abstract

Humanoid control systems have made significant progress in recent years, yet modeling fluent interaction-rich behavior between a robot, its surrounding environment, and task-relevant objects remains a fundamental challenge. This difficulty arises from the need to jointly capture spatial context, temporal dynamics, robot actions, and task intent at scale, which is a poor match to conventional supervision. We propose ExoActor, a novel framework that leverages the generalization capabilities of large-scale video generation models to address this problem. The key insight in ExoActor is to use third-person video generation as a unified interface for modeling interaction dynamics. Given a task instruction and scene context, ExoActor synthesizes plausible execution processes that implicitly encode coordinated interactions between robot, environment, and objects. Such video output is then transformed into executable humanoid behaviors through a pipeline that estimates human motion and executes it via a general motion controller, yielding a task-conditioned behavior sequence. To validate the proposed framework, we implement it as an end-to-end system and demonstrate its generalization to new scenarios without additional real-world data collection. Furthermore, we conclude by discussing limitations of the current implementation and outlining promising directions for future research, illustrating how ExoActor provides a scalable approach to modeling interaction-rich humanoid behaviors, potentially opening a new avenue for generative models to advance general-purpose humanoid intelligence.

Method

1. Exocentric Video Generation
 

A captured third-person robot observation is first transferred into a human-like reference frame with Nano Banana Pro (Gemini 3.1 Pro), preserving scene layout, viewpoint, pose, orientation, scale, and robot-specific proportions. Step-wise action prompts then guide video generation models such as Kling 3 and Veo 3.1 to produce task-consistent execution clips.

2. Interaction-Aware Motion Estimation

Generated videos are translated into structured motion using GENMO for whole-body trajectories and WiLoR for bilateral hand states, yielding a synchronized representation of body motion, hand pose, and interaction state.

3. General Motion Tracking
 

The estimated motion is executed with a general humanoid motion-tracking controller, allowing the Unitree G1 robot to follow generated task demonstrations while maintaining dynamic stability.

Key Contributions

  • We identify exocentric (third-person) video generation as a scalable paradigm for modeling interaction dynamics in humanoid control, leveraging the generalization capabilities of large pretrained video models.
  • We propose ExoActor, an end-to-end framework that synthesizes task execution videos and directly converts them into executable humanoid behaviors via human motion estimation and general motion tracking, without task-specific data collection.
  • We demonstrate the feasibility of this paradigm on real-world humanoid tasks, showing that generated videos can be translated into interaction-aware behaviors across diverse scenarios.
  • We discuss key challenges and future directions, including physically grounded video generation, improved motion execution pipelines, integration with vision-based whole-body control, extensions to manipulation-intensive tasks.

Pipeline Demonstration

Easy Tasks

Original robot observation
Generated human-like exocentric video
Full-body pose estimation
Real-world execution
B1 original robot observation
B2 original robot observation
B3 original robot observation
B4 original robot observation

Moderate Tasks

Original robot observation
Generated human-like exocentric video
Full-body pose estimation
Real-world execution
A1 original robot observation
A2 original robot observation
A3 original robot observation
A4 original robot observation
A5 original robot observation
A6 original robot observation

Challenging Tasks

Original robot observation
Generated human-like exocentric video
Full-body pose estimation
Real-world execution
S1 original robot observation
S2 original robot observation
S3 original robot observation
S4 original robot observation

BibTeX

@article{zhou2026exoactor,
  title={{E}xo{A}ctor: {E}xocentric Video Generation Enables Generalizable Interactive Humanoid Controlling},
  author={Yanghao Zhou and Jingyu Ma and Yibo Peng and Zhenguo Sun and Yu Bai and Börje F. Karlsson},
  journal={arxiv: 2604.27711},
  year={2026},
  url={https://arxiv.org/abs/2604.27711}
}