Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control
X-Humanoid Team
X-Humanoid

Heracles. A state-conditioned diffusion middleware that bridges precise motion tracking and generative synthesis for general-purpose humanoid control.

Abstract

Achieving general-purpose humanoid control requires a delicate balance between the precise execution of commanded motions and the flexible, anthropomorphic adaptability needed to recover from unpredictable environmental perturbations. Current general controllers predominantly formulate motion control as a rigid reference-tracking problem. While effective in nominal conditions, these trackers often exhibit brittle, non-anthropomorphic failure modes under severe disturbances, lacking the generative adaptability inherent to human motor control. To overcome this limitation, we propose Heracles, a novel state-conditioned diffusion middleware that bridges precise motion tracking and generative synthesis. Rather than relying on rigid tracking paradigms or complex explicit mode-switching, Heracles operates as an intermediary layer between high-level reference motions and low-level physics trackers. By conditioning on the robot’s real-time state, the diffusion model implicitly adapts its behavior: it approximates an identity map when the state closely aligns with the reference, preserving zero-shot tracking fidelity. Conversely, when encountering significant state deviations, it seamlessly transitions into a generative synthesizer to produce natural, anthropomorphic recovery trajectories. Our framework demonstrates that integrating generative priors into the control loop not only significantly enhances robustness against extreme perturbations but also elevates humanoid control from a rigid tracking paradigm to an open-ended, generative general-purpose architecture.

System Overview

Overview of the Heracles framework

Overview of the Heracles framework. (a) A flow matching model learns to synthesize feasible keyframe trajectories conditioned on the current state. (b) Reference motions are quantized into discrete tokens via FSQ, shared by reconstruction and action prediction heads. (c) At inference, the middleware generates trajectories through closed-loop replanning for the motion tracker to execute. The proposed framework intrinsically decouples high-level intent generation from high-frequency physical execution, comprising two primary components: a state-conditioned generative middleware and a low-level, general-purpose physics tracking policy.

BibTeX

@misc{heracles2026,
  title         = {Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control},
  author        = {X-Humanoid Team},
  year          = {2026},
  eprint        = {2603.27756},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.27756}
}