Project Page AAAI 2026

Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints

1Tongyi Lab, Alibaba Group  ·   2Nanyang Technological University
GlobalDiff teaser
Local rotation diffusion leads to error accumulation in distal joints. Global rotation diffusion avoids this but lacks structural priors (top left). We address this with constraints at the joint, skeleton, and motion levels to ensure coherent and reasonable motion (bottom).

📋 Abstract

Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0% compared to the current SOTA under multiple speaker identities. The code will be released at https://github.com/Xiangyue-Zhang/GlobalDiff.

🎬 Video

🏗️ Framework

GlobalDiff: global rotation diffusion augmented with multi-level structural constraints.

GlobalDiff framework
Architecture of GlobalDiff. Our model generates consistent and expressive co-speech motion using the global rotation diffusion augmented with multi-level structural constraints. The diffusion model is conditioned on seed pose and prosodic features and predicts body motion through stacked motion generation blocks. To enforce structural plausibility, we introduce: (a) a Joint structure constraint using virtual anchor points to disambiguate orientations; (b) a Skeleton structure constraint that enforces angular consistency across adjacent bones; and (c) a Temporal structure constraint based on a shared multi-scale VAE encoder to preserve temporal dynamics. Facial expressions are generated in parallel from prosody using a transformer encoder.

🎨 Qualitative Comparisons

GlobalDiff consistently generates semantically grounded and physically coherent co-speech motions across all speaker identities.

Qualitative comparison on BEAT2
Comparison on BEAT2 Dataset. We compare gestures produced on four speakers—Wayne (ID1), Scott (ID2), Lawrence (ID4), and Carla (ID6)—across varied utterances. For "exaggerating", GlobalDiff produces wide, symmetric arm extensions; for "wow", elevated open-hand motions reflecting surprise; for "driving", realistic steering-like motions. This comparison underscores the advantage of our global rotation strategy and structural constraints.

📊 Quantitative Comparisons

GlobalDiff consistently achieves the best performance across nearly all metrics on both single-speaker and multi-speaker settings.

Quantitative results
We obtain the lowest FGD and comparable MSE, indicating superior spatial fidelity and motion reconstruction accuracy. On BeatAlign, our method performs comparably with the top baselines. These results highlight the strength of our global rotation modeling and multi-level structural constraints.

📄 BibTeX

@misc{zhang2025mitigatingerroraccumulationcospeech,
      title={Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints},
      author={Xiangyue Zhang and Jianfang Li and Jianqiang Ren and Jiaxu Zhang},
      year={2025},
      eprint={2511.10076},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.10076},
}