Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint’s prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0\% compared to the current SOTA under multiple speaker identities. The code will be released at https://github.com/Xiangyue-Zhang/GlobalDiff.
Architecture of GlobalDiff. Our model generates consistent and expressive co-speech motion using the global rotation diffusion augmented with multi-level structural constraints. The diffusion model is conditioned on seed pose and prosodic features and predicts body motion through stacked motion generation blocks. To enforce structural plausibility, we introduce: (a) a Joint structure constraint using virtual anchor points to disambiguate orientations; (b) a Skeleton structure constraint that enforces angular consistency across adjacent bones by aligning the angular matrices; and (c) a Temporal structure constraint based on a shared multi-scale VAE encoder to preserve temporal dynamics. Facial expressions are generated in parallel from prosody using a transformer encoder.
Comparison on BEAT2 Dataset. As shown in Figure, GlobalDiff consistently generates semantically grounded and physically coherent co-speech motions across all speaker identities, outperforming baselines. We compare gestures produced by all three methods on four speakers—Wayne (ID1), Scott (ID2), Lawrence (ID4), and Carla (ID6)—across varied utterances. RAG-GESTURE is excluded due to unavailable public outputs. For the phrase “exaggerating”, GlobalDiff produces wide, symmetric arm extensions that clearly express emphasis, whereas EMAGE shows constrained gestures and HoloGest yields flat, non-expressive arm poses. For “wow”, our model generates elevated open-hand motions that reflect surprise, while EMAGE collapses the upper limbs and HoloGest shows minimal variation. When expressing “driving”, GlobalDiff synthesizes realistic, steering-like motions. In contrast, EMAGE and HoloGest often fall back on generic, low-effort gestures. For “all the time”, our model maintains consistent chest-level hand motions across speakers, preserving semantic clarity, while EMAGE introduces lateral imbalance and HoloGest shows reduced motion fidelity. This comparison underscores the advantage of our global rotation strategy and structural constraints in producing expressive, semantically accurate, and identity-consistent gestures.
Quantitative comparison with SOTA methods. Table provides a detailed quantitative comparison between our method and several state-of-the-art baselines on both single-speaker and multi-speaker co-speech motion generation. Our method consistently achieves the best performance across nearly all metrics. Specifically, we obtain the lowest FGD and comparable MSE, indicating superior spatial fidelity and motion reconstruction accuracy. On BeatAlign, our method performs comparably with the top baselines, reflecting robust temporal alignment with speech rhythm. Although RAG-GESTURE attains the closest diversity with GT on the single-speaker setting, our method maintains competitive diversity scores while preserving structural integrity and semantic consistency. These results highlight the strength of our global rotation modeling and multi-level structural constraints in producing accurate, expressive, and speaker-agnostic co-speech motion.
@misc{zhang2025mitigatingerroraccumulationcospeech,
title={Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints},
author={Xiangyue Zhang and Jianfang Li and Jianqiang Ren and Jiaxu Zhang},
year={2025},
eprint={2511.10076},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.10076},
} </html>