EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

🏆 Accepted at ACM MM 2025 * means co-first authors; † means corresponding author.
cars peace

Abstract

Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion. The code will be released at https://github.com/Xiangyue-Zhang/EchoMask.

Video

Comparison of Masked Motion Modeling Concepts


By delving deeper into the underlying limitations, we identify the core bottleneck in advancing semantically grounded masked motion modeling: the effectiveness of semantically rich motion frames mask selection strategies. Existing methods predominantly adopt random (a) or loss-based (b) masking strategies. Random masking often fails to target semantically meaningful regions due to the sparse distribution of semantic content in motion sequences. Loss-based masking, which prioritizes frames with high reconstruction error, assumes these are semantically rich. However, this assumption does not always hold—high reconstruction loss may simply reflect abrupt yet uninformative transitions. For instance, a sharp change in hand position might signal the end of a sentence rather than a meaningful gesture. As a result, such strategies struggle to accurately identify semantically significant frames, ultimately limiting the quality of speech-conditioned motion generation. Moreover, prior methods mask at the token level using discrete code indices, which lose fine-grained motion details and hinder accurate detection of frames critical for motion intent and speech alignment.Based on this observation, we raise a central question: Can speech be used as a query to identify semantically important motion frames worth focusing on during masked modeling? To explore this, we propose a new masked motion modeling framework, EchoMask, for co-speech motion generation.

image.

Framework


Architecture of EchoMask. (a) MAM projects motion and audio into a shared latent space. Learnable speech queries \(Q'\) are refined through hierarchical cross-attention with HuBERT features (\( \gamma_l, \gamma_h \)) and jointly processed with quantized latent motion \( \tilde{z}_m \) via a shared transformer, optimized with contrastive loss. (b) Given \( m \), mask transformer teacher computes a cross-attention map \( \mathcal{M} \) between latent poses \(p\) and motion-aligned speech features \( Q \), identifying semantically important frames. These frames are masked via a Soft2Hard strategy to produce \( \tilde{m} \), which the student transformer uses to generate motion tokens.

image.

Qualitative Comparisons


Comparison on BEAT2 Dataset. Red boxes highlight implausible or uncoordinated motions, while green boxes indicate coherent and semantically appropriate results. Our EchoMask consistently generates co-speech motions that are semantically aligned with ground truth. For instance, when articulating “never” and “start”, our model positions both hands in a poised gesture near the torso, reflecting a thoughtful and intentional motion, whereas prior methods such as DiffSHEG and EMAGE either generate imbalanced hand postures or ambiguous limb placements.

image.

Facial Comparison on the BEAT2 Dataset. Our approach tightly synchronizes facial expressions with both phonetic and semantic cues in speech, producing natural and articulate lip movements. This alignment enhances both clarity and expressiveness—for example, in words like job and angry,” where each syllable transition is smoothly and accurately reflected in the facial motion.

image.

Quantitative Comparisons


Quantitative comparison with SOTA methods. Lower values indicate better performance for FMD, FGD, MSE, and LVD, while higher values are better for BC and DIV. For clarity, we report FGD \(\times 10^{-1}\), BC \(\times 10^{-1}\), MSE \(\times 10^{-8}\), and LVD \(\times 10^{-5}\). Best results are shown in bold.

image.

BibTeX

@inproceedings{zhang2025echomask,
  title={EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation},
  author={Zhang, Xiangyue and Li, Jianfang and Zhang, Jiaxu and Ren, Jianqiang and Bo, Liefeng and Tu, Zhigang},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={10827--10836},
  year={2025}
}

</html>