Project Page 🔥 ICCV 2025

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

1Wuhan University  ·   2Tongyi Lab, Alibaba Group  ·   3Zhejiang University
SemTalk teaser
Why SemTalk is needed. Co-speech motion is not only rhythm. Most frames carry ordinary beat-aligned movement, but a few frames carry semantic emphasis that makes a gesture feel intentional. SemTalk separates these two sources of motion and fuses them frame by frame, so the generated speaker can stay rhythmically stable while still producing sparse, meaningful gestures.

Abstract

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion. The code will be released at https://github.com/Xiangyue-Zhang/SemTalk.

Demo

Method

SemTalk uses a two-stream design: one stream learns a stable rhythm-aligned base, while another stream activates sparse semantic motion only when speech calls for emphasis.

SemTalk framework
Architecture of SemTalk. (a) Base Motion Generation uses rhythmic consistency learning to produce rhythm-aligned codes \( q^r \), conditioned on rhythmic features \( \gamma_b \), \( \gamma_h \), seed pose \( \tilde{m} \), and \( id \). (b) Sparse Motion Generation employs semantic emphasis learning to generate semantic codes \( q^s \), activated by semantic score \( \psi \). (c) Adaptively Fusion automatically combines \( q^r \) and \( q^s \) based on \( \psi \) to produce mixed codes \( q^m \) at frame level for rhythmically aligned and contextually rich motions.
Two streams Separate rhythmic base motion from sparse semantic gestures.
Frame-level Fuse base and semantic codes according to a learned semantic score.
BEAT2 + SHOW Evaluated on both in-domain and cross-dataset holistic motion generation.

Semantic Emphasis

The central design choice is not to make every frame more expressive. Instead, SemTalk learns when the speech contains a phrase that should trigger stronger, sparse gestures.

Semantic score visualization
Semantic score. The learned score highlights frame ranges where speech semantics require visible emphasis. These frames receive sparse semantic motion codes, while ordinary frames keep the rhythm-aligned base motion. This avoids the common failure mode where generated motion is either too flat everywhere or overly active everywhere.

Qualitative Results

The visual comparisons show what the semantic branch changes: gestures become more intentional at meaningful words while staying stable during ordinary speech.

Comparison on BEAT2
Comparison on BEAT2 Dataset. SemTalk* refers to the model trained solely on the Base Motion Generation stage. In contrast, SemTalk successfully emphasizes sparse yet vivid motions. For example, when the phrase "my opinion" is spoken, SemTalk-driven characters raise both hands and make the gesture of extending their index finger to emphasize the statement.
Comparison on SHOW
Comparison on SHOW Dataset. SemTalk shows more agile gestures than TalkSHOW, EMAGE, and DiffSHEG, when applied to unseen data. Our method captures natural and contextually rich gestures, particularly in moments of emphasis such as "I like to do" and "relaxing."
Facial comparison on BEAT2
Facial Comparison on BEAT2. Our approach synchronizes facial expressions closely with phonetic and semantic cues in speech, generating natural lip movements that enhance clarity and expressiveness.
SemTalk user study
User study. Human preference results support the same conclusion as the qualitative figures: separating rhythm and semantic emphasis improves perceived realism and semantic match rather than only optimizing numeric metrics.

Quantitative Results

Lower is better for FGD, MSE, and LVD; higher is better for BC and DIV. SemTalk gives the best motion-distribution score on both BEAT2 and SHOW while keeping strong beat consistency.

BEAT2
Method FGD lower BC higher DIV higher MSE lower LVD lower
CaMN 6.644 6.769 10.86 - -
DSG 8.811 7.241 11.49 - -
TalkSHOW 6.209 6.947 13.47 7.791 7.771
EMAGE 5.512 7.724 13.06 7.680 7.556
DiffSHEG 8.986 7.142 11.91 7.665 8.673
SemTalk 4.278 7.770 12.91 6.153 6.938
SHOW
Method FGD lower BC higher DIV higher MSE lower LVD lower
CaMN 22.12 7.712 10.37 - -
DSG 24.84 8.027 10.23 - -
Habibie et al. 27.22 8.209 8.541 145.6 47.35
TalkSHOW 24.43 8.249 10.98 139.6 45.17
EMAGE 22.12 8.280 12.46 136.1 42.44
DiffSHEG 24.87 8.061 10.79 139.0 45.77
SemTalk 20.18 8.304 11.36 134.1 39.15
The web table keeps the key benchmark numbers visible without opening the result image. The original paper table reports scaled values for readability; the relative comparison follows the paper figure.

BibTeX

@inproceedings{zhang2025semtalk,
  title={SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis},
  author={Zhang, Xiangyue and Li, Jianfang and Zhang, Jiaxu and Dang, Ziqiang and Ren, Jianqiang and Bo, Liefeng and Tu, Zhigang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13761--13771},
  year={2025}
}