Project Page ECCV 2026

StreamTalk: Streaming Co-Speech Gesture Generation with Key-Pose Anchoring

1The University of Tokyo  ·   2Tongyi Lab, Alibaba Group  ·   3Nanyang Technological University  ·   4Renmin University
* Equal contribution.
arXiv Coming Soon Code Coming Soon Demo BibTeX
StreamTalk teaser
Why StreamTalk is needed. Existing streaming systems generate one clip, feed its tail to the next clip, and never check whether the trajectory is still plausible. The left side illustrates the resulting distributional drift: small local errors accumulate until the motion moves away from natural human poses. StreamTalk adds a closed-loop correction path. At each clip boundary, it retrieves a plausible destination key pose and refines the clip toward that anchor, keeping minute-scale co-speech gestures stable.

Abstract

Real-time co-speech gesture generation must produce 3D motion clip by clip as speech streams in. Existing streaming methods are fundamentally open-loop: each clip is synthesized from past context and directly passed to the next window, so small errors compound into long-horizon distributional drift. StreamTalk addresses this with a closed-loop streaming framework. At inference time, Streaming Pose-Guided Generation (SPG) first generates a coarse clip, retrieves a plausible tail key pose from a speaker-specific motion database, and refines the clip with this destination anchor before moving to the next window. During training, Stochastic Anchor Masking (SAM) teaches the model to inpaint complete motion from sparse boundary anchors, while a part-aware DiT separates hands, body, and translation streams. On BEAT2, StreamTalk improves motion quality, suppresses minute-scale drift, and runs in real time.

Demo

Method

StreamTalk uses a generate-retrieve-refine cycle to periodically pull streaming motion back toward a plausible pose distribution.

StreamTalk method
Architecture. During training, Stochastic Anchor Masking (SAM) randomly hides pose and translation frames so the model learns to reconstruct full motion from sparse boundary cues. During inference, Streaming Pose-Guided Generation (SPG) runs a generate-retrieve-refine loop: generate a coarse clip, retrieve a plausible tail key pose from a speaker-specific motion database, then refine the clip with that anchor. The part-aware DiT separates hands, body, and translation so global trajectory errors do not corrupt local articulation.

Closed-Loop Streaming

Open-loop and closed-loop streaming
Open loop versus closed loop. Open-loop methods only know where the current clip starts, so each window can wander in a locally plausible but globally drifting direction. SPG gives each clip a weak destination: one retrieved key pose near the tail. This single waypoint is enough to pull the trajectory back toward the natural pose manifold without over-constraining every frame.
Closed-loop Generate, retrieve, and refine each streaming window.
76 FPS Real-time inference reported on the BEAT2 benchmark.
Minute-scale Designed for stable long-horizon co-speech gesture synthesis.

Core Quantitative Results

The key number is FGD: lower FGD means the generated gesture distribution is closer to real human motion. StreamTalk achieves the best FGD on both the single-speaker streaming setting and the harder all-speaker setting, while keeping beat consistency and diversity close to the ground truth.

BEAT2 1-Speaker Setting
Method FGD lower BC near GT DIV near GT
GT - 0.703 11.97
EMAGE 0.551 0.772 13.06
RAG-GESTURE 0.808 0.734 11.97
EchoMask 0.462 0.774 13.37
GestureLSM 0.409 0.714 13.42
SemTalk 0.428 0.777 12.91
StreamTalk 0.383 0.704 13.18
BEAT2 All-Speakers Setting
Method FGD lower BC near GT DIV near GT
GT - 0.477 7.29
CaMN 0.512 0.200 5.58
EMAGE 0.692 0.284 6.06
HoloGest 0.646 0.803 13.53
RAG-GESTURE 0.487 0.514 9.94
StreamTalk 0.293 0.616 7.27

BC and DIV are interpreted relative to the ground-truth row rather than simply maximized. The all-speaker result is especially important because it shows that the retrieved anchors do not only help one speaker; they generalize across identities.

What Makes It Work?

The gains come from matching training and inference: SAM teaches the model to use sparse anchors, and SPG supplies those anchors during streaming inference. Either piece alone helps less than the full closed-loop system.

SAM and SPG Ablation
Method 1-Spk FGD 1-Spk BC 1-Spk DIV All FGD All BC All DIV
GT - 0.703 11.97 - 0.477 7.29
StreamTalk base 0.478 0.716 12.30 0.391 0.621 7.21
+ SAM 0.503 0.747 13.72 0.379 0.613 7.31
+ SPG 0.455 0.695 14.24 0.353 0.607 7.49
+ SAM and SPG 0.383 0.704 13.18 0.293 0.616 7.27
Anchor Source and Refinement
Variant FGD lower BC near GT DIV near GT
GT - 0.703 11.97
Random anchor 0.673 0.743 13.12
Retrieved anchor 0.503 0.747 12.57
Retrieved + linear refinement 0.471 0.753 12.31
Retrieved + StreamTalk refinement 0.383 0.704 13.11
Key-Pose Position and Count
Position Number FGD lower BC near GT DIV near GT
Random 8 0.601 0.759 13.51
Random 4 0.540 0.716 13.43
Random 1 0.426 0.715 13.05
Middle 1 0.408 0.693 13.55
Tail 4 0.443 0.700 13.15
Tail 1 0.383 0.704 13.18

A random anchor hurts because an arbitrary pose can point the motion in the wrong direction. A retrieved tail anchor works because drift is mainly a destination problem: the clip needs one plausible waypoint near its boundary, not dense constraints at many frames.

Long-Horizon Stability

StreamTalk is designed for streaming, so we care not only about a short clip looking good, but also whether quality decays after many clip-to-clip transitions.

Velocity and Acceleration Distribution Distance
Method VEL lower ACC lower
GT 0.0 0.0
EMAGE 2.40e2 4.03e2
GestureLSM 2.33e2 4.30e2
SemTalk 2.28e2 3.58e2
StreamTalk 2.15e2 2.33e2
Sliding-Window FGD Stability
Window w/o SPG mean w/o SPG std w/ SPG mean w/ SPG std
70 0.4600 0.0097 0.4000 0.0044
80 0.4600 0.0149 0.4000 0.0042
90 0.4600 0.0167 0.4000 0.0033
100 0.4600 0.0150 0.4000 0.0025
FGD over time
FGD over time. Open-loop models tend to accumulate errors as more clips are generated. SPG keeps the sliding-window FGD lower and less volatile, which means later parts of a long sequence remain closer to the real gesture distribution instead of drifting away.

Qualitative and User Study

The tables show distribution-level improvements; the visual comparison explains what those numbers correspond to: smoother transitions, fewer frozen poses, and more stable arm spacing across long speech.

StreamTalk qualitative comparison
Qualitative comparison on BEAT2. EMAGE often becomes repetitive with low-amplitude swings, GestureLSM is locally smooth but can spatially drift at clip boundaries, and SemTalk can freeze during pauses or low-energy speech. StreamTalk preserves rhythmic articulation while keeping arm positions and idle motion stable across the sequence, because each clip is corrected before it becomes the context for the next clip.
User study results
User study. Participants ranked shuffled long video segments on realism, rhythm consistency, motion-speech synchrony, and diversity. StreamTalk receives the strongest overall preference, especially on realism and rhythm consistency, matching the quantitative finding that closed-loop anchoring suppresses drift without removing expressive variation.

BibTeX

@inproceedings{zhang2026streamtalk,
  title={StreamTalk: Streaming Co-Speech Gesture Generation with Key-Pose Anchoring},
  author={Zhang, Xiangyue and Li, Jianfang and Zhang, Jiaxu and Yang, Kaixing and Hoi, Steven},
  booktitle={European Conference on Computer Vision},
  year={2026}
}