HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions

Nanjing University of Science and Technology, Beijing Normal University, University of Sassari
HiTMM Motion Generation Demo

HiTMM generates human-human interactive motions using a shared timeline under causal temporal control. Given an interaction text input, our method can produce high-quality 3D human motions maintaining temporal continuity.

Abstract

We have recently seen some progress in the current field of human-human interaction generation. However, directly generating complex two-person interactive motions remains a significant challenge. Meanwhile, these models typically employ two independent timelines when generating motions for inter- active scenarios involving two individuals. This design overlooks the temporal dependencies between motions at each timestep and fails to account for the roles of active and reactive participants during the generation process, often resulting in unrealistic and unnatural motions. In this work, we propose HiTMM, a novel framework for Human interaction generation based on Temporal Masked Modeling. HiTMM first decomposes the human interac- tion into two separate single-person motions. Individual motions within the interaction belong to the same type, enabling them to be mapped to a shared latent space through a coarse-to-fine approach that generates multi-layer discrete tokens. We then arrange all tokens of the two interacting individuals along a shared timeline. Subsequently, we employ a masked transformer and a residual transformer to model the base-layer and rest-layer motion tokens. Both the base-layer and rest-layer motion tokens are arranged along a single timeline, allowing the model to learn the role of the initially generated motion. Note that, our model utilizes a shared temporal representation, making it capable of performing temporal editing on specific regions within human interaction sequences. Experimental results show that our model achieves an FID of 5.017 on the InterHuman dataset, surpassing the current state-of-the-art model (vs 5.154 for InterMask), and an FID of 0.373 on the InterX dataset (vs 0.399 for InterMask).



Overall Architecture of HiTMM


HiTMM Motion Generation Demo

Overview of HiTMM. (a) Here, the motions of each individual are quantized through residual vector quantization (RVQ), resulting in \( t_a^{0:V} \) and \( t_b^{0:V} \). (b) The motion tokens in the base layer of two individuals are first cross-combined to obtain \( t_{cross}^0 \), which is then masked and processed through the TMT for prediction. (c) The remaining tokens, \( t_{cross}^{v>0} \) are progressively predicted layer by layer using the TRT based on the tokens \( t^{0:v-1}_{cross} \) from original layers.

Gallery of Generation

Comparisions


We evaluate HiTMM against three state-of-the-art baseline approaches, including both diffusion-based models (InterGen, in2IN) and masked modeling approach (InterMask). In contrast to these existing methods, HiTMM demonstrates superior capability in capturing subtle linguistic nuances, enabling the generation of more realistic and natural human motions.


One person kicks the right leg first and then the left leg towards another person




One person greets with his right hand, while the other person places his hands on his chest.




Two people exchange greetings with a wave of the hand.


Application: Temporal Editing


We demonstrate HiTMM's ability to perform localized motion inpainting within editing motion sequences, guided by textual descriptions. Specifically, we present inpainting results for three key regions: prefix (beginning), middle, and suffix (end) segments of motion clips.

Prefix


These two people greet each other.

One person lower his arms.


Inbetweening


These two people greet each other.

Two people put their hands on their chests.


Suffix


These two people greet each other.

One person opens his arms to another person.

BibTeX

BibTex Code Here