HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions

We have recently seen some progress in the current field of human-human interaction generation. However, directly generating complex two-person interactive motions remains a significant challenge. Meanwhile, these models typically employ two independent timelines when generating motions for inter- active scenarios involving two individuals. This design overlooks the temporal dependencies between motions at each timestep and fails to account for the roles of active and reactive participants during the generation process, often resulting in unrealistic and unnatural motions. In this work, we propose HiTMM, a novel framework for Human interaction generation based on Temporal Masked Modeling. HiTMM first decomposes the human interac- tion into two separate single-person motions. Individual motions within the interaction belong to the same type, enabling them to be mapped to a shared latent space through a coarse-to-fine approach that generates multi-layer discrete tokens. We then arrange all tokens of the two interacting individuals along a shared timeline. Subsequently, we employ a masked transformer and a residual transformer to model the base-layer and rest-layer motion tokens. Both the base-layer and rest-layer motion tokens are arranged along a single timeline, allowing the model to learn the role of the initially generated motion. Note that, our model utilizes a shared temporal representation, making it capable of performing temporal editing on specific regions within human interaction sequences. Experimental results show that our model achieves an FID of 5.017 on the InterHuman dataset, surpassing the current state-of-the-art model (vs 5.154 for InterMask), and an FID of 0.373 on the InterX dataset (vs 0.399 for InterMask).

HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions

HiTMM generates human-human interactive motions using a shared timeline under causal temporal control. Given an interaction text input, our method can produce high-quality 3D human motions maintaining temporal continuity.

Abstract

Overall Architecture of HiTMM

Gallery of Generation

Two humans take a step back and prepare for the assault.

One tosses the ball to the other, and the other catches it.

Two people bow to each other.

The first person raises the right leg aggresively towards the second.

These two people are fencing each other.

Two humans keep up a forward motion.

Two performers approach each other.

These two people attacked each other with their legs.

Two people extend their arms and rotate counterclockwise.

Comparisions

InterGen

in2IN

InterMask

HiTMM

One person kicks the right leg first and then the left leg towards another person

InterGen

in2IN

InterMask

HiTMM

One person greets with his right hand, while the other person places his hands on his chest.

InterGen

in2IN

InterMask

HiTMM

Two people exchange greetings with a wave of the hand.

Application: Temporal Editing

We demonstrate HiTMM's ability to perform localized motion inpainting within editing motion sequences, guided by textual descriptions. Specifically, we present inpainting results for three key regions: prefix (beginning), middle, and suffix (end) segments of motion clips.

Prefix

These two people greet each other.

One person lower his arms.

Inbetweening

These two people greet each other.

Two people put their hands on their chests.

Suffix

These two people greet each other.

One person opens his arms to another person.

BibTeX