Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	MESSINA Nicola SEDMIDUBSKÝ Jan FALCHI Fabrizio REBOK Tomáš
Rok publikování	2025
Druh	Článek v odborném periodiku
Časopis / Zdroj	ACM Transactions on Multimedia Computing, Communications, and Applications
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://doi.org/10.1145/3744565
Doi	https://doi.org/10.1145/3744565
Klíčová slova	3D human motion;cross-modal retrieval;multi-modal understanding;text-motion retrieval
Popis	Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp.
Související projekty:	Automatizovaná forenzní laboratoř digitálních dat pro odhalování komplexní trestné činnosti