Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	MESSINA Nicola SEDMIDUBSKÝ Jan FALCHI Fabrizio REBOK Tomáš
Year of publication	2025
Type	Article in Periodical
Magazine / Source	ACM Transactions on Multimedia Computing, Communications, and Applications
MU Faculty or unit	Faculty of Informatics
Citation
web	https://doi.org/10.1145/3744565
Doi	http://dx.doi.org/10.1145/3744565
Keywords	3D human motion;cross-modal retrieval;multi-modal understanding;text-motion retrieval
Description	Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp.
Related projects:	Automated digital data forensics lab for complex crime detection