AI-Powered Real-Time Hand-Object Pose Estimation Unveiled

Abstract

Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

A new AI-powered framework has been developed, offering groundbreaking capabilities for the real-time analysis of two hands engaged in manipulating an object.

A research team, led by Professor Seungryul Baek from the UNIST Artificial Intelligence Graduate School has introduced the Query-Optimized Real-Time Transformer (QORT-Former) framework, which accurately estimates the 3D poses of two hands and an object in real time. Unlike previous methods that require substantial computational resources, QORT-Former achieves exceptional efficiency while maintaining state-of-the-art accuracy.

Examples of estimated 3D poses on H2O dataset Figure 1. Examples of estimated 3D poses on H2O dataset: For a separate example in each row, the figure represents (a) input RGB image, (b) our hand-object queries, (c) ground-truth contact map, (d) predicted contact map, and (e) final 3D pose estimation results, respectively

To optimize performance, the team proposed a novel query division strategy that enhances query features by leveraging contact information between the hands and the object, in conjunction with a three-step feature update within the transformer decoder. With only 108 queries and a single decoder, QORT-Former achieves 53.5 frames per second (FPS) on an RTX 3090 Ti GPU, making it the fastest known model for hand-object pose estimation.

Professor Seungryul Baek stated, "QORT-Former represents a significant advancement in the understanding of hand-object interactions." He further noted, "It not only enables real-time applications in augmented reality (AR), virtual reality (VR), and robotics, but also pushes the boundaries of real-time AI models."

"Our work demonstrates that efficiency and accuracy can be optimized simultaneously," Co-first author Khalequzzaman Sayem remarked. "We anticipate broader adoption of our method in fields that require real-time hand-object interaction analysis."

The study has been accepted to the 39th Annual Conference on Artificial Intelligence (AAAI) 2025, one of the world's most prestigious academic conferences in the field of artificial intelligence. It has received support from the Ministry of Science and ICT (MSIT), the AI Center, and CJ Corporation.

Journal Reference

Elkhan Ismayilzada, MD Khalequzzaman Chowdhury Sayem, Yihalem Yimolal Tiruneh, et al., "QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects," in Proc. of Annual AAAI Conference on Artificial Intelligence (AAAI), Pennsylvania, USA, (2025).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.