Hong Kong Polytechnic University and Sun Yat-Sen University’s experts in China are teaching a computer to train itself to predict human figure pose in a video. This is called self-supervision where the machines need minimal human guidance and teach themselves on their own.

The machine learning methods used to do these tasks require lots of labeled data. And, larger “deep” neural networks would always need more and more data. Hence, researchers at Sun Yat-Sen have illustrated that a neural network can improve its learning by continuously matching the guesses of many networks with each other, eventually reducing the requirement of “ground truth” provided by the annotated data set.

According to the authors, the previous methods of deducing human pose have been successful but at the cost of a “time-consuming network architecture (e.g., ResNet-50) and limited scalability for all scenarios due to the insufficient 3D pose data.”

The scientists show that this self-supervision method has outperformed other artificial intelligence (AI) techniques in forecasting the human pose in many benchmark videos. According to them, they have even outperformed their own results that came out in 2017.    

ArXiv pre-print server has posted the paper titled “3D Human Pose Machines with Self-supervised Learning” and Keze Wang, Chenhan Jiang, Liang Lin, Pengxu Wei, and Chen Qian are the authors of the paper.  

In their paper of 2017, they utilized labeled MPII data set to get 2-D human body parts from still pictures. Then, the authors transformed those 2-D body parts into 3-D image representations. In their latest paper, they have again used the MPII data set to do the same “pre-training”, to get 2-D human figure poses from the pictures. Further, they have also used the same “Human3.6M” dataset that they used in 2017 to get the ground truth data for 3-D.    

The main difference this time was in the final part of the neural network, where they did not use the 2-D and 3-D annotations. They instead matched their 3-D model’s prediction about their 2-D models with the 2-D versions that were created in the first step.

 “After initialization, we substitute the predicted 2D poses and 3D poses for the 2D and 3D ground-truth to optimize” the model “in a self-supervised fashion.” 

They “project the 3D coordinate(s)” of the 3D human figure pose “into the image plane to obtain the projected 2D pose” and then they “minimize the dissimilarity” between this latest 2D human pose and the original one that they had made “as an optimization objective.”

So, the neural network keeps on questioning if its 3-D version is forecasting accurately. And, in the end, a long short-term memory network gets the continuity of the body from many sequences of video frames to develop the 3-D model.

With this last step, the authors succeeded in reducing the requirement for 3D data and in its place used only the 2D images. “The imposed correction mechanism enables us to leverage the external large-scale 2D human pose data to boost 3D human pose estimation,” they wrote.