Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

Prof. Joonseok Lee’s team proposed POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3D Multi-person Pose Estimation (3DMPPE), powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, they verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of their approach is verified not only by achieving state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos.

[3D pose estimation task]

3D pose estimation aims to reproduce the 3D coordinates of a person appearing in an untrimmed 2D video. It has been extensively studied in literature with many real-world applications, e.g., sports [1], healthcare [2], games [3], movies [4], and video compression. Instead of fully rendering 3D voxels, this work narrows down the scope of discussion to reconstructing a handful number of body key-points (e.g., neck, knees, or ankles), which concisely represent dynamics of human motions in the real world.

[Challenges]

3D pose estimation for multi-person (3DMPPE) from a monocular video is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. They pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output.

[Proposed Method]

The proposed method, POTR-3D, realizes a seq2seq 2D-to-3D lifting model for 3DMPPE for the first time, and devises a simple but effective data augmentation strategy, allowing to generate an unlimited number of occlusion-aware augmented data with diverse views. Putting them together, the overall methodology effectively tackles the aforementioned three challenges in 3DMPPE and adapts well to in-the-wild videos.

Sungchan Park, Eunyi Lyou, Inhoe Lee, Joonseok Lee.

Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023.

References

Lewis Bridgeman, Marco Volino, Jean-Yves Guillemaut, and Adrian Hilton. Multi-person 3D pose estimation and tracking in sports. In CVPR Workshops, 2019.
Qingqiang Wu, Guanghua Xu, Sicong Zhang, Yu Li, and Fan Wei. Human 3D pose estimation in a lying position by rgb-d images for medical diagnosis and rehabilitation. In Proc. of the Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2022.
Hian-Ru Ke, LiangJia Zhu, Jenq-Neng Hwang, Hung-I Pai, Kung-Ming Lan, and Chih-Pin Liao. Real-time 3D human pose estimation from monocular view with applications to event detection and video gaming.
Karteek Alahari, Guillaume Seguin, Josef Sivic, and Ivan Laptev. Pose estimation and segmentation of people in 3D movies. In ICCV, 2013.
Ing-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. arXiv:2011.15126, 2020.