16-726 Learning-Based Image Synthesis, 2021 Spring

Multi-Pose to Body Translation

Maneesh Bilalpur(mbilalpu), Rawal Khirodkar(rkhirodk), Teddy Zhang(wentaiz)

Overview

Most works in pose-to-body translation space deal with single person activity. For example, Wang et al. 2018, Chan et al. 2019, and Balakrishnan et al. 2018 have demonstrated that it is possible to convert the single person pose maps to dance videos.

Related works for Single Pose to Body translation.

While these works have shown stunningly good synthesis outputs, the lack of any experiments on how good are GANs in dealing with multi-person pose synthesis is concerning. Unlike single person interactions, multi-person interactions have interesting occlusions that are caused not only because of camera settings but also because of the inter-person(s) interactions.

In this project we attempt a two-stage GAN approach to solve for the interaction problem in pose-to-body synthesis for multi-person setting. Our sample multi-pose to body output is presented below:

Results of translating the input pose maps to output RGB images at epoch-10.

Challenges

In addition to challenges in collecting dance videos with no camera motion and static background, we have seen difficulties in

Data
- The input to the generator models are pose maps that are obtained from SOTA pose estimation frameworks such as openpose and HRNet. However they fail severely in multi-person settings particularly in assigning limbs to the right person during the interaction and identifying all keypoints that are visible.
Human-Human interaction
- Handling human-human interactions in GAN architecture

Manual Annotations

We collected 10 couple videos from YouTube each 3 minutes long. We thus extracted about 3000 frames per video and used 2000 frames in training and 1000 frames in testing. We found that under human-human interaction pose estimators like OpenPose and HRNet failed. Here are some failure cases,

Failure cases of OpenPose under human-human interaction. To address the pose estimation failure using openpose, we manually annotate 18 keypoints using the COCO annotator. We also correct keypoint assignments to person, in case of wrong assignments.

Manual annotations overlaid with Openpose predictions. Notice that openpose connects the girl's left arm to the boy's left arm.

Architecture

We propose a two stage model to solve for modeling human interactions in multi-person pose maps. The stages include

Local stage(G-Local)
- We train an identity-specific generator that maps input pose maps to corresponding RGB images of the person.
Global Refinement(G-Global)
- To solve for occlusions in human interactions we introduce a global generator that considers individual identity-specific RGB images together with the pose map of the entire scence to predict the RGB image with identities, pose and interaction preserved.

End-to-end components in the proposed model

Implementation details

We discard the temporal dynamics of the frames in the videos. This is due to the large computation resources needed to even run the baseline experiments from methods like Wang et al. 2018. Hence explore only frame-level synthesis.

We use the pix2pixHD model from NVIDIA's opensource code towards our implementation of local generators. The code for our generator and other preprocessing is released here.

We first train the local generators with occlusion-free frames and incorporate the global generator. When training the global generator, we use the checkpoint from the best local generators to train with all the frames(including the frames with occlusion) from given video.

We used 4 NVIDIA's RTX2080ti GPUs to train the models.

Results

In the interest of time of training a baseline pix2pixHD model and the proposed hierarchical model we have restricted our experiments and annotations to two videos.

Baseline

We use the pix2pixHD implementation as our baseline and present the results on one of the two videos with most occlusion below. Throughout the rest of the work all our critical analysis is based on this video. Our baselines simply takes the entire pose image and maps it to the rgb image in one step.

Input Multi-Pose
Baseline (Single Stage Translation)

Local Generator Examples

We present some results on pose-to-body translation in extreme poses under no occlusion

Input	Synthesis Images	Groundtruth

Proposed Model output

In addition, we present the below output from the proposed model. While both look largely similary we notice major differences are in terms of background synthesis and interaction when the dancers swap their positions. Preceptually, we found that the baseline has more inconsistent artifacts over the proposed method. For example, the edge artifacts in the sky have a similar color in the proposed model output over the baseline.

Please refresh to time sync the gifs, make sure everything is on the window.

Input Multi-Pose	Baseline (Single Stage Translation)
Local Generator Output for Male Dancer	Local Generator Output for Female Dancer

Our Proposed Two Stage approach. Note, we handle interaction better than the baseline and suffer from no artifacts in the background generation.

Training Progress

We show qualitatively how the training of the Global generator progresser through various epochs. We train for a total of 50 epochs.

Contributions

We propose a new architecture of hierarchical pix2pixHD models to train pose-to-body translation models for videos with human interactions.
To overcome pose estimation method failures, we manually annotated frames with interaction.
We demonstrate the results of our approach using HD model at 1280x720 or higher.

Critical analyses and Future Work

Frame-level synthesis has produced surprisingly good results for the whole video(including interaction frames).
Similar to most previous works, we notice that fine-grain hand and facial features synthesis with our approach is a challenge. Finding a reliable face and hand model have been challening in our experience.
Interestingly, in the baseline output interactions cause background artifacts(even if the background is motion-free). This is better handled using our two-stage approach.
Despite our synthesis being better than the baseline, we see noticable artifacts due to clothing similarity and proximity of the individuals. Using DensePose or segmentation mask methods to save the parameters for synthesis over our current sparse(relatively) posemaps is a potential solution. However, this would require considerably more resources to solve with our proposed solution.

Acknowledgements

All the videos used for this project are obtained from YouTube.

Recommend viewing the website on 21-inch display or higher for accessibility and better viewing experience.