Overview

The wearing of the face masks appears as a solution for limiting the spread of COVID-19. However, face masks also cover up a large portion of the facial expressions which could have served as a great assistance to languages during conversations. Linguists have confirmed that reading the shape of the lips can largely help the understanding of language, and whether the speaker is smiling or not apparently reveals his/her emotions. Thus, we would like to generate the facial expressions covered by the masks and aim for faces that are as natural as possible.

Goal

Datasets

We use the data from Flickr-Face-HQ (FFHQ) Dataset which contains faces without mask. We also use the Correctly Masked Faces subset (CMFD) from the MaskedFace-Net dataset which contains faces with facemask. These two dataset are paired for convenient use.

Method 1: CycleGAN

We first decided to naively treat whether wearing a mask or not as a style domain problem. We used unpaired data to train the model, and then only used the "mask -> no mask" generator to generate faces without facial masks.

The loss of cycleGAN is the combination of the following four terms with cycle-consistency loss.

The result is shown below, which is pretty bad. Though we meant to remove the mask, generated faces are dyed to blue while the mask areas are turned to skin color. Clearly, whether wearing a mask or not should not be treat as a "style", since it aligns with intuition that the covered and uncovered area may need to be dealt separately. Also, paired data may work better for delicate facial expressions.

Part 2: Pixel2Pixel

The result is shown below, the generated mouse is often weirdly showing the mask folds. One possible reason is the L1 loss could be a disruption here because the network is trying to “generate a mouth” with respect to the “shape and color of a face mask”. What’s more, we would better adapt an architecture that “already know” there will be a mouth under the mask.

Part 3: StyleGAN2

So we tried StyleGAN2. We didn’t optimize the generator but used one that pretrained on FFHQ dataset which knows there will be nose and mouse under the mask. The main optimization work is on the latent space (W+ space). At first we generate a random face from noise by mapping it to the latent code manifold of faces. Then we calculate the loss with respect to the target face and gradually making to face similar to our target face. Specifically, the perceptual loss is calculated from conv2 layer of the pretrained vgg network. We tried several sets of perceptual loss + $L_2$ combination, and $\lambda=0.8$ gave the best result.

The underlying idea is that. We believe the face with a facemask lacks the latent code of the mouth, we want to “borrow” the latent code from a randomly generated face so that we can generate face looks like the right picture.

Since we tried $L_1$ loss and failed in Pixel2Pixel model, this time we tried the network with pure BCE loss, we get some nightmarish results as below. We found out that we should not delete the L1 loss but apply it with a mask constraint.

So we switched to MSE loss with masking. For each picture, we manually draw a mask over the blue facial mask to dropout that area in backward-propagation. The result is shown at the middle. And the ground truth is shown on the right. As you can see, the generated face is pretty natural now.

Some extra results are shown below. As we can see the generated image may not be similar to the original one in the uncovered area.

Since the generated result heavily depends on the first “random face”. Some extreme failure case will also appear when the “first random face” is too different from the target face. The failure case below shows that when target face is femail while the random face is male, the result is not good. Also, sometimes the network can’t figure out whether the color around the face is a hat, clothes or hair. That's why here we get a super stylish man on the right which looks totally different from the target face.

Part 4: Future Work

For future work, we can add facial landmark detection to make the mask drawing process automatically. And we can also decide an encoder to encode the target image with mask on to latent space, then make sure the sampled latent code is close to that. But theoretically, to guess the facial expression under the cover is still with much randoness.

16-726 Learning-Based Image Synthesis