Given an image and a pretrained Generator, I can solve for the latent variable that best reconstructs the target image. But the effectiveness of reconstructing the target image depends on many factors, which I explore below.
The simplest loss function is to minimize the L2 norm between the target and reconstructed image. However, we know that the L2 distance does not match our perceptual distance, so I also tried using a perceptual loss that minimizes the distance at different layers of VGG. The perceptual loss seems to have made a slight improvement in image quality, especially in recovering high frequency details like whiskers and the fur on the chin. It does, however, take longer to run.
I can also vary the Image Generator used. Switching to a vanilla gan clearly produces worse looking results although runs much faster. StyleGan has more capacity than the vanilla gan and also performs generation at different feature resolutions, allowing better capture of both low and high level frequencies.
The StyleGan network using a mapping network that map from Gaussian noise $z$ to a latent code $w$. We can also allow for a separate latent code for each block in StyleGan ($w+$).
The optimization was slihgtly slower for both $w$ and $w+$. $w+$ looks the best, which is not surprising given that it has the most degrees of freedom.
The Vanilla gan's interpolation looks less noisy, but it is much more abrupt. Interpolation in $z$ looks unnatural and $w+$ does not change the pose properly. Interpolation in $w$ looks best.
Source Image |
Destination Image |
||
Vanilla Gan |
StyleGan (z) |
StyleGan (w) |
StyleGan (w+) |
Source Image |
Destination Image |
||
Vanilla Gan |
StyleGan (z) |
StyleGan (w) |
StyleGan (w+) |
I tried optimizing for both the provided sketches and my own sketches. For each sketch, I optimized in the $z$, $w$, and $w+$ spaces. I found that the optimized parameter often goes out of distribution, so I also tried adding a proximity loss that penalizes for the $z$ noise deviating from 0.
Overall, I found the results reasonable for what I might expect. I did not find one latent space to perform significantly better than the others. Rather, it was how close the initialization was to target image that was most important. If the optimization had to traverse a large region of the latent space, it was likely to drift off the distribution of real cats. This was slightly mitigated by the proximity loss for $z$. I suspect these results would be better across the board if I had selected the best result over multiple runs.
Denser sketches are generally more challenging, especially Image 1. Colors that are out of distribution are challenging to optimize for, as shown by the overlay figures.