Tiancheng Zhao (andrewid: tianchen), Chia-Chun Hsieh (andrewid: chiachun)
In this project, we explore the use of multi-modal instructions to edit images.
Object-deforming/structural-change type of image editing is
under-explored in current diffusion-based image editing literature (see fig 1).
We hypothesize that multi-modal instructions can be a natural way to address this, because
we can use
We experimented with 3 different models:
To first verify that text-only editing instructions perform poorly for our purposes (structural changes),
We first explored different input images/prompts/diffusion steps/guidance weights on Instruct-Pix2Pix.
Indeed, we observe that this method performs well when doing attribute changes, but produces a lot of artifacts
across all of our samples when we ask the model to make structural changes. This phenomenon can be attributed
to the training process of Instruct-Pix2Pix, which uses Prompt-to-Prompt to generate training data
and Prompt-to-Prompt struggles with the structural based editing.
For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.
Input Image |
Prompt |
Step 10 | Step 30 | Step 50 |
"Make the dog Brown" | ||||
"Make the background snowy" | ||||
"Make the chihuahua red" |
For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.
Input Image |
Prompt |
Weight 1.5 | Weight 15 | Weight 30 |
"Make the dog Brown" | ||||
"Make the background snowy" | ||||
"Make the chihuahua red" |
For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.
Input Image |
Prompt |
Step 10 | Step 30 | Step 50 |
"Make the Samoyed jump" | ||||
"Make the Samoyed lift its left leg" | ||||
"Give the chihuahua wings" |
For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.
Input Image |
Prompt |
Weight 1.5 | Weight 15 | Weight 30 |
"Make the Samoyed jump" | ||||
"Make the Samoyed lift its left leg" | ||||
"Give the chihuahua wings" |
We again conducted experiments with different input images/prompts/diffusion steps/guidance weights, and observe that while this approach is able to produce better results than Instruct-Pix2Pix, we were unable to isolate the editing to only the object of interest.
For this experiment, we fixed the text guidance weight to 15.
We can notice that the number of inference steps and the seeds have huge influence on the quality of generated images. Besides, the best number of inference steps is specific to different images and text prompts (35 for the jumping samoyed and 30 for the samoyed lifting its leg).
Input Image |
Prompt |
Step 30 | Step 35 | Step 40 | |
A jumping samoyed | Seed 1 | ||||
Seed 2 | |||||
A samoyed lifts its leg | Seed 1 | ||||
Seed 2 |
For this experiment, we fixed the number of inference steps to 35 for the jumping samoyed and 30 for the samoyed lifting its leg.
We can notice that the text guidance weight has little influence on the quality of generated images.
Input Image |
Prompt |
Weight 5 | Weight 10 | Weight 15 | |
A jumping samoyed | Seed 1 | ||||
Seed 2 | |||||
A samoyed lifts its leg | Seed 1 | ||||
Seed 2 |
This method can be applied to other diffusion models as well, such as Versatile Diffusion, another LDM based diffusion model accessible to the public. The results are similar, so we won't include it here.
We also observe that the artifacts come from the stroke based-editing can not be removed when the number of inference steps is less than 30. We believe the nature of LDM structure (diffusion in the latent space rather than in the pixel space) results in this phenomenon. Similar to SDEdit, if the number of inference steps is too large, the generated images are far away from the input image, and if the number of inference steps is too small, the generated images will have artifacts remaining. Unfortunately, we couldn't validate this due to the lack of pixel-based text-to-image diffusion models accessible to the public.
Pix2pix-zero adopts BLIP to generate text prompts for input images. We can use CLIP to find the correspoinding cross-attention map by finding the word with highest similarity to the target object to be modified. We apply masks to constrain the region to be modified and resize masks to each level of cross-attention maps.
Since pix2pix-zero adopts DDIM to add noise and then denoise, it can better preserve the structure of the image and the identity of the object, compared to SDEdit, which adopts DDPM to add noise and DDIM to denoise.
To erase some parts of the target object, we can mask out (set the value to be 0) of the region of corresponding parts in cross-attention maps.
Naively masking out the region can lead to pattern mismatch between the masked region and the rest of the image.
To achieve better quality, we can also specify which object used to fill in the masked out region by setting the value of its cross-attention maps in that region to be 1. We can display the text prompt generated by BLIP so that we can precisely specify the object used to fill in.
We can observe:
Inference Step |
30 | 40 | 50 | ||
Cross-attention Guidance Weight |
|||||
Input Image |
0.05 | Naive | |||
Better | |||||
Input Mask |
0.1 | Naive | |||
Better | |||||
Prompt |
0.15 | Naive | |||
"a white dog standing on top of a lush green field" | Better |
To add some parts of the target object, we should first mask out (set the value to be 0) of the original object in that region in original object's cross-attention maps, since the original object has high activation in cross-attention maps. Without masking out the original object will hardly change the image.
Then, we can setting the value of the target object's cross-attention maps in that region to be 1. Again, we can display the text prompt generated by BLIP so that we can precisely specify the original object to mask out.
We can observe:
Inference Step |
30 | 40 | 50 | |
Input Image |
Cross-attention Guidance Weight |
|||
0.05 | ||||
Input Mask |
0.1 | |||
0.15 |
However, not all masks work similarly, as shown in the table below, our method fails to add a leg horizontal to the ground. The underlying reason is that the verb "standing" in the text prompt constrains the editing process and the dog must standing on top of a lush green field.
We have similar obeservations to the previous part except that increasing the cross-attention gudiance and the number of inference steps impairs the editing.
Inference Step |
30 | 40 | 50 | |
Input Image |
Cross-attention Guidance Weight |
|||
0.05 | ||||
Input Mask |
0.1 | |||
0.15 |
We also try to do cross object editing. We try to remove / add leg and transfer from dog to cat simultaneously. The results are reasonable though not perfect.
Input Image |
|
Remove leg |
|
Add leg |
|
We successfully achieve some basic structural changes with our second and third approaches, but there are still some limitations. The second approach modifies unintended regions of the image and the third approach is constrained by text prompts generated by BLIP. Aside from that, since we don't finetune the pre-trained diffusion models, our approaches inherit the bias of pre-trained models. We will leave more flexible approaches and fine-tuning based approaches for future work.