16726 Final Project: Multi-Modal Instruction Image Editing

Tiancheng Zhao (andrewid: tianchen), Chia-Chun Hsieh (andrewid: chiachun)

In this project, we explore the use of multi-modal instructions to edit images.
Object-deforming/structural-change type of image editing is under-explored in current diffusion-based image editing literature (see fig 1). We hypothesize that multi-modal instructions can be a natural way to address this, because we can use

sketches/masks to designate object boundaries
text to designate object attributes

We experimented with 3 different models:

Instruct-Pix2Pix (text-only baseline)
Stable-Diffusion (img2img translation with text prompt)
Pix2Pix-Zero (with masks)

and we show the results in the following sections

Experiment 1: Instruct Pix2Pix

Method Overview

Instruct Pix2Pix is a fine-tuned text-guided image editing model that takes an input image and an editing instruction to produce the desired image.

To first verify that text-only editing instructions perform poorly for our purposes (structural changes),
We first explored different input images/prompts/diffusion steps/guidance weights on Instruct-Pix2Pix.
Indeed, we observe that this method performs well when doing attribute changes, but produces a lot of artifacts across all of our samples when we ask the model to make structural changes. This phenomenon can be attributed to the training process of Instruct-Pix2Pix, which uses Prompt-to-Prompt to generate training data and Prompt-to-Prompt struggles with the structural based editing.

1.1 Attribute-Change Results

1.1.1 Number of inference steps

For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.

Input Image	Prompt	Step 10	Step 30	Step 50
	"Make the dog Brown"
	"Make the background snowy"
	"Make the chihuahua red"

1.1.2 Text Guidance Weights

For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.

Input Image	Prompt	Weight 1.5	Weight 15	Weight 30
	"Make the dog Brown"
	"Make the background snowy"
	"Make the chihuahua red"

1.2 Structural Change Results

1.1.1 Number of inference steps

For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.

Input Image	Prompt	Step 10	Step 30	Step 50
	"Make the Samoyed jump"
	"Make the Samoyed lift its left leg"
	"Give the chihuahua wings"

1.1.2 Text Guidance Weights

For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.

Input Image	Prompt	Weight 1.5	Weight 15	Weight 30
	"Make the Samoyed jump"
	"Make the Samoyed lift its left leg"
	"Give the chihuahua wings"

Experiment 2: Stable-Diffusion with sketches+text

Method Overview

We use a setup of Stable-Diffusion similar to homework 5. The intuition for our approach here is to use erasing + addition of objects to achieve structural changes, then use stable diffusion to smooth out the artifacts. We first add noise with DDPM for certain numner of steps and then denoise with DDIM to get the edited images.

We again conducted experiments with different input images/prompts/diffusion steps/guidance weights, and observe that while this approach is able to produce better results than Instruct-Pix2Pix, we were unable to isolate the editing to only the object of interest.

2.1 Number of Inference Steps

For this experiment, we fixed the text guidance weight to 15.

We can notice that the number of inference steps and the seeds have huge influence on the quality of generated images. Besides, the best number of inference steps is specific to different images and text prompts (35 for the jumping samoyed and 30 for the samoyed lifting its leg).

Input Image	Prompt		Step 30	Step 35	Step 40
	A jumping samoyed	Seed 1
	A jumping samoyed	Seed 2
	A samoyed lifts its leg	Seed 1
	A samoyed lifts its leg	Seed 2

2.2 Text Guidance Weights

For this experiment, we fixed the number of inference steps to 35 for the jumping samoyed and 30 for the samoyed lifting its leg.

We can notice that the text guidance weight has little influence on the quality of generated images.

Input Image	Prompt		Weight 5	Weight 10	Weight 15
	A jumping samoyed	Seed 1
	A jumping samoyed	Seed 2
	A samoyed lifts its leg	Seed 1
	A samoyed lifts its leg	Seed 2

2.3 Discussions

This method can be applied to other diffusion models as well, such as Versatile Diffusion, another LDM based diffusion model accessible to the public. The results are similar, so we won't include it here.

We also observe that the artifacts come from the stroke based-editing can not be removed when the number of inference steps is less than 30. We believe the nature of LDM structure (diffusion in the latent space rather than in the pixel space) results in this phenomenon. Similar to SDEdit, if the number of inference steps is too large, the generated images are far away from the input image, and if the number of inference steps is too small, the generated images will have artifacts remaining. Unfortunately, we couldn't validate this due to the lack of pixel-based text-to-image diffusion models accessible to the public.

Experiment 3: Pix2Pix-Zero

Method Overview

To only edit the intended parts of the image and minimize side effects, we attempted to use Pix2Pix-zero with some attention map manipulations. The intuition is that cross-attention maps decide the structure of generated images, we can add masks to modify cross-attention maps correspoinding to the target object to be modified.

Pix2pix-zero adopts BLIP to generate text prompts for input images. We can use CLIP to find the correspoinding cross-attention map by finding the word with highest similarity to the target object to be modified. We apply masks to constrain the region to be modified and resize masks to each level of cross-attention maps.

Since pix2pix-zero adopts DDIM to add noise and then denoise, it can better preserve the structure of the image and the identity of the object, compared to SDEdit, which adopts DDPM to add noise and DDIM to denoise.

3.1 Erase some parts of the object

To erase some parts of the target object, we can mask out (set the value to be 0) of the region of corresponding parts in cross-attention maps.

Naively masking out the region can lead to pattern mismatch between the masked region and the rest of the image.

To achieve better quality, we can also specify which object used to fill in the masked out region by setting the value of its cross-attention maps in that region to be 1. We can display the text prompt generated by BLIP so that we can precisely specify the object used to fill in.

We can observe:

Specifying which object to fill in performs significantly better when the cross-attention gudiance is small and the number of inference steps is small. It also can mitigate the pattern mismatch issues when the cross-attention gudiance is large and the number of inference steps is large.
When the cross-attention gudiance is small, increasing the number of inference steps improves editing (no vanishing back legs). When the cross-attention gudiance is large, increasing the number of inference steps impairs editing (face distortion and additional ears).
When the number of inference steps is small, increasing the cross-attention gudiance improves editing (no vanishing back legs). When the number of inference steps is large, increasing the cross-attention gudiance impairs editing (face distortion and additional ears).
The best result is achieved when the cross-attention gudiance = 0.15 and the number of inference steps = 30.

	Inference Step		30	40	50
	Cross-attention Guidance Weight
Input Image	0.05	Naive
	0.05	Better
Input Mask	0.1	Naive
	0.1	Better
Prompt	0.15	Naive
"a white dog standing on top of a lush green field"	0.15	Better

3.2 Add some parts of the object

To add some parts of the target object, we should first mask out (set the value to be 0) of the original object in that region in original object's cross-attention maps, since the original object has high activation in cross-attention maps. Without masking out the original object will hardly change the image.

Then, we can setting the value of the target object's cross-attention maps in that region to be 1. Again, we can display the text prompt generated by BLIP so that we can precisely specify the original object to mask out.

We can observe:

Increasing the number of inference steps impairs editing (face distortion, additional ears, vanishing added legs).
When the number of inference steps is small, the cross-attention gudiance has little influence on editing.
The best result is achieved when the number of inference steps = 30.

	Inference Step	30	40	50
Input Image	Cross-attention Guidance Weight
	0.05
Input Mask	0.1
	0.15

However, not all masks work similarly, as shown in the table below, our method fails to add a leg horizontal to the ground. The underlying reason is that the verb "standing" in the text prompt constrains the editing process and the dog must standing on top of a lush green field.

We have similar obeservations to the previous part except that increasing the cross-attention gudiance and the number of inference steps impairs the editing.

	Inference Step	30	40	50
Input Image	Cross-attention Guidance Weight
	0.05
Input Mask	0.1
	0.15

3.3 Cross object editing

We also try to do cross object editing. We try to remove / add leg and transfer from dog to cat simultaneously. The results are reasonable though not perfect.

Input Image

Remove leg

Add leg

Conclusion

We successfully achieve some basic structural changes with our second and third approaches, but there are still some limitations. The second approach modifies unintended regions of the image and the third approach is constrained by text prompts generated by BLIP. Aside from that, since we don't finetune the pre-trained diffusion models, our approaches inherit the bias of pre-trained models. We will leave more flexible approaches and fine-tuning based approaches for future work.

16726 Final Project: Multi-Modal Instruction Image Editing

Experiment 1: Instruct Pix2Pix

Method Overview

1.1 Attribute-Change Results

1.1.1 Number of inference steps

Input Image

Prompt

1.1.2 Text Guidance Weights

Input Image

Prompt

1.2 Structural Change Results

1.1.1 Number of inference steps

Input Image

Prompt

1.1.2 Text Guidance Weights

Input Image

Prompt

Experiment 2: Stable-Diffusion with sketches+text

Method Overview

2.1 Number of Inference Steps

Input Image

Prompt

2.2 Text Guidance Weights

Input Image

Prompt

2.3 Discussions

Experiment 3: Pix2Pix-Zero

Method Overview

3.1 Erase some parts of the object

Inference Step

Cross-attention Guidance Weight

Input Image

Input Mask

Prompt

3.2 Add some parts of the object

Inference Step

Input Image

Cross-attention Guidance Weight

Input Mask

Inference Step

Input Image

Cross-attention Guidance Weight

Input Mask

3.3 Cross object editing

Input Image

Remove leg

Add leg

Conclusion