1. Introduction

Image Synthesis course “Audience Choice Award” 🥇
Check out our code on GitHub!

Team:

  1. Eileen Li (chenyil)
  2. Simon Seo (myunggus)
  3. Yu-Hsuan Yeh (yuhsuany)

A pipeline using pretrained foundation models that output a visual narrative with 3-5 images and a text prompt given by the user. sample output gif of pipeline

2. Abstract

as a child, have you ever wanted to be the main character of your favorite story?
Recent advancements in generative models and LLMs, such as Stable Diffusion and ChatGPT, have ushered in a new wave of excitement for AI. We want to leverage this technology to enable applications that, up until recently, might have seemed like magic. We will focus on creating illustrations for a chosen story, where the main character(s) of the plot can be customized to the user’s liking. By using user input to inform the story’s title and characters, our system allows people to see themselves in the stories they read and to explore their own imagination. Our goal is to obtain the best results possible, which means leveraging pretrained foundation models as much as we can and exploring SOTA image editing methods for customization. The result will be a cohesive, compelling illustrated story where anyone can be the protagonist.

3. Our Goals

Here are the goals of our project:

1) Generate the complete story from just a title. 2) Substitute one character in the story with a character of our choosing, such as replacing the wolf with our teammate’s dog, ‘Aspen’. 3) Ensure a consistent style across all scenes while achieving high-quality visualization.

4. TuT Architecture

diagram of pipeline

We’ve created an end-to-end pipeline that generates custom illustrated storybooks. You provide a title and some character customization images to start with, and it produces the plot along with beautiful illustrations to tell a coherent story.

There are three key modules:

  1. Story generation with LLM
  2. Train text2image model on custom images for chosen character
  3. Story illustration from generated prompts and custom text2image model

5. Method

5.1. Story Generation with LLM

example of generated story caption and text2image prompt

The first module creates the text of the story. We use OpenAI’s Chat Completion API (somewhat similar to ChatGPT) to generate a story from the title of a well-known story or a short description of a custom story. Along with the story, we also generate text2image prompts at this stage. We found that generating text2image prompts separately works better than using the story plot directly for image generation. We use the GPT 3.5 Turbo model which is available at no monetary cost and provides a simple web API.

We first condition the chat completion model as a children’s book storyteller. This is what we tell the system even before we ask it to generate stories:

self.messages = [{
    "role": "system",
    "content": (
        "You are a storyteller who tells children's stories. "
        "You don't use pronouns. You are good at counting numbers. "
        "You will focus on getting the number of sentences correct. "
        "Be descriptive about appearance of character and background. "
        "Answer the following question."
    )
}]
completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=self.messages
)
chat_response = completion.choices[0].message.content

Note that on top of conditioning the model on its role, we encourage it to be accurate at counting the number of sentences and to be dscriptive about the physical appearance of the scene. More on this later.

Next, we output sentences that can be used as prompts in the text2image generation process.

prompt = (
    f"Write a book of the story {title} with {custom_key} as the main character '{character.orig_name}' "
    f"with {self.num_sentences} pages, one sentence per page, and one illustration per page. "
    f"Provide a prompt for a speech-to-image system in square brackets to generate each illustration. "
    f"Show me a numbered list of exactly {self.num_sentences} sentences. "
    "As an example, for the story of Jack and the Beanstalk, you would generate: \n "
    "1. On a small farm, Jack lived with his mother in a small cottage [Generate an image of the cottage]. \n"
    "2. They were poor and had no food to eat, so his mother asked Jack to sell their cow [Generate an image of Jack leading the cow to a market]. \n"
    "3. On the way to the market, Jack met a strange old man who gave him magic beans in exchange for the cow [Generate an image of the old man and Jack exchanging the cow and magic beans]. \n"
)

We made sure to prioritize the structure of the language model’s output, and even provided a specific template for it to follow. This was important in order to maintain consistency and standardization throughout the pipeline, since we needed to extract the captions and text2image prompts separately for the next steps.

5.1.1. Examples of generated stories

Little Red Riding Hood

1. In a small house in the forest, Little Red lived with her grandmother. [Generate an image of the small house in the forest]
2. Little Red was a brave girl who loved to explore the woods. [Generate an image of Little Red exploring the woods]
3. One day, Little Red's mother asked her to take a basket of treats to her sick grandmother. [Generate an image of Little Red holding the basket of treats]
4. Little Red promised her mother she would not stray from the path through the woods. [Generate an image of Little Red on the path through the woods]
5. As Little Red walked through the woods, she saw a big, scary wolf. [Generate an image of the wolf lurking in the forest]
6. The wolf asked Little Red where she was going, and she foolishly told him. [Generate an image of the wolf and Little Red talking]
7. The wolf ran ahead of Little Red to her grandmother's house, where he gobbled up her grandmother. [Generate an image of the wolf eating the grandmother]
8. When Little Red arrived, the wolf disguised himself as her grandmother and tried to eat her too. [Generate an image of the wolf disguised as the grandmother]
9. Little Red outsmarted the wolf by tricking him into the closet and locking him inside. [Generate an image of Little Red locking the wolf in the closet]
10. Little Red and her grandmother lived happily ever after, knowing to be cautious and never trust a stranger. [Generate an image of Little Red and her grandmother hugging]

Jack and the Beanstalk

1. Jack was a young boy with curly brown hair who lived in a small cottage. [Generate an image of Jack outside his cottage]
2. He had a mother who worked hard every day to provide for the two of them [Generate an image of Jack's mother working in the garden]
3. One day, Jack's mother asked him to sell their cow so they could have some money to buy food. [Generate an image of Jack leading the cow to the market]
4. On the way to the market, Jack met an old man who offered to trade him magic beans for the cow. [Generate an image of the old man and Jack shaking on the deal]
5. When Jack's mother found out he traded the cow for beans, she threw them out the window in anger [Generate an image of the beans flying out the window]
6. The next morning, Jack found a giant beanstalk had grown up to the sky [Generate an image of the beanstalk towering above the cottage]
7. Curiosity getting the better of him, Jack decided to climb up the beanstalk and see where it led [Generate an image of Jack starting to climb the beanstalk]
8. At the top of the beanstalk, Jack discovered a castle owned by a giant [Generate an image of Jack peeking into the castle]
9. The giant's wife helped Jack hide from her husband and even gave him food to eat [Generate an image of the wife sneaking food to Jack]
10. Before leaving the castle, Jack stole a harp made of gold [Generate an image of Jack grabbing the harp]
11. The giant chased Jack down the beanstalk but Jack managed to cut it down and defeat him [Generate an image of Jack cutting the beanstalk and the giant falling]
12. Jack and his mother were left with the harp and a happy ending [Generate an image of Jack and his mother celebrating with the harp]

Avatar: The Last Airbender

1. In a world of four nations, Aang is the last Airbender, living a peaceful life in the Southern Air Temple. [Generate an image of the Southern Air Temple]
2. One day, Aang falls into the ocean and is frozen for 100 years. [Generate an image of Aang being frozen in ice]
3. When he awakens, he meets two new friends, Katara and Sokka. [Generate an image of Aang, Katara, and Sokka meeting for the first time]
4. Aang discovers that the Fire Nation wants to conquer the world, and he is the only one who can stop them. [Generate an image of the Fire Nation soldiers]
5. Aang travels to the Northern Water Tribe to learn waterbending from Master Pakku. [Generate an image of the Northern Water Tribe]
6. Along his journey, Aang meets more friends, such as Toph and Zuko. [Generate an image of Aang, Toph, and Zuko together]
7. Aang defeats Fire Lord Ozai, ending the Hundred Year War. [Generate an image of Aang in the Avatar State fighting Fire Lord Ozai]
8. He establishes peace throughout the world and works towards rebuilding the Air Nation. [Generate an image of Aang with the Air Nomads]
9. Aang falls in love with Katara and they start a family. [Generate an image of Aang and Katara with their children]
10. Aang is remembered as a hero and is greatly respected by future generations. [Generate an image of Aang's statue in honor of his legacy]

5.2. Train Text2Image with Custom Images

5.2.1. Dataset

For customization, we prepare seed datasets with each of the authors and Aspen (the dog). We find that optimal results are achieved with 5-10 selfies taken from different angles and light settings.

5.2.2. Comparing Dreambooth and Textual Inversion

In our project, we have tried two different approaches for generating images, namely Dreambooth [5] and Textual Inversion [2]. Dreambooth yielded better overall results, but at the cost of increased computational requirements. In contrast, Textual Inversion is much more lightweight and requires less computational resources.

5.2.3. Dreambooth quality enhancement

Additionally, to further enhance the quality of the generated images produced by Dreambooth, we finetuned the text encoder and preserved priors by providing class images (ex. boy vs. Simon boy). Adding both these training tricked resulted in higher-quality images that are more visually appealing and more closely aligned with the intended style of our project.

5.2.4. Effect of speed up techniques

To optimize the performance of our approach, we have utilized several speed-up techniques, including mixed precision, 8-bit Adam, and LoRA. These methods have enabled us to achieve faster image generation times and to process a bunch of data in a shoter amount of time.

5.3. Story Illustration with Custom Text2Image

5.3.1. The comparison between different text2image models

We experimented with several different text2image models: Stable Diffusion (v1.5 and v2), Midjourney V5 (through paid API), and OpenJourney (public model trained on Midjourney images).

Midjourney V5 has a clear advantage over the other models and produces the most consistently beautiful and relevant images. Both Stable Diffusion and OpenJourney generate unwanted artifacts for more complex scenes and require many attempts for the handpicked outputs you see below:

However, we cannot gain access to the Midjourney model for customization and use Stable Diffusion [4] and OpenJourney for our project.

5.3.2. Handling page consistency

We want to ensure that the overall style of images in the storybook stays consistent. This could be approached in three ways (from easy to difficult).

  1. tell our text2image model to generate in a particular style, forcing consistency across pages
  2. generate each image and provide the first generated image or a user provided image as a reference style image, or
  3. train a style consistency loss over all of the images.

We were able to make the first idea work and created styles using prompt engineering of the text2image model. We append these as suffixes to our prompt.

{
    "Cyberpunk": "as concept art illustration cyberpunk vibe, urban futurism, neon lights, night, bladerunner, realistic, extreme realism, hd render, professional quality, illustration, trending on artstation, complex details, refined, uhd, highly detailed, intricate, vivid colors",
    "Digital Art": "as digital art illustration professional quality, illustration, trending on artstation, complex details, refined, uhd, highly detailed, intricate, vivid colors",
    "Disney 3D": "as disney 3d film still 3d render, smooth render, professional quality, cinematic, octane 3d render, blender render, global illumination, ambient occlusion, ray tracing, materiality, cgi, uhd, highly detailed, intricate, vivid colors",
    "Disney Cartoon": "as disney cartoon animation, whimsical, by Walt Disney, cel-shaded, professional illustration, magical, dynamic, colorful, uhd, highly detailed, intricate, vivid colors",
    "Edward Gorey": "as pen and ink illustration by Edward Gorey, spooky, creepy, intricate, professional quality, illustration, trending on artstation, complex details, refined, uhd, highly detailed, intricate, vivid colors",
    "Fauvism": "as oil painting by Fauvism, fauvist painting, by Henri Matisse, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Impressionism": "as oil painting by Impressionism, by Claude Monet, by Pierre-Auguste Renoir, by Mary Cassatt, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Isometric 3D": "as isometric 3d render geometric, clean, colorful, busy, professional quality, cinematic, octane 3d render, blender render, global illumination, ambient occlusion, ray tracing, materiality, cgi, uhd, highly detailed, intricate, vivid colors",
    "Jackson Pollock": "as drip painting chaotic, exciting, splatter, gestural, abstract, by Jackson Pollock, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Looney Tunes": "as cartoon illustration cartoon, animation, looney tunes, whimsical, by Chuck Jones, cel-shaded, professional illustration, magical, dynamic, colorful, uhd, highly detailed, intricate, vivid colors",
    "M.C. Escher": "as pencil drawing black and white, by M.C. Escher, extreme detail, intricate, subtle shading, intellectual, impossible geometry, professional quality, illustration, trending on artstation, complex details, refined, uhd, highly detailed, intricate, vivid colors",
    "Masterpiece": "as realistic painting classic, realism, realistic, intricate, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Naive Art": "as oil painting by Henri Rousseau, naive art, high saturation, bold colors, distinct shapes, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Nature Field Guide": "as field guide illustration academic, by John James Audubon, clear, distinct, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "New Yorker Cartoon": "as new yorker cartoon drawing, black and white, humorous, funny, clever",
    "Origami": "as origami intricate, detailed, crisp, professional quality, cinematic, studio lighting, rim lighting, super resolution, volumetric lighting, depth of field, 4k, uhd, highly detailed, intricate, vivid colors",
    "Pen and Ink": "as pen and ink drawing intricate, extreme detail, black and white, professional quality, illustration, trending on artstation, complex details, refined, uhd, highly detailed, intricate, vivid colors",
    "Pixar": "as pixar film still 3d render, smooth render, professional quality, cinematic, octane 3d render, blender render, global illumination, ambient occlusion, ray tracing, materiality, cgi, uhd, highly detailed, intricate, vivid colors",
    "Pop Art": "as silkscreen print Pop Art, by Andy Warhol, colorful, high saturation, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Pro Photo": "as photograh arri-taken photo, professional lighting, depth of field, lighting gallery, unsplash, nikon-taken photo, canon-taken photo, professional quality, cinematic, studio lighting, rim lighting, super resolution, volumetric lighting, depth of field, 4k, uhd, highly detailed, intricate, vivid colors",
    "Renaissance": "as renaissance painting realism, realistic painting, Renaissance, refined brushwork, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Soap Opera": "as image soap opera style, in the style of a telenovela, hazy light, 1980s tv look, glow filter, video, vignette, melodramatic, soft lighting, evenly lit, ethereal glow, romantic, emotional, cinematic",
    "Street Art": "as graffiti painting Street Art, by Banksy, by Keith Haring, by Jean-Michel Basquiat, masterpiece, masterwork, uhd, highly detailed, intricate, vivid colors",
    "Studio Ghibli": "as anime illustration wondrous, expressive, by Hayao Miyazaki, by Studio Ghibli, cel-shaded, professional illustration, magical, dynamic, colorful, uhd, highly detailed, intricate, vivid colors",
} 

6. Results

6.1. Textual Inversion vs. Dreambooth

Below we show a direct comparison between Textual Inversion and Dreambooth, with all else being the same except for the customization method.

6.1.1. Textual Inversion

[“Jack and the Beanstalk”, Simon as “Jack”, style=”Digital Art”]

6.1.2. Dreambooth

[“Jack and the Beanstalk”, Simon as “Jack”, style=”Digital Art”]

6.2. Comparing styles

Below we compare several different styles, for “Jack and the Beanstalk” and Simon as “Jack” using Dreambooth.

6.2.1. Fauvism

6.2.2. Edward Gorey

6.2.3. Pop Art!

6.2.4. Pen and Ink

6.2.5. Studio Ghibli

7. Limitations

We originally envisioned a web app where users can run TuT in real time to get customized results. However, the computational requirements make that really difficult as each model takes ~30 minutes for training and inference.

Moreover, the results do not look as great as we would’ve hoped running end-to-end. We can obtain better results if, for each illustration page, we generate many outputs and pick the best one.

8. Further Explorations

8.1. Going beyond existing stories

We tried out using ChatGPT for novel stories and it seems to work pretty well. We share our results for the following prompt:

You are a storyteller. 
Generate a adventure story about a protagonist Aspen the dog that takes place at H-Mart. 
There should be conflict and a twist in the story, as well as a resolution at the end. 
The story should be exactly ten sentences. 
Show me the sentences as a numbered list. 
For each sentence, generate a prompt for a text-to-image model.

8.1.1. Story

Aspen the dog loved going on adventures with his owner, and today they were headed to H-Mart for some shopping.
Prompt: Aspen and his owner walking towards H-Mart with a shopping cart.

Once inside, Aspen's nose went wild with all the delicious scents, and he led his owner down each aisle.
Prompt: Aspen, standing on his hind legs, sniffing the air in front of the shelves stocked with food and treats.

Suddenly, Aspen caught a whiff of something strange and started barking furiously at one of the aisles.
Prompt: Aspen, barking at an aisle in H-Mart, with his owner trying to calm him down.

His owner thought Aspen was just being silly, but Aspen knew something was wrong.
Prompt: Aspen, looking determined and focused, while his owner looks skeptical.

Aspen's instincts were proven right when they stumbled upon a group of thieves trying to rob the store.
Prompt: Aspen, standing in front of a group of thieves, growling and barking, while his owner stands behind him in shock.

The thieves were startled by Aspen's sudden appearance and fled the store, but not before grabbing Aspen and running away with him.
Prompt: Aspen, being grabbed by the thieves and taken out of the store, with his owner running after them in desperation.

Aspen was scared and alone, but he refused to give up hope. He chewed through the ropes that bound him and escaped from his captors.
Prompt: Aspen, chewing through the ropes that bound him, while the thieves are distracted.

Aspen ran as fast as he could until he found his way back to H-Mart, where he reunited with his owner.
Prompt: Aspen and his owner, hugging each other tightly in front of H-Mart, with shoppers looking on.

Aspen's bravery had saved the day, and the grateful store manager rewarded him with a lifetime supply of dog treats.
Prompt: Aspen, surrounded by bags of dog treats, with his owner and the store manager looking on with gratitude.

Aspen and his owner left H-Mart, proud of their adventure and grateful for the loyal and courageous dog by their side.
Prompt: Aspen and his owner walking away from H-Mart, with Aspen carrying a big bag of dog treats and his owner looking at him with pride.

8.1.2. TuT illustrations

8.2. Multiple characters

While we did not have time for this project, we hope to try out multiple character customization in the future. This can be done naively by training multiple tokens for Textual Inversion or finetuning sequentially on Dreambooth. However, these naive methods are reported to not work very well and perhaps something like [3] could be explored.

9. References