Text-to-Image Generation: Unleashing the Power of DALL-E 2 and DALLE mini
In a groundbreaking paper conducted on April 13, 2022, titled “Hierarchical Text-Conditional Image Generation Using CLIP Latent Paper” [1] Ramesh, Aditya, et al. presented compelling evidence for the effectiveness of contrastive models such as CLIP in acquiring robust image representations that encompass both style and semantics. The researchers proposed a two-stage model comprising a prior, which generates a CLIP image embedding based on a given text caption, and a decoder that utilizes the image embedding to generate a corresponding image. Central to its functionality is the concept of stable diffusions, which enables the generation of high-quality images by iteratively refining the output of the model. This innovative AI system demonstrates the capability to create authentic and realistic visuals and artwork by leveraging natural language descriptions as a creative foundation. The system excels at blending ideas, qualities, and stylings to produce compelling outputs. Importantly, the advancements made in DALL-E 2 and DALLE mini are significant developments in the field of text-to-image generation. After reading this article you should know the following:
- What is DALL·E 2
- How DALL·E 2 works
- Python implementation of DALL·E 2
- What is DALL-E mini?
- Python implementation of DALL-E mini
So What is DALL·E 2 you may ask?
The updated version of DALL-E, a generative language model that takes sentences and generates original visuals in response, is called DALL-E 2. DALL-E2 is a huge model with 3.5B parameters, yet it is interestingly smaller than GPT-3 and not quite as massive as DALL-E (12B). Despite its size, DALL-E 2 produces photos with a 4x higher resolution than DALL-E, and human judges favors it +70% of the time for caption matching and photorealism (Figure 1).
How DALL·E 2 works?
The CLIP model takes image-caption pairs and creates “mental” representations in the form of vectors, called text/image embedding (figure 2, top of the dotted line), then the Prior model that takes a caption/clip text embedding and generates clip image embedding, then with Decoder Diffusion model (unclip) takes a clip image embedding and generates images.
DALLE 2 is a specific example of a previous, decoder-based, two-part model (figure 2, below the dotted line). We can convert a statement into an image by concatenating both models.
It’s interesting to note that the decoder is called unclip since it uses the opposite procedure from the original clip model to create an original picture from a general mental representation rather than an image.
To make it easier, imagine that you want to draw something. What do you do? You imagine its image in your brain, and then after that, you start drawing its likeness. This is simply what is done. The main characteristics that are semantically meaningful are encoded in the mental representation include people, animals, objects, styles, colors, etc. DALLE 2 can build a new image while maintaining these characteristics and modifying the non-essential elements .
Python implementation of DALL-E 2
First, let’s Install openai’s library
pip install openai
Import all needed Libraries:
import os
import sys
import urllib.request
import json
import base64
import IPython.display
import IPython.core.display
import IPython.core.displaypub
import openai
Then let’s get started with our prompt! What do we want to generate?
prompt = "Lightning hits a river, electric, dark cloudy sky"
Obtain your API keys from your OpenAI account and use them as follows:
openai.api_key = "<YOUR_KEY>"
response = openai.Image.create(
prompt= prompt,
n = 1,
size = "1024x1024"
)
image_url = response['data'][0]['url']
image_url
OpenAI API will return an image URL, let’s display that image:
def insert_image(url, caption):
# get the image
image = urllib.request.urlopen(url).read()
# encode the image
image = base64.b64encode(image).decode('utf-8')
# create the html
html = f"""<img src="data:image/png;base64,{image}" alt="{caption}"
title="{caption}" />"""
# display the image
IPython.display.display(IPython.display.HTML(html))
# create the caption
IPython.display.display(IPython.display.Markdown(f"""# {caption}"""))
insert_image(image_url, prompt)
Sample DALL-E 2 outputs for prompt: “Lightning hits a river, electric, dark cloudy sky”
What is DALL-E mini?
DALL-E mini, initially conceived by Boris Dayma, a computer expert based in Texas, emerged as a submission for a coding contest. This application draws inspiration from the formidable DALL-E, developed by the artificial intelligence startup OpenAI, hence the shared name (Figure 4). DALL-E mini, also known as Craiyon, serves as a web application that offers a more user-friendly approach while employing similar underlying technology. Although Dayma openly accessible model is available to anyone online and Dayma developed it in collaboration with the AI research communities on Twitter and GitHub,
Python implementation of DALL-E mini
Let’s start by installing min-dalle library:
! pip install min-dalle -q
Import MinDalle:
from min_dalle import MinDalle
Define your model, and your prompt:
model = MinDalle(is_mega=True, is_reusable=True)
prompt = "Dogs playing"
seed = 6
grid_size = 2
display(model.generate_image(text, seed, grid_size))
Sample Craiyon outputs for prompt: “Dogs Playing”