Image Generation with DALL·E 3
What is DALL-E 3?
DALL-E 3 is a closed source text-to-image generation system developed by OpenAI. It provides an intermediary layer to refine user’s input prompt. Unlike its predecessors, this makes it easier and more convenient for users to jump in and get usable images.
In the following sections, I will show some images generated by DALL-E 3, including the exact original system prompt used to generate them.
Getting Started
All the images shown are generated using standard ChatGPT Plus.
- After initial generation, images are upscaled using upscayl open source tool.
- Finally, all the images are converted to
.jpg
and downsized slightly prior to being added to this blog post.
Visual Depiction of Great Power Politics
Recently I have listening to talks and podcasts with John Mearsheimer, and one recurring topic is the rivalry between the United States and China. So I thought it would be interesting to “visualise” what great power politics would look like.
Anime drawing capturing the essence of great power conflicts between China and the USA as depicted by John Mearsheimer: On the left, a rising dragon, symbolizing China, ascends with economic and military might. In the center, a majestic eagle, representing the USA, spreads its wings, asserting its dominance and influence. The two creatures circle each other, signifying the intense rivalry and strategic competition. On the right, a chessboard with pieces shaped like global landmarks hints at the geopolitical maneuvers and the stakes of this power struggle.
Artwork inspired by ‘Wanderer Above the Sea of Fog’ summarizing the great power dynamics between China and the USA based on John Mearsheimer’s perspective: A lone observer stands on a cliff’s edge, gazing into the distance. To the left, the silhouette of a dragon emerges from the mist, representing China’s ambitions. To the right, an eagle soars above the clouds, reflecting the USA’s influential role. The vast sea of fog below embodies the challenges and unknowns of their strategic competition.
Painting in the style of ‘Wanderer Above the Sea of Fog’ capturing the essence of great power conflicts between China and the USA as depicted by John Mearsheimer: In the foreground, a solitary figure stands atop a rugged peak, looking out over a vast, fog-covered landscape. On one side, mountainous terrains and ancient structures represent China’s rich history and rising power. On the opposite side, illuminated cities and modern infrastructures symbolize the USA’s established dominance. The foggy chasm in between signifies the uncertainties and complexities of their geopolitical rivalry.
Michelangelo’s David in a different context
According to DALL-E 3 as of the time this post was written, it is allowed to imitate art styles from artists, movements, or eras whose latest works were created before 1912.
Below is an attempt using the art style from the Renaissance period (1300–1600). This period is characterized by a revival of classical art and learning. Notable artists include Leonardo da Vinci, Michelangelo, and Raphael.
While the end result does not resemble the movie at all, it was nevertheless a fascinating result.
Render inspired by Michelangelo’s ‘David’ summarizing the movie ‘Your Name’: A large marble slab in the center depicts the comet’s trajectory, symbolizing the pivotal event in the movie. Flanking this centerpiece are statues of Taki and Mitsuha, carved with the grace and precision akin to ‘David’. Their hands almost touch, emphasizing the near misses and connections they experience. The base of the sculptures integrates elements from both urban Tokyo and rural Itomori, reflecting the two contrasting worlds they inhabit.
Branch off from OpenAI example prompts
Another way to get started is to find existing examples and use their prompts as the starting point. Anecdotally speaking, this approach works about 80% of the time, but there are some cases where even using the same system prompt, the system generates a significantly different image.
The first two images shown were taken from the official DALL-E 3 page. These images are properties of OpenAI and they are presented here for illustrative purposes only.
In front of a deep black backdrop, a figure of middle years, her Tongan skin rich and glowing, is captured mid-twirl, her curly hair flowing like a storm behind her. Her attire resembles a whirlwind of marble and porcelain fragments. Illuminated by the gleam of scattered porcelain shards, creating a dreamlike atmosphere, the dancer manages to appear fragmented, yet maintains a harmonious and fluid form.
A middle-aged woman of Asian descent, her dark hair streaked with silver, appears fractured and splintered, intricately embedded within a sea of broken porcelain. The porcelain glistens with splatter paint patterns in a harmonious blend of glossy and matte blues, greens, oranges, and reds, capturing her dance in a surreal juxtaposition of movement and stillness. Her skin tone, a light hue like the porcelain, adds an almost mystical quality to her form.
The subsequent two images are attempts to “improve” upon the example images shown above.
In front of a pure white backdrop, a figure of a young ancient Chinese adult with unimaginable beauty and elegance, her skin rich and glowing, is captured in a full-body shot, mid-twirl. Her silky hair flows like a storm behind her. She is wearing attire that resembles a whirlwind of marble and porcelain fragments. The scene is illuminated by the gleam of scattered porcelain shards, creating a dreamlike atmosphere. The dancer appears fragmented yet maintains a harmonious and fluid form. The image is super ultra wide, showing the entire scene.
An 18-year-old Asian woman with dark hair streaked with silver appears fractured and splintered, intricately embedded within a sea of broken porcelain. The porcelain glistens with splatter paint patterns in glossy and matte blues, greens, oranges, and reds, capturing her dance in a surreal juxtaposition of movement and stillness. Her skin tone, a light hue like the porcelain, adds a mystical quality to her form. Her eyes are open, looking straight out of the frame.
Issues Encountered
Some notable issues and inconsistencies encountered include:
Text Generation: Supposedly, this is one of the key areas where the model has made improvements on. Specifically in the research paper [1], a human evaluation interface was shown on how human feedback was used to improve prompt following, coherence and style.
However, it seems like only selected official prompts/scenarios work well. For reference, below is an official example prompt which works well.
[official example prompt] An illustration of an avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.
Changing Image Format: Currently, if you do not explicitly say that you want the image to be generated in vertical or wide (or specify a specific context or purpose), then you would typically get an image in a square (1:1) aspect ratio.
While you can request a change of format in a follow-up prompt, there is no guarantee that the new image would resemble the previous image generated. While DALL-E 2 has previously supported outpainting feature, it is currently not available for DALL-E 3 yet.
New Limitations and Safe Guards: Since the original release of DALL-E 3, there has been new restrictions on content generation policy. One of these restriction is on political contents. For example, if you try to re-use the prompts presented in the above Visual Depiction of Great Power Politics section, you may find that the system would abort the content generation halfway.
As of the end of December 2023, it seems to still work, despite the system saying that it is not allowed to generate political content that could be considered partisan or politically charged in a way that could be divisive.
This has been somewhat inconsistent though, and it does seem like OpenAI is continuously fine tuning the system prompts and safe guard mechanisms.
What is next?
Despite some of the shortcomings of the current implementation, it is fascinating to see how large language models and text-to-image models are slowly changing how we think about artworks.
One idea which I have been bringing up frequently as of late is, we will likely be able to direct our own movies and TV shows in the near future. Given that we have projects like StreamDiffusion as of the end of 2023, it is hard to imagine what would be around in 2030.
However, I also wonder how these advances in large language models and text-to-image models would change how we value media contents such as artworks, movies and video games.
Are we slowly approaching a future where more and more contents are fully customised to suit our individual preference? And when you have the abundance of contents and opportunity to create, with such low barrier of entry, how does this change how we work and choose what we want to do?