OpenAI Unveils AI-Powered DALL-E Text-to-Image Generator

OpenAI unveiled DALL-E, which generates images from text using two multimodel AI systems that leverage computer vision and NLP. The name is a reference to surrealist artist Salvador Dali and Pixar’s animated robot WALL-E. DALL-E relies on a 12-billion parameter version of GPT-3. OpenAI demonstrated that DALL-E can manipulate and rearrange objects in generated imagery and also create images from scratch based on text prompts. It has stated that it plans to “analyze how models like DALL·E relate to societal issues.”

VentureBeat reports that among the societal issues OpenAI plans to study include “economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.” OpenAI also unveiled CLIP, “a multimodal model trained on 400 million pairs of images and text collected from the Internet,” which uses “zero-shot learning capabilities akin to GPT-2 and GPT-3 language models.”

OpenAI coauthors noted that CLIP “learns to perform a wide set of tasks during pre-training including object character recognition (OCR), geo-localization, action recognition, and many others …  measure[d] by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets.” It’s found to be “competitive with prior task-specific supervised models” but did fall “short in specialization tasks like satellite imagery classification or lymph node tumor detection.”

Both OpenAI chief scientist Ilya Sutskever and Google AI chief Jeff Dean alluded to or predicted the release of CLIP. VB notes that, “the release of DALL-E follows the release of a number of generative models with the power to mimic or distort reality or predict how people paint landscape and still life art.”

Some systems, such as StyleGAN, “demonstrated a propensity to racial bias.” One bias test in the CLIP paper showed it was “most likely to miscategorize people under 20 as criminals or non-human, people classified as men were more likely to be labeled as criminals [than] people classified as women, and some label data contained in the dataset are heavily gendered.”

Engadget reports that DALL-E “can create images based on a description of its attributes, like ‘a pentagonal green clock,’ or ‘a collection of glasses is sitting on a table’.” In doing so, it “can also draw and combine multiple objects and provide different points of view, including cutaways and object interiors … [and] even infers details that aren’t mentioned in the description but would be required for a realistic image.”

The capability of zero-shot reasoning “allows an agent to generate an answer from a description and cue without any additional training … [in this case] applied to the visual domain to perform both image-to-image and text-to-image translation.” It can even understand “how telephones and other objects change over time, grasping geographic facts and landmarks and creating images in photographic, illustration and even clip-art styles.”

For more information and a chance to “play” with DALL-E, visit the OpenAI blog.

This Avocado Armchair Could Be the Future of AI
OpenAI has extended GPT-3 with two new models that combine NLP with image recognition to give its AI a better understanding of everyday concepts.
MIT Technology Review, 1/5/21