Google’s Imagen AI Model Makes Advances in Text-to-Image

Google has released a research paper on a new text-to-image generator called Imagen, which combines the power of large transformer language models for text with the capabilities of diffusion models in high-fidelity image generation. “Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis,” the company said. Simultaneously, Google is introducing DrawBench, a benchmark for text-to-image models it says was used to compare Imagen with other recent technologies including VQGAN+CLIP, latent diffusion models, and OpenAI’s DALL-E 2.

AI-powered text-to-image models have been challenged to accurately and predictably interpret natural language into images exactly as the instructor imagined, with results that are often quirky or surprising. But Google says its achieved a breakthrough with its approach, and that Imagen consistently outperformed other models in the DrawBench side-by-side comparisons, both in terms of the accuracy of the image produced and its quality.

“The advances Google’s researchers claim with Imagen are several,” reports TechCrunch, emphasizing its “diffusion techniques, which basically start with a pure noise image and slowly refine it bit by bit until the model thinks it can’t make it look any [better].” This, TechCrunch says, “was an improvement over top-to-bottom generators that could get it hilariously wrong on first guess, and others that could easily be led astray.”

Another significant aspect noted by TechCrunch is “improved language understanding through large language models using the transformer approach,” also used in language models including GPT-3.

“They say that existing text models can be used for the text encoding portion, and that their quality is more important than simply increasing visual fidelity,” TechCrunch reports. “That makes sense intuitively, since a detailed picture of nonsense is definitely worse than a slightly less detailed picture of exactly what you asked for.”

As an example, TechCrunch cites the Imagen paper’s comparative analysis of “results for it and DALL-E 2 doing ‘a panda making latte art,” wherein DALL-E ends up with “latte art of a panda; in most of Imagen’s it’s a panda making the art. (Neither was able to render a horse riding an astronaut, showing the opposite in all attempts. It’s a work in progress.)”

“Increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model,” Google reports in sharing its research paper. Imagen achieved a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters found Imagen samples to be on par with the COCO data itself in image-text alignment, Google says.