Nvidia enters text-to-image battle with eDiff-I, takes on DALL-E, Imagen

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and gain efficiencies by improving and scaling citizen developers. look now.


The field of artificial intelligence (AI) text-to-image generators is the new battleground for tech conglomerates. Every AI-driven organization now aims to create a generative model that can exhibit extraordinary detail and conjure up compelling images from relatively simple text prompts. After OpenAI DALL-E 2Google Imagen and Meta’s Make-a-Scene grabbed headlines with their image synthesis capabilities, Nvidia entered the race with its text-to-image model called eDiff-I.

>> Don’t miss our new special issue: Zero trust: the new security paradigm.<

Unlike other large text-to-image generative models that perform image synthesis through an iterative denoising process, Nvidia’s eDiff-I uses a set of expert denoisers that specialize in denoising different intervals of the generative process.

Nvidia’s unique image synthesis algorithm

The developers behind eDiff-I describe the text-to-image model as “a new generation of Generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive paint-with-word capabilities.

Event

Smart Security Summit

Learn about the essential role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.

Register now

In a recent published article, the authors say that current image synthesis algorithms rely heavily on text prompting to create text-aligned information, while text conditioning is almost completely ignored, diverting the task of synthesis to producing high fidelity visual outputs. This led to the realization that there might be better ways to represent these unique modes of the build process than to share model parameters across the entire build process.

“Thus, unlike existing work, we propose to form a set of text-image broadcast models specialized for different synthesis steps,” the Nvidia research team said in their post. “To maintain training efficiency, we initially train a single model, which is then gradually split into specialized models that are then trained for specific steps in the iterative generation process.”

eDiff-I’s image synthesis pipeline includes a combination of three diffusion models – a base model that can synthesize 64 x 64 resolution samples and two super-resolution stacks that can progressively upsample images to a resolution of 256 x 256 and 1024 x 1024, respectively.

These models process an input caption by first calculating its T5 XXL embedding and text embedding. The model architecture for eDiff-I also uses CLIP image encodings computed from a reference image. These image integrations serve as a stylized vector, further fed into cascading models to gradually generate 1024 x 1024 resolution images.

These unique aspects allow eDiff-I to have a much higher level of control over the generated content. In addition to synthesizing text into images, the eDiff-I model has two additional features: style transfer, which allows you to control the style of the generated pattern using a reference image, and “paint with words”, an app where the user can create images by drawing segmentation maps on a virtual canvas, a handy feature for scenarios where the user wants to create a specific desired image.

Image source: Nvidia AI.

A new denoising process

Synthesis in diffusion models typically occurs through a series of iterative denoising processes that gradually generate images from random noise, with the same denoising neural network being used throughout the denoising process. The eDiff-I model uses a unique denoising method where the model trains a set of specialized denoisers for denoising at different intervals of the generative process. Nvidia refers to this new denoising network as “expert denoisers” and claims that this process significantly improves the quality of image generation.

The denoising architecture used by eDiff-I. Image source: Nvidia AI.

Scott Stephenson, CEO of Deepgramindicates that the new methods presented in the eDiff-I training pipeline could be inculcated for new versions of DALL-E or Stable Diffusion, where they can enable significant advances in the quality and control of synthesized images.

“It definitely adds to the complexity of model training, but does not significantly increase the computational complexity in production use,” Stephenson told VentureBeat. “Being able to segment and define how each component of the resulting image should look could speed up the creative process significantly. Plus, it allows human and machine to work more closely together.

Better than contemporaries?

While other state-of-the-art contemporaries such as DALL-E 2 and Imagen only use a single encoder such as CLIP or T5, eDiff-I’s architecture uses both encoders in the same model . Such an architecture allows eDiff-I to generate substantially diverse visuals from the same text input.

CLIP gives the created image a stylized look; however, the output often lacks textual information. On the other hand, images created using T5 text embeddings can generate better individual objects. By combining them, eDiff-I produces images with both synthetic qualities.

Generate variants from the same text input. Image source: Nvidia AI.

The development team also discovered that the more descriptive the text prompt, the better the performance of T5 compared to CLIP, and that the combination of the two results in better summary results. The model was also evaluated on standard datasets such as MS-COCO, indicating that CLIP+T5 integrations provide significantly better trade-off curves than either alone.

Nvidia’s study shows that eDiff-I outperformed competitors like DALL-E 2, Make-a-Scene, GLIDE and Stable Diffusion based on Frechet Inception Distance, or FID – a metric for evaluating the quality of the images generated by the AI. eDiff-I also obtained a higher FID score than Google’s Imagen and Parti.

Zero-hit FID comparison with recent leading models on COCO 2014 validation dataset. Image source: Nvidia AI.

When comparing images generated through simple and long detailed captions, Nvidia’s study claims that DALL-E 2 and Stable Diffusion failed to synthesize images accurately into text captions. Additionally, the study found that other generative models produced misspellings or ignored some of the attributes. Meanwhile, eDiff-I was able to correctly model the characteristics of English text over a wide range of samples.

But that said, the research team also noted that they generated multiple results from each method and selected the best one to include in the figure.

Comparison of image generation through detailed captions. Image source: Nvidia AI.

Current challenges for generative AI

Modern text-to-image delivery models have the potential to democratize artistic expression by providing users with the ability to produce detailed, high-quality images without the need for specialist skills. However, they can also be used for advanced manipulation of photos for malicious purposes or to create deceptive or harmful content.

Recent advances in generative models and AI-based image editing have profound implications for image authenticity and beyond. Nvidia says these challenges can be addressed by automatically validating authentic images and detecting manipulated or fake content.

Training datasets of current large-scale text-image generative models are mostly unfiltered and may include biases captured by the model and reflected in the generated data. Therefore, it is crucial to be aware of these biases in the underlying data and to counter them by actively collecting more representative data or using bias correction methods.

“Generative AI image models face the same ethical challenges as other areas of artificial intelligence: where training data comes from and how to understand its use in the model,” Stephenson said. “Large datasets of labeled images may contain copyrighted material, and it is often impossible to explain how (or if) the copyrighted material was incorporated into the product. final.”

According to Stephenson, the speed of model training is another challenge that generative AI models still face, especially during their development phase.

“While it takes between 3 and 60 seconds for a model to generate an image on some of the highest end GPUs on the market, production scale deployments will require either a significant increase in GPU supply, is the way to generate images in a fraction of a second”. time. The status quo is not scalable if demand increases 10 times or 100 times,” Stephenson told VentureBeat.

The future of generative AI

Kyran McDonnell, Founder and CEO of revoltstated that although today’s text-image models do abstract art exceptionally well, they lack the architecture required to construct the priors needed to properly understand reality.

“They’ll be able to get closer to reality with enough training data and better models, but won’t really understand it,” he said. “Until this underlying problem is resolved, we will still see these models make common sense mistakes.”

McDonnell believes that next-generation text-to-image architectures, such as eDiff-I, will solve many of today’s quality issues.

“We can still expect composition errors, but the quality will be similar to where GAN are today when it comes to face generation,” McDonnell said.

Similarly, Stephenson said we will see more applications of generative AI across multiple application areas.

“Generative models trained on a brand’s overall style and ‘vibe’ could generate an endless variety of creative assets,” he said. “There’s a lot of room for enterprise applications, and generative AI hasn’t had its ‘mainstream moment’ yet.”

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Discover our Briefings.

Leave a Reply

%d bloggers like this: