Artificial intelligence

Generative AI changes everything. But what’s left when the hype is gone?

The big breakthrough behind the new models lies in the way the images are generated. The first version of DALL-E used an extension of the technology behind OpenAI GPT-3 language model, producing images by predicting the next pixel in an image as if it were words in a sentence. It worked, but not well. “It wasn’t a magical experience,” Altman says. “It’s amazing that it worked.”

Instead, DALL-E 2 uses what is called a diffusion model. Diffusion models are neural networks trained to clean up images by removing pixelated noise added by the training process. The process involves taking images and modifying them a few pixels at a time, in several stages, until the original images are erased and you are left with random pixels. “If you do this a thousand times, the picture ends up looking like you’ve ripped the aerial cable off your TV: it’s just snow,” says Björn Ommer, who works on the Generative AI at the University of Munich in Germany that helped build the diffusion model that now powers Stable Diffusion.

The neural network is then trained to reverse this process and predict what the least pixelated version of a given image would look like. The result is that if you give a broadcast model a mess of pixels, it will try to generate something a bit cleaner. Reconnect the cleaned image and the model will output something cleaner. Do this enough times and the model can take you from televised snow to a high resolution image.

AI art generators never work exactly the way you want them to. They often produce hideous results that can look like distorted stock footage at best. In my experience, the only way to really make the work look good is to add a descriptor at the end with a style that looks aesthetically pleasing.

~Erik Carter

The trick with text-to-image models is that this process is guided by the language model which tries to match a prompt to the images that the broadcast model produces. This pushes the broadcast model to images that the language model considers a good match.

But templates don’t pull the links between text and images out of thin air. Today, most text-image models are trained on a large dataset called LAION, which contains billions of text and image associations pulled from the Internet. This means that the images you get from a text-to-image model are a compendium of the world as it is depicted online, distorted by prejudice (and pornography).

One last thing: there is a small but crucial difference between the two most popular models, DALL-E 2 and Stable Diffusion. DALL-E 2’s broadcast model works on full size images. Stable diffusion, on the other hand, uses a technique called latent diffusion, invented by Ommer and his colleagues. It works on compressed versions of images encoded in the neural network in what is called a latent space, where only the essential features of an image are retained.

This means that stable streaming requires less computing power to operate. Unlike DALL-E 2, which runs on powerful OpenAI servers, Stable Diffusion can run on (good) personal computers. Much of the explosion in creativity and the rapid development of new applications is due to the fact that Stable Diffusion is both open source – programmers are free to modify it, develop it, and learn from it. money – and light enough for people to run it. at home.

Redefining Creativity

For some, these models are a step towards general artificial intelligence, or AGI – an overrated buzzword referring to a future AI with versatile or even human-like abilities. OpenAI has been explicit about its goal of achieving AGI. For this reason, Altman doesn’t care that DALL-E 2 now competes with a host of similar tools, some of which are free. “We’re here to create AGIs, not image generators,” he says. “It will fit into a larger product roadmap. It’s a small piece of what an AGI will do.

Leave a Reply