Text-to-image AI exploded this year as technical advancements dramatically improved the fidelity of art that AI systems could create. Controversial because systems like Steady broadcast and Open AI DALL-E 2 are, platforms such as DeviantArt and Canva have adopted them to power creative tools, personalize branding, and even dream up new products.
But the technology at the heart of these systems is capable of much more than generating art. Called diffusion, it is used by some intrepid research groups to produce music, synthesize DNA sequences and even discover new drugs.
So what is diffusion, exactly, and why is it such a big leap from the previous state of the art? At the end of the year, it’s worth taking a look at the origins of broadcasting and how it evolved over time to become the influential force it is today. The story of Diffusion isn’t over – technical improvements are coming with each passing month – but the last year or two has mostly brought remarkable progress.
The birth of broadcasting
You may remember the trend of deepfaking apps from several years ago – apps that inserted portraits of people into existing images and videos to create realistic substitutions of the original subjects in that target content. Using AI, the apps would “insert” a person’s face – or in some cases, their entire body – into a scene, often convincing enough to fool someone at first glance.
Most of these applications were based on an artificial intelligence technology called generative adversarial networks, or GAN for short. GANs consist of two parts: a Generator which produces synthetic examples (e.g. images) from random data and a discriminator which attempts to distinguish between synthetic and real examples from a set of training data. (Typical GAN training datasets consist of hundreds to millions of examples of things that the GAN should eventually capture.) The generator and discriminator enhance their respective capabilities until the discriminator is unable to distinguish between real examples of examples synthesized with better than 50% accuracy expected from chance.
The most successful GANs can create, for example, snapshots of fictional apartment buildings. StyleGAN, a system developed by Nvidia a few years ago, can generate high-resolution portraits of fictional people by learning attributes such as facial pose, freckles, and hair. Beyond image generation, GANs have been applied to the 3D modeling space and vector sketchshowing an ability to produce music videos as well as speech and even instrument sample loops in songs.
In practice, however, GANs suffered from a number of shortcomings due to their architecture. Simultaneous training of the generator and discriminator models was inherently unstable; sometimes the generator would “collapse” and produce many similar sounding samples. GANs also needed a lot of data and computing power to operate and train, which made them difficult to scale.
Enter the broadcast.
How Broadcast Works
Diffusion was inspired by physics – being the process in physics where something moves from a region of higher concentration to a region of lower concentration, like a sugar cube dissolving in coffee. The sugar granules in coffee are initially concentrated at the top of the liquid, but gradually break down.
Broadcasting systems borrow from broadcasting in non-equilibrium thermodynamics More precisely, where the process increases the entropy – or randomness – of the system over time. Consider a gas – it will eventually spread out to evenly fill an entire space in random motion. Similarly, data such as images can be transformed into an even distribution by randomly adding noise.
Broadcast systems slowly destroy the data structure by adding noise until only noise remains.
In physics, diffusion is spontaneous and irreversible – sugar diffused into coffee cannot be reconstituted into a cube. But diffusion systems in machine learning aim to learn a kind of “reverse diffusion” process to restore destroyed data, gaining the ability to recover data from noise.
Broadcast systems have been around for almost a decade. But a relatively recent OpenAI innovation called CLIP (short for “Contrastive Language-Image Pre-Training”) has made them much more practical in everyday applications. CLIP categorizes data – for example, images – to “grade” each step of the streaming process based on the likelihood of it being categorized under a given text prompt (e.g. “a sketch of a dog in a lawn flowery”).
At first, the data has a very low CLIP score because it is mostly noise. But as the broadcast system reconstructs the data from the noise, it slowly gets closer to matching the prompt. A useful analogy is uncarved marble – like a master carver telling a novice where to carve, CLIP guides the broadcast system to an image that gives a higher score.
OpenAI introduced CLIP alongside the DALL-E image generation system. Since then, it has made its way into DALL-E’s successor, DALL-E 2, as well as open-source alternatives like Stable Diffusion.
What can broadcasting do?
So what can CLIP-guided diffusion models do? Well, as mentioned earlier, they’re pretty good at generating art – from photorealistic art to sketches, drawings, and paintings in the style of virtually any artist. In fact, there is evidence to suggest that they problematically regurgitate some of their workout data.
But the talent of the models, as controversial as it is, does not stop there.
The researchers also experimented with using guided diffusion models to compose new music. Harmonaian organization financially supported by Stability AI, the London-based startup behind Stable Diffusion, has launched a diffusion-based model that can produce music videos by training on hundreds of hours of existing songs. More recently, developers Seth Forsgren and Hayk Martiros created a hobby project dubbed diffusion which uses a diffusion model intelligently trained on spectrograms – visual representations – of audio to generate ditties.
Beyond the realm of music, several labs are trying to apply streaming technology to biomedicine in hopes of discovering new treatments for disease. Startup Generate Biomedicines and a team from the University of Washington trained diffusion-based models to produce protein designs with specific properties and functions, as MIT Tech Review reported earlier this month.
Models work in different ways. Generating noise adds biomedicines by unraveling the chains of amino acids that make up a protein, then assembling random chains to form a new protein, guided by constraints specified by the researchers. The University of Washington model, on the other hand, starts with a scrambled structure and uses information about how the pieces of a protein should fit together provided by a separate AI system trained to predict the structure. proteins.
They have already had some success. The model designed by the University of Washington group was able to find a protein that can bind to parathyroid hormone – the hormone that controls blood calcium levels – better than existing drugs.
During this time, at OpenBioML, a Stability AI-backed effort to bring machine learning-based approaches to biochemistry, researchers have developed a system called DNA-Diffusion to generate cell type-specific regulatory DNA sequences – segments of DNA molecules. nucleic acid that influence the expression of specific genes in an organism. DNA-Diffusion will – if all goes as planned – generate regulatory DNA sequences from text instructions such as “A sequence that will activate a gene to its maximum level of expression in cell type X” and “A sequence that activates a gene in the liver and heart, but not in the brain.
What future for diffusion models? The sky might just be the limit. Already, researchers have applied it to generate videos, image compression and summary speech. That’s not to say that broadcast won’t eventually be replaced by a more efficient and capable machine learning technique, like GANs were with broadcast. But it’s the architecture of the day for a reason; broadcasting is nothing if not versatile.