Clicky
Artificial intelligence

Riffusion’s AI generates music from text using visual sonograms

Enlarge / An AI generated image of musical notes exploding from a computer screen.

Ars-Technica

On Thursday, a pair of tech enthusiasts posted diffusion, an AI model that generates music from text prompts by creating a visual representation of sound and converting it to audio for playback. It uses a refined version of the Steady broadcast 1.5 image synthesis model, visual application latent diffusion processing sound in a new way.

Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms, which store audio in a two-dimensional image. In a sonogram, the X axis represents time (the order in which frequencies are played, from left to right) and the Y axis represents the frequency of sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment.

Ultrasound being a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom stable diffusion model with sample sonograms linked to descriptions of the sounds or musical genres they represented. With this knowledge, Riffusion can generate new music on the fly based on text prompts describing the type of music or sound you want to hear, like “jazz”, “rock”, or even typing on a keyboard.

After generating the sonogram image, Riffusion uses torch audio to change the sonogram to sound, playing it as audio.

A sonogram represents time, frequency, and amplitude in a two-dimensional image.
Enlarge / A sonogram represents time, frequency, and amplitude in a two-dimensional image.

“This is the stable diffusion model v1.5 without modifications, just refined on images of spectrograms associated with text”, write the creators of Riffusion on its explanation page. “It can generate infinite variations of a prompt by varying the seed. All of the same web UI and techniques like img2img, inpainting, negative prompts, and interpolation work out of the box.”

Visitors to the Riffusion site can experiment with the AI ​​model through an interactive web application that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real time while viewing the spectrogram continuously on the left side of the page.

A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms.
Enlarge / A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms.

It can also merge styles. For example, typing “smooth tropical dance jazz” brings in elements from different genres for a fresh result, encouraging experimentation by mixing styles.

Of course, Riffusion isn’t the first AI-powered music generator. Earlier this year, Harmonai published dance broadcast, an AI-powered generative music model. Open AI Jukebox, announced in 2020, also generates new music with a neural network. And sites like Soundraw create non-stop music on the fly.

Compared to those more streamlined AI music efforts, Riffusion feels more like the hobby project it is. The music it generates ranges from the interesting to the unintelligible, but it remains a notable application of latent diffusion technology that manipulates audio in visual space.

The Riffusion model checkpoint and code are available on GitHub.

Leave a Reply