Stability AI Develops ‘Stable Audio’ Generative Text-to-Music

By Paula Parisi
September 15, 2023

Stability AI is launching Stable Audio, a music generation AI tool that uses latent diffusion to deliver what the company says is high-quality 44.1 kHz music for commercial use. Stable Audio uses a web-based interface to generate music from text prompts and duration. Because its latent diffusion model architecture has been conditioned on text metadata as well as audio file duration and start time, it defeats a problem common to diffusion for generative audio — producing cohesive musical segments as opposed to arbitrary sections of a song that start or end in the middle of a phrase.

Stability AI is offering both a basic free version of Stable Audio, which can be used to generate and download tracks of up to 20 seconds, and a Pro paid subscription that delivers 90-second tracks that are downloadable for use in commercial projects, the company explains in a news release. Technical specs of the latent audio diffusion model are included in a research paper.

“Stability says that Audio Diffusion’s underlying, roughly 1.2-billion-parameter model affords greater control over the content and length of synthesized audio than the generative music tools released before it,” according to TechCrunch, which reports that Stability released the new app “under pressure from investors to translate over $100 million in capital into revenue-generated products.”

While the ROI challenge is not unique to Stability, also behind the open-source image generator Stable Diffusion, its investor pockets are not as deep as some of its competitors.

Stable Audio isn’t the company’s first stab at generative music. The London-based startup debuted something called Dance Diffusion a year ago but halted updates on the text-based song and sound effects generator. To create it, Stability funded a research organization called Harmonai that is now its music development division.

Stable Audio was developed by researchers at Stability along with Harmonai and the team took a different approach.

“Dance Diffusion generated short, random audio clips from a limited sound palette, and the user had to fine-tune the model themselves if they wanted any control,” whereas “Stable Audio can generate longer audio, and the user can guide generation using a text prompt and by setting the desired duration,” Stability AI VP of Audio Ed Newton-Rex told TechCrunch.

VentureBeat explains how the generative AI power of Stable Audio is different, “enabling users to create new music that goes beyond the repetitive notes that are common with MIDI and symbolic generation.”

The VentureBeat article goes on to discuss how Stable Audio cannot be used to create music “in the style of” known acts because it wasn’t trained on those tunes. Rather, it “used over 800,000 pieces of licensed music from audio library AudioSparx,” since creative musicians typically don’t want to copycat others’ style.

Given all the activity in the space, generative music is shaping up to be among the most competitive consumer-use AI sectors. Meta Platforms has two AI for music generation, AudioCraft and MusicGen. Google has MusicLM, currently in beta, and last month announced its YouTube subsidiary was launching a Music AI Incubator in conjunction with Universal Music Group.

Stability AI Develops ‘Stable Audio’ Generative Text-to-Music

No Comments Yet

Leave a comment