Stability AI has released an AI model that generates stereo audio that is quick and lightweight enough for mobile devices. Called Stable Audio Open Small, the open-source model is the result of a collaboration between the AI startup and chipmaker Arm. While there are several AI-powered apps that generate audio — Suno and Udio among them — most rely on cloud processing, thus can’t be used offline. Stability says Stable Audio Open Small is also IP safe due to being trained entirely on audio from the royalty-free libraries Free Music Archive and Freesound.
“Stable Audio Open Small is 341 million parameters in size and optimized to run on Arm CPUs,” writes TechCrunch, explaining that it is “designed for quickly generating short audio samples and sound effects (e.g., drum and instrument riffs)” and “can produce up to 11 seconds of audio on a smartphone in less than 8 seconds.”
In a news release Stability AI says the mobile-optimized text-to-audio model builds on its Stable Audio Open model, released in June 2024, but is “smaller and faster while preserving output quality and prompt adherence.” That earlier version has 1.1 billion parameters while the new release is “significantly easier to run on consumer hardware,” per The Decoder.
Developers interested in incorporating Stable Audio Open Small into apps and tools can refer to the new Arm Learning Path, “which offers hands-on guidance using Stable Audio Open Small on Arm CPUs,” Stability AI says.
The new model “is based on a technique known as ‘Adversarial Relativistic-Contrastive’ (ARC), developed by researchers at the University of California, Berkeley and others,” writes The Decoder, noting that “on high-end hardware like an Nvidia H100 GPU, it can produce 44 kHz stereo audio in just 75 milliseconds — fast enough for near real-time generation.”
Stable Audio Open Small was announced at Mobile World Congress in March. The release version is free for commercial and non-commercial use under the permissive Stability AI Community License. The model weights are available for download on Hugging Face and the code is posted at GitHub.
“Stability AI says the model is especially good at generating sound effects and field recordings” and “still struggles with music, particularly with singing voices,” according to The Decoder, which says it works best with English-language prompts.
“As AI-driven creative media workloads move to the edge, smaller models help align compute resources with task complexity,” Stability AI says, referencing an MIT Technology Review piece on the trend.
No Comments Yet
You can be the first to comment!
Leave a comment
You must be logged in to post a comment.