Stability AI, a company mostly known for AI-generated visuals, launched a text-to-audio generative AI platform called Stable Audio.
Stable Audio uses a diffusion model, the same AI model that powers the company’s more popular image platform, Stable Diffusion, but trained with audio rather than images. Users can use it to generate songs or background audio for any project.
Audio diffusion models tend to generate a fixed length of audio, which is terrible for music production as songs can vary in length. Stability AI’s new platform lets users make sounds at different lengths, requiring the company to train on music and add text metadata around a song’s start and end time.
Previously, audio taught on a 30-second clip can only generate 30 seconds of audio and create arbitrary sections of songs. Stability AI said tweaking the model now allows users of Stable Audio to have more control over how long the song will be.
“Stable Audio represents the cutting-edge audio generation research by Stability AI’s generative audio research lab, Harmonai,” the company said in a statement. “We continue to improve our model architectures, datasets, and training procedures to improve output quality, controllability, inference speed, and output length.”
According to the company, it trained Stable Audio with “a dataset consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems” and text metadata from stock music licensing company AudioSparx. The dataset represents more than 19,500 hours of sounds. By partnering with a licensing company, Stability AI says it has permission to use copyrighted material.
Stable Audio will have three pricing tiers: a free version that lets users create up to 45 seconds of audio for 20 tracks a month; an $11.99 Professional level for 500 tracks that are up to 90 seconds long; and an Enterprise subscription, through which companies can customize their usage and price. Those using the free version cannot commercially use audio they make with Stable Audio.
Text-to-audio generation is not new, as other big names in generative AI have been playing around with the concept. Meta released AudioCraft in August, a generative AI suite of models that help create natural-sounding ERM, sound, and music from prompts. It is so far only available to researchers and some audio professionals. Google’s MusicLM also lets people generate sounds but is only available for researchers.
As with other generative AI audio platforms, a big chunk of Stable Audio’s potential use cases will be in making background music for podcasts or videos to make those workflows faster.
Stability AI announced its plans to expand into audio generation and video and 3D images last year.