HyperAI초신경
Back to Headlines

Stability AI and Arm Unveil Compact Text-to-Audio Model for Mobile Devices

6시간 전

Stability AI and Arm have unveiled a compact text-to-audio model designed to run seamlessly on modern smartphones. Named Stable Audio Open Small, this model can generate stereo audio clips up to 11 seconds long in approximately 7 seconds. When deployed on high-end hardware, such as an Nvidia H100 GPU, it can produce 44 kHz stereo audio in just 75 milliseconds, enabling near real-time generation. The full version of Stable Audio Open was initially released last year as a free, open-source model with 1.1 billion parameters. The new, smaller version, which boasts just 341 million parameters, significantly reduces the computational requirements, making it feasible to run on consumer-grade devices. The collaboration between Stability AI and Arm, first announced in March, was essential in achieving this milestone. Redesigned for Mobile Hardware To optimize the model for mobile devices, the team completely revamped its architecture. The system is now composed of three key components: Autoencoder: This module compresses the audio data, reducing the amount of storage needed. Embedding Module: It interprets text prompts and converts them into a form the model can understand. Diffusion Model: This generates the final audio output based on the interpreted text. Despite not using distillation techniques, the new architecture slashes memory usage from 6.5 GB to 3.6 GB. This significant reduction allows the model to operate efficiently on smartphones. During development and testing, the team used the Vivo X200 Pro, an Android device equipped with 12 GB of RAM and a Mediatek Dimensity 9400 chip, which was introduced in late 2024. Applications and Limitations Stability AI asserts that the model excels at generating sound effects and field recordings. However, it currently faces challenges with music, especially vocal tracks. The model performs optimally with English-language prompts, but efforts are ongoing to expand its capabilities. The training dataset comprised around 472,000 audio clips sourced from the Freesound database. These clips were licensed under CC0, CC-BY, or CC-Sampling+ terms, ensuring compliance with copyright laws. Automated checks were employed to filter out any potentially infringing material. Availability and Licensing Stable Audio Open Small is available under the Stability AI Community License for open-source use. For commercial applications, separate licensing terms apply. Developers can access the code on GitHub, and the model weights are stored on Hugging Face. Example Outputs To give users a better idea of its capabilities, several audio samples have been provided: Seawaves Soulful Calm Hip-Hop Warm Arpeggios on House Beats with Drums and Effects Synthwave Loop with Pulsating Bass and Dreamy Pads These clips showcase the model's ability to create realistic and diverse audio outputs, from natural sounds to music samples, although the limitations with more complex musical elements, such as vocals, are evident. The release of Stable Audio Open Small marks a significant step forward in democratizing text-to-audio technology, making it accessible to a broader audience through mobile devices. This advancement holds promise for various applications, from content creation and gaming to augmented reality and educational tools.

Related Links