HyperAIHyperAI

Command Palette

Search for a command to run...

SongGen: A Fully Open-Source Single-Stage Auto-Regressive Transformer Designed for Controllable Song Generation

### Abstract: SongGen - A Single-Stage Auto-Regressive Transformer for Controllable Song Generation **Introduction:** The creation of songs from text is a challenging task due to the intricate combination of lyrics and melodies that express emotions. This complexity is further compounded by the scarcity of high-quality open-source data, which has historically limited research and development in the field. Traditional approaches to text-to-song generation often involve multi-stage processes where vocals and instrumental music are generated separately, leading to inefficiencies and reduced control over the final output. To address these issues, researchers have introduced SongGen, a fully open-source single-stage auto-regressive transformer designed to generate songs controllably from text descriptions, lyrics, and optional reference voices. **Model Overview:** SongGen is an auto-regressive transformer decoder integrated with a neural audio codec. It predicts audio token sequences that are then synthesized into songs. The model supports two primary generation modes: Mixed Mode and Dual-Track Mode. In Mixed Mode, the X-Codec encodes raw audio into discrete tokens, with a training loss that emphasizes earlier codebooks to improve vocal clarity. A variant, Mixed Pro, introduces an auxiliary loss for vocals, further enhancing their quality. Dual-Track Mode, on the other hand, generates vocals and accompaniment separately, synchronizing them through either Parallel or Interleaving patterns. The Parallel mode aligns tokens frame-by-frame, while the Interleaving mode enhances interaction between vocals and accompaniment across layers. **Conditioning and Data Processing:** To condition the model, lyrics are processed using a VoiceBPE tokenizer, voice features are extracted via a frozen MERT encoder, and text attributes are encoded using FLAN-T5. These embeddings guide the song generation process through cross-attention mechanisms. Addressing the lack of public text-to-song datasets, the researchers developed an automated pipeline to process 8,000 hours of audio from various sources. Quality data is ensured through stringent filtering strategies. **Evaluation and Performance:** SongGen was evaluated against several existing models, including Stable Audio Open, MusicGen, Parler-tts, and Suno. MusicGen, which generated only instrumental music, and Stable Audio Open, which produced unclear vocal sounds, did not meet the desired standards. Fine-tuning Parler-tts for singing also proved ineffective. Despite using only 2,000 hours of labeled data, SongGen outperformed these models in terms of text relevance and vocal control. Among its modes, the "Mixed Pro" approach significantly enhanced vocal quality (VQ) and phoneme error rate (PER), while the "Interleaving (A-V)" dual-track method excelled in vocal quality but had slightly lower harmony (HAM). Attention analysis demonstrated that SongGen effectively captured musical structures, maintaining coherence even without a reference voice. Ablation studies confirmed that high-quality fine-tuning (HQFT), curriculum learning (CL), and VoiceBPE-based lyric tokenization improved the model's stability and accuracy. **Ethical Considerations:** While SongGen offers a powerful tool for text-to-song generation, its capability to mimic voices raises ethical concerns. The potential for abuse in voice impersonation necessitates the implementation of protective measures. As a foundational work in the field, SongGen can serve as a baseline for future research, guiding improvements in audio quality, lyric alignment, and expressive singing synthesis while addressing these ethical and legal challenges. **Conclusion:** SongGen represents a significant advancement in text-to-song generation by simplifying the process into a single-stage, auto-regressive transformer. Its open-source nature makes it accessible to both beginners and experts, enabling precise control over voice and instrument components. The model's strong performance in text relevance and vocal control, coupled with its ability to effectively capture musical structures, positions it as a valuable resource for future research and development in controllable text-to-song generation. **Technical Details and Resources:** For more detailed information, readers can refer to the technical details and the GitHub page of the project. The researchers behind SongGen are credited for their innovative work, and interested individuals are encouraged to follow related updates on the project's social media platforms and ML SubReddit. **Recommended Read:** For additional context, readers may want to explore the release of NEXUS by LG AI Research, an advanced system that integrates agent AI and data compliance standards to address legal concerns in AI datasets. **Author:** Divyesh Vitthal Jawkhede, a consulting intern at Marktechpost and a BTech student in Agricultural and Food Engineering at the Indian Institute of Technology, Kharagpur, is a Data Science and Machine Learning enthusiast. His goal is to integrate these technologies into the agricultural domain to solve practical challenges.

Related Links