HyperAIHyperAI

Command Palette

Search for a command to run...

Lightweight Text-to-Speech Model Sopro Enables Zero-Shot Voice Cloning with Minimal Resources

Sopro is a lightweight text-to-speech model developed as a personal project, designed to deliver fast and efficient speech synthesis with zero-shot voice cloning capabilities. The name "Sopro" comes from the Portuguese word for "breath" or "blow," reflecting the model’s focus on natural, expressive speech generation. Unlike many modern TTS systems that rely on large Transformer architectures, Sopro uses dilated convolutions—inspired by WaveNet—combined with lightweight cross-attention layers, making it more efficient and suitable for deployment on modest hardware. Despite not achieving state-of-the-art performance across all voices or scenarios, Sopro stands out for its low resource requirements. It was trained entirely on a single L40S GPU, highlighting the potential of efficient architectures even with limited compute. The model is designed to be simple, fast, and extensible, with room for future improvements, especially through better training data and fine-tuning. Key features include: - Support for zero-shot voice cloning, allowing the model to mimic new voices with just a short reference audio sample. - Efficient architecture that enables real-time or near real-time inference. - Lightweight design ideal for edge devices or local deployment. Installation is straightforward. The project is available via PyPI or directly from the GitHub repository. Minimal dependency versions are pinned to simplify setup, though performance can be optimized by selecting compatible versions of PyTorch—for instance, torch==2.6.0 without torchvision has been shown to deliver up to three times better speed on M3-based Macs. Users can interact with Sopro through a command-line interface, which includes standard parameters like temperature and top_p for controlling generation randomness. The model supports non-streaming, streaming, and interactive streaming modes. For a live demo, users can run the web interface via Docker and access it at http://localhost:8000. A few important caveats are worth noting. Due to storage constraints during training, the raw audio was discarded after preprocessing. Instead, the dataset was pre-tokenized using a neural codec, which may have led to some loss of subtle vocal nuances. Future improvements could involve training directly on raw audio to enhance speaker embeddings and voice fidelity. Additionally, current inference is limited to approximately 32 seconds (400 frames), beyond which the model tends to hallucinate or degrade in quality. This limit can be adjusted, but stability diminishes with longer outputs. The project was developed with minimal optimization—no extensive hyperparameter tuning or architectural refinement—leaving room for further enhancements. Potential improvements include caching convolutional states for faster inference and expanding support to additional languages. AI tools were used primarily during development to organize code, assist with brainstorming, conduct ablation studies, and build the web demo. The creator welcomes contributions and feedback, and invites support through a Buy Me a Coffee link to help fund future improvements and access to more computational resources. The training data used was curated from publicly available sources, and the project acknowledges the broader AI community for enabling such experimentation. This project exemplifies how innovative, lightweight models can emerge from constrained resources and passion-driven development.

Related Links