HyperAI

OpenAudio-s1-mini: Efficient Text-to-speech Generation Tool

1. Tutorial Introduction

Build

OpenAudio-S1-mini is an open source text-to-speech (TTS) model released by the Fish Audio team on May 26, 2025. This is a neural network structure that performs well in natural language processing tasks. At the same time, it also uses multi-task learning methods and advanced neural network vocoders to achieve high-quality speech synthesis. The project supports a variety of mainstream languages including Chinese, allowing users to express themselves freely in cross-cultural communication. With only 15 seconds of audio samples, voice cloning can be quickly achieved to generate speech that is highly similar to the target voice.

This tutorial uses resources for a single RTX 4090 card.

2. Project Examples

Text-to-speech 

3. Operation steps

1. After starting the container, click the API address to enter the Web interface

2. Once you enter the webpage, you can use the model

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.  When using the Safari browser, the audio may not be played directly and needs to be downloaded before playing.

How to use

2.1 Text to Audio

Parameter Description:

  • Advanced Config:
    • Iterative Prompt Length: Iterative prompt length. 0 means off. Non-zero value controls the length of prompt text used each time when iteratively generating speech.
    • Maximum tokens per batch: The maximum number of tokens per batch. 0 means unlimited. A non-zero value limits the maximum number of tokens processed per batch.
    • Top – P: kernel sampling probability, which controls the diversity and certainty of generated text.
    • Repetition Penalty: Repetition penalty coefficient, used to control the frequency of repeated content in the generated text. The larger the value, the more repetition is avoided.
    • Temperature: Temperature coefficient, which adjusts the randomness of the generated text. The larger the value, the more random it is.
    • Seed: Random seed, used to generate fixed random numbers to ensure reproducible results.
  • Reference Audio:
    • Use Memory Cache: Select whether to use memory cache.
    • Reference Audio: Upload an audio file (wav file) to be used as a reference.
    • Reference Text: Enter the text content of the uploaded audio.

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}