HyperAI

CSM Conversational Speech Generation Model Demo

1. Tutorial Introduction

CSM (Conversational Speech Model) is a conversational speech model developed by the Sesame team in 2025. It aims to improve the emotional interaction capabilities of voice assistants through natural and coherent speech generation technology. Based on a multimodal learning framework, the model combines text and voice data, and uses an end-to-end Transformer architecture to directly generate natural and emotional speech. It can generate RVQ audio codes based on text and audio inputs. The model architecture uses a Llama backbone and a small audio decoder that can generate Mimi audio codes.

This tutorial uses the CSM-1B model to implement a two-person conversation (only supports English generation), and the computing power resource uses RTX 4090.

2. Operation steps

1. After starting the container, click the API address to enter the Web interface

2. Set the speaking object

3. Set up dialogue and speech synthesis (only supports English generation)

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓