About
What is Fish Speech?
Fish Speech, powered by Fish Audio S2, is a state-of-the-art text-to-speech system designed to generate natural, realistic, and emotionally rich speech. Trained on over 10 million hours of audio across approximately 50 languages, S2 utilizes a Dual-Autoregressive architecture and Reinforcement Learning Alignment for superior quality. It supports fine-grained inline control of prosody and emotion using natural-language tags, native multi-speaker and multi-turn generation, and rapid voice cloning from short reference samples. The system is optimized for production streaming via SGLang, offering efficient inference and low latency, making it suitable for various applications requiring high-quality voiceovers and audio content.
Best used for
Ideal for developers and content creators who need to integrate advanced text-to-speech capabilities into their applications, generate natural-sounding voiceovers with emotional control, and clone voices rapidly. Especially valuable for creating multilingual audio content and developing AI agents with realistic speech.
Common actions
Audio contentContent creationAI voicespeech synthesisvoiceovertext to speechvoice generatornatural language processing
Capabilities
Key features
- Fine-grained inline control
- Dual-Autoregressive architecture
- Reinforcement Learning Alignment
- Production streaming via SGLang
- Multilingual text-to-speech
- Native multi-speaker generation
- Rapid voice cloning
Target Audience
developersaudio engineerscontent creatorsai researchers
Integrations
Not yet documentedPricing & Plans
Open Source ยท Likely Not Free
FAQs
What is the Fish Audio S2 model?
Fish Audio S2 is the latest text-to-speech model developed by Fish Audio. It's trained on over 10 million hours of audio across 50 languages, combining reinforcement learning alignment with a Dual-Autoregressive architecture to produce natural, realistic, and emotionally rich speech.
Can Fish Speech control speech emotions and prosody?
Yes, Fish Speech S2 allows for fine-grained inline control over speech generation. Users can embed natural-language instructions like '[laugh]', '[whispers]', or '[super happy]' directly within the text to influence prosody and emotion at specific word or phrase positions.
Does Fish Speech support multiple languages and speakers?
Fish Speech S2 offers high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing, supporting languages like English, Chinese, Japanese, and more. It also features native multi-speaker generation and multi-turn generation for complex dialogues.
How does Fish Speech handle voice cloning?
Fish Speech S2 supports rapid and accurate voice cloning using a short reference audio sample, typically 10โ30 seconds. The model captures timbre, speaking style, and emotional tendencies to produce realistic and consistent cloned voices without additional fine-tuning.