Dia-1.6B is a 1.6 billion parameter text-to-speech model by Nari Labs that generates high-fidelity dialogue directly from transcripts. Designed for realistic vocal performance, Dia supports expressive features like emotion, tone control, and non-verbal cues such as laughter, coughing, or sighs. The model accepts speaker conditioning through audio prompts, allowing limited voice cloning and speaker consistency across generations. It is optimized for English and built for real-time performance on enterprise GPUs, though CPU and quantized versions are planned. The format supports [S1]/[S2] tags to differentiate speakers and integrates easily into Python workflows. While not tuned to a specific voice, user-provided audio can guide output style. Licensed under Apache 2.0, Dia is intended for research and educational use, with explicit restrictions on misuse like identity mimicry or deceptive content.
Features
- Realistic TTS from transcripts with speaker tagging ([S1]/[S2])
- Emotion and tone control via conditioning audio
- Supports non-verbal sounds like (laughs), (coughs), etc.
- Voice cloning through user-provided audio prompts
- Python API for simple text-to-audio generation
- Real-time performance on supported GPUs
- Planned CLI tool, PyPI package, and quantized version
- Licensed under Apache 2.0 with strict misuse policies