ZETTS: Towards Zero-shot and Emotion-controlled TTS

This demo shows the capability of conditional timbre generation and conditional emotion generation by text.

"Seen" means the label/speaker has appeared in the training dataset and "Unseen" means the reverse.

Comparison with Other Models

{{k}}

Adaption and Emotion Control

Orignial Utterance from {{spk_desc_list[j-1]}}:

{{!['Happy', 'Neutral', 'Sad', 'Suprise', 'Angry'].includes(e)? e + '(Unseen)': e}}
{{text_list[i-1]}}

Emotion Intensity Control (Angry)

{{((i-1)/10).toFixed(1)}}