This demo shows the capability of conditional timbre generation and conditional emotion generation by text.
"Seen" means the label/speaker has appeared in the training dataset and "Unseen" means the reverse.