From Handcrafted Features to Implicitly-learned Representations: Creating the Next Generation of Expressive and Natural Speech Synthesis

Raul Fernandez - IBM Watson speech team

From Handcrafted Features to Implicitly-learned Representations:
Creating the Next Generation of Expressive and Natural Speech Synthesis

Modern speech-synthesis frameworks approach the problem of generating speech in two steps: First, a purely-linguistic front-end module is responsible for extracting a series of representations from text that are thought to be relevant to the acoustic and prosodic realization of
utterances. Then, one of a variety of back-end architecture exploits these representations to generate the output speech. In this talk I will
present an overview of the evolution of this paradigm over time, starting with the classical scenario (where features are “handcrafted” to reflect prior knowledge about the task, and the extraction of such features relies heavily on linguistic knowledge and/or the existence of annotated corpora and supervised learning techniques) toward modern approaches informed by large unlabeled corpora and deep-learning techniques (where ultimately the paradigm breaks down in favor of a single holistic model that can learn the text-to-acoustic mapping directly). I will illustrate how these approaches have been implemented to solve a variety of speech-synthesis tasks (such as the generation of natural and expressive prosody, and discuss the pros and cons of these different ways of solving the problem.

If you would like to meet the speaker, please contact Ingmar Steiner.