Overview of Speech Synthesis technology
A text-to-speech system (or engine) is composed of two parts: a front end and a back end. Broadly, the front end takes input in the form of text and outputs a symbolic linguistic representation. The back end takes the symbolic linguistic representation as input and outputs the synthesized speech waveform. The naturalness of a speech synthesizer usually refers to how much the output sounds like the speech of a real person.
The front end has two major tasks. First it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is often called text normalization, pre-processing, or tokenization. Then it assigns phonetic transcriptions to each word, and divides and marks the text into various prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme (TTP) or grapheme-to-phoneme (GTP) conversion. The combination of phonetic transcriptions and information about prosodic units make up the symbolic linguistic representation output of the front end.
The other part, the back end, takes the symbolic linguistic representation and converts it into actual sound output. The back end is often referred to as the synthesizer. The different techniques synthesizers use are described below.
History
Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech. Early examples of 'speaking heads' were made by Gerbert (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294).
In 1779, Christian Kratzenstein of St. Petersburg built models of the human vocal tract that could produce the five long vowel sounds (a, e, i, o and u). This was followed by the bellows-operated 'Acoustic-Mechanical Speech Machine' by Wolfgang von Kempelen of Vienna, Austria, described in his 1791 paper Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine (J.B. Degen, Wien). This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837 Charles Wheatstone produced a 'speaking machine' based on von Kempelen's design, and in 1857 M. Faber built the 'Euphonia'. Wheatstone's design was resurrected in 1923 by Paget.