Main Page | See live article | Alphabetical index

Speech synthesis

Speech synthesis is the generation of human speech without directly using a human voice.

Generally speaking, a speech synthesizer is software or hardware capable of rendering artificial speech.

Speech synthesis systems are often called text-to-speech (TTS) systems in reference to their ability to convert text into speech. However, there exist systems that can only render symbolic linguistic representations like phonetic transcriptions into speech.

Table of contents
1 Overview of Speech Synthesis technology
2 History
3 Synthesizer technologies
4 Text-to-phoneme challenges
5 Front End challenges
6 Examples of current systems
7 Speech synthesis markup languages
8 External links

Overview of Speech Synthesis technology

A text-to-speech system (or engine) is composed of two parts: a front end and a back end. Broadly, the front end takes input in the form of text and outputs a symbolic linguistic representation. The back end takes the symbolic linguistic representation as input and outputs the synthesized speech waveform. The naturalness of a speech synthesizer usually refers to how much the output sounds like the speech of a real person.

The front end has two major tasks. First it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is often called text normalization, pre-processing, or tokenization. Then it assigns phonetic transcriptions to each word, and divides and marks the text into various prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme (TTP) or grapheme-to-phoneme (GTP) conversion. The combination of phonetic transcriptions and information about prosodic units make up the symbolic linguistic representation output of the front end.

The other part, the back end, takes the symbolic linguistic representation and converts it into actual sound output. The back end is often referred to as the synthesizer. The different techniques synthesizers use are described below.

History

Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech. Early examples of 'speaking heads' were made by Gerbert (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294).

In 1779, Christian Kratzenstein of St. Petersburg built models of the human vocal tract that could produce the five long vowel sounds (a, e, i, o and u). This was followed by the bellows-operated 'Acoustic-Mechanical Speech Machine' by Wolfgang von Kempelen of Vienna, Austria, described in his 1791 paper Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine (J.B. Degen, Wien). This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837 Charles Wheatstone produced a 'speaking machine' based on von Kempelen's design, and in 1857 M. Faber built the 'Euphonia'. Wheatstone's design was resurrected in 1923 by Paget.

Bell Labs VODER
In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyser and synthesizer that was said to be clearly intelligible. Homer Dudley refined this device into the VODER, which he exhibited at the New York World's Fair of 1939.

Early electronic speech synthesizers sounded very robotic and were often barely intelligible. Output from contemporary TTS systems is often indistinguishable from actual human speech.

Despite the success of electronic speech synthesis, research is still being conducted into mechanical speech synthesizers for use in humanoid robots. Even a perfect electronic synthesizer is limited by the quality of the transducer (usually a loudspeaker) that produces the sound, so in a robot a mechanical system may be able to produce a more natural sound than a small loudspeaker.

TODO: see [Dennis Klatt's History of Speech Synthesis]

Reference: History and Development of Speech Synthesis, from Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing (2003-09-13)

Synthesizer technologies

There are two main technologies used for the generating synthetic speech waveforms: concatenative synthesis and formant synthesis

Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis gives the most natural sounding synthesized speech. However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness. There are three main subtypes of concatenative synthesis:

Formant synthesis

Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called Rule-based synthesis but some argue that because many concatenative systems use rule-based components for some parts of the system, like the front end, the term is not specific enough.

Many systems based on formant synthesis technology generate artificial, robotic-sounding speech, and the output would never be mistaken for the speech of a real human. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have some advantages over concatenative systems.

Formant synthesized speech can be very reliably intelligible, even at very high speeds, avoiding the acoustic glitches that can often plague concatenative systems. High speed synthesized speech is often used by the visually impaired for quickly navigating computers using a screen reader. Second, formant synthesizers are often smaller programs than concatenative systems because they do not have a database of speech samples. They can thus be used in embedded computing situations where memory space and processor power are often scarce. Last, because formant-based systems have total control over all aspects of the output speech, a wide variety of prosody or intonation can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.

Other synthesis methods

Text-to-phoneme challenges

TODO: rule-based vs. dictionary-based systems

Front End challenges

The process of normalizing text is rarely straightforward. Texts are full of homographs, numbers and abbreviations that all ultimately require expansion into a phonetic representation.

There are many words in English which are pronounced differently based on context. Some examples:

Since most TTS systems do not generate semantic representations of their input texts, various techniques are used to guess the proper way to disambiguate homographs, like looking at neighboring words and using statistics about frequency of occurrence.

Deciding how to convert numbers is another problem TTS systems have to address. It is a fairly simple programming challenge to convert a number into words, like 1325 becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts in texts, and 1325 should probably be read as "thirteen twenty-five" when part of an address (1325 Main St.) and as "one three two five" if it is the last four digits of a social security number. Often a TTS system can infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the systems provide a way to specify the type of context if it is ambiguous.

Similarly, abbreviations like "etc." are easily rendered as "et cetera", but often abbreviations can be ambiguous. For example, the abbreviation "in." in the following example: "Yesterday it rained 3 in. Take 1 out, then put 3 in.". "St." can also be ambiguous: "St. John St." TTS systems with intelligent front ends can make educated guesses about how to deal with ambiguous abbreviations, while others do the same thing in all cases, resulting in nonsensical but sometimes comical outputs: "Yesterday it rained three in." or "Take one out, then put three inches."

Examples of current systems

Some freely available text-to-speech systems:

Some very natural sounding commercial concatenative TTS systems with online demos: All of these have US English, most have other languages available. [ASY] is an articulatory synthesis program developed at Haskins Laboratories.

The Klatt Synthesizer, developed in 1980 by Dennis Klatt, is a cascade/parallel formant synthesizer whose basic approach still serves as the waveform synthesizer of many formant synthesis systems.

Well known external hardware devices:

Speech synthesis markup languages

A number of markup languages for rendition of text as speech in an XML compliant format, have been established, most recently the SSML proposed by the W3C (still in draft status at the time of this writing). Older speech synthesis markup languages include SABLE and JSML. Although each of these was proposed as a new standard, still none of them has been widely adopted.

Speech synthesis markup languages should be distinguished from dialogue markup languages such as VoiceXML, which includes, in addition to text-to-speech markup, tags related to speech recognition, dialogue management and touchtone dialing.

External links

See also speech processing, speech recognition