Speech synthesis

The development of speech synthesis

By the end of the 18th century, researchers were already attempting to reproduce human language by machine. In 1937, the American Domer Dudley succeeded for the first time in reconstructing spoken utterances electronically with the help of a vocoder. Synthesis systems with phonetic input were developed in the early 1950s. About 20 years later, the first fully text-driven systems were available. Since then, the technologies have been continuously developed with a particular focus on optimising the system structure and output quality.

Text-to-Speech (TTS) programs

First and foremost, text-to-speech systems were developed to make everyday life easier for people with impairments. Synthesised devices such as computers, watches and dictionaries allow people with visual or reading difficulties to access content they would not otherwise be able to access. For the speech impaired, a speech synthesis system can provide an artificial voice. TTS systems are also used on customer portals, in infotainment or in interaction with machines and robots.

How do TTS systems work?

A text-to-speech system converts written text into speech in a two-step process. For correct pronunciation, the program analyses the input text in the first step from a linguistic point of view before the content is converted into a synthetic speech signal in the second step. The software used to convert writing into speech is called a speech synthesiser.

Quality features of the TTS software

Speech synthesis aims to make the speech output as close to human speech as possible. The decisive features for the quality of a device are intelligibility and naturalness. The aim is to create a sentence melody that sounds as natural as possible, called prosody. This includes all the characteristic features of a language such as intonation, rhythm or pauses in speech and is very difficult to produce. For this reason, computer voices can still be distinguished from human voices.

Different approaches to speech synthesis

For the conversion of text into speech sounds (phonemes), there are two basic methods of determining the pronunciation of a word based on its spelling: the rule-based approach and the lexicon-based approach. Both bring different advantages and disadvantages, which is why most TTS systems use a combination of both approaches.

Rule-based speech synthesis

This approach is based, as the name suggests, on pronunciation rules combined with a list of exceptions. The approach is particularly reliable for languages with a very regular written-sound correspondence, such as French and Spanish. Rule-based TTS systems generate a hit for every entry but have some difficulty with foreign words and abbreviations.

Lexicon-based speech synthesis

The lexicon or dictionary-based approach to synthesis works like a large dictionary containing all the words of a language and their correct pronunciations. When entering a text, each word is looked up individually, which is fast and very accurate. It only becomes problematic if a word is not contained in the dictionary – then no result is generated.

Need a translation?