Speech synthesis is a method for artificially generating speech. A text-to-speech system (TTS) is used, i.e. a device or computer program that converts written text into acoustic signals. Speech synthesis is used, among other things, to help visually impaired people communicate.
By the end of the 18th century, researchers were already attempting to reproduce human language by machine. In 1937, the American Domer Dudley succeeded for the first time in reconstructing spoken utterances electronically with the help of a vocoder. Synthesis systems with phonetic input were developed in the early 1950s. About 20 years later, the first fully text-driven systems were available. Since then, the technologies have been continuously developed with a particular focus on optimising the system structure and output quality.
Text-to-Speech (TTS) programs
First and foremost, text-to-speech systems were developed to make everyday life easier for people with impairments. Synthesised devices such as computers, watches and dictionaries allow people with visual or reading difficulties to access content they would not otherwise be able to access. For the speech impaired, a speech synthesis system can provide an artificial voice. TTS systems are also used on customer portals, in infotainment or in interaction with machines and robots.
How do TTS systems work?
A text-to-speech system converts written text into speech in a two-step process. For correct pronunciation, the program analyses the input text in the first step from a linguistic point of view before the content is converted into a synthetic speech signal in the second step. The software used to convert writing into speech is called a speech synthesiser.
Quality features of the TTS software
Speech synthesis aims to make the speech output as close to human speech as possible. The decisive features for the quality of a device are intelligibility and naturalness. The aim is to create a sentence melody that sounds as natural as possible, called prosody. This includes all the characteristic features of a language such as intonation, rhythm or pauses in speech and is very difficult to produce. For this reason, computer voices can still be distinguished from human voices.
Different approaches to speech synthesis
For the conversion of text into speech sounds (phonemes), there are two basic methods of determining the pronunciation of a word based on its spelling: the rule-based approach and the lexicon-based approach. Both bring different advantages and disadvantages, which is why most TTS systems use a combination of both approaches.
Rule-based speech synthesis
This approach is based, as the name suggests, on pronunciation rules combined with a list of exceptions. The approach is particularly reliable for languages with a very regular written-sound correspondence, such as French and Spanish. Rule-based TTS systems generate a hit for every entry but have some difficulty with foreign words and abbreviations.
Lexicon-based speech synthesis
The lexicon or dictionary-based approach to synthesis works like a large dictionary containing all the words of a language and their correct pronunciations. When entering a text, each word is looked up individually, which is fast and very accurate. It only becomes problematic if a word is not contained in the dictionary – then no result is generated.
FAQ: More questions about speech synthesis
What is TTS?
Text to speech (abbreviation: TTS) refers to a method for converting written text into speech. This is a form of speech synthesis.
What does speech synthesis mean?
Speech synthesis is the artificial generation of human speech. Different devices and programs can be used such as TTS software.
What are the approaches to speech synthesis?
To generate speech signals, a distinction is made between two approaches: the rule-based and the lexicon-based approach, which are used in combination in most text-to-speech systems.
What is neural speech synthesis?
Neural speech synthesis refers to a form of speech generation that is constantly being improved through machine learning. To do this, an artificial neural network is created that learns to predict the phonetics of human speech. The result is a more fluid and natural sounding voice.
Where is text to speech used?
While text to speech was initially used mainly to help people with disabilities communicate or to provide them with barrier-free access to content, it can now be used wherever text needs to be converted into speech, e.g. in customer service portals or when using smart devices.
This site is registered on wpml.org as a development site.