Speech synthesis

In speech synthesis means the artificial production of human speech voice. A text - to-speech system (TTS ) (or reading machine ) converts continuous text in a synthesized voice.

Basically there are two approaches to generating speech signals can be distinguished. Firstly, through the so-called signal modeling to speech recordings (samples ) may be used. On the other hand, the signal can be completely generated in the computer by the so-called physiological ( articulatory ) modeling. While the first systems were based on Formantsynthesen, the systems currently used in industry are mainly based on signal modeling. A particular problem for speech synthesis is the generation of a natural intonation ( prosody ).

  • 3.2.1 Source - filter model 3.2.1.1 formant
  • 3.2.1.2 Acoustic Model
  • 3.2.1.3 Articulatory synthesis

History

Long before the invention of electronic signal processing scientists tried to build machines that can produce human speech. Gerbert of Aurillac ( 1003) is a fabricated bronze " Talking Head" attributed, which has been reported that he is " yes" and was able to say "no." Probably in the range of legends include the apparatus of Albertus Magnus (1198-1280) and Roger Bacon ( 1214-1294 ).

Could the German, active in Copenhagen scientist Christian Kratzenstein built in 1779 due to a competition of the St. Petersburg Academy, a " language organ " that simulated by free-swinging Lingualpfeifen with the human vocal tract resonators five long vowels (a, e, i, o and u) synthesize. Wolfgang von Kempelen since 1760 had already developed a speaking machine, which he, in his publication " mechanism of human language including the description of his speaking machine" represented 1791. This synthesis was based as Krantz stone on a bellows lung as equivalent, but the actual excitation occurred significantly closer anatomy through a single, folding Lingualpfeife. For some vowels and plosives were possible. In addition, some fricatives could be represented via different mechanisms. On the vocal cords closed a Ledertubus to who could be deformed by hand, and so replicating the variable geometry and resonance behavior of the vocal tract. By Kempelen wrote:

" Gain in a period of three weeks admirable skill in playing, especially if you moved to the Latin, French or Italian language, because the German is harder to much [ because of the frequent consonants bundle ]. "

Charles Wheatstone in 1837 built a Speaking Machine, which is based on this design, a replica can be found in the Deutsches Museum. 1857 Joseph Faber built the Euphonia, which also follows this principle.

End of the 19th century, the interest (genetic speech synthesis ) developed away from the replica of the human speech organs, the simulation of the acoustic space ( gennematische speech synthesis ). So Hermann von Helmholtz first synthesized vowels by means of tuning forks that were tuned to the resonant frequencies of the vocal tract vowel in certain positions. These resonant frequencies are called formants. Speech synthesis by a combination of formants was technical mainstream until the mid -1990s.

At Bell Labs, a keyboard -driven electronic speech synthesizer was in the 1930s, the vocoder, developed, do they say about that he was clearly understandable. Homer Dudley improved this machine for Voder, which was presented at the 1939 World's Fair. The Voder used electrical and electronic oscillators to generate the formant frequencies.

The first computer-based speech synthesis systems were developed in the late 1950s, completed the first complete text-to -speech system in 1968. The physicist John Larry Kelly, Jr developed in 1961 at Bell Labs speech synthesis with an IBM 704 and let him sing the song Daisy Bell. The director Stanley Kubrick was so impressed that he did in the movie 2001: A Space Odyssey integrated.

Presence

While early electronic speech synthesis still very robotic sounding and were sometimes difficult to understand, achieve a quality in which it is sometimes difficult to distinguish them from human speakers around since the turn of the millennium. This is mainly due to the fact that the technology has turned away from the actual synthesis of the speech signal and focused on concatenate to recorded speech segments optimally.

Synthesis

Speech synthesis presupposes an analysis of human language requires regarding the phonemes, but also prosody, just because a set solely by the intonation can have different meanings.

With regard to the synthesis process itself, there are various methods. Common to all methods is that they rely on a database are stored in the characteristic information about speech segments. Elements of this inventory will be linked to the desired expression. Speech synthesis systems can be classified on the basis of the inventory of the database, in particular the method for linking. Tends to the signal synthesis falls off more easily the larger the database, since they then already contain elements that are closer to the desired expression and fewer signal processing is necessary. For the same reason succeed in a large database most natural sounding synthesis.

One difficulty of the synthesis lies in the joining of inventory items. Since these come from different utterances, they also differ in volume, fundamental frequency, but also the location of the formants varied. In a preprocessing of the database or when connecting the inventory items, these differences must be as well balanced ( normalization) in order not to compromise the quality of the synthesis.

Unit Selection

The Unit Selection provides the best quality especially in a restricted domain. The synthesis uses a large speech database, in which each recorded utterance is segmented into some or all of the following units:

  • Phonemes / diphones
  • Syllables,
  • Morphemes,
  • Phrases
  • Sentences.

These segments are stored with a directory of a number of acoustic and phonetic features such as fundamental frequency variation, duration, or neighbors.

For the synthesis of a number of large segments are possible by special search algorithms, weighted decision trees, determined that the utterance to be synthesized as closely as possible with respect to these properties. Since this number is output with little or no signal processing, the naturalness of the spoken language remains as long as a few concatenation points are required.

Diphonesynthese

Beginning of the 20th century of experimentation has shown that the correct reproduction of the sound transitions is essential for the intelligibility of speech synthesis. To save all sound transitions, a data base with about 2500 entries will be used. Therein each of the time domain of the stationary part, the Phonemmitte a phoneme stored to the stationary part of the next phoneme. For the synthesis of the information is put together according to ( concatenated ).

More coarticulation effects that contribute much to the naturalness of the language can be accounted for by more extensive data bases. One example is Hadifix containing demisyllables, diphones and suffixes.

Signal generation

The signal generation is the desired segments from the database with the given fundamental frequency variation again. This expression of the basic frequency response can be done in various ways, which differ the following procedure.

Source-filter model

In the synthesis, using a source-filter separation, a signal source is used with a periodic waveform. Their period length is adjusted to match the fundamental frequency of the utterance to be synthesized. This suggestion is additionally mixed with noise, depending on Phonemtyp. The final filtering imposed on the noisy characteristic spectra. The advantage of this class of methods is the ease of control of the fundamental frequency of the source. A drawback is due to the filter parameters stored in the database, the determination of speech samples is difficult. Depending on the type of the filter or the underlying view of speaking one distinguishes the following procedures:

Formant

The formant based on the observation that it is sufficient to distinguish the vowels to reproduce the first two formants accurately. Each formant is modeled by a controllable at the center frequency and Q band-pass, a polarizing filter 2nd order. The formant is relatively easy to implement by analog electronic circuits.

Acoustic model

The acoustic model constitutes the entire resonance properties of the vocal tract by specially through a suitable filter. Frequently as the tube vocal tract is simplified to consider variable cross -section, with transverse modes are neglected, since the lateral expansion of the vocal tract is small. The cross-sectional changes will still be approximated by equidistant cross section jumps. A frequently selected filter type is the lattice- chain filter, in which there is a direct relationship between cross-section and filter coefficient.

These filters are closely related to Linear Predictive Coding (LPC), which is also used for speech synthesis. The LPC also the entire resonance characteristics are considered, but there is no direct connection between the filter coefficients and cross-sectional shape of the vocal tract.

Articulatory synthesis

The articulatory synthesis is compared with the acoustic model establishes a relationship between the position of the articulators and the resulting cross-sectional shape of the vocal tract. Here are for simulating the resonance characteristic in addition to the discrete-time cross- link chain filtering solutions of the time-continuous horn equation is used, from which the time signal is obtained by Fourier transform.

Overlap Add

Pitch Synchronous Overlap Add abbreviated, PSOLA, is a synthesis process in which there are in the database records of the speech signal. If it is periodic signals, they are provided with information about the fundamental frequency (pitch) and the beginning of each period is highlighted. In the synthesis of these periods be cut with a specific environment by means of a window function and added to the signal to be synthesized at the appropriate place: Depending on whether the desired fundamental frequency higher or lower than that of the entry in the database, they are correspondingly more dense or less dense than in the original together. To adjust the sound length periods can be omitted or shown twice. This process is also known as TD- PSOLA or TD PSOLA (TM ), where TD stands for Time Domain and emphasizes that the methods work in the time domain.

A further development is the Multiband Resynthesis OverLap add - method, Mbrola short. Here, the segments in the database are brought to a single fundamental frequency by a preprocessing and phase of the harmonic is normalized. This results in the synthesis of a transition from one segment to the next less noticeable perceptual disturbances and the achieved speech quality is higher.

These synthesis methods are related to granular synthesis, the electronic music production for use is in sound generation and alienation.

Parametric speech synthesis from Hidden Markov Models (HMM ) and / or stochastic Markov graphs (SMG )

Parametric speech synthesis is a group of methods, which are based on stochastic models. These models are either Hidden Markov Models (HMM ) to stochastic Markov graphs (SMG ), or more recently, a combination of the two. The basic principle is that the lessons learned from a Textvorverarbeitung, symbolic phoneme sequences through a statistical modeling by first broken down into segments, and each of these segments is then assigned to a specific model from an existing database. Each of these models is, in turn, described by a number of parameters, and finally linked to the other models. The processing of an artificial speech signal based on said parameters, then completes the synthesis. In case of using flexible stochastic Markov graphs, such a model can even optimize insofar as that it has a certain level of naturalness can be trained in advance and by supplying examples of natural language. Statistical methods of this type are taken from the conflicting area of ​​speech recognition and motivate yourself by findings on the relationship between the probability of a particular spoken word sequence and the then anticipated, approximate rate of speech, or of their prosody.

Applications of text-to- speech software

The use of speech synthesis software must not be an end in itself. People with visual disabilities - for example, Cataracts or age-related macular degeneration - use TTS software solutions in order to have them read to text directly on the screen. Blind people can use a computer using a screen-reader software and get use controls and text content announced. But even teachers use speech synthesis for recording lectures. Likewise, authors use TTS software to check self-written lyrics for errors and intelligibility.

Of particular interest is the use of software that lets you create MP3 files. Then speech synthesis software can also be used for creating simple podcasts or audio blogs. According to experience, can be very time-consuming production of podcasts or audio blogs.

When working with U.S. software is to be noted that the existing voices of different qualities are. English voices have a higher quality than German. A 1:1 copy of the text in a TTS software can not be recommended, a post-processing is necessary in every case. It 's not just about replacing abbreviations, and inserting punctuation characters - even if they are not grammatically correct - can help to influence the set pace. German "translations" with anglicisms make for speech synthesis usually an insurmountable problem dar.

Common applications include announcements in telephone and navigation systems.

Speech synthesis software

  • BOSS, developed at the Institute of Communication Sciences of the University of Bonn
  • Audiodizer
  • AnalogX SayIt
  • Browsealoud of textHelp
  • Cepstral Text-to -Speech
  • CereProc
  • DeskBot
  • Espeak (Open Source, 20 languages ​​, SAPI5 )
  • Festival
  • Festvox
  • FreeTTS
  • GhostReader
  • Infovox
  • IVONA Text-to -Speech
  • Linguatec Voice Reader
  • Loquendo TTS
  • Logox clip Reader
  • MARY Text-To -Speech developed by the DFKI Language Technology Lab
  • Mbrola
  • MWS Reader from direct innovation UG ( limited liability)
  • NaturalReader Natural Soft
  • ReadSpeaker: reading web pages and Podcasting
  • RealSpeak from Nuance (formerly ScanSoft ), now KobaSpeech 3
  • SVOX
  • SpeechConcept
  • Sprechomat
  • Text Aloud MP3
  • Toshiba ToSpeak
  • VirSyn CANTOR vocal synthesis
  • Virtual Voice
  • Vocal generator: Special program for amateur musicians
  • Vocaloid: for synthesis of singing
  • VoiceFlux: Pro
  • Your Speaker: Incl. Control capability of the debate ( voice control module )

Speech synthesis hardware

  • Votrax SC -01A (analog formant )
  • SC -02 / SSI -263 / "Arctic 263"
  • SP0250
  • SP0256 - AL2 " Orator " ( CTS256A - AL2)
  • SP0264
  • SP1000
  • TMS5110A (LPC)
  • TMS5200
  • MSM5205
  • MSM5218RS (ADPCM )
664382
de