Speech recognition

The speech recognition or automatic speech recognition is a branch of applied computer science, engineering and computational linguistics. It deals with the study and development of methods, the machines, especially computers, the spoken language of the automatic data collection makes accessible. The speech recognition is to be distinguished from the voice or speaker recognition, a biometric method for personal identification. However, the implementations of these methods are similar.

  • 3.1 Discrete and continuous speech 3.1.1 Discrete language
  • 3.1.2 Continuous language
  • 3.4.1 consonants
  • 4.1 preprocessing 4.1.1 scanning
  • 4.1.2 filtering
  • 4.1.3 transformation
  • 4.1.4 feature vector 4.1.4.1 cepstrum
  • 4.2.1 Hidden Markov Models
  • 4.2.2 Neural Networks
  • 4.2.3 Language Model
  • 4.2.4 evaluation

Historical Development

Research on speech recognition systems began in the 1960s, but was then largely unsuccessful: The developed systems of private companies allowed under laboratory conditions, the detection of a few dozen individual words. This was partly due to the limited knowledge in this new area of ​​research, but also to the at that time limited technical possibilities.

It was not until the mid-1980s came the development moving forward. During this time, it was discovered that you could distinguish by context tests homophones. By created statistics on the incidence of certain word combinations and auswertete, you could decide at similar or identical sounding words, which was meant. These so-called Trigrammstatistiken were then an important part of all speech recognition systems. In 1984, IBM before a first speech recognition system that could recognize about 5,000 single English words. The system needed for a recognition operation, however, several minutes of computing time on a large computer. The other hand, was a more advanced system developed by Dragon Systems: This could only be used on a portable PC.

1991, IBM introduced the first time at CeBIT, a speech recognition system from which could see 20,000 to 30,000 German words. However, the presentation of the Tangora 4 mentioned system had to take place in a specially shielded room, as the noise of the fair would otherwise have disturbed the system.

The end of 1993, IBM introduced the first developed for the mass market voice recognition system from: The IBM Personal Dictation System called system ran on normal PCs and cost under $ 1,000. As it was presented under the name IBM Voice Type Dictation System at CeBIT 1994, it met with great interest from the visitors and the press.

Published in 1997 for the PC end user both the software IBM ViaVoice (successor of IBM Voice Type ) and the version 1.0 of the software Dragon NaturallySpeaking. In 2004, IBM parts of his speech recognition applications as Open Source and caused quite a stir. Industry insiders suspected as basic tactical action against the company Microsoft, which is also active in this area, and since 2007 with the release of its PC operating system Windows Vista as an integral part first proposed speech recognition functions for the control as well as for the dictation, which to this day in Windows 8.1 were developed.

While the development of IBM ViaVoice has been set, Dragon NaturallySpeaking developed to present the most popular third-party speaker-dependent speech recognition software for Windows-based PCs and is manufactured and distributed by Nuance Communications since 2005.

Nuance in 2008 with the acquisition of Philips Speech Recognition Systems, Vienna, and the rights to the Software Development Kit ( SDK) Speech Magic acquired, which has been found in particular in the health sector distribution. For iMac personal computer from Apple a third-party speech recognition software was distributed under the name of the company iListen MacSpeech since 2006, which was based on Philips components. In 2008, this was replaced by MacSpeech Dictate using the core components of Dragon NaturallySpeaking and after the acquisition of MacSpeech Nuance Communications Dragon Dictate, 2010 ( Version 2.0 - since 2012, the version 3.0 is distributed ) renamed.

Current status

Currently can be broadly distinguished between two types of speech recognition:

  • Speaker-independent speech recognition
  • Speaker-dependent speech recognition

Characteristic of the speaker-independent speech recognition is the property that the user can immediately begin using speech recognition without a previous training phase. The vocabulary is limited to a few thousand words.

Speaker-dependent speech recognizer must be trained by the user prior to use for their own specific pronunciation. A use in applications with frequently changing users is not possible. The vocabulary is much larger than that of the speaker-independent recognizer in comparison. For example, the current version 11 of the software Dragon NaturallySpeaking contains around 300,000 word forms. A distinction is also between

  • Front -end systems and
  • Back-end systems.

In front -end systems (also known as online dictation ) the processing of language and reaction is carried out in text directly on the user so that he can read the result with virtually no appreciable time delay computer. In back-end systems ( and offline dictation or server -based detection called ), however, the reaction is performed on a remote server, so that the text is only with delay. Such systems are particularly more common in the medical field.

Speaker-independent speech recognition is preferably used in technical applications, for example in automatic dialogue systems, such as a timetable information. Wherever only a limited vocabulary is used, the speaker-independent speech recognition is practiced with success. To achieve systems for recognition of the spoken English numbers from 0 to 9, a nearly 100 - % detection rate.

Very high recognition rates can be achieved in the use of speaker-dependent speech recognition even on a limited vocabulary. Where, however, not limited vocabulary is used, does not ensure complete accuracy more. Even an accuracy of 95 percent, however, is too low, there must be improved too much ( and can build up the error due to a trigram statistics are often used ).

In the meantime, current systems reach the dictation of continuous text on personal computers detection rates of about 99 percent and thus meet the requirements for many areas of practice, such as for scientific texts, business correspondence or legal briefs. On limits of use encounters where the respective author constantly new, not included in the software dictionary words and word forms required, the manual addition possible, with only one-time occurrences in texts of the same speaker but is not efficient. Thus, for example, poets and journalists benefit less from the use of speech recognition as eg doctors and lawyers.

In addition to the size of the dictionary and the quality of the acoustic recording plays a crucial role. For microphones that are mounted directly in front of the mouth (for example, headsets or phones ) had a significantly higher recognition accuracy is obtained than in more distant room mics.

See also: Steno mask

However, most important influencing factors in practice are a precise pronunciation and related dictates of remarks sufficiently long so that the language model can work optimally.

Lipreading

In order to increase the recognition accuracy even further, is partly also tried using a video camera to film the speaker's face and it read off the lip movements. By combining these results with the results of acoustic detection, one can achieve a significantly higher detection rate especially in noisy images.

This corresponds to observations in human speech recognition: Harry McGurk had in 1976 found that people from the lip movement to the spoken language close ( McGurk effect).

Speech

Since most are in communication with human language to a dialogue between two parties, one finds the speech recognition often in combination with voice synthesis. In this way, the user of the system can get acoustic feedback on the success of speech recognition and information about any actions performed are given. In the same way, the user can also be prompted to input a new speech.

Problem

To understand how a speech recognition system works, one must first be clear about the challenges that have to deal with.

Discrete and continuous speech

In a sentence in ordinary language, the individual words are pronounced without perceptible pause between. As a human being you can be intuitively guided by the transitions between the words - earlier speech recognition systems were not capable of doing. They required a discrete ( discontinuous ) language, in which artificial breaks must be made between words.

However, modern systems are also capable of understanding continuous ( flowing ) language.

Discrete language

In the discrete language clearly shows the pauses between the words be longer and more clearly than the transitions between the syllables in the word encyclopedia.

Continuous language

In continuous speech the individual words merge into each other, there are no breaks recognizable.

Size of the vocabulary

By flexion, so the diffraction of a word depending on the grammatical function, arising from root words ( lexemes ) a plurality of word forms. This is for the size of the vocabulary of importance, since all word forms must be considered in speech recognition as separate words.

The size of the dictionary is highly dependent on the language. On the one hand have average German -language speaker with about 4000 words a much larger vocabulary than English with about 800 words. In addition, result from the inflection in the German language about ten times as many word forms, while in the English language only about four times as many word forms arise. ( Specify sources )

Homophones

In many languages ​​there are words or word forms that have different meanings but are pronounced the same. So sound the words sea and more though identical, but have nevertheless nothing to do with each other. Such words are called homophones. Since a speech recognition system as opposed to people usually has no knowledge of the world, it can not be distinguished by the importance the various possibilities.

The question of the upper or lower case also falls within this range.

Formants

In acoustic level, in particular the position of the formants plays a role: the frequency components of spoken vowels typically focus on different specific frequencies, called formants. Specifically, to distinguish the vowels the lowest two formants are important: The lower frequency is in the range of 200 to 800 Hertz, the higher is in the range 800-2400 Hertz. About the location of these frequencies can be distinguished, the individual vowels.

Consonants

Consonants are relatively difficult to detect; single consonants ( so-called plosives ) are detectable, for example, only by the transition to the neighboring sounds, as the following example shows:

It is evident that speak within the word the consonant p ( more precisely, the closure phase of the phoneme p) is in fact only silence and is only recognized by the transitions to the other vowels - so removing it causes no audible difference.

Other consonants are quite recognizable by characteristic spectral patterns. To draw about the sound s like the sound of f ( fricatives ) by a high proportion of energy in higher frequency bands themselves. It is noteworthy that relevant for the decision of these two sounds information is largely outside of the transmitted telephone networks in the spectral range (up to about 3.4 kHz). This is to explain that the spelling on the phone without using a special Buchstabieralphabets is also pronounced in the communication between two people tedious and error -prone.

Realization

A speech recognition system composed of the following elements: pre-processing, which splits the analog voice signals into the individual frequencies. Subsequently, the actual recognition using acoustic models, dictionaries, and language models takes place.

Preprocessing

Preprocessing consists essentially of the steps of sampling, filtering, transforming the signal into the frequency domain, and creating the feature vector.

Sampling

In sampling the discrete signal is digitized, ie decomposed into an electronically processable bit sequence in order to further process it more easily.

Filtering

The most important task of the working step filtering is the distinction of ambient noise such as noise or eg engine noise and speech. These include, for example, the energy of the signal or the zero crossing rate is used.

Transformation

For the speech recognition is not the timing signal, but the signal is relevant in the frequency domain. There is transformed by FFT. From the result, the frequency spectrum, the frequency components present in the signal can be read.

Feature vector

For the actual speech recognition, a feature vector is created. This consists of mutually dependent or independent features which are generated from the digital voice signal. In addition to the previously mentioned range includes mainly the cepstrum. Feature vectors can be compared eg by means of a pre- definable metric.

Cepstrum

The cepstrum is obtained from the spectrum by FFT is formed of the logarithmic magnitude spectrum. So periodicities indicate in the spectrum. These are generated in the human vocal tract and the vocal cord stimulation. The periodicities of the vocal cord excitation predominate and can therefore be found in the upper part of the cepstrum, whereas the lower part depicts the position of the vocal tract. This is relevant for speech recognition, so only these lower portions of the cepstrum in the feature vector are included. Since the room transfer function - ie, the change of the signal, for example, due to reflections from walls - not been changed, this can be represented by the mean value of the cepstrum. This is therefore frequently subtracted to compensate for the cepstrum echoes. Similarly, the first derivative of the cepstrum is to compensate for the room transfer function to use, which may also be included in the feature vector.

Recognition

Hidden Markov Models

In the course of play Hidden Markov Models (HMM ) play an important role. These make it possible to find the phonemes that best match the input signals. For this purpose, the acoustic model of a phoneme is divided into different parts: the beginning, depending on the length of a varying number of centerpieces and the end. The input signals are compared against the recorded sections and found using the Viterbi algorithm possible combinations.

(Used in the made ​​a pause after each word ) for the detection of discontinuous ( discrete ) language was sufficient to calculate each word together with a pause model within the HMM. As the computing power of modern PCs but is increased significantly, also flowing (continuous ) language can now be recognized by allowing wider Hidden Markov Models are formed which consist of several words and the transitions between them.

Neural Networks

Alternatively, attempts have been made ​​to use neural networks for the acoustic model. With Time Delay Neural Networks' should in particular the changes in the frequency spectrum can be used over the timing of time for recognition. The development has certainly brought positive results, but was ultimately abandoned in favor of the HMMs.

But there is also a hybrid approach, in which the data obtained from the pre-processing pre- classified by a neural network and the output of the network to be used as parameters for the hidden Markov models. This has the advantage that data from just before and just after being processed use time can not increase the complexity of the HMMs. In addition, you can as the classification of the data and the context-sensitive composition (formation of meaningful words / phrases ) from each other.

Language model

The language model then attempts to determine the probability of certain word combinations and thereby eliminate false or improbable hypotheses. This can be used either a grammar model using formal grammars or a statistical model with the aid of N -grams.

A bi- or tri-gram stores the probability of occurrence of word combinations of two or three words ( used in the current software such as Dragon Naturally Speaking Quint gram statistics according to word combinations of five words ). These statistics are derived from large text corpora ( eg texts). Each hypothesis identified by the speech recognizer is subsequently checked and, if necessary, be discarded if its probability is small. This also homophones, ie, different words can be distinguished with identical pronunciation. Thank you would be more likely than Fielen Thanks, though both are pronounced the same.

With trigrams theoretically more accurate estimates of the probabilities of occurrence of the word combinations are possible compared to bigrams. However, should the text databases from which the trigrams are extracted, be substantially greater than for bigrams, given that all permissible word combinations of three words in a statistically significant number occur in it (ie: each substantially more than once). Combinations of four or more words long were not used because, in general, can not find any example text databases more that include all word combinations in sufficient numbers. An exception is Dragon, which since version 12 also used pentagrams - which increases the recognition accuracy in this system further noticeable.

If grammars are used, there are usually context-free grammars. In this case, however, every word has to its function within the grammar are assigned. Therefore, such systems are usually used only for a limited vocabulary and special applications, but not in the conventional voice recognition software for PCs.

Evaluation

The quality of a speech recognition system can be specified with different numbers. In addition to detection rate - usually as a real- time factor ( ECF ) specified - can be measured as the Word of neatness or word recognition rate, the recognition quality.

Vocabularies

For the integration of speech recognition systems, there are already pre-defined vocabularies to facilitate the work with speech recognition. These vocabularies are called in the context of Speech Magic ConTexts and in the range of Dragon vocabularies. The better the vocabulary is adapted to the vocabulary and Diktierstil (frequency of word sequences ) used by the speaker, the higher recognition accuracy. A vocabulary includes not only the speaker-independent lexicon ( professional and basic vocabulary ) and an individual word sequence model ( language model ). In the vocabulary of all the software known words in phonetics and spelling are stored. In this way, a spoken word is recognized in his sound through the system. When words are different in meaning and spelling but sound the same, the software uses the word sequence model. In it, the probability is defined by the specific user at a word following another.

742546
de