Voice activity detection

Voice Activity Detection (English voice activity detection VAD) is used in speech processing technique in which the presence or absence of human voice is detected. The main uses for voice activity detection are in the areas of speech coding and speech recognition. It can facilitate the speech processing and can be used to deactivate some processes during a pause in speech: It can avoid unnecessary coding and transmission of meaningless data packets in IP telephony applications and thus save processing power and transmission capacity.

Voice Activity Detection is a key technology for a variety voice -based applications. Therefore, various algorithms have been developed that have different characteristics and represent a trade-off between Latency, sensitivity, precision and computational effort. Some algorithms also provide further analysis of data, for example, whether the language voiced, unvoiced, or is endured. Voice Activity Detection is usually independent of the language.

It was first investigated for use in systems for time assigned speech interpolation ( CSI ).

Algorithm

The typical design of a VAD algorithm is as follows:

In this sequence, there may be feedback loops, in which the decision of the voice activity detection is used to adjust the Störgeräuscherkennung or dynamically adjust the / the threshold value ( s). These feedback mechanisms improve recognition performance under varying noise.

A representative set of recently published voice activity detection methods determines the decision rule from block to block using continuously measured deviation distance between speech and noise. The different measured variables, which are used in the voice activity detection, include waste the spectral distribution curve, correlation coefficient, logarithmic likelihood ratio, cepstral, weighted cepstral and modified distance measures.

Regardless weighed on the choice of break detection algorithm must be between the detection of noise as speech or language as noise ( between false positive and false negative). A run in a mobile telephone voice activity detection must be able to recognize speech signals in the presence of a range of very different types of acoustic background noise. Under these difficult conditions recognition it is often desirable to have a conservative break detection, categorized in doubt as a voice signal in order to reduce the risk of lost speech sections. The greatest difficulty in recognizing the speech sections in this region are the low -noise ratios encountered. If parts of the speech utterances lost in noise, a distinction between speech and noise due to the simple level determination may be impossible.

Applications

  • Voice activity detection is a fundamental part of various voice communication systems such as telephone conferencing, echo cancellation, speech recognition, speech signal coding and hands-free calling.
  • In the field of multimedia applications, voice activity detection allows simultaneous use of speech and data applications.
  • Similarly influenced and reduced it in Universal Mobile Telecommunications System (UMTS ), the average bit rate, and improves the overall voice quality.
  • In mobile radio systems (e.g., GSM and CDMA2000 ) with discontinuous transmission (DTX ), voice activity detection is essential for improvement of the total capacity by reducing the disorder of side channels, and energy consumption of mobile devices.

With a wide range of applications such as digital radio, Digital Simultaneous Voice and Data ( DSVD ) or voice recordings, it is desirable to have a broken transmission of speech coding parameters. Benefits may be lower average energy demand in mobile devices, higher average bit rate for concurrent services such as data transmission or higher capacity memory chips. However, the benefits depend on the proportion of pauses in conversations and the reliability of the voice activity detection used. On the one hand it is advantageous to have a small portion of speech sections. Should on the other hand cuts in portions of speech, so the loss of portions of speech, to be minimized in order to maintain quality. This is the crucial problem for a voice activity detection algorithm under the condition of strong interference.

Use in telephone sales

A controversial use of voice activity detection is used in conjunction with telephone sales companies Predictive dialers. To maximize the productivity of the agents sent telesales companies a predictive dialer to call more numbers, as agents are available, in the knowledge that most calls end answered or answering machines. If a person takes, they usually speak briefly ("Hello ", " Good evening," etc.) and then followed by a period of silence. Voicebox messages usually contain 3 to 15 seconds of continuous speech flow. With properly selected voice activity detection parameters dialers can determine whether a person or an answering machine has accepted the call and if it is a person transferring the call to an available agent. If an answering machine is detected, the dialer hangs up. Often, the system recognizes correctly measured the acceptance by a person with no agent is available.

Performance evaluation

To evaluate a speech pause detection method, its output is compared on the basis of test shots with the results of an "ideal " Voice Activity Detection - created by manual determination of the presence and absence of speech in the recordings. The performance of a voice activity detection is usually analyzed in terms of four parameters:

  • FEC ( Front End Clipping ): angeschnittener language section in the transition from noise to speech content;
  • MSC (Mid Speech Clipping ): interrupted speech section by misclassification of speech content as noise;
  • OVER: speech content as interpreted by continuous noise pauses status after the transition from speech to noise;
  • NDS (Noise Detected as Speech ): noise during a period of silence to be interpreted as a speech signal.

Nevertheless, the method described above provides useful objective information on the performance of a voice activity detection, it is only an approximate measure of the subjective impact. For example, the impact angeschnittener language sections can sometimes be hidden, depending on the type of selected comfort noise generator by the presence of background noise, which some measured with objective tests incisions in language sections are not really noticeable. Therefore, it is important to undergo speech pause detections subjective tests, mainly to ensure the acceptability of the perceived cuts. This type of testing requires a certain number of listeners the evaluation of recordings with the recognition results of the test method. The listener must evaluate the following features:

  • Quality;
  • Intelligibility;
  • Audibility of incisions.

These ratings obtained by listening to some speech sequences are then used to calculate the average results for each enumerated above features and thereby to obtain an overall assessment of the behavior of the tested speech pause detection. So while objective methods in an initial stage of development are very useful to check the quality of a voice activity detection, subjective methods are more meaningful. However, since they are more expensive ( because they require the participation of a certain number of people over a few days ), they are generally used only when a proposal is in the standardization.

Implementations

  • An early standardized voice activity detection is developed in 1991 by British Telecom for use in the pan-European digital mobile network method. It uses the basis of trained speech pause sections inverse filtering to filter out background noise and reliable then to decide based on a simple level threshold if a voice is present.
  • The G.729 standard, calculates the following features for its voice activity detection: Line Spectral Frequencies, whole band energy, the lower part of the band energy (<1 kHz ), and zero crossing rate. It is a simple classification one with a fixed decision threshold in the space that define these features, and then performs smoothing and dynamic corrections to this estimate.
  • The GSM standard developed by ETSI contains two options for voice activity detection. The first method calculates the signal to noise ratio in nine frequency bands and applies a threshold to these values. The second method calculates different parameters: energy density of the channel measurement parameters of the voice and energy density of the noise. It then applies a threshold to the speech signal parameters, which is changed by the estimated signal to noise ratio.
  • The Speex audio compressor library uses as a named Improved Minima Controlled Recursive Averaging procedure that uses a smoothed representation of the spectral energy distribution and then searches for the minima of a smoothed periodogram. As of version 1.2, it was, according to the author by a botched solution (English original: " kludge " ) replaced.
742649
de