VoiceXML (Voice Extensible Markup Language) is an XML application that serves as description of dialog processes in a speech dialog system. It has been developed specially for phone applications. Since June 2007 is the current version of VoiceXML 2.1 a recommendation of the World Wide Web Consortium ( W3C) and thus has the same status as a web standard like HTML. Applications that have been developed in VoiceXML, thus run on any VoiceXML -compliant voice platform. Due to the analogy to the HTML web browser VoiceXML interpreter also be referred to as a voice browser.

To extend the graphical user interface on the World Wide Web at the input and output ability of natural language, and thus expand to multimodal user interfaces, two additional dialog description languages ​​have evolved as an alternative or supplement to VoiceXML:

  • SALT ( Speech Application Language Tags )
  • X V ( XHTML Voice)

SALT was initiated by Microsoft and is used for creating closer links between voice applications with contents and procedures of the World Wide Web. X V combines XHTML and VoiceXML elements to merge Internet and telephony.

History of development

At the first voice application there was no separation of application and platform. Dialogue courses were just as " hard -wired" programmed and compiled such as the interfaces to the telephone system. While this had the advantage of voice applications usually could be created quickly and reliably ran. The price was unacceptable by today's rigidity. For example, if a dialog are changed, the application programmer had to make deep in the source code procedures.

In more recent speech applications and therefore the use of the platform was separated, so that dialog could be easily maintained. However, were (and still are in many cases also today still), the scripting languages ​​or tools to describe these applications proprietary - so different from provider to provider.

VoiceXML 2.0 is a standardization effort with the goal to arrive at a unified description of speech applications. At the same time, an interface language that can be used for communication between the application and the platform. The standard has now become widespread and is supported by many vendors, but without dominating the market completely. In addition to the still very popular in the market proprietary solutions and application platforms there are competing standardization approaches, in particular the a consortium led by Microsoft -driven SALT standard. The specification was published on 16 March 2004.

VoiceXML 2.1 was published on 19 July 2007 and expanded version 2.0 for some additional skills. This is to compensate for while working with VoiceXML 2.0 recognized shortcomings. Version 2.1 is fully backward compatible with version 2.0.

Currently working on the specification for VoiceXML 3.0. This version will bring a complete redesign of the specification is to allow its use as a domain specific language for the development of speech interfaces and off the phone. The backward compatibility with VoiceXML 2.1 is to be made ​​possible by a special profile.

Analogies to the World Wide Web

Comparing VoiceXML with HTML, then there are a number of parallels. Like HTML, VoiceXML both description language and interface standard:

  • You can use VoiceXML directly to encode voice applications, just as you can use the HTML directly to code user interfaces.
  • One can also define application using a proprietary tool, and generate therefrom ( dynamic or static) VoiceXML code. This is equivalent to using a document management system for the maintenance of a website. VoiceXML is reduced in this case largely on its status as interface standard.

The analogy limps, however, in the present state of the technology is still at an important point: Still sits the VoiceXML browser (as part of the platform ) are not directly phone the end user, but often stands ( for efficiency reasons ) even in the same server room as the application server. The communication between the caller and platform is done via the public telephone network. This is for the caller, and often also for the operator, the question of no great significance, which communicate via standard platform and application. Only when found due to increased computing power of the browser ( and with him, especially the speech and the speech synthesis ) on the phone space, is the issue of standardization for the caller (more precisely: the user of the voice application ) of real importance. So the situation is still in some ways similar to the question whether a user interface for a locally -driven application in the language of HTML, or as in Visual Basic or with a ( proprietary ) tool for GUI creation is to be realized - is crucial especially the quality of the resulting user interface.


VoiceXML is - like any standard - a compromise. This means that desired features may not be supported or only in a later version. In this case, however, VoiceXML can be extended by proprietary additions. This dilutes the advantages mentioned above somewhat, but is still workable, as the whole system set up on a proprietary script.

VoiceXML as a scripting language for application development based on the fundamental concept that can be dialogues between man and machine formalize explicitly on pre-defined flowcharts. In this presentation " navigates " the caller through the predefined dialog flow, often even using explicit navigation commands such as " Back" and " Main Menu". This concept comes there at the boundaries where the interaction approaches a free man -machine dialogue, where the caller can take over the dialogue initiative by formulating complete sentences, such as " no, to Hamburg, and in a way that I 'm so about 18 clock because " (so-called conversational or mixed-initiative dialogues ). While there are constructs in VoiceXML, navigate through the dialog flow to open the caller certain freedoms (eg so-called form- filling ); the effort for application development increases, however, inherently with increasing freedom in the dialog flow dramatically. For the realization of such dialogs, the introduction of a so-called dialogue manager proves to be useful, which determines how the system reacts dynamically based on the dialogue history. Such a dialogue manager can be used to VoiceXML documents - as an interface for voice platform - to generate dynamically.

Multimodal applications - ie the connection between language and graphical output - are currently supported only limited by VoiceXML. However, there are tendencies to establish multimedia thrust of the dialogue description languages. In this case, X V ( XHTML Voice) attempt because, VoiceXML with XHTML using merge of special synchronization elements. A further approach, the language offers SALT, which is intended as an add- on HTML, but it does for the voice functions on a proprietary, multiple VoiceXML approach. So far, these technical solutions but still have the main problem is that a convincing use case for their practical application is lacking.