Raymond Lau, Giovanni Flammia, Christine Pao and Victor Zue
Spoken Language Systems Group
MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139
United States of America
raylau@sls.lcs.mit.edu, flammia@sls.lcs.mit.edu,
pao@sls.lcs.mit.edu, zue@sls.lcs.mit.edu
This paper presents WebGALAXY, a flexible multi-modal user interface system that allows wide access to selected information on the World Wide Web (WWW) by integrating spoken and typed natural language queries and hypertext navigation. WebGALAXY extends our GALAXY spoken language system, a distributed client-server system for retrieving information from on line sources through speech and natural language. WebGALAXY supports a spoken user interface via a standard telephone line as well as a graphical user interface via a standard Web browser using either Java/JavaScript or a cgi-bin/forms front end. Natural language understanding is performed by the system and information servers retrieve the requested information from various on line resources including WWW servers, Gopher servers and CompuServe. Currently, queries about three domains are supported: weather, air travel, and points of interest around Boston.
We believe, as do many others, that a speech interface for a browser is ideal for naive users because it is the most natural, flexible, efficient, and economical form of human communication. However, providing a speech interface is much more than simply being able to "speak" the icons and hyperlinks that are designed for keyboard and mouse. This is because replacing one modality by another, while undoubtedly useful in hands-busy environments and for disabled users, does not necessarily expand the system's capabilities or lead to new interaction paradigms. Instead, we need to explore how spoken language technology can significantly expand the user's ability to obtain the desired information from the Web easily and quickly. In our view, speech interfaces should be an augmentation of, rather than a replacement for, mouse and keyboard. A user should be able to choose among many input/output modalities to achieve the task in the most natural and efficient manner.
Spoken language interaction is particularly appropriate when the information space is broad and diverse, or when the user's request contains complex constraints. Both of these situations occur frequently on the Web. For example, finding a specific homepage or document now requires remembering a URL, searching through the Web for a pointer to the desired document, or using one of the keyword search engines available. The heart of the problem is that the current interface presents the user with a fixed set of choices at any point, of which one is to be selected. Only by stepping though the offered choices and conforming to the prescribed organization of the Web can the user reach the document they desire. The multitude of indexes and meta-indexes on the Web is testimony to the reality and magnitude of this problem. The power of spoken language in this situation is that it allows the user to specify what information or document is desired (e.g., "Show me the MIT homepage," "Will it rain tomorrow in Seattle," or "what is the zip code of Santa Clara, California"), without having to know where and how the information is stored. Complex requests can arise when a user is interested in obtaining information from on-line databases. Constraint specifications are natural to users (e.g., "I want to fly from Boston to Hong Kong with a stopover in Tokyo," or "Show me the hotels in Boston with a pool and a Jacuzzi") are both diverse and rich in structure. Menu or form-based paradigms cannot readily cover the space of possible queries. A spoken language interface, on the other hand, offers a user significantly more power in expressing constraints, thereby freeing them from having to adhere to a rigid, preconceived indexing and command hierarchy.
In fact, many tasks that a user would like to perform on the Web - browsing for the cheapest airfare, for example, or looking for a reference article - are in fact exercises in interactive problem-solving. The solution is often built up incrementally, with both user and computer playing active roles in the "conversation." Therefore, several language-based technologies must be developed and integrated to reach this goal. On the input side, speech-recognition must be combined with natural language processing so the computer can understand spoken commands (often in the context of previous parts of the dialogue). On the output side, some of the information provided by the computer - and any of the computer's requests for clarification - must be converted to natural sentences, and perhaps delivered verbally.
Since 1989, our group has been conducting research leading to the development of conversational interfaces to computers. The most recent system we developed, called GALAXY, is a speech-based interface that enables universal information access using spoken dialogue. The initial demonstration of GALAXY is in the domain of travel planning and knowledge navigation, making use of many on-line databases, most of them available on the Web. Users can query the system in natural English (e.g., "What is the weather forecast for Miami tomorrow," "How many hotels are there in Boston," and "Do you have any information on Switzerland," etc.), and receive verbal and visual responses.
The GALAXY conversational interface was a client application running under the X window system. A major constraint of the X-based client was that a user must have access to an X server in order to use it. Unfortunately, most personal computer users do not have X server software and furthermore the X protocol requires high bandwidth internet connection to function acceptably. This paper describes the WebGALAXY project, whose goal is to integrate the client into a Web browser, which is available on almost any platform today, with no need for the user to download any additional software or plug-ins. More importantly, WebGALAXY serves as an illustration of a possible new paradigm for interface that is rich, flexible, and intuitive.
Integrating a new information server into the system requires three stages. First, we have to define the appropriate semantic frame representation and the needed access protocols to the online resources. Frequently, a local database containing pointers to various resources along with auxiliary information is created to help with this step (e.g. a list of city names and airport codes for air travel). Then, we must add new entries to the pronunciation lexicon of the voice recognition component for the additional words in the new domain. Finally, we need to add a new set of grammar and discourse rules along with new lexical entries for the natural language understanding and the speech synthesis components. When a new server is integrated into GALAXY, the system performance can be iteratively improved by running several usability tests. Collecting speech data from user sessions is necessary to improve the voice recognition accuracy, while collecting natural language data allows refining and improving the coverage of the grammar rules that guide the natural language component.
Adding support for a new language requires the definition of new acoustic-phonetic units, a new lexicon and a new set of grammar rules. The semantic frame representation is language-independent, and allows for the translation of the information retrieved from the database from one language into the other. Prototype versions of GALAXY are currently available for Spanish and Mandarin Chinese.
The development of WebGALAXY required several changes in the original GALAXY architecture. The previous GALAXY client's functionality was split into two parts: a new WebGALAXY hub and a standard Web browser. The hub maintains the state of the current discourse with the user and also mediates the information flow between the various servers and the
Figure 2. The WebGALAXY client/server architecure utilizes the
telephone network for speech input/output and World Wide Web standard protocols for the
graphical interface, making GALAXY accessible to a global audience.
Web browser. The Web browser is used to provide all graphical user interface to WebGALAXY. Figure 2 outlines the WebGALAXY architecture.
Two graphical user interfaces are current supported: a Java/JavaScript interface with rich interactivity and a more austere cgi-bin/forms interface for browsers which do not support Java. Additionally, WebGALAXY is designed to also support a displayless interface using only spoken language interaction. To start WebGALAXY, the user simply goes to the WebGALAXY homepage, selects either the Java or forms interfaces, optionally enters a phone number for spoken interaction and clicks the Start button. If a phone number were entered, the user would be called shortly and a spoken language interaction could occur over the phone. With or without a phone number, the user can always interact with the system through typing and clicking.
In the display shown here, the user asked for the forecast for Boston orally.
The request was correctly handled by the natural language server. The reply
with the forecast was generated by the Weather domain server and displayed by WebGALAXY. The user
could have also typed the same request. Certain requests generate lists. For example, the request
Show me Chinese restaurants in Cambridge would generate a list as a reply. The user can then
continue to interact verbally, using the names of the restaurants of their ordinal positions
(the second one). The user can also click on an item in the list and then say Give me the phone
number referring to the clicked item. For certain types of lists, clicking twice on an item gives
more detailed information. Requests for homepages, such as Show me the homepage for MIT will
retrieve and display the target homepage in the lower area. The user is free to continue browsing
with the mouse and keyboard from that page, such as by clicking a link.
Figure 3. The Java-based graphical interface. At the top, an Applet connected to the WebGALAXY
hub displays the speech recognition output and the system status in real time. The Applet directs
the display of HTML formatted responses from WebGALAXY to frames below.
There is also a limited amount of interactive input permitted within GALAXY. For example, when a list is displayed (e.g. "Show flights from Boston to Los Angeles tomorrow."), the user is permitted to click on items in that list. Normally, a single click only modifies the dialog state, so GALAXY knows that "item X has been selected." The user may then follow up with a request affecting the clicked on item. If we were to create normal links, i.e. standard HREF tags, then such clicks would necessarily generate an annoying visual update as the browser loads a new page. Instead, we have decided to use JavaScript to send a message to the Java applet, which then gets communicated back to the hub over the control channel.
In both the Java and Forms interfaces, access to the hub's control channel is regulated through the use of magic cookies. Before the hub will initiate a session over the control channel, it must receive a valid cookie. The hub's HTTP server does not currently impose any access restrictions. Thus any user can grab the WebGALAXY artwork and the Java class files used to implement the applet, but we do not currently consider this to be a serious risk. We merely want to restrict the initiation of new WebGALAXY sessions with the hub.
We have successfully tested WebGALAXY with Netscape Navigator 3.0 running under Windows, MacOS, Linux, SunOS, and Solaris, and with Microsoft Internet Explorer 3.0 running under Windows. We have demonstrated WebGALAXY from locations within the United States, Italy and Sweden, and with Internet connections as slow as 28,800 bps. However, due to constraints on the number of telephone interfaces we have (one), on the computational resources dedicated to the information, natural language, and recognition servers, and also to the constantly changing developmental nature of the system, we are not yet able to make WebGALAXY available to the general public.
We are witnessing a shift in the types of interfaces to the Web. Desktop browsers are being replaced by browsers that reside in many types of devices such as hand-held personal digital assistants, smart digital telephones, and television set-up boxes. Each one of these devices has specific input and output interfaces and limitations. A multi-modal user interface that supports typed and spoken natural language could provide easy and universal access to the Web from different devices and in multiple languages, reaching a much wider audience.
On the content side, we recognize that the type of semantic units on the Web is rapidly shifting from one of static modality (ASCII text and graphics files) to multiple dynamic modalities (text and graphics, responses generated on the fly, speech, audio, and video data). Clearly, all these diverse types of content and modalities require a paradigm shift in the content organization and underlying communication protocols.
WebGALAXY is a small step in the direction of shifting the paradigm of the user interface to the Web from a simple point-and-click navigation in a deep forest of HTML documents towards a richer, more flexible and intuitive navigation. The WebGALAXY servers and clients are being designed to handle multiple input and output modalities in a systematic way. Currently, WebGALAXY encodes the communication between the various servers and clients with specific protocols that are defined at the software level, i.e., in software source code and in text files that list program arguments and parameters. To allow for rapid application development and portability to new domains and new languages, we are trying to minimize the need for writing software by specifying many of the domain parameters such as lexica, dictionaries, grammar rules and dialogue management rules into text files that are easily edited. We would like to push this approach further by specifying a uniform semantic content communication protocol to the information that is currently scattered in a variety of formats such as HTTP, CORBA, SQL, and other open and propietary protocols. We are especially interested in standardizing the protocol for accessing the natural language server and the speech recognition server. A standard, generic and simple communication protocol modeled after HTTP and HTML will foster the rapid deployement of a multitude of natural language and speech recognition servers all across the Web. In addition, we would like to create an authoring tool for rapid development of information servers which rely primarily on Web-based resources.
Extensions to the HTML markup language coupled with an easy-to-use authoring tool could specifically handle semantic content across different media, natural language queries and responses, and dialogue management for mantaining the discourse state between interactions.
Finally, WebGALAXY, and in general spoken language interaction to the Web, would clearly benefit from advances in Internet and telephone technology that allow for simultaneous transmission of HTML data and voice input/output carried over the same connection line. We are also interested in extending HTML for explicitely representing mixed media such as speech signal and ASCII text that reside in the same document and possibly the same data transmission channel.
[2] V.W.Zue, "Navigating the Information Superhighway Using Spoken Language Interfaces,"
IEEE Expert, vol. 10, no. 5, pp. 39-43, October 1995.
[4] V.Zue, J.Glass, M.Phillips, and S.Seneff, "The MIT SUMMIT Speech Recognition System: A
Progress Report," Proc. DARPA Speech and Natural Language Workshop Feb '89,
Philadelphia, PA, pp. 179-189, February 1989.
[6] H.Meng, S.Busayapongchai, and V.Zue, "WHEELS: A Conversational System in the Automobile Classifieds Domain,"
Proc. Int'l Conference on Spoken Language Processing '96, Philadelphia, PA, vol. 1, pp. 225-228, October 1996.
Return to Top of Page
Return to Technical Papers Index