Speech access to computing applications has been typically provided by developing generic screen-reading applications --programs that build a model of the visual display and then present this model using speech. This technique has been fairly effective in retrofitting speech output to applications that were developed with no thought to their use in an eyes-free environment by a functionally blind user.
Today, a moderate degree of spoken access to the WWW can be provided by using screen-reading applications on standard computing platforms like Windows95 in conjunction with popular WWW browsers like Netscape Navigator or Internet Explorer. Other users needing speech access and vary of using a GUI rely on terminal based browsers like LYNX.
The inherent shortcoming in the screen-reading approach is that the spoken output is decoupled from the application being used. This means that the speech the user hears is generated purely from the visual display appearing on the screen, with no regard to the application context. In the case of standard applications, this produces moderately useful to minimally usable interfaces.
Things get a lot worse when retrofitting speech to the WWW using the screen-reading paradigm. This is because on the WWW, the document is the interface, and WWW designers are free to layout an interface of their choice using standard (or alas increasingly non-standard) HTML markup to achieve a particular visual effect. Thus, most screen-reading applications available today are completely defeated by the use of tables in HTML to achieve specific visual layouts such as ulticolumn documents; the user of a screen-reading application hears the text in the left to right order that the text appears on the screen, rather than in the order that one would read the content.
The problems described above arise primarily due to the screen-reading approach only examining the final form appearance of the document as rendered by the visual browser. The rest of this paper describes the speech-enabling approach ---a technique where spoken output is produced by tightly integrating speech feedback with the specific application that is being used. This technique has been implemented in Emacspeak which can be thought of as a fully speech-enabled audio desktop. The remaining sections of this paper focus on one aspect of this environment, namely the speech enabled browser.
The speech-enabled WWW browser available on the Emacspeak platform produces succinct context-specific spoken feedback by examining the logical structure of the document rather than its visual appearance on the screen. By tightly integrating the speech system with the WWW browser, the environment provides fluent a spoken interface to interactive document elements such as fill out forms. Finally, it should be pointed out that speech-enabled WWW browsers like the one described here rely implicitly on well marked up HTML documents, where the document logical structure is clearly separated from the visual presentation.