Information Access:

A Cornerstone of Web Publishing


Authors:
Paul Campbell, Solutions Consultant
Lawrence Fitzpatrick, Developer/Architect of CPL
Assisted by:
David C. Macdonald, Director of Marketing for Electronic Publishing
Adrienne Griffith, Product Manager (acting), PLServer

Abstract

Background
Personal Library Software, a full-text search and retrieval company based in Rockville, MD, has devoted the past 11 years to developing the most intelligent and intuitive search tools in the PC, Mac & UNIX market. On the basis of our history, we have taken this expertise and focused it on complementing the power of Mosaic and the World Wide Web with PLServer, an Internet publishing tool.

PLServer
PLServer targets the goal of reducing the effort needed to place information on the WWW, while fundamentally increasing its accessibility. We have integrated our robust search engine to communicate with an HTTP server. Our Mosaic forms-based interface is the guide to the information.

The intelligent search tools provided by PLServer include the ability to search databases using natural language or conversational syntax; relevance feedback and ranking of retrieved documents; dynamic concept searching, which automatically builds a database-specific list of related words to better the search.

Noted earlier, PLServer aids in the process of publishing information by providing basic HTML mark-up and automatically applying it to untagged data.

Objective
It is PLS's intent to facilitate the Gutenberg-like potential of the Web and Mosaic with powerful information management and publishing tools.

To try PLServer: http://www.pls.com

INFORMATION ACCESS: A Cornerstone of Web Publishing

Introduction

Close to 500 years ago Johann Gutenberg spawned a revolution in publishing with the invention of the printing press. Mass production of information was enabled. Today, many believe that we are witnessing a revolution in publishing of equal, if not greater, magnitude with the emergence of the World Wide Web and MOSAIC. Skeptics, on the other hand, view these claims as one of the many overhyped declarations of how a new technology will change the world. The tendency towards overstating the significance of technology is nothing new. Hyperbole has become almost commonplace in the computer industry. However, close inspection of the Web and MOSAIC reveals the hype to be substantive. The Web as a communications medium possesses some very attractive characteristics for publishing. Clearly, the Web has not totally arrived with all the functionality for electronic communications. However, with the continuing introduction of additional robust capabilities, the Web will become the preferred platform for online publishing.

Publishing, whether hard copy or digital, is comprised of four building blocks: Content, Medium, Market and Access.

a). Content is the information to be distributed, whether printed text, graphics, audio or video.

b). The Medium is the delivery mechanism. With technology, the Medium has migrated from printed text to CD-ROM and online services. Today, the Web has emerged as a very compelling medium.

c). All publishing by its definition requires a Market. Today's publishers are concerned with targeting their message, effectiveness of communication, distribution channels, pricing, and promotion considerations. These decisions are obviously influenced by the Content and in many respects limited by the available Medium.

d). The final cornerstone, Access, is often the most ignored or misunderstood in terms of its importance. As we will illustrate, this is essential in mediating a digital world. Access is defined here as the ability to quickly and easily find what you want and need in a pool of information. In a paper-restricted world this was handled via the table of contents and the index. These hierarchical metaphors are limited when transferred to digital form. Digital publishers require more effective access methods that are geared towards the Medium, the Content, and the Market.

All four of the publishing building blocks are intertwined and have an impact on one another. For the purposes of this discussion the focus will be on the relationship between the Medium and Access. Specifically, the medium of the World Wide Web will be analyzed in terms of its potential as a publishing vehicle. The assessment will explain the aspects of the Web that make it appealing. The conclusion to be drawn from this analysis is that any realization of this potential is incumbent upon the implementation of industrial strength Access tools. Without the power of intelligent search and retrieval capabilities, publishers and their audience will quickly find the Web overwhelming and confusing.

Method

As stated, the purpose of this paper is to outline the pivotal role of intelligent information access in World Wide Web publishing. In order to accomplish this, we must highlight the aspects of the Web which are so compelling for publishing. By contrasting the Web architecture to that of traditional forms of online publishing, the advantages become apparent.

From this platform the value of intelligent access tools will be introduced. Specifically it will be demonstrated through an in-depth review of PLServer, a Web-based search and retrieval tool. PLServer is an advanced product from Personal Library Software, a company that has devoted the last 11 years to developing intelligent tools to aid in the process of accessing and retrieving information from large electronic data repositories.

Access to information is necessary in all cases of publishing, however it gains particular importance at this juncture of the Web's development. The growth and promise of the Web as a publishing medium hinges on intelligent access mechanisms. Otherwise the hype will lead to frustration from users unable to find important information at the right time.

Web & Mosaic = Publishing (almost)

The phenomenal growth of the World Wide Web and Mosaic over the past year and a half is unprecedented. To appreciate the magnitude of this increase, the numbers show that at the current estimated rate of expansion, Web usage will exceed the world's digitized voice traffic in three years. No other technology has ever proliferated at this rate.

This level of popularity can be traced to the inherent architecture of the Web and the emergence of MOSAIC. With the Internet as the backbone, the Web is a set of standards for resolving global hyperlinks and viewing documents. Given its universal addressing structure, hyperlinks can resolve to documents or to any network resources like ftp, mail or Gophers. With the advent of the MOSAIC graphical browser, the Web can support navigation to all the resources of the Internet. It is not surprising that the Web is so amenable to publishing, the goal at its inception was to facilitate the sharing of information.

What makes the Web such a dynamic medium for electronic publishing is that, in addition to its openness, ubiquity and simplicity in operation, it removes (at least) three serious limitations inherent in today's online publishing model. Today's interaction with an online publisher typically involves running a vendor-specific access application that makes a telephone connection directly to a vendor's host for the duration of the work session.

The first problem with this online access model is that an extremely scarce resource, the communications link, is consumed totally by one application. A user cannot simultaneously access another host. Furthermore, the act of making a connection is such an extremely heavyweight operation (typically taking minutes), that once connected there is a disincentive to "surf" to another service. It is not uncommon, while connected, to have the desire to "bop over" to another resource for complementary or supplementary information only to quell the urge after realizing that it will cost you 10 minutes to negotiate the logout/login/logout/login round-trip sequence.

The second problem with this model is that as result of the cost (in both time and resources) of making a connection, the current crop of online access applications confuse a user session with a physical connection. Because of the need to have the user control this most precious resource, the communications line, the "center of gravity" of most of today's conventional online access applications is: the C-O-N-N-E-C-T-I-O-N! This should not be. The access application should implement a model that optimizes information discovery, not preservation of a scarce compute resource.

In addition, the online industry has derived a cost model that takes into account the expense of maintaining a physical connection for long periods of time -- i.e., billing for connection time. A large motivation regarding this cost model is to put a governor on the consumption of expensive resources on the host. Users are being taxed for something which is only tangentially related to information discovery -- physical connect time.

The third problem with the current online model relates to the plethora of incompatible end-user navigation applications. Clearly, there is a need for vendor-specific applications in many cases. But, for much of the information that the online community publishes, there is little semantic difference between the UI, navigational, and search models offered by different vendors. A more profound fallout of the vendor specific-application model is that it poses a bootstrapping problem for a vendor's efforts at market expansion. Not only must the vendor locate a potential client, he must convince him to install Yet Another Bloated Application (YABA) on his system.

The Web does not exhibit any of these problems. Due to the nature of an Internet connection, one can simultaneously access multiple sites within the Internet with a single connection. The communications line has been turned into an extension of the LAN. This is a rather profound change, with the operative word being synergy. Different service providers can excel at different information offerings all navigable from a "standard" interface. Contrast this with the tendency in today's online market toward "everything but the kitchen sink" vendors. This is incredibly inefficient. Why must every online service vendor provide lousy e-mail services?

As a viable alternative, the Web is fostering a positive trend towards "componentization" of information services. MOSAIC enables Web servers to specialize in offering a service that they do best, while being ensured that they are accessible by their prospective clients. This functionality is not limited to MOSAIC alone, given the Web's structure any browser developed that adheres will most likely be acceptable.

The present interactions with httpd are almost transaction-oriented connections as opposed to the session-oriented connections of today's online systems. We would assert that a transaction processing model of interacting with an online service is vastly preferable to the present connection model. It allows processing to distribute appropriately to the client or server and it minimizes the interface between client and server which is good for evolution, resource utilization, and optimization. Also important, is that it allows the end-user applications (note that with HTML, the distinction between a document and an application gets blurry!) to concentrate on implementing effective information access models and not force the user to manage low-level physical resources like the communications link.

Finally, standardized client applications like Mosaic have demonstrated that a common user interface can be used effectively to distribute content from multiple providers to multiple clients. We recognize that there are cases where this does not work well, but hopefully this can be seen as a challenge for the next generation of "extensible" Web clients.

A common navigation application also addresses the market bootstrapping problem. The information provider has an application-enabled user base from which to cull new clients. One might posit that a MOSAIC could be used as a Trojan Horse by an online provider wishing to host his application on the new clients' workstations!

Resulting from this openness and lack of barriers to publishing, all forecasts of the Web's growth must be taken seriously. However, as mentioned above, the inexorable increase in information and traffic mandate industrial-strength Access tools.

"....Where is the knowledge we have lost in information" The Rock, T.S. Eliot

Intelligent Access with PLServer

The perfect complement for Web publishers is a robust search and retrieval tool. Any electronic medium promotes proliferation of information, but something as seductive as the Web demands it. The only conceivable way to survive this sea without drowning is through the implementation of intelligent access methods.

The above excerpt from T.S. Eliot captures the essence of Personal Library Software's existence: to provide intelligent tools which turn information into knowledge. PLServer is the direct response to a swelling market voice. The core of this product is industrial-strength software that has gained pre-eminence in the major online service community. Companies like Dow Jones News Retrieval, America Online, Apple, and many others have chosen the PLS engine to support their systems. PLS is qualified to offer its implementation as an example of the functionality needed by Web publishers. These features are necessary for the Web to handle the increasing volume of data while maintaining its usability.

Behind the PLServer implementation is the concept of intelligent retrieval and scalability. The core engine of this technology, Callable Personal Librarian (CPL), has been implemented in major proprietary online solutions with requirements very similar to most electronic publishing environments. Understanding why it was chosen in these traditional online instances will illuminate its value on the Web. The requirements for robustness must be met on two fronts, database administration and end-user functionality.

Database administration explicitly refers to the ability to manage large, distributed sources of information. Issues like concurrency control and transaction integrity are integral to convenient database administration. Concurrency control allows real-time database update to proceed simultaneously with end user searching. In publishing applications, that desire to take advantage of the 24/7 nature of Internet communications makes this feature essential.

Transaction control, and related restart/recovery mechanisms are considered mandatory as the size of the database grows. As incremental updates to the information are made, the full database should not be jeopardized in the indexing process. For example, if a 10 GB database is being updated with 250 MB of new data, the indexing process would only apply to the newly 250 MB. The rest of the repository remains searchable due to concurrency and is not corruptible in the event of any system problems during indexing. This type of database integrity is an insurance policy against damaging the database.

Administration capabilities are highly valuable. However, the most important needs to be fulfilled are those of the end user. As the pace of change and amount of information grows geometrically, harnessing technology to combat data asphyxiation is mandatory. PLS describes the type of software we produce as "intelligent access". There are three characteristics of the searching methodology that underline the software's quotient:

1) Natural Language refers to querying the database using conversational syntax as opposed to complex Boolean operators. An example of a natural language search is: "tell me about the fall of communism in the soviet union". The utility of this is apparent as it represents the way we normally interact when requesting information. This does not preclude the use of the conventional Boolean operations, but it is far more intuitive than Boolean query language when the user is not a trained searcher and does not already possess a mental map of the data space. Since an online database has constituents from various backgrounds, supporting both Boolean and natural language is important.

2)Relevance feedback and ranking is the process that occurs during query evaluation to order the documents retrieved by the query such that those more likely to be relevant to the user are presented before those less likely to be relevant. Relevancy is determined via a sophisticated proprietary algorithm which in essence considers the query terms based on the frequency they are found in the document; the proximity of the search terms in relation to each other; and the rarity of any particular word, which increases a documents value. The matrix weighs these factors and then assigns a relevance score. In PLServer, after a search is invoked the hypertext hitlist is returned in order with the feedback scores normalized relative to the most relevant. Relevance ranking manages the glut of information by ordering the results.

3)Dynamic Concept Searching is the most abstract to define and the most powerful in use. It is designed to assist the searcher by dynamically generating a list of words which relate to the query. An example, the query is "health care" against a content database comprised of news stories. In PLServer there is an on-screen button called 'Concept Search' that would be invoked. The first action to follow is that a Related Word List is generated. These words are not synonyms or predefined relationships, but represent a dynamic selection of terms that co-occur at a threshold above coincidence, and therefore may intimate that a significant relationship exists. Returning to the example, the health care query could yield a related word list with terms like Hillary, universal, coverage, HMO, Mitchell, etc. These terms are not thesaurus content, nor could the words have been precoded because they reflect the changing content of the database not a static representation of knowledge. This feature has direct implications on the worth of the information retrieved. Information is changing so rapidly there is no way that anyone can keep up with all the necessary developments. Also supporting this approach is the psychological theory that states that people's recognition is better than their recall. In other words, present a list of associated terms that reflect the query of the database(s) being searched and human interaction can recognize and chose the appropriate terms. Contrast this to recall, where the user has to generate the possible list of words that could return valuable information. When volumes increase the need for this capability rises. The description could continue, but really comes to life through a demonstration.

To PLS, the purpose for all these tools is Access, which is fundamentally a challenge on three tiers a). finding information you know exists b).finding information you think exists; and most important c). finding information you don't know exists. With the amount of data growing on Web servers, and the number of MOSAIC clients proliferating in the thousands per day, these users will not know the full breadth or utility of what they are searching and therefore need tools like those described to facilitate this process.

Conclusion

It is no accident that WWW/MOSAIC is being termed the "killer environment/app." for the Internet. Its appeal is well founded in both technological elegance and a solid business rationale. The Web could significantly alter the state of publishing, broadening the scope and depth of the that community. This paper has attempted to highlight what makes the Web so special as an electronic communications medium. The next objective was to illustrate that the rudiments of publishing call for intelligent Access methods, otherwise the medium will become a quagmire. PLServer as a major implementation to address this need is important, but the philosophy behind the product is what will drive the provision of Access to information today and in the future.

A popular refrain from the sixties was that the "Revolution would not be televised". Gil Scott-Heron was right, instead it will be on the Net.


Author Biographies:

Paul A. Campbell, Solutions Consultant, has been involved with the marketing and sale of Personal Library Software's products since February 1994. Prior to joining PLS, he was affiliated with the company as a representative of Xerox Corporation's document management division. Mr. Campbell wrote two white papers for Xerox's Palo Alto Research Center on the impact of electronic publishing. Educated at the University of Toronto, Mr. Campbell holds an Honors Bachelor's degree in Industrial Relations and Economics.

Lawrence Fitzpatrick, Developer/Architect of CPL, is architect and lead developer of the PLS text engine, Callable Personal Librarian. He came to PLS in 1987 from the National Library of Medicine where he was principally responsible for the design and implementation of the IRx Information Retrieval Workbench text engine. Mr. Fitzpatrick received a Master of Engineering from the University of Virginia where his area of study was computer science applications to biomedical problems. He received a Bachelor of Science degree in Biology with a minor in Computer Science from Georgetown University in 1979. Mr. Fitzpatrick is also a part-time lecturer at the University of Maryland University College, where he teaches courses on database systems and object-oriented programming.

David C. Macdonald, Director of Marketing for Electronic Publishing, has more than 16 years of experience in technical marketing and sales. Prior to joining Personal Library Software, Mr. Macdonald directed the worldwide sales and marketing functions of BRS and ORBIT Online Products for Infopro Technologies, Inc. Mr. Macdonald also managed the systems development, product management and sales of the Electronic Publishing unit of Canada Systems Group. With a Bachelor of Management Science degree from the University of Ottawa, his background includes extensive telecommunications, information systems and service, and systems engineering experience.

Adrienne Griffith, Product Manager (acting), PLServer, manages the design, productization and marketing of PLServer. Before the advent of PLServer, Ms. Griffith was responsible for marketing and communications for Personal Library Software's other products and services. Ms. Griffith holds a Bachelor of Science degree in Communications from Syracuse University.


Contact: Paul Campbell, paulc@pls.com