Untangling the Web - The role of text retrieval in a hypertext environment

Theresa Kasper
Development Manager
Information Dimensions Inc.

Abstract

The World Wide Web (WWW) serves as a fascinating tool to access information from a multitude of resources and sites. However its information is still very hard to find within a particular organization. Flat file resources require users to know much about the system they are on as well as the domain of information that's available. Once the document is found, it also becomes apparent with today's WWW browsers that the textual information displayed is hard to navigate because of the document's size. Currently, the Web has no means to locate within a large document, text excerpts that are relevant to the user's interest.

This paper will examine how the Web can be enhanced to resolve some of the problems mentioned above by using a textual database system. We will also show the advantages of using a database system as a means to support the storage of documents all in one storage medium. We will discuss what advantages a retrieval engine has in locating and maintaining up to date hypertext links on the topic of interest without needing much knowledge about the sites' file structure. We will show how the definition of an URL will change to include the query requests instead of file locations. We will also discuss navigation tools that provide movement forward, and backwards with regard to content not just mindless window pages. We will demonstrate through examples by using the database package, BASISplus how this system can make using the Web a more efficient and effective tool in locating and retrieving relevant information in this massive world of available resources.

1. Introduction

Information retrieval systems serve the purpose of finding data items that are relevant to the users query request. The World Wide Web is a tool that has become very popular as a means to easily access information from other sites. This paper will discuss how the Web and database systems (information retrieval systems) can be used together to provide the easiest of tools to locate and access the massive amount of information available to the user. It will discuss various aspects of storing, retrieving and displaying documents through the use of a database system. This paper hopes to demonstrate how closely a database system needs to interact with the Web server and suggests future Web developments that would allow to facilitate the integration database systems. This paper will demonstrate some of it's points by using a database system called BASISplus developed by Information Dimensions, Inc.

2. Database vs Flat File Resources

The World Wide Web (WWW) opens the door to a multitude of information from a multitude of sites. It is important then for a user to be able to get a grasp as to what kind of information is available and where the user can locate that information. Part of the problem with the Web is that the information is stored in flat files that is scattered throughout the site's file system. Typically, a user finds information by locating a single document and following the hypertext links from that document into other documents. This always leaves the user with a feeling that there may be more documents or information out there. Information may be missed simply because the document being read did not have a hypertext link to the other available information. Now consider the user’s experience when they enter a Web site that has a database system in place. Immediately, the users know where all the information is located and they can tailor queries to access all the information relevant to their needs and interest. Because users can use queries, as well as queries for hypertext links, they now have better tools to prevent them from wondering whether they missed a crucial piece of data

Simply storing the information into a single storage medium doesn't necessarily guarantee that the relevant information will be found. It is the retrieval tools provided by the database system that will give the user additional power in finding the relevant documents needed. Tools like:

Since data is typically not static and as more documents get added to the database, the user has a means to create queries that can be saved and re-run over the document base which allows always the most up to date and relevant information returned to the user.

3. The URL as a Query

As we continue with this idea of using a database system to hold HTML documents and using the power of a database system to retrieve the information needed, we have to also assume that the URL as it stands now will also develop and evolve by adding database and query attributes to its definition. Currently an URL's definition is:

                      scheme ://host.domain [:port ]/path /filename  
For example, to include a link to NCSA's Beginner's Guide to HTML in a document, you would use:
                      < A HREF = "http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html" >
                                NCSA's Beginner's Guide to HTML </A>
This would make the text “NCSA's Beginner's Guide to HTML” a hyperlink to this document.

Through the use of a database system not only can you use URLs to specify specific files but you can also use URLs to specify specific databases as well as specific queries for a database. Using the BASISplus database, we may use the following URL to invoke the access of a specific BASISplus database.

                      scheme ://host.domain [:port] /db_name /model_name/view_name/SF

The db_name would be a valid BASISplus database name. The model_name is the defined User data model which is a collection of views that a user has available to use in the database. The view_name is the schema that describes the database as seen by a user. Following the specific database information is the action to be performed which in this case is SF for "Search Form." This URL accesses the database and brings up a search form with which the user can enter queries.

For example, to include a link to access the database NEWS which holds research documents, you would use:

                      < A HREF = "http://www.site.edu/NEWS/ALL/DOCS/SF" > 
                      BASISplus News research database < /A > 

This would make the text ``BASISplus News research database'' a hypertext link that will invoke the search form for the News database. Once the user invokes this link, the search form will be shown and the user can now supply the proper query criteria to retrieve the information on a specific concept or topic (Figure 1).

Figure 1

Hypertext links that access databases should also be able to specify a specific query that returns a number of documents directly. This URL would include the query as part of the URL which ensures the most up-to-date information to this HTML document. As the database changes and more documents are added, or as outdated data is deleted, this HTML document with the hypertext link containing the query is a live evolving document.

The syntax of such an URL would be:

                      scheme ://host.domain [:port] /db_name /model_name/view_name/SDW?
                      W=search_where_clause  

For example if I had an HTML document where I want a link to documents on Mosaic, an URL to produce such results would be:

                      < A HREF = "http://www.site.edu/NEWS/ALL/DOCS/SDW?W=text phrase any Mosaic" >
                      BASISplus News research database, documents on Mosaic < /A >

The example above references the News database (using the ALL model and the database view DOCS). The parameter "W=text phrase any Mosaic" is specifying to perform a query on the Text field and search for the phrase Mosaic. The "phrase any" string is the query operator used for this search. The following information is returned to the user (Figure 2).

Figure 2

Notice that each of the items displayed in Figure 2 is a hypertext link that will display the actual document found in the database. The URL to specify a specific document merely will contain the documentís primary key that will uniquely identify the specific document.

The syntax of such an URL would be:

                      scheme ://host.domain [:port] /db_name /model_name/view_name/DDD?K=primary_key_value  

In Figure 2 the first Hypertext link’s URL is:

                      < A HREF = "http://www.site.edu/NEWS/ALL/DOCS/DDD?K=1056" >
                          DATA RESEARCH INTRODUCES LIBRARY SOFTWARE AGENT - OTHER NEW < /A > 

The example above has a primary_key_value of 1056. For this particular database the primary key has been defined as a unique number that the system assigns. The primary key is defined by the database administrator and can be any value as long as the value is unique. Figure 3 is a display of the document.

Figure 3

Based on the current constraints, the URLs mentioned above, to invoke a database system and to manipulate data through a database system has been specified by using the current URL’s syntax. It is hoped as more and more users access database systems via the Web, the URL itself will evolve and become more conducive to customizations for database systems. It may be a consideration as part of the URL to allow the Web to have the scheme of the URL user defined. A user defined scheme may allow for additional entry points into the Web server to process this user defined scheme which in this case would be to process requests to a particular database system.. The work done with this database system also brought out the fact that each of these URLs required some action to be performed. It may also be a consideration to formalize some method of supporting a user defined action as part of the URL. These considerations need to be more carefully thought out, however the goal is to allow for better integration with database systems.

4. Navigational Tools

Much of the data found on the Web is textual in nature. As research papers and government documents become accessible and as more lengthy documents appear on the Web, it requires navigational tools that will allow the user to reach the pertinent information within the document quickly and easily. The use of a database system enhances the navigation tools of the current Web browsers because the database system has the information regarding where in the document the user’s query criteria was found. With this information, database systems can provide “hit-to-hit scrolling”. (A hit is the term or phrase that matches the user’s query criteria.) This feature enables the user to quickly and efficiently navigate through the document. Searching for relevant information as a whole is an iterative process; therefore it is critical that the Web browsers have effective navigational tools that give users immediate access to pertinent parts of a document. “Hit-to-hit” scrolling must be provided in both directions, forward and backward within the document.

The collection of documents retrieved by a query is known as a document set. Along with tools to navigate within a document, we must also provide tools to navigate within a document set. The navigation tools that must be provided are next and previous document. These features will allow a user to traverse a document set going from one document to the next based on the document currently displayed. The database system should also provide the capability of going directly to the first and last document in the set.

Some documents can become very large. It may not be wise to return the entire document especially if the user is reviewing the information to determine if in fact the document is relevant to his needs. Therefore, it is necessary for the Web browsers to be able to send “chunks” of a document while the user is determining its importance. Once the user has decided that this document is relevant, at that time some tool may be necessary that does in fact return the entire paper, article or document.

The navigational requirements and other aspects of manipulating data through a database system brings an awareness and need for the customization and control over the menu bar. In this particular application each document had its own menu bar drawn as part of the HTML document. It seems apparent that menu bar support for the client needs to provide the capabilities to add additional items or modify current items of the menu bar. It would certainly help in its use with a database system.

5. Conclusion

The Web is a powerful and easy to use tool that has become very popular to all industries and organizations throughout the world. As the Web continues to develop and improve, it is necessary to keep in mind the wealth of information that the user is asked to review, analyze and locate. Tools that will support the user’s quest to access relevant information need to be developed. Database and information retrieval in general have proven in the past to be the key tool in locating relevant information. The Web needs to consider developing tools and capabilities that open the door to allowing various database systems to assist the user in locating, accessing and reviewing this valuable information that is currently available.

Theresa Kasper General Information: Business Address: Information Dimension Inc. 5080 Tuttle Crossing Boulevard Dublin, Ohio 43017 Phone: (614)761-7214 EMail: Kasper@idi.oclc.org Education: Bachelor's degree in Russian Languages (BA) Bachelor's Degree in Computer Science (BS) Skills Systems: UNIX, PC/WINDOWS, VAX/VMS, MACINTOSH Languages: C,Fortran,some experience with Pascal and Cobol Other: FutureBasic/PG:pro on the Mac, Basisplus DB system Database algorithms, and presentation skills Experience: Development Manager for Basisplus DB system. Currently involved with development of a Web server that services database requests. Project Leader and developer for on-line Thesaurus construction for the Basisplus Database system. Singularization algorithm used to convert a plural form to its singular. Used for DB Indexing. Journaling algorithms for DB system User Interface development for DB system

Kasper@idi.oclc.org