Using Coollists to Index HTML Documents in the Web

Jong-Gyun Lim
E-CIM Center, Corporate Technical Operations
Samsung Electronics, Suwon, Korea

Table of Contents

Abstract

This paper suggests a partial solution (limited to HTML documents) to the Web-indexing problem using Coollists. Roughly, a Coollist is equivalent to a Hotlist in Mosaic except that it automatically records all the visited HTML document titles by default. Thus, in theory, by maintaining a merged list of everybody's Coollists, a complete index of all the HTML files in the Web should be created eventually.

In practice, even if transferring everybody's Coollists to a single site were feasible, the growth and change rate of Web questions us whether the archie metaphor of "every index server maintains all the know-wheres" could be applied to the rest of the Web.

The new metaphor we are suggesting is a library metaphor. Let each organization maintain the merged Coollists of their individuals. If some organization has surplus computing resources, let it maintain the merged list of other merged lists. This way, individuals are likely to find documents of their interest from their own organization. But organizations have characteristics like libraries have specialities. Therefore, individuals will find other interesting documents from its "neighboring" sites. Bigger libraries carry more books. Likewise, there will be sites that merge many merged lists together which will be useful for blind keyword searching of the titles.

For our current implementation of Coollist, we take advantage of CERN proxy-cache server to collect the indices of all the visited HTML files. People on three of the 19 plants within the company tried the merged list of Coollists and found it almost indispensable. People who used to save almost every URLs they visited and those who wanted some comprehensive list of URLs found it particularly useful.

In the paper, we describe the result of our experiment in detail and also point out how our approach might solve the scaleability problem of other Web indexing solutions.

1 Introduction

Web is growing fast. Introduction to Web of the Hypertext Markup Language [1] along with its browsers such as Mosaic has accelerated the growth of Web as much as it has facilitated the browsing of the Web documents. The amount of HTML documents themselves has already exceeded that of Gopher documents, and HTML+ document style is quickly emerging as the standard Web document style. There is a dilemma, though. The freedom that a hypertext document provides to the users causes them to be lost in the complex graph of hypertext documents. Browsers provide tools like Window History or Hotlist to help the viewers navigate the hypertext space. However, if we want to find a particular information in the Web, we must somehow discover the documents in the Web and organize them in such a way that we can find their presence with minimal effort. Already a number of tools are developed that traverses and index the Web automatically. Also, research efforts are on the way to find the best Web resource discovering and indexing ideas.

In this paper we briefly review the existing Web indexing solutions and their limitations and suggest a solution complementary to the existing ones. The two main differences between our solution and those of the others are that in our solution:

The major advantages of our approach are that we solve the resource discovery problem without the use of expensive Web wonderers such as Robots or Worms [4] or manual maintenance effort of the document providers and that the structure of the index resembles the infostructure of Web more closely than other existing indexing systems. Our solution is also based on the following hypothesis:
  1. Organizing the entire Web into a subject hierarchy is difficult.
  2. Finding a desired document from the hierarchically organized Web is even more difficult.
  3. More recently browsed document is usually more valuable.
  4. Documents browsed by a person during a contiguous time span are likely to be related.
  5. People within a same organization are likely to be interested in one cluster of documents than another.
With these hypothesis, we have designed an indexing mechanism using Coollists. In the following two section we describe the existing indexing solutions and the motivations for using Coollists as our indexing solution. Then in Section 4, we define a Coollist and describe the indexing algorithm using Coollists. In Section 5 we describe our implementation experience with Coollists and the reactions of the users who used the experimental system. Based on our experience, we make suggestions on implementing the HTML document indexing system using Coollists.

2 Indexing The Web; Related Works

Today if we want to find some desired information in the Web, we can choose from a number of services that employ different resource discovering methods and search interfaces. Some of them are more actively used than others, but they are all suffering from one limitation or another. Upon studying those services we have identified four criterias that we believe should be met by a successful Web indexing system; 1. Infostructure preservation, 2. Accuracy, 3. Maintainability, 4. Scaleability.

2.1 Infostructure Preservation

HTML documents on Web has a graph infostructure unlike Gopher or Annonymous FTP that has hierarchy of subject infostructure. Information on Gopher or FTP is meant to be found by recursively browsing the sub-menus or subdirectories whereas information on Web is meant to be found by following the hypertext links. Both Gopher and FTP experienced fast growth but the need to search the entire Gopherspace or FTP sites is satisfied by Veronica [9] and archie [2], respectively. We believe that one of the reasons why Veronica and archie became widely used is because their query results were presented in the same infostructure as the infostructure of the documents being searched; the result of Veronica search puts us on a Gopher Menu and archie returns list of paths and file names. We refer to these characteristics of an indexing system as infostructure preserving.

BBS style HTML document indexing systems such as GENVL [6] are not infostructure preserving because they attempt to organize the HTML document indexes under a subject hierarchy. Information requesters must use their judgement to select a more closely matching subject name from the top level subject listing and recursively search the sub-listings until they find the requested document. This is like overlaying Gopher infostructure on top of HTML document infostructure. Because the authors of the HTML documents do not have the Gopher-like infostructure in mind when creating the documents, categorizing them into a subject hierarchy is challenging, not to mention finding them.

Archi-like HTML document indexing systems such as ALIWEB [5] and W3 Catalog [10]] do not overlay a different infostructure but neither do they take advantage of the graph infostructure of the HTML documents; these systems generate a plain list of HTML document titles (and keywords in the case of ALIWEB) that (partially) match the requested keyword.

There are many Spiders, Worms, and Robots that automatically traverse and discover resources on Web [4]. However, they typically leave only the plain list of items discovered and loose the structural information that they found during the traversal.

2.2 Accuracy

HTML document titles are not always very meaningful. Authors of HTML documents often make the assumption that the readers are within the context of the parent document and thus omit redundant information from the title. They even inherit the title of the parent document to several of its children documents. This becomes problematic to systems that rely on titles to search for the requested document. We call this an accuracy problem. Thus an indexing system with little accuracy results in longer search time on the requesters side because they have to try and err several times before finding the desired document.

W3 Catalog and WWWW [6] generate list of HTML document titles that (partially) match the keyword request. A title alone without the context information, however, is not sufficient to identify the content. These limitations may be complemented by WAIS-like services [3] [7] that maintains precomputed index of texts in the document.

GENVL is better because it provides structural information in addition to the titles. However, following wrong branches increases the search effort and thus lengthen the search time.

ALIWEB is perhaps the most accurate indexing system available because HTML documents in ALIWEB contain additional field of information; description and keywords. These fields, however, are not required fields of an HTML document and thus document providers must make additional efforts to supply them.

2.3 Maintainability

Ideally, resource discovery system must be fully automatic. Archie and Veronica require least human intervention which is another reason why they are successful. An HTML document indexing system that require the least amount of continuous manual effort to maintain the system is the most maintainable system.

Although ALIWEB, like archie, only requires initial registration of the HTML index file serving sites, the required effort of creating the index file and the additional and continuous effort to update the index file prohibits its wide spread growth.

GENVL requires no manual maintenance effort to maintain the system itself. "Posting" HTML documents to the Mother-of-All BBS does require continuous efforts, though.

W3 Catalog require no manual maintenance effort except for adding new source of Catalog. Web wonderers and WWWW requires no manual maintenance effort either.

2.4 Scaleability

What portion of the Web does the indexing system cover? As the coverage of the Web increases, will the indexing system create a resource bottleneck on a single machine? Most of the existing indexing systems are limited by some sort of scaleability problem. Systems like WAIS that manages huge amount of indexing data distributes the indices to several sites. However, this places the burden on the users who should find the right index site to search for the document they want. Archie maintains relatively little index information. Thus, all the archie indexes are stored on a central location and they are duplicated over to a number of mirroring sites. However, as the number of archiving sites increases, archie indexes are now facing the need to be distributed like WAIS indexes. [8]

Existing HTML document indexing systems face the same kind of scaleability problems that WAIS and archie are experiencing. ALIWEB must process and store all the registered index files which will be a problem as the number of registered sites increases. W3 Catalog and WWWW will need to find a way to distribute the index DB to avoid the bottleneck on particular systems.

GENVL's BBS approach is perhaps the most scaleable approach because the hierarchical nature of its index organization distributes the resources to many different sites. However, pages towards the top of the hierarchy, such as the Mother-of-All BBS page and some of its more popular children pages, will eventually need to be replicated among several sites.

3 Motivations

Before introducing Coollists and our solution to HTML document indexing problem using Coollists, we share the two metaphors that motivated our solution: the Talk-of-the-Town Metaphor and the Library Metaphor. Also, we discuss our hypothesis that HTML documents browsed by a person during a contiguous time span are likely to be related. This hypothesis will be the bases for our HTML document infostructure reconstruction algorithm.

3.1 Talk-of-the-Town Metaphor

In a typical town, one can hear about the current events and news that are most talked about. Those who have been to an interesting event spread the news to those who have not. There are also newspapers and guide books that summarize the current events, places to visit, things to buy, and books to read. Usually the most interesting talk of the town is the most recent ones and as time progresses they become less important or obsolete.

The situation is similar with HTML documents. While looking at an interesting HTML document, someone next door looks over to your screen and asks the URL of that document. Or while browing the Web, you find documents that you think should be shared by others in your organization and either place them in your homepage or store their URLs on the Hotlist and mail it out or post on the local BBS. If the URL were indeed interesting and useful, it would be referenced by most of the people in the organization and maybe even multiple times. Chances are that if someone collected all the URLs that are referenced by the people in that organization, then the most current and relevant URLs of that organization could be identified.

3.2 Library Metaphor

At a typical university, there are several specialized libraries and possibly one or more general libraries. A specialized library is a place where people of common interests agree to place their books so that they can be shared by others; e.g., Math library, Science library, and Music library. The collection of books maintained by those libraries change when new users are introduced or the interests of the existing users change. Specialized libraries are physically located near the departments where the users are. Most people can find books of their interest from the specialized libraries, but when they cannot, they must make a trip to other libraries.

A general library houses many books from many different subject areas. Once within the general library, people can find most of the books they want. However, the time it takes to travel to the distant general library makes it more troublesome to find a book from the general library than from the specialized library. The bigger the general library the better chance there is to find the books, but it takes more efforts to access the books from it; e.g. making a trip to the Library of Congress.

Assuming that the collection of all the referenced URLs by an organization is housed by a specialized library, people in that organization will be able to find many of the needed HTML documents from that library. If each organization maintains such a library, then people from one organization may "visit" the library to find out or learn about the interests of that organizations. For example, if a researcher at an electronics research center wants to find out information about DNA, s/he may search the Biochemistry Lab library that contains the collection of all the URLs referenced by the researchers in the laboratory. Organizations that has surplus resources may decide to create a general library of referenced URLs. For example, someone may decide to merge all the URL listings from 100 biochemistry lab libraries in U.S. or from all the URL libraries of manufacturing sites within the company. The more general the URL libraries are the more chance there is to find a document, but it will take longer to search for it.

3.3 Information Clustering

A cluster of information is a related set of information. For example, a list of files under a subdirectory of an anonymous FTP site is a cluster of information. A list of documents on a same Gopher menu is a cluster of information. For the HTML document infostructure, a cluster of information could be defined as the set of documents referenced by a single HTML document. However, there are also many closely related HTML documents that are not referenced by a parent HTML document, and even those documents that are referenced by a same parent HTML document may not necessarily be a related set of information.

Another way, and perhaps more accurate way, of defining an HTML document cluster is by grouping a set of HTML documents that are browsed by a person during a contiguous time span. Unless the person is not making a random selection of URLs, s/he is likely to browse documents that are related to the task at hand. Also, browsing typically follows the infostructure of HTML documents. Thus, a parent document and its children are likely to fall into a same information cluster. Because task oriented browsing of documents will only select those document links that are related to the task, some of the children that are remotely related to its parent document will be excluded from the cluster which is why our way of clustering the HTML documents may be more accurate.

4 Coollist

In this section, we define Coollist, Coollist Repository and Coollist Library, and describe how each of them work with each other in the HTML document indexing algorithm.

4.1 Coollist

A Coollist is a list exactly like Hotlist in NCSA Mosaic except that all the viewed HTML documents are automatically added to the list by default. As in Hotlist, the format of an entry in a Coollist is the (<URL>, <document title>, <date viewed>) triplet. A Coollist is transmitted to a Coollist Repository and destroyed each time a new browsing session is started. The beginning of a browsing session is defined by idle time, the duration of time long enough to assume that the user has switched the task.

4.2 Coollist Repository

A Coollist Repository is a designated server into which Coollists are transmitted, merged, and stored. During the installation of browsers like Mosaic, client systems are recommended to select the closest Coollist Repository server to transmit their Coollists. By default HTTP cache servers will be used as Coollist Repository servers. We are encouraging this as the default setting because we believe in the wide spread use of caching servers in the near future and that the clients are most likely to choose a caching server that is closest in physical distance, thus minimizing the network traffic.

The merged list of Coollists should never be sorted. Also, Coollists should not be mixed with other Coollists. A newly arrived Coollist should be appended to the top of the merged list without any alteration. The length of the merged list is only limited by the resource availability of the server. However, most of the Coollist Libraries will request for only certain portion from the top of the merged list.

4.3 Coollist Library

A Coollist Library is the HTML document indexing system that collects merged Coollists from one or more Coollist Repositories and provides search interfaces to the document requesters. Any host can become a Coollist Library as long as it satisfies the following three requirements:
  1. Announce the name of the library interface HTML form document with its URL.
  2. Provide the list of Coollist Repositories and/or other Coollist Libraries from which it collects the merged Coollists.
  3. Provide descriptive information about its specialities (if it is a specialized library).
A library that collects merged Coollists from other libraries are likely to become a general library. Actually, a general library does not receive Coollists through the special libraries. Instead, it retrieves the list of Coollist Repositories from each of the special libraries and then received the Coollists directly from those Coollist Repositories. This approach avoids a number of potential problems: Bigger libraries will make access to its Coollist Repositories at a lower frequency than the smaller libraries. Also, bigger libraries will request for smaller portion of the lists from the Coollist Repositories than will the smaller libraries. These flexibilities allow general libraries to balance their coverage and the usage of their resources.

5 Implementation Experience

In this section we describe our implementation experience. Our implementation took place before we developed the Coollist concept described in this paper. Thus, it does not necessarily proves the validity of our approach. Nevertheless, the positive reaction received by the users provided us with enough motivation to make full implementation of the described algorithm based on this implementation.

5.1 Coollists Under CERN Caching Server; Our Experience

In our implementation, a Coollist Repository does not collect the Coollists from the individual users. Instead, it constructs something similar to merged Coollists from the HTML files under the ./cache directory at the CERN caching server. More precisely, it first applies the UNIX find command to find HTML files, and then apply ls -lut command to select those documents that have access time later than that previous recording time, next it sorts them in reverse chronological order, compiles the list of (<URL>, <document title>, <date viewed>) triplet by parsing the <TITLE> field and constructing the URL from the pathname, and finally prepends the list of triplets to the merged Coollists.

Effectively, this method generates the list of HTML documents that are most recently referenced by the users of the cache server. Obviously, there are many limitations with this method, however. First, it does not generate the true Coollist that has the characteristics of containing a cluster of related information. Furthermore, documents that the cache server does not cache are not stored in the Coollist Repository which is a rather significant problem because those are the documents stored within the local hosts and thus often the most frequently referenced documents. Also, this cannot be the general solution because it depends too much on the directory structure of the cache server.

Alternatively, we could have used the ~/.mosaic-global-history files or Window History listings in Mosaic as means of collecting Coollists. However, the Global History file does not leave the <TITLE> information along with the recorded URLs, and Window History, which was actually the best candidate, could not be saved into a file.

We designated three CERN Cashing servers as Coollist Repositories. One of them is located at our R&D; center covering about 120 users. The other two are located at CAD Centers of Refrigerator and Video manufacturing sites, respectively, covering about 100 users each. With the three Coollist Repositories we built four Coollist Libraries: one for each Coollist Repository and the fourth one that combines all three repositories. The search interface provided by a library is an HTML form file. The result of a keyword search on titles returns the first title matched and the consecutive search will return next matches. A matched title is returned along with the cluster of information it belongs to. In our implementation, we return five documents above and below the list from the matched document. Users find this a very useful feature. With the full implementation of the algorithm we will be able to present to the user the actual cluster of information which should prove to be even more useful.

Users at our R&D; center use this tool regularly. Some of them are fascinated by it because when they want to see the interesting HTML document being displayed on the screen of a person in a neighboring cubicle, they are sure to find the document just by knowing the partial title of the document. By using the keyword search on the Coollist Library they will find the title of the very document as quickly as on the first try. The group of documents referenced by the manufacturing sites are already beginning to reveal characteristics different from those referenced by our R&D; center which will eventually be a proof to our hypothesis that people within a same organization are likely to be interested in one cluster of documents than another.

5.2 Proposed Implementation Methods

A Coollist can be best implemented by taking advantage of the Window History feature in Mosaic. All that need to be done is to modify Mosaic such that it: Coollist Repository server should be written and hopely be packaged with HTTP caching servers such that we can encourage making a cache site a Coollist Repository site as well. A Coollist Repository server should: With these implementation, a Coollist Library can be built as described in Section 4.3.

6 Conclusion

A very different method of indexing HTML documents in the Web is introduced. Our method follows Talk-of-the-Town and Library metaphors and meets the four criterias that we believe are basic requirements for Web indexing systems; information preservation, accuracy, maintainability, and scaleability. Our prototype system received positive reactions from people at three different sites within our organization. However, implementing a widely used HTML document indexing system requires collaboration with the developers of the browers and the HTTP servers.

7 About the Author

Jong-Gyun Lim has received B.S. Honors degree in Computer Science from New York University, M.S. degree in Computer Science from Columbia University, and he is currently a Ph.D. candidate at Columbia University in the department of Computer Science. During his leave of absence, he is working at Samsung Electronics R&D; Center as a senior engineer. His area of interests are Natural Language Generation, Text Planning, and Knowledge Representation. His Email address is: jglim@rnd.sec.samsung.co.kr

References

 [1] T. Berners-Lee. Hypertext mark-up language. CERN.
    http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html.

 [2] P. Deutsch and A. Emtage. The archie system: An internet electronic
    directory service. ConneXions, 6(2), February 1992.

 [3] B. Kahle and A. Medlar. An information system for corporate users:
    Wide area information servers. ConneXions - The Interoperability
    REport, 5(11):pp.2-9, November 1991. 

 [4] M. Koster. World wide web wanderers, spiders and robots. Nexor Corp.
    http://web.nexor.co.uk/mak/doc/robots/robots.html.

 [5] M. Koster. Aliweb - archie-like indexing in the web. In Proceedings of
    the First International WWW Conference, Geneva, May 1994.
    http://web.nexor.co.uk/mak/doc/aliweb-paper/paper/html.

 [6] O. A. McBryan. Genvl and wwww: Tools for taming the web. In
    Proceedings of the First International WWW Conference, Geneva, May 1994.
    http://www.cs.colorado.edu/home/mcbryan/public_html/bb/summary.html.

 [7] C. Neuss and S. Hofling. Lost in hyperspace? free text searches in the
    web. In Proceedings of the First International WWW Conference, Geneva,
    May 1994. 

 [8] A. Emtage B. Kahle B. C. Neuman Schwartz, M. F.  A comparison of
    internet resource discovery approaches. Computing Systems, 5(4), 1992.

 [9] gopher://gopher.unr.edu/00/veronica/veronica-faq.

 10] http://cui_www.unige.ch/W3Catalog/README.html.