Using Coollists to Index HTML Documents in the Web
Jong-Gyun Lim
E-CIM Center, Corporate Technical Operations
Samsung Electronics, Suwon, Korea
Table of Contents
Abstract
This paper suggests a partial solution (limited to HTML documents) to the
Web-indexing problem using Coollists. Roughly, a Coollist is
equivalent to a Hotlist in Mosaic except that it automatically
records all the visited HTML document titles by default. Thus, in
theory, by maintaining a merged list of everybody's Coollists, a complete
index of all the HTML files in the Web should be created eventually.
In practice, even if transferring everybody's Coollists to a single site
were feasible, the growth and change rate of Web questions us whether the
archie metaphor of "every index server maintains all the
know-wheres" could be applied to the rest of the Web.
The new metaphor we are suggesting is a library metaphor. Let each
organization maintain the merged Coollists of their individuals. If some
organization has surplus computing resources, let it maintain the merged
list of other merged lists. This way, individuals are likely to find
documents of their interest from their own organization. But organizations
have characteristics like libraries have specialities. Therefore,
individuals will find other interesting documents from its "neighboring"
sites. Bigger libraries carry more books. Likewise, there will be sites
that merge many merged lists together which will be useful for blind
keyword searching of the titles.
For our current implementation of Coollist, we take advantage of CERN
proxy-cache server to collect the indices of all the visited HTML files.
People on three of the 19 plants within the company tried the merged list
of Coollists and found it almost indispensable. People who used to save
almost every URLs they visited and those who wanted some comprehensive list
of URLs found it particularly useful.
In the paper, we describe the result of our experiment in detail and also
point out how our approach might solve the scaleability problem of other
Web indexing solutions.
Web is growing fast. Introduction to Web of the Hypertext Markup Language
[1] along with its browsers such as Mosaic has accelerated the growth of
Web as much as it has facilitated the browsing of the Web documents. The
amount of HTML documents themselves has already exceeded that of Gopher
documents, and HTML+ document style is quickly emerging as the standard Web
document style. There is a dilemma, though. The freedom that a hypertext
document provides to the users causes them to be lost in the complex graph
of hypertext documents. Browsers provide tools like Window History
or Hotlist to help the viewers navigate the hypertext space.
However, if we want to find a particular information in the Web, we must
somehow discover the documents in the Web and organize them in such
a way that we can find their presence with minimal effort. Already a number
of tools are developed that traverses and index the Web automatically.
Also, research efforts are on the way to find the best Web resource
discovering and indexing ideas.
In this paper we briefly review the existing Web indexing solutions and
their limitations and suggest a solution complementary to the existing
ones. The two main differences between our solution and those of the others
are that in our solution:
- An index of Web resources is automatically updated by the viewers as they
reference the document rather than by the document providers.
- An index is not a list of documents but a list of document clusters
such that the result of a query is the list of related documents as well as
the document with the (partially) matching title.
The major advantages of our approach are that we solve the resource
discovery problem without the use of expensive Web wonderers such as Robots
or Worms [4] or manual maintenance effort of the document providers and
that the structure of the index resembles the infostructure of Web more
closely than other existing indexing systems. Our solution is also based on
the following hypothesis:
- Organizing the entire Web into a subject hierarchy is difficult.
- Finding a desired document from the hierarchically organized Web is even more difficult.
- More recently browsed document is usually more valuable.
- Documents browsed by a person during a contiguous time span are likely to be related.
- People within a same organization are likely to be interested in one cluster of documents than another.
With these hypothesis, we have designed an indexing mechanism using
Coollists. In the following two section we describe the existing indexing
solutions and the motivations for using Coollists as our indexing solution.
Then in Section 4, we define a Coollist and describe the indexing algorithm
using Coollists. In Section 5 we describe our implementation experience
with Coollists and the reactions of the users who used the experimental
system. Based on our experience, we make suggestions on implementing the
HTML document indexing system using Coollists.
Today if we want to find some desired information in the Web, we can choose
from a number of services that employ different resource discovering
methods and search interfaces. Some of them are more actively used than
others, but they are all suffering from one limitation or another. Upon
studying those services we have identified four criterias that we believe
should be met by a successful Web indexing system; 1. Infostructure
preservation, 2. Accuracy, 3. Maintainability, 4. Scaleability.
HTML documents on Web has a graph infostructure unlike Gopher or Annonymous
FTP that has hierarchy of subject infostructure. Information on Gopher or
FTP is meant to be found by recursively browsing the sub-menus or
subdirectories whereas information on Web is meant to be found by following
the hypertext links. Both Gopher and FTP experienced fast growth but the
need to search the entire Gopherspace or FTP sites is satisfied by Veronica
[9] and archie [2], respectively. We believe that one of the reasons why
Veronica and archie became widely used is because their query results were
presented in the same infostructure as the infostructure of the documents
being searched; the result of Veronica search puts us on a Gopher Menu and
archie returns list of paths and file names. We refer to these
characteristics of an indexing system as infostructure preserving.
BBS style HTML document indexing systems such as GENVL [6] are not
infostructure preserving because they attempt to organize the HTML document
indexes under a subject hierarchy. Information requesters must use their
judgement to select a more closely matching subject name from the top level
subject listing and recursively search the sub-listings until they find the
requested document. This is like overlaying Gopher infostructure on top of
HTML document infostructure. Because the authors of the HTML documents do
not have the Gopher-like infostructure in mind when creating the documents,
categorizing them into a subject hierarchy is challenging, not to mention
finding them.
Archi-like HTML document indexing systems such as ALIWEB [5] and W3
Catalog [10]] do not overlay a different infostructure but neither do they
take advantage of the graph infostructure of the HTML documents; these
systems generate a plain list of HTML document titles (and keywords in the
case of ALIWEB) that (partially) match the requested keyword.
There are many Spiders, Worms, and Robots that automatically traverse
and discover resources on Web [4]. However, they typically leave only the
plain list of items discovered and loose the structural information that
they found during the traversal.
HTML document titles are not always very meaningful. Authors of HTML
documents often make the assumption that the readers are within the context
of the parent document and thus omit redundant information from the title.
They even inherit the title of the parent document to several of its
children documents. This becomes problematic to systems that rely on
titles to search for the requested document. We call this an
accuracy problem. Thus an indexing system with little accuracy
results in longer search time on the requesters side because they have to
try and err several times before finding the desired document.
W3 Catalog and WWWW [6] generate list of HTML document titles that
(partially) match the keyword request. A title alone without the context
information, however, is not sufficient to identify the content. These
limitations may be complemented by WAIS-like services [3] [7] that
maintains precomputed index of texts in the document.
GENVL is better because it provides structural information in addition
to the titles. However, following wrong branches increases the search
effort and thus lengthen the search time.
ALIWEB is perhaps the most accurate indexing system available because
HTML documents in ALIWEB contain additional field of information;
description and keywords. These fields, however, are not required fields of
an HTML document and thus document providers must make additional efforts
to supply them.
Ideally, resource discovery system must be fully automatic. Archie and
Veronica require least human intervention which is another reason why they
are successful. An HTML document indexing system that require the least
amount of continuous manual effort to maintain the system is the most
maintainable system.
Although ALIWEB, like archie, only requires initial registration of the
HTML index file serving sites, the required effort of creating the index
file and the additional and continuous effort to update the index file
prohibits its wide spread growth.
GENVL requires no manual maintenance effort to maintain the system
itself. "Posting" HTML documents to the Mother-of-All BBS does require
continuous efforts, though.
W3 Catalog require no manual maintenance effort except for adding new
source of Catalog. Web wonderers and WWWW requires no manual maintenance
effort either.
What portion of the Web does the indexing system cover? As the coverage of
the Web increases, will the indexing system create a resource bottleneck on
a single machine? Most of the existing indexing systems are limited by some
sort of scaleability problem. Systems like WAIS that manages huge
amount of indexing data distributes the indices to several sites. However,
this places the burden on the users who should find the right index site to
search for the document they want. Archie maintains relatively little index
information. Thus, all the archie indexes are stored on a central location
and they are duplicated over to a number of mirroring sites. However, as
the number of archiving sites increases, archie indexes are now facing the
need to be distributed like WAIS indexes. [8]
Existing HTML document indexing systems face the same kind of
scaleability problems that WAIS and archie are experiencing. ALIWEB must
process and store all the registered index files which will be a problem as
the number of registered sites increases. W3 Catalog and WWWW will need to
find a way to distribute the index DB to avoid the bottleneck on particular
systems.
GENVL's BBS approach is perhaps the most scaleable approach because the
hierarchical nature of its index organization distributes the resources to
many different sites. However, pages towards the top of the hierarchy, such
as the Mother-of-All BBS page and some of its more popular children pages,
will eventually need to be replicated among several sites.
Before introducing Coollists and our solution to HTML document indexing
problem using Coollists, we share the two metaphors that motivated our
solution: the Talk-of-the-Town Metaphor and the Library
Metaphor. Also, we discuss our hypothesis that HTML documents browsed
by a person during a contiguous time span are likely to be related. This
hypothesis will be the bases for our HTML document infostructure
reconstruction algorithm.
In a typical town, one can hear about the current events and news that are
most talked about. Those who have been to an interesting event spread the
news to those who have not. There are also newspapers and guide books that
summarize the current events, places to visit, things to buy, and books to
read. Usually the most interesting talk of the town is the most recent ones
and as time progresses they become less important or obsolete.
The situation is similar with HTML documents. While looking at an
interesting HTML document, someone next door looks over to your screen and
asks the URL of that document. Or while browing the Web, you find documents
that you think should be shared by others in your organization and either
place them in your homepage or store their URLs on the Hotlist and
mail it out or post on the local BBS. If the URL were indeed interesting
and useful, it would be referenced by most of the people in the
organization and maybe even multiple times. Chances are that if someone
collected all the URLs that are referenced by the people in that
organization, then the most current and relevant URLs of that organization
could be identified.
At a typical university, there are several specialized libraries and
possibly one or more general libraries. A specialized library is a place
where people of common interests agree to place their books so that they
can be shared by others; e.g., Math library, Science library, and
Music library. The collection of books maintained by those libraries
change when new users are introduced or the interests of the existing users
change. Specialized libraries are physically located near the departments
where the users are. Most people can find books of their interest from the
specialized libraries, but when they cannot, they must make a trip to other
libraries.
A general library houses many books from many different subject areas.
Once within the general library, people can find most of the books they
want. However, the time it takes to travel to the distant general library
makes it more troublesome to find a book from the general library than from
the specialized library. The bigger the general library the better chance
there is to find the books, but it takes more efforts to access the books
from it; e.g. making a trip to the Library of Congress.
Assuming that the collection of all the referenced URLs by an
organization is housed by a specialized library, people in that
organization will be able to find many of the needed HTML documents from
that library. If each organization maintains such a library, then people
from one organization may "visit" the library to find out or learn about
the interests of that organizations. For example, if a researcher at an
electronics research center wants to find out information about DNA, s/he
may search the Biochemistry Lab library that contains the collection of all
the URLs referenced by the researchers in the laboratory. Organizations
that has surplus resources may decide to create a general library of
referenced URLs. For example, someone may decide to merge all the URL
listings from 100 biochemistry lab libraries in U.S. or from all the URL
libraries of manufacturing sites within the company. The more general the
URL libraries are the more chance there is to find a document, but it will
take longer to search for it.
A cluster of information is a related set of information. For
example, a list of files under a subdirectory of an anonymous FTP site is a
cluster of information. A list of documents on a same Gopher menu is a
cluster of information. For the HTML document infostructure, a cluster of
information could be defined as the set of documents referenced by a single
HTML document. However, there are also many closely related HTML documents
that are not referenced by a parent HTML document, and even those documents
that are referenced by a same parent HTML document may not necessarily be a
related set of information.
Another way, and perhaps more accurate way, of defining an HTML document
cluster is by grouping a set of HTML documents that are browsed by a person
during a contiguous time span. Unless the person is not making a random
selection of URLs, s/he is likely to browse documents that are related to
the task at hand. Also, browsing typically follows the infostructure of
HTML documents. Thus, a parent document and its children are likely to fall
into a same information cluster. Because task oriented browsing of
documents will only select those document links that are related to the
task, some of the children that are remotely related to its parent document
will be excluded from the cluster which is why our way of clustering the
HTML documents may be more accurate.
In this section, we define Coollist, Coollist Repository and
Coollist Library, and describe how each of them work with each other
in the HTML document indexing algorithm.
A Coollist is a list exactly like Hotlist in NCSA Mosaic
except that all the viewed HTML documents are automatically added to the
list by default. As in Hotlist, the format of an entry in a Coollist
is the (<URL>, <document title>, <date viewed>) triplet. A Coollist is
transmitted to a Coollist Repository and destroyed each time a new browsing
session is started. The beginning of a browsing session is defined by
idle time, the duration of time long enough to assume that the user
has switched the task.
A Coollist Repository is a designated server into which Coollists
are transmitted, merged, and stored. During the installation of browsers
like Mosaic, client systems are recommended to select the closest Coollist
Repository server to transmit their Coollists. By default HTTP cache
servers will be used as Coollist Repository servers. We are encouraging
this as the default setting because we believe in the wide spread use of
caching servers in the near future and that the clients are most likely to
choose a caching server that is closest in physical distance, thus
minimizing the network traffic.
The merged list of Coollists should never be sorted. Also, Coollists
should not be mixed with other Coollists. A newly arrived Coollist should
be appended to the top of the merged list without any alteration. The
length of the merged list is only limited by the resource availability of
the server. However, most of the Coollist Libraries will request for only
certain portion from the top of the merged list.
A Coollist Library is the HTML document indexing system that
collects merged Coollists from one or more Coollist Repositories and
provides search interfaces to the document requesters. Any host can become
a Coollist Library as long as it satisfies the following three requirements:
- Announce the name of the library interface HTML form document with its URL.
- Provide the list of Coollist Repositories and/or other Coollist
Libraries from which it collects the merged Coollists.
- Provide descriptive information about its specialities (if it is a specialized library).
A library that collects merged Coollists from other libraries are likely
to become a general library. Actually, a general library does not receive
Coollists through the special libraries. Instead, it retrieves the list of
Coollist Repositories from each of the special libraries and then received
the Coollists directly from those Coollist Repositories. This approach
avoids a number of potential problems:
- Graph cycling problem; a library receives its own Coollists through
other libraries.
- Coollist Library host bottleneck problem; some libraries must transfer
all of its possibly huge data regularly.
- Coollist transfer latency problem; Coollists are outdated while hopping
around several libraries.
Bigger libraries will make access to its Coollist Repositories at a
lower frequency than the smaller libraries. Also, bigger libraries will
request for smaller portion of the lists from the Coollist Repositories
than will the smaller libraries. These flexibilities allow general
libraries to balance their coverage and the usage of their resources.
In this section we describe our implementation experience. Our
implementation took place before we developed the Coollist concept
described in this paper. Thus, it does not necessarily proves the validity
of our approach. Nevertheless, the positive reaction received by the users
provided us with enough motivation to make full implementation of the
described algorithm based on this implementation.
In our implementation, a Coollist Repository does not collect the Coollists
from the individual users. Instead, it constructs something similar to
merged Coollists from the HTML files under the ./cache directory at the
CERN caching server. More precisely, it first applies the UNIX find
command to find HTML files, and then apply ls -lut command to select
those documents that have access time later than that previous recording
time, next it sorts them in reverse chronological order, compiles the list
of (<URL>, <document title>, <date viewed>) triplet by parsing the <TITLE>
field and constructing the URL from the pathname, and finally prepends the
list of triplets to the merged Coollists.
Effectively, this method generates the list of HTML documents that are
most recently referenced by the users of the cache server. Obviously, there
are many limitations with this method, however. First, it does not generate
the true Coollist that has the characteristics of containing a cluster of
related information. Furthermore, documents that the cache server does not
cache are not stored in the Coollist Repository which is a rather
significant problem because those are the documents stored within the local
hosts and thus often the most frequently referenced documents. Also, this
cannot be the general solution because it depends too much on the directory
structure of the cache server.
Alternatively, we could have used the ~/.mosaic-global-history files or
Window History listings in Mosaic as means of collecting Coollists.
However, the Global History file does not leave the <TITLE>
information along with the recorded URLs, and Window History, which
was actually the best candidate, could not be saved into a file.
We designated three CERN Cashing servers as Coollist Repositories. One
of them is located at our R&D; center covering about 120 users. The other
two are located at CAD Centers of Refrigerator and Video manufacturing
sites, respectively, covering about 100 users each. With the three Coollist
Repositories we built four Coollist Libraries: one for each Coollist
Repository and the fourth one that combines all three repositories. The
search interface provided by a library is an HTML form file. The result of
a keyword search on titles returns the first title matched and the
consecutive search will return next matches. A matched title is returned
along with the cluster of information it belongs to. In our implementation,
we return five documents above and below the list from the matched
document. Users find this a very useful feature. With the full
implementation of the algorithm we will be able to present to the user the
actual cluster of information which should prove to be even more useful.
Users at our R&D; center use this tool regularly. Some of them are
fascinated by it because when they want to see the interesting HTML
document being displayed on the screen of a person in a neighboring
cubicle, they are sure to find the document just by knowing the partial
title of the document. By using the keyword search on the Coollist Library
they will find the title of the very document as quickly as on the first
try. The group of documents referenced by the manufacturing sites are
already beginning to reveal characteristics different from those referenced
by our R&D; center which will eventually be a proof to our hypothesis that
people within a same organization are likely to be interested in one
cluster of documents than another.
A Coollist can be best implemented by taking advantage of the Window
History feature in Mosaic. All that need to be done is to modify Mosaic
such that it:
- Saves the Window History into a dot file like .mosaic-window-history,
- Transmits and purge the dot file to the repository site upon invocation, and
- Provides a resource that points to its Coollist Repository site.
Coollist Repository server should be written and hopely be packaged with
HTTP caching servers such that we can encourage making a cache site a
Coollist Repository site as well. A Coollist Repository server should:
- Receive the Coollists from the browsers that transmit them, and
- Transmit its merged Coollists to the Coollist Libraries that request them.
With these implementation, a Coollist Library can be built as described in Section 4.3.
A very different method of indexing HTML documents in the Web is
introduced. Our method follows Talk-of-the-Town and Library metaphors and
meets the four criterias that we believe are basic requirements for Web
indexing systems; information preservation, accuracy, maintainability, and
scaleability. Our prototype system received positive reactions from people
at three different sites within our organization. However, implementing a
widely used HTML document indexing system requires collaboration with the
developers of the browers and the HTTP servers.
7 About the Author
Jong-Gyun Lim has received B.S. Honors degree in
Computer Science from New York University, M.S. degree in Computer Science
from Columbia University, and he is currently a Ph.D. candidate at Columbia
University in the department of Computer Science. During his leave of
absence, he is working at Samsung Electronics R&D; Center as a senior
engineer. His area of interests are Natural Language Generation, Text
Planning, and Knowledge Representation. His Email address is:
jglim@rnd.sec.samsung.co.kr
References
[1] T. Berners-Lee. Hypertext mark-up language. CERN.
http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html.
[2] P. Deutsch and A. Emtage. The archie system: An internet electronic
directory service. ConneXions, 6(2), February 1992.
[3] B. Kahle and A. Medlar. An information system for corporate users:
Wide area information servers. ConneXions - The Interoperability
REport, 5(11):pp.2-9, November 1991.
[4] M. Koster. World wide web wanderers, spiders and robots. Nexor Corp.
http://web.nexor.co.uk/mak/doc/robots/robots.html.
[5] M. Koster. Aliweb - archie-like indexing in the web. In Proceedings of
the First International WWW Conference, Geneva, May 1994.
http://web.nexor.co.uk/mak/doc/aliweb-paper/paper/html.
[6] O. A. McBryan. Genvl and wwww: Tools for taming the web. In
Proceedings of the First International WWW Conference, Geneva, May 1994.
http://www.cs.colorado.edu/home/mcbryan/public_html/bb/summary.html.
[7] C. Neuss and S. Hofling. Lost in hyperspace? free text searches in the
web. In Proceedings of the First International WWW Conference, Geneva,
May 1994.
[8] A. Emtage B. Kahle B. C. Neuman Schwartz, M. F. A comparison of
internet resource discovery approaches. Computing Systems, 5(4), 1992.
[9] gopher://gopher.unr.edu/00/veronica/veronica-faq.
10] http://cui_www.unige.ch/W3Catalog/README.html.