While the World-Wide Web (WWW, or Web for short) offers an incredibly rich base of information, organized as a hypertext, it does not provide a uniform and efficient way to retrieve specific information, based on user-defined search-criteria.
Two types of search tools have been developed for the WWW
The future of client-based searching will depend heavily on assistance
from WWW servers.
The fish-search algorithm is currently being used to search by navigating
through individual WWW documents.
Substantial speedup and reduction of network resource consumption will
become possible when the same search algorithm is converted to navigate
through a Web of servers instead of a Web of documents.
In order to implement such a search algorithm, WWW servers need to offer
a service similar to that provided by
GlimpseHTTP.
This paper highlights the properties, implementation and possible
future developments of a client-based search tool called the
fish-search, and compares it to other approaches.
The fish-search, implemented on top of Mosaic for X, offers an open-ended
selection of search criteria.
It allows the search to start from the "current" document or from the
documents in the user's hotlist.
To some extent, the loose organization of the WWW is both the key to its
success and the biggest problem when trying to join it
Documents in the WWW are uniquely identified by means of a
Universal Resource Locator (URL), which includes the name of the
server the documents resides on.
Documents may contain pointers (links) to other documents, possibly
on other servers. The Web structure of documents and pointers makes the
WWW a Hypertext. Therefore we shall use the terms node and link
in the sequel, as is done in most hypertext literature [SK89].
The only access to information in the WWW is defined by means of the
HyperText Transfer Protocol (HTTP), which is used by the user's
(client) interface to retrieve nodes from a WWW server.
There is no protocol for asking a server what nodes it has, and no
protocol to find out which servers exist.
The access to the WWW is therefore limited to browsing (sometimes
called navigation)
Browsing is the common interaction paradigm for hypertext, when a user
is gathering information. It is very useful for reading and comprehending
the contents of a hypertext, but not suitable for locating a specific
piece of information.
Most hypertext systems offer a global search facility that lets you
find a specific node based on information you know about its contents,
regardless the navigation that would be
necessary (the links to follow) to get to that node.
In a distributed and loosely organized hypertext like the WWW there is
no global search facility.
However, two kinds of attempts have been made to provide such a
facility
In Section 2 we describe how this client-based search works.
The search is currently embedded in Mosaic (but only in the 2.4.2 version of the
Univ. of Tübingen).
Section 3 briefly presents a Mosaic-independent interface to the fish-search,
by means of a cgi-script,
which must be installed on a local (server) machine..
Finally, in Section 4 we propose a new approach to client-based searching,
which assumes there will be some cooperation of the servers in the future.
Reliably finding information in the WWW will not be possible in the future
if the access to WWW-servers remains limited to the retrieval of (single)
nodes. Tools like
GlimpseHTTP
will improve the possibilities for finding information in the WWW.
Because the WWW is very large, not necessarily completely connected,
and changing all the time, it is impossible to provide databases that
truly function as a catalog of the WWW. These robots run for weeks,
gathering information, forgetting most of it (because there is too much
to index everything in the WWW), and failing to reach the newest and
still not very well known sites.
The information these databases contain is very useful, but unreliable and
incomplete. Most organizations offering access to such index databases
fail to warn the user about these shortcomings of their systems.
A number of factors determine how effective and efficient the search-tool
can be
Apart from the search string, the widget offers two different sets of
selectable options and parameters
A complete search of the WWW is normally impossible, since the WWW is not
necessarily completely connected, since the part of the WWW that can be
reached depends on how well the starting node is connected,
and since some parts of the WWW are "hidden" behind nodes with embedded
forms and clickable maps.
But most importantly, a complete search would take weeks.
The fish-search is intended
to run for a limited time, which can be selected, and to find a limited
number of relevant nodes, which can also be selected.
While the fish-search can be used as a general-purpose search tool,
special care has to be taken to prevent the search from being too slow and
ineffective. If all nodes have to be retrieved from remote locations,
like different continents, the search will be too slow.
Using a cache may significantly improve the search speed.
Limiting the search to a local domain may improve the speed even more.
The fish-search has been used very successfully as a global search tool
for local parts of the WWW.
In order to find information in the whole WWW, starting from an answer
produced by an index database like the
JumpStation or the
World Wide Web Worm
may increase the chance of finding relevant nodes.
A detailed description of the fish-search algorithm and its heuristics
can be found in [BHKP94].
We concentrate on the search-tool that is part of the Tübingen
Mosaic 2.4.2, a derivative of NCSA's Mosaic for X 2.4.
The figure below shows the widget used to specify a search request.
Figure 1: Widget for specifying a search request. (Please note
that this is a figure, not the actual widget. Also, this is not a clickable
image.)
!agrep -1 "comp[ue]ting" | wc -l
win.tue.nl
.
When the domain menu item
*.y.z
is selected (as in the figure), only nodes on servers
with a domain name ending in tue.nl
are searched.
A more general way to specify the domains to include or exclude from the
search is being considered for future versions of the search tool.
The "Query Setup" window shown in Figure 1 only contains widgets one
can include in WWW forms
The fish-search form is not identical to the Query Setup window for Mosaic.
Since your "current" node is the search form, the "current" node cannot
also be the answer from an index database or any other node you would like.
Therefore, the form contains a field for specifying the URL of the document
to start from.
The server also does not know what your hotlist contains.
Searching your hotlist through the forms-based search is not possible.
Using the fish-search by means of this form requires you to install the
actual search program on a WWW server. Also, you must write an HTML document
containing the form. The default selections for the toggles and menus
are preset in the form. In order to modify them you must change the form,
not a configuration file for your WWW browser.
Compared to the Mosaic version of the fish-search the forms-based version
has two more drawbacks
In order to be useful for searching the neighborhood of a remote site
the fish-search program needs to be installed on a WWW server in that
neighborhood. Only then are the Internet transfers limited to that
neighborhood, and thus faster. So which neighborhoods can be efficiently
search depends on where the fish-search programs are installed.
The fish-search cgi-script, when completed, will be made available from
Figure 2ftp.win.tue.nl
in the directory pub/infosystems/www
,
a directory that also contains a copy of the Tübingen Mosaic for X and the
Lagoon cache [BP94].
The problem of finding information, even if one has ample time, like the robots have, is further complicated by the fact that the WWW is not necessarily completely reachable even from a large number of starting nodes (there will always be servers nobody is pointing to yet), and there are parts of the WWW that are hidden behind forms and clickable maps. The number of possible inputs for forms and maps is much too large to try all possibilities in order to find the information behind them.
The core of the problem however is the following
A database like GlimpseHTTP can be kept up to date by means of a program
that is run daily (nightly).
A complete and up to date list of all the nodes on a server containing
a given string, expression, or approximation, can be obtained within seconds.
With the fish-search this would take minutes to hours depending on the
connection with that server, and with all existing index-databases put
together, this list would still not be obtained because all these databases
are always outdated and incomplete.
In order to search the whole WWW (or a large part thereof) for a string
or expression, the navigation strategy of the fish-search could be used
to jump from server to server, instead of from node to node.
To do that, a server should not only provide a list of nodes containing
relevant information, but also a list of other servers it knows about.
Each of these servers can then be asked the same question
Providing this kind of information on the whole WWW is not as difficult
as it seems. There are only a few popular WWW server packages, most
notably the ones offered by CERN and NCSA. If these packages would
include such an "extended GlimpseHTTP-like" tool and a standard
form to access it, the implementation of an efficient and very effective
fish-search would be relatively easy.
Index-databases could be replaced by caches containing recent answers
to searches for frequently used strings or expressions. They could also
provide the names (and addresses) of servers they know about.
For many applications the quick but incomplete and outdated answers
one gets from the current generation of index databases will still be
sufficient for some time. For up-to-date information on occurrences of
strings or expressions in a small part of the WWW the current fish-search
will still be sufficient. But for a high-quality information service,
the client-based node by node search is too slow, and the second-hand
information on the index-databases too unreliable. Furthermore,
more and more people (administrators) object to the network resource
consumption by the robots that are used by the current generation of
search tools.
By installing software on each server (or small clusters of servers)
that provides up-to-date and complete information on its own part of the
WWW, the Web should become searchable again.
organized (query) access to a server's information should be offered
by that server, not by remote databases or search tools.
This idea motivated the designers of
GlimpseHTTP
to develop a forms-based search interface to the complete information
that is stored on a single server (actually, in a directory tree).
GlimpseHTTP offers approximate regular expression search (from which the
fish-search borrowed the agrep library) on the complete contents of all
nodes on a server, also the nodes that are hidden behind forms
and maps, and also the nodes to which there is no link in the WWW, not even
on that same server.
Paul De Bra received his ph.d. in computer science from the University of Antwerp, in 1987. His research was on Horizontal Decompositions of Relational Databases. As a post-doctoral researcher he spent one and a half years at AT&T; Bell Laboratories in Murray Hill, NJ, working on WYSIWYG interfaces for document processing. In december 1989 he joined the Eindhoven University of Technology.
His research interests are models, information retrieval and structure analysis and querying for hypermedia databases.