The TkWWW Robot extends the functionality of the TkWWW browser. It can also be used as a standalone agent that processes queries on the WWW. In either case, the robot traverses the web according to functions described using Tcl extensions developed for the TkWWW browser and the TkWWW robot. Tcl language extensions allow novice users to direct a search based on key words and HTML data types or to invoke predefined queries. Expert users can add their own Tcl extensions for use in more sophisticated searches. The system can be quickly adapted to support new resource types that may become available on the WWW, such as live audio/video connections for interaction with experts.
When TkWWW robots are dispatched from the TkWWW browser, their results can be used to guide further browsing. Robots can also be run in the background to build HTML indexes, compile WWW statistics, collect a portfolios of pictures, or perform any other function that can be described by the TkWWW Tcl extensions. Searches can be restricted according to URL names, limits on search path lengths, etc. The TkWWW robot can interact with its master by popping up windows to request input according to a Tcl query description. The main advantage that the TkWWW robot has over existing web spiders and robots is its flexibility in adapting to virtually any criteria possible to guide its search path and control selection of data for retrieval.
The paper presents the architecture and implementation of the TkWWW robot. It describes Tcl extensions implemented to support TkWWW robot queries. The TkWWW Robot is being developed over the summer at the Air Force Rome Laboratory with funding from the Air Force Office of Scientific Research.
The global distributed WWW database is growing dramatically, as is the Internet itself. If the WWW follows a growth pattern similar to the Internet, we can expect an explosion of information and commercial applications in the near future. The quantity of data available has already aided the development of techniques for building indexes to help users direct their search. Many of the index-building applications require exhaustive search of the net, taking significant time.
Most access to WWW resources today is through interactive browsers. In these systems, a WWW query is limited to the use of a single pointer to reference a page. The page is retrieved and the user determines whether further references can provide additional information. WWW public databases are hierarchically structured. The only dynamically created state information associated with a search is a stack of branch points that can be used to move forward and backward through a history of pages to effectively search a hierarchically structured information server.
WWW database structures and the tools used for their access must be enhanced in order to provide effective access to an information database that is growing rapidly. The terms robot and spider (http://web.nexor.co.uk/mak/doc/robots/robots.html) are used in reference to automated browsing tools for access to publicly accessible databases on the internet. A simple robot is described in this paper that is flexible enough to implement many query processing techniques. The robot is analogous to the access methods of a classical database, but it provides a much higher level interface which simplifies the implementation of new query processing techniques in a heterogeneous WWW environment. The paper also discusses the changes in database structure that may be expected as the current structuring techniques become overloaded by the increasing size of the database.
The first section of the paper discusses the issues that were considered in deciding to build a robot. The second section describes the rational for basing the robot on Tcl/Tk and describes its implementation. The third section examines issues related to whether servers should provide additional functionality. Section 4 discusses research directions for further experimentation with the TkWWW Robot. The paper concludes with some general comments from the experience gained in implementing a robot.
Locality of Reference
Robots exploit the locality of reference inherent in the natural organization of the WWW data. Home pages reflect tha main interests of their maintainers. While there is nothing to prevent random links and disorganized information hierarchies, links are most often used to present sub-topics in accordance with the underlying storage structure. The notion of locality of reference suggests that a logical neighborhood of a node would have a good chance of referencing the major sites that store related data. Pages found in browsing the WWW are often linked to pages presenting related topics. Links can be used to revisit pages that may be changing dynamically
Improved storage organization would help to reduce the search radius needed to find most of the related data. Although a dynamic robot search does not guarantee that all desired data is retrieved, there is no way that any query processor can make that guarantee on today's WWW. The increasingly dynamic nature of WWW servers, storing information like live video (http://tns-www.lcs.mit.edu/vs/demos.html) and weather information (xmosaic http://rs560.cl.msu.edu/weather/) affects the ability of indexes to be properly maintained. Autonomous operation of servers and reliability considerations make it impossible to deliver all possible responses to a query.
Resource Requirements
There is some concern that automated browsing will either be too slow to be effective or will overload existing Internet communication channels. These are the same concerns that characterize the development of any advanced research system. Software must be developed on today's systems to be developed in time to service the next generation of computers. The rate of development of Internet resources and the drop in cost of computer processors, memory and disk all support the assertion that robots will be able to operate at reasonable levels of resource consumption within a short time.
The main questions for resource consumption are how far it is worth searching for a result and whether the result is worth the expenditure of resources needed for the search. In a single machine environment, a database scheduler can determine priorities for processing to provide for balanced and efficient operation. Distributed database systems cooperate to provide similar scheduling functionality. The growth of the WWW has placed the same type of unpredictable resource requirements on computer server implementations that have characterized the Internet communications processors for years.
Client/Server Architecture: Parallelism
The current client/server architecture implemented in the WWW is based on full autonomy of the server. Client applications, like Mosaic, TkWWW, and Gopher, request pages from the server which are then processed locally. This architecture minimizes the degree of parallel processing that can occur because of the synchronous nature of the read function. A client may have just a single request outstanding, even if it is waiting on a very slow server.
There are two approaches to developing increased parallelism. In the first, servers do local processing to minimize the amount of data that is transmitted. This approach requires that a filter be sent to the server. A natural extension would be for a more complex query to be sent to the server and for additional steps in the processing to take place there. A second approach can improve parallelism while having less impact on autonomy. Several machines in a local network can cooperate to share the I/O burden associated with a query. This approach suffers from the need to share a common network connection, limiting the degree of parallelism that can be accomplished.
The prototype TkWWW Robot is the first step in developing a general purpose robot that can be programmed to provide a query processing interface to users. Users will, in general be interested in finding particular types of HTML objects (similar to projection in relational databases), finding HTML objects that satisfy a specific search criteria (relational database selection), and looking for logical relationships between HTML objects scattered over the WWW database (relational database join). The initial TkWWW Robot implementation projects all links found within a set of WWW pages.
Figure 1 shows the TkWWW Robot browsing interface after a search for links within a radius of 2 hops is selected from the robut pull-down menu. Notice that the links on the robot home page (shown in the background) are labeled to show the links that can be expected to be found on the corresponding page. The pop-up dialog box contains all links found on the pages referenced by the home page in addition to all links found on the home page. Any of these links can be selected directly for further browsing.
Figure 1 - TkWWW Robot Browsing Interface
Design and Implementation
The TkWWW Robot is an extension of the TkWWW browser which is written in Tcl/Tk (URL ref). The TkWWW browser implements commands in the C language which augment the base Tcl/Tk command language to provide a convenient interface which can be used to access WWW resources. The TkWWW Robot uses TkWWW commands along with additional commands that are specifically designed to support the implementation of autonomous agent data retrieval. The diagram below shows the layering of the TkWWW Robot implementation. Commands at all levels of the system are used in developing query processing algorithms and automated browsing tools.
Figure 2 - TkWWW Robot System Structure
In order to provide maximum flexibility in implementing
additional robot functionality, the additional commands added for TkWWW
Robot should be a simple as possible. The interface implemented in
the prototype robot operates on a URL (WWW page address). It returns
all links stored on the page. The links are returned as a
character stream which is easily processed using the powerful string
handling commands of Tcl. Figure 3 shows the call to the HtRobot function which
returns a list of pairs of
Figure 3 - The HtRobot Function Interface
Higher level Tcl functions loop through the addresses/identifier pairs
and produce an display display of the identifiers with appropriate
callbacks defined to invoke WWW page access commands when
one of the identifiers is selected. The tkW3RobotDialog function,
implemented in the
prototype TkWWW Robot is designed to provide access to all links
within a radius of 1 hop from the current page (another similar
function implements the 2 hop access shown in figure 1 above).
The implementation,
shown in figure 4 is simplified by the use of the
tkW3NavigateRobot function which uses the
HtRobot interface described above to access the links on the specified
page. The list is then processed to display the links and set up appropriate
callbacks for their access.
This simple loop builds the pop-up menu shown in figure 1 above.
Figure 4 - Tcl/Tk/TkWWW Processing for a List of Links
Implementation Problems
The current structure of the system made it somewhat difficult
to isolate the links that were found on each HTML page. The functions for
parsing
and generation of the bitmap display are integrated rather than being
performed separately. The current implementation identifies links during the
parsing process and
copies them into a global list structure for use in the HtRobot
interface function. The call to the parser is implemented
in the same way as a call to display a page. Page display functions
accept the URL address of a page as a parameter but default to the
current page when none is given.
Eventually, WWW will support distributed query processing with
parallel execution of server functions to support queries generated by
client programs that provide the parameters for a dynamic search of the
WWW. Today's concern over limited processing resources and bandwidth
will be reduced as the resources become available in the Internet, or
its successor, to support the ideas and software being developed in
research laboratories today. Addition of resources to local processing
resources to provide adequate query processing support will be driven
by the need for companies to compete commercially. Providing convenient
query processing
support will be comparable to providing a facility where people can shop.
In addition to ordering products from on-line hypertext catalogs,
the WWW provides a market for research ideas where the acceptance of ideas
may depend on how easily they can be located.
Database Structure
If we assume that the WWW is moving in a commercial direction,
then the perspective of the WWW database designer should be oriented
toward a structure that is general enough to provide the
information that would be expected, in a standard format, rather than
supporting a totally unstructured information system. The hypertext page
is analogous to the tuple in relational database systems. Pages could be
linked together to form an IMS-like database system. A more appropriate model
for WWW databases is an object-oriented database structure. The WWW access
methods (page retrieval) are already designed to access pages of varying
formats. Additional structure is required to support identification of
attributes from base classes which define the basic structures which
can be used in sub-classes that use them in their definition.
Such a definition could be used for accessing an personal URL
type page. Each page should provide a core of information (in this case the
name, company, and email address) associated with the individual.
Access Methods
Data that is made available on the WWW will be more efficiently
retrieved through robots which implement parallel exploration of access
paths to the data required. A search for a product will be successful
if data is readily accessible (see above) and if an appropriate search
engine or robot is in use. In the case of the TkWWW Robot prototype,
described above,
the support for distributed processing would retrieve lists of <address, title>
pairs for the links known to the server. Either all links or only those at
a particular hop count from the entry point would be retrieved. The server
would, with a minor increase in local processing, significantly
reduce network traffic.
The TkWWW browser operates on the HTML
retrieved from remote servers and performs all
operations locally. It would seem reasonable that
simple search functions be added to servers that
would allow search operations to use remote processing
resources in order to reduce the impact on the
network at a server. A server-based search capability
would allow a robot or browser to return all links stored at a
site, rather than retrieving all pages, then throwing away
all data except the links.
Indexing
Indexing techniques are commonly used in database systems that
are highly structured. Many static indexes have been developed for access
to WWW resources. They are mainly produced through exhaustive search
techniques. The indexes are not updated upon changes to the objects
they point to, as they would be in a standard
database system. This is due in part to an effort to maintain the autonomy
of servers. Robots, like the W4 Worm can not insist that the objects that
they put into their indices report back to them upon changes in their status.
Increasing the frequency of exhaustive search would be a large burden to
the network in order to avoid a few missed references. Good indexes exist on
the WWW for convenient search of many resources, by name, subject, etc.
The only mechanism that supports user controlled static index
construction for today's
browsers requires manually inserting pointers to pages into a bookmark
list. In most cases, bookmarks are unrelated and are used as starting points
for independent searches.
Eventually, the WWW must move away from a model where all decisions
regarding WWW navigation are made by the local browser/client/robot, based on
remote information retrieval. Current robots have no choice but to retrieve
links from remote sites as a basis for their navigation. The TkWWW, with
its flexible implementation is poised to change as the WWW moves in the
direction of increased intelligence in database servers. This section describes
some of the changes that will force WWW designers to address the issues
involved in increasing the functionality in WWW servers and clients.
Real-Time Video Communication
Much of what has driven the evolution of the WWW involves
integration of multimedia databases through a high-level
object-oriented interface. Many systems of the future will have real-time
constraints and will include videotelecommunications as an integral component.
The TkWWW robot is poised to meet the challenges of these new environments.
In addition to having real-time constraint requirements, future servers
fill be characterized by increasingly dynamic behavior. Servers may offer
buttons to initiate dynamic videoconferences based on the instantaneous
availability of an appropriate research expert or salesman, depending on
the organization offering the connection.
Searching for sources of live video provides the WWW analog of
a channel scanner where each WWW page may provide a source of live video.
Robot starting points, similar to the "bookmark" capability in today's
browsers gives a robot a set of initial starting points for a search.
This functionality is already implemented in the TkWWW Robot. A robot may
find all experts in an area that are prepared for a videoconference.
In a similar way, real-time video may be used for talking to sales and
technical representatives. Information services, analogous to touchtone
hierarchy traversal for telephone-based systems already exist, without
the ability to establish a direct videoconference connection.
A typical TkWWW Robot query of the future (with a natural language interface):
Individual Databases and Servers
Robots are useful to support the flexibility of
personal data representation and communication, helping people
communicate by integrating network, video, and multimedia databases into
modern information systems. They can perform automated answering, filtering
and customizing messages received through a communication system,
depending on the authorization of a caller. A robot can actively
pursue information on behalf of its master by finding servers that
have answers to questions and retrieving the appropriate data.
Advanced Functionality
The use of a Tcl-based robot offers the potential for communication
and cooperation with distribution robots, by client-agent robots.
EXPECT extensions to the basic Tcl command set allow discrimination
of requests and can be used to enhance both the TkWWW browser and the
TkHttpd. Security would not be a problem since the browser would control
interaction with the server. A server (httpd) robot may query the
client database to determine areas where a search could provide
additional information and avoid unnecessary overhead needed to
return data that is already available. This suggests that data
should be structured as HTML in a hierarchic database that is
understood by the httpd.
The TkWWW prototype demonstrated the feasibility of implementing
a robot. Extensions will be needed to develop support for additional
capabilities described in the paper. In addition to handling the
basic link objects, the robot will be upgraded to retrieve other types
of HTML objects will be implemented. New TkWWW Robot commands to
search for objects independent of type, perhaps based on size or some other
attribute would also be interesting. A command that searches for a
logical combination of given parameters can search pages that
satisfy expressions like (mpeg or jpeg or gif).
More complex commands nay search pages based on the relationship
of objects on a page. A low-level TkWWW Robot command to
return the next object on a page would
allow queries like "get the addresses of all pages
that have mpegs immediately followed by audios", ie. queries
where the ordering of objects is important. In general
the Tcl functions will determine the characteristics that
are sought on the objects. For example "find pages that have
an mpeg followed by at least 3 audio files where the sizes
of audio files are increasing" would indicate how many
WWW database managers sort their files.
There are several other research issues related to the
operation of WWW servers that should be addressed
to improve the general efficiency of Internet robots.
Other issues involve commercializing the WWW.
Robots must be able to avoid servers that would charge for their
access. The current approach to limiting robot access, through the
installation of an advisory file which requests that robots refrain
from access can be amplified to provide them with the equivalent
of a toll booth where desired by the server operator.
Some of the appropriate structure may depend on the rate structure
that develops as NSFnet is turned over to commercial providers.
The TkWWW robot can be programmed to perform most of the functions
of other robots that are in common use. The work being done on the robot
is focused on providing additional low level functionality and developing
techniques to implement indexes based on the current content of databases
of interest. It seems likely that additional types of data will be
continue to be integrated into the WWW framework. The TkWWW can adapt
quickly to changes in the environment, making it a reasonable target
for future development.
The ability to develop advanced robot functionality depends
on whether server functions can be developed to help robots do
their work efficiently. If initial experimentation with robots is
successful, additional server functionality seems inevitable. Servers
that do not support robot access will be overlooked, causing a great
deal of effort invested in their development and maintenance to
be of limited value.
The TkWWW Browser is GNU General Public License software. It is
available through the Internet.
Scott Spetka received his Ph.D. degree in computer science
from UCLA in 1989.
He is currently an Assistant Professor in the Computer Science
Department at the State University of New York Institute of Technology
at Utica/Rome. His research interests are in the areas of distributed
databases, operating systems and networks. During the last year, Scott
has been developing a network of PCs running the Unix operating system.
Before becoming involved in WWW research, Scott was developing SUNY Nodes,
a Network Oriented Data Engineering System. The system is used to
experiment with query processing techniques.
scott@sunyit.edu
set RobotList [HtRobot $page_value]
proc tkW3RobotDialog {} {
global tkW3Robot
set parent .
set w .robot
# tkW3NavigateRobot returns a list of links, using HtRobot
tkW3NavigateRobot
# Set up the pop-up dialog window (some buttons not shown)
DLG:toplevel $parent $w
tkW3OutputMakeButtons $w.button_frame {
.br
{current "Add current" "tkW3RobotAdd"}
.br
...
{edit "Edit Title" "tkW3RobotEditTitle"}
}
# Draw box and set up bindings
pack append $w $w.button_frame top
DLG:draw_listbox $w {}
DLG:draw_buttons $w [list "Dismiss" "Help"]
DLG:bind_button $w 1 "DLG:hide $w"
DLG:bind_button $w 2 "tkW3HelpNoHelp"
# Loop through list of page address/description pairs
# and add them to pop-up dialog box
set Rcnt 0
set Rlen [llength $tkW3Robot(list)]
while { $Rcnt <Rlen } {
.br
$w.list insert end
.br
[lindex $tkW3Robot(list) [expr $Rcnt + 1 ] ]
.br
incr Rcnt 2
}
# Display the pop-up box
DLG:show $parent $w
}
The TkWWW Robot: a Distributed Database Query Processor
Future Robots: Using Robots on Tomorrow's WWW
"Go find someone to talk to and negotiate
a price of less than $X for the conversation"
Future Research
Conclusion
Acknowledgment
Author Biography
Author Email Address