The TkWWW Robot: Beyond Browsing

Scott Spetka

SUNY Institute of Technology at Utica/Rome

Abstract

The TkWWW Robot extends the functionality of the TkWWW browser. It can also be used as a standalone agent that processes queries on the WWW. In either case, the robot traverses the web according to functions described using Tcl extensions developed for the TkWWW browser and the TkWWW robot. Tcl language extensions allow novice users to direct a search based on key words and HTML data types or to invoke predefined queries. Expert users can add their own Tcl extensions for use in more sophisticated searches. The system can be quickly adapted to support new resource types that may become available on the WWW, such as live audio/video connections for interaction with experts.

When TkWWW robots are dispatched from the TkWWW browser, their results can be used to guide further browsing. Robots can also be run in the background to build HTML indexes, compile WWW statistics, collect a portfolios of pictures, or perform any other function that can be described by the TkWWW Tcl extensions. Searches can be restricted according to URL names, limits on search path lengths, etc. The TkWWW robot can interact with its master by popping up windows to request input according to a Tcl query description. The main advantage that the TkWWW robot has over existing web spiders and robots is its flexibility in adapting to virtually any criteria possible to guide its search path and control selection of data for retrieval.

The paper presents the architecture and implementation of the TkWWW robot. It describes Tcl extensions implemented to support TkWWW robot queries. The TkWWW Robot is being developed over the summer at the Air Force Rome Laboratory with funding from the Air Force Office of Scientific Research.

Introduction

The global distributed WWW database is growing dramatically, as is the Internet itself. If the WWW follows a growth pattern similar to the Internet, we can expect an explosion of information and commercial applications in the near future. The quantity of data available has already aided the development of techniques for building indexes to help users direct their search. Many of the index-building applications require exhaustive search of the net, taking significant time.

Most access to WWW resources today is through interactive browsers. In these systems, a WWW query is limited to the use of a single pointer to reference a page. The page is retrieved and the user determines whether further references can provide additional information. WWW public databases are hierarchically structured. The only dynamically created state information associated with a search is a stack of branch points that can be used to move forward and backward through a history of pages to effectively search a hierarchically structured information server.

WWW database structures and the tools used for their access must be enhanced in order to provide effective access to an information database that is growing rapidly. The terms robot and spider (http://web.nexor.co.uk/mak/doc/robots/robots.html) are used in reference to automated browsing tools for access to publicly accessible databases on the internet. A simple robot is described in this paper that is flexible enough to implement many query processing techniques. The robot is analogous to the access methods of a classical database, but it provides a much higher level interface which simplifies the implementation of new query processing techniques in a heterogeneous WWW environment. The paper also discusses the changes in database structure that may be expected as the current structuring techniques become overloaded by the increasing size of the database.

The first section of the paper discusses the issues that were considered in deciding to build a robot. The second section describes the rational for basing the robot on Tcl/Tk and describes its implementation. The third section examines issues related to whether servers should provide additional functionality. Section 4 discusses research directions for further experimentation with the TkWWW Robot. The paper concludes with some general comments from the experience gained in implementing a robot.

Issues for Consideration

Locality of Reference

Robots exploit the locality of reference inherent in the natural organization of the WWW data. Home pages reflect tha main interests of their maintainers. While there is nothing to prevent random links and disorganized information hierarchies, links are most often used to present sub-topics in accordance with the underlying storage structure. The notion of locality of reference suggests that a logical neighborhood of a node would have a good chance of referencing the major sites that store related data. Pages found in browsing the WWW are often linked to pages presenting related topics. Links can be used to revisit pages that may be changing dynamically

Improved storage organization would help to reduce the search radius needed to find most of the related data. Although a dynamic robot search does not guarantee that all desired data is retrieved, there is no way that any query processor can make that guarantee on today's WWW. The increasingly dynamic nature of WWW servers, storing information like live video (http://tns-www.lcs.mit.edu/vs/demos.html) and weather information (xmosaic http://rs560.cl.msu.edu/weather/) affects the ability of indexes to be properly maintained. Autonomous operation of servers and reliability considerations make it impossible to deliver all possible responses to a query.

Resource Requirements

There is some concern that automated browsing will either be too slow to be effective or will overload existing Internet communication channels. These are the same concerns that characterize the development of any advanced research system. Software must be developed on today's systems to be developed in time to service the next generation of computers. The rate of development of Internet resources and the drop in cost of computer processors, memory and disk all support the assertion that robots will be able to operate at reasonable levels of resource consumption within a short time.

The main questions for resource consumption are how far it is worth searching for a result and whether the result is worth the expenditure of resources needed for the search. In a single machine environment, a database scheduler can determine priorities for processing to provide for balanced and efficient operation. Distributed database systems cooperate to provide similar scheduling functionality. The growth of the WWW has placed the same type of unpredictable resource requirements on computer server implementations that have characterized the Internet communications processors for years.

Client/Server Architecture: Parallelism

The current client/server architecture implemented in the WWW is based on full autonomy of the server. Client applications, like Mosaic, TkWWW, and Gopher, request pages from the server which are then processed locally. This architecture minimizes the degree of parallel processing that can occur because of the synchronous nature of the read function. A client may have just a single request outstanding, even if it is waiting on a very slow server.

There are two approaches to developing increased parallelism. In the first, servers do local processing to minimize the amount of data that is transmitted. This approach requires that a filter be sent to the server. A natural extension would be for a more complex query to be sent to the server and for additional steps in the processing to take place there. A second approach can improve parallelism while having less impact on autonomy. Several machines in a local network can cooperate to share the I/O burden associated with a query. This approach suffers from the need to share a common network connection, limiting the degree of parallelism that can be accomplished.

TkWWW Robot - Using Robots on Today's WWW

The prototype TkWWW Robot is the first step in developing a general purpose robot that can be programmed to provide a query processing interface to users. Users will, in general be interested in finding particular types of HTML objects (similar to projection in relational databases), finding HTML objects that satisfy a specific search criteria (relational database selection), and looking for logical relationships between HTML objects scattered over the WWW database (relational database join). The initial TkWWW Robot implementation projects all links found within a set of WWW pages.

Figure 1 shows the TkWWW Robot browsing interface after a search for links within a radius of 2 hops is selected from the robut pull-down menu. Notice that the links on the robot home page (shown in the background) are labeled to show the links that can be expected to be found on the corresponding page. The pop-up dialog box contains all links found on the pages referenced by the home page in addition to all links found on the home page. Any of these links can be selected directly for further browsing.

Figure 1 - TkWWW Robot Browsing Interface

Design and Implementation

The TkWWW Robot is an extension of the TkWWW browser which is written in Tcl/Tk (URL ref). The TkWWW browser implements commands in the C language which augment the base Tcl/Tk command language to provide a convenient interface which can be used to access WWW resources. The TkWWW Robot uses TkWWW commands along with additional commands that are specifically designed to support the implementation of autonomous agent data retrieval. The diagram below shows the layering of the TkWWW Robot implementation. Commands at all levels of the system are used in developing query processing algorithms and automated browsing tools.

Figure 2 - TkWWW Robot System Structure

In order to provide maximum flexibility in implementing additional robot functionality, the additional commands added for TkWWW Robot should be a simple as possible. The interface implemented in the prototype robot operates on a URL (WWW page address). It returns all links stored on the page. The links are returned as a character stream which is easily processed using the powerful string handling commands of Tcl. Figure 3 shows the call to the HtRobot function which returns a list of pairs of to the higher-level Tcl/Tk functions the HtRobot function is written in the C language. The $page_value parameter specifies the address of a page to search for links.

set RobotList [HtRobot $page_value]

Figure 3 - The HtRobot Function Interface

Higher level Tcl functions loop through the addresses/identifier pairs and produce an display display of the identifiers with appropriate callbacks defined to invoke WWW page access commands when one of the identifiers is selected. The tkW3RobotDialog function, implemented in the prototype TkWWW Robot is designed to provide access to all links within a radius of 1 hop from the current page (another similar function implements the 2 hop access shown in figure 1 above). The implementation, shown in figure 4 is simplified by the use of the tkW3NavigateRobot function which uses the HtRobot interface described above to access the links on the specified page. The list is then processed to display the links and set up appropriate callbacks for their access. This simple loop builds the pop-up menu shown in figure 1 above.

proc tkW3RobotDialog {} {
    global tkW3Robot
    set parent .
    set w .robot
  #   tkW3NavigateRobot returns a list of links, using HtRobot
    tkW3NavigateRobot
  #   Set up the pop-up dialog window (some buttons not shown)
    DLG:toplevel $parent $w
    tkW3OutputMakeButtons $w.button_frame {
.br
	{current "Add current" "tkW3RobotAdd"}
.br
		...
	{edit "Edit Title" "tkW3RobotEditTitle"}
    }
 #   Draw box and set up bindings
    pack append $w $w.button_frame top
    DLG:draw_listbox $w {}
    DLG:draw_buttons $w [list "Dismiss" "Help"]
    DLG:bind_button $w 1 "DLG:hide $w"
    DLG:bind_button $w 2 "tkW3HelpNoHelp"
    
 #   Loop through list of page address/description pairs
 #		and add them to pop-up dialog box
    set Rcnt 0
    set Rlen [llength $tkW3Robot(list)]
    while { $Rcnt <Rlen } {
.br
	$w.list insert end
.br
		[lindex $tkW3Robot(list) [expr $Rcnt + 1 ] ]
.br
	incr Rcnt 2
    }
 #   Display the pop-up box
    DLG:show $parent $w
}

Figure 4 - Tcl/Tk/TkWWW Processing for a List of Links

Implementation Problems

The current structure of the system made it somewhat difficult to isolate the links that were found on each HTML page. The functions for parsing and generation of the bitmap display are integrated rather than being performed separately. The current implementation identifies links during the parsing process and copies them into a global list structure for use in the HtRobot interface function. The call to the parser is implemented in the same way as a call to display a page. Page display functions accept the URL address of a page as a parameter but default to the current page when none is given.

The TkWWW Robot: a Distributed Database Query Processor

Eventually, WWW will support distributed query processing with parallel execution of server functions to support queries generated by client programs that provide the parameters for a dynamic search of the WWW. Today's concern over limited processing resources and bandwidth will be reduced as the resources become available in the Internet, or its successor, to support the ideas and software being developed in research laboratories today. Addition of resources to local processing resources to provide adequate query processing support will be driven by the need for companies to compete commercially. Providing convenient query processing support will be comparable to providing a facility where people can shop. In addition to ordering products from on-line hypertext catalogs, the WWW provides a market for research ideas where the acceptance of ideas may depend on how easily they can be located.

Database Structure

If we assume that the WWW is moving in a commercial direction, then the perspective of the WWW database designer should be oriented toward a structure that is general enough to provide the information that would be expected, in a standard format, rather than supporting a totally unstructured information system. The hypertext page is analogous to the tuple in relational database systems. Pages could be linked together to form an IMS-like database system. A more appropriate model for WWW databases is an object-oriented database structure. The WWW access methods (page retrieval) are already designed to access pages of varying formats. Additional structure is required to support identification of attributes from base classes which define the basic structures which can be used in sub-classes that use them in their definition. Such a definition could be used for accessing an personal URL type page. Each page should provide a core of information (in this case the name, company, and email address) associated with the individual.

Access Methods

Data that is made available on the WWW will be more efficiently retrieved through robots which implement parallel exploration of access paths to the data required. A search for a product will be successful if data is readily accessible (see above) and if an appropriate search engine or robot is in use. In the case of the TkWWW Robot prototype, described above, the support for distributed processing would retrieve lists of <address, title> pairs for the links known to the server. Either all links or only those at a particular hop count from the entry point would be retrieved. The server would, with a minor increase in local processing, significantly reduce network traffic.

The TkWWW browser operates on the HTML retrieved from remote servers and performs all operations locally. It would seem reasonable that simple search functions be added to servers that would allow search operations to use remote processing resources in order to reduce the impact on the network at a server. A server-based search capability would allow a robot or browser to return all links stored at a site, rather than retrieving all pages, then throwing away all data except the links.

Indexing

Indexing techniques are commonly used in database systems that are highly structured. Many static indexes have been developed for access to WWW resources. They are mainly produced through exhaustive search techniques. The indexes are not updated upon changes to the objects they point to, as they would be in a standard database system. This is due in part to an effort to maintain the autonomy of servers. Robots, like the W4 Worm can not insist that the objects that they put into their indices report back to them upon changes in their status. Increasing the frequency of exhaustive search would be a large burden to the network in order to avoid a few missed references. Good indexes exist on the WWW for convenient search of many resources, by name, subject, etc.

The only mechanism that supports user controlled static index construction for today's browsers requires manually inserting pointers to pages into a bookmark list. In most cases, bookmarks are unrelated and are used as starting points for independent searches.

Future Robots: Using Robots on Tomorrow's WWW

Eventually, the WWW must move away from a model where all decisions regarding WWW navigation are made by the local browser/client/robot, based on remote information retrieval. Current robots have no choice but to retrieve links from remote sites as a basis for their navigation. The TkWWW, with its flexible implementation is poised to change as the WWW moves in the direction of increased intelligence in database servers. This section describes some of the changes that will force WWW designers to address the issues involved in increasing the functionality in WWW servers and clients.

Real-Time Video Communication

Much of what has driven the evolution of the WWW involves integration of multimedia databases through a high-level object-oriented interface. Many systems of the future will have real-time constraints and will include videotelecommunications as an integral component. The TkWWW robot is poised to meet the challenges of these new environments. In addition to having real-time constraint requirements, future servers fill be characterized by increasingly dynamic behavior. Servers may offer buttons to initiate dynamic videoconferences based on the instantaneous availability of an appropriate research expert or salesman, depending on the organization offering the connection.

Searching for sources of live video provides the WWW analog of a channel scanner where each WWW page may provide a source of live video. Robot starting points, similar to the "bookmark" capability in today's browsers gives a robot a set of initial starting points for a search. This functionality is already implemented in the TkWWW Robot. A robot may find all experts in an area that are prepared for a videoconference. In a similar way, real-time video may be used for talking to sales and technical representatives. Information services, analogous to touchtone hierarchy traversal for telephone-based systems already exist, without the ability to establish a direct videoconference connection.

A typical TkWWW Robot query of the future (with a natural language interface):

	"Go find someone to talk to and negotiate
	a price of less than $X for the conversation"

Individual Databases and Servers

Robots are useful to support the flexibility of personal data representation and communication, helping people communicate by integrating network, video, and multimedia databases into modern information systems. They can perform automated answering, filtering and customizing messages received through a communication system, depending on the authorization of a caller. A robot can actively pursue information on behalf of its master by finding servers that have answers to questions and retrieving the appropriate data.

Advanced Functionality

The use of a Tcl-based robot offers the potential for communication and cooperation with distribution robots, by client-agent robots. EXPECT extensions to the basic Tcl command set allow discrimination of requests and can be used to enhance both the TkWWW browser and the TkHttpd. Security would not be a problem since the browser would control interaction with the server. A server (httpd) robot may query the client database to determine areas where a search could provide additional information and avoid unnecessary overhead needed to return data that is already available. This suggests that data should be structured as HTML in a hierarchic database that is understood by the httpd.

Future Research

The TkWWW prototype demonstrated the feasibility of implementing a robot. Extensions will be needed to develop support for additional capabilities described in the paper. In addition to handling the basic link objects, the robot will be upgraded to retrieve other types of HTML objects will be implemented. New TkWWW Robot commands to search for objects independent of type, perhaps based on size or some other attribute would also be interesting. A command that searches for a logical combination of given parameters can search pages that satisfy expressions like (mpeg or jpeg or gif).

More complex commands nay search pages based on the relationship of objects on a page. A low-level TkWWW Robot command to return the next object on a page would allow queries like "get the addresses of all pages that have mpegs immediately followed by audios", ie. queries where the ordering of objects is important. In general the Tcl functions will determine the characteristics that are sought on the objects. For example "find pages that have an mpeg followed by at least 3 audio files where the sizes of audio files are increasing" would indicate how many WWW database managers sort their files.

There are several other research issues related to the operation of WWW servers that should be addressed to improve the general efficiency of Internet robots. Other issues involve commercializing the WWW. Robots must be able to avoid servers that would charge for their access. The current approach to limiting robot access, through the installation of an advisory file which requests that robots refrain from access can be amplified to provide them with the equivalent of a toll booth where desired by the server operator. Some of the appropriate structure may depend on the rate structure that develops as NSFnet is turned over to commercial providers.

Conclusion

The TkWWW robot can be programmed to perform most of the functions of other robots that are in common use. The work being done on the robot is focused on providing additional low level functionality and developing techniques to implement indexes based on the current content of databases of interest. It seems likely that additional types of data will be continue to be integrated into the WWW framework. The TkWWW can adapt quickly to changes in the environment, making it a reasonable target for future development.

The ability to develop advanced robot functionality depends on whether server functions can be developed to help robots do their work efficiently. If initial experimentation with robots is successful, additional server functionality seems inevitable. Servers that do not support robot access will be overlooked, causing a great deal of effort invested in their development and maintenance to be of limited value.

Acknowledgment

The TkWWW Browser is GNU General Public License software. It is available through the Internet.

Author Biography

Scott Spetka received his Ph.D. degree in computer science from UCLA in 1989. He is currently an Assistant Professor in the Computer Science Department at the State University of New York Institute of Technology at Utica/Rome. His research interests are in the areas of distributed databases, operating systems and networks. During the last year, Scott has been developing a network of PCs running the Unix operating system. Before becoming involved in WWW research, Scott was developing SUNY Nodes, a Network Oriented Data Engineering System. The system is used to experiment with query processing techniques.

Author Email Address

scott@sunyit.edu