AstroWeb Tools

Robert E. Jackson
Staff Scientist
Computer Sciences Corporation

Abstract

AstroWeb is a WWW interface to a collection of Internet accessible resources aimed at the astronomical community. It consolidates the separate resource listings previously maintained by the consortium members in the United States, France, Germany, and Australia.

A set of tools using Tcl has been developed to assist in the maintenance and distribution of the collection by the widely distributed consortium members. These tools allow the members to coordinate their resource discovery efforts while providing them with complete control over their local presentations.

AstroWeb Origins

AstroWeb began with the realization that several people had independently created large collections of Internet accessible resources aimed at the astronomical community. It seemed advantageous to combine their efforts so that each listing covered the same set of resources. On January 24, 1994 the first EMail was sent and the first three members, Bob Jackson (CSC/STScI), Don Wells (NRAO), and Hans-Martin Adorf (ESO/ST-ECF) agreed to coordinate their efforts. After agreeing on a common interface format, HTML DL/DT/DD, Wells created Awk tools to merge the three lists. By the end of February, two more creators of large resource listings, Andre Heck (CDS) and Anton Koekemoer (MSSSO), were invited to join the consortium and their listings were merged into the combined listing. AstroWeb went public on April 6, 1994.

The individual listings were all based on the same dataset, but each person chose somewhat different presentation formats and structures. At NRAO, the resources were listed by Category, while at STScI they were listed by protocol, i.e., http, gopher, etc. Both these sites included the resource Description in their presentations, while at ESO/ST-ECF the Description was omitted.

Tool Origin

To make changes to the merged resource database, anyone could use HTML forms to enter information which was saved to files at NRAO. This input was inspected by Don Wells who then used Awk-based tools to update the master database. However, Don had other responsibilities and sometimes change requests would pile up. This led to local versions of the listing being updated prior to the master listing being updated. This was clearly a serious problem.

The need for resource `validation' was identified very early. In the volatile world of the Internet, URL's can vanish or change as easily as they are created. Thus one of the first Astroweb tools was one which went through the resource listing and checked to see if each URL still worked.

It was obvious from the start that the merged resource listing would have to be put into a WAIS database. This would provide users with the ability to find resources independent of the local formatting and structuring. Since the database was at one site and the search tool was at another site, some tools were needed to bridge the gap.

The unified central resource listing combined with the different formatting needs of the individual sites required the creation of a tool which could provide a generalized reformatting capability.

EditMaster

EditMaster is the tool which allows consortium members to make changes to the merged resource listing and have the changes available immediately and without any human intervention. It uses HTML forms to add a new resource, where the form restricts to user to legal Category values. It uses a page of HTML to select a resource for deletion. It also uses a HTML form to allow the user to edit the entire database. In this last case, the user can edit directly in the form or they can SAVE AS, use their favorite file editor, and then OPEN LOCAL via their WWW client.

This last ability is the equivalent of the yet-to-be-implemented POST method. It allows the user to send an entire file to a HTTP server via HTML forms. Since the merged resource listing can be represented in ASCII characters, it can be put into a HTML TEXTAREA region after escaping any "<", ">", and "&" characters. The server script can extract the TEXTAREA information, unescape the characters, and then save the information in a file. The method even works for UUENCODED binary files.

The tool uses Concurrent Version System (CVS) to store the merged resource listing. It allows several people to be simultaneously editing the merged listing and does not require a user to `checkout' or `lock' the file. Using CVS also has the advantage of tracking the version which the user fetched and prevents a user who fetched an earlier version from undoing the changes made by a user who fetched a later version. This version tracking ability would be difficult to implement if the information were stored in a conventional database.

The tool itself is implemented as a CGI script using Tcl. Tcl is an interpreted scripting language with a high level and easy to learn syntax and with extensive facilities for operating on strings, interacting with UNIX processes, and communicating with UNIX sockets. To eliminate the need to parse the merged resource listing file each time it was changed, the resource listing file was converted from the original HTML syntax to Tcl command syntax which automatically loads all the information into Tcl associative arrays. A simple Tcl procedure is used to create a HTML version from the Tcl version whenever the listing is changed.

The CGI script provides security by demanding a username and password and allowing access from only a specific set of sites. It also verifies that the required fields were populated and that only legal Categories were used.

ValidateMaster

ValidateMaster checks, three times a day, that each non-Telnet and non-Usenet resource is still available and creates HTML files listing the "dead" or "unreliable" resources. "Dead" resources are those with 10 consecutive failures and "Unreliable" resources are those with more than 10 failures in the last 20 tries. Currently ~47 of the 1029 URL's are listed as "dead" or "unreliable".

The tool obtains the Tcl version of the merged database via Lynx and produces a list of URL's from the resources URL's and from any URL's contained in the resource Descriptions. WWW and Gopher URL's are tested using the Tcl `server_open' function. WAIS and FTP resources are tested using the `waissearch' and `ftp' clients. Earlier versions attempted to test TELNET resources, but the tool hung too often checking them to be really useful. Since TELNET resources are so expensive to set up, they are less likely to be taken down or moved than other types of resources.

The results of the last 20 tries are stored in a file along with the string returned by the most recent failure. This information is available in separate HTML reports on the "dead" and "unreliable" URL's as well as an "Inactive???" label in the listings sorted by protocol and by category.

This information has been invaluable in detecting changed URL's and inactive URL's. Any static list of URL's must be `validated' or else it will become populated with pointers to nowhere.

IndexMaster

IndexMaster generates a WAIS searchable index to the collection. Each resource is categorized by keywords located in HTML comments. The entire HTML file version of the collection is put into the WAIS index and it can be searched by category keyword, URL, resource name, etc.

The tool obtains the Tcl version of the merged database via Lynx and produces a HTML version with each resources separated by a line of dashes. This HTML version is indexed by WAIS using the `-t dash' option. The user queries the WAIS index via a CGI script which uses `waissearch' directly on the WAIS files and removes the lines of dashes.

By naively indexing the entire HTML file, the user can search on the Category's which are hidden in HTML comments, or on portions of the URL, as well as only any text appearing in the resource Longtitle or Description.

The searchable index also provides rapid access to specific resources where the user knows part of the Longtitle or the Shorttitle. Without this tool, the user would be forced to search the individual HTML pages to find the desired resource.

ReformatMaster

ReformatMaster fetches the collection from a central repository and reformats the contents into the presentation desired at the members site.

The tool obtains the Tcl version of the merged database and the datafile saved by ValidateMaster and produces HTML files for resources in each protocol and in each Category. Other sites can modify their local copies of the tool to fit their own presentation structure and format.

Conclusion

A small personal list of interesting Internet resources can be created and maintained by a single person using only an editor and a WWW client. For larger and more comprehensive lists, software tools are needed to support distributed editing of the resource list, validating the contents of the resource list, and distributed display of the resource list. The AstroWeb tools meet these needs and can be easily adapted to other resource listings.

Author's Biography

Robert E. Jackson is a Staff Scientist with Computer Sciences Corporation at the Space Telescope Science Institute (STScI) in Baltimore, MD. He received a B.S. in Physics from the California Institute of Technology and a Ph.D. in Astronomy from the University of California, Santa Cruz. He has been instrumental in introducing and applying WWW, WAIS, and Tcl technology at STScI when not involved in the development and maintenance of the Proposal Entry Processor system (PEPSI) and its successors at STScI.

Author's Email Address

jackson@stsci.edu