The Unified Computer Science Technical Report Index:

Lessons in indexing diverse resources

Marc VanHeyningen
Computer Science Department
Indiana University
mvanheyn@cs.indiana.edu

Abstract

UCSTRI is a WWW service which provides a searchable index over thousands of existing technical reports, theses, preprints, and other documents broadly related to computer science. This service has been in operation since May of 1993 and has enjoyed significant use; it received an honorable mention for "Best Professional Service" in the 1994 Best of the Web awards, and is available at <URL:http://www.cs.indiana.edu/cstr/search>.

The design and philosophy of UCSTRI is presented and compared with other approaches to indexing technical reports. The lessons learned about organizing electronic resources are discussed, both regarding technical publications and network services in general.


Introduction

In the electronic revolution currently underway in the field of ``publishing,'' the academic community has in some ways held the leading edge. This community was among the first to have access to global network connectivity, and academic publication greatly simplifies compensation issues since the author does not generally expect financial renumeration. Academics have long been exchanging preprints and technical reports with colleagues relatively freely; this makes their findings widely available prior to publication in more conventional channels, such as journals [Odlyzko]. These factors form an ideal environment for electronic publishing.

In recent years, this informal network has moved onto the Internet, as many departments, research groups, and other institutions make publications such as technical reports, preprints, and theses freely available electronically. The typical arrangement was a set of documents, usually in PostScript® format, being made available via FTP. This brought a large amount of information into the realm of network accessibility, but did little to allow scholars to easily search for items within this sea of data which might be of interest. Pointers to items could be passed around through essentially the same informal network of people, moved online.

The idea of combining archives of academic papers to form a searchable interface is not a new one; the domain is a standard resource discovery problem. Many other attempts to index online information also exist; however, academic papers are particularly suited to indexing because they typically are made available along with ``metadata'' which provides a concise and manageable description of their contents (author names, title, sometimes an abstract.) This is richer indexing information than is found in, say, FTP filenames or Gopher menu entries.


Enter UCSTRI

The Unified Computer Science Technical Report Index, or UCSTRI (rhymes with ``Spruce Tree'') is, as its name suggests, an attempt to unify a wide variety of technical documents broadly related to computer science as a searchable index. Technical Reports about computer science form the core of the collection; theses, preprints, and other papers from CS and related areas are also included. At UCSTRI's core are some essential ideas:

Design

UCSTRI requires two major modules. An index builder polls numerous FTP sites for item information to construct a master index file. The list of sites and their characteristics is the only component of the system's operation that must be maintained by hand. A search engine then processes queries into that file to return citations and hypertext links to appropriate items. The overall structure is similar to other indexing systems (e.g. [Aliweb].)


Figure showing the two aspects

Figure showing UCSTRI's major components


A major distinction between UCSTRI and ALIWEB lies in the rigidity of the index file; ALIWEB relies on the provider to have created the file in a specific format, while UCSTRI assumes the provider probably already has a file, but assumes as little as possible about the structure of that file.

The hard part of UCSTRI is in the index builder: it must be general enough to extract the metadata from index files on remote servers despite the fact that those files do not necessarily follow any consistent format. The indexer must find the index file, split it up into separate records for each different document, and match those records with the filenames of the items themselves.

Here is a typical example of such a file:

TR 319   Andrew J. Hanson.  The Rolling Ball: Context-Free Control
         of Spatial Orientation with Two-Dimensional Input Devices.
         (November 1990)
         17 pgs
         
TR 340   Gregory J. E. Rawlins.  The new publishing: Technology's impact
         on the publishing industry over the next decade.  (Nov. 1991).
         68pgs
The indexer splits the file into records by some expression (in this case, blank lines.) Associated with that index is a directory of filenames:

ARCHIVE       Index         README.IU     STIQUITO.INFO
TR319.ps.Z    TR340.ps.Z    TR344.ps.Z    TR345.ps.Z
Filename extensions are highly standardized and easily removed. In this particular index, the records contain a space after TR while the filenames do not; such standard changes can be accomodated by simple subsitution rules. The textual contents of the index file are simply included blindly with whitespace folded together. From the example above, the entries created are:
Indiana U CS TR319.ps.Z(45K)
TR 319 Andrew J. Hanson. The Rolling Ball: Context-Free Control of Spatial Orientation with Two-Dimensional Input Devices. (November 1990) 17 pgs

Indiana U CS TR340.ps.Z(167K)
TR 340 Gregory J. E. Rawlins. The new publishing: Technology's impact on the publishing industry over the next decade. (Nov. 1991). 68pgs

When parsing ordinary index files, the content descriptions are opaque to the indexer (as are the files containing the documents themselves.) Some sites employ more specific formats, such as the format employed by the UNIX program refer or defined by [RFC1357], more structure is provided and therefore the resulting entries can be formatted more nicely. After culling from the sites, all entries are placed in an master index file.

The search engine was designed for lightweight simplicity and power; termed Simple Index Keyword Search or SIKS, it accepts multiple keywords (actually regular expressions) and returns items ordered by how many expressions each matched. This search is flexible for users familiar with regular expressions, but does not employ pre-constructed tables such as those employed by [Wais]; such an engine could be employed without altering UCSTRI's essential design.

A sample query might be a search for information about Knuth's work with sandwich theorems via specifying keywords sandwich theorem knuth. The results are shown below.


Figure showing results from a sample query


Results

UCSTRI's structure has permitted it to grow reasonably rapidly; as of this writing, it indexes 9,766 items at 177 different sites throughout the network world (although there are inevitably some items which are duplicates, errors, or otherwise not useful.) The resulting index file, about 6.2 megabytes in size, is small enough to manage easily. Although active participation on the part of indexed sites is unneeded, some sites have become interested in UCSTRI and design their FTP archive with it in mind.

Between March 18 and September 11, UCSTRI received 119,630 queries originating from 21,053 distinct IP addresses. The two graphs below show where these figures came from and when they arrived. These figures only include actual searches on the database, not connections to view the search cover page or information about the service. A number of hosts could not be resolved by the Domain Name System; they are listed as DNS failures.


Figures showing top-level domains of queries and frequencies of queries over time


UCSTRI is a reasonably old and mature service as WWW services go; it first came online in May of 1993. Use is somewhat volatile, particularly as the large segment of academic users ebbs and flows during the summer, and the esoteric nature of much technical information limits the audience of serious users.

Stability is always a problem for network services, and certainly is for UCSTRI. The service is not formally supported in any way; it is administered as a hobby and run with machine resources donated by the Indiana University Computer Science department. We now have a mirror site in Japan, but in general no provisions exist for effectively distributing the current system.

The lack of active participation by information providers causes frequent maintenance problems due to changing circumstances; frequently a file format will change, an FTP server will move, or a filename will change from Index to Index-1994.

The synopsis is that UCSTRI is a hack that works for now. The maintenance required is relatively high; supporting the service as it could be supported would probably take 8-10 hours a week (in practice, the support it gets is somewhat irregular.)


UCSTRI and other indexers

Rik Harris's WAIS index of abstracts [Harris] provided some of the first broad search functionality. Unfortunately, its interface does not provide hypertext links to the final content when available online. Its list of sites, however, was invaluable in constructing UCSTRI.

The Wide Area Technical Report System [Waters] is another attempt at organizing such information. The National CS TR Library [Dienst, Davis] represents another, more ambitious approach to a distributed digital library. Both systems offer more sophisticated functionality and better scalability than UCSTRI, but both also require sites to use specific software to be included. The development time is consequently much longer because consensus must be reached with a large number of participants; each is still working with only a handful of participating sites in the short term.

One intriguing recent addition is a broker for CS technical reports as a demonstration application of Harvest [Harvest]. This system extracts information from the documents themselves, unlike most other systems which treat documents (typically in difficult-to-analyze formats like compressed PostScript) as opaque objects[Essence]. Since files are more standard than index formats, Harvest is able to function with less intensive maintenance than UCSTRI. This broker has somewhat broader coverage than UCSTRI (indeed, UCSTRI is one of the sources from which the broker builds its list of sites) though, unlike UCSTRI, it includes many entries for reports not available online. The broker also seems to have greater problems with duplicate entries (for example, the TRs of the author's department are all listed three times using different domain names for the same machine.) The Harvest TR broker also uses significantly more storage space for its index, though provisions for distribution make that system more scalable overall.

The various indexing services might be considered along various scales indicating ease of use for providers of documents, maintainers of the service itself, and users. For providers, Harvest is easiest (no effort is required) followed closely by UCSTRI (little or no effort is required.) For maintainers, UCSTRI is probably the most time-consuming and tedious to maintain well. For users, ignoring differences in coverage, Dienst is probably the easiest to use for generating powerful queries; WATERS is roughly comparable, with UCSTRI falling significantly below it.


Lessons learned

As the amount of information available has grown, making the data structured in time becomes important. Technical reports and related information is of most use early in its lifetime, and older reports are less likely to be of value to users. UCSTRI orders its results by time using the modification dates on the files obtained from FTP directory listings, but such a solution is sadly incomplete. It should be possible to restrict queries to a specific time interval, such as the past six months.

Building an effective index for resource discovery which is smaller than the space being indexed requires finding characteristics to concisely describe the content of each item. Some approaches, such as [Essence], attempt to extract such information automatically; others, such as [Waters] or [Dienst], make the provider responsible for making it in a specific form. UCSTRI steers a middle course, assuming that this ``metadata'' often already exists and is maintained but is not necessarily made available in a standardized format. The former WAIS archive of FTPable-READMEs also employed such a strategy.

In general, metadata provided by people is likely to do a better job of concisely expressing the essence of an item to a human reader than metadata extracted by a program. The result is that search results from a system based on explicit metadata are likely to be more intelligible by the user than search results from a system based on implicit metadata, such as the Harvest TR indexer [Harvest] or Archie [Archie].

Like most other successful network resource indices, UCSTRI is a quick solution that works. As a more general framework for resource discovery on the Internet evolves, the need for such solutions tends to go away. Provider-supplied metadata, however, seems likely to continue to play an important role in any general solution.


Acknowledgments

UCSTRI is run on facilities made available by the Department of Computer Science at Indiana University, Bloomington. Thanks to Bill Dueber and Tom Loos for helping develop this document, and to Jun-ichiro Itoh for provinding a mirror in Japan and enhancing UCSTRI's formatting to handle Harris's index format.

References

[Aliweb]
Koster, M. ``ALIWEB -- Archie-like indexing in the web.'' First International Conference on the World Wide Web, Geneva, 1994.
<URL:http://web.nexor.co.uk/mak/doc/aliweb-paper/paper.html>
[Archie]
Emtage, A. & Deutsch, P. ``Archie -- An electronic directory service for the Internet.'' Proceedings of the USENIX Winter Conference, pp 93--110, January 1992.
[Davis]
Davis, J. & Lagoze, C. ``A protocol and server for a distributed digital technical report library.'' Technical Report 94-1418, Computer Science Department, Cornell University. June 24, 1994.
<URL:http://cs-tr.cs.cornell.edu/TR/CORNELLCS:TR94-1418>
[Dienst]
Davis, J. & Lagoze, C. ``Dienst, a protocol for a distributed digital document library.'' Internet Draft, work in progress.
<URL:http://cs-tr.cs.cornell.edu/Info/dienst_protocol.html>
[Essence]
Hardy, D., Schwartz, M. ``Customized Information Extraction as a Basis for Resource Discovery.'' Technical Report CU-CS-707-94, Department of Computer Science, University of Colorado, Boulder, Colorado, March 1994.
<URL:ftp://ftp.cs.colorado.edu/pub/techreports /schwartz/Essence.Jour.ps.Z>
[Harris]
Harris, Rik. ``Computer Science Technical Reports Archive Sites.''
<URL:http://www.rdt.monash.edu.au/tr/siteslist.html>.
[Harvest]
Bowman, C., Danzig, P., Hardy, D., Manber, U., & Schwartz, M. ``Harvest: A scalable, customizable discovery and access system.'' Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, Colorado, August 1994.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports /schwartz/Harvest.ps.Z>
[Odlyzko]
Odlyzko, A. ``Tragic loss or good riddance? The impending demise of traditional scholarly journals.'' International Journal for Human-Computer Studies (formerly International Journal for Man-Machine Studies), to appear. Condensed version to appear in Notices Amer. Math. Soc., Jan. 1995.
<URL:ftp://netlib.att.com/netlib/att/math/odlyzko/tragic.loss.Z>
[Rawlins]
Rawlins, G. ``The new publishing: Technology's impact on the publishing industry over the next decade.'' Technical Report 340, Department of Computer Science, Indiana Univeristy. November, 1991. An abbreviated version of this paper appeared in Journal of the American Society for Information Science 44:474.
<URL:ftp://ftp.cs.indiana.edu/pub/techreports/TR340.ps.Z>
[RFC1357]
Cohen, D. ``A format for e-mailing bibliographic records.'' Request For Comments 1357, Network Working Group, Internet Engineering Taskforce. July, 1992.
<URL:ftp://nic.merit.edu/documents/rfc/rfc1357.txt>
[Wais]
Kahle, B. & Medlar, A. ``An Information System for Corporate Users: Wide Area Information Servers.'' Wais corporate paper version 3.
<URL:ftp://ftp.think.com/wais/wais-corporate-paper.text>
[Waters]
Maly, K., French, J., Selman, A, & Fox, E. ``Wide area technical report service.'' Technical Report 94-13, Dept. of Computer Science, Old Dominion University. June 6, 1994.
<URL: http://www.cs.odu.edu:8000/wais/www.cs.odu.edu:210 /WATERS/HTML/1654/1=hengest%3A210; 2=/home/waters/NEW-WATERS/server/WATERS; 3=0%201654%20/home/waters/NEW-WATERS/html /hengest.cs.odu.edu_77.html; 4=hengest%3A210; 5=/home/waters/NEW-WATERS /server/WATERS; 6=0%201654%20/home/waters/NEW-WATERS/html /hengest.cs.odu.edu_77.html; 7=%00;>

Author's Biography

Marc VanHeyningen is a doctoral student in the Computer Science Department at Indiana University, Bloomington. His research, with advisor Gregory J. E. Rawlins, involves evolving index strategies for large image databases. He is also employed by University Computing Services to construct a document registry and index for network resources available throughout the university.

Marc has been actively involved in the WWW community for some time; he authored the first sophisticated Perl HTTP daemon which, after much work by others, formed the core of the Plexus server. He still administers the departmental HTTP server.

Author's Email address

mvanheyn@cs.indiana.edu