UCSTRI is a WWW service which provides a searchable index over thousands of existing technical reports, theses, preprints, and other documents broadly related to computer science. This service has been in operation since May of 1993 and has enjoyed significant use; it received an honorable mention for "Best Professional Service" in the 1994 Best of the Web awards, and is available at<URL:http://www.cs.indiana.edu/cstr/search>
.The design and philosophy of UCSTRI is presented and compared with other approaches to indexing technical reports. The lessons learned about organizing electronic resources are discussed, both regarding technical publications and network services in general.
In recent years, this informal network has moved onto the Internet, as many departments, research groups, and other institutions make publications such as technical reports, preprints, and theses freely available electronically. The typical arrangement was a set of documents, usually in PostScript® format, being made available via FTP. This brought a large amount of information into the realm of network accessibility, but did little to allow scholars to easily search for items within this sea of data which might be of interest. Pointers to items could be passed around through essentially the same informal network of people, moved online.
The idea of combining archives of academic papers to form a searchable interface is not a new one; the domain is a standard resource discovery problem. Many other attempts to index online information also exist; however, academic papers are particularly suited to indexing because they typically are made available along with ``metadata'' which provides a concise and manageable description of their contents (author names, title, sometimes an abstract.) This is richer indexing information than is found in, say, FTP filenames or Gopher menu entries.
The hard part of UCSTRI is in the index builder: it must be general enough to extract the metadata from index files on remote servers despite the fact that those files do not necessarily follow any consistent format. The indexer must find the index file, split it up into separate records for each different document, and match those records with the filenames of the items themselves.
Here is a typical example of such a file:
TR 319 Andrew J. Hanson. The Rolling Ball: Context-Free Control of Spatial Orientation with Two-Dimensional Input Devices. (November 1990) 17 pgs TR 340 Gregory J. E. Rawlins. The new publishing: Technology's impact on the publishing industry over the next decade. (Nov. 1991). 68pgsThe indexer splits the file into records by some expression (in this case, blank lines.) Associated with that index is a directory of filenames:
ARCHIVE Index README.IU STIQUITO.INFO TR319.ps.Z TR340.ps.Z TR344.ps.Z TR345.ps.ZFilename extensions are highly standardized and easily removed. In this particular index, the records contain a space after
TR
while the filenames do not; such standard changes can
be accomodated by simple subsitution rules.
The textual contents of the index file are simply included blindly
with whitespace folded together. From the example above, the entries
created are:
When parsing ordinary index files, the content descriptions are opaque to the indexer (as are the files containing the documents themselves.) Some sites employ more specific formats, such as the format employed by the UNIX program
- Indiana U CS TR319.ps.Z(45K)
- TR 319 Andrew J. Hanson. The Rolling Ball: Context-Free Control of Spatial Orientation with Two-Dimensional Input Devices. (November 1990) 17 pgs
- Indiana U CS TR340.ps.Z(167K)
- TR 340 Gregory J. E. Rawlins. The new publishing: Technology's impact on the publishing industry over the next decade. (Nov. 1991). 68pgs
refer
or defined by [RFC1357], more structure is provided and
therefore the resulting entries can be formatted more nicely. After
culling from the sites, all entries are placed in an master index
file. The search engine was designed for lightweight simplicity and power; termed Simple Index Keyword Search or SIKS, it accepts multiple keywords (actually regular expressions) and returns items ordered by how many expressions each matched. This search is flexible for users familiar with regular expressions, but does not employ pre-constructed tables such as those employed by [Wais]; such an engine could be employed without altering UCSTRI's essential design.
A sample query might be a search for information about Knuth's work with sandwich theorems via specifying keywords sandwich theorem knuth. The results are shown below.
Between March 18 and September 11, UCSTRI received 119,630 queries originating from 21,053 distinct IP addresses. The two graphs below show where these figures came from and when they arrived. These figures only include actual searches on the database, not connections to view the search cover page or information about the service. A number of hosts could not be resolved by the Domain Name System; they are listed as DNS failures.
Stability is always a problem for network services, and certainly is for UCSTRI. The service is not formally supported in any way; it is administered as a hobby and run with machine resources donated by the Indiana University Computer Science department. We now have a mirror site in Japan, but in general no provisions exist for effectively distributing the current system.
The lack of active
participation by information providers causes frequent maintenance
problems due to changing circumstances; frequently a file format will
change, an FTP server will move, or a filename will change from
Index
to Index-1994
.
The synopsis is that UCSTRI is a hack that works for now. The maintenance required is relatively high; supporting the service as it could be supported would probably take 8-10 hours a week (in practice, the support it gets is somewhat irregular.)
The Wide Area Technical Report System [Waters] is another attempt at organizing such information. The National CS TR Library [Dienst, Davis] represents another, more ambitious approach to a distributed digital library. Both systems offer more sophisticated functionality and better scalability than UCSTRI, but both also require sites to use specific software to be included. The development time is consequently much longer because consensus must be reached with a large number of participants; each is still working with only a handful of participating sites in the short term.
One intriguing recent addition is a broker for CS technical reports as a demonstration application of Harvest [Harvest]. This system extracts information from the documents themselves, unlike most other systems which treat documents (typically in difficult-to-analyze formats like compressed PostScript) as opaque objects[Essence]. Since files are more standard than index formats, Harvest is able to function with less intensive maintenance than UCSTRI. This broker has somewhat broader coverage than UCSTRI (indeed, UCSTRI is one of the sources from which the broker builds its list of sites) though, unlike UCSTRI, it includes many entries for reports not available online. The broker also seems to have greater problems with duplicate entries (for example, the TRs of the author's department are all listed three times using different domain names for the same machine.) The Harvest TR broker also uses significantly more storage space for its index, though provisions for distribution make that system more scalable overall.
The various indexing services might be considered along various scales indicating ease of use for providers of documents, maintainers of the service itself, and users. For providers, Harvest is easiest (no effort is required) followed closely by UCSTRI (little or no effort is required.) For maintainers, UCSTRI is probably the most time-consuming and tedious to maintain well. For users, ignoring differences in coverage, Dienst is probably the easiest to use for generating powerful queries; WATERS is roughly comparable, with UCSTRI falling significantly below it.
Building an effective index for resource discovery which is smaller than the space being indexed requires finding characteristics to concisely describe the content of each item. Some approaches, such as [Essence], attempt to extract such information automatically; others, such as [Waters] or [Dienst], make the provider responsible for making it in a specific form. UCSTRI steers a middle course, assuming that this ``metadata'' often already exists and is maintained but is not necessarily made available in a standardized format. The former WAIS archive of FTPable-READMEs also employed such a strategy.
In general, metadata provided by people is likely to do a better job of concisely expressing the essence of an item to a human reader than metadata extracted by a program. The result is that search results from a system based on explicit metadata are likely to be more intelligible by the user than search results from a system based on implicit metadata, such as the Harvest TR indexer [Harvest] or Archie [Archie].
Like most other successful network resource indices, UCSTRI is a quick solution that works. As a more general framework for resource discovery on the Internet evolves, the need for such solutions tends to go away. Provider-supplied metadata, however, seems likely to continue to play an important role in any general solution.
<URL:http://web.nexor.co.uk/mak/doc/aliweb-paper/paper.html>
<URL:http://cs-tr.cs.cornell.edu/TR/CORNELLCS:TR94-1418>
<URL:http://cs-tr.cs.cornell.edu/Info/dienst_protocol.html>
<URL:ftp://ftp.cs.colorado.edu/pub/techreports /schwartz/Essence.Jour.ps.Z>
<URL:http://www.rdt.monash.edu.au/tr/siteslist.html>
.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports /schwartz/Harvest.ps.Z>
<URL:ftp://netlib.att.com/netlib/att/math/odlyzko/tragic.loss.Z>
<URL:ftp://ftp.cs.indiana.edu/pub/techreports/TR340.ps.Z>
<URL:ftp://nic.merit.edu/documents/rfc/rfc1357.txt>
<URL:ftp://ftp.think.com/wais/wais-corporate-paper.text>
<URL:
http://www.cs.odu.edu:8000/wais/www.cs.odu.edu:210
/WATERS/HTML/1654/1=hengest%3A210;
2=/home/waters/NEW-WATERS/server/WATERS;
3=0%201654%20/home/waters/NEW-WATERS/html /hengest.cs.odu.edu_77.html;
4=hengest%3A210; 5=/home/waters/NEW-WATERS /server/WATERS;
6=0%201654%20/home/waters/NEW-WATERS/html /hengest.cs.odu.edu_77.html;
7=%00;>
Marc has been actively involved in the WWW community for some time; he authored the first sophisticated Perl HTTP daemon which, after much work by others, formed the core of the Plexus server. He still administers the departmental HTTP server.