The goal of drop-in publishing is to simplify digital publishing over the Internet. We would like digital publishing of non-commercial matter (e.g. technical reports, course notes, brochures) be as easy as sending email is now, but with the virtues of archival storage and easy searching that we associate with electronic libraries. We propose a protocol, Dienst, to allow communication between clients and document servers by encoding object-oriented messages within URL's. A preliminary version of this protocol now runs at eight sites, and we describe some of its features. Next we present tools for automating the maintenance of document collections. Finally, we discuss the problems we've had with the Web as it stands, hoping to motivate changes that would improve performance of digital library systems such as ours.
"However one may sing the praises of those who by their virtue either defend or increase the glory of their country, their actions only affect worldly prosperity, and within narrow limits....[but] Aldus is building up a library which has no other limits than the world itself."Desiderius Erasmus wrote these words in praise of his friend Aldus, a book publisher of the 16th century. More than 400 years later, digital publishing may finally enable us to fulfill this vision, providing universal access to all the world's information. What's in the way?
The existing technologies (WWW, gopher, and even anonymous FTP) make reproduction and transmission fairly fast and cheap, but do little or nothing to help writers write or readers find or read documents. In our view, the problem is that they provide too little structure to the document collection. All of them present basically the same abstraction, namely a hierarchy of files, but do nothing to help the user locate a file within a hierarchy. Every site is different. Some group reports by year, others by project name; but even if every site on the Internet organized its hierarchy identically, it would not be enough, because every site also has its own conventions for naming files, indicating data formats, and making searchable indices. A writer who wishes to contribute has basically the same problem - it's easy to copy a file into an anonymous FTP area, but hard to make sure that it's indexed properly. A considerate writer might want to provide the same document in several formats, to increase the chances of accessibility, but this is a nuisance. We claim what's needed is a new, higher level protocol that hides the underlying details, and the underlying tools to simply library management.
This paper presents our first steps towards the universal library. We describe a protocol for universal access and the server that implements it. (For those familiar with our server - in this paper we describe not the currently running protocol, but rather the one we have submitted as an Internet Draft [DIENSTPROT] , which corrects a number of design flaws in the working version. We regret any confusion this causes.) We present a number of tools that integrate with our server to make publishing a document on-line relatively easy. We also discuss the steps we took to bring a large, existing collection online from paper. Finally, since our protocol is based on the World Wide Web, we also describe some of the problems we've observed in using it, in the hope that others at this conference will have solutions we can adopt.
Our focus on non-commercial publishing requires explanation. We realize that some content providers will not place their intellectual property on the net until clear definitions of legal rights and mechanisms for payment and protection are in place. We have nothing to contribute in these areas. Nevertheless there are a number of providers, such as universities or corporate internal groups, for whom these issues are less pressing, and we believe that we can thus make some useful contribution without working on the additional issues raised by economics.
Dienst supports a message-passing interface to this document model. Messages may be addressed to every document server, to a particular server, to one document, or to a particular part of a document. A message is encoded into the "path" portion of a URL, and contains the name of the message, the recipient, and the arguments, if any. A message may be sent to any convenient Dienst server (the nearest, for example), which will execute it locally if or forward it as appropriate. Dienst appears to be a single virtual document collection, and hides the details of the server distribution. (Note that the actual implementation does not use an object oriented language, we use message passing only as a convenient conceptual model.)
Each document in Dienst has a unique identifier which names each document in a location-independent manner. This identifier, called a DocID, serves exactly the same role as a URN, and when URNs are fully specified we will adopt them. A DocId has three components: a naming convention, a publisher and a number. To ensure that each DocID is unique, each component (or rather, the institution that issues each component) guarantees that the next component is unique - thus each naming convention controls a namespace of publishers, and each publisher issues a set of numbers.
For each publisher, there must be at least one server to handle messages for the documents issued by that publisher. In our view, the minimum commitment a publisher must make to issue a document is to store and deliver the document to the network. When a Dienst server receives a message for a document it locates the closest server for the document's publisher and forwards the message to it.
Dienst messages address four types of digital library services: user interface services which present library information in a format designed for human readability, repository services, which store the document, and support retrieval of all or part, index services, which provide search, and miscellaneous services, which provide general information about a server.
Of these four services, only the first is used directly by a human. The others used by programs, in particular other Dienst servers, but also by other digital library or publishing systems. For example, the Stanford Information Filtering Tool ([SIFT]) obtains bibliographic records through the index interface, and we are currently designing a gateway to the WATERS ([WATERS]) system. We encourage other developers of digital library systems to provide both user-interface and application-interfaces to their systems.
All services except the last are optional at a given site. This allows maximal flexibility in the way that particular server implementations interoperate. For example, one server may exist solely as a user interface gateway, providing transparent access for users to a particular domain of indexes and repositories. We see this flexible interoperability as key to the development of a digital library infrastructure where the "collection" will span multiple sites and continents.
The Dienst protocol includes a message that requests a document for a list of formats in which it is available. We specify formats with MIME ([MIME]) Content-types. Dienst does not support the notion of explicit conversion between document formats (as does System 33 [Putz]). A repository willing and able to provide a document in a given format should simply list that format, even if it is only obtained through a conversion service.
Diversity is the rule on the Internet, and each site supporting Dienst is likely to store their documents in a different way. The Dienst protocol hides all detail of the underlying storage organization -- this is in sharp contrast to FTP, Gopher, and "bare" HTTP, where the underlying hierarchy is visible. Each Dienst repository includes a function which maps from a DocID and format to the actual storage pathname on that server. This hides both details of file system structure and file typing or naming conventions from outside users. Thus one may request, say, the second page of the TIFF version of a document from a server without needing to know where and how it is stored.
Every query language is based on an underlying model for the meta data it queries. The initial query language in Dienst assumes a minimal data model, where documents have an author, title, and abstract in addition to the publisher and number. A query may refer to any of these fields; if it refers to more than one then the terms are connected with an implicit "and". Thus one might query for all documents published by author "Wilson" at publisher "Stanford".
A search request returns a document of type
text/x-dienst-response
, consisting of records containing
meta-information on all the matching documents. This meta-information
follows the encoding proposed for Uniform Resource Characteristics
(URC) [URC] . The URC draft proposes fields such
as title, author and Content-type and URL, all of which which are
obviously applicable; we have added a number of experimental attributes.
One uses Dienst by connecting to any convenient Dienst server (that supports the user interface services) using a standard Web client. This server will display a form for searching the collection. Unless the user restricts the search to a single publisher, all Dienst servers are searched in parallel. Each Dienst server is made aware of all other Dienst servers by fetching a list of all servers from a single, central meta-server. Thus when a new server comes online, other servers become aware of it after only a short time. The results from a search are displayed as a list of the DocID, author, title, and date for each matching document, and include a URL for each document. Selecting one displays the document in more detail, including a list of the available formats (obtained as described above.) The user can retrieve the document in any of the formats.
Some repositories include page images as 4-bit 72 dot per inch GIF files. When this is the case, the user interface service is able to display the document page at a time, inline on the user's Web client. We found that such pages are readable on most monitors and saves considerable network bandwidth compared to the 600 dpi TIFF images. In addition, some sites also store reduced size "thumbnail" page images, which allow the user to quickly browse through a document and then click to view a interesting page (say one that contains a graphic) in full-page version. Although we do not have any formal user studies, anecdotal evidence says that this is a very powerful and helpful feature.
The server also allows the user to download and/or print all or selected pages of the document. Local users may print directly, while remote users can download a PostScript version of the document and then print it manually. Since all documents are not available in PostScript, the server has the ability to translate from TIFF images to level 2 PostScript on the fly.
At Cornell we have implemented a set of tools that mostly automate the process of managing a digital library. The tools are closely integrated with the Dienst digital library server. They are similar in spirit to those implemented for the Wide Area Technical Report Server ([WATERS]) system, known as Techrep, but whereas Techrep is designed to maintain the centralized index and unstructured FTP-based document repository that is characteristic of WATERS, the tools described here are tailored for the distributed indexes and structured repositories characteristic of Dienst.
Our design goal was to make the digital library maintainable by a document librarian (DL) with relatively low-level computer training. This DL serves four major roles - 1) as the general manager of the collection; 2) as the reviewer of document submissions, to protect against counterfeit document submissions; 3) as the clearing house for copyright issues; and 4) as the archiver of document hardcopy. This system has recently been installed in the Cornell Computer Science Department and is now the means for all technical report submissions.
The author submits a document by completing an HTML form that contains text fields for bibliographic data about the document. These fields are the document title, author(s), pathname of the PostScript file, abstract, and submitter's e-mail address. The submitter can quickly complete this form by "cutting and pasting" text from the document source.
The remainder of the process is fully automated. Software that is integrated with the digital library server generates the RFC-1357 bibliographic entry from the submitter's entry, checks the validity of the postscript file, builds the actual database entry, and generates the GIF images for online viewing and browsing of the document.
The image conversions in this process are done with the Extended Portable Bitmap Toolkit ([PBMPLUS]). PBMPLUS consists of a number of filters for conversion between a variety of image formats (TIFF's, GIF's, X Bitmaps, etc.) and a small set of portable formats , and a set of tools to perform manipulations (rotations, color transformation, scaling) on the portable format files. PBMPLUS has the advantages of being free, quite reliable, usable on a wide variety of graphical formats, and quite powerful in its basic image manipulating capabilities.
We have just begun to use this automated system in the Computer Science department at Cornell. At a later time we will evaluate the effectiveness of the system, with special attention payed to the number of documents that require a special submission procedure (i.e., are not translatable to postscript). Obviously if the ratio of these is high to the number submitted documents, we need to rethink the design of the system.
The common form that exists for all existing documents is hardcopy - the department maintains archival copies of the entire TR corpus. A production scanning facility on campus allowed the department to convert the entire corpus to high-quality 600dpi group 4-compressed TIFF images. Over a nine month period all hardcopy pages were scanned to individual TIFF files and downloaded via FTP to disk in the Computer Science Department. Each TIFF file ranges in size from around one kilobyte for a blank page to almost two megabytes for a page that contains a high quality photographic image. The total collection of pages images now occupies around 3.6 gigabytes.
It should be noted that scanning a collection, even as modest as the Cornell CS TR's, is time consuming, labor intensive, and not without problems. Even the most careful scanning technician occasionally misses pages, skews pages, or misses part of a page due to a unnoticed fold when the page is put on the scanner bed. These problems are difficult, if not impossible, to detect automatically. In addition, any problems that are detected are computationally intensive to correct. For example, a simple ninety-degree rotation of a 600 dpi TIFF image (due to incorrect scanning orientation) can take up to thirty minutes on a reasonably equipped SPARCstation 10.
An example illustrates the difficulty of correcting scanning problems. We discovered after all scanning was complete that many of our older TR's were scanned from pages that were oriented in landscape mode - two pages side-by-side. The result was a TIFF file containing two page images, which made correct page mapping impossible in the document server. While it was easy to find files with this problem (by reading the height and width from the TIFF header with a publically available TIFF package [Leffler]), reasonably quick correction required handcrafting c-code to split the files. Even with the handcrafted code, the location and correction process took over a week of compute time on a powerful workstation.
In addition to manual scanning of documents, we also had to manually enter the RFC-1357 bibliographic files. While it would have been easy to write translators between RFC-1357 and other common bibliographic formats such as BibTex, refer, etc, a consistent electronic bibliographic format was not available for all the TR's.
MPEG
clips.
First, Dienst provides a uniform protocol for search, retrieval, and display of documents. This protocol addresses a flexible document model where each document has a unique name, can be in multiple formats, and consists of a set of named parts. These parts can be physical, such as pages, or logical, such as chapters and tables. In addition, the protocol allows full interoperability between distributed digital library servers. The result is that the user sees a single virtual document collection.
Second, Dienst provides a set of tools that permit easy management of a digital library. These tools automate document submission, permit a document librarian to manage the collection, and facilitate the production of archival hardcopy.
We plan over the next year to build on this technology in a number of ways. Installation of the digital library server is too difficult. We intend to implement tools that will "auto-configure" the server. The search engine in the current implementation is primitive. We intend to include more advanced search engines, for example full-text search, to make document discovery in a collection more powerful and easier. The current strategy of conducting a parallel search over all servers does not scale over a very large number of servers. We intend to use meta-information about individual document servers to improve the search strategy. With this facility, one could, for example, choose to search only those libraries that have a high probability of containing computer science documents. We plan to examine and possibility incorporate current work on copyright servers, so Dienst might be used for commercial documents. Finally, we hope to use some of the current work in location-independent identifiers to refine the method by which documents on the net are addressed in Dienst.
[Dienst] James R. Davis, Carl Lagoze. A protocol and server for a distributed digital technical report library. Cornell University Computer Science Department Technical Report 94-1418, June 1994.
[DIENSTPROT] James R. Davis, Carl Lagoze. Dienst, A Protocol for a Distributed Digital Document Library. Internet Draft.
[EZPUB] Cornell Information Technologies. How to Use EZ-PUBLISH and the Docutech Printer at Cornell Information Technologies. November 24, 1993.
[GLOSS] Luis Gravano, Hector Garcia-Molina, Anthony Tomasic. The Efficiency of GLOSS for the Text Database Discovery Problem. Stanford University Technical Report CS-TN-93-2.
[Leffler] Sam Leffler. Public TIFF package. Available via ftp from sgi.com/graphics/tiff/v3.2beta.tar.Z .
[MIME] Nathaniel S. Borenstein, Ned Freed. MIME (Multipurpose Internet Mail Extensions). RFC-1521 .
[PBMPLUS] Jef Poskanzer. Extended Portable Bitmap Toolkit. Available from many anonymous FTP sites including ftp.ee.utah.edu.
[Putz] Steve Putz Design and Implementation of the System 33 Document Service Xerox PARC P93-00112, 1993
[SIFT] Online service at http://sift.stanford.edu.
[TIFF] Aldus Corporation. TIFF Revision 6.0 Specification.
[URC] Michael Mealling. Encoding and Use of Uniform Resource Characteristics. Internet Draft.
[URL] Tim Berners-Lee, Uniform Resource Locators (URL) Internet Draft.
[WWW] Tim Berners-Lee, Robert Cailliau, Jean-Francis Groff, and Berd Pollerman. World-wide web: The information universe. Electronic Networking: Research, Applications and Policy 2(1):52-58, 1992.
[WATERS] Kurt J. Maly, Edward A. Fox, James C. French and Alan L. Selman. Wide Area Technical Report Server. Published online in http://www.cs.odu.edu/WATERS/WATERS-paper.ps
Carl Lagoze works for the Computer Science Department at Cornell University as a Senior Software Engineer in the CSTR project . He received a Master of Software Engineering from the Wang Institute of Graduate Studies in 1987. After receiving his degree he worked in both academia and the commercial world developing tools for the generation of language-specific editors. Over the past two years he has discovered the joys of digital libraries and the fascinating world of information capture and access. From the view of his non-technical friends, he is doing "something on that information superhighway." Mr. Lagoze is also the proud parent of the cutest baby ever and an avid cyclist and canoeist.
contact author: davis@dri.cornell.edu 607-255-1134