If a gopher interface to an archive is available it can provide a menu based interface in which the archive administrator can describe the resource, albeit in around a maximum of 70 characters for common terminal sizes, which is what most common gopher clients run on.
More commonly, only the anonymous ftp form was available, providing just the UNIX shell-like interface to the archive.
Usually, the donated files were lucky to have a single line of text describing the contents, or more likely, the filename was the best hint to the package (foobar-1.3.tar.gz for version 1.3 of package foobar). Sometimes, there would be a README or index file in the same directory as the files with a description of each of the files in natural language. This would be fine for people prepared to look at every single README file in the archive to find something but natural language is not a good way to describe the files since the information is not structured and hence not machine readable / writable. There are also additional problems:
In the first three cases, it would be easy to only allow materials with correctly formatted metadata to be allowed on the archive, but the the latter cases are more difficult. mirroring is a process which makes an identical copy (a clone, or mirror image) of the files on a remote site on the local site. Thus, these files cannot be modified locally, and any metadata must be external, in other files. For the newsgroups, there are a lot of articles being archived daily so generating appropriate metadata by hand would be very tedious. Thus to handle all of the above sources, a metadata standard was needed with the following requirements:
Investigations were done into the metadata forms that were available at the time:
Title, Version, Entered-date, Description, Keywords, Author, Maintained-by, Primary-site, Alternate-site, Original-site, Platforms, Copying-policy
The form of the entries is similar to Internet Mail headers[4] with colon separated attribute-value pairs that can wrap over several lines. There is a short description of the valid values for each field but little concrete data form, most of it is free form text.
Later on, tools were built to process these templates, index them and create such things as the Linux Software Map[5]. At present, when people are submitting something to the archive, they may be rejected by the maintainers if they do not have LSM templates written.
Unfortunately, the LSM templates are very much intended for software packages that are replicated at different sites and hence are not particularly appropriate for indexing a much richer set of files.
The draft has a rich range of templates, attributes and values that can be used to describe common and useful elements. The goal is that these are to be used to index archives, made available publicly in them to allow searching, indexing and sharing of information on the archive contents, services and administrative data.
This template scheme is based on the same RFC822 form like the LSM templates, with colon separated attributes-values known as data elements. One or more data elements are collected into templates which have a single Template-Type field to describe the type of the basic template. Multiple templates can be collected in index files by separating with blank lines. The attributes can be structured in several ways:
There are 14 currently defined template types:
SITEINFO, LARCHIVE, MIRROR, USER, ORGANIZATION, SERVICE, DOCUMENT, IMAGE, SOFTWARE, MAILARCHIVE, USENET, SOUND, VIDEO, FAQand each has appropriate attributes defined for them. Most of the types are self explanatory apart from SITEINFO which is a description of the FTP site and LARCHIVE which is a description of a logical (sub-)archive. More types can be defined if necessary having the same basic attributes as DOCUMENT.
It also turns out that LSM Templates were based on an early draft of the IAFA Templates (June 1992) but modified to be have more consistent elements. The later versions were modified to be more similar but some differences remain.
IAFA Templates were the solution chosen to base my metadata on. They were rich and extensible and a standard, or albeit a draft one.
Template-Type: DOCUMENT URI: path Description: descriptionbut of course, not everything is a document and more intelligence was needed to determine the metadata.
The information is structure hierarchically and hence there is a need to list the sub-directories for any given directory. There is no way to do that cleanly in the draft, the only way would be to rely on a convention that a DOCUMENT with a URI ending in a '/' is a directory. It is better to add a Template-Type DIRECTORY since a directory is not a document. Another template type that was added was EVENT which was used with to describe conferences, workshops, etc. which have a date range.
In addition, there was no way to describe symbolic links. These are used in my archive to point from one area to another so that the directories /parallel/transputer/compilers/occam and /parallel/occam/compilers have the same content but the names of the final directories are different. If the alternative was used, a site-relative URI, that would make the directory names the same and hence confusing for the browser. A simple extension to the format of the URI field allowed symbolic links to be added.
Another extension was the definition of the separation of templates. The draft uses a blank line defined as an empty line or a line consisting of only white space. I use just the former, an empty line since that means paragraph breaks can be put in descriptive or other text (see example later).
Extra elements that were added include:
In later versions of the draft, this was changed to be the MIME[6] type of the document but this would not be sufficient to describe the tar files above so the earlier version of the definition was kept. MIME types could be added easily.
update-afa-indices operates on a directory tree (or subtree of it). In each directory, there is a single index file AFA-INDEX containing the templates for each of the files and directories in there. There is also a configuration file .ixconfig that allows specific files or directories to be excluded from the index. This allows mirrored areas to be kept the same as the remote site, but not all the files need to be shown. For example, if the entire contents of a text index file are represented in the index, there is no need to include a reference to that file.
The program walks the tree, reading the IAFA indices and looks for differences between the templates and the entries in the file system for items 1 and 2 from the list above. Only if there is a difference, is item 3 is calculated, since it is an expensive operation. If the entry is new, it is appended to the end of the index. Files or directories that have been deleted are automatically removed from the index. After all the processing is done, the index file is sorted by fields that are configurable by the .ixconfig file.
Template-Type: EVENT Description: Call for papers for the Fifth International Conference on Parallel Computing (ParCo'95) being held from 19th-22nd September 1995 at International Conference Center, Gent, Belgium. Topics: Applications and Algorithms; Systems Software and Hardware. Deadlines: Abstracts: 31st January 1995; Notification: 15th April 1995; Posters: 30th June 1995. See also <URL:http://www.elis.rug.ac.be/announce/parco95/cfp.html> Author-Email: a.n.author@host.site.country Author-Name: A. N. Author Title: Fifth International Conference on Parallel Computing X-Acronym: ParCo'95 X-End-Date: 1995-09-22 X-Expires-Date: 1995-09-22 X-Start-Date: 1995-09-19 Format-v0: ASCII document Format-v1: PostScript document Last-Revision-Date-v0: Wed, Jan 11 11:24:39 1995 GMT Last-Revision-Date-v1: Wed, Sep 21 10:41:01 1994 GMT Size-v0: 4516 Size-v1: 71330 URI-v0: parco95.ascii URI-v1: parco95.ps X-Gopher-Description-v0: 5th Int. Conference on Parallel Computing (ParCo'95) CFP (ASCII) X-Gopher-Description-v1: 5th Int. Conference on Parallel Computing (ParCo'95) CFP (PS)which describes a pair of files for a conference call. The derived text index output would be:
parco95.ascii "Fifth International Conference on Parallel Computing" Call for papers for the Fifth International Conference on Parallel Computing (ParCo'95) being held from 19th-22nd September 1995 at International Conference Center, Gent, Belgium. Topics: Applications and Algorithms; Systems Software and Hardware. Deadlines: Abstracts: 31st January 1995; Notification: 15th April 1995; Posters: 30th June 1995. See also <URL:http://www.elis.rug.ac.be/announce/parco95/cfp.html> Author: A. N. Author <a.n.author@host.site.country>. [ASCII document] parco95.ps "Fifth International Conference on Parallel Computing" Call for papers for the Fifth International Conference on Parallel Computing (ParCo'95) being held from 19th-22nd September 1995 at International Conference Center, Gent, Belgium. Topics: Applications and Algorithms; Systems Software and Hardware. Deadlines: Abstracts: 31st January 1995; Notification: 15th April 1995; Posters: 30th June 1995. See also <URL:http://www.elis.rug.ac.be/announce/parco95/cfp.html> Author: A. N. Author <a.n.author@host.site.country>. [PostScript document]and the derived gopher elements would be these entries in the gopher tree:
5th Int. Conference on Parallel Computing (ParCo'95) CFP (ASCII) 5th Int. Conference on Parallel Computing (ParCo'95) CFP (PS)and the HTML element would be (as part of a conformant HTML 2.0 index file):
<DL> <DT><A NAME="parco95.ascii" href="/www4/parco95.ascii"><B>Fifth Internation\ al Conference on Parallel Computing (<I>ParCo'95</I>)</B></A> [ASCII do\ cument] (4516 bytes)<BR> <DT><A NAME="parco95.ps" href="/www4/parco95.ps"><B>Fifth International Con\ ference on Parallel Computing (<I>ParCo'95</I>)</B></A> [PostScript doc\ ument] (71330 bytes)<BR> <DD>Call for papers for the Fifth International Conference on Parallel Computing (ParCo'95) being held from 19th-22nd September 1995 at International Conference Center, Gent, Belgium. <P> <I>Topics:</I> Applications and Algorithms; Systems Software and Hardware.<P> <I>Deadlines:</I> Abstracts: 31st January 1995; Notification: 15th April 1995; Posters: 30th June 1995.<P> See also <A href="http://www.elis.rug.ac.be/announce/parco95/cfp.html">\ http://www.elis.rug.ac.be/announce/parco95/cfp.html</A><P> Author: A. N. Author (<I>a.n.author@host.site.country</I>). </DL>which looks like this when displayed formatted:
Topics: Applications and Algorithms; Systems Software and Hardware.
Deadlines: Abstracts: 31st January 1995; Notification: 15th April 1995; Posters: 30th June 1995.
See also http://www.elis.rug.ac.be/announce/parco95/cfp.html
Author: A. N. Author (a.n.author@host.site.country).
This software is also configurable by the .ixconfig file and it allows hand-written indices e.g. the top index.html which is the home page, to be left untouched. In addition, some areas can be left without indices, for example, directories containing icons used in the HTML pages.
There are also the problems of encoding; there is no way to use binary data, non-ASCII characters, or indeed, blanks in descriptions as paragraphs (without the extension I used). Some of these problems have been addressed in other formats, and other metadata standards for different purposes are being designed which may provide a rich-enough structure to cope with these difficulties.
Subject, Title, Author, Publisher, OtherAgent, Date, ObjectType, Form, Identifier, Relation, Source, Language, CoverageThe elements are syntax-independent; no single encoding was defined and it was intended that they could be mapped into more complex systems such as SGML or USMARC[15] and can use any appropriate cataloging code such as AACR2, LCSH or Dewey Decimal.
In addition to the collated indexers like ALIWEB and Harvest, there are web crawlers that try to ``index the web''. These could benefit from rich metadata, provided by the document authors or site administrators that would be difficult to extract automatically. In the best of all worlds, each WWW site would create the metadata for each of the files it wants to make available to the world and the results would be distributed automatically using a hierarchy of caches (for efficiency). ALIWEB and Harvest allow forms of these kinds of systems to be built using IAFA templates and SOIF respectively. Then the webcrawlers could get just the new metadata, rather than crawl the web continually, and smartly index it with their own software.
The use of gopher as an Internet service is declining and the use of HTTP (the WWW) is rapidly increasing. This system shows that the metadata can remain independent and richer than the presentation format and can survive evolutionary changes in the technology. Similarly, when a new (or defacto) standard for Metadata appears, it should be easy to derive the metadata from the IAFA Templates.
The software can be found at the HENSA Unix Archive at:
<URL:ftp://unix.hensa.ac.uk/pub/tools/www/iafatools/
<URL:http://www.hensa.ac.uk/tools/www/iafatools/
2. L. McLoughlin, mirror, Imperial College, University of London, UK, <URL:ftp://src.doc.ic.ac.uk/packages/mirror/>
3. J. Kopmanis and L. Wirzenius, Linux Software Map Entry Template, August 1994, <URL:ftp://sunsite.unc.edu/pub/Linux/docs/lsm-template>.
4. D. Crocker, Standard for the format of ARPA Internet Mail Messages, RFC822, University of Delaware, August 1992, <URL:ftp://nic.merit.edu/documents/rfc/rfc0822.txt>. 5. T. Boutell and L. Wirzenius, Linux Software Map, June 1995, <URL:http://siva.cshl.org/lsm/lsm.html>.
6. N. Borenstein and N. Freed, MIME (Multipurpose Internet Mail Extensions), September 1993, <URL:ftp://nic.merit.edu/documents/rfc/rfc1521.txt> and <URL:ftp://nic.merit.edu/documents/rfc/rfc1522.txt>.
7. Darren R. Hardy and Michael F. Schwartz, Customized Information Extraction as a Basis for Resource Discovery, Technical Report CU-CS-707-94, Department of Computer Science, University of Colorado, Boulder, March 1994 (revised February 1995). To appear, ACM Transactions on Computer Systems.
8. D. Hardy, M. Schwartz and D. Wessels, Harvest User's Manual, University of Colorado, Boulder, USA, April 1995, <URL:http://harvest.cs.colorado.edu/harvest/user-manual/>.
9. C. Mic Bowman, P. B. Danzig, D. R. Hardy, U. Manber and M. F. Schwartz, The Harvest Information Discovery and Access System, Proceedings of the Second International World Wide Web, Conference, pp. 763-771, Chicago, Illinois, October 1994.
10. M. Koster, ALIWEB, Proceedings of First International WWW Conference, 25-27 May 1994, CERN, Geneva, Switzerland. ALIWEB is at <URL:http://web.nexor.co.uk/public/aliweb/aliweb.html>.
11. R. Daniel Jr and M. Mealling, URC Scenarios and Requirements, Internet Draft, March 1995, <URL:ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-uri-urc-req-01.txt>
12. P. E. Hoffman and R. Daniel Jr, Trivial URC Syntax: urc0, Internet Draft, May 1995, <URL:ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-uri-urc-trivial-00.txt>
13. R. Daniel Jr and T. Allen, An SGML-based URC Service, Internet Draft, June 1995, <URL:ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-uri-urc-sgml-00.txt>
14. Stuart Weibel, Jean Godby, Eric Miller, OCLC/NCSA Metadata Workshop Report, Dublin, Ohio, USA, March 1995 <URL:http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html>.
15. USMARC Advisory Group, Mapping the Dublin Core Metadata Elements to USMARC, 1995, <URL:gopher://marvel.loc.gov/00/.listarch/usmarc/dp86.doc>.
(2) HENSA Unix Archive at <URL:http://www.hensa.ac.uk/>
(3) Linux archive, SunSITE USA at <URL:ftp://sunsite.unc.edu/pub/Linux/welcome.html>