Specifying metadata standards for metadata tool configuration

Andrew Waugh

CSIRO Mathematical and Information Sciences,
723 Swanston St, Carlton, VIC, 3053, Australia

andrew.waugh@cmis.csiro.au

Abstract
A critical problem for metadata applications is flexibility. A metadata application must be sufficiently flexible to cope with changes to metadata standards over time and to allow users to extend a standard to cope with local requirements. A key component of supporting flexible metadata applications is software which can be dynamically configured by a specification of the metadata standard. By contrast, in current metadata software the metadata standard is embedded in the code, making changes relatively more difficult and expensive. Configurable software also leads to better tools at a lower cost, as it is not necessary to re-implement functionality for every new metadata standard. This paper describes a metadata specification designed to support dynamic configuration of metadata software by capturing features of metadata standards. The specification comprises three components: the classification of the metadata standard, the metadata schema, and the metadata expression.

Keywords
Metadata data models; Metadata standards; Metadata software

1. Introduction

There are many objects available on the Internet, including documents, data sets, and programs. Metadata is the information associated with those objects that allows access to and manipulation of the objects. Typically, metadata describes what the object is (e.g. title, subject, keywords), how to use the object (e.g. where to retrieve it from, how it is encoded), and how the object is to be managed (e.g. lifecycle, relationships with other objects). A comprehensive list of metadata resources is maintained by IFLA [IFLA].

Different communities are standardising the metadata they need to access and manipulate the resources they use. This is producing a large number of metadata standards. Typical metadata standards include Dublin Core [DC] for resource discovery, GILS [GILS] for accessing government information, and ANZLIC [ANZLIC] and FGDC [FGDC] for describing geographical data sets.

An essential characteristic of any system is flexibility to adjust to change. For a metadata application, flexibility means the ability to extend the metadata standard quickly and easily. The ability to modify a metadata standard is needed for the following reasons:

It is relatively easy to change the standard itself; it is much harder is to change the software that implements the standard. Current metadata software typically has the standard embedded in the code and changing the standard means rewriting the code. The typical Dublin Core metadata creation tool, for example, is a CGI script targeted at a specific metadata application. Such scripts are inflexible as changes to the metadata standard require the script to be modified and retargeting the script to a different metadata application would require the script to be completely rewritten. Such scripts also have a very limited ability to handle complex structured metadata and have very limited functionality, being limited to data entry rather than being a true metadata editor.

This lack of sophisticated metadata manipulation tools highlights the advantage of the metadata specification approach advocated in this paper. The typical metadata tools have limited functionality and flexibility because they address only one metadata standard. Hence it is not worth expending a lot of resources on them as they have limited application.

An alternative is to prepare a metadata specification which is used to configure the metadata software. Ideally a change to the standard merely requires a change to the specification and no code needs to be touched. Other advantages of this approach are:

Our main interest in a metadata specification is to configure metadata software, but the specification could also be used to formally document a metadata standard.

This paper describes the results of research into what information can be included in a metadata specification designed to allow the easy configuration of metadata software. In Section 2, we describe the metadata specification we developed in detail, and Section 3 describes a metadata editor we developed to test the specification. The strengths and weaknesses of the specification approach are discussed in Section 4.

2. The metadata specification

A metadata specification captures some of the characteristics of a metadata standard. Ideally all characteristics would be captured by the specification, but the law of diminishing returns comes into play; some features are simply too complex to be worth the effort of capturing.

Our metadata specification is divided into three components: a metadata classification, the metadata schema, and the metadata expression. The metadata classification describes the descriptive power of the standard; how complex may the metadata values be? The metadata schema captures characteristics of individual metadata standards. The metadata expression captures how a metadata value is expressed during storage or transmission over a network.

2.1. Some definitions

A metadata instance consists of a set of facts about a resource. An instance can be represented as an acyclic directed labelled graph (Fig. 1).

Fig. 1.

The root node represents the resource being described. The leaf nodes represent primitive values (e.g. strings containing names), and the interior nodes represent structured values (i.e. a value where the information is structured into sub-properties). For convenience we treat structured values as ``resources''. This allows a convenient definition that a vertex in the graph (Fig. 2) represents a property (fact) and links a resource and a value. The label of the vertex is the property type.

Fig. 2.

This formal definition of metadata is based on that presented in RDF [RDF], the major change is that RDF metadata may be cyclic whereas we have restricted them to be acyclic. We do not consider this restriction to be of practical concern when dealing with metadata as cycles do not make sense; a cycle in a metadata value would mean that a part of a value is the value itself. Restricting the data model to acyclic graphs has the advantage of allowing the complex concept of reification in RDF to be much more simply expressed as annotations (to be described in Section 2.2.2).

2.2. A metadata classification

We classify metadata standards into four classes based on their expressive power. In essence the classification answers the question ``in this standard, how complicated can the information in a metadata value be?'' Software written to handle one class of metadata standard will not be able to manipulate more powerful classes of metadata.

Metadata standards can be divided into four classes (Fig. 3) depending on whether the standard supports structured values and whether it supports annotated values.

Fig. 3.

The four classes overlap (Fig. 4) and a metadata standard that supports structured, annotated, metadata can support the simpler types of metadata.

Fig. 4.

2.2.1. Structured values

The horizontal axis of Fig. 3 divides metadata standards into those which only support ``simple'' values and those which support ``structured'' values. The information in a ``simple'' value is in one undifferentiated lump. For example, a simple ``address'' property contain:

	Address ("723 Swanston Street Carlton Victoria 3053, Australia").

A ``structured'' value structures the information value into properties. An address might be expressed:

	Address (
		Street ("723 Swanston Street")
		Suburb ("Carlton")
		State ("Victoria")
		Postcode ("3053")
		Country ("Australia"))

Explicitly labelling the information in a value in this way simplifies machine processing of the value as it is trivial for a program to extract a component. However, it can be more expensive to generate the value in the first place (as something or someone must identify the components).

Not all metadata standards support structured values. Strictly speaking, Dublin Core does not, although some of uses of ``sub-elements'' are really structured values. Other standards (e.g. ANZLIC and GILS) have some structured values. A very few metadata standards (such as RDF) fully support structured values.

Conceptually, a value can be considered as a resource (Fig. 5), which can, recursively, contain a simple value or a structured value.

Fig. 5.

2.2.2. Annotated values

The vertical axis of Fig. 3 divides metadata standards into those which allow values to be annotated, and those which do not. An annotation is information which about a value (as distinct from information which is part of the value).

For example, PICS allows a rating (the value) to annotated with information about who assigned the rating. The current Dublin Core standard defines three types of annotation (called qualifiers): scheme (what standard the value was drawn from); subelement (a refinement of the semantics of the property type); and language (the language of the metadata value).

Conceptually, an annotation is a property where the resource is a value. (Fig. 6).

Fig. 6.

2.3. Metadata schema

The metadata schema is the configuration information that specialises the metadata software for a particular metadata standard. The schema, for example, turns a generic metadata editor into a Dublin Core editor, an ANZLIC editor, or a GILS editor.

The characteristics captured in the schema developed for this work fall into four groups:

The grammar used to express the metadata schema in our example metadata editor is represented by the following BNF production (Fig. 7):

Input		:= "root" Word ( Properties )+ 
Properties	:= Word "property" Definition
Definition	:= ( Description )? ( Label )? ( Logo )? ( NoValues )?
		   Validate ( Annotations )?
Description	:= "description" Url
Label		:= "label" String
Logo		:= "logo" Url
NoValues	:= "values" PosInt PosInt
ValuesOrdered	:= ( "valuesOrdered" | "valuesUnordered" )
Annotations	:= "annotations" Word ( "," Word )*
Validate	:= ( "container" "property" ContainerAttr )
		| ( "string" "property" StringAttr )
		| ( "integer" "property" IntAttr )
		| ( "real" "property" FloatAttr )
ContainerAttr	:= ( "set" | "sequence" ) "of" Word ( "," Word )*
StringAttr	:= ( "maxLength" PosInt )? ValidValues
IntAttr		:= ( "range" SRange Int "," Int ERange )? ValidValues
FloatAttr	:= ( "range" SRange Float "," Float ERange )? ValidValues
SRange		:= ( "[" | "(" )
ERange		:= ( "]" | ")" )
ValidValues	:= ("valid" "values" ("only")? String ("," String )* )?
		   ( "defaults" String ( "," String )* )?

Fig. 7.

Part of the schema definition for the Dublin Core standard is shown in Fig. 8.

-- Sample Dublin Core Schema
--
-- Uses the proposed qualifier list from
--	http://www.loc.gov/marc/dcqualif.html
-- and
--	http://sunsite.berkeley.edu/Metadata/types.html

root dublinCore

dublinCore property
	description 
	label "Dublin Core"
	logo 
	container property set of
		title, creator, subject, description, publisher,
		contributors, date, type, format, source,
		language, relation, coverage, rights

title property
	description 
	label "Title"
	string property
	annotations titlequal

	titlequal property
		label "Qualifiers"
		container property set of titleqt, lang
		
	titleqt property
		label "Type"
		string property
		valid values only "Alternative"

Fig. 8.

2.3.1. Structural information

Structural information describes how the information in the value may be structured (i.e. whether the value can contain structured properties). In the grammar, the structural information is the basic framework on which the other schema information is supported. The schema contains the following structural information:

2.3.2. HCI information

HCI (Human/Computer Interface) information is used to inform the human user about the semantics of the metadata. The HCI information for each property includes:

2.3.3. Validation information

Validation information is used to validate property values. Validation information is a primitive representation of the semantics of the attribute. The schema can specify:

Such a simple encoding of the semantics of a value is very limiting. Many properties have rules that govern valid values (e.g. telephone numbers), but to predefine them all in the specification for the schema is not feasible. There are also application-specific validations; for example, checking a part number or employee number against a database.

Distributed object technologies offer an alternative. A validation object can encapsulate a validation test for a particular property (Fig. 9). For example, the validation object could encapsulate the Human Resources database. The metadata software would send a request containing an employee number and be returned a indication as to whether it was valid. The test implemented by the validation object can be arbitrarily complex, but the interface can be very simple: String Validate(String) where the object is passed a string to validate and returns null string if the string is valid or an error message otherwise.

Fig. 9.

It would be equally possible to implement the validation object using Java applets and for the metadata software to download the applets. However, a value would typically be much smaller than the code for the validation object and hence it would normally be more efficient to transfer the value to the validation object than to transfer the validation object to the metadata software.

2.3.4. Default information

Default information contains the initial contents of an newly created value.

2.4. Metadata expression

The metadata expression is how a value is expressed (stored or transmitted) outside a metadata application. There are many different equivalent ways of expressing a metadata value; the binary based expressions used for ASN.1, for example, are quite different to the character based expressions used for HTML.

In practice, the metadata expression is closely related to the classification of the standard as the expression format must support the expressive power of the standard. For example, Dublin Core was initially a simple, non annotated metadata standard and could be expressed in HTML-2.0. Considerable problems were caused when Dublin Core was extended to support qualifications (i.e. it changed to a simple, annotated, metadata standard) which could not be expressed in HTML-2.0. The work-around was to encode the qualification in the value, but it was recognised that this would cause problems with indexing. The eventual solution was to lobby for HTML-4.0 to include support for qualifications.

The metadata expression is a problem from the point of view of a metadata specification. The wide range of possible ways of expressing metadata makes it difficult to encode the expression in a configuration file, as was done for the metadata schema.

We chose, instead, to implement each expression as a Java class. All expression classes are subclasses of an abstract class ``Instance''. Instance defines five methods: express(), parse(), addControls(), removeControls(), and action(). The first two methods generate and parse the metadata expressions. The last three methods allow the user of the editor to control options available in the expression (e.g. selection of HTML-4.0 instead of HTML-2.0).

3. PrismEd: a generic metadata editor

To test and develop the ideas presented in this paper we have implemented a generic metadata editor named PrismEd. It allows a user to create or read a metadata instance, edit it, and store the result. PrismEd was designed to edit metadata that had structured, annotated, values. Specification files were written to configure PrismEd to edit Dublin Core metadata, ANZLIC metadata, and GILS metadata. A limited function version of the editor can be downloaded as an applet [PrismEd]

PrismEd can be used to create metadata from scratch, but we expect that its main role will be as a component in a metadata management process (Fig. 10). In this process, most of the metadata instance is captured automatically (e.g. extracted from the underlying data, or from the system that produced the data). PrismEd would be used check and augment this automatically generated information. The editor would also be subsequently used to maintain the metadata.

Fig. 10.

The user interface design of PrismEd attempts to display the maximum amount of information whilst retaining an uncluttered display. When editing metadata, we have found that as much of the metadata as possible should be displayed on the screen as this aids comprehension. Unfortunately, text editing components can use up a lot of screen real estate, particularly if scrollbars are necessary. This problem is compounded in metadata applications where the application often allows lengthy values (say 2000 characters), but the actual values are normally very much shorter than this. PrismEd compromises by placing metadata values on buttons. Clicking on the button pops up a window allowing the value to be edited. In Fig. 11, the user is editing a Geographic Extent Polygon value.

Fig. 11.

PrismEd is written in Java 1.0.2 and is currently about 5000 lines of code. It was written to run as either an applet or an application; unfortunately both have limitations. It is normally run as an application as PrismEd loads faster, the run time environment is less buggy, and it can read and write files from the local file system. The disadvantage is that the run time environment lacks the integrated network environment of a browser. Classes must be found to interpret HTML and generic processing of URLs is difficult.

4. Results and further work

Generation of a schema from a moderately complicated standard (e.g. ANZLIC) for PrismEd required about four hours (including testing, but excluding generation of help files). This is much faster than any other method of producing an ANZLIC editor. Reflecting changes in the metadata standard, or extending it to handle local properties, simply involves editing the schema.

The main limitation of the approach described in this paper is adding new metadata expressions. As described in Section 2, it is necessary to write Java routines to generate and parse each different expression. It took, for example, about three hours to generate the PrismEd schema for GILS, but to write the Z39.50 interface necessary to interface PrismEd to a GILS server would have required far more effort. So far, our experience has been that these routines are not difficult to write, but a much more experience needs to be gained before a categorical statement can be made. In the meantime, RDF opens the possibility that many metadata standards will have the same expression (XML [XML]).

The metadata classification in this paper has drawn heavily from the ideas presented in RDF. We view RDF as an abstract metadata standard. It defines a structured, annotated, metadata standard. This standard is then specialised to provide a number of components that are expected to be generally useful (e.g. InstanceOf), and defines a method of expressing metadata values in XML [XML]. Other standards bodies will specialise RDF by defining properties for their particular application thus producing a concrete metadata standard. For example, work is progressing on using RDF to represent Dublin Core.

The work presented in this paper can be viewed as a prototype of the systems that could be built using RDF. We have simplified the RDF data model slightly as we believe that the model is needlessly complex and we have added the concept of the schema which specialises the model for a particular metadata application.

The development of a metadata specification similar to the one proposed in this paper can be expected to have a number of beneficial effects on metadata standards:

However, a specification might straight-jacket the development of new ideas for metadata standards. It is important that the specification itself be flexible and capable of accommodating new ideas.

5. Conclusions

It is possible to implement generic metadata tools. This will result greater flexibility for metadata standards because it will be easy to reconfigure the software that manipulates the metadata. It will also result in software with greater functionality and flexibility as the development cost can be spread over a wider customer base.

In developing generic metadata tools, it is necessary to consider three issues: classification of the metadata standards to be used, the metadata schema, and the metadata expresssion.

The metadata classification limits the expressive power of a generic metadata tool as it limits the standards that can be manipulated by the generic tool. We classify metadata applications along two dimensions: the complexity of the values (whether simple or structured values are supported), and the whether values can be annotated. We believe that a metadata tool that can manipulate structured structured values can be configured to manipulate any metadata.

The metadata schema is the configuration information needed to configure the tool to a particular metadata standard. We have divided the characteristics of a schema into Structural, HCI, Validation, and Default characteristics. Structural information describes the organisation of information in the metadata. HCI information indicates the semantics of the information to users of the tool. Validation information is used to validate values. Default information controls the default values.

The metadata expression describes how the metadata instance is represented outside the tool. There are an enormous range of ways of expressing metadata; ranging from databases to text files of various formats. This makes it difficult to write a generic program to handle the range. We have compromised by providing an API into which a variety of external representations can be slotted. Java makes this easy. This problem may be reduced by the adoption of standard metadata expressions based on XML.

It is likely that the specification presented in this paper will evolve as experience grows with specifying metadata standards, particularly as new metadata standards with new concepts are developed. We believe that the benefits of being able to use generic metadata software will outweigh the costs of developing this metadata specification.

Acknowledgements

The development of a previous version of the software described in this paper was funded in part by the Cooperative Research Centres Program through the Department of the Prime Minister and Cabinet of Australia.

References

[IFLA] Digital Libraries: Metadata resources page, International Federation of Library Associations and Institutions (IFLA), http://www.nlc-bnc.ca/ifla/II/metadata.htm

[ANZLIC] Core metadata elements for land and geographic directories in Australia and New Zealand, The Australia New Zealand Land Information Council (ANZLIC), http://www.anzlic.org.au/metaelem.htm

[DC] Dublin Core metadata, http://purl.oclc.org/docs/metadata/dublin_core

[FGDC] Content Standard for digital geospatial metadata, Version 1.0, (US) Federal Geographic Data Committee (FGDC), http://www.fgdc.gov/Metadata/Metadata.html

[GILS] Application profile for the Government Information Locator Service (GILS), Version 2, http://www.usgs.gov/gils/prof_v2.html

[PrismEd] http://www.mel.dit.csiro.au:8080/~ajw/schema/editor.html

[RDF] Resource Description Framework (RDF), model and syntax, Version 1, 2 October 1997, in: O. Lassila and R. Swick (Eds.), W3C, http://www.w3.org/TR/WD-rdf-syntax-971002.html

[XML] Extensible Markup Language (XML): Part 1, Syntax, W3C Working Draft, 30 June 1997 http://www.w3.org/TR/WD-xml-lang-970630.html

Vitae

Andrew Waugh is a Senior Scientist in the CSIRO Division of Mathematical and Information Sciences. He has an M.Sc. from the University of Melbourne. Experience with X.500 electronic directories focused his attention on the difficulties of economically creating and managing accurate metadata. He was seconded to the Research Data Networks CRC from 1994 to 1997 where he extended this interest to the difficulty of finding resources on the developing net.