The present paper proposes that a database can often be organized as a large collection of small text files, each containing the structured information that relates to a particular object. This means that the basic idea of WWW pages - small text files that are cross-linked by symbolic addresses - is generalized to become a database technique as well. The database pages are expressed in HORL notation (HyperObject Representation Language). Just like a WWW browser accesses and reads HTML pages dynamically as it needs them, our WWDB database system accesses and reads HORL pages as it needs them in the course of its data processing operations. When an HORL page is read by the database system, its contents are converted to an internal representation, and it does not have to be read again during the same session.
The WWDB technique can easily be combined with WWW usage. It was developed as a tool for generating WWW pages with structured contents, for example annotated publication lists, directories of authors, journals, and conferences, etc. It has also been put to a second usage in a mail management system. Our experience suggests that it may be an attractive and viable technique for many kinds of low-to-medium duty database applications, in particular, those where the database is a source of information rather than e.g. the basis of a transaction processing system.
In this paper, we first describe the application within which the WWDB concepts were developed, and then proceed to a description of the essential technical characteristics of the existing, experimental implementation. Finally we discuss what conclusions can be drawn from the project so far, and the perspectives for continued use of this technique.
An electronic colloquium generalizes this concept to the electronic arena, and the WWW is an ideal substrate for it. Simply speaking, the colloquium home page offers a menue of specific services, in particular:
The important thing about a colloquium is that it should have very clear focus, and be oriented to a particular research topic. Thus, given that articles addressing the given topic may appear in any one of a large number of journals or conferences, but in each of those only a small percentage of the contents are actually relevant for the colloquium topic, the colloquium will be highly selective. Ideally, it will present all relevant contributions from those sources, and no irrelevant contributions. The colloquium members define the focus and perform the selection.
Electronic colloquia of this kind are particularly important on the European scene, where they offer a possibility for researchers in different countries to obtain continuous interaction with a group of sufficient critical size.
The Compulog project of Esprit (European Union research program in information technology) has recently started an electronic colloquium for spatial and temporal reasoning (ECSTER). This is a sub-area of research in knowledge representation and artificial intelligence, dealing with logical and algorithmic methods for reasoning about actions and their effects, developments over time, etc. Planning, scheduling, and diagnosis based on temporal and spatio-temporal data are some of its application areas. It is an example of a specialized research topic, containing work that ranges from the highly theoretical to the quite practical, where an electronic colloquium would be of interest. The present number of active researchers in Europe in this area is estimated to be around one hundred, most of them working isolated or in small local groups.
For obvious reasons, we chose to use the WWW as the information carrier for ECSTER. At first, experimental versions of the important colloquium pages were set up completely manually, using a text editor, but it became readily apparent that this was inefficient and inconvenient. The problem consisted not only of having to write HTML syntax, but also in the redundancy of the actual information contents: the same data tended to appear repeatedly in multiple contents. Furthermore, in order to have a reasonable order in the accumulated information, it was necessary to organize it in terms of a multi-level directory structure under the operating system being used (Unix), but the chores of locating files in different directory levels was a nuisance in itself. Finally, we wished to have a clear separation between the working version and the public version of each HTML file, so that a maintainer of a page or substructure could work with it to satisfaction, and only then release it for public viewing.
In summary, the practical overhead concerned both the structure within the WWW pages and between them. Furthermore, the same problems arose with other information that we were dealing with, besides the WWW pages. Distribution of published papers is an important function of an electronic colloquium, and the various aspects of a paper (full text, abstract, commentary, annex containing experimental data or software and its documentation, etc.) impose administrative burdens that are fairly analogous to those that arise for the WWW pages in HTML format.
It became clear very soon, therefore, that we needed to introduce a structured representation for these kinds of information. This structured representation should be a database in the sense of having little or no redundancy, so that each essential information element is only represented once, and it should lend itself easily to being processed in operations that combine related information elements. The HTML representation should then be generated from that database, or, to be precise, large sections of the HTML files should be generated from underlying structured data. There must always be some parts which only serve presentation purposes, and which continue to be best written in HTML. For example, an HTML page may have the following essential structure:
We believe that this situation is typical of many "bread and butter" applications of WWW. Naturally, home pages and other pages that a user encounters more or less immediately must be more interesting and less standardized, but as regards those pages that serve a productive purpose, it seems that the information one wants to present is often structured, and can best be generated from an underlying representation. The availability of a richer presentation language, with audio, animation, color, and embedded video capabilities does not change the essential situation: if anything, it will increase the need for a structured representation of the information. We will return to this topic in the final section of the paper.
If HTML pages are generated from underlying data, there is a choice whether the generation is to be done in advance, under the direction of the person editing the data, or on demand as the user accesses the information. The difference is a practical one, since the generation process is quite similar in both cases. In our application there has not yet been any strong reason for on-line generation of HTML, so we have chosen the former alternative so far. The methods proposed below would work equally well in the case of on-line generation, however.
For the representation of the structured data, the most obvious choice might have been to use a conventional database system. However, we chose instead to organize our database using large numbers of small text files, which are expressed in the HORL syntax. The resulting database is still of moderate size, but it has the inherent capability of growth that is suggested by its name, a World-Wide Data Base. We proceed now to describing this design and the reasons why it was chosen.
The WWDB is an object-oriented database in the literal sense that it is organized as a collection of objects each of which has a number of properties. Objects are classified into types; objects of the same type have similar sets of properties. Other notions that are often associated with the term "object-oriented", such as message-passing and inheritance, are not presently used in the WWDB. In database terminology, the WWDB may be described as a binary database.
Each object has a name and a description. The object's name is like an identifier in an ordinary programming language. The description is an expression that maps labels to properties. For example, the combination of the name |France| and the type |countries| may be assigned the following description:
{ CAPITAL ~ |Paris|, CURRENCY ~ FRF, NEIGHBORS ~ { |Belgium|, |Germany|, |Switzerland|, |Italy|, |Spain|, |Andorra| }}where the tilde character is to be read as an arrow, connecting a label and the corresponding property. The description is the real "object"; several objects may have the same name, but for each combination of a name and a type there may be at most one description. (For example, if persons are denoted by their last name, then the combination of |France| and |persons| may represent Anatole France). Properties may be names, or sets of things, but also numbers, strings, sequences of things, new mappings, etc.
So far, this is quite conventional, and it should be clear how one can build a database with authors, publications, universities, cities, countries, journals, conferences, and so on as some of the types. Rather than storing these objects and object descriptions in an ordinary database system, we chose to create one file for each object, and to store the description in textual form in that file. Instead of a database system, we now have a database browser, that is, a program that reads database text files as it needs them.
One of the uses of the database browser is for interactive updating of the database: adding more information, or correcting its existing data contents. Typically, this usage is interleaved with generation of HTML pages. A number of other tasks are also evident, such as for database search, and for consistency controls, but so far the HTML generation task has dominated in our applications.
The full ECSTER structure consists of a number of such pages, which are linked in an approximate tree structure. The public versions and the working versions are linked as parallel structures, so that a public page links to a public subpage, and a working page to the corresponding working sub-page. Only on exit from the structure, for example in references to the full text of an archived article, or the reference to the home page of a researcher, do the parallel structures converge to common points.
Extended versions do not form a third parallel structure. Instead, an extended-version page links to working versions of subordinate or neighboring pages.
The
The on-line reader is invited to visit ECSTER's
public home page and the
working home page, as
well as their respective sub-pages, in order to see how this works.
The clickable item "[revision]" has locally the effect of invoking
the WWDB system; remote users will only see the Lisp code but not
its execution.
As discussed above, each page typically contains some parts that
are to be edited directly on HTML level, and some parts that
are to be generated from the database. The direct-editing parts
are modified in the usual fashion, for example using the editing
capability of the HTML browser, or using a plain text editor. (We
are currently using Emacs for this purpose). The
automatically generated parts
are distinguished by the separate quasi-HTML command label,
so the text of an HTML page may have the following structure:
In order to change the autogenerated information, for example for
adding one more author, or updating the information about a particular
conference, the maintainer clicks the [revision] link at the
top of the working page. This invokes or resets
the WWDB browser, which is put in a state where the
database object corresponding to the current HTML page is the current
object, and |webpages| is the current type. The maintainer
may then use the database browser to update the datastructures, and
finally invoke commands that regenerate the current HTML page.
Concretely, suppose the current HTML page has the filename
To use those methods, the maintainer uses the interactive
commands rg and rgp. Writing
As the WWDB browser is invoked from the working page, only the
program and a kernel set of objects are loaded. Additional object
descriptions are loaded from their text files as they are needed.
For example, the first time the command rg contents is
given in the example, it will cause the WWDB browser to load the
description of the object |authorlist| from its file,
and then in turn it will load the descriptions of all the authors
that are in the author-list. The presentation of these authors
may in turn require the loading of additional objects. For
example, the current affiliation of an author may be represented by
specifying a WWDB name, for example |TU-Munich|,
and then the corresponding description
has to be loaded in its turn. An encouraging observation from the
present experimental implementation is that this loading of
successive objects can be performed quite rapidly, and does not offer
any practical performance problems.
Loaded objects are retained in working memory, so the next time
the same rg command is issued they do not have to be
re-loaded.
The operation GENERATE, by contrast, is a simpler operation
which generates HTML expressions according to the formatting
directive of its single argument, in the presence of the current
object. For example the |ecster-heading| format will
use the HEADING property of the current object for both
the HTML TITLE lines and the first-level headings.
Thus the routine of the web-page maintainer is to modify the database,
using the viewing and editing commands of the WWDB browser, and
to regenerate the HTML working page from time to time using the
rg command. Since the WWW-HTML browser and the WWDB browser
appear in separate windows on the workstation, it is easy to
reload the regenerated HTML page, look at it, and return to the
WWDB browser as necessary.
What has now been described is the basic organization of the system.
Additional modifications can easily be introduced into the same
architecture. For example, if a given update or set of updates
of the database affect a number of HTML pages, it would be
desirable to keep track of those dependencies, regenerate all
affected pages, and inform the user of what pages have been
changed. The existence of a database with flexible datastructures
is a correct basis for implementing such services.
For another example, if the manually edited (non-automatic)
parts of a working page have been edited, and are to be transferred
to public status, then all links to working versions of subpages
or other related pages must be replaced by links to public versions.
This requires a systematic scan of the entire text contents, either
for removing all substrings of the form -wv or (more
reliably) removing such substrings if they appear in appropriate
context, and otherwise giving a warning message.
The reason for choosing CommonLisp for the experimental system
was that the operations of printing and reading datastructures
are built into the language, so that the transfer between the
text-file representation and the in-memory representation of
the object descriptions is trivial. An additional reason was that
CommonLisp datastructures lend themselves easily to the implementation
of embedded sublanguages, such as the script language for
defining HTML generators.
The present implementation is experimental, and has been written
without any particular consideration of efficiency. In spite of this,
it operates with quite adequate speed. The loading time for
the WWDB system is 9 seconds on a Sparcstation 10, provided that
the LAN does not slow it down. (Typically it is only loaded once a
day, and then used repeatedly throughout the day). The time
for regenerating the list of authors, with its current 170 members,
in the
The technique described here can be implemented quite compactly.
The following figures for the size of the present program show
that it is easy to implement and re-implement a WWDB browser.
The figures refer to lines of LISP S-expressions, spatiously
printed:
The main limitation of CommonLisp and Xlisp at present is the limited
access to screen dialogue capability. An obvious alternative would be
to use Java, which would remedy that limitation. On the other hand, it
would require a separate implementation of a package for printing and
reading datastructures. The same requirement arises when the
work is redone in e.g. C++.
For convenience in the development stage, we are using standard
Lisp I/O of data structures (that is, Lisp's read and
print functions) in parallel with the HORL representation.
The program is freely available. After some additional polishing, we
intend to make the program available via ftp and the
documentation via WWW.
The additional step that has been taken by WWDB is to make a much
more complete separation of content and appearance, and to organize
content as a database, while at the same time retaining the
distributed text-file organization of WWW. Appearance has not been
an issue for the present project: we are satisfied with the
appearance capabilities offered by HTML for the time being, which
is why we manage by generating HTML pages.
Java, which is generally viewed as the next step of development
after HTML, is a programming language, not more, not less. Improved
appearance capabilities is its particular strength, so in this way
it represents an orthogonal development to the one shown by WWDB.
It follows that WWDB and Java together would most likely be a very
powerful combination.
One must recognize, however, that an absolute separation of
content and appearance is not possible. It must be understood
as a guiding principle, and not as a strict rule.
Description-file retrieval within one file system.
Briefly, the details are as follows. For every
combination of an object name and a type name, the WWDB browser must
be able to retrieve the full name of the file containing the object
description, so that it can then load the contents of that file.
The full name is constructed as (access path) + (file name) +
(extension), where the extension is standardized as .horl
for the hyperobject representation language used above, and
.lsp if the same information is expressed in classical
CommonLisp format. The retrieval process assumes that the description
of the type is already available; if it is not then retrieval is called
recursively with the previous type as the new object, and with
|types| as the new type. (Types are a special kind of objects,
of course). Then, two main cases are allowed:
(1) The same access path for all objects of the same type.
In this case, the type description contains the access path for the
members of the type, and the object name serves as file name.
The construction of the full file name for the object is trivial.
(2) Each object in the type has its own access path.
In this case, the access path is a property of the object, but not
of the type. In fact, it is stored as a subproperty under the
property META; an example of this was shown above for the case of
an object of type |webpages|. The problem, of course, is
that as long as the object description is still only stored
as a text file, the browser does not know how to find it.
For this reason, WWDB contains the notion of concierges.
A concierge, who is a key person particularly in Paris, is someone to
whom you mention a name, and he or she will tell you where to go in
order to find the person with that name. Similarly, a WWDB concierge
is a WWDB object containing a mapping from names to access paths.
Therefore, the retrieval process which is given an object name, a
type name, and the description for that type,
will first check with the type description whether
this type has a single common access path, or individual paths. In
the former case, the type contains the access path and the object
name becomes file name. In the latter case, the browser will go
through all the currently loaded concierge objects and ask each of
them whether they have an appropriate access path for the present
combination of object name and type name, until it finds one that
can provide the information.
Types, in particular, have distributed access paths. The initially
loaded WWDB system therefore only needs to load the description of
the types |types| and |concierges|, a concierge
for all types (that is, all members of the type |types|)
that may need to be used initially, and concierge(s)
for other relevant objects, for example for relevant members of
|webpages|.
Description-file retrieval by world-wide access. The access
mechanism with concierges which know about access paths has
been generalized to allow arbitrary URL:s, besides local paths.
In this way, it is possible to construct a database that is similar
to the vast body of displayable information already existing in HTML
format. Individual contributions can be set up locally and made available
on the Internet, and these contributions can be accessed and used by
the database browsers of other users regardless of where they are.
The power of this concept is that the usage of the information is
not limited to viewing; it can also be processed, combined with
information from elsewhere, and presented in very flexible ways.
The usage for electronic colloquia is important enough, but we foresee
that the same technique can be used for much broader purposes. Imagine,
for example, a world-wide database containing geographical and
historical information: countries, cities, activities in those cities,
historical events, and so on. It would be reasonable to start with
fairly elementary facts, and then to extend the database by gradually
attaching additional information to existing ones. A world-wide database
with those kinds of contents could develop into an
encyclopaedia that is available freely to everyone (in the same sense
and to the same extent as the present WWW is free). More specialized
knowledge bases in various academic disciplines might use the same
technique.
Some additional constructs would be necessary as the
world-wide database becomes larger and larger. The present design
requires all participating partners to use the same naming scheme
for types and for concierges. The distributed system may accomodate
multiple descriptions for a given combination of object name and
type name, for types with individual access paths, as long as each
user only selects a subset of all available concierges. In this way
the user only "sees" one of the descriptions for each object/type
combination. But in a world-wide context, it may be necessary to
accomodate different uses of the same type name or the same
concierge name concurrently. One plausible way of doing that is to
allow multiple
domains, where each domain consists of a set of information
providers, and the present naming scheme is used within each
domain. For information exchange between domains, one would use
the well-known technique of mediators,
that is, devices that translate a query that has been issued in one
domain as an object/type combination, into a corresponding query
in another domain.
The WWDB approach represents a deviation from this traditional mode
of thinking. It allows data to be represented in small and simple
text files whose contents are open to everyone. One is not dependent
on the continued use of a particular database software; it is very
easy to implement and re-implement support for the HORL format.
This has been demonstrated by the moderate size of the present
operational program, where the program kernel is a mere 12 pages
of code.
Besides bringing independence from any particular software, the
compactness of the WWDB design has another important effect:
access to the world-wide database can easily be integrated with
any user interface, be it a conventional WWW browser, a UIMS,
a document preparation system, or a particular application program.
The TSIMMIS project at Stanford University [Garcia-Molina et al, 1995]
advocates a tagged object model which is similar to our view
of data. However, the notation that is used in TSIMMIS is on a quite
low level compared to the set-theoretic notation used in our HORL.
TSIMMIS do not report using access paths or URL:s as first-class
objects in their database.
The WWDB approach goes against current trends in the database area
in another respect as well: large main-memory databases are presently
a subject of considerable interest. Although this is important for
many applications, one can not hold a world-wide database in-core.
The WWDB approach uses a browser-like database tool that loads HORL
pages as it needs them.
What are the disadvantages of the WWDB approach? One of the major
issues in traditional database technology is data
consistency and integrity: a database system
shall contain type declarations for the data it contains, and various
control mechanisms for verifying the structural correctness and the
consistency of those data. In a WWDB, we make a virtue of necessity
and consider data consistency and integrity as a separate issue.
Anyone who posts information as an information source in the WWDB
will have to make his own commitments as to the structural properties
of the data he provides. In some cases this may be a very small issue.
In other cases it may not be sufficient, and then the WWDB approach
is not appropriate for those cases.
Concurrent update is another although related topic. The WWDB approach
is oriented towards the assumption that the data are object-oriented,
and that different information providers make non-conflicting
contributions and updates to the body of object descriptions for
name/type pairs. The classical example from transaction data
processing - making a withdrawal from one account, and a corresponding
deposit to another account - would obtain miserable performance
in the WWDB architecture.
Actually, one observation from our Electronic Colloquium
project has been that the traditional list of references in
scientific articles is likely to become an obsolete
construct in the age of electronic publication. Why should
one freeze the reference list into the article; why not generalize
it into a bibliographic reference structure which
connects articles by binary links, and which can be gradually
incremented over time, even after the article has been
published?
The homepage of the WWDB project:
[http://vir.liu.se/brs/database/]
The homepage of the ECSEL electronic colloquium:
[http://vir.liu.se/brs/]
The author's homepage:
[http://www.ida.liu.se/~erisa/]
Hector Garcia-Molina, Joachim Hammer, et al:
Integrating and Accessing Heterogeneous Information Sources
in TSIMMIS.
Presented at the AAAI Symposium, 1995.
Also available on-line in
[postscript].
Guy L. Steele Jr: Common LISP. The language.
Digital Press, 1984.
<label heading>
Automatically generated heading
</label heading>
Text pertaining to manually written and edited parts...
<label contents>
Automatically generated part
</label contents>
More text pertaining to manually written and edited parts...
<label footing>
Automatically generated footing
</label footing>
Of course the label and /label commands are ignored
by the HTML browsers, and for the maintainer of the page they indicate
that whatever goes between <label x> and </label x>
shall be left alone, since it will be regenerated anyway.
/info/www/ext/brs/researchers/index-wv.html
and that it contains three autogenerated segments as described above.
The corresponding database object is stored as the file
/info/www/ext/brs/researchers/index-wv.horl
with contents which may look as follows (simplified form)
{ META ~
{ ACCESSPATH ~ "/info/www/ext/brs/authors",
OBJNAME ~ |author-index|,
FILENAME ~ |index-wv|,
PUBLNAME ~ |index|,
EXTENDNAME ~ |index-xv| },
TITLE ~ "Catalogue of authors in the ECSTER area",
FORMAT ~ |ecster-page|,
LANGUAGE ~ |english|,
GENERATORS ~ {
|heading| ~ (GENERATE |ecster-heading|),
|contents| ~ (ALLMEMBERS |authorlist| |author-display|),
|footing| ~ (GENERATE |ecster-footing|) }}
The name of this database object is author-index; it could
not be just index or index-wv since many
different HTML pages are called index. This object
description contains enough information in order to reconstruct
the full file name of the working version, the public version, and
the extended version of the HTML page, since both access path and
filename are there. It also contains relevant
parameters, such as the language in which the page is written, which
is needed in order to write language-independent generators.
Finally, it specifies the generator methods for regenerating the three
auto-generated segments.
rg contents
to the WWDB browser (or selecting the same command from pull-down
menues; not implemented at present)
when |author-index| is the current object,
will cause the current HTML working page to be regenerated,
retaining all lines except the text between the lines
<label contents> and </label contents>. The
expression (ALLMEMBERS |authorlist| |author-display|)
in the object description specifies the recipe for this generation
process: the object |authorlist| is an object containing
a list (ordered set) of authors, which is here used as the basis
for generation, and
|author-display| is a script specifying
how to generate the appropriate HTML expressions based on a
given object of type author.
(Naturally, |authorlist| may be used in several different
contexts).
The operation ALLMEMBERS
looks up the list of members represented by its first argument,
and generates HTML code for each of them in succession using the
script of the second argument.
rgp contents
Details of the current implementation
The present implementation is a single-user system which is being
used regularly as a working tool for maintaining the ECSEL
information structures. (It is also used locally as a mail manager,
and for administrating the user's own publications).
The WWDB browser has been implemented in Xlisp, which is a
variant of CommonLisp [Steele, 1984]. The Xlisp implementation
is available for Unix, PC, and Macintosh platforms. The present
system runs in a Unix environments on Sun workstations under the
Solaris operating system.
In other words, with 50 lines per page, the core program is
about 12 pages of CommonLisp programs. There is of course
no reason why it can not be implemented in any other language.
Discussion
The presently implemented system has the advantage of simplicity: with
a small implementation effort it has been possible to realize a tool
which is reasonably general, and which works well for the intended
first application. However, the purpose of the present article is to
promote the general idea embodied in the program, rather than the
program itself. We shall now address the design idea from several
distinct perspectives.
Separation of content and appearance in WWW languages
In spite of its intentions, HTML does in fact combine content and
appearance. It is true that the contents are presented in a somewhat
media-independent format, but it remains that HTML is a markup
language. An HTML file contains everything that is going to be
written on the screen, plus high-level information about how it is
to be written.
The world-wide perspective
The example above showed how a WWDB object may have a property
containing the file name of an HTML file; this file can then
be read and regenerated from the WWDB browser. In the same
way, a WWDB object can have a property containing the
file name of an HORL file, which is how the browser can
start from some objects and successively read additional ones.
We have first used this technique within the same computer
system, but we have also started to use it with arbitrary
URL:s as properties, allowing the WWDB browser to read and
access HTML or HORL files from foreign servers.
Let us first describe how this works within the local file
system, and then discuss the extension.
The distributed database perspective.
The current wisdom in the database area is that databases are
represented by database systems, which are a particular
kind of software, and where all data are "owned" by
a particular database system. The user is supposed to enter his or
her data into the database system, and this system can then
be used for performing various operations on the data that are
in it. Distributed database systems allow the additional
possibility of having several such software systems which run on
different computers, and which exchange information as needed.
Heterogenous distributed database systems allow, in
addition, that the participating database systems can have different
internal structure, that is, they may organize "their" respective
data in different ways.
Related and relevant work
We have not been able to find any earlier usage of our basic
idea - a world-wide database defined by a simple, textual
language (HORL) whereby database object descriptions can be
stored in a fully distributed fashion as small text files.
However, the proposal obviously touches on a number of current
topics in different parts of computer science, including
databases, knowledge bases, office systems, multi-media,
and so on. It is neither possible nor meaningful to make a
full account of all those ramifications here. We refer instead
to the home page of the WWDB project (please refer to the URL
in the list of references below), which contains both an account
of these related areas, and references to other articles
(including forthcoming ones) about the WWDB project.
Summary
In summary, the WWDB represents an approach that has significant
similarities and significant differences with current
HTML-oriented WWW technology. It is orthogonal to the Hot Java
development since Java is a programming language, and WWDB
addresses data structuring. Similarly, it has significant connections
and significant differences compared to current database
technology. In other contexts, we discuss its relationship to
knowledge-base technology and to office information systems
(please refer to the WWDB home page for the references).
The following are the salient points of this new approach:
References
WWW pages:
Conventional publications: