aNEDO I.T. Researcher/ETL,
Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan
inder@etl.go.jp
bETL,
Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan
hurst@etl.go.jp
cChuo University, Tokyo and ETL
Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan
kato@etl.go.jp
The Internet is giving us access to unprecedented amounts of information, both as static "multi media" documents, and as a result of interrogating databases. However, the sheer volume of information available is one of the major obstacles to exploiting its full potential. Used with skill, search engines and catalogue systems can do an excellent job of finding relevant sites or pages, many of which will link to several other relevant documents or other information. As a result, they can still leave users with so much material that just retrieving and assessing the most promising is still a substantial and time-consuming task.
There is a clear need for tools to assist in sifting, organising, combining and presenting material in a way that suits the individual users' task and preferences. Once we have a number of sites where we can expect to find relevant information, can we create a system that will automatically interrogate them, summarising and combining the results in some way?
We report an initial exploration of the problems involved in doing this for the task of buying books. Our target is to produce a system which will, in response to the user's query, draw information from a number of separate sources, and produce an on-line document or data-set which integrates the various products and prices that are available.
A major obstacle to carrying out this task is the fact that each bookstore's Web site is designed to present that store's wares as effectively as possible, to help visitors locate things of interest and persuade them to buy them from that store.
Once the content of a particular document has been extracted, the agent must actually combine it with information from other sources. Much work on combining data from multiple sources focuses on reconciling the structure of the data, based on metadata for each source. We doubt Web agents will have access to accurate metadata, but will have to recognise the relevant structures by analysing the data itself within the context of its own ontology. However, precise alignment of data models is not necessary for much useful agent functionality. The limiting factors involve the data itself, and the problems associated with identifying references to "the same" object in different databases something that will require substantial amounts of domain knowledge (see below).
Finally, the system must organise the information, and present at least some of it to the user. At its simplest, this could involve generating a report that identifies the books the system has located, and gives their price and availability.
A system hoping to combine and organise information about a number of objects must have an appropriate way of deciding when two different descriptions refer to the same object or "equivalent" things. So, if a book-buyer's assistant is to integrate the information it gathers, it needs to know enough about books to decide what constitutes a book, and how the various books should be grouped.
Books can be identified by means of their title, author and so on. However, we are dealing with material intended to be concise and appealing to potential customers, not complete or precise. Thus even without considering errors, which abound, the descriptions of books are partial, and varied, and the decisions about co-reference are hard. For instance, different vendors may ignore, or include within the title, various attributes of the book, such as sub-titles, editions and format. In addition, matching between any data-bases must allow for "compression" of information appropriate for that domain: in our case, the use of initials, abbreviations, truncations and so on.
For these reasons, simple string matching is seldom adequate, and an agent must have enough knowledge of its domain to make good matching and discriminating decisions on the basis of imperfect or partial information. Nevertheless, such decisions will still be fallible, which raises the issue of how, if at all, the system should present the uncertainty to the user. In any event, the system must handle inconsistencies sensibly, and must help users confirm its decisions (e.g. by providing URLs to retrieve the relevant raw information) at all times.
Although the need for domain knowledge was illustrated in terms of the domain of buying books, comparable problems will arise in any domain. The implementation of an information gathering agent must ensure that the domain knowledge used is encapsulated, and can be replaced for another domain.
The ontology of Maxwell's model of the book domain includes vendors, authors, URLs, publications in various formats and offers-for-sale. It also has functions for computing numerical similarity measures for pairs of titles, authors and formats, and uses them to decide how reasonable it is to assume that that two different imprecise descriptions actually refer to the same publication.
Maxwell's local meta-data describes a number of book stores and their search interfaces. Each store's search control forms are encoded by specifying the mapping between the various fields on the form and the relevant attributes in Maxwell's domain model, and indicating which type of matching will be carried out.
Each class of information page returned by a book store is characterised by a "template" which is composed of a number of regular expression fragments, together with information to control and coordinate their matches. This mechanism was designed to allow controlled recognition of sets of patterns as they appear within otherwise uninterpretable material.
Once Maxwell has the information describing the relevant books the vendor is offering, it develops a list of candidate matches between these and the publications that it knows about.
Another major limitation on the system is the gathering of meta-information, and in particular the need for this to be done by hand. Maxwell's use of declarative formalisms for its meta-information minimises the effort required. Nevertheless, most vendor sites are complex, and one cannot tell whether all their output formats have been dealt with. They are also updated frequently, but without notice. There is therefore good reason to gather the meta-information automatically. This is seldom straightforward, particularly when it involves analysing material that aims to be concise and attractive to people, not simple and consistent for agents! However, the content is tightly constrained and there is ample domain knowledge available to guide the process.
The software developed to date has yielded valuable insight into the general nature of the problem. In particular, it has highlighted a number of specific challenges, and given reason to believe that they can be addressed.
The full version of this paper, including 18 references and URLs, discusses the issues raised in more detail, and includes the architecture and sample output from the system. It is available as ETL Technical Report TR-98-3, and at http://www.etl.go.jp/etl/taiwa/~inder/MaxwellPaper1/.