A prototype agent to assist shoppers

Robert Indera, Matthew Hurstb and Toshikazu Katoc

aNEDO I.T. Researcher/ETL, Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan

inder@etl.go.jp

bETL, Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan

hurst@etl.go.jp

cChuo University, Tokyo and ETL Electrotechnical Laboratories,
Umezono 1-1-4, Tsukuba, Ibaraki 305, Japan

kato@etl.go.jp

Abstract
This paper describes initial work assessing the problems associated with building an agent to retrieve and combine data from various sources on the WWW. It describes a prototype system built to explore options in the area of home shopping. It highlights the problems facing the system and identifies the directions for possible work.

Keywords
Agent; Shopping; Database query; Data integration

1. Introduction

The Internet is giving us access to unprecedented amounts of information, both as static "multi media" documents, and as a result of interrogating databases. However, the sheer volume of information available is one of the major obstacles to exploiting its full potential. Used with skill, search engines and catalogue systems can do an excellent job of finding relevant sites or pages, many of which will link to several other relevant documents or other information. As a result, they can still leave users with so much material that just retrieving and assessing the most promising is still a substantial and time-consuming task.

There is a clear need for tools to assist in sifting, organising, combining and presenting material in a way that suits the individual users' task and preferences. Once we have a number of sites where we can expect to find relevant information, can we create a system that will automatically interrogate them, summarising and combining the results in some way?

We report an initial exploration of the problems involved in doing this for the task of buying books. Our target is to produce a system which will, in response to the user's query, draw information from a number of separate sources, and produce an on-line document or data-set which integrates the various products and prices that are available.

A major obstacle to carrying out this task is the fact that each bookstore's Web site is designed to present that store's wares as effectively as possible, to help visitors locate things of interest and persuade them to buy them from that store.

2. The task

An agent to support comparative shopping for books will first have to locate a number of bookstores, and interrogate their catalogues. Once information has been retrieved from a vendor, it must be interpreted well enough to allow it to be combined with information from other sources, and presented effectively. One can view this as almost a paradigm case of information extraction. The system knows in advance the kinds of information that the document will contain (titles, authors etc.), and has a domain model, or ontology, which can be used to specify a template which must be completed based on the contents of the document. Moreover, many of the words used will clearly identify the type of information they provide (e.g. "paperback" is probably a book format, "Rudyard Kipling" an author).

Once the content of a particular document has been extracted, the agent must actually combine it with information from other sources. Much work on combining data from multiple sources focuses on reconciling the structure of the data, based on metadata for each source. We doubt Web agents will have access to accurate metadata, but will have to recognise the relevant structures by analysing the data itself within the context of its own ontology. However, precise alignment of data models is not necessary for much useful agent functionality. The limiting factors involve the data itself, and the problems associated with identifying references to "the same" object in different databases — something that will require substantial amounts of domain knowledge (see below).

Finally, the system must organise the information, and present at least some of it to the user. At its simplest, this could involve generating a report that identifies the books the system has located, and gives their price and availability.

3. The role of domain knowledge

Domain knowledge is key to the operation of the system, in both the initial information extraction and its subsequent combination.

A system hoping to combine and organise information about a number of objects must have an appropriate way of deciding when two different descriptions refer to the same object or "equivalent" things. So, if a book-buyer's assistant is to integrate the information it gathers, it needs to know enough about books to decide what constitutes a book, and how the various books should be grouped.

Books can be identified by means of their title, author and so on. However, we are dealing with material intended to be concise and appealing to potential customers, not complete or precise. Thus even without considering errors, which abound, the descriptions of books are partial, and varied, and the decisions about co-reference are hard. For instance, different vendors may ignore, or include within the title, various attributes of the book, such as sub-titles, editions and format. In addition, matching between any data-bases must allow for "compression" of information appropriate for that domain: in our case, the use of initials, abbreviations, truncations and so on.

For these reasons, simple string matching is seldom adequate, and an agent must have enough knowledge of its domain to make good matching and discriminating decisions on the basis of imperfect or partial information. Nevertheless, such decisions will still be fallible, which raises the issue of how, if at all, the system should present the uncertainty to the user. In any event, the system must handle inconsistencies sensibly, and must help users confirm its decisions (e.g. by providing URLs to retrieve the relevant raw information) at all times.

Although the need for domain knowledge was illustrated in terms of the domain of buying books, comparable problems will arise in any domain. The implementation of an information gathering agent must ensure that the domain knowledge used is encapsulated, and can be replaced for another domain.

4. The prototype system

A prototype version of the agent, "Maxwell", has been implemented, and is able to query and combine results from several bookstores.

The ontology of Maxwell's model of the book domain includes vendors, authors, URLs, publications in various formats and offers-for-sale. It also has functions for computing numerical similarity measures for pairs of titles, authors and formats, and uses them to decide how reasonable it is to assume that that two different imprecise descriptions actually refer to the same publication.

Maxwell's local meta-data describes a number of book stores and their search interfaces. Each store's search control forms are encoded by specifying the mapping between the various fields on the form and the relevant attributes in Maxwell's domain model, and indicating which type of matching will be carried out.

Each class of information page returned by a book store is characterised by a "template" which is composed of a number of regular expression fragments, together with information to control and coordinate their matches. This mechanism was designed to allow controlled recognition of sets of patterns as they appear within otherwise uninterpretable material.

Once Maxwell has the information describing the relevant books the vendor is offering, it develops a list of candidate matches between these and the publications that it knows about.

5. Further work

The system's most obvious limitation is its reasoning, which is both limited, and imperfect. The initial book matching mechanism that was implemented proved to be far too simple (not least because it assumed vendors would be internally consistent, which they are not), and thus too restrictive. Experience of using it has led us to make it more tolerant in a number of ways, but the current heuristics for individuating domain objects are still inadequate, and a major revision is required. Moreover, although the system explicitly identifies marginal decisions, it as yet makes no attempt to use follow-up queries to acquire additional information to clarify difficult decisions, and has no mechanism for revising previous decisions in the light of subsequent evidence. We are confident that the system's expertise can be extended to give adequate performance, particularly if coupled with a truth maintenance framework.

Another major limitation on the system is the gathering of meta-information, and in particular the need for this to be done by hand. Maxwell's use of declarative formalisms for its meta-information minimises the effort required. Nevertheless, most vendor sites are complex, and one cannot tell whether all their output formats have been dealt with. They are also updated frequently, but without notice. There is therefore good reason to gather the meta-information automatically. This is seldom straightforward, particularly when it involves analysing material that aims to be concise and attractive to people, not simple and consistent for agents! However, the content is tightly constrained and there is ample domain knowledge available to guide the process.

The software developed to date has yielded valuable insight into the general nature of the problem. In particular, it has highlighted a number of specific challenges, and given reason to believe that they can be addressed.

The full version of this paper, including 18 references and URLs, discusses the issues raised in more detail, and includes the architecture and sample output from the system. It is available as ETL Technical Report TR-98-3, and at http://www.etl.go.jp/etl/taiwa/~inder/MaxwellPaper1/.