Efficient Acquisition of Web Data through Restricted Query Interfaces

Simon Byers
byers@research.att.com
AT&T Labs-Research

Juliana Freire
juliana@research.bell-labs.com
Bell Laboratories

Cláudio Silva
csilva@research.att.com
AT&T Labs-Research

Abstract:

A wealth of information is available on the Web. But often, such data are hidden behind form interfaces which allow only a restrictive set of queries over the underlying databases, greatly hindering data exploration. The ability to materialize these databases has endless applications, from allowing the data to be effectively mined to providing better response times in Web information integration systems. However, reconstructing database images through restricted interfaces can be a daunting task, and sometimes infeasible due to network traffic and high latencies from Web servers. In this paper we introduce the problem of generating efficient query covers, i.e., given a restricted query interface, how to efficiently reconstruct a complete image of the underlying database. We propose a solution to the problem of finding covers for spatial queries over databases accessible through nearest-neighbor interfaces. Our algorithm guarantees complete coverage and leads to speedups of over 50 when compared against the naive solution. We use our case-study to illustrate useful guidelines to attack the general coverage problem, and we also discuss practical issues related to materializing Web databases, such as automation of data retrieval and techniques which make it possible to circumvent unfriendly sites, while keeping the anonymity of the person performing the queries.

Keywords: query coverage, dynamic content, restricted query interfaces, spatial queries, wrappers

Bibliography

S. Byers, J. Freire, and C. Silva.
Efficient acquisition of web data through restricted query interfaces.
Technical report, AT&T Shannon Labs and Bell Laboratories, 2000.

A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu.
A query language for XML.
In Proc. of WWW, pages 77-91, 1999.

S. Lawrence and C. Giles.
Searching the world wide web.
Science, 280(4):98-100, 1998.

J. Melton and A. Simon.
Understanding the New SQL: A Complete Guide.
Morgan Kaufmann, 1993.

H. Samet.
The Design and Analysis of Spatial Data Structures.
Addison-Wesley, Reading, MA, 1990.

R. Yerneni, C. Li, H. Garcia-Molina, and J. Ullman.
Computing capabilities of mediators.
In Proc. SIGMOD, pages 443-454, 1999.

Footnotes

... Locator, ¹

http://tier2.census.gov/ctsl/ctsl.htm

... radius ²

In practice, some sites do not return the radius directly, but given an address (location), it is possible to find the latitude and longitude, and compute the radius by performing extra queries (possibly in different Web sites).

Juliana Freire 2001-03-09

Efficient Acquisition of Web Data through Restricted Query Interfaces

Abstract:

1 Introduction

2 Spatial Covers through
Nearest-Neighbor Interfaces

3 Acquiring Web Data

Bibliography

Footnotes

Efficient Acquisition of Web Data through Restricted Query Interfaces

Abstract:

1 Introduction

2 Spatial Covers through Nearest-Neighbor Interfaces

3 Acquiring Web Data

Bibliography

Footnotes

2 Spatial Covers through
Nearest-Neighbor Interfaces