WWW2010 - Raleigh: Semi-Structured Data

Semi-Structured Data

Thursday, 1:30–3:00 PM
Chair: Jim Hendler

LCA-based Selection for XML Document Collections

Georgia Koloniari, Evaggelia Pitoura

In this paper, we address the problem of database selection for XML document collections, that is, given a set of collections and a user query, how to rank the collections based on their goodness to the query. Goodness is determined by the relevance of the documents in the collection to the query. We consider keyword queries and support Lowest Common Ancestor (LCA) semantics for defining query results, so that the relevance of each document to a query is determined by properties of the LCA of those nodes in the XML document that contain the query keywords. To avoid evaluating queries against each document in a collection, we propose maintaining in a preprocessing phase, information about the LCAs of all pairs of keywords in a document and use it to approximate the properties of the LCA-based results of a query. To improve storage and processing efficiency, we use appropriate summaries of the LCA information based on Bloom filters. We address both a boolean and a weighted version of the database selection problem. Our experimental results show that our approach incurs low errors in the estimation of the goodness of a collection and provides rankings that are very close to the actual ones.

Faceted Exploration of Image Search Results

Roelof van Zwol, Börkur Sigurbjörnsson

This paper describes MediaFaces, a system that enables faceted exploration of media collections. The system processes semi-structured information sources to extract objects and facets, e.g. the relationships between two objects. Next, we rank the facets based on a statistical analysis of image search query logs, and the tagging behaviour of users annotating photos in Flickr. For a given object of interest, we can then retrieve the top-k most relevant facets and present them to the user. The system is currently deployed in production by Yahoo!’s image search engine. We present the system architecture, its main components, and the application of the system as part of the image search experience.

Matrix “Bit”loaded: A scalable lightweight join query processor for RDF data

Medha Atre, Vineet Chaoji, Mohammed Zaki, James Hendler

The Semantic Web community, until now, has used traditional database systems for the storage and querying of RDF data. The SPARQL query language also closely follows SQL syntax. As a natural consequence, most of the SPARQL query processing techniques are based on database query processing and optimization techniques. For SPARQL join query optimization, previous works like RDF-3X and Hexastore have proposed to use 6-way indexes on the RDF data. Although these indexes speed up merge-joins by orders of magnitude, for complex join queries generating large intermediate join results, the scalability of the query processor still remains a challenge. In this paper, we introduce (i) BitMat—a compressed bit-matrix structure for storing huge RDF graphs, and (ii) a novel, light-weight SPARQL join query processing method that employs an initial pruning technique, followed by a variable-binding-matching algorithm on BitMats to produce the final results. Our query processing method does not build intermediate join tables and works directly on the compressed data. We have demonstrated our method against RDF graphs of upto 1.33 billion triples—the largest among results published until now (single-node, non-parallel systems), and have compared our method with the state-of-the-art RDF stores— RDF-3X and MonetDB. Our results show that the competing methods are most effective with highly selective queries. On the other hand, BitMat can deliver 2-3 orders of magnitude better performance on complex, low-selectivity queries over massive data.

.

Back to full list of papers