Indexing and Caching
Thursday 3:30–5:00 PM
Chair: Ricardo Baeza-Yates
Scalable Techniques for Document Identifier Assignment in Inverted Indexes
Shuai Ding, Josh Attenberg, Torsten Suel
Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Several authors have recently proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant improvements in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on Traveling Salesman or graph partitioning problems that achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained using Locally Sensitive Hashing, which achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
Sync Kit: A Persistent Client-Side Database Caching Toolkit for Data Intensive Websites
Edward Benson, Adam Marcus, David Karger, Samuel Madden
We introduce a client-server toolkit that demonstrates the potential of client-side database storage for improving the performance of data intensive websites. The Sync Kit toolkit is designed to make use of the embedded relational database defined in the upcoming HTML5 standard to offload some data storage and processing from the web server to clients. Our toolkit provides various synchronization strategies for relational database tables between the browser and the web server, along with a client-side template library so that full-portions web applications may be executed client-side. Unlike prior work in this area, this toolkit persists both templates and data across web sessions, reducing load and bandwidth usage on the server by a factor of up to 5 versus that of a traditional server-only web stack and a factor of 3 versus a template caching approach.
A Refreshing Perspective of Search Engine Caching
Berkant Barla Cambazoglu, Flavio Junqueira, Vassilis Plachouras, Scott Banachowski,
Baoqiu Cui, Swee Lim, and Bill Bridge
Web search engines are required to process very large amounts of data with a very low latency to return the best answers to user queries. To achieve low latency, designers of search engines take advantage of the query frequency distribution and employ caching at various levels of the system, e.g., caching of query results. The search engine also needs to update its index frequently to incorporate changes to the Web. After every index update, however, the query results for some entries in the cache might be stale. One solution to this problem is to flush the content of the cache. As the frequency of index updates increases, this solution imposes a significant performance penalty. In this work, we first argue that eviction policies are not as important as policies for invalidating queries and updating the cache content upon new versions of the index. We then introduce a novel caching algorithm that uses a time-to-live value to expire queries and refreshes queries using idle cycles of the engine. The cache selects the queries to refresh according to their frequency or temperature and to their age. More frequent and older queries are selected for refreshing because they are more likely to result in more hits in the future. In addition, we present mechanisms for setting the rate of refreshes issued by the cache. Our evaluation results on real workloads show that our cache outperforms an extension of LRU with refreshes, while the refresh rate is set so that the latency of queries is within a specified range. An implementation of this algorithm is currently in production use.
.