Caching Strategies for Data-Intensive Web Sites

Daniela Florescu	Valerie Issarny
Daniela.Florescu@inria.fr	Valerie.Issarny@inria.fr
Patrick Valduriez	Khaled Yagoub
Patrick.Valduriez@inria.fr	Khaled.Yagoub@prism.uvsq.fr

1. Introduction

We propose a customizable cache system architecture for Web sites whose content is dynamically extracted from large relational databases. Our solution improves the response time of those sites by customizing data caching at the various levels of data elaboration within the Web site, according to the users' access profiles. The system allows to cache the results of database queries, intermediate XML fragments and HTML files.

Performance problems may arise when a Web site provides access to large numbers of pages that are dynamically built from a database. In this context, producing a Web page may require costly interaction with the database system mainly, for connection and querying. The database cost adds up to the already non-negligeable base cost of Web page delivery. A solution for reducing the client waiting time relies on caching HTML pages. Despite good performance, this solution has several major drawbacks. It incurs significant space overhead, propagating updates from the database to the cached data is made more difficult, and the caching granularity (i.e. a page) is not always appropriate. For instance, different fragments in a page can have different update frequencies, and caching at the page level imposes the recomputation of the entire page, even if some parts of the page did not change.

To overcome these problems, we propose a Web site management system, called Weave [1], supporting customized cache management and automatic generation of runtime policies from declarative specification of Web sites.

2. Features

In the following, we briefly discus the key features and the corresponding components of the system:

Weave is a data-intensive Web site management system. It relies on a declarative specification of the Web site through an XML graph data model that captures the structure and the content of the site independently of its graphical representation. A site schema then represents an XML view definition over a relational database. This enables managing the Web site at three levels: database, XML fragments, and HTML files. The graphical representation of the site is described using XSL style sheets.

It comes with a declarative language which is WeaveL, for specifying the XML site schema. A WeaveL program consists of a set of site class specifications. Each class specification includes the declaration of the parameters identifying an instance of the class, the SQL query whose result gives all possible instances for the above parameters (describing how to produce all instances of the class), the specification of the data contained in an instance, and the specification of the hyperlinks from an instance of the respective class.

It proposes a three level caching architecture composed of DB, XML, and HTML caches. The DB cache allows to cache, in the DBMS, the results of parameterized SQL computation, under the form of relational tables, and reuse the results for subsequent requests [2]. This improves performance of handling database queries, allows for efficient update propagation, and enables caching of data that are shared among various pages.
Compared to the HTML cache, which caches HTML files on disk, the XML cache has the advantage of storing less data and allows for carefully controlling the granularity of the cached data, ranging from the entire page to fragments of the page. Moreover, it allows to reduce the load generated on the database by the Web server.

The runtime behavior of a Web site can be controlled by a customized runtime policy so as to make optimal usage of the caches according to behavioral information such as the users' access patterns. A runtime policy specifies which kind of data to prefetch or to cache (HTML pages, XML fragments, relational tables, or any combination of those), which particular items to prefetch or to cache (e.g. particular HTML pages), and which actions to execute under different events like page requests, data updates or environmental changes.

Finally, we introduce a high level language, which is WeaveRPL, for the abstract specification of the cache system's behavior (the specification is similar for the three caches). The language is based on event-condition rules. It enables to explicitly specify the global runtime policies implemented by an individual cache manager for setting overall features such as the maximum cache size, and the actions to be carried out upon a global event such as a cache overflow. Furthermore, it builds upon the declarative Web site specification, and allows the definition of per-site-class customized caching, basically, how to handle events related to data retrieval, addition, removal, and staleness.

The foreseen advantage of a 3-level caching architecture, coupled with runtime policy customization, is that it makes possible to specialize the runtime behavior of a Web site, according to each user, to the user's context or to the portion of the Web site being accessed, and to automatically balance the load of such system.

3. References

[1] http://caravel.inria.fr/~yagoub/weave.html
[2] Daniela Florescu, Alon Levy, Dan Suciu, and Khaled Yagoub. Run time management of data intensive Web sites. In Proc. of the Int. Conf. on Very Large Data Bases (VLDB), Edinburgh, UK, 1999.Postcript