Not surprisingly, the problem of keeping track of the dynamics of the web becomes inherently different for the continuous query case compared to the discrete query case. We use the term monitoring to explicitly account for the differences from the classical crawling problem. A monitoring task fetches a web page, much like a crawler does, but with the goal of fetching new information relevant to one or more queries while a crawl is not done with any specific user request in mind. The work involved in handling continuous queries is portrayed in Figure 1. For continuous queries, since the system should maintain the freshness or currency of responses to users, the problem translates to one of (a) knowing which pages are relevant, (b) tracking the changes to the pages, to determine the characteristics of changes to these pages, and from these, (c) deciding when to monitor the pages for changes, so that responses are current. The last problem is the focus of this paper, and has several subproblems: allocating the resources needed for monitoring the pages, scheduling the actual monitoring tasks, and then monitoring. Specifically, in this paper, we address the problem of distributing a given number of monitoring tasks among the pages whose changes need to be tracked so as to respond to a set of continuous queries.
In Figure 1, the feedback arcs from the monitoring phase to the earlier phases indicate that observations made during the monitoring phase can be used to adjust subsequent decisions.
It could be argued that discrete queries posed every so often can be considered to be equivalent to continuous queries but the following reasons should help dispel this misconception: First, determining the next time when the discrete query should be posed by the user is highly non-trivial. If the time-interval is kept small then it may induce unnecessary load on the system, particularly when the updates are not frequent. If we set the time-interval to be large, it may lead to loss of information if updates are more frequent than expected. Second, in contrast to discrete queries which have zero lifetimes, continuous queries have a non-zero lifetime, enabling a query system to study a query's characteristics carefully and thereby answer it more efficiently. Unlike in the case of discrete queries, the time taken to provide the system's first response to a continuous query may not be as important as the maintenance of currency during all the responses.
The above discussion makes it clear that not only the nature of the crawling problem but optimization goals also become different when we move from discrete to continuous queries. To this end, our optimization metric minimizes the information loss, compared to an ideal monitoring algorithm which can monitor every change of a page.
We show the optimality of CAM's resource allocation algorithm under specific change scenarios and formally prove that, in the continuous query case, that proportional allocation, in which the pages with high frequency of change are allocated more monitoring tasks, works better than uniform allocation, which allocates an equal number of monitoring tasks to each page, independent of its change frequency. On the surface, this seems to contradict a prior result [3] that the uniform allocation of crawling resources produces better results than its proportional counterpart. We provide an intuitive explanation for this apparently surprising behavior. This also emphasizes why CAM is distinct from earlier work on crawl maintenance for answering discrete queries.