There are several systems developed for monitoring sources on the web. CONQUER[14], WebCQ[13] and C3 are the most related project to our work. The most evident differences between CAM and any of these related works is the approach of monitoring. The way source site are monitored in CAM is more efficient. CAM keeps responses to continuous queries current by focusing on the problem of dynamically monitoring the sources of information relevant to the queries. From the change characteristics of these pages (observed in a tracking phase), a probabilistic model of their change behavior is formulated and weights are assigned to pages to denote their importance for the current queries. During the resource allocation phase, based on these statistics, resources, needed to continuously monitor these pages for changes, are allocated. Given these resource allocations, the scheduling phase produces a near-optimal schedule for monitoring. We also presented experimental evidence for the effectiveness of our approach which offers several-fold improvement in the returned information, compared to the classical Uniform and Proportional techniques. In general, CAM performance improves even more under skewed page update and page weight distributions. We also showed that these techniques do not work well for answering continuous queries because of the specific nature of continuous queries. We formally proved that Proportional allocation works better than Uniform allocation for CQs. CAM's resource allocation algorithm is designed to be optimal: It maximizes the value of the returned information, when the updates are quasi-deterministic, i.e., when updates occur at specific time instants and the actual occurrence of a change is associated with a probability. Similar resource allocation techniques can be developed to be optimal for other types of update behaviors.
There have been several studies of web crawling in an attempt of capturing web dynamics. The earliest study to our knowledge is by Brewington and Cybenko [1]. They not only studied the dynamics of web but also raise some very interesting issues for developing better crawling techniques. They showed that page change behavior varies significantly from page to page and so crawling them equal number of times is a fallacious technique. [3] and [2] address a number of issues relating to the design of effective crawlers. In [4,18], authors approached the problem formally and devised an optimal crawling technique. (Some aspects of our formal are adopted from [18] and modified to suit our problem definition.) A common assumption made in most of these studies is that page changes are a Poisson or memoryless process. In fact it has shown to hold true for a large set of pages but it is also found in [1] that most of web pages are modified during US working hours, i.e., 5 a.m. to 5 p.m. In our case, we go beyond these assumptions and present an optimal monitoring technique for answering continuous queries independent of any assumption about page change behavior. Instead, we collect and build page change statistics during a tracking period and only on the basis of this collected information, we do resource allocation. Then we keep on updating this information after every T time units based on the result of the monitoring done. This makes our solution robust and adaptable in any web scenario.
It is important to mention the push-based alternative to answering continuous queries: information is pushed from web sources instead of users pulling it as is assumed in our scheme [16]. Here users register their queries with the sources and when the sources update the relevant pages they themselves propagate their changes to the users. This, of course, is applicable only to push-based Web sites, and even in that case, the onus of consolidating and aggregating the information returned from the sources is left to the end application.
Sandeep Pandey 2003-03-05