Subsections

Resource Allocation in CAM

As noted earlier, we need to distinguish between pages on the basis of two metrics. One is the nature of page change behavior and the other is the importance of a page for one or more queries. Page change behavior is studied during a tracking phase and is characterized by associating a probability of change with every potential update instant. Next we show the way in which pages can be ranked by assigning weights to them using relevance measures. These relevance measures are determined for each page for each query during the tracking period.

Goals of the Resource Allocation Phase

CAM aims is to minimize the weighted importance of changes that are not reported to users: $\sum_{i \in P}^{}$ W_i E_i. Here E_i denotes expected number of lost changes for i^th page.

We make two assumptions about updates. First, inter-update intervals are independent of each other. Second, an update results in complete loss of information from previous updates. That is, the number of lost updates is an indication of the amount of lost information. While these may not always be true, they give us a simple way to state the goal to be accomplished. Thus,

E_i

$\displaystyle \sum_{j \in U_i}^{}$ $\displaystyle \rho_{i,j}^{}$ (1 - y_i,
j).

The constraint on resources is given by

$\displaystyle \sum_{i \in P}^{}$ $\displaystyle \sum_{j \in U_i}^{}$ y_{i, j} = C.

where C denotes the maximum available number of monitoring tasks (which we assume are all used up). Implicitly, we assume that during the monitoring epoch of length T, the relevance of each updated version of page for queries remains the same as estimated during the tracking period. At the end of a monitoring epoch, we update these relevance measures on the basis of the monitored information. So unless T is very large, or page updates are very erratic, our assumption is acceptable.

It is important to point out that it is relatively easy to accommodate the following extensions to the above model. In certain cases, we may have more information about specific changes than the case described above. For example, if we can measure change behavior of i^th page with respect to j^th query, then it would be possible to allocate resources even more efficiently. For example, suppose we get to know during the tracking period that a particular news site mainly declares health updates only once at the start of day and in the rest of the time, it remains mainly concerned about political and sports updates, then we can better characterize the change behavior of this page with respect to queries concerned with sports, medical and political domain.

Suppose $\rho_{i,j,k}^{}$ denotes the probability of change of i^th page at j^th update instant where this change is relevant for query k. Then E_i, the weighted expected number of lost changes for i^th page is,

$\displaystyle \sum_{k \in Q}^{}$ $\displaystyle \omega_{k}^{}$ r_i,
k $\displaystyle \left(\vphantom{ \sum_{j \in U_i} \rho_{i,j,k} (1-y_{i,j}) }\right.$ $\displaystyle \sum_{j \in U_i}^{}$ $\displaystyle \rho_{i,j,k}^{}$ (1 - y_i,
j) $\displaystyle \left.\vphantom{ \sum_{j \in U_i} \rho_{i,j,k} (1-y_{i,j}) }\right)$

If we can extract even more information about changes to the ith page (by measuring, say, not only the probability of change at time u_{i, j}, but also the average importance of change at this time) then it can make better resource allocation. For example, suppose we find that a particular research site compiles and announces all its previous day's research updates daily at 10:00a.m. in the morning, but throughout the rest of the day, it updates its page only if/when some new research breakthrough takes place. Then it is clear that visit to this page right after 10:00a.m. is more likely to be fruitful than any other visit to this page.

The Resource Allocation Algorithm

The formulated resources allocation problems are discrete, separable and convex.

Discrete: because variable y_i,
j can take only discrete values. Our problem is inherently discrete due to discrete nature of monitoring. Either a monitoring task is allocated to a page or it won't be. There can't be anything between these.
Separable: because optimizing function could be expressed in terms of y_{i, j} only.
Convex: owing to the convex nature of the objective function.

Discrete, separable and convex problems have been studied intensively [10]. Formally, it can be stated as minimizing $\sum_{i = 1}^{G}$ F_i(x_i) with resource constraint $\sum_{i = 1}^{G}$ x_i = Z, where x_is are discrete and F_is are convex. A greedy algorithm exists for the discrete case [7]. There is a faster algorithm also for our problem, due to Galil and Megiddo, which has complexity O(G(log Z)²). The fastest algorithm is due to Frederickson and Johnson [8] of complexity O(max{G, G log(Z/G)}). In our case, the output of these algorithms is a set of y_i,
j's. This set in turn gives us the number of monitoring tasks allocated to a page ( x_i = $\sum_{j=0}^{p_i}$ y_{i, j}) as well as the ideal time instants, (namely, the js for which y_{i, j} is 1), at which these allocated monitoring tasks should be executed.

Given the design of the above resource allocation algorithm, the y_{i, j}s that result are optimal, i.e., maximize the value of the returned information, when the updates are quasi-deterministic, i.e., occur at specific time instants and the actual occurrence of a change is associated with a probability.

Sandeep Pandey 2003-03-05