As noted earlier, we need to distinguish between pages on the basis of two metrics. One is the nature of page change behavior and the other is the importance of a page for one or more queries. Page change behavior is studied during a tracking phase and is characterized by associating a probability of change with every potential update instant. Next we show the way in which pages can be ranked by assigning weights to them using relevance measures. These relevance measures are determined for each page for each query during the tracking period.
CAM aims is to minimize the weighted importance of changes that are not reported to users: Wi E i. Here Ei denotes expected number of lost changes for ith page.
We make two assumptions about updates. First, inter-update
intervals are independent of each other. Second, an update results
in complete loss of information from previous updates. That is, the
number of lost updates is an indication of the amount of lost
information. While these may not always be true, they give us a
simple way to state the goal to be accomplished. Thus,
Ei | = | (1 - yi, j). |
yi, j = C. |
It is important to point out that it is relatively easy to accommodate the following extensions to the above model. In certain cases, we may have more information about specific changes than the case described above. For example, if we can measure change behavior of ith page with respect to jth query, then it would be possible to allocate resources even more efficiently. For example, suppose we get to know during the tracking period that a particular news site mainly declares health updates only once at the start of day and in the rest of the time, it remains mainly concerned about political and sports updates, then we can better characterize the change behavior of this page with respect to queries concerned with sports, medical and political domain.
Suppose denotes the probability of change of ith page at jth update instant where this change is relevant for query k. Then Ei, the weighted expected number of lost changes for ith page is,
If we can extract even more information about changes to the ith page (by measuring, say, not only the probability of change at time ui, j, but also the average importance of change at this time) then it can make better resource allocation. For example, suppose we find that a particular research site compiles and announces all its previous day's research updates daily at 10:00a.m. in the morning, but throughout the rest of the day, it updates its page only if/when some new research breakthrough takes place. Then it is clear that visit to this page right after 10:00a.m. is more likely to be fruitful than any other visit to this page.
The formulated resources allocation problems are discrete, separable and convex.
Given the design of the above resource allocation algorithm, the yi, js that result are optimal, i.e., maximize the value of the returned information, when the updates are quasi-deterministic, i.e., occur at specific time instants and the actual occurrence of a change is associated with a probability.
Sandeep Pandey 2003-03-05