The Web is a constantly updating source of information. A large number of latest developments, events and commentaries are constantly being posted on the Web in the form of blogs, news articles, usenet posts etc. In order to help the users avoid the tedious task of periodically searching for updated information on the Web, some of the search engines provide an ``alert'' service (e.g. Google alerts [16] or Live alerts [26]). The main idea is to let the users create profiles (e.g. by specifying a few query keywords) describing the information for which they would like to receive updates. After that, the search engines are continuously monitoring the newly collected information from the Web and will alert the user (e.g. through email) whenever there is new information that matches the user's profile. In this way, the user can stay current with a developing news story, track a good deal on a desired product or follow who is writing about her blog.
The continuous monitoring of the Web to match a given set of user-defined keywords, is also known as continuous querying. Typically, the results of a continuous query can be returned to the user either in an ``as-found'' basis or ``in-batch'' after a particular time interval (e.g. once a day).
Although running continuous queries on the Web can potentially help the users to stay current with important updates, in general, the amount of information returned as updates to the user can be ``unbounded''. For example, if the user is following a very controversial or popular topic, she may receive hundreds of updated pages as an alert, and may thus be overwhelmed by this huge amount of information.
Since a user typically can allow a limited time for comprehending the information delivered to her, one way to alleviate this problem is to allow the user to restrict (or bound)1 the number of returned results within a particular time interval. More specifically, the user may decide that, say, every day, she is interested in reading only the 10 most relevant updates to her continuous query and would like to receive only those updates. For the cases where it is acceptable to return the results in-batch the solution is straightforward: we first collect all the relevant results within a day, we rank them and then return the top-10 to the user. However, in the cases where the ``freshness'' of the results is very important, and thus we need to return them as-found, the user is not willing to wait until we collect all the relevant results and return them at the end of the query period. For example, if a user is tracking Web pages describing digital cameras offered for sale, she would like to know the 10 best pages according to some specification as soon as they appear, since the cameras may be sold after a short period of time.
Returning the  best results in
  an as-found basis to a given continuous query involves two main
  challenges. First, the potentially relevant results within a time
  period (e.g. a day) are not known in advance. Without knowing
  all the relevant results how can we find the
  top-
 best results in
  an as-found basis to a given continuous query involves two main
  challenges. First, the potentially relevant results within a time
  period (e.g. a day) are not known in advance. Without knowing
  all the relevant results how can we find the
  top- among them to return to the user?
  Second, the points in time where the top-
 among them to return to the user?
  Second, the points in time where the top- relevant results will appear are also not known in advance. Given
  that the freshness of results is very important, how can we
  ensure that we return the top-
  relevant results will appear are also not known in advance. Given
  that the freshness of results is very important, how can we
  ensure that we return the top- results as
  soon as they appear?
 results as
  soon as they appear?
Regarding the first challenge, clearly we will have to wait
  until we see all results in order to calculate the
  exact top- . However, in a
  practical scenario, we may safely assume that the user is willing
  to exchange some imprecision in the top-
. However, in a
  practical scenario, we may safely assume that the user is willing
  to exchange some imprecision in the top- results for a greater freshness. For example, in our digital
  camera example, the user may be happy to receive the 10 deals out
  of which only 9 belong to the top-10, but receive them soon
  enough to actually be able to buy the products.
  results for a greater freshness. For example, in our digital
  camera example, the user may be happy to receive the 10 deals out
  of which only 9 belong to the top-10, but receive them soon
  enough to actually be able to buy the products.
Given this relevance/freshness tradeoff, in this paper we
  present an optimization method for bounded continuous search
  queries on the Web. More specifically, our goal is to extract the
  top- relevant pages that appear over a
  specified time interval on the Web and return them to the user as
  soon as possible. Our proposed solution utilizes principles from
  the field of optimal stopping [34] in order to realize fresh, high
  quality and a bounded number of search results during a specified
  execution time of a continuous query. Optimal stopping is a
  well-known problem in the field of financial mathematics. The
  main idea of this paper is to consider the development of the
  relevance of versions of Web pages as relevance charts and
  to treat the problem of estimating top-k results as-found similar
  to a basic problem in financial mathematics, the problem of
  finding the buying or selling points for stocks at optimal
  prices.
 relevant pages that appear over a
  specified time interval on the Web and return them to the user as
  soon as possible. Our proposed solution utilizes principles from
  the field of optimal stopping [34] in order to realize fresh, high
  quality and a bounded number of search results during a specified
  execution time of a continuous query. Optimal stopping is a
  well-known problem in the field of financial mathematics. The
  main idea of this paper is to consider the development of the
  relevance of versions of Web pages as relevance charts and
  to treat the problem of estimating top-k results as-found similar
  to a basic problem in financial mathematics, the problem of
  finding the buying or selling points for stocks at optimal
  prices.
In summary, our contributions in this paper are:
In the next section we start our discussion by presenting the
  strategy that is presently employed by the current search
  engines. In section ![[*]](file:/usr/share/latex2html/icons/crossref.png) we
  demonstrate that in the cases where we need to obtain information
  as up-to-date as possible, current approaches may return
  sub-optimal results to the users. In Section
 we
  demonstrate that in the cases where we need to obtain information
  as up-to-date as possible, current approaches may return
  sub-optimal results to the users. In Section ![[*]](file:/usr/share/latex2html/icons/crossref.png) we define
  a new language for bounded continuous search queries and present
  our optimization approach which utilizes principles from the
  field of optimal stopping . Finally, in the experimental section
 we define
  a new language for bounded continuous search queries and present
  our optimization approach which utilizes principles from the
  field of optimal stopping . Finally, in the experimental section
  ![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) we
  verify that our new method can generate fresher and more relevant
  results for a variety of continuous queries. We conclude with an
  overview of related research and a summary.
 we
  verify that our new method can generate fresher and more relevant
  results for a variety of continuous queries. We conclude with an
  overview of related research and a summary.
 . The index
. The index
   corresponds to the time a
  document is obtained from the Web in a push or a pull manner
  [21].2
 corresponds to the time a
  document is obtained from the Web in a push or a pull manner
  [21].2
 and a query profile (i.e. a set of query terms)
 and a query profile (i.e. a set of query terms)
   [39]. In the formula the term
 [39]. In the formula the term
   (term frequency) is
  the frequency of a term
 (term frequency) is
  the frequency of a term  in a
  document
 in a
  document  . N is the number of
  documents in the collection,
. N is the number of
  documents in the collection, 
   is the
  inverse document frequency where
 is the
  inverse document frequency where  is the number of documents in the collection that
  contain
 is the number of documents in the collection that
  contain  . Formula 1 is one way to estimate the relevance of a
  document with respect to a query. In this paper in order to keep
  our discussion more concise whenever we talk about
  relevance we will use equation 1,
  but in general we could use any other estimation.
. Formula 1 is one way to estimate the relevance of a
  document with respect to a query. In this paper in order to keep
  our discussion more concise whenever we talk about
  relevance we will use equation 1,
  but in general we could use any other estimation.
  A problem of this measure is the computation of the
   term. At a specific point in
  time the entire set of documents appearing in the stream is not
  yet available. Possible approaches to this problem have been
  presented in [38],
  [18] and [31]. In this work we consider each
  version of the information source or incoming document as a
  single document 'collection'. The
 term. At a specific point in
  time the entire set of documents appearing in the stream is not
  yet available. Possible approaches to this problem have been
  presented in [38],
  [18] and [31]. In this work we consider each
  version of the information source or incoming document as a
  single document 'collection'. The  -term is based only on the state of the information
  source at the current point in time. If
-term is based only on the state of the information
  source at the current point in time. If  is the state of the information source
 is the state of the information source
   at time
 at time  we denote the respective ranking of document
 we denote the respective ranking of document 
   with respect to
  the query
 with respect to
  the query  as
 as 
   . As
  described above in this work we make the simplifying assumption
  that at each point in time a single (new) document is available.
  We may therefore define
. As
  described above in this work we make the simplifying assumption
  that at each point in time a single (new) document is available.
  We may therefore define 
   and
 and 
   .
.
  The function  is used to obtain
  relevance estimations for new documents in the sequence with
  respect to a query profile. In this work we simplify the search
  for optimal documents by defining quality as the
  estimation provided by the ranking function
 is used to obtain
  relevance estimations for new documents in the sequence with
  respect to a query profile. In this work we simplify the search
  for optimal documents by defining quality as the
  estimation provided by the ranking function  . Quality is used as a synonym for the relevance of
  documents.
. Quality is used as a synonym for the relevance of
  documents.
  
  In order to find optimal documents we have therefore simply to
  find documents with the highest ranking values according to
   . This quality definition is
  used because the development of estimation functions similar to
  (1) is not the focus of this paper but
  has been examined thoroughly in the field of information
  retrieval. In this paper we show that even if an optimal
  estimator for the quality of documents is given (or assumed) the
  optimization of bounded continuous search queries is not a
  trivial problem.
. This quality definition is
  used because the development of estimation functions similar to
  (1) is not the focus of this paper but
  has been examined thoroughly in the field of information
  retrieval. In this paper we show that even if an optimal
  estimator for the quality of documents is given (or assumed) the
  optimization of bounded continuous search queries is not a
  trivial problem.
  Based on the previous definitions we may now define a common
  strategy to process bounded document filtering.
   
  
In this work we consider a PE method that applies function (1)
  as the ranking or evaluation function. Obviously a PE method is a
  bounded filtering method according to definition 1 due to the bounding condition,
  which may e.g. imply a maximal number of pages returned in each
  evaluation period or a total maximal number. In the latter case
  the number of returned pages per evaluation period is determined
  by (the closest integer to) the total number of documents to be
  returned according to the bounding condition divided by
   , the number of evaluation
  periods. A PE-query may e.g. inquire about the 'best 10' pages
  each day with respect to a set of query terms, as e.g. realized
  by the GoogleAlert system [16]. The PE-method is illustrated
  in figure
, the number of evaluation
  periods. A PE-query may e.g. inquire about the 'best 10' pages
  each day with respect to a set of query terms, as e.g. realized
  by the GoogleAlert system [16]. The PE-method is illustrated
  in figure ![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) .
  In the figure '
.
  In the figure ' '-symbols denote
  ranking values depending on the current state of a specific data
  source or a new document in a stream, a query profile and a
  ranking function (as e.g. function (1)).
  In this case the query execution time is
'-symbols denote
  ranking values depending on the current state of a specific data
  source or a new document in a stream, a query profile and a
  ranking function (as e.g. function (1)).
  In this case the query execution time is  . There are two evaluation periods. The
  bounding condition is 4, i.e. the best two documents in each
  evaluation period have to be selected and forwarded to a user as
  indicated by circled ranking values.
. There are two evaluation periods. The
  bounding condition is 4, i.e. the best two documents in each
  evaluation period have to be selected and forwarded to a user as
  indicated by circled ranking values.
![[*]](file:/usr/share/latex2html/icons/crossref.png) that by
  applying the PE strategy documents are returned with a certain
  delay between the point in time when a document is
  obtained3 and the time at the end of an
  evaluation period when a document or a respective notification is
  forwarded to a user. We may therefore define a freshness (or
  reciprocal: delay) metric as:
 that by
  applying the PE strategy documents are returned with a certain
  delay between the point in time when a document is
  obtained3 and the time at the end of an
  evaluation period when a document or a respective notification is
  forwarded to a user. We may therefore define a freshness (or
  reciprocal: delay) metric as: are the end points
  of the evaluation periods (figure
 are the end points
  of the evaluation periods (figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) ). At these
  points in time results are sent to a user.
). At these
  points in time results are sent to a user.  is the number of evaluation periods and
 is the number of evaluation periods and
   is the number of requested
  (estimated best) objects.
 is the number of requested
  (estimated best) objects.  is
  the number of optimal elements to be selected in the n-th
  evaluation period (with
 is
  the number of optimal elements to be selected in the n-th
  evaluation period (with 
   ) and
) and
   is the j-th best element
  selected in the n-th evaluation period.
 is the j-th best element
  selected in the n-th evaluation period. .
.
  The validity of this theorem may be demonstrated by
  considering the example in figure 1. If
  the best documents have to be selected that appear during the
  query execution time ![$[e_0, e_2]$](/www2007/test-img12.png) the optimal
  strategy is to store documents and to wait until
 the optimal
  strategy is to store documents and to wait until  . At this point the 4 highest ranked documents may be
  returned if we assume that the bounding condition implies a
  number of 4 documents to be returned. However, as shown in the
  example in figure 1 the delay-value as
  defined above may be significant.
   href="/www2007/#foot55">1 In order to acquire fresher
  results, a larger number of evaluation periods has to be
  considered. In figure 1 the query period
  is subdivided into 2 evaluation periods. The bounding condition
  in this case is 2 documents for each of the two evaluation
  periods in order to fulfill the global bounding condition of
  maximal 4 documents. The freshness of retrieved optimal documents
  is obviously increased. However the selected documents (as
  illustrated by circled '
. At this point the 4 highest ranked documents may be
  returned if we assume that the bounding condition implies a
  number of 4 documents to be returned. However, as shown in the
  example in figure 1 the delay-value as
  defined above may be significant.
   href="/www2007/#foot55">1 In order to acquire fresher
  results, a larger number of evaluation periods has to be
  considered. In figure 1 the query period
  is subdivided into 2 evaluation periods. The bounding condition
  in this case is 2 documents for each of the two evaluation
  periods in order to fulfill the global bounding condition of
  maximal 4 documents. The freshness of retrieved optimal documents
  is obviously increased. However the selected documents (as
  illustrated by circled '  '-symbols) are no longer the optimal ones and
  represent a suboptimal choice.
'-symbols) are no longer the optimal ones and
  represent a suboptimal choice.

 
  ![[*]](file:/usr/share/latex2html/icons/crossref.png) we
  wouldn't have the restriction of the maximal number of objects to
  be returned and could forward each object above a specified
  threshold. However in this case not knowing future ranking
  values, a suboptimal threshold may be chosen, which affects
  precision and recall results.
 we
  wouldn't have the restriction of the maximal number of objects to
  be returned and could forward each object above a specified
  threshold. However in this case not knowing future ranking
  values, a suboptimal threshold may be chosen, which affects
  precision and recall results.
   consisting of a
  normal query
 consisting of a
  normal query  (e.g. written in SQL), a
  trigger condition
 (e.g. written in SQL), a
  trigger condition  and a
  termination criterion
 and a
  termination criterion  . In this work
  we consider only time-based trigger conditions. We extend the
  basic notation of OpenCQ in order to support continuous search
  queries. For this purpose we assume the availability of a ranking
  function for query results as provided by (1). A main extension with respect to many
  continuous query languages is the possibility to provide a
  bounding condition. In the considered query language a user has
  to define the number of estimated best results to be returned.
  This feature is well-known from common search engines. The best
  'n' results are displayed on the first result page. A further
  specific attribute is the requirement to specify a user profile
  consisting of query terms. An example for the considered query
  language is the following:
. In this work
  we consider only time-based trigger conditions. We extend the
  basic notation of OpenCQ in order to support continuous search
  queries. For this purpose we assume the availability of a ranking
  function for query results as provided by (1). A main extension with respect to many
  continuous query languages is the possibility to provide a
  bounding condition. In the considered query language a user has
  to define the number of estimated best results to be returned.
  This feature is well-known from common search engines. The best
  'n' results are displayed on the first result page. A further
  specific attribute is the requirement to specify a user profile
  consisting of query terms. An example for the considered query
  language is the following: BEST
    10
 BEST
    10 BEST'
  directive in the query denotes that, given an appropriate ranking
  measure and estimation method, the query engine should estimate
  and return the best documents. In general it is not known, if
  versions of a data source that appear in the future have a higher
  ranking and thereby declassify the current version as
  (relatively) irrelevant. The current version would thereby create
  'costs' in terms of information overload and a decreased
  'precision', if returned to a user.
 BEST'
  directive in the query denotes that, given an appropriate ranking
  measure and estimation method, the query engine should estimate
  and return the best documents. In general it is not known, if
  versions of a data source that appear in the future have a higher
  ranking and thereby declassify the current version as
  (relatively) irrelevant. The current version would thereby create
  'costs' in terms of information overload and a decreased
  'precision', if returned to a user.
  In the example the current version may contain the terms 'camera 12 mega' but a future version may contain the terms 'camera 12 mega' and 'flash' which declassifies the current version. However if the query engine waits until all versions have been available, the respective cameras may already be sold.
In the following we refer to this query language as bounded
  continuous search query language (BCSQL).
 , where
, where
   is the number of candidates and
 is the number of candidates and
   is the Euler number. In
  other words approximately one third of the candidates should be
  observed without being chosen. This result has been proved 'in
  the limit', for
 is the Euler number. In
  other words approximately one third of the candidates should be
  observed without being chosen. This result has been proved 'in
  the limit', for 
   . Further
  strategies for the basic SSP are discussed in [19]. Extensions of the basic SSP have
  been proposed in [14] and
  [30].
. Further
  strategies for the basic SSP are discussed in [19]. Extensions of the basic SSP have
  been proposed in [14] and
  [30]. candidates in a
  stream of ranked documents by
 candidates in a
  stream of ranked documents by  choices,
  we denote as k-SSP. An obvious extension of the single SSP
  is not to consider a single observation period (needed to adjust
  the optimal selection probability) but to consider
 choices,
  we denote as k-SSP. An obvious extension of the single SSP
  is not to consider a single observation period (needed to adjust
  the optimal selection probability) but to consider  observation periods. Our method, following an
  approach in [15], first
  implies the choice of
 observation periods. Our method, following an
  approach in [15], first
  implies the choice of  starting
  times
 starting
  times 
   .7 After rejecting the
  first
.7 After rejecting the
  first 
   candidates,
  the first candidate considered for selection is examined at or
  after time
 candidates,
  the first candidate considered for selection is examined at or
  after time  .
. candidates have already been
  examined with
 candidates have already been
  examined with  objects accepted and
 objects accepted and
   rejected, the
 rejected, the  object is chosen if it is at least better
  than one of the
 object is chosen if it is at least better
  than one of the  objects already selected.
  It is rejected if it is worse than at least one of the
 objects already selected.
  It is rejected if it is worse than at least one of the
   objects rejected.
 objects rejected. is ranked
 is ranked  (between the accepted and the rejected objects) it is chosen if
  (between the accepted and the rejected objects) it is chosen if 
  
   and
  rejected if
 and
  rejected if 
   , where
, where
   is the current point in
  time.
 is the current point in
  time. choices have been made where
 choices have been made where
   and
 and  candidates are left with respect to the entire
  sequence of input candidates to be evaluated, all of the
  remaining candidates must be chosen in order to guarantee that
 candidates are left with respect to the entire
  sequence of input candidates to be evaluated, all of the
  remaining candidates must be chosen in order to guarantee that
   objects are chosen.
 objects are chosen.
An example is shown in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) . We denote
  the sequence of candidates as
. We denote
  the sequence of candidates as 
   appearing
  at times
 appearing
  at times 
   respectively.
  We consider a number of 2 candidates to be returned and two
  starting points
 respectively.
  We consider a number of 2 candidates to be returned and two
  starting points 
   and
 and 
   . Candidates
. Candidates
   and
 and  are rejected due to the first observation phase. Candidate
  are rejected due to the first observation phase. Candidate
   at
 at  is accepted because it is better than all of
  the previously rejected candidates.
 is accepted because it is better than all of
  the previously rejected candidates.  is better than all the previously rejected candidates and worse
  than all the previously accepted candidates. It is rejected
  because it appears before the stopping time
  is better than all the previously rejected candidates and worse
  than all the previously accepted candidates. It is rejected
  because it appears before the stopping time  . It would have been accepted if
. It would have been accepted if 
   .
.  is accepted because it is better than at least one
  previously accepted candidate. Due to the previous choice of two
  candidates, candidates
 is accepted because it is better than at least one
  previously accepted candidate. Due to the previous choice of two
  candidates, candidates  and
 and  are not considered.
 are not considered.
  Based on this selection strategy the main problem is to find
  optimal times 
   in
  order to maximize the probability of choosing the best
 in
  order to maximize the probability of choosing the best
   candidates. Due to the equal
  distribution of ranking values intuitively the starting times
  should be spread evenly over the considered time period. In
  [15] a strategy is proposed
  to position starting times that is proved to be optimal and
  applied in section
 candidates. Due to the equal
  distribution of ranking values intuitively the starting times
  should be spread evenly over the considered time period. In
  [15] a strategy is proposed
  to position starting times that is proved to be optimal and
  applied in section ![[*]](file:/usr/share/latex2html/icons/crossref.png) .
.
![[*]](file:/usr/share/latex2html/icons/crossref.png) the delay
  parameter is not adjustable if the SSP is applied to the
  optimization of retrieval results. Results are returned
  immediately (delay=0) if estimated to have a high ranking.
 the delay
  parameter is not adjustable if the SSP is applied to the
  optimization of retrieval results. Results are returned
  immediately (delay=0) if estimated to have a high ranking.
  ![[*]](file:/usr/share/latex2html/icons/crossref.png) shows the
  basic steps of the BCS query processing algorithm. The input of
  the algorithm are the start and the end time of the continuous
  query, the trigger condition, a value '
 shows the
  basic steps of the BCS query processing algorithm. The input of
  the algorithm are the start and the end time of the continuous
  query, the trigger condition, a value ' ' for the number of estimated best items to be chosen and
  a query profile
' for the number of estimated best items to be chosen and
  a query profile  .
. of reload operations
  are computed. Applying the k-SSP strategy in section
 of reload operations
  are computed. Applying the k-SSP strategy in section ![[*]](file:/usr/share/latex2html/icons/crossref.png) the
  starting times
 the
  starting times 
   are
  computed based on the number of candidates and the number 'k' of
  highly ranked candidates to be chosen. At time
 are
  computed based on the number of candidates and the number 'k' of
  highly ranked candidates to be chosen. At time  the first candidate is loaded in step 7 and the
  ranking with respect to the search query '
 the first candidate is loaded in step 7 and the
  ranking with respect to the search query ' ' is computed (section 2.1).
  The ranking is compared to previous ranking values in step 9
  which are available in the list
' is computed (section 2.1).
  The ranking is compared to previous ranking values in step 9
  which are available in the list  and the relative ranking is computed. In step 10
  it is determined if a new version is chosen as a highly ranked
  candidate according to section
 and the relative ranking is computed. In step 10
  it is determined if a new version is chosen as a highly ranked
  candidate according to section ![[*]](file:/usr/share/latex2html/icons/crossref.png) . In figure
. In figure
  ![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) we assume the availability of a function isSelected(C)
  that indicates, if a candidate C has been selected. In
  step 11 the new candidate C is inserted into the list
  rankList at the position determined by the ranking value.
  If the candidate is chosen, a message is sent to the client.
  Finally the algorithm waits until the time of the subsequent
  reload time in step 13 and returns to step 6.
  we assume the availability of a function isSelected(C)
  that indicates, if a candidate C has been selected. In
  step 11 the new candidate C is inserted into the list
  rankList at the position determined by the ranking value.
  If the candidate is chosen, a message is sent to the client.
  Finally the algorithm waits until the time of the subsequent
  reload time in step 13 and returns to step 6. based on s,e,tc
    based on s,e,tc based on s,e,tc
 based on s,e,tc based on
    based on  ,
, 

 )
)
 based on
 based on
     ,
, 
 to previous
    rankings
 to previous
    rankings according to
 according to into
    rankList
 into
    rankList )
)![[*]](file:/usr/share/latex2html/icons/crossref.png) ) and the
  quality of search results according to definition
) and the
  quality of search results according to definition ![[*]](file:/usr/share/latex2html/icons/crossref.png) . Applying
  the k-SSP method (figure
. Applying
  the k-SSP method (figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) ) objects
  that are estimated to be relevant are returned to a user
  immediately after detection on the Web.8 In this case we
  assume an immediate decision of the filtering method and the
  delay value in formula (
) objects
  that are estimated to be relevant are returned to a user
  immediately after detection on the Web.8 In this case we
  assume an immediate decision of the filtering method and the
  delay value in formula (![[*]](file:/usr/share/latex2html/icons/crossref.png) ) is
  0.
) is
  0.![[*]](file:/usr/share/latex2html/icons/crossref.png) we defined
  the quality or relevance of a single document retrieved by a
  search engine. In order to measure the quality of a set of
  retrieved documents we build the sum of quality values of the
  individual documents. In [20] a very similar relevance
  measure is presented that is based on graded relevance
  assessments (in contrast to binary relevance assessments usually
  considered in IR). In [20]
  the functions
 we defined
  the quality or relevance of a single document retrieved by a
  search engine. In order to measure the quality of a set of
  retrieved documents we build the sum of quality values of the
  individual documents. In [20] a very similar relevance
  measure is presented that is based on graded relevance
  assessments (in contrast to binary relevance assessments usually
  considered in IR). In [20]
  the functions|  | 
|  | 
 denotes
  the entire set of documents and relevance is a function
  providing relevance values for documents, retr is the set
  of retrieved documents. In the experiments we apply the same
  measures and define the relevance function according to
  definition
 denotes
  the entire set of documents and relevance is a function
  providing relevance values for documents, retr is the set
  of retrieved documents. In the experiments we apply the same
  measures and define the relevance function according to
  definition ![[*]](file:/usr/share/latex2html/icons/crossref.png) .
.
![[*]](file:/usr/share/latex2html/icons/crossref.png) shows two
  examples for the relevance evolution of two queries ('oscar' and
  'basketball') over a period of 60 days according to the quality
  definition in (1). The reason to
  consider generated data in the following is to show the basic
  functionality in principle. Real Web data are considered to show
  that the presented filtering method based on k-SSP selection may
  be applied to distributions of relevance developments of real
  information.
 shows two
  examples for the relevance evolution of two queries ('oscar' and
  'basketball') over a period of 60 days according to the quality
  definition in (1). The reason to
  consider generated data in the following is to show the basic
  functionality in principle. Real Web data are considered to show
  that the presented filtering method based on k-SSP selection may
  be applied to distributions of relevance developments of real
  information.
   with identical
  likelihoods are generated. An individual sequence is gradually
  provided as an input to the BCS and PE algorithms.
 with identical
  likelihoods are generated. An individual sequence is gradually
  provided as an input to the BCS and PE algorithms.
![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) a
  number of 50 sequences of (distinct) ranking values (in
 a
  number of 50 sequences of (distinct) ranking values (in 
   ) of candidate
  size
) of candidate
  size  were generated. Figure
 were generated. Figure
  ![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) shows the respective mean values of acquired retrieval qualities
  (gr and gp according to (
  shows the respective mean values of acquired retrieval qualities
  (gr and gp according to (![[*]](file:/usr/share/latex2html/icons/crossref.png) )) for
  different values of
)) for
  different values of  , the number of
  objects to be returned. The PE strategy is applied to the data
  based on a single evaluation period (PE(1)) of length 99, where
  the time span between the appearance of two candidates is assumed
  to be 1.9 These retrieval results are optimal
  since the PE(1) method has knowledge of the whole distribution of
  (previous) retrieval values. The lower graph shows the retrieval
  quality of the BCS method. The graph below shows the retrieval
  quality of the random method that chooses a number of
, the number of
  objects to be returned. The PE strategy is applied to the data
  based on a single evaluation period (PE(1)) of length 99, where
  the time span between the appearance of two candidates is assumed
  to be 1.9 These retrieval results are optimal
  since the PE(1) method has knowledge of the whole distribution of
  (previous) retrieval values. The lower graph shows the retrieval
  quality of the BCS method. The graph below shows the retrieval
  quality of the random method that chooses a number of  arbitrary candidates. Figure
 arbitrary candidates. Figure ![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) shows
  that the BCS method provides significantly better results than
  the random strategy. As expected the quality is lower than the
  quality provided by the PE(1) method which has access to the
  entire set of ranking values.
 shows
  that the BCS method provides significantly better results than
  the random strategy. As expected the quality is lower than the
  quality provided by the PE(1) method which has access to the
  entire set of ranking values.
  ![[*]](file:/usr/share/latex2html/icons/crossref.png) the
  retrieval quality of the BCS strategy is lower than the retrieval
  quality of the PE(1) strategy. However the BCSQL results are
  returned to a user immediately while the PE(1) strategy returns
  results at the end of a single evaluation period which is in this
  case identical to the query period. If fresher results are
  requested when using the PE method obviously a larger number of
  evaluation periods has to be considered during the query
  execution time. We proportion requested items
 the
  retrieval quality of the BCS strategy is lower than the retrieval
  quality of the PE(1) strategy. However the BCSQL results are
  returned to a user immediately while the PE(1) strategy returns
  results at the end of a single evaluation period which is in this
  case identical to the query period. If fresher results are
  requested when using the PE method obviously a larger number of
  evaluation periods has to be considered during the query
  execution time. We proportion requested items  to the number of evaluation periods. If
 to the number of evaluation periods. If 
   the selected
  candidates are proportioned with an equal likelood to the
  evaluation periods. In the following we consider the tradeoff
  between retrieval quality and freshness of data.
 the selected
  candidates are proportioned with an equal likelood to the
  evaluation periods. In the following we consider the tradeoff
  between retrieval quality and freshness of data.
 and
 and
   . We consider the mean values
  of 200 generated sequences. Figure
. We consider the mean values
  of 200 generated sequences. Figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (left)
  shows the development of the graded recall for the PE(n) strategy
  when the number of PE intervals is increased from 1 to N, the
  number of versions over time (curve
 (left)
  shows the development of the graded recall for the PE(n) strategy
  when the number of PE intervals is increased from 1 to N, the
  number of versions over time (curve  ). Due to a constant k (
). Due to a constant k ( in def. (
 in def. (![[*]](file:/usr/share/latex2html/icons/crossref.png) )) it is
  sufficient to consider the graded recall. Similar results would
  have been obtained by considering the graded precision. If the
  number of evaluation intervals is N, the retrieval quality of the
  PE strategy (PE(N)) is obviously similar to a strategy where
  candidates are selected randomly (
)) it is
  sufficient to consider the graded recall. Similar results would
  have been obtained by considering the graded precision. If the
  number of evaluation intervals is N, the retrieval quality of the
  PE strategy (PE(N)) is obviously similar to a strategy where
  candidates are selected randomly ( in figure
 in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) ): Since
  the number of candidates is
): Since
  the number of candidates is  , the
  average graded recall of the random strategy is
, the
  average graded recall of the random strategy is 
   (according
  to (
 (according
  to (![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) )). In
  figure
)). In
  figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (left)
  graph
 (left)
  graph  shows the retrieval
  quality of the BCS strategy and graph
 shows the retrieval
  quality of the BCS strategy and graph  shows the retrieval quality of the PE strategy
  when only a single evaluation period is considered (PE(1)). Both
  strategies do not depend on the the number of observation periods
  (x-axis).
 shows the retrieval quality of the PE strategy
  when only a single evaluation period is considered (PE(1)). Both
  strategies do not depend on the the number of observation periods
  (x-axis). in figure
 in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (left)) of
  the maximal number of evaluation periods where the graded recall
  of the PE strategy is better than the graded recall of the BCS
  strategy. I.e. if the number of intervals is further increased
  because fresher results shall be returned by the PE strategy, the
  retrieval quality is lower than the retrieval quality of the BCS
  method. Below the intersection point IS in figure
 (left)) of
  the maximal number of evaluation periods where the graded recall
  of the PE strategy is better than the graded recall of the BCS
  strategy. I.e. if the number of intervals is further increased
  because fresher results shall be returned by the PE strategy, the
  retrieval quality is lower than the retrieval quality of the BCS
  method. Below the intersection point IS in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (left) the
  PE strategy therefore becomes inferior to the BCS strategy. The
  BCS strategy provides maximal freshness due to an immediate
  delivery of results. In this situation also the retrieval quality
  is superior to the PE method in a probabilistic sense.
 (left) the
  PE strategy therefore becomes inferior to the BCS strategy. The
  BCS strategy provides maximal freshness due to an immediate
  delivery of results. In this situation also the retrieval quality
  is superior to the PE method in a probabilistic sense.![[*]](file:/usr/share/latex2html/icons/crossref.png) (right) we
  consider the delay of the considered strategies according to
  definition (
 (right) we
  consider the delay of the considered strategies according to
  definition (![[*]](file:/usr/share/latex2html/icons/crossref.png) ). The data
  points close to
). The data
  points close to  denote
  the delay of results of the PE strategy considering a single
  evaluation period (PE(1)). The figure shows that results are
  delivered with a delay of approximately 0.5
 denote
  the delay of results of the PE strategy considering a single
  evaluation period (PE(1)). The figure shows that results are
  delivered with a delay of approximately 0.5 50% of the entire query execution time. If e.g.
  the query execution time is 40 days, results are returned with a
  delay of 20 days. Curve
50% of the entire query execution time. If e.g.
  the query execution time is 40 days, results are returned with a
  delay of 20 days. Curve  shows the delay of the PE(n) method where the number of
  evaluation periods is increased from 1 to N. Obviously the delay
  converges monotonically to 0. The main point in this graph is the
  y-value
  shows the delay of the PE(n) method where the number of
  evaluation periods is increased from 1 to N. Obviously the delay
  converges monotonically to 0. The main point in this graph is the
  y-value  of the PE-delay
  (figure
 of the PE-delay
  (figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (right))
  where the x value (
 (right))
  where the x value ( )
  corresponds to the x-value of the intersection point IS in the
  figure on the left. This point marks the minimal delay of the PE
  strategy where the retrieval quality is better than the retrieval
  quality of the BCS strategy. In other words: If results are
  requested by a user that are fresher than
)
  corresponds to the x-value of the intersection point IS in the
  figure on the left. This point marks the minimal delay of the PE
  strategy where the retrieval quality is better than the retrieval
  quality of the BCS strategy. In other words: If results are
  requested by a user that are fresher than  , a user should prefer the BCS strategy presuming
  he wants to acquire maximal retrieval quality. Otherwise, if less
  fresh results are sufficient, the PE strategy should be applied.
  We denote
, a user should prefer the BCS strategy presuming
  he wants to acquire maximal retrieval quality. Otherwise, if less
  fresh results are sufficient, the PE strategy should be applied.
  We denote  as
 as  -turning point in the following. The
-turning point in the following. The
   -turning point is
  obtained by a local linear fit of the PE and BCS recall close to
  the intersection point IS in figure
-turning point is
  obtained by a local linear fit of the PE and BCS recall close to
  the intersection point IS in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (left) and
  by computing the intersection point of the respective PE and BCS
  fitting lines (close to
 (left) and
  by computing the intersection point of the respective PE and BCS
  fitting lines (close to  and
  and  in figure
 in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) , left).
  Then the point
, left).
  Then the point  in the PE-delay
  graph (the intersection of the
 in the PE-delay
  graph (the intersection of the  -value and the PE-delay graph) has to be
  extracted. In the example in figure
-value and the PE-delay graph) has to be
  extracted. In the example in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) (right)
  BCS should be used if results are requested that are fresher than
 (right)
  BCS should be used if results are requested that are fresher than
  
   3.3% of the
  query execution time. If the entire query period is 40 days, the
  BCS strategy should be used if results should be fresher than 1.3
  days.
3.3% of the
  query execution time. If the entire query period is 40 days, the
  BCS strategy should be used if results should be fresher than 1.3
  days.
  
|  | 
![[*]](file:/usr/share/latex2html/icons/crossref.png) shows the
 shows the
   -turning point for
  different values of the number of candidates
-turning point for
  different values of the number of candidates  and the number of best values to be returned
 and the number of best values to be returned  . It may be observed that the
. It may be observed that the  -turning point tends to decrease for higher
  values of
-turning point tends to decrease for higher
  values of  . The image on the right
  suggests that, provided that
. The image on the right
  suggests that, provided that  is
  sufficiently high,
 is
  sufficiently high,  is
  relatively constant.
 is
  relatively constant.
   BEST
 BEST  d
d
 url
url WHERE query='
 WHERE query=' singleterm
singleterm '
' delay
delay
| BCS | PE(n) | PE(1) | turning point | ||||
| gr | gr depend. on maximal delay | optimal gr | |||||
| source (language) | 2 days | 4 days | 8 days | 12 days |  (in days) | ||
| news.bbc.co.uk (e) | 0.387 | 0.135 | 0.246 | 0.408 | 0.506 | 0.595 | 7.9 | 
| www.faz.net-s-homepage.html (g) | 0.374 | 0.107 | 0.245 | 0.354 | 0.443 | 0.548 | 8.8 | 
| www.ftd.de (g) | 0.317 | 0.095 | 0.23 | 0.319 | 0.408 | 0.5 | 7.9 | 
| www.latimes.com (e) | 0.24 | 0.067 | 0.104 | 0.238 | 0.3 | 0.404 | 8.4 | 
| www.nytimes.com (e) | 0.231 | 0.094 | 0.159 | 0.248 | 0.321 | 0.424 | 7.4 | 
| www.spiegel.de (g) | 0.326 | 0.104 | 0.167 | 0.328 | 0.404 | 0.5 | 7.5 | 
| www.sueddeutsche.de (g) | 0.307 | 0.099 | 0.157 | 0.267 | 0.351 | 0.463 | 10.7 | 
| www.usatoday.com-news-front.htm (e) | 0.307 | 0.093 | 0.153 | 0.277 | 0.372 | 0.542 | 11.2 | 
| www.washingtonpost.com (e) | 0.121 | 0.074 | 0.113 | 0.167 | 0.223 | 0.307 | 5.1 | 
| www.welt.de-chl-20.html (g) | 0.178 | 0.057 | 0.076 | 0.155 | 0.198 | 0.296 | 10.6 | 
| www.wienerzeitung.at (g) | 0.332 | 0.118 | 0.154 | 0.322 | 0.439 | 0.605 | 9.6 | 
| www.zeit.de (g) | 0.276 | 0.074 | 0.134 | 0.283 | 0.361 | 0.479 | 7.5 | 
| normalized mean value | 57% | 18% | 31% | 58% | 73% | 100% | |
![[*]](/www2007/file:/usr/share/latex2html/icons/crossref.png) shows
  a representative fraction of these experiments. The table shows
  the retrieval quality (graded recall) of the BCS, the PE(j) and
  the PE(1) strategy for 12 Web pages, 7 in German and 5 in
  English. We consider a number of the 4 best objects to be chosen
  (
 shows
  a representative fraction of these experiments. The table shows
  the retrieval quality (graded recall) of the BCS, the PE(j) and
  the PE(1) strategy for 12 Web pages, 7 in German and 5 in
  English. We consider a number of the 4 best objects to be chosen
  ( ). Each entry in the table
  is the mean value of the respective quality parameters of all
  considered (480 German and 480 English) queries. In these
  experiments we specified a maximal delay of returned results of
  2, 4, 8 and 12 days and adjusted the number of evaluation
  intervals in the PE(j) method respectively. The table shows the
  resulting graded recall (gr) values of the different
  methods and the
). Each entry in the table
  is the mean value of the respective quality parameters of all
  considered (480 German and 480 English) queries. In these
  experiments we specified a maximal delay of returned results of
  2, 4, 8 and 12 days and adjusted the number of evaluation
  intervals in the PE(j) method respectively. The table shows the
  resulting graded recall (gr) values of the different
  methods and the  turning
  point (in days) for each source.
 turning
  point (in days) for each source. value (in days) in table
 value (in days) in table
  ![[*]](file:/usr/share/latex2html/icons/crossref.png) marks the
  maximal freshness where a user obtains the best results by the
  PE(n) method. Below this point the BCS strategy provides results
  of a higher retrieval quality. The last row of table
 marks the
  maximal freshness where a user obtains the best results by the
  PE(n) method. Below this point the BCS strategy provides results
  of a higher retrieval quality. The last row of table ![[*]](file:/usr/share/latex2html/icons/crossref.png) shows the
  mean values of all data sources standardized by the maximal
  recall value provided by the PE(1) strategy.
 shows the
  mean values of all data sources standardized by the maximal
  recall value provided by the PE(1) strategy.
![[*]](file:/usr/share/latex2html/icons/crossref.png) shows the
 shows the
   -turning point for
  exemplary data sources and the mean value of 14 data sources with
  respect to the number of best items to be returned. The mean
  value of the
-turning point for
  exemplary data sources and the mean value of 14 data sources with
  respect to the number of best items to be returned. The mean
  value of the  values (e.g. for
  k=8) is approximately 7.3 days. This is 9.1% of the query
  execution time of 80 days.
 values (e.g. for
  k=8) is approximately 7.3 days. This is 9.1% of the query
  execution time of 80 days.
   documents respectively
 documents respectively 
    
     ,
    where
,
    where  is a set of (new)
    documents
 is a set of (new)
    documents 
     , is not
    the focus of this paper. In this case the applied information
    retrieval measures presented here have to be modified as
    described e.g. in [38].
, is not
    the focus of this paper. In this case the applied information
    retrieval measures presented here have to be modified as
    described e.g. in [38]. '-symbols in figure
'-symbols in figure ![[*]](file:/usr/share/latex2html/icons/crossref.png) .
. 4
4 1 week, usually only a
    suboptimal choice is possible.
 1 week, usually only a
    suboptimal choice is possible. denotes the
    ranking position of object
 denotes the
    ranking position of object  with
    respect to objects
 with
    respect to objects 
     , then the
    independence assumption is
, then the
    independence assumption is 
     ,
, 
     .
. candidates have been rejected and
 candidates have been rejected and  accepted we are at time
 accepted we are at time  . A rank of '1' marks the best object.
. A rank of '1' marks the best object. -value
    due to definition (
-value
    due to definition (![[*]](file:/usr/share/latex2html/icons/crossref.png) ).
).