Recall is a well established measure in information retrieval (IR) and
information extraction (IE) for evaluating the effectiveness of either
retrieving relevant documents or extracting relevant statements with
statements considered to be the smallest bits of information about
entities [5].
For information gathering (IG) or knowledge acquisition from the Web
[3,4]
- which can be modeled as a consecutive process of IR, IE and Information
Integration (II) - the actual goal is not to locate and extract all
appearances of relevant information, but to extract as much unique
information as possible, disregarding all similar or redundant
appearances. For domains with large portions of redundant information on
the Web, such as digital consumer products or news, the dominant part of
relevant information can be gathered even with low overall recall. As an
example, a domain-specific recall of
would probably be enough for learning the information that Shizuka
Arakawa won the gold medal in figure skating at the Winter Olympic Games
20061.
Therefore, this paper argues to shift the attention from the redundant representation of information to the actual core information stripped off of all redundancy (Fig. 1).
In what follows, we consider relevance to be a binary decision. Similarly, we disregard the issue of conflicting information and assume two statements to either express the same information, and hence be redundant, or not.
![]() |
(1) |
|
|
|
|
Figure 2(a) shows equation 3 for example values of . Despite the
apparent innocence of this formula, its derivation is not
straightforward. The proof succeeds by applying a limit value
consideration to a geometric model of a probabilistic lottery.
Figure 2(b) illustrates this
approximation to be suitable and to hold strongly for
with
being
the number of unique occurrences of relevant information.
By building on Proposition 1,
we can formulate the general equation for arbitrary redundancy
distributions with layers of
partly redundant information as
Calculating the first derivative and then taking the limit value for
, we further
learn the approximation
Figure 2(d) depicts equation 5 together with the closed solutions for the three canonical redundancy distributions homogeneous, linear and geometric from Fig. 2(c).
To demonstrate its practical relevance, assume we want to know the
fraction of available redundant
information instances we have to retrieve and extract in order to learn a
certain portion of the core information from a given domain of interest.
Further assume the redundancy distribution of information within this
domain to follow a known function
with
and
. The inverse function
of equation 4 then gives us the required joint
recall
of the IR and IE steps of our IG system.
Deriving a particular analytic solution is not always simple, mainly due
to the required transitions between discrete and continuous viewpoints.
However, given our derived approximation of for
from
equation 5, which holds even
stronger for
with
, we can state the following approximation, independent
of the actual distribution
:
![]() |
(6) |
The intriguing open and relevant problem is to derive the analytic
solution for a generalized Zipf redundancy distribution, as this
distribution is generally assumed to approximate well the frequency of
appearance of individual pieces of information on the Web [6,7]. A more playful
formulation of the research question is as follows: Assume the Zipf
distribution with exponential parameter
of Fig. 2(c) adequately describes the redundancy
distribution of information on the Web [2]. What is the
generalization of the 20/80 rule [8] that holds for the
Web?
Acknowledgement. The author would like to express his gratitude to
Professor Georg Gottlob for overall motivation of this research, and to
Konrad Richter for always having time to discuss some interesting
mathematical problems.