Personal webservers have proven to be a popular means of sharing files and peer collaboration. Unfortunately, the transient availability and rapidly evolving content on such hosts render centralized, crawl-based search indices stale and incomplete. To address this problem, we propose YouSearch, a distributed search application for personal webservers operating within a shared context (e.g., a corporate intranet). With YouSearch, search results are always fast, fresh and complete -- properties we show arise from an architecture that exploits both the extensive distributed resources available at the peer webservers in addition to a centralized repository of summarized network state. YouSearch extends the concept of a shared context within web communities by enabling peers to aggregate into groups and users to search over specific groups. In this paper, we describe the challenges, design, implementation and experiences with a successful intranet deployment of YouSearch.
H.3.4 [Systems and Software]: Distributed systems, information networks, Performance evaluation (efficiency and effectiveness); H.4.1 [Office Automation]: Groupware; H.5.4 [Hypertext/Hypermedia]: Architectures, User Issues.
Algorithms, Performance, Human Factors.
Web search, Intranet search, Peer-to-Peer networks, P2P, Decentralized systems, Information communities.
1,500
people
within IBM use the YouServ [14]
personal webserving system every week.
Critical to group collaboration is the ability to find desired content. While the Web offers URLs for accessing content from a named location, the popular method for locating content on the web is through search. We believe personal webservers are used in a manner which makes search even more preferable:
Motivated by the desiderata of freshness, we opted for a Peer-to-Peer (P2P) paradigm in designing YouSearch. In this architecture, each webserver is enhanced with a search component consisting of a content indexer and a query evaluator. The indexer regularly monitors shared files, immediately updating its local index when changes are detected. Queries are forwarded to local indexes for evaluation at run time. Only clients that are available at query time respond, ensuring live results.
The need for speed and completeness led us towards a hybrid architecture in which the P2P network is augmented with a light-weight centralized component. Peers maintain compact site summaries (in the form a Bloom filter [16]) which are aggregated at a centralized registrar. These summaries are queried so that searches target only the relevant machines. YouSearch does not rely on query flooding or other routing-based schemes that are subject to network fragmentation and limited search horizons. Peers help reduce query load on the system by caching and sharing query results. They also cooperate to maintain freshness of the summary aggregation. This minimizes the role of centralized resources for low cost and graceful scaling.
Enhancing the Shared Context of Users2
months, the system has been adopted by nearly
1,500
users and the number is steadily increasing. Our
experiences in such a real-life active usage scenario show that
YouSearch is fast and efficient, and most importantly, satisfies
users' need for search on personal webservers.
YouSearch is a deployed application that combines several ideas to provide an effective and coherent solution for web searching over peer-hosted content. This section discusses the most closely related systems and proposals.
P2P File-sharing SystemsWe believe that the success of Gnutella and KaZaa in spite of these problems is due to their use for music and video sharing. In such networks, the most popular songs and files are both widely replicated and the target of most queries. Documents in corporate environments do not follow such replication or query patterns. In such settings, we believe that their inherent limits on performance will prove a stumbling block.
Napster [11] provided search over peer-hosted music files by adopting a hybrid scheme that is similar to YouSearch. Napster required the entire song list and song meta data from each peer to be centralized for indexing. This overhead forced fragmentation of Napster into islands of servers whose peers could not communicate. Such centralization of the term index becomes even more infeasible when terms arise from both file names as well as document contents, as in YouSearch.
Perhaps more important than the technical details is the philosophy of use behind current P2P systems. These systems form closed communities of users. Even people who wish to merely access or search shared content must install special purpose software that uses a proprietary protocol. As a result, each such protocol creates a partition of shared data that is inaccessible to users of the others. YouSearch, in contrast, is at its core web-compatible, and closely mimics the existing user experience of conducting a web search.
P2P Research ProposalsPlanetP [18] is a research project that proposes another searchable P2P network. As in YouSearch, each peer constructs bloom filters to summarize its content. PlanetP peers then gossip with each other to achieve a loosely consistent directory of every peer in the network. In contrast, YouSearch peers exploit a designated lightweight registrar as a ``blackboard'' for storing network state. The advantage is that registrar-maintained state is much more consistent and can be more efficiently maintained. The registrar also performs other useful functions like locating result caches that would be impractical to provide via a gossiping scheme.
Webserving ToolsOne webserver supporting search across transient peers is BadBlue [2]. BadBlue can be used to search over a (private) network of hosts running BadBlue or other-Gnutella compliant software. In relying on the Gnutella protocol for searching the network, BadBlue suffers from the problems already outlined above.
Collaborative ApplicationsBharat in [15] proposed SearchPad as an extension to the search interface at a client to keep track of chosen ``interesting'' query results. The results are remembered and used in isolation by each individual client. The tool was found to be helpful for users in a study conducted in a corporate setting. YouSearch enables users to record and share results with other members in their groups. We believe that such a shared context among group members makes this functionality even more useful.
Caching of Resources at PeersIn this section we discuss the experience of a typical end user of the YouSearch application as it is currently deployed within the IBM intranet. We highlight how the usage characteristics of YouSearch motivate our design choices.
The Web Search MetaphorNotice that unlike web search engines, YouSearch offers no single centralized query form. Instead, each participating host offers its own web-accessible search interface. Thus, our solution is an amalgam of a P2P execution infrastructure and the user expectation of a ``universal search form.''
The hits from each individual host are appropriately ranked using standard text search metrics. Though not currently implemented, background rank aggregation of inter-host results [20] could cause the most relevant (unviewed) results to percolate upwards in the ranking even if discovered later. The ranking of already-viewed results, however, should be frozen to preserve back and forward semantics of web browsing.
Searching within a Shared Context2
months. Over the last week, nearly 1,500
people have made their content available to the
search system. The full set of functionality described in this
paper is being gradually rolled out. For example, the initial
deployment of YouSearch provided conjunctive queries, exclusion and
site operators. A month later we provided users with the
group operator. The feature allowing sharing of
user-recommended results among group members is implemented but not
currently deployed. We intend to wait for users to familiarize
themselves with the group operator before its release.
The participants in YouSearch include (1) peer nodes who run YouSearch-enabled clients, (2) browsers who search YouSearch-enabled content through their web browsers, and (3) a Registrar which is a centralized light-weight service that acts like a ``blackboard'' on which peer nodes store and lookup (summarized) network state.
The search system in YouSearch
functions as follows. Each peer node closely monitors its own
content to maintain a fresh local index. A bloom filter content
summary is created by each peer and pushed to the registrar. When a
browser issues a search query at a peer p
, the peer
p
first queries the summaries at the
registrar to obtain a set of peers R
in the
network that are hosting relevant documents. The peers in R
are then directly contacted by p
with the query to obtain the URLs for its results.
To quickly
satisfy any subsequently issued queries with identical terms, the
results from each query issued at a peer p
are cached for a limited time at p
. The peer
p
notifies the registrar of any cache
entry it maintains. This allows peers other than p
that happen to receive the same query to locate and return the
cached results instead of executing the query from scratch.
Peers in YouSearch use the centralized registrar as a blackboard by posting the state of their node content for other peers to query. This way, the registrar can be used to avoid costly and often ineffective flooding of queries to irrelevant peers, and also for efficient location of relevant result caches. Though its role is important, the registrar remains a mostly passive, and hence light-weight entity. The peer nodes perform almost all of the heavy work in searching.
We now describe the indexing process in detail. An indexing process (Figure 3) is periodically executed at every peer node. The process starts off with an Inspector that examines shared files in accordance with a user specified index access policy. Each shared file is inspected for its last modification date and time. If the file is new or the file has changed, the file is passed to the Indexer. The Indexer maintains a disk-based inverted-index over the shared content. The name and path information of the file are indexed as well. Our implementation uses the engine described in [17] for indexing local node content, which provides sub-second query response times, efficient index builds, and excellent results ranking.
The Summarizer obtains a list of terms T
from the Indexer and creates a bloom filter [16] from them in the following
way. A bit vector V
of length L
is created with each bit set to 0
. A specified
hash function H
with range {1,2,...,L}
is
used to hash each term t
in T
and the bit at position H(t)
in V
is set to 1
.
Notice that bloom filters are precise
summaries of content at a peer. Suppose we want to determine if a
term t
occurs at a peer p
with a bloom filter V
. We inspect the bit
position H(t)
in V
. If the bit
is 0
, then it is guaranteed that a document
with term t
does not appear at p
.
If the bit is 1
, then the term might or might not
occur at p
since multiple terms might hash to the
same value thus setting the same bit in V
. The
probability that such conflicts occur can be reduced by increasing
the length L
.
Conflicts can also be reduced by
using k
independent hash functions
and inspecting
that bit positions
are all set to
1
. Conventionally, the k
hash
functions are used to set bits in a single bit vector. However, in
YouSearch we construct k
different bloom filters,
one for each hash function. It can be shown that such a design
incurs a slight increase in false positives. Nevertheless, we opt
for multiple bloom filters to enable further decentralization of
registrar. As the network grows and the load on registrar
increases, we can easily distribute load by placing the
k
bloom filters across multiple servers.
The k
bloom filters are sent to the registrar when a
peer becomes available, and whenever changes in its content are
detected. The Summary Manager at the registrar aggregates
these bloom filters into a structure that maps each bit position to
a set of peers whose bloom filters have the corresponding bit set.
We now describe the querying process in detail. Any browser (Bob) visiting a YouSearch enabled website (Alice's peer) has a query interface from which he can search the network.
Suppose Bob wishes to search all of YouSearch enabled content (Figure 4). Bob's query is received by Alice's peer node via a web interface and is forwarded to a Canonical Transformer that converts the query into a canonical form consisting of sets of terms labeled with the associated modifier. For example, a query of pdf group:YouSearchTeam will be converted to {, }. The canonical query is forwarded to the Result Gatherer.
The Result
Gatherer sends the canonical query to the registrar where the
Query Manager computes the hash of keywords to determine
the corresponding bits for each of the k
bloom
filters. The registrar looks up its bit position to IP address
mapping and determines the intersection R
of peer IP
address sets. The set R
is then returned to the
querying peer (Alice).
The
Result Gatherer at Alice's peer obtains R
. If the
query contained special modifiers (e.g., site, group),
R
is further filtered to contain only
peers that satisfy the modifier. It then contacts each of the peers
in R
and obtains a list of URLs U
for matching documents. The results are then passed to
Result Display which then appropriately formats and
displays U
. In order to reduce the latency
perceived by Bob, Result Display shows its results even as Result
Gatherer is collecting them.
The decision to restrict query processing at the registrar to keywords
was motivated by the desire to reduce processing load at the
registrar. The trade-off involves increased bandwidth usage as
R
could be further filtered by query modifiers. However,
the group modifier in particular is an expensive filter as
it involves a remote procedure call to the intranet group server.
By delegating modifier processing to querying peers, the incurred
cost is distributed across the community. Moreover, we have now
allowed a possible extension of having groups defined locally at
each peer. The registrar can then be made transparent to
maintenance costs as groups evolve over time.
Note that the use of bloom filters as
summaries of shared content implies that R
is guaranteed to contain all peers that have a matching document
(no false negatives). This ensures result completeness.
However, R
can have false positives in
which case Alice will receive 0
answers from
some of the peers it contacts. The chances of such false positives
can be reduced by increasing the length L
of a bloom
filter or the number k
of bloom filters used.
Suppose Bob wishes to limit his search to Alice's node alone. The
query is received by the web interface at Alice's peer, transformed
into a canonical form and forwarded to Result Gatherer as before.
The Result Gatherer recognizes the query to be a local query and
looks up its local index to find documents that match the query.
The index returns a list of ranked URLs U
for the
matching documents. The ranked list is sent to Results Display
which formats and displays results as before.
As the network matures, a significant fraction of the queries will be repeated. Indeed, it has been widely reported that queries have a zipfian distribution and individual queries are temporally clustered [29]. Caching search results enables a search solution to reduce costs by reusing the search effort.
Since YouSearch has abundant resources (the
computing and storage resources of all peers in the network) at its
disposal, it is extremely aggressive in its use of caching. Every
time a global query is answered that returns non-zero results, the
querying peer (Alice's peer node) caches the result set of URLs
U
. The peer then informs the registrar
of the fact. The registrar adds a mapping from the query to the
IP-address of the caching peer (Alice) in its cache table.
Each cache entry is associated with a (small) lifetime that is monitored by the caching peer node. The caching peer itself monitors and expires entries in its cache, and informs the registrar of any such changes.
Suppose Bob asks a global query at Alice's peer node that has been cached at other peer nodes in the network. The Result Gatherer at Alice's node sends the query to the Query Manager at the registrar. The Query Manager at the registrar looks up its cache mapping for the query and determines the set of peers that are caching the query. The registrar picks one of them (Ted) at random and asks the Result Gatherer at Alice's node to obtain the cached results from Ted's peer node.
Suppose Bob searched for a query that he expects others in his group to be interested in (e.g., how to install a printer, location of weekly meetings, etc.). Since peers in a group share a context, Bob expects that results they are interested in will be similar to his own interests. It would be convenient for them if Bob could persist his efforts in inspecting the results to determine the most relevant answer to his own query.
YouSearch provides a mechanism by which such information can be shared by members in the group. Each global query result is displayed with a check box that allows Bob to indicate if he found the result relevant and would like to recommend it. If Bob opts to do so, the query and the selected result are sent to the registrar who maintains such mappings from query to the recommended URL. If Bob is signed in, the result is stored as a recommendation of Bob, otherwise it is recorded as an anonymous recommendation.
Bob can also query the recommended information. The query is evaluated as before except that the Result Gatherer obtains the result set of URLs directly from the registrar. The results displayed by Result Display are grouped into three categories: recommended by Bob himself, by anonymous users, and by identified users. Bob can also delete a result from the recommended information that was set by Bob himself or by an anonymous user. Thus the information pool is editable, allowing peers to modify and augment shared information.
Peers take it on themselves to keep the network state consistent by absolving the registrar from actively having to maintain its store of bloom index and cache mappings. Each peer informs the registrar of any changes in its index or cache entries. Whenever a peer leaves the network, it asks the registrar to remove its entries from the index and cache tables.
If a misbehaving peer (Carol) neglects to inform the registrar of its changes, the mappings at the registrar become inconsistent. In such a case, the registrar would include Carol's peer node in its set of hints of peers to contact for a particular query. The peers are designed to handle and report such inconsistencies. A querying peer (Alice) will try to retrieve answers from Carol who is no longer available causing the request to time out. Alice then informs the registrar that Carol is unreachable. The registrar verifies the information by contacting Carol and subsequently removes entries for Carol from its mappings. Thus, the registrar in YouSearch serves just as a repository of state, changing the state at the explicit direction of the peers themselves.
YouSearch has been deployed as a component of the YouServ personal web-hosting application within the IBM corporate intranet since September 16, 2002, with a limited beta release preceding it a week before on September 9, 2002. This section provides a look at the usage trends of YouSearch compiled to date. We show how these trends support YouSearch's design principles and our assertion regarding the importance of search within such a network.
YouServ was originally deployed within the IBM intranet in July
2001. A peer enables content sharing by starting the YouServ
client, and disables it when the client stops. The interval between
start and stop of a YouServ client is defined as a YouServ
session for that peer. Figure 5 shows the distribution of
session durations observed among the 3,055
YouServ
peers that were active between July 1, 2002 and November 9, 2002.
As can be seen, most of the sessions are short, with half of the
sessions being less than 3
hours long. The dotted
arrow in the figure points to the average session duration of
684
minutes.
YouServ offers the ability for replicating
content across trusted peer nodes so that peer transience need not
necessarily translate into content availability transience. The
idea is that as long as some peer hosting the content is online,
the content remains available, and through the same URLs. Although
users are aware of this feature, only a small fraction (171
of the 3,055
) of the peers made use
of it. Thus, result freshness remains an important concern.
Figure 6 shows the
distribution of YouSearch-enabled peers within the IBM intranet.
Although the dominant fraction of peers is located within the USA,
YouSearch is being used actively in 43
countries
across the world. The different time zones of its users combined
with short session durations causes the YouServ web to be in
constant churn, rendering centralized crawling impractical.
The proliferation of YouServ peers on the IBM intranet shows that
people want a simple and effective means of sharing content using
the web. Recent usage statistics indicate the value-add of
YouSearch. Figure 7 shows
the number of unique users who actively used YouServ sometime
during the week before each plotted point. Note that a significant
growth in this usage metric coincides with the day YouSearch was
released. Another sign of demand for search is the fact that the
beta release alone (interval [-7, 0]
in
Figure 7) was downloaded
and used by nearly one hundred users.
In our IBM intranet deployment, the length of each bloom filter is
L=64
Kbits and the number of bloom
filters is set to k=3
. The three hash functions
are computed as follows.
An MD5 [25] hash of each term
at a peer is determined. An MD5 hash is 16
bytes long
from which three 16
bit hash values are
extracted and used as keys in the bloom filter. Note that the use
of MD5 ensures that the hash is strongly random and that the
resulting bloom filters are independent.
As mentioned in Section 3, only publicly shared files at each peer are indexed. Users can also explicitly declare specific directories to be indexable or not. For each indexable file, the Indexer indexes keywords that appear in its URL. In addition, if the file is of type HTML or text, the Indexer indexes keywords within the file itself.
Figure 8(a) shows
that only a few peers have a large (two-thirds) fraction of their
bits set. In addition to increasing network traffic, these peers
could face high query processing loads. A relatively simple
solution to this problem is to create partitions of content at such
peers, with distinct bloom filters summarizing each partition
instead of each node.
Figure 8(b) shows that most
of the
bits in the bloom index are highly selective, though a few bits
that correspond to the most frequently occurring words (about ) are set by almost 80%
of peers.
YouSearch could be made to filter stop words to reduce this effect.
To evaluate overall query performance, we logged statistics for
global queries asked at a YouSearch peer. We formed a query set
comprised of the first 1,500
global queries logged during the
monitoring period (September 9, 2002 to November 9, 2002). The queries
were sorted based on the time taken to gather answers at the querying peer.
Figure 9 shows that
more than half the queries were answered in less than 10
seconds. Nearly 10%
of the
queries took more than a minute to be answered. Note that these
times are not the response times seen by the browser. As discussed
in Section 3, the results
are displayed to the user while they are being gathered.
We also plotted the number of peers that were contacted for answers to the corresponding query. Not surprisingly, the curve for number of peers contacted follows the time curve closely. The jitter in this curve can be attributed to the geographic distance between peers and the fact that even on the IBM intranet, nodes have vastly different bandwidth and latency characteristics (some connect via dialup VPN, for example). We note that the current implementation probes peers sequentially. Parallelizing such probes by contacting nodes simultaneously will result in proportional improvements in gather times. A factor of ten in improvement is easily feasible and will bring the median gather time to a sub-second range.
Not surprisingly, the longest queries were also observed to have large answer sets. For these queries, the collection of results were in fact collected faster than the speed at which a user is likely to inspect them.
94.47%
) of
queries were simple keyword queries with an average length of
1.22
keywords and standard deviation 0.58
The remaining 5.53%
(83
of 1,500
)
involved an advanced
feature like site or group search. About
70%
of the queries had at least one answer with an
average of 254.41
answers per query obtained from
an average of 22.83
peers. Figure 9 plots the number of peers
(size of R
) that were probed to obtain answers.
We believe that the users are still adjusting to the availability of search, with a significant amount of content remaining unindexed due to YouSearch's default behavior of leaving content unindexed should it be hidden behind index.html files. As users become more familiar with YouSearch, more data will be made available for searching, and the fraction of successful queries will increase.
We observed that 17.38%
of the
queries had a false-positive peer in its result set. The average
number of false-positive peers was 0.139
which
corresponds to 0.6%
of an average result
set of size 22.83
peers. Most of these false
positives are due to the few peers in Figure 8(a) that have a large
fraction of their bits set.
Of the successful queries, 3.54%
were served from peer caches. This value will
increase as the system grows due to higher query loads.
Additionally, we have been very aggressive in clearing caches: the
default cache lifetime is set to 5
minutes.
Increasing this default parameter will lead to improved cache hit
rates, though with a slight penalty in result freshness. Indeed,
nearly a third (31.31%
) of all queries in
our sample were asked more than once.
To better quantify the effect of caching on performance, we issued
a sample of 25
queries at one peer, and then
repeated these same 25
queries at different peers
in the network. The second time a query was executed, results were
gathered from a cache instead of gathered from scratch. Figure 10 shows the times taken in the
two invocations. Clearly caching improves performance, often by an
order of magnitude.
Recall that YouSearch utilizes a centralized registrar for providing an aggregated bloom filter index. In this section we analyze the communication and processing demands required of this component.
Suppose there are n
peers
participating in the network. Each of the n
peers will send their k
bloom filters of size
l
bits every T
seconds if their content
has changed, where T
is the period at which
crawls occur at each peer. Let f
be the
fraction of peers whose content changes in an interval of
T
seconds. The registrar thus has an average inbound
traffic of
bits every
T
seconds.
In the current YouSearch
deployment, k=3
, l=65,536
and
T=300
seconds. We conservatively set the
frequency of site changes, f
, to
20%
. With such settings, assuming the registrar has a
T1-line bandwidth of 1.54M
bits per second, of
which 20%
is consumed by networking overhead,
the registrar could support
9,856 peers. Assuming a corporate private network with a T3
line capacity of 44.736M
bits per second, the registrar
could support n=
286,310 peers. These are
rather loose upper bounds given our conservative setting of
site-modification frequency. Bandwidth could be further reduced by
having peers send only changed bits [22] instead of re-sending
the entire bloom filter with each change. The current design
nevertheless easily supports the current and projected user bases
within the IBM deployment.
Let us now consider the processing costs at the registrar. For any reasonable number of peers, the registrar can easily maintain the three mappings of Section 4 in main memory. For each query, the registrar performs a small number of simple lookups on these data structures which amount to easily-optimized bit-vector operations. This design permits even modest hardware to scale to tens if not hundreds of thousands of users and their queries.
We realized that tuning peer deployments for optimal performance would be a difficult job. The difficulty stems from the end user's reluctance in downloading new releases of software that is already providing a desired level of satisfaction. In this section we present a simple solution that we designed in response.
While every client-side software deployment suffers from similar end-user inertia, the problem is especially severe for P2P networks. A P2P network draws its benefits from having large numbers of peers participate to form a single community of users. The large numbers of peers directly translates to a large number of individual software deployments that need to be tuned. More importantly, the single community constrains that such tuning be simultaneous across all peers: the success of the community relies on all peers using the same protocol.
L=512b
. We
might have found that a significant fraction of the community sets
most of the bits. Each query would then be mapped to all the peers
in the network causing the Result Gatherer phase to degenerate into
a broadcast of the query over the entire network. The correct thing
to do would be to increase the size of the bloom filters to
1024
bits. However, the problem is not solved by just
rolling out a new version of the software. If the new and the old
versions co-existed, the same term would be hashed to a
different bit position in both these versions. Thus the
same query would be represented differently at different peers.
While the problem could be temporarily solved by maintaining two
different filters at the Registrar (one each for 512
bits and for 1024
bits),
the complexity of the code would increase. Further, each such tweak
would result in more filters being created and maintained for
backward compatibility.
Similar to the size of bloom filters discussed in Example 6.1, there are several parameters in the code that need to be tweaked based on the usage patterns of peers (e.g., the number of bloom filters to create at each peer, the frequency of sending bloom filters to the Registrar from each peer, the duration of time-outs while contacting a peer to gather results, etc.) We label such parameters that arise from an implementation of the conceptual design tunable parameters.
We programmed YouSearch to allow distributed tuning. Each YouSearch peer is enabled with a Tuning Manager. The Tuning Manager works with a centralized Administrator to receive and affect the changes pushed out by the Administrator. The Manager creates a local state file on disk at the peer during the installation process. Signed messages received from the Administrator are interpreted, acted upon and persisted in the state file by the manager. The rest of the application reads the maintenance state to respond to the changes pushed out by the Administrator.
The various parameters that need to be tuned in the code are identified and set to values read from the maintenance state file at application launch. Changes to the values of these tunable parameters can be sent to the Maintenance Manager by the Administrator. The Manager merely overwrites the values for these parameters in the state file to affect the changes. Thus, we can simultaneously change the settings for tunable parameters of as many peers as we like, allowing us to experiment and tune the network as it evolves.
In this paper, we addressed the challenge of providing fresh, fast and complete search over personal webserver-hosted content. Because of the transient availability of personal webservers and their rapidly evolving content, any crawl-based search solution suffers from stale and incomplete results. Our solution, YouSearch, is instead a hybrid peer-to-peer system that relies primarily on the webservers themselves to keep pace with changing network state. It scales gracefully and costs little since its centralized resource requirements are small.
YouSearch also enhances the shared context among its users. Personal webservers can be aggregated into overlapping, user specified groups, and these groups searched just as individual nodes. Any group member can persist result recommendations so that others can draw upon its knowledge.
Within two months of its deployment, YouSearch has
already been adopted by nearly 1,500
users.
Our study of its usage in this real-life setting showed that
YouSearch performs well and, most importantly, satisfies user
needs.
Future work might consider allowing authenticated peers to search secured in addition to public content. Other useful extensions would include having a peer generate snippets of matching body text for its (cached) search results, exploit social networks (defined by existing group definitions) for personalized inter-host ranking, and even actively maintain cached results for the most popular queries (instead of simply timing them out to avoid staleness). Unlike purely centralized search architectures, the plethora of compute, storage, and bandwidth available to the set of YouSearch peers as a whole puts few constraints on further enhancement.
We thank Rakesh Agrawal for many insightful comments on this draft and YouSearch in general. We thank Dan Gruhl for designing the YouSearch logo. Finally, we acknowledge the many users of our internal YouSearch deployment for their valuable feedback. Mayank also thanks Amit Somani for introducing him to the YouServ project.