Abstract
In the ocean of Web data, Web search engines are the pri
mary way to access content. As the data is on the order of
petabytes, current search engines are very large centralized
systems based on replicated clusters. Web data, however,
is always evolving. The number of Web sites continues to
grow rapidly (over 270 millions at the beginning of 2011) and
there are currently more than 20 billion indexed pages. On
the other hand, Internet users are above one billion and hun
dreds of million of queries are issued each day. In the near
future, centralized systems are likely to become less e ective
against such a data-query load, thus suggesting the need of
fully distributed search engines. Such engines need to main
tain high quality answers, fast response time, high query
throughput, high availability and scalability; in spite of net
work latency and scattered data. In this tutorial we present
the architecture of current search engines and we explore
the main challenges behind the design of all the processes of
a distributed Web retrieval system crawling, indexing, and
query processing. |