Refereed Papers

Track: Search: Crawlers

Paper Title:
IRLbot: Scaling to 6 Billion Pages and Beyond

Authors:

Hsin-Tsang Lee(Texas A&M University)
Derek Leonard(Texas A&M University)
Xiaoming Wang(Texas A&M University)
Dmitri Loguinov(Texas A&M University)

Abstract:
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted $41$ days, IRLbot running on a single server successfully crawled $6.3$ billion valid HTML pages ($7.6$ billion connection requests) and sustained an average download rate of $319$ mb/s ($1,789$ pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over $117$ million hosts, parsed out $394$ billion links, and discovered a subset of the web graph with $41$ billion unique nodes.

PDF version

Inquiries can be sent to: Email contact: program-chairs at www2008.org

**Sponsors are Listed In Alphabetic Order**

Refereed Papers

Sponsors are Listed In Alphabetic Order