Refereed Papers
Track: Search: Crawlers
Paper Title:
iRobot: An Intelligent Crawler for Web Forums
Authors:
- Rui Cai(Microsoft Research, Asia)
- Jiang-Ming Yang(Microsoft Research, Asia)
- Wei Lai(Microsoft Research, Asia)
- Yida Wang(Microsoft Research, Asia)
- Lei Zhang(Microsoft Research, Asia)
Abstract:
We study in this paper the Web forum crawling problem, which is
a very fundamental step in many Web applications, such as search
engine and Web data mining. As a typical user-created content
(UCC), Web forum has become an important resource on the Web
due to its rich information contributed by millions of Internet
users every day. However, Web forum crawling is not a trivial
problem due to the in-depth link structures, the large amount of
duplicate pages, as well as many invalid pages caused by login
failure issues. In this paper, we propose and build a prototype of
an intelligent forum crawler, iRobot, which has intelligence to
understand the content and the structure of a forum site, and then
decide how to choose traversal paths among different kinds of
pages. To do this, we first randomly sample (download) a few
pages from the target forum site, and introduce the page content
layout as the characteristics to group those pre-sampled pages and
re-construct the forum's sitemap. After that, we select an optimal
crawling path which only traverses informative pages and skips
invalid and duplicate ones. The extensive experimental results on
several forums show the performance of our system in the following
aspects: 1) Effectiveness – Compared to a generic crawler,
iRobot significantly decreases the duplicate and invalid pages; 2)
Efficiency – With a small cost of pre-sampling a few pages for
learning the necessary knowledge, iRobot saves substantial network
bandwidth and storage as it only fetches informative pages
from a forum site; and 3) Long threads that are divided into multiple
pages can be re-concatenated and archived as a whole thread,
which is of great help for further indexing and data mining.
Inquiries can be sent to: