Effective Web-Scale Crawling Through Website Analysis

Ivan Gonzalez

Carnegie Mellon Univ
{Pittsburgh, Pennsylvania, USA

Adam Marcus

Rensselaer Polytechnic Institute
Troy, New York, USA

Daniel N.Meredith

IBM Almaden Research Center
San Jose, California, USA

Linda A. Nguyen

IBM Almaden Research Center
San Jose, California, USA

Abstract:

The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. netSifter utilizes a combination of page-level analytics and heuristics which are applied to a sample of web pages from a given website. These algorithms score individual web pages to determine the general utility of the overall website. In doing so, netSifter can formulate an in-depth opinion of a website (and the entirety of its web pages) with a relative minimum of work. netSifter is then able to bias the future efforts of its crawl towards higher quality websites, and away from the myriad of low quality websites and crawler traps that litter the World Wide Web.

Categories & Subject Descriptors

D.2.11SoftwareSoftware Architecture H.2Information SystemsInformation Storage and Retrieval

General Terms

Performance, Design

Keywords

WebFountain, UIMA, netSifter, crawling, sampling

1 Introduction

We implemented netSifter within IBM's WebFountain [3] to counter the challenges faced by current crawler methodologies. netSifter consists of a scalable, flexible architecture that approaches the Web as a collection of websites, so as to avoid evaluating web pages one by one. The unique concept of the solution is to rank the URL frontier by making page-level judgements based on knowledge of the originating site. This site knowledge is created by performing various analyses on a sample of pages from the site, and subsequently formulating an overall site score. Future pages from the site can then be prioritized in relation to other pages via the site score. The rationale for this idea is that pages from a common site are related, and share characteristics indicative of quality.

netSifter is extensible to varied needs, allowing it to take advantage of both modern focused crawling techiques [1] and full-web crawling strategies [2]. For example, netSifter can make use of relatively expensive content-based analyses while remaining scalable, since only a sample of pages from a site are examined. Additionally, netSifter can make simplified use of popularity measures by, for example, examining outlinks from the sample set of web pages, or by incorporating link-based ranking results into the site scores periodically. Since netSifter employs a plurality of analysis techniques, a website is not excluded or included based on any one metric.

2 System Details

1 Architecture

2 Sampling Method

We require a website to pass the t-test for three sequential observations in order to provide an opportunity for the crawl to move past locally consistent content, and find a better estimate of the mean for the entire site. A maximum sample size is set to avoid analyzing too many pages for websites which are unable to produce a consistent site score within a reasonable sample size.

3 Results and Conclusion

Among the top 1000 SiteRank scores, 148 sites were scored negatively by netSifter. There were many Asian language, spam, link farm, and adult content sites. The presence of Asian sites indicates that some annotators improperly handle non-English content. netSifter correctly scored spam and adult content sites, even though these sites were rated well by SiteRank. Among the bottom 1000 SiteRank scores, 373 sites were scored positively by netSifter. Manual examination of these sites rated a large majority as postive. This shows that netSifter was able to identify interesting sites which were not well-connected.

The second experiment compares a netSifter crawler against an unbiased crawler. The utility of the crawlers was measured using ODP Utility (the count of pages crawled from sites which appear in the ODP listings, ODP Utility (Weighted) (the sum of the number of pages crawled from a given site multiplied by the number of site appearances in the ODP, and SiteRank Utility (the number of pages crawled from a given site multipled by the SiteRank of that site. The results are shown in Table

. Though netSifter crawled fewer pages than the unbiased crawl, it outperformed it on ODP Utility (Weighted) and SiteRank Utility.

netSifter demonstrates that by exploiting the logical association of a web page to a website, and then forming an estimate of the overall quality of a website, the URL frontier of a web-scale crawler can be effectively prioritized to bias the crawler towards higher-quality content.

Effective Web-Scale Crawling Through Website Analysis

Ivan Gonzalez

Carnegie Mellon Univ
{Pittsburgh, Pennsylvania, USA

ieg@cs.cmu.edu

Adam Marcus

Rensselaer Polytechnic Institute
Troy, New York, USA

marcua@cs.rpi.edu

Daniel N.Meredith

IBM Almaden Research Center
San Jose, California, USA

dnm@almaden.ibm.com

Linda A. Nguyen

IBM Almaden Research Center
San Jose, California, USA

lan@almaden.ibm.com

Abstract:

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 System Details

1 Architecture

2 Sampling Method

3 Results and Conclusion

Bibliography

Footnotes

Effective Web-Scale Crawling Through Website Analysis

Ivan Gonzalez

Carnegie Mellon Univ{Pittsburgh, Pennsylvania, USA

Adam Marcus

Rensselaer Polytechnic InstituteTroy, New York, USA

Daniel N.Meredith

IBM Almaden Research CenterSan Jose, California, USA

Linda A. Nguyen

IBM Almaden Research CenterSan Jose, California, USA

Abstract:

Categories & Subject Descriptors

General Terms

Keywords

Footnotes

Carnegie Mellon Univ
{Pittsburgh, Pennsylvania, USA

Rensselaer Polytechnic Institute
Troy, New York, USA

IBM Almaden Research Center
San Jose, California, USA

IBM Almaden Research Center
San Jose, California, USA