Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling

Hok Peng Leung and Wynne Hsu
School of Computing
National University of Singapore
Lower Kent Ridge Road, Singapore 119260
{leunghp, whsu} @comp.nus.edu.sg

Introduction

A successful web information retrieval system requires the ability to determine quickly and accurately whether a document or a link should be further explored. Many researchers have looked into improving the performance of such systems by utilizing different information available from the web documents. In this paper, we propose a fast and accurate approach to determining the relevancy of a document by taking into account the information embedded within these formatting tags. Using such information, we are able to quickly narrow down the scope of our search to the most promising sites. In addition, a new query formulation strategy is proposed to further improve the accuracy of the new approach. A number of experiments have been conducted to test the effectiveness of the proposed approach and the crawling strategy. Experiment results indicate that we are able to achieve a significant improvement over the standard information retrieval algorithm based on tf*idf. Furthermore, our algorithm, unlike the tf*idf scheme, does not require the whole document space to be known in advance. This feature makes our algorithm suitable to be used on the web where it is impossible to know in advance the entire document space.

Our HTML-based Text-Emphasis cum Query Reformulation Approach

The motivation for HTML-Based Text Emphasis is to take into consideration the subjective intent of the author during retrieval by allocating different weights to the different HTML tags used to format the document. The algorithm is based on the idea of recursively assigning higher weights to those terms that are enclosed within some text-emphasis tags. In addition, query phrase formulation is used to process the query as composition of phrases rather than as individual terms. For example, for a query Q="Natural Language Processing", a document that contains the string "Natural Language…Processing" is more relevant than those documents that contain single words such as "Natural", or "Language", or "Processing." In this strategy, we first generate all the possible phrases that can be formed from such query string. They include:

(Natural Language Processing)
(Natural) & (Language Processing)
(Natural Language) & (Processing)
(Natural) & (Language) & (Processing)

For each phrase, we attempt to match it to the document iteratively, starting from the longest phrase (case 1), to the shortest phrase (case 4). The results are then weighted and averaged to give the overall similarity measure. Full details can be found in Full Thesis.

Based on the new similarity measure, our crawler decides which is the most relevant page to begin its search. Once a page has been selected, it then determines among the many links that appear within this selected page, the most relevant link to drill down.

Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling

Hok Peng Leung and Wynne Hsu
School of Computing
National University of Singapore
Lower Kent Ridge Road, Singapore 119260
{leunghp, whsu} @comp.nus.edu.sg

Introduction

Our HTML-based Text-Emphasis cum Query Reformulation Approach

Experiment Results

Figure 1. Overall Performance Chart.

Conclusions

Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling

Hok Peng Leung and Wynne Hsu School of Computing National University of Singapore Lower Kent Ridge Road, Singapore 119260 {leunghp, whsu} @comp.nus.edu.sg

Introduction

Our HTML-based Text-Emphasis cum Query Reformulation Approach

Experiment Results

Figure 1. Overall Performance Chart.

Conclusions

Hok Peng Leung and Wynne Hsu
School of Computing
National University of Singapore
Lower Kent Ridge Road, Singapore 119260
{leunghp, whsu} @comp.nus.edu.sg