Utilizing the Subjective Intent of Authoring Formats to Perform Focused
Web Crawling
Hok Peng Leung and Wynne Hsu
School of Computing
National University of Singapore
Lower Kent Ridge Road, Singapore 119260
{leunghp, whsu} @comp.nus.edu.sg
Introduction
A successful web information retrieval system requires the ability to determine
quickly and accurately whether a document or a link should be further explored.
Many researchers have looked into improving the performance of such systems
by utilizing different information available from the web documents. In
this paper, we propose a fast and accurate approach to determining the
relevancy of a document by taking into account the information embedded
within these formatting tags. Using such information, we are able to quickly
narrow down the scope of our search to the most promising sites. In addition,
a new query formulation strategy is proposed to further improve the accuracy
of the new approach. A number of experiments have been conducted to test
the effectiveness of the proposed approach and the crawling strategy. Experiment
results indicate that we are able to achieve a significant improvement
over the standard information retrieval algorithm based on tf*idf. Furthermore,
our algorithm, unlike the tf*idf scheme, does not require the whole document
space to be known in advance. This feature makes our algorithm suitable
to be used on the web where it is impossible to know in advance the entire
document space.
Our HTML-based Text-Emphasis cum Query Reformulation Approach
The motivation for HTML-Based Text Emphasis is to take into consideration
the subjective intent of the author during retrieval by allocating different
weights to the different HTML tags used to format the document. The algorithm
is based on the idea of recursively assigning higher weights to those terms
that are enclosed within some text-emphasis tags. In addition, query phrase
formulation is used to process the query as composition of phrases rather
than as individual terms. For example, for a query Q="Natural Language
Processing", a document that contains the string "Natural Language…Processing"
is more relevant than those documents that contain single words such as
"Natural", or "Language", or "Processing." In this strategy, we first generate
all the possible phrases that can be formed from such query string. They
include:
-
(Natural Language Processing)
-
(Natural) & (Language Processing)
-
(Natural Language) & (Processing)
-
(Natural) & (Language) & (Processing)
For each phrase, we attempt to match it to the document iteratively, starting
from the longest phrase (case 1), to the shortest phrase (case 4). The
results are then weighted and averaged to give the overall similarity measure.
Full details can be found in
Full Thesis.
Based on the new similarity measure, our crawler
decides which is the most relevant page to begin its search. Once a page
has been selected, it then determines among the many links that appear
within this selected page, the most relevant link to drill down.