WWW2010

Classification and Clustering

Friday, 1:30–3:00 PM
Chair: Hang Li

Multi-modality in One-class Classification

Boris Chidlovskii, Matthijs Hovelynck

We propose a method of improving classification performance in a one-class setting by combining classifiers of different modalities. We applied it to the problem of distinguishing responsive documents in a corpus of e-mails, like Enron Corpus. We propose a way to turn the social network that is implicit in a large body of electronic communication into valuable features for classifying the exchanged documents. Working in a one-class setting we take a semi-supervised approach based on the Mapping Convergence framework. We propose an alternative interpretation, that allows for broader applicability by dismissing the prerequisite that positive and negative items must be naturally separable. We propose an extension to the one-class evaluation framework which turns to be useful even when very little positive training examples are available. We extent the one-class setting to the {\it co-training} principle that enables us to take advantage of the availability of multiple redundant views on the data. We evaluate this extension on the Enron Corpus, classifying towards responsiveness.

A Large Scale Active Learning System for Topical Categorization on the Web

Suju Rajan, Dragomir Yankov, Scott Gaffney, Adwait Ratnaparkhi

Many web applications such as ad matching systems, vertical search engines, and page categorization systems require the identification of a particular type or class of pages on the Web. The sheer number and diversity of the pages on the web, however, makes the problem of obtaining a good sample of the class of interest hard. In this paper, we describe a successfully deployed end-to-end system that starts from a manually collected biased training sample and makes use of several state-of-the-art machine learning systems working in tandem, including a powerful active learning component, in order to achieve a good classification system. The performance of the system is evaluated on the traffic to a real-world ad-matching platform and is shown to have significant reduction in editorial effort and labeling time, while maintaining pre-specified performance criteria.

The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Deepayan Chakrabarti, Rupesh Mehta

An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on \url{amazon.com}), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are “important,” such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance. We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a {\em global} analysis of the entire website, whereby search data for popular pages can be used to infer good paths even for other pages that receive little or no search traffic. The webpages on the website are then clustered using these “key” paths. Our algorithm only requires information on search queries, and the webpages clicked in response to them; there is no need for human input, and it does not need to be told which portion of a webpage the user found interesting. The resulting clusterings achieve an adjusted RAND score of over 0.9 on half of the websites (a score of 1 indicating a perfect clustering), and 59% better scores on average than competing algorithms. Besides leading to refined clusterings, these key paths can be useful in the wrapper induction process itself, as shown by the high degree of match between the key paths and the manually identified paths used in existing wrappers for these sites (90% average precision).

.

Back to full list of papers