WWW2010

Information Extraction

Thursday, 1:30–3:00 PM
Chair: Eugene Agichtein

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web

Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, Nguyen Duc

Extracting semantic relations between entities is an important first step in various tasks in Web mining and natural language processing such as, information extraction, relation detection, and social network mining. A relation can be expressed extensionally by stating all the instances of that relation, or intentionally by defining all the paraphrases of that relation. For example, consider the ACQUISITION relation between two companies. An extensional definition of ACQUISITION contains all pairs of companies where one company is acquired by the other (e.g. (YouTube, Google) or (Powerset, Microsoft)). On the other hand we can intentionally define ACQUISITION as the relation described by lexical patterns such as, X is acquired by Y, or Y purchased X, where X and Y denote two companies. We utilize this dual representation of semantic relations to propose a novel sequential co-clustering algorithm that can efficiently extract a large number of relations from unlabeled data. We provide an efficient heuristic to find the parameters of the proposed co-clustering algorithm. Using the clusters produced by the algorithm, we train an L1 regularized logistic regression model to identify the representative patterns that describe the relation expressed by each cluster. We evaluate the proposed method in three different tasks: measuring relational similarity between entity pairs, open information extraction (Open IE), and classifying relations in a social network system. Experiments conducted using a benchmark dataset show that the proposed method improves existing relational similarity measures. Moreover, the proposed method significantly outperforms the current state-of-the-art Open IE systems in both precision and recall. The proposed method correctly classifies 53 relation types in an online social network containing 470,671 nodes and 35,652,475 edges, thereby demonstrating its efficacy in real-world relation detection tasks.

Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries

Xiaoxin Yin, Wenzhao Tan, Xiao Li, Yi-Chin Tu

Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users’ search trails (i.e., post-search browsing behaviors) as implicit labels on the relevance between web contents and user queries. Based on such labels we use information extraction approach to build wrappers and extract structured information. An important observation is that many web sites contain pages for name entities of certain categories (e.g., AOL Music contains a page for each musician), and these pages have the same format. This makes it possible to build wrappers from a small amount of implicit labels, and use them to extract structured information from many web pages for different name entities. We propose STRUCLICK, a fully automated system for extracting structured information for queries containing name entities of certain categories. It can identify important web sites from web search logs, build wrappers from users’ search trails, filter out bad wrappers built from random user clicks, and combine structured information from different web sites for each query. Comparing with existing approaches on information extraction, STRUCLICK can assign semantics to extracted data without any human labeling or supervision. We perform comprehensive experiments, which show STRUCLICK achieves high accuracy and good scalability.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition

Utku Irmak, Reiner Kraft

Named entity recognition studies the problem of locating and classifying parts of free text into a set of predefined categories. Although extensive research has focused on the detection of person, location and organization entities, there are many other entities of interest, including phone numbers, dates, times and currencies (to name a few examples). We refer to these types of entities as “semi-structured named entities”, since they usually follow certain syntactic formats according to some conventions, although their structure is typically not well-defined. Regular expression solutions require significant amount of manual effort and supervised machine learning approaches rely on large sets of labeled training data. Therefore, these approaches do not scale when we need to support many semi-structured entity types in many languages and regions. In this paper, we study this problem and propose a novel three-level bootstrapping framework for the detection of semi-structured entities. We describe the proposed techniques for phone, date and time entities, and perform extensive evaluations on English, German, Polish, Swedish and Turkish documents. Despite the minimal input from the user, our approach can achieve 95% precision and 84% recall for phone entities, and 94% precision and 81% recall for date and time entities, on average. We also discuss implementation details and report run time performance results, which show significant improvements over regular expression based solutions.

.

Back to full list of papers