Logical Structure Based Semantic Relationship Extraction from Semi-Structured Documents

Zhang Kuo

Tsinghua University, Beijing, 100084, China

Wu Gang

Tsinghua University, Beijing, 100084, China

Li JuanZi

Tsinghua University, Beijing, 100084, China

Copyright is held by the World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
ACM 1-59593-323-9/06/0005.

ABSTRACT

Addressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting 'isolated' semantic information by making use of linguistic analysis or linkage information in web pages and limited research has been done on extracting semantic relationship from the semi-structured documents. In this paper, we propose a method for semantic relationship extraction by using the logical information in the semi-structured document (semi-structured document usually has various types of structure information, e.g. a semi-structured document may be hierarchical laid out). To the best of our knowledge, extracting semantic relationships by using logical information has not been investigated previously. A probabilistic approach has been proposed in the paper. Features used in the probabilistic model have been defined.

Categories & Subject Descriptors

I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods -- Relation systems, Semantic networks

General Terms

Algorithms, Performance, Languages.

Keywords

Semi-structured document, Logical structure, Relationship extraction, Ontology

1. INTRODUCTION

Currently, there are a few semantic annotation platforms which extract information from web pages and annotate them with ontology. For example, S-CREAM [1] supports the semi-automatic annotation for web pages. KIM [2] provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. WebKB [3] extracts instances of classes and relations based on web page contents and their linkage path.

However, few of the previous works focus on detecting semantic relationships. Furthermore, the existent semantic annotation systems mainly discover the relationships making use of web linkage or sentence structures. As a result, only relationships between pages or within a sentence can be extracted.

There also exist a large amount of semi-structured documents other than web pages, such as academic papers, enterprise reports. This kind of documents is different from web pages, because they usually do not contain hyperlinks, and authorized in strict logical structure. Therefore, we need a new algorithm that fits into the features in this kind of documents.

In this paper, we propose an approach that exploits document logical structure to extract relationships. We first extract text pieces as data type property values with iASA [4] . Then we compute the probability that two property values are related by the same instance using logistic regression. And then we find the relationships between the property values that maximize the loss function.

2. Problem Statement

We first give the definition of knowledge base in our scenario. A knowledge base can be viewed as a three tupel:

where C denotes a set of concepts; P denotes a set of property; I denotes a instance set of all concepts. Specifically, let denote a concept, denote a property and i_c denote an instance of concept c, i.e..

We now illustrate the problem of relationship extraction by an example. Say we have a document snippet about hotel information:

Then the task is to associate the information correspondingly (i.e. in this example, we need associate the hotel name, address and phone number). This is exactly the problem semantic relationship extraction addresses. Finally, the output might be:

3. Our Approach

1. Property values for the same instance are usually in the same relative position in logical structure. For instance, hotel name is usually in the parent logical level of hotel address.

2. Two property values usually appear in a document in a constant order. For instance, hotel address usually appears before phone number.

Definition1. For any property values l₁ and l₂, they are called relevant if and only if there exist , , , having ,. In other words, l₁ and l₂ are property values for the same instance. We use to denote the relevant relation.

Our approach has two main steps. At the first step, we use property values extracted by iASA and their logical structure information as input, and exploit logistic regression to predict the probability of for any property value pair. At the second step, we use the relevant probabilities to construct the instances by maximizing a loss function defined in 3.2 section. The output is constructed instances which is similar to figure 2.

3.1 Relevant Probability Estimation

We consider one implementation of our approach. We employ logistical regression [5] in the relation probability estimation. It has not been investigated previously to the best of our knowledge.

The learning based probability estimation consists of two stages: training and prediction.

In training, we train a regression model for each property pair p_m, p_n that have the same domain concept. Table 1 shows the major features used in the regression model.

Table 1: Features used in regression model

Features	Comments
Higher_logic_level	Whether l₁ is in a higher logical level than l₂
Same_logic_level	Whether l₁ is in the same logical level than l₂
Lower_logic_level	Whether l₁ is in a lower logical level than l₂
Appear_before	Whether l₁ appears before l₂
Logical_distance	The distance in the logical structure tree
Same_sentence	Whether l₁ and l₂ are in the same sentence
Same_paragraph	Whether l₁ and l₂ are in the same paragraph

where denotes their closest common ancestor, and level(l) means the length from l to the root node.

3.2 Instance construction

For each concept c, we associate property values with instances so as to maximize the loss function:

where p(i, l) means the value of property p of instance i is l, and P(r_mn(l_a, l_b)) represents the probability that l_a and l_b are relevant.

Obviously, it is impossible to enumerate all the instance list candidates {i_c₁,i_c₂,¡,i_ck}, and select the one which maximize the loss function. So we propose an algorithm to construct the instances:

then attach l to i*, i.e., set l as the value of property p_n of i*, and detach l from the original instance.

Step3. compute Loss^(k). If Loss^(k) -Loss^(k-¹⁾ <, then stop, otherwise repeat step2.

The complex of step2 is O(n_l²), where n_l is the number of property values. A property value l may be reattached to instances more than one time, because the attachment changing of other property values may affect l.

4. Conclusion

In this paper, we investigated the problem of semantic relationship extraction from semi-structured documents. We give a definition of relationship extraction problem. We proposed an approach for the problem by using logistic regression.

REFERENCES

[1] S. Handschuh, S. Staab, and F. Ciravegna. S-CREAM --Semi-automatic CREAtion of Metadata. In Proceedings of EKAW 2002.

[2] P. Borislav, K. Atanas, K. Angel, M. Dimitar, O. Damyan, G. Miroslav: KIM - Semantic Annotation Platform. International Semantic Web Conference 2003: 834-849.

[3] C. Mark, D. Dan, F. Dayne, M. Andrew, M. Tom, N. Kamal and S. Sean. Learning to Construct Knowledge Bases from the World Wide Web, Artificial Intelligence, 118(1-2): 69-113.2000.

[4] J. Tang, JZ. Li, HJ. Lu, BY. Liang, XT. Huang, KH. Wang. iASA: Learning to Annotate the Semantic Web. Journal on Data Semantics (4): 110-145.2005.

[5] D.H. Freeman. Applied Categorial Data Analysis. Dekker, New York, 1987

Logical Structure Based Semantic Relationship Extraction from Semi-Structured Documents

Zhang Kuo

Tsinghua University, Beijing, 100084, China

zkuo99@mails.tsinghua.edu.cn

Wu Gang

Tsinghua University, Beijing, 100084, China

wug03@mails.tsinghua.edu.cn

Li JuanZi

Tsinghua University, Beijing, 100084, China

ljz@keg.cs.tsinghua.edu.cn

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. Problem Statement

3. Our Approach

3.1 Relevant Probability Estimation

3.2 Instance construction

4. Conclusion

REFERENCES