Using Graph Matching Techniques to Wrap Data from PDF Documents

Tamir Hassan

Vienna University of Technology
Vienna, Austria

Robert Baumgartner

Vienna University of Technology
Vienna, Austria

ABSTRACT

Wrapping is the process of navigating a data source, semi-automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute.

Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances.

Categories & Subject Descriptors

I.7.5 [Document and Text Processing]: Document Capture--document analysis; H.3.3 [Information Systems]: Information Search and Retrieval

General Terms

Algorithms, Experimentation

Keywords

Wrapping, PDF, Document understanding, Logical structure, Graph matching

1 Introduction

In HTML, on the other hand, the structure of the code somewhat corresponds to the logical structure of the document. This has led to the development of a number of tools that use this structure to locate data items. One such product is the Lixto Visual Wrapper, which allows the user to interactively select data items from a visual rendition of the web page. The system then generates a wrapping program to automatically extract this data from similarly structured sources, or from sources whose content changes over time.

2 Our Approach

We therefore developed our own HTML conversion process, which attempts to represent the logical structure of the PDF in the resultant HTML code. This now gives us limited wrapping functionality in many documents, although this is heavily dependent on the accuracy of the document understanding process, which is inherently an imprecise task. There are many complex documents, such as the example in Fig. 1, a real use-case example of quality management data from the automotive domain.² Such documents can not be fully understood without additional input from the user.

We have identified three main data structures within a PDF document that could be used to locate instances of data to be wrapped:

Whilst our HTML conversion allows us to use the content and logical structure to identify wrapping instances, it does not give us direct access to the document's geometric structure. The graph matching method described in this paper allows us to use a combination of all three structures, essentially shifting some of the burden of the document understanding process to the user. We expect this to compensate for the inherent inaccuracies and limitations of document understanding.

3 Implementation

3.1 Obtaining PDF data

3.2 Page segmentation

3.3 Graph representation

As each of the nodes has a set of co-ordinates, this representation maps easily onto the visual domain, where the user can interactively select an example wrapping instance, and its corresponding sub-graph is found automatically.

3.4 Similarity measures

The familiar notion of edit cost can be used to define the similarity of two sub-graphs. Allowed operations would include not just additions and deletions of single nodes or edges, but additions and deletions of complete rows of elements. For example, a certain paragraph may be one line longer or a certain table might have an extra row added. Yet, the logical structure with relation to shape would remain the same. Thus we are finding wrapping instances using both logical and visual similarity.

Furthermore, this method could be further extended to discriminate between headings and data. The logical relations present in the graph enable us to determine, with some degree of certainty, which blocks contain headings and which blocks contain just ``data'' (plain body text). Any ``edits'' that affect heading elements would therefore correspond to a change in logical structure, and this would carry a higher edit cost than the equivalent operation to only body text.

3.5 The matching process

REFERENCES

[1] R. Baumgartner, S. Flesca, and G. Gottlob.
Visual web information extraction with Lixto.
In The VLDB Journal, pages 119-128, 2001.

[2] W. J. Christmas, J. Kittler, and M. Petrou.
Structural matching in computer vision using probabilistic relaxation.
IEEE Tran. on Pattern Anal. and Mach. Intel., 17(8):749-764, Aug. 1995.

[3] J. Llados, E. Marti, and J. J. Villanueva.
Symbol recognition by error-tolerant subgraph matching between region adjacency graphs.
IEEE Tran. on Pattern Anal. and Mach. Intel., 23(10):1137-1143, Oct. 2001.

Using Graph Matching Techniques to Wrap Data from PDF Documents

Tamir Hassan

Vienna University of Technology
Vienna, Austria

hassan@dbai.tuwien.ac.at

Robert Baumgartner

Vienna University of Technology
Vienna, Austria

baumgart@dbai.tuwien.ac.at

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 Our Approach

3 Implementation

3.1 Obtaining PDF data

3.2 Page segmentation

3.3 Graph representation

3.4 Similarity measures

3.5 The matching process

REFERENCES

Footnotes

Using Graph Matching Techniques to Wrap Data from PDF Documents

Tamir Hassan

Vienna University of TechnologyVienna, Austria

hassan@dbai.tuwien.ac.at

Robert Baumgartner

Vienna University of TechnologyVienna, Austria

baumgart@dbai.tuwien.ac.at

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 Our Approach

3 Implementation

3.1 Obtaining PDF data

3.2 Page segmentation

3.3 Graph representation

3.4 Similarity measures

3.5 The matching process

REFERENCES

Footnotes

Vienna University of Technology
Vienna, Austria

Vienna University of Technology
Vienna, Austria