In HTML, on the other hand, the structure of the code somewhat corresponds to the logical structure of the document. This has led to the development of a number of tools that use this structure to locate data items. One such product is the Lixto Visual Wrapper, which allows the user to interactively select data items from a visual rendition of the web page. The system then generates a wrapping program to automatically extract this data from similarly structured sources, or from sources whose content changes over time.
We therefore developed our own HTML conversion process, which attempts to represent the logical structure of the PDF in the resultant HTML code. This now gives us limited wrapping functionality in many documents, although this is heavily dependent on the accuracy of the document understanding process, which is inherently an imprecise task. There are many complex documents, such as the example in Fig. 1, a real use-case example of quality management data from the automotive domain.2 Such documents can not be fully understood without additional input from the user.
Figure 1: A sample page of data to be wrapped. Brackets indicate the individual wrapping instances.
We have identified three main data structures within a PDF document that could be used to locate instances of data to be wrapped:
Whilst our HTML conversion allows us to use the content and logical structure to identify wrapping instances, it does not give us direct access to the document's geometric structure. The graph matching method described in this paper allows us to use a combination of all three structures, essentially shifting some of the burden of the document understanding process to the user. We expect this to compensate for the inherent inaccuracies and limitations of document understanding.
Figure 2: Sub-graph for one record (wrapping instance) from Fig. 1. Note that edges with arrows represent superior-to-inferior relationships.
As each of the nodes has a set of co-ordinates, this representation maps easily onto the visual domain, where the user can interactively select an example wrapping instance, and its corresponding sub-graph is found automatically.
The familiar notion of edit cost can be used to define the similarity of two sub-graphs. Allowed operations would include not just additions and deletions of single nodes or edges, but additions and deletions of complete rows of elements. For example, a certain paragraph may be one line longer or a certain table might have an extra row added. Yet, the logical structure with relation to shape would remain the same. Thus we are finding wrapping instances using both logical and visual similarity.
Furthermore, this method could be further extended to discriminate between headings and data. The logical relations present in the graph enable us to determine, with some degree of certainty, which blocks contain headings and which blocks contain just ``data'' (plain body text). Any ``edits'' that affect heading elements would therefore correspond to a change in logical structure, and this would carry a higher edit cost than the equivalent operation to only body text.
[1] R. Baumgartner, S. Flesca, and G. Gottlob.
Visual web information extraction with Lixto.
In The VLDB Journal, pages 119-128, 2001.
[2] W. J. Christmas, J. Kittler, and M. Petrou.
Structural matching in computer vision using probabilistic relaxation.
IEEE Tran. on Pattern Anal. and Mach. Intel.,
17(8):749-764, Aug. 1995.
[3] J. Llados, E. Marti, and J. J. Villanueva.
Symbol recognition by error-tolerant subgraph matching between
region adjacency graphs.
IEEE Tran. on Pattern Anal. and Mach. Intel.,
23(10):1137-1143, Oct. 2001.