Generation of the original hypertext base

Next: Transformation into WWW Up: Document preparation Previous: Document preparation

2.1 Generation of the original hypertext base

Of course, the conversion of paper documents into hypertext documents may be done manually. However, as this approach is not feasible for larger documents or document collections, we propose the automatic conversion of paper documents. In this paper, only the main architecture of the transformation process is described. More details on this transformation can be found in [9][8].

The process of automatic document conversion consists of three steps: document preprocessing, document segmentation, and detection of hypertext links (see upper half in fig. 1). mainly consists of scanning in the paper document and processing it by means of a commercial OCR software (SCANWORX). Resulting from the preprocessing phase, image representations and enhanced ASCII representations of the original document are received. Enhanced ASCII representation means that information, e. g. with regard to fonts, is delivered in addition to the pure sequence of characters. It has to be stated that the reliability of this output cannot be considered as perfect at the moment. Therefore, possible errors have to be taken into consideration during the succeeding processing steps as well as with regard to the provided access mechanisms.

Based on the preprocessing, the document segmentation is done afterwards. Thereby the different logical segments of the document are determined and classified, e. g. headings, paragraphs, figures, captions etc. This is equivalent to the general problem of document understanding: mapping the geometric structure to the logical structure as e. g. described in [11]. With regard to the hypertext area, segmentation can be interpreted as the detection and classification of hypertext nodes. Thus, the internal hypertext nodes are formed (in contrast to the representational nodes: the images) that are mainly used for link generation.

The detection of links concludes the document conversion. Thereby, four categories of links are detected by means of our system:

Page links: rebuild the original sequence of pages by means of connecting the page under concern with the directly preceding and succeeding pages.
Hierarchical links: reflect the hierarchical relations of logical segments, e. g.: several subsections are part of a section.
Syntactical links: are embedded in the text and can be found by means of looking for key phrases like ``see section 3''.
Similarity links: connect segments that are related to each other with regard to their topics. They are generated by means of statistic evaluation of the document under concern [10].

The results of the transformation process are stored within a database (see fig. 1). Using our system HYPERFACS, a user may now navigate through the document space by means of directly or indirectly querying this database.

___________________________________________________

Figure 1: Conversion and storage of paper documents in HYPERFACS

Next: Transformation into WWW Up: Document preparation Previous: Document preparation