Of course, the conversion of paper documents into hypertext documents may be done manually. However, as this approach is not feasible for larger documents or document collections, we propose the automatic conversion of paper documents. In this paper, only the main architecture of the transformation process is described. More details on this transformation can be found in [9][8].
The process of automatic document conversion consists of three steps: document preprocessing, document segmentation, and detection of hypertext links (see upper half in fig. 1). mainly consists of scanning in the paper document and processing it by means of a commercial OCR software (SCANWORX). Resulting from the preprocessing phase, image representations and enhanced ASCII representations of the original document are received. Enhanced ASCII representation means that information, e. g. with regard to fonts, is delivered in addition to the pure sequence of characters. It has to be stated that the reliability of this output cannot be considered as perfect at the moment. Therefore, possible errors have to be taken into consideration during the succeeding processing steps as well as with regard to the provided access mechanisms.
Based on the preprocessing, the document segmentation is done afterwards. Thereby the different logical segments of the document are determined and classified, e. g. headings, paragraphs, figures, captions etc. This is equivalent to the general problem of document understanding: mapping the geometric structure to the logical structure as e. g. described in [11]. With regard to the hypertext area, segmentation can be interpreted as the detection and classification of hypertext nodes. Thus, the internal hypertext nodes are formed (in contrast to the representational nodes: the images) that are mainly used for link generation.
The detection of links concludes the document conversion. Thereby, four categories of links are detected by means of our system:
The results of the transformation process are stored within a database (see fig. 1). Using our system HYPERFACS, a user may now navigate through the document space by means of directly or indirectly querying this database.
___________________________________________________
___________________________________________________
Figure 1: Conversion and storage of paper documents in HYPERFACS