Full text searches

Next: Hypertext Links Up: Document access Previous: Direct access

3.2 Full text searches

Searching documents within a computer environment in most cases includes the ability to start full text searches. There, full text searches replace and, to some degree, extend the index contained in most printed technical documents. In order to support a basic form of full text search in image bases, any kind of OCR software will work. However, to implement further mechanisms like highlighting either requires the construction of own software tools or the use of special OCR packages that return the coordinates of words in addition to pure character information. In order to take into consideration possible errors of the OCR package, too, we have implemented two different tools: one tool for exact full text searches and one for fuzzy full text searches.

The script for exact full text searches accepts two parameters: a query string and a database. It takes the string as specified by the user and tries to match it with the content of the full text database. Thereby, precise matches are requested, though a precise match within the database does not necessarily guarantee a precise match within the original document, again due to possible OCR errors. In our prototype system, the queried database is managed by a ``real'' relational database system which uses B* trees for its index. Thus, a quick evaluation of queries can be achieved. However, little effort would be necessary in order to exchange the underlying database management system or even replace it with a simple file access tool. Thus, the portability of the approach can be preserved.

real char. recogn.     real char. recogn.     real char. recogn.     real char. recogn.
---------------------------------------------------------------------------------------
1          l           O          0           e          c           .          ,
i          l           0          O           c          e           f          t
1          I           0          o           nothing    ,           t          c
1          i           e          o           nothing    .           y          v
i          t           o          c           nothing    '           ,          .
i          I           a          o           .          nothing     M          N
I          l           s          a           space      nothing     5          S
l          I           a          s           a          nothing     g          q
l          1           a          e           -          nothing     M          H
t          l           e          a           .          space       h          b
---------------------------------------------------------------------------------------

Table 1: Part of the applied confusion matrix

The emphasis of an exact full text search lies on precision, whereas the emphasis of fuzzy full text searches lies on recall. Fuzzy full text searches take into consideration possible or probable OCR errors. This can be done by means of enhancing the user's original search string using a confusion matrix for characters or character patterns (see table 1). Thus, a plain string is transformed into a larger regular expression, e. g. the string ``Abbildung'' (German word for ``figure'') is transformed into ``A[ \.',]?[bh][ \.',]?[bh][ \.',]?[iltI]?[ \.',]?[l1I][ \.',]?d[ \.',]?[un][ \.',]?(ii|ri|[num])[ \.',]?[gq]'' (PERL syntax). Matching the full text data with this enhanced string, most of the occurrences of the query string within the original document can be found: e. g. occurrences of ``Abbildung'' that are split into several words (see fig. 6).

However, a pure implementation of this approach is of little use, because precision will not be acceptable any more. E. g. if ``Seite'' (in English ``page'') is selected as a query term, it is transformed into ``[Ss][ \.',]?[ecao]?[ \.',]?[iltI]?[ \.',]?[tcl][ \.',]?[ecao]?''. According to the applied confusion matrix, ``e'' and ``i'' may be omitted. Therefore, also patterns like ``sc'' or ``sl'' would be considered as matches. To improve precision for the fuzzy full text search, several approaches can be thought of, e. g.:

Because many of the problems with regard to precision derive from wrong insertions or deletions of characters, the computing of the difference between the length of the query string and the length of the retrieved string may be of help. This difference is easy and fast to compute and helps to eliminate some of the wrong hits. However, tests have shown that the elimination rate is not high enough in order to achieve acceptable precision.
The Levenshtein measure is used in order to compute the distance between the query string and the retrieved pattern [5]. Thereby, the number of insertions, deletions, and substitutions of characters are counted that are necessary in order to transform one string into the other. In order to measure the distance between a regular expression and a pattern, the agrep tool may be used [12]. This solution is currently implemented.
The distance between the query string and a candidate pattern may be determined as the minimum number of errors that has to be made in order to transform the query string into the candidate pattern (according to the confusion matrix). This solution will be used in the future.

The results of a full text query may be represented in two ways. The ordinary representation lists all the pages that contain the search string (fig. 5). The user may then either select a plain representation of a page that contains a hit (``normal'') or he may select the ``highlighting'' mode, thus triggering the inverted display of search results. In the latter case, the URI of a search hit contains the coordinates of the matching string. This coordinates are passed to the server script that manages the display of nodes by means of the QUERY_STRING environment variable. Then, a temporary image file is created which is sent back to the WWW client together with the rest of the node information.

Especially with regard to fuzzy full text searches a different representation of search results is sensible: there, the user may be interested in the actual string that was retrieved (that may be different to the query string). In such cases, the KWIC display supports the user who may now inspect and judge the relevance of a hit within its context (fig. 6).

___________________________________________________

Figure 5: Part of a full text search result (ordinary representation)

___________________________________________________

Figure 6: Part of a full text search result (KWIC representation)

Next: Hypertext Links Up: Document access Previous: Direct access