Full-text indexing of non-textual resources

Department of Computer Science, Linköpings Universitet,
S-58183 Linköping, Sweden
davby@ida.liu.se

Abstract: Full-text indexing of resources on the World Wide Web is limited to simple content types, such as HTML and plain text. More complex content types, such as Postscript, PDF and proprietary word-processing formats are excluded, despite the fact that such documents are usually rich in content. The reason for excluding these types of resources is simply that it would be too expensive and too difficult to attempt to extract a textual representation from them. The operator of a search engine is simply not motivated to expend the additional resources that would be needed to handle such documents. The gain would be fairly small, and search engines are extremely popular even when they are limited to HTML and plain text documents. The situation is quite different from the point-of-view of the content provider. A site may have significant amounts of its content in non-textual documents, but despite this the content provider may want to have the documents indexed in normal search engines. In this paper we present several server-side solutions that allow existing indexing software to index the textual representation of non-textual resources.
Keywords: Searching: Indexing; Content negotiation

1. Introduction

One of the most popular ways of locating information on the World Wide Web is through the use of full-text indices, such as WebCrawler, Lycos, OpenText, Excite and AltaVista. Users can query the services through an interface on the World Wide Web, and retrieve a list of documents matching a specified set of criteria. The traffic to such services attests to their popularity.

The same kind of indexing and searching tools are also useful when deployed on single sites and on intranets. An organization using World Wide Web technology for disseminating information might want to make the contents of the internal network searchable; or a university publishing reports or articles on-line may want to provide full-text searching of their documents through standard tools and services. As long as all documents are in simple formats, such as plain text or HTML, normal indexing tools and search engines are adequate, but if the documents are in some other format, such as Postscript, PDF or a proprietary word-processing format, normal indexing tools are inadequate since they are unable to deal with such content types.

Our goal has been to find a method whereby we could enable full-text indexing and searching of the content in compressed Postscript and PDF files in a technical reports archive through standard tools and services. We have examined a number of methods for accomplishing this goal. We also present two methods that improve performance but require small modifications to both the indexing and server software.

Specifically our main requirements were the following:

Existing public services such as AltaVista and WebCrawler must be able to index the archive contents.
The resulting index should either point at the Postscript documents themselves or at a "cover page", which contains meta-information about the document.
Access for indexing should require no more than a single HTTP request.
Access for indexing purposes should place little or no extra burden on the server, compared to normal accesses.
The performance of normal accesses should not be affected noticeably.

We have not been too concerned with the translators used to produce indexable forms of the Postscript and PDF files. There are publicly available tools that are adequate for our purposes.

We recognize that the more general problem of content negotiation has been addressed in various works, including several IETF HTTP Working Group drafts and proposals [5, 6, 7]. The mechanisms that we propose in this paper could easily be replaced by more general content negotiation mechanisms in the future, but they are of little help in the short term. Therefore we need to explore other options.

A system like Harvest [4] can also address the general problem effectively and efficiently. In Harvest the information provider can generate indexing information using a tool called a gatherer. The indexing information is then retrieved by one or more entities that provide query interfaces on the indexed information. Since the provider generates the indexing information, the generation can be adapted however the provider wants: a provider that wants to have Postscript files indexed simply runs a gatherer capable of extracting indexing information from Postscript files.

The main disadvantage with Harvest is that it relies on protocols that are not widely implemented. Although the server providing the information does not need to be aware of its participation in such a system, the clients do. Since one of our goals was to allow existing indexing tools to index non-textual resources, a solution like Harvest is simply not an option.

2. Document indexing

Most tools for automatically indexing documents on the World Wide Web use the same technique for finding resources. The process is initialized with a set of known resources. These are retrieved and the links that appear to point to other indexable resources are scheduled for retrieval and the process is repeated [10]. The software responsible for traversing the Web and supplying the indexing software with documents to index is often called a robot.

An small example will help demonstrate the process. Assume that the HTML file reproduced in Fig. 1 has been retrieved by a robot.

<html>
   <body>
     <a href="author.html">About the author</a><br>
     <a href="abstract.html">Abstract</a><br>
     <a href="tr-96-11.ps.gz">Report (gzip'd Postscript)</a><br>
     <a href="tr-96-11.fm.gz">Report (gzip'd FrameMaker)</a><br>
   </body>
</html>

Fig. 1. Example of an HTML document with references to non-textual resources.

The document in Fig. 1 contains four references: to author.html, abstract.html, tr-96-11.fm.gz and tr-96-11.ps.gz. The first two of these appear to be references to HTML documents, so the robot will schedule them for retrieval. The third reference appears to be a link to a compressed Postscript file, and the fourth appears to be a link to a compressed FrameMaker file. These two references will probably be ignored by most robots since they are unable to extract the content of Postscript and FrameMaker files.

The result is that the most interesting information, that in the actual document, is not indexed, and is therefore not searchable.

3. Enabling indexing of non-textual resources

In order to allow standard indexing software to index the content of non-textual resources changes need to be made to the indexing software or to the server software or to both. Since modifying the indexing software may be impractical, impossible or simply undesirable, we have concentrated on solutions that require changes only to the server.

3.1. Why most current content negotiation schemes are not adequate

Content negotiation is the process by which a server and user-agent select a variant of a resource for the server to send.

In HTTP/1.0 [2] this process is driven by use of the Accept header in the HTTP request. Using this scheme, clients can rank content types in order of preference, and the server chooses a variant based on this ranking. HTTP/1.1 [3] provides for more flexible types of content negotiation. The proposed standard outlines two types of content negotiation: server-driven content negotiation and agent-driven content negotiation.

Server-driven content negotiation mechanisms have the server selecting the variant to send to the client, based on information provided by the client. The use of accept headers in HTTP/1.0 is an example of a simple server-driven content negotiation scheme.

Agent-driven content negotiation mechanisms have the client selecting the variant to receive from a list of alternatives provided by the server. Agent-driven content negotiation has several advantages over server-driven content negotiation. An obvious advantage is that the client is guaranteed to receive the best variant available. A disadvantage is that all accesses to resources that have multiple variants will require at least two requests.

Both agent-driven and server-driven content negotiation, as well as combinations of the two, such as transparent content negotiation, are discussed in detail in various Internet drafts [5, 6, 7].

At first glance it would appear that agent-driven content negotiation would be the best solution for our particular problem since it allows the client to choose the best representation for indexing. However, one of our goals was to enable indexing of non-textual resource using current tools. Agent-driven content negotiation has not yet been standardized, and to the best of our knowledge currently available indexing software does not implement any of the proposals. Furthermore, even if support for agent-driven content negotiation becomes widely implemented, each request for a resource with more than one variant will require at least two HTTP requests, which is undesirable.

Server-driven solutions based on accept headers are not adequate. Accept headers allow the server to select one of several representations for the same content, but it is more difficult to select variations of the content itself. For example, a robot collecting text to index and an interactive client that can only display textual content will send very similar, perhaps even identical, accept headers.

3.2. Client-side processing

The most obvious solution to enabling indexing of non-textual resources is to give the indexing software itself the capability to deal with such resources. For example, the indexing software could extract text from Postscript or word-processing files, or find features in image files. This solution allows considerable flexibility in the indexing and searching process. For example, the search engine could be queried for images similar to another image or for audio files containing some specific sound.

Since our goal was to make non-textual resources indexable by normal search engines and run-of-the-mill indexing software, client-side processing is not an option. Furthermore the required processing is often computationally expensive. Given the amount of work the indexing software already has to perform, adding more might not be desirable.

The final reason for not using client-side processing is that it makes the introduction of new content types difficult. With a server-side approach, only the servers that actually provide resources of the new type would have to be changed, whereas a client-side solution would require the indexing software to be changed. Changes to the server are less undesirable since the content provider has an interest in having the content indexed, whereas the operators of the indexing software can probably get away with ignoring unusual content types.

For these reasons we have chosen to concentrate on server-side approaches.

3.3. Munged Links

The Munged Links method was the first method we considered. The idea with munged links is to show different links to robots than normal users see. Robots will see links to textual representations of non-textual resources, whereas users will see links to the real resources.

When an a robot retrieves a document containing references to non-textual resources, those references are replaced by references to files containing a textual representation of the non-textual resource. This causes the robot to retrieve the textual version of the non-textual resource, which will then be indexed.

A problem with this approach is that queries on the index will return references to the textual representation, not to the non-textual resource. To compensate for this effect, a request from a non-robot for a textual variant of a non-textual resource needs to be redirected to the non-textual version.

For example, when a robot requests the HTML file in Fig. 1, the server could transform it to the following:

<html> <body> 
  <a href="author.html">About the author</a><br> 
  <a href="abstract.html">Abstract</a><br>
  <a href="tr-96-11.ps.html">Report (gzip'd Postscript)</a><br>
  <a href="tr-96-11.fm.html">Report (gzip'd FrameMaker)</a><br>
</body> </html>

Note that the references to compressed Postscript have been replaced by references to HTML files, which the robot will follow. The HTML files are assumed to contain a textual representation of the Postscript and FrameMaker files. When a user requests one of the references to these HTML files the server issues a redirect to tr-96-11.ps.gz or tr-96-11.fm.gz. The result is that robots will see text files and users will see the non-textual resources.

Following references from an index will take users directly to the non-textual resource, but the index will show a reference to the textual representation of the resource. Although we have not studied it in detail, we feel that this may confuse users. Furthermore, the server either needs to either parse all resources as they are being sent or preprocess them in order to replace the links to non-textual resources. Preprocessing in turn requires tool support to ensure that preprocessing is done every time a resource is modified. An advantage of the munged links method is that documents containing references do not need to be modified in any way.

3.4. Conditional redirection

Munged links can be viewed as a variant of conditional redirection. Instead of referencing non-textual resources directly, references are constructed to look like references to text or HTML resources. When such a reference is followed a redirect is sent as a reply: to a textual representation of the non-textual resource if the request is made by a robot and to the non-textual resource if the request is made by a user.

The main advantage of this method over munged links is that the server is not required to parse everything that is being sent in order to change links in the data. That advantage is easily outweighed by the disadvantages. One important disadvantage is that documents will not contain references to the documents that users see, and that rules out use of several tools for publishing and maintaining information structures on the Web. Aside from these points this approach shares the properties of the Munged Links approach.

3.5. Maskerade

Maskerade may be the best method of the ones we have considered because it requires only small changes to the server software, places virtually no additional load on the server and operates nearly transparently. The only problem with the approach is that it will not work for all kinds of indexing tools, including some of the tools we needed our solution to work with.

Unlike our previous approaches, links are not changed at all. When a robot requests a non-textual resource, the server simply lies, and indicates that the resource is plain text by sending the appropriate content-type header and replaces the non-textual resource with its textual representation.

This method has a number of advantages. The server does not need to preprocess or modify anything; documents do not need to contain any special information and; the resulting index will point directly at the real resources. This solution would be ideal if it worked, but unfortunately it will only work as long as robots follow all links, even if they seem to point to non-textual resources. This is not always the case.

3.6. Modified content

Our final approach, modified content, is the one we have settled on for in our application. Our document repository contains a "cover page" for each report. The cover page contains copyright information, information about the document and related links. Using the modified content method, the textual content of the non-textual resource will be indexed with the cover page.

When a robot requests a cover page, the server simply appends the textual representation of the document that the cover page references to the end of the cover page. When a user requests the same cover page, the textual representation is not appended. The result is that the index will point to the cover page, not the non-textual resource itself.

For example, if this paper was stored as tr-96-11.ps.gz, a compressed Postscript file, in a report archive, and there was a cover page containing the HTML in Fig. 1, the following would be sent if the cover page was requested by a robot:

<html> <body>
  <a href="author.html">About the author</a> <br>
  <a href="abstract.html">Abstract</a> <br>
  <a href="tr-96-11.ps.gz">Report (gzip'd Postscript)</a> <br>
  <a href="tr-96-11.fm.gz">Report (gzip'd FrameMaker)</a>
  <br>
</body> </html>
Full-Text Indexing of Non-Textual Resources David Byers
Department of Computer and Information Science Linköping 
University davby@ida.liu.se Introduction One of the most
popular ways of . . . . . .

This approach is an excellent choice in those cases where every non-textual resource is logically paired with one or more textual resources. In our particular application it is actually preferable to have the index point to the cover page rather than the non-textual resource itself.

When applicable, this method is quite attractive since it places very little extra load on the server; it requires no changes to the indexing software; it generates no redirects; and references from the index will point to existing resources.

4. Distinguishing between robots and users

All of the methods for indexing non-textual resources presented in this paper require the server to recognize when a request is sent by a robot and when it is sent by a user. We have examined four different options. One does not require any changes to the robots; one is useless; one requires minor changes to the robots but is very accurate; and one is still just a rough idea.

4.1. The user-agent HTTP header

The most obvious method for recognizing requests from robots is to examine the user-agent header in the HTTP request. The server has a catalog of how known robots identify themselves, and a request is classified as coming from a robot if the user-agent field of the request is present in the catalog. This method does not require any change to the robots, but new and unknown robots will be misclassified as normal users.

The main advantage of this approach is its simplicity. It requires no change to the robots, and little or no change to the server. The disadvantage is primarily lack of accuracy.

Use of the user-agent field for identifying robots is not very sophisticated and hardly novel, but it performs well enough in practice to be a viable choice. It is also the best method available at this time.

4.2. Using an extension method

The first bit of information any client sends in an HTTP request is the access method. Both robots and real users use the GET method to retrieve documents. If robots were to use a different method, perhaps INDEX, it would be easy for the server to distinguish between robots and users.

Use of a nonstandard access method will fail when communicating with HTTP/1.0 servers and HTTP/1.1 [2, 3] servers that do not understand it. The robot could respond to an error by reissuing the request using the GET method.

The major advantage of using an extension method for indexing is that the server easily and accurately recognizes robots. This advantage is easily outweighed by the disadvantages. Performance suffers when communicating with HTTP/1.0 and HTTP/1.1 server that do not implement the access method, since each access will require two requests: one using the INDEX method and one using the GET method. Another disadvantage is that this method requires fairly extensive changes to the robots.

4.3. Using an extension header

An extension header is a nonstandard header in the HTTP request. Extension headers are explicitly supported in HTTP/1.1 and appear cause no problems with most HTTP/1.0 servers. If robots send an extension header, perhaps named X-indexing, with all regular GET requests, the server would have no problem recognizing requests from robots.

Although this method requires changes to the robots, the change is simple and does not cause compatibility problems. It has the same advantages of extension access methods, but none of the disadvantages.

Our preferred solution is to use a combination of extension header fields and classification based on the user-agent field. When the appropriate extension header is present, the server knows that the client making the request is a robot. When the extension header field is not present, the server uses the user-agent field to make an educated guess.

4.4. Access pattern based classification

We think that it may be possible to solve the classification problem fairly accurately by automatically classifying new user agents based on their patterns of access to documents on the server. Classification could be based on a fixed set of rules, such as the simplistic "if a previously unknown user agent requests robots.txt, then classify it as a robot."

Accurate classification probably requires more sophisticated analysis than such rules. One possibility may be to manually try to identify patterns in accesses from known robots, and formulating more complex rules to recognize similar access patterns from unknown user agents. Another possibility might be to train an artificial neural network to recognize robots, based on their patterns of access.

We have not attempted to develop this idea since other request classification methods perform well enough for our purposes, but nonetheless we feel that it is an intriguing approach.

5. Implementation

At this time we have an implementation of the maskerade and modified content methods in the Apache server. The current implementation is not suitable for real use; it is nothing more than a proof of concept.

We are in the process of implementing the conditional redirection, maskerade and modified content methods in the Roxen Challenger Web server [8]. We also intend to modify the IntraSeek robot [9] to work with maskerade and to send an extension header.

The technical reports archive will use the modified content approach since we want searches to take users to the cover page of a report rather than the report itself. The server will use the user-agent header to recognize robots. It will also react to the use of an extension header, X-indexing, to indicate that a request originates with a robot.

6. Conclusions

In this paper we presented four different methods that allow existing indexing software to index a textual representation of non-textual resources on the World Wide Web or on an intranet. Unlike more ambitious and complete content-negotiation proposals, the methods presented here will operate with existing indexing tools. All methods are similar in the sense that the server sends different responses depending on whether the request was made from a robot or from a user.

The "munged links" method fools the robot into retrieving a textual variant of the non-textual resource by changing the links in all referring documents. The "conditional redirect" method uses HTTP redirect replies to send users and robots to different variants of the same resource. The maskerade method replaces the non-textual content with a textual representation when the resource is requested by a robot, but in order to function, the robot must follow all references it sees, even those that at first glance appear to point to non-textual resources. The final method, modified content, appends a textual version of the non-textual resource to the end of all referring documents.

Of these four approaches, we feel that maskerade is the best choice in general, provided it works at all. If maskerade cannot be used, then some version of conditional redirect is probably most straightforward. For some applications, including the one that motivated us, the modified content approach is ideal.

We also presented three methods for recognizing requests from robots. We feel that the best method is to let the server look for an extension header named X-Indexing in the HTTP request. If one is found the request is assumed to be from a robot. If one is not found, then the server should examine the user-agent header to see if the request comes from a known robot.

In conclusion, the initial experiments with our document repository were quite successful, but we still hope to be able to replace these methods with standardized content negotiation methods in the future.

References

[1]: Proceedings of The 2nd World Wide Web Conference '94: Mosaic and the Web, 1994, http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/WWW2_Proceedings.html
[2]: Hypertext Transfer Protocol – HTTP/1.0, Request for Comments 1945, May 1996, http://ds.internic.net/rfc/rfc1945.txt
[3]: Hypertext Transfer Protocol HTTP/1.1, Request for Comments 2068, January 1997, http://ds.internic.net/rfc/rfc2068.txt
[4]: C.M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber, and M.F. Schwartz, The Harvest information discovery and access system, in: Proc. of the 2nd World Wide Web Conference [1], http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html
[5]: K. Holtman and A. Mutz, HTTP remove variant selection algorithm – RVSA/1.0, HTTP Working Group Internet Draft, July, 1997, ftp://ftp.ietf.org/internet-drafts/draft-ietf-http-rvsa-v10-02.txt
[6]: K. Holtman and A. Mutz, Transparent content negotiation in HTTP, HTTP Working Group Internet Draft, September, 1997, ftp://ftp.ietf.org/internet-drafts/draft-ietf-http-negotiation-04.txt
[7]: K. Holtman, A. Mutz and T. Hardie, The alternates header field, HTTP Working Group Internet Draft, ftp://ftp.ietf.org/internet-drafts/draft-ietf-http-alternates-01.txt
[8]: Idonex, Roxen Challenger product overview, http://www.roxen.com/products/challenger/
[9]: Idonex, Roxen IntraSeek product overview, http://www.roxen.com/products/intraseek/
[10]: B. Pinkerton, Finding what people want: Experiences with the WebCrawler, in: Proc. of the 2nd World Wide Web Conference [1], http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html

URLs

AltaVista: http://www.altavista.digital.com/
Excite: http://www.excite.com/
IETF HyperText Transfer Protocol Working Group: http://www.ietf.org/html.charters/http-charter.html
Idonex: http://www.idonex.com/
OpenText: http://www.opentext.com/
Lycos: http://www.lycos.com/
Roxen Challenger: http://www.roxen.com/
WebCrawler: http://www.webcrawler.com/