This research provides a systematic approach for integrating Websystems through linking interrelated elements and functions. Theinfrastructure generates the vast majority of link anchors and linksautomatically through the use of structural relationship rules, inaddition to lexical analysis.
digital library, service integration, automatic link generation,collaborative filtering, lexical analysis
This research provides a general method for integrating Websystems through linking the interrelated elements and functions.While our approach is a general one, we shall illustrate it usingdigital libraries as our sample domain.
The purpose of the Digital Library Service Integration project(DLSI) is to automatically generate links for digital librarycollections to related collections and services. Collections arelibraries of computerized documents. Services include searching,providing annotations and peer review. Figure 1 presents an exampleof what users would see.
DLSI supplements collections by linking them automatically torelevant services and related collections. DLSI supplements servicesby automatically giving relevant objects in collections (and otherservices) direct access to these services. Users see a totallyintegrated environment, using their system just as before. However,they will see additional link anchors, and when clicking on one, DLSIwill present a list of supplemental links. DLSI will filter and rankorder this set of generated links to user preferences and tasks.
The DLSI infrastructure provides a systematic approach forintegrating digital library systems, and by extension, any otherinformation system with a Web interface. Systems generally require nochanges to integrate with DLSI.
Figure 1: Mockup of a document with DLSIsupport. DLSI automatically adds link anchors, including an icon inthe top right-hand corner for the document as a whole. Choosing oneprompts DLSI to generate a list of links. The figure superimposes twopossible sets of links for different elements: the concept "PlantPathology" and the document as a whole. Each link shows a descriptivelabel, and the system to which it leads.
Figure 2: DLSI Architecture. DLSI is within theshaded area. The dashed paths indicate that once integrated,collections and services can share features through DLSI linksautomatically. Integrated systems also continue to operateindependently of DLSI.
Figure 2 presents the DLSI integration infrastructure. Tointegrate a system, an analyst must write a wrapper, initiatecommunications between the system and its wrapper, and definerelationship rules. (The DLSI Integration Manager module manages therelationship rules.)
(1) Develop a Wrapper: The wrapper's main task is to parsethe display screens that appear on the user's Web browser to identifythe "elements of interest" that DLSI will make into link anchors.First, wrappers will parse the display based on an understanding ofthe structure of its content. Second, DLSI will parse the displaycontent using lexical analysis to identify additional elements ofinterest. If a service can operate on an element, DLSI will generatea link anchor over the element. Among the links generated for thatanchor will be a link leading directly to that service's feature.
(2) Develop Relationship Rules: Relationship rules specifythe "structural relationships" for automatically generating links forrecognized object types within the system being integrated.
(3) Initiate Communications: Several possible ways exist toensure information passes between the system being integrated and thewrapper.
Most other kinds of information systems could be integrated in thesame manner as digital library collections and services.
We need to emphasize that DLSI generates the vast majority of linkanchors and links automatically. If a system can operate on anelement, DLSI will generate a link leading directly to that system'sfeature. For example, if there were a discussion thread about acourse, any time that course's identifier would appear in a screen ordocument, DLSI would automatically detect this and add an anchor overthe course identifier.
DLSI typically generates link anchors in two ways. First,"wrappers" parse screens and documents based on an understanding ofthe structure of the system's displays (i.e., using form templates,XML markup or parsing rules). Most anchors are identified in thismanner.
Second, DLSI parses the screen and document content using lexicalanalysis to identify additional anchors. DLSI generates linksautomatically based on relationship rules.
Relationship rules define which relationships (links) should beavailable for which kinds of elements. For example, in Figure 1, therelationship rule underlying the first concept link would include thefollowing parameters:
Because they operate at the "class" or "kind of element" level,each relationship rule works for every element of that class. E.g.,the rule above applies to any "concept" found in any documentdisplayed.
Each relationship rule represents a single relationship for asingle element class. As elements can have many relationships, eachelement class can have several relationship rules. Each elementinstance triggers the same set of relationship rules, assumingconditions are satisfied for each. In Figure 1, nine relationshiprules triggered for the "concept" element (or more rules triggered,but the filtering mechanism produced this customized list).
The DLSI Integration Manager uses the relationship rules todetermine which elements in a display will have links. TheIntegration Manager then creates an integrated HTML or XML documentconsisting of the original display output together with DLSI'sanchors, which it will send to the user's browser. When the userselects an anchor, DLSI will use the relationship rules to generate alist of relevant links. When the user selects one, the IntegrationManager passes the appropriate information to the appropriatecollection or service for that link.
The Integration Manager is built upon the Dynamic HypermediaEngine project [1, 2, 3].
DLSI wrappers perform lexical analysis when they parse documentsand display screens to determine additional "elements of interest,"which the Integration Manager will supplement with DLSI link anchors.Our Noun Phrase Extractor works this way: Tokenization is firstperformed on the document or display screen. We then use the Wordnetlexical database [http://www.cogsci.princeton.edu/~wn/]to assign part-of-speech tags to tokens. Finally, a morphological andsyntactic rule base is used to parse sentences and extract nounphrases. The Noun Phrase Extractor extracts noun phrases in theirroot forms (this takes care of morphological changes) from returneddocuments. These root form noun phrases are then separated into twolists of phrases: those that are in the master thesaurus file andthose that are not. Any found in the master thesaurus will be madeinto supplemental link anchors. Keywords and key phrases fromparticipating collections and services also will be added to thisintegrated master file.
The number of potential links that DLSI could generate for aparticular element on a screen could vary from several to well over ahundred, resulting in the well-known hypermedia problem of cognitiveoverload. With a large number of links, filtering and ordering themis critical for effective use. Filtering and rank ordering in DLSIposes several challenges. First, it should be customized to eachuser's needs. Second, it should dynamically re-organize as the usersadvance through the system. Third, for the same user, support formultiple needs must be possible. A user may have several differenttasks (needs) and the links should be re-organized depending on theuser's current task.
DLSI incorporates collaborative filtering to filter informationbased on people's evaluations or behaviors. It generatesrecommendations using the following algorithm [4, 5, 6]:
This research's primary contribution is providing a relativelystraightforward, sustainable infrastructure for integratinginformation systems. Other contributions include:
We gratefully acknowledge support by the NSF under grantsEISA-9818309, EIA-0083758, IIS-0135531 and DUE-0226075. DLSI is partof the National Science Digital Library project (http://www.nsdl.org).