WebGUIDE: Querying and Navigating Changes in Web Repositories

Fifth International World Wide Web Conference
May 6-10, 1996, Paris, France

WebGUIDE: Querying and Navigating Changes in Web Repositories

Fred Douglis¹
Thomas Ball²
Yih-Farn Chen¹
Eleftherios Koutsofios¹

¹AT&T Research, 600 Mountain Ave., Murray Hill, NJ 07974-2008
²Bell Laboratories, 1000 E. Warrenville Rd., Naperville, IL 60566-7013

Abstract:

WebGUIDE is a system for exploring changes to World Wide Web pages and Web structure that supports recursive document comparison: users may explore the differences between pages with respect to two dates. Differences between pages are computed automatically and summarized in a new HTML page, and differences in link structure are shown via graphical representations. WebGUIDE is the combination of two tools that complement one another: the AT&T Internet Difference Engine (AIDE) [9] is a tool for tracking and viewing modifications to World-Wide-Web pages, which has been extended to support recursive tracking of pages; Ciao [6] is a graphical navigator that allows users to query and browse structural connections embedded in a document repository. The union of these tools let users get information on the evolution of pages of interest (both textually and graphically), browse the differences interactively, and dynamically modify the set of pages with which they interact.

Keywords: differencing, repository, navigator, HTML, AIDE, CIAO

1 Introduction

Browsing and searching are popular ways to access and find information on the World Wide Web (WWW). While GUI-based browsers and powerful search engines are now ubiquitous, tools and mechanisms that provide access to historical information and tracking of updates only have been developed recently and are not in widespread use. Search engines and browsers help users locate and inspect information of interest, while tracking tools help users to keep up-to-date on this pertinent information. WWW services and applications can benefit from a mechanism that tracks changes, maintains page version histories, and automatically computes differences [2]. Its usefulness will be further increased by mechanisms for dealing with the vast number of documents on the Web: graphical views of pages [8], with querying and filtering based on user-specified criteria [6]; recursive tracking and viewing of changes to related Web documents; and prioritization of change notifications based on user criteria. In this paper we focus on the first two of these improvements.

We have combined and expanded upon two existing tools, Ciao [6] and the AT&T Internet Difference Engine (AIDE) [9][2], in order to provide two sorts of visual cues. We call the resulting system the Web Graphical User Interface to a Difference Engine, or WebGUIDE. With Ciao we show high-level structural differences by displaying graphs that show the relationships between pages and color the nodes to indicate which pages have been modified. Using AIDE, we show low-level textual differences by marking up changes between versions, modifying anchors to cause documents reached from that page to be annotated.

The next section describes Ciao and AIDE in greater detail. Section 3 discusses the architecture of WebGUIDE. Section 4 gives an example of a WebGUIDE session. Section 5 covers related work, and Section 6 concludes.

2 Ciao and AIDE

This section gives an introduction to Ciao and AIDE, giving examples of their use.

2.1 Ciao

Ciao is a customizable graphical navigator that allows users to query and browse structural connections embedded in a document repository. Ciao involves three major components: an abstractor that converts source documents to a database according to a data model that describes the documents' internal structure, a repository that keeps versions of the documents and corresponding databases, and a graphical interface that allows users to query and visualize the information structure. Ciao has been instantiated for C, C++, ksh, HTML, and some business information repositories.

[Ciao-HTML Example]
Figure 1:Example of Ciao-HTML as applied to the AT&T home page.

Ciao-HTML can be used to explore the structure of HTML documents. The data model for HTML includes entities such as HTML pages, anchors, headers, and images, and relationships among them. Unlike other instantiations, Ciao-HTML database can expand in real time as the user tries to explore links to pages that are not currently incorporated in the database. Figure 1 shows a snapshot of Ciao-HTML on a version of the AT&T Home Page.

The user started with a query to retrieve all relationships between the AT&T Home Page and its anchors, which resulted in a graph shown in the upper-left window. The user then expanded two of the anchors, Home and Work, in place to show further link connections. Since the graph had become more complicated, the user decided to create yet another window (somewhat like Netscape's clone feature) shown in the lower right with the Home node as the root to focus on that page and its derivatives. The user also visited two of the home pages by sending requests to her browser. All these operations were done through pop-up menus attached to the graph nodes. These query and navigation features of Ciao-HTML allow the user to browse complex Web structures comfortably.

Note that Ciao-HTML runs as an external application on the user's machine, and interfaces with the browser by sending it commands to visit particular nodes. It retrieves and processes pages independently from the browser (relying on a proxy-caching server to ensure that the same pages are not fetched multiple times from off-site).

2.2 AT&T Internet Difference Engine

The AT&T Internet Difference Engine [9][2] combines notification of changes to pages on the Web with a customized view of what has changed to those pages. Notification of changes has become relatively commonplace [17][16][7][19], but viewing changes has not. AIDE supports this with a shared version repository, into which users ``deposit'' pages of interest when they have seen them, and a tool called HtmlDiff, which creates a page that highlights the differences between two versions of an HTML document. In addition to seeing the changes to a page since the user last viewed it, it is possible to see a history of versions and to compare any pair of them. All archival and differencing is performed on a server, using CGI scripts.

Figure 2: Example of HtmlDiff as applied to the WWW-5 home page. It shows, for instance, that the text describing the WWW4 conference was updated once the conference took place, and that an ``SMEs Forum'' link was added to the schedule. Some text is omitted to permit the output to fit on one page. Also, the rows in this table are for demonstrative purposes, for inclusion with borders in this document.
HtmlDiff: Here is the first difference. There are 9 differences on this page.~~is old.~~ *is new.*
Fifth International World Wide Web Conference May 6-11, 1996, Paris, France General Information
The World Wide Web network Information System is now driving the Internet expansion throughout the World. The World Wide Web was originally created at CERN by Tim Berners-Lee for high-energy physicists and since then, has developed into millions of users from a wide variety of application domains. It is recognized as being of strategic importance for the future development of the global information society. Since 1994, several International WWW Conferences have been organized:
WWW1: Geneva, May 1994 WWW2: Chicago, October 1994 WWW3: Darmstadt, April 1995 WWW4: ~~to be held in Boston in December 1995~~ *Boston,* *December* *11-15,* *1995*
The Fifth International World Wide Web Conference will take place on May 6-11, 1996 at CNIT-Paris La Defense. The CNIT is one of the largest conference and exhibition centers in Europe, located on the western side of ~~Paris,~~ *Paris*, France. [omitted]
Conference & Exhibition Schedule: May 6: Tutorials and Workshops May 7-9: Technical Program May 10: Developer's Day May 9-11: Exhibition *May* *10-11:* *SMEs* *Forum*
Important Dates: Call for Papers including Format Guidelines: *open* November 1, 1995 Call for Exhibitors: *open* November 1, 1995 Deadline for submission of Technical Papers: January 29, 1996 Deadline for submission of Tutorial / Workshop Proposals: February 23, 1996 Notification of acceptance of Papers: March 4, 1996 Deadline for submission of final versions of accepted papers: April 5, 1996
~~For more information~~ *Guided* *Tour* of *this* *site* *Send* a *mail* to *Organizers*
Created: 30 October Last updated: ~~7 November~~ *December* 16

Figure 2 gives an example of HtmlDiff's output. Bold italics indicate new text, struck-out text indicates deletions, and arrows point to either (including changes to URLs, which are not otherwise highlighted).

Note that until the functionality of AIDE and Ciao were combined as WebGUIDE, the only interface to AIDE was through simple HTML forms and anchors. Once the volume of pages tracked by a single user exceeds some threshold, or links are followed recursively, more sophisticated interfaces are necessary to provide visual feedback and navigational tools.

3 System Architecture

WebGUIDE consists of four components: a version and meta-data repository, a robot that tracks modifications, a difference engine, and a graph generator. Pieces of these components have been described elsewhere [9][6]; we focus here on how they have evolved since. The architecture is depicted in Figure 3.

[WebGUIDE Architecture]
Figure 3: WebGUIDE architecture. AIDE and Ciao each have their own databases. AIDE stores versions of pages and information about when pages have been modified, as well as which users have seen which versions; Ciao has an entity-relationship database that is extended dynamically as the result of queries. Ciao accesses the AIDE version repository to compare versions of pages. All data are stored on a central server which is accessed via a CGI interface.

3.1 Repository

The AIDE version repository is a centralized service that archives versions of pages. Unlike a search engine such as Inktomi [13] or Lycos [15], it retrieves and stores only pages that users explicitly request. (Note however that in the extreme case, a user could specify a page that ultimately leads to many other pages, such as Yahoo [20], along with a high level of recursion, and achieve a similar effect.) Pages are stored in RCS [18] format, so storing multiple versions does not result in excessive storage overhead as long as changes are relatively small.

In addition, AIDE maintains a relational database containing meta-data about each page, each user, and the relationships between them. For each URL, it stores the following (among other information):

Last modification date: This is used to find pages that have been modified since a user saw them.
Last check: The time when the last modification date was obtained is used to determine when the page should next be checked.
Checksum: This is used in case the last modification date is unavailable.
History: This records information about archived version, including the date and the RCS version number.
Frequency of checks: Different users may request different minimum frequencies to check a URL; this is the minimum across all users.

For each user, the database contains global information (such as an email address) and per-URL information. For each <user, URL> combination it keeps:

Last time seen: The last time a user has seen a page through AIDE is saved. (Of course, if the user views the page directly, AIDE has no way of knowing this unless AIDE has access to her history file.)
History: AIDE keeps a history of which versions the user has seen, which is a subset of all versions recorded for the page.
Minimum frequency of checks: How often the URL should be checked.
Notification method: Most changes to URLs will be reported upon request by a user (invoking a CGI script), but in some cases the user may request email notification. In addition, for those URLs that are reported together, a priority can cause them to be ordered to call attention to some more than others. This is similar to Tapestry [11], which orders email and netnews postings based on user criteria.
Auto-archive: The user can specify that a page should be archived every time a change is detected, or versions can be archived only upon explicit request of the user.
Depth: The depth indicates how many levels of hyperlinks to follow when checking for modifications and archiving versions. Typically it will be zero.

3.2 Tracking Modifications

The robot, like the tools described in Section 5.1, periodically checks pages for updates. It queries the database for all pages that have not been checked within their minimum polling frequency. (For pages that are to be checked recursively, the polling frequency for links is typically less than the root page.)

AIDE does not check pages that are ``known'' to be new: if every user who has expressed an interest in a page has already been told a page has been modified, and has not visited the page through AIDE or viewed its differences, the page is not checked again with the same frequency.

The time of each check is recorded in the database, as well as the new modification time. Modified pages are reported to interested users immediately if requested. The new page is archived automatically if specified by any user.

3.3 HTML Differencing and Recursion

Originally, differencing was done only on a per-page basis, with no notion of recursion [9]. That mode is useful when most pages are checked in isolation, but less so when pages are tracked recursively. Now, one can visit a page with links to modified pages and have those links highlighted. By following the link, HtmlDiff is invoked recursively on the new page, and its links are similarly highlighted. Thus one can see the differences between a set of related pages at any points in time that contents have been archived.

The recursive comparison interface works as follows: the user selects two versions of an HTML document for comparison. The two timestamps associated with these documents define the time range for future document comparison as the user browses. When HtmlDiff compares two documents, it gathers up all the URLs in the document and queries the version repository to determine if there are different versions of the documents specified by the URLs for the two dates. If so, an icon is inserted before the hypertext link; this icon is itself a hypertext link that transfers control back to AIDE in order to compare the versions of the documents. This on-the-fly analysis and annotation of the pages is similar to the transducers of Brooks et al. [3], but for a different purpose.

Clearly, the effectiveness of recursive comparison depends on the quantity of historical information in the version repository. Many URLs will not have any history and will not be filtered. Other URLs may have historical information, but not for the exact dates specified for recursive comparison. In the latter case, we make a number of approximations in order to provide more comparative information. Suppose that the current date is 96.04.01, that the user asks for version comparison between the dates 95.09.20 and 96.03.06, and that for a given URL, versions exist as of 95.10.30, 96.01.01, and 96.03.10. In this case, we use the dates closest to those specified (up to some epsilon interval), so the comparison will use the 95.10.30 and 96.03.10 versions. For another URL, there may only be a version stored for 95.10.15. In this case, we compare the stored version and the current version on the WWW. The epsilon interval used for date approximation may be user-specified.

Recursive HTML comparison allows users to see that a hypertext link points to a page for which there are changes. However, this only works well for one level of indirection. If the currently viewed page and a changed page are separated by a long chain of unchanged pages, it is bothersome to force the user to step through the unchanged pages to get to the differences. The Ciao graphical interface addresses this problem by providing a graphical overview of the changed pages, allowing the user to quickly navigate to changed pages.

3.4 Graph Generator

The graphical view of relationships between URLs of interest to a user, and their states, could be generated in a number of ways. WebGUIDE generates graphs on the fly as embedded images, using a tool called webdot [10]. The images can be clickable, so clicking on a node can invoke another operation. Unfortunately, image maps do not currently support operations other than selecting a URL based on location within the image, so unlike an external application like Ciao [6] or WebMap [8], one cannot click on a node to bring up a menu directly. Instead, it is necessary to go to a URL and have that page provide the user interface to select an operation. We have taken this approach, and will support several operations through this indirect page:

Visit the URL represented by the node.
Show the differences between the current version of the page and the previous version saved by the user.
Remember the page represented by the node.
See the node's version history.
Perform a Ciao query to modify the graph, for instance to select nodes matching some criteria.

Another approach might be a helper application that would run on the user's machine, external to the browser, as WebMap does. This would be complicated by the need to interact with a database and CGI services on another machine, rather than being self-contained like WebMap, and would require that a user install an external software package.

Perhaps the most elegant approach is to provide full interactive access to the graph using Java [14]. We intend to explore this possibility in the future.

4 A WebGUIDE Session

We are now ready to go through a WebGUIDE session to see how a user might interact with WebGUIDE to query and navigate changes in a Web repository. The following example demonstrates how the components of AIDE and Ciao are combined seamlessly in WebGUIDE to provide effective browsing, searching, archiving, and differencing capabilities, all under a simple visual interface.

[Ciao diff example]
Figure 4: A Structure Difference Graph for Two Versions of the AT&T Home Page, http://www.att.com. In the actual system, this would be an imagemap that would take the user to a form based on the node selected.

The user visits the WebGUIDE home page and is interested in viewing the history of http://www.att.com. This is done through a standard form-based interface and a history list showing all available versions is sent back.
She then picks "version 1.24" and "version 1.23" and invokes the graph diff operator. Several things happen at this point:
- The corresponding html pages are reconstructed from the RCS repository.
- The Ciao-HTML abstractor is invoked to create a database for each home page. These databases are transient, and are deleted after a period of non-use.
- The difference engine invokes Ciao dbdiff operator to compute the difference database.
- The graph generator sends back the webdot graph computed from the difference database to show the connections between the AT&T home page and other anchors, highlighting the additions, deletions, and changes of nodes and edges. Figure 4 shows a graphical difference generated by WebGUIDE for the AT&T home page for the dates 95.11.28 and 96.01.23. The root page is a rectangle node and anchors in ovals. Yellow nodes indicate that the corresponding pages have been changed, red ones are new anchors, white ones are deleted, and light-blue ones are those that remain the same. Similarly, dashed lines indicate new links, dotted lines indicate deleted links, and solid lines are those links that remain intact. The graph gives us a high level view on structure changes occurred in the AT&T home page since the last visit (if version 1.24 is the current version).
The user may then decide to invoke HtmlDiff on the AT&T Home Page to see detailed text changes, similar to the example shown in Figure 2.
The user may also become interested in the new addition "Network Systems" and visit that node using the same mechanism shown in Figure 1.
She may then decide to incorporate it in the WebGUIDE repository by issuing the "Remember" command.
Since "Network Systems" appears to be quite related to the user's current interests, she may decide to explore further by recursively visiting the descendents of the Network Systems node and "remembering" them in the WebGUIDE repository.

5 Related Work

Related work falls in two categories: tracking modifications, discussed in Section 5.1, and browsing tools, in Section 5.2.

5.1 Tracking Modifications

A number of tools will watch for modifications of Web pages and notify the user, as a list of updated pages [7][17], annotations in the bookmark view [16], or email [19]. While the latter runs as a centralized service that checks a URL once periodically for a large community of users, the others run on the user's own machine. Running locally has the advantage of privacy, as well as giving the polling mechanism access to authenticated URLs, but also the disadvantage of not scaling well to an Internet-wide community.

AIDE expands upon these tools by providing a version archive and automatic differencing [2][9]. Many users can share a single archive. Tracking modifications was originally handled by users individually but is migrating to the central server.

Chawathe, et al., recently developed algorithms for detecting how hierarchically structured documents change [5]. They can identify not only additions and deletions but also the movement of substantially similar subtrees from one point in a document to another. Their initial prototype is specialized for LaTeX documents, but they note its applicability to HTML, and we intend to investigate the incorporation of their algorithms into HtmlDiff.

5.2 Browsing

WebMap [8] is a tool for visualizing relationships between pages, particularly ones that a user has previously viewed with a Web browser. It relies on notification from the browser to keep its list of nodes, and their relationships, up to date. It also can send commands back to the browser to visit a URL represented by a node in the graph. Dömel additionally describes a ``domain'' concept, which can be used to group pages together, for instance to print a set of related pages. It does not support differencing or notification of modifications.

Hyper-G [1], supports automatic maintenance of large datasets, including hierarchical navigation and searches based on attributes or contents. Its underlying database includes such things as the links between entities, much like the Ciao-HTML does, providing an efficient query mechanism. However, Hyper-G is a layer above such things as HTML (its native language is the Hyper-G Tect Format), and it works with a specific set of documents stored within a collection of Hyper-G servers. WebGUIDE works on all documents in the Web.

GlimpseHTTP [12] uses the ``neighborhood'' of a Web page to limit searches. This is done by indexing a directory hierarchy and providing a CGI query facility to perform full-text searches of the hierarchy or any subcomponent. It is being extended (as something called gweb) to retrieve pages over the Web and index them locally. Its automated searches have similarities to our system, but it is oriented toward full-text indexing rather than differencing or meta-queries.

DeckScape [4] is a browser that changes what is currently a ``standard'' depth-first paradigm into a multi-level one. Each ``deck'' is traversed linearly, but separate decks may be maintained in parallel, and documents may be moved between decks. Thus users have more control over navigating a set of disjoint pages that are either unrelated or branch out from a common ancestor, and can switch back and forth among unrelated pages. More importantly, unlike the common browsers (e.g., Netscape or Mosaic), state (the contents of decks) is preserved across invocations, so users can retain context over time. We view DeckScape as complementary to WebGUIDE, since the enhanced navigational abilities offered by Ciao-HTML (or WebMap) should be directly supported by browsers. However, neither DeckScape nor WebMap supports extended queries about the relationships between nodes.

6 Status and Future Work

To date, we have extended the AIDE CGI system and HtmlDiff application to support recursive differencing and recursive archival of a page and its immediate descendants; and we have integrated the Ciao analysis facilities, the webdot HTML graph generation, and AIDE CGI system to support graphical display of structural differences between two versions of a page, such as the addition or deletion of an anchor or image.

We must still complete the migration of the AIDE notification component into the CGI system, so that it has current knowledge of when pages change, and then make Ciao query the AIDE database to highlight anchors that point to changed pages. We must also provide increased functionality from the webdot graph, allowing users to expand and query links as well as visit the nodes they represent.

A couple of lingering issues remain. One is the way the AIDE repository evolves over time. If a user loses interest in a page, it need not be tracked if no other users have registered an interest, but what about its version archive? Worse, what if the URL is deleted or relocated? Over time it might be desirable to delete or merge old versions, or it might be appropriate to keep historical data for archival purposes.

Another issue is that of copyright. One can argue that archiving a page on the Web does not violate copyright any more than caching it in a proxy-caching server for an indefinite interval; however, archiving it permanently stretches the limits of this comparison. Furthermore, to what extent does highlighting differences constitute an infringement as a "derivative work"? Since HTML does not dictate how a viewer displays particular markup, it is unclear whether the owner of HTML content can dictate exactly which fonts are used in a document. It would be preferable, though, to evolve to a standard in which content providers would explicitly authorize archival and differencing within their documents.

References

1: Keith Andrews, Frank Kappe, and Hermann Maurer, Serving Information to the Web with Hyper-G, In Proceedings of the Third International WWW Conference, April 1995.
2: Thomas Ball and Fred Douglis. An internet difference engine and its applications. In Proceedings of 1996 COMPCON, February 1996, pp. 71-76.
3: Charles Brooks, Murray S. Mazer, Scott Meeks, and Jim Miller. Application-specific proxy servers as http stream transducers. In Proceedings of the Fourth International WWW Conference, December 1995.
4: Marc H. Brown and Robert A. Shillner. Deckscape: An experimental web browser. In Proceedings of the Third International WWW Conference, April 1995.
5: S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996. (To appear.).
6: Yih-Farn Chen, Glenn S. Fowler, Eleftherios Koutsofios, and Ryan S. Wallach. Ciao: A Graphical Navigator for Software and Document Repositories. In International Conference on Software Maintenance, pages 66-75, 1995. See also the Ciao home page.
7: B. B. Cutter III. w3new. http://www.stuff.com/~bcutter/programs/w3new/w3new.html.
8: Peter Dömel. Webmap - a graphical hypertext navigation tool. In Proceedings of the Second International WWW Conference, 1994.
9: Fred Douglis and Thomas Ball . Tracking and viewing changes on the web. In Proceedings of 1996 USENIX Technical Conference, January 1996. See also the AIDE home page.
10: John Ellson. Personal communication.
11: David Goldberg, David Nichols, Brian M Oki, and Douglas Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):61-70, December 1992.
12: Burra Gopal, Paul Klark, and Udi Manber. Combining browsing and searching. Unpublished manuscript, October 1995.
13: Inktomi. http://inktomi.berkeley.edu.
14: Java. http://www.javasoft.com/.
15: Lycos. http://www.lycos.com/.
16: First Floor Software. http://www.firstfloor.com/.
17: Specter, Inc. Webwatch. http://www.specter.com/users/janos/webwatch/index.html.
18: W. Tichy. RCS: a system for version control. Software-Practice & Experience, 15(7):637-654, July 1985.
19: Url-minder. http://www.netmind.com/URL-minder/URL-minder.html.
20: Yahoo. http://yahoo.com/.

About this document ...

WebGUIDE: Querying and Navigating Changes in Web Repositories

Fred Douglis

Fifth International World Wide Web Conference May 6-10, 1996, Paris, France

Abstract:

Fifth International World Wide Web Conference

May 6-11, 1996, Paris, France

General Information

Conference & Exhibition Schedule:

Important Dates:

References

Fifth International World Wide Web Conference
May 6-10, 1996, Paris, France