Fred
Douglis¹
¹AT&T Research, 600 Mountain Ave., Murray Hill, NJ 07974-2008
Thomas
Ball²
Yih-Farn Chen¹
Eleftherios Koutsofios¹
²Bell Laboratories, 1000 E. Warrenville Rd., Naperville, IL 60566-7013
Keywords: differencing, repository, navigator, HTML, AIDE, CIAO
1 Introduction
Browsing and searching are popular ways to access
and find information on the World Wide Web (WWW).
While GUI-based browsers and powerful search
engines are now ubiquitous, tools and mechanisms
that provide access to historical information and tracking of
updates only have been developed recently and are not in widespread use.
Search engines and browsers help users locate and inspect
information of interest, while tracking
tools help users to keep up-to-date on this pertinent information.
WWW services and applications can
benefit from a mechanism that tracks changes, maintains page
version histories, and automatically computes
differences [2]. Its usefulness will be
further increased by mechanisms for dealing with the vast number of
documents on the Web: graphical views of pages [8],
with querying and filtering based on user-specified
criteria [6]; recursive tracking and viewing of changes to related
Web documents; and prioritization of change notifications based on user
criteria. In this paper we focus on the first two of these improvements.
We have combined and expanded upon two existing tools, Ciao [6] and the AT&T Internet Difference Engine (AIDE) [9][2], in order to provide two sorts of visual cues. We call the resulting system the Web Graphical User Interface to a Difference Engine, or WebGUIDE. With Ciao we show high-level structural differences by displaying graphs that show the relationships between pages and color the nodes to indicate which pages have been modified. Using AIDE, we show low-level textual differences by marking up changes between versions, modifying anchors to cause documents reached from that page to be annotated.
The next section describes Ciao and AIDE in greater detail. Section 3 discusses the architecture of WebGUIDE. Section 4 gives an example of a WebGUIDE session. Section 5 covers related work, and Section 6 concludes.
This section gives an introduction to Ciao and AIDE, giving examples of their use.
Ciao is a customizable graphical navigator that allows users to query and browse structural connections embedded in a document repository. Ciao involves three major components: an abstractor that converts source documents to a database according to a data model that describes the documents' internal structure, a repository that keeps versions of the documents and corresponding databases, and a graphical interface that allows users to query and visualize the information structure. Ciao has been instantiated for C, C++, ksh, HTML, and some business information repositories.
Figure 1:Example of Ciao-HTML as applied to the AT&T home page.
Ciao-HTML can be used to explore the structure of HTML documents. The data model for HTML includes entities such as HTML pages, anchors, headers, and images, and relationships among them. Unlike other instantiations, Ciao-HTML database can expand in real time as the user tries to explore links to pages that are not currently incorporated in the database. Figure 1 shows a snapshot of Ciao-HTML on a version of the AT&T Home Page.
The user started with a query to retrieve all relationships between the AT&T Home Page and its anchors, which resulted in a graph shown in the upper-left window. The user then expanded two of the anchors, Home and Work, in place to show further link connections. Since the graph had become more complicated, the user decided to create yet another window (somewhat like Netscape's clone feature) shown in the lower right with the Home node as the root to focus on that page and its derivatives. The user also visited two of the home pages by sending requests to her browser. All these operations were done through pop-up menus attached to the graph nodes. These query and navigation features of Ciao-HTML allow the user to browse complex Web structures comfortably.
Note that Ciao-HTML runs as an external application on the user's machine, and interfaces with the browser by sending it commands to visit particular nodes. It retrieves and processes pages independently from the browser (relying on a proxy-caching server to ensure that the same pages are not fetched multiple times from off-site).
2.2 AT&T Internet Difference Engine
The AT&T Internet Difference Engine [9][2] combines notification of changes to pages on the Web with a customized view of what has changed to those pages. Notification of changes has become relatively commonplace [17][16][7][19], but viewing changes has not. AIDE supports this with a shared version repository, into which users ``deposit'' pages of interest when they have seen them, and a tool called HtmlDiff, which creates a page that highlights the differences between two versions of an HTML document. In addition to seeing the changes to a page since the user last viewed it, it is possible to see a history of versions and to compare any pair of them. All archival and differencing is performed on a server, using CGI scripts.
Figure 2: Example of HtmlDiff as applied to the WWW-5 home page. It shows, for instance, that the text describing the WWW4 conference was updated once the conference took place, and that an ``SMEs Forum'' link was added to the schedule. Some text is omitted to permit the output to fit on one page. Also, the rows in this table are for demonstrative purposes, for inclusion with borders in this document. |
---|
HtmlDiff: Here is the first difference. There are 9 differences on this page. |
Fifth International World Wide Web ConferenceMay 6-11, 1996, Paris, France
General Information |
The World Wide Web network Information System is now driving the Internet expansion throughout the World.
The World Wide Web was originally created at CERN by
Tim Berners-Lee for high-energy physicists and since then, has developed into millions of users from a wide variety of application domains.
It is recognized as being of strategic importance for the future development of the global information society. Since 1994, several International WWW Conferences have been organized: |
The Fifth International World Wide Web Conference will take place on May 6-11, 1996 at CNIT-Paris La Defense. The CNIT is one of the largest conference and exhibition centers in Europe, located on the western side of [omitted] |
Conference & Exhibition Schedule: |
Important Dates:
|
|
Created: 30 October Last updated: |
Figure 2 gives an example of HtmlDiff's output. Bold italics indicate new text, struck-out text indicates deletions, and arrows point to either (including changes to URLs, which are not otherwise highlighted).
Note that until the functionality of AIDE and Ciao were combined as WebGUIDE, the only interface to AIDE was through simple HTML forms and anchors. Once the volume of pages tracked by a single user exceeds some threshold, or links are followed recursively, more sophisticated interfaces are necessary to provide visual feedback and navigational tools.
3 System Architecture
WebGUIDE consists of four components: a
version and meta-data repository, a robot that tracks modifications, a
difference
engine, and a graph generator.
Pieces of these components have been
described elsewhere [9][6]; we focus
here on how they have
evolved since. The architecture is depicted in Figure 3.
Figure 3: WebGUIDE architecture. AIDE and Ciao
each have their own databases. AIDE stores versions of pages and
information about when pages have been modified, as well as
which users have seen which versions; Ciao has an
entity-relationship database that is extended dynamically as the
result of queries. Ciao accesses the AIDE version repository to
compare versions of pages. All data are stored on a central server
which is accessed via a CGI interface.
The AIDE version repository is a centralized service that archives versions of pages. Unlike a search engine such as Inktomi [13] or Lycos [15], it retrieves and stores only pages that users explicitly request. (Note however that in the extreme case, a user could specify a page that ultimately leads to many other pages, such as Yahoo [20], along with a high level of recursion, and achieve a similar effect.) Pages are stored in RCS [18] format, so storing multiple versions does not result in excessive storage overhead as long as changes are relatively small.
In addition, AIDE maintains a relational database containing meta-data about each page, each user, and the relationships between them. For each URL, it stores the following (among other information):
The robot, like the tools described in Section 5.1, periodically checks pages for updates. It queries the database for all pages that have not been checked within their minimum polling frequency. (For pages that are to be checked recursively, the polling frequency for links is typically less than the root page.)
AIDE does not check pages that are ``known'' to be new: if every user who has expressed an interest in a page has already been told a page has been modified, and has not visited the page through AIDE or viewed its differences, the page is not checked again with the same frequency.
The time of each check is recorded in the database, as well as the new modification time. Modified pages are reported to interested users immediately if requested. The new page is archived automatically if specified by any user.
3.3 HTML Differencing and Recursion
Originally, differencing was done only on a per-page basis, with no notion of recursion [9]. That mode is useful when most pages are checked in isolation, but less so when pages are tracked recursively. Now, one can visit a page with links to modified pages and have those links highlighted. By following the link, HtmlDiff is invoked recursively on the new page, and its links are similarly highlighted. Thus one can see the differences between a set of related pages at any points in time that contents have been archived.
The recursive comparison interface works as follows: the user selects two versions of an HTML document for comparison. The two timestamps associated with these documents define the time range for future document comparison as the user browses. When HtmlDiff compares two documents, it gathers up all the URLs in the document and queries the version repository to determine if there are different versions of the documents specified by the URLs for the two dates. If so, an icon is inserted before the hypertext link; this icon is itself a hypertext link that transfers control back to AIDE in order to compare the versions of the documents. This on-the-fly analysis and annotation of the pages is similar to the transducers of Brooks et al. [3], but for a different purpose.
Clearly, the effectiveness of recursive comparison depends on the quantity of historical information in the version repository. Many URLs will not have any history and will not be filtered. Other URLs may have historical information, but not for the exact dates specified for recursive comparison. In the latter case, we make a number of approximations in order to provide more comparative information. Suppose that the current date is 96.04.01, that the user asks for version comparison between the dates 95.09.20 and 96.03.06, and that for a given URL, versions exist as of 95.10.30, 96.01.01, and 96.03.10. In this case, we use the dates closest to those specified (up to some epsilon interval), so the comparison will use the 95.10.30 and 96.03.10 versions. For another URL, there may only be a version stored for 95.10.15. In this case, we compare the stored version and the current version on the WWW. The epsilon interval used for date approximation may be user-specified.
Recursive HTML comparison allows users to see that a hypertext link points to a page for which there are changes. However, this only works well for one level of indirection. If the currently viewed page and a changed page are separated by a long chain of unchanged pages, it is bothersome to force the user to step through the unchanged pages to get to the differences. The Ciao graphical interface addresses this problem by providing a graphical overview of the changed pages, allowing the user to quickly navigate to changed pages.
The graphical view of relationships between URLs of interest to a user, and their states, could be generated in a number of ways. WebGUIDE generates graphs on the fly as embedded images, using a tool called webdot [10]. The images can be clickable, so clicking on a node can invoke another operation. Unfortunately, image maps do not currently support operations other than selecting a URL based on location within the image, so unlike an external application like Ciao [6] or WebMap [8], one cannot click on a node to bring up a menu directly. Instead, it is necessary to go to a URL and have that page provide the user interface to select an operation. We have taken this approach, and will support several operations through this indirect page:
Another approach might be a helper application that would run on the user's machine, external to the browser, as WebMap does. This would be complicated by the need to interact with a database and CGI services on another machine, rather than being self-contained like WebMap, and would require that a user install an external software package.
Perhaps the most elegant approach is to provide full interactive access to the graph using Java [14]. We intend to explore this possibility in the future.
4 A WebGUIDE Session
We are now ready to go through a WebGUIDE session to see how a user might
interact with WebGUIDE to query and navigate changes in a Web repository.
The following example demonstrates how the components of AIDE
and Ciao are combined seamlessly in WebGUIDE to provide effective browsing,
searching, archiving, and differencing capabilities, all under a simple
visual interface.
Figure 4: A Structure Difference Graph for Two
Versions of the AT&T Home Page, http://www.att.com. In the actual
system, this would be an imagemap that would take the user to a form
based on the node selected.
5 Related Work
Related work falls in two categories: tracking modifications,
discussed in Section 5.1, and browsing tools,
in Section 5.2.
A number of tools will watch for modifications of Web pages and notify the user, as a list of updated pages [7][17], annotations in the bookmark view [16], or email [19]. While the latter runs as a centralized service that checks a URL once periodically for a large community of users, the others run on the user's own machine. Running locally has the advantage of privacy, as well as giving the polling mechanism access to authenticated URLs, but also the disadvantage of not scaling well to an Internet-wide community.
AIDE expands upon these tools by providing a version archive and automatic differencing [2][9]. Many users can share a single archive. Tracking modifications was originally handled by users individually but is migrating to the central server.
Chawathe, et al., recently developed algorithms for detecting how hierarchically structured documents change [5]. They can identify not only additions and deletions but also the movement of substantially similar subtrees from one point in a document to another. Their initial prototype is specialized for LaTeX documents, but they note its applicability to HTML, and we intend to investigate the incorporation of their algorithms into HtmlDiff.
WebMap [8] is a tool for visualizing relationships between pages, particularly ones that a user has previously viewed with a Web browser. It relies on notification from the browser to keep its list of nodes, and their relationships, up to date. It also can send commands back to the browser to visit a URL represented by a node in the graph. Dömel additionally describes a ``domain'' concept, which can be used to group pages together, for instance to print a set of related pages. It does not support differencing or notification of modifications.
Hyper-G [1], supports automatic maintenance of large datasets, including hierarchical navigation and searches based on attributes or contents. Its underlying database includes such things as the links between entities, much like the Ciao-HTML does, providing an efficient query mechanism. However, Hyper-G is a layer above such things as HTML (its native language is the Hyper-G Tect Format), and it works with a specific set of documents stored within a collection of Hyper-G servers. WebGUIDE works on all documents in the Web.
GlimpseHTTP [12] uses the ``neighborhood'' of a Web page to limit searches. This is done by indexing a directory hierarchy and providing a CGI query facility to perform full-text searches of the hierarchy or any subcomponent. It is being extended (as something called gweb) to retrieve pages over the Web and index them locally. Its automated searches have similarities to our system, but it is oriented toward full-text indexing rather than differencing or meta-queries.
DeckScape [4] is a browser that changes what is currently a ``standard'' depth-first paradigm into a multi-level one. Each ``deck'' is traversed linearly, but separate decks may be maintained in parallel, and documents may be moved between decks. Thus users have more control over navigating a set of disjoint pages that are either unrelated or branch out from a common ancestor, and can switch back and forth among unrelated pages. More importantly, unlike the common browsers (e.g., Netscape or Mosaic), state (the contents of decks) is preserved across invocations, so users can retain context over time. We view DeckScape as complementary to WebGUIDE, since the enhanced navigational abilities offered by Ciao-HTML (or WebMap) should be directly supported by browsers. However, neither DeckScape nor WebMap supports extended queries about the relationships between nodes.
6 Status and Future Work
To date, we have extended the AIDE CGI system and HtmlDiff
application to support recursive differencing and recursive archival
of a page and its immediate descendants; and we have integrated the
Ciao analysis facilities, the webdot HTML graph generation,
and AIDE CGI system to support graphical display of structural
differences between two versions of a page, such as the addition or
deletion of an anchor or image.
We must still complete the migration of the AIDE notification component into the CGI system, so that it has current knowledge of when pages change, and then make Ciao query the AIDE database to highlight anchors that point to changed pages. We must also provide increased functionality from the webdot graph, allowing users to expand and query links as well as visit the nodes they represent.
A couple of lingering issues remain. One is the way the AIDE repository evolves over time. If a user loses interest in a page, it need not be tracked if no other users have registered an interest, but what about its version archive? Worse, what if the URL is deleted or relocated? Over time it might be desirable to delete or merge old versions, or it might be appropriate to keep historical data for archival purposes.
Another issue is that of copyright. One can argue that archiving a page on the Web does not violate copyright any more than caching it in a proxy-caching server for an indefinite interval; however, archiving it permanently stretches the limits of this comparison. Furthermore, to what extent does highlighting differences constitute an infringement as a "derivative work"? Since HTML does not dictate how a viewer displays particular markup, it is unclear whether the owner of HTML content can dictate exactly which fonts are used in a document. It would be preferable, though, to evolve to a standard in which content providers would explicitly authorize archival and differencing within their documents.
This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.