WWW5 Fifth International World Wide Web Conference
May 6-10, 1996, Paris, France


Interactive Video on WWW: Beyond VCR-like Interfaces

Arun Katkere - Jennifer Schlenzig - Amarnath Gupta - Ramesh Jain

Contact email: katkere@ucsd.edu
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407, USA

Abstract:

The WWW is evolving into a predominantly visual medium. The demand for access to images and video has been increasing rapidly. Interactive Video systems, which provide access to the content in video archives, are starting to emerge on the WWW. Partly due to the two-dimensional nature of the web, and partly due to the fact that images that comprise the video are two dimensional, most of these systems provide a VCR-like interface (play, fast-forward, reverse, etc., with additions like object selection, motion specification in the image space, and viewpoint selection). The basis of this paper is the realization that the video streams represent projections of a three-dimensional world, and the user is interested in this three-dimensional content and not the actual configuration of pixels in the image space. In this paper, we justify this intuition by enumerating the information-bearing entities that the user is interested in, and the information specification mechanisms that allow the user to query upon these entities. We will describe how such a intuitive system could be implemented using WWW technologies -- VRML, HTML, and HTTP -- and present our current WWW prototype which is based on extensions to some of these standards. This system is built on top of our multiple perspective interactive video (MPI Video) paradigm which provides a framework for the management of and interactive access to multiple streams of video data capturing different perspectives of related events.

1. Introduction

  In a very short time, the World Wide Web has emerged as the most powerful framework for locating and accessing remote, distributed information. A number of protocols and interfaces have been designed for each of the many different kinds of information. For navigational access to documents with text, images and references, the hypertext metaphor for information request has been most popular. While for database-style search, both keyword and forms-based interfaces have been developed and are essentially Web-extended (or HTML enhanced) versions of individual native database languages[1, 24]. For most applications, these different needs of information access and manipulation do not cross boundaries. Thus virtual reality users do not make information browsing queries and hypertext document surfers typically do not navigate in a three-dimensional world. But why not? A truly collaborative virtual work environment must allow users to access documents, and 3D visualization of schema would surely improve user-database interaction. The purpose of this paper is to advocate the use of three dimensional user interfaces as a means of accessing various types of data on the World Wide Web. Specifically we address these issues in the context of multiple perspective interactive video (MPI Video)[6, 9].

Currently, the popular interface to interactive video resembles a enhanced VCR interface which allows only brief, sporadic feedback from the user. This limited interaction provides no support for the purpose of querying a database beyond simple who queries. To achieve interactive video we must empower the user with the capability of manipulating the spatio-temporal content of the video. In addition, it is within the province of the interface to offer more than button clicks and mouse movements. Environments such as ALIVE and those that incorporate gesture understanding[17] will have the greatest potential as an interactive interface.

It is the paradigm of MPI Video, described in more detail in Section 2, which demands and in fact enables this level of interaction. More than just a collection of video streams, the MPI Video environment is a heterogeneous, distributed information structure. The primary source of information is a number of live video streams acquired from a set of cameras covering a closed environment such as a football game. This environment has a static component consisting of a model of the environment which resides on a server. The server also contains a library of possible dynamic objects that can appear in the environment. Multiple sensors capture the event and the system dynamically reconstructs a sequence of camera-independent three-dimensional scenes from the video streams using computer vision techniques[7]. In MPI Video the role of the user is to view and navigate in this world as the real-life event unfolds. While remaining in this world, the user may also request additional information on any static or dynamic object. Secondary information resources such as hyper-linked HTML documents, databases of static images, and ftp sites of reference archives are available to the system and may need to be accessed either to initiate a user query or as the result of a query.

In Section 3 of this paper we propose a set of information classes that can be formulated in the MPI Video environment. We demonstrate why without a three-dimensional interface the user would lose the potential expressive power required in this paradigm. In Section 4, we elaborate on our information exchange architecture and how it supports the current query specification interface. In Section 5 we conclude the paper with a discussion of our future work plan.

2. The MPI Video paradigm

 

  figure23
Figure 1: MPI Video System Architecture Overview

Multiple Perspective Interactive Video[6], MPI Video, provides a framework for the management of and interactive access to multiple streams of video data capturing different perspectives of related events[9]. MPI Video has dominant database and hypermedia components which allow a user to not only interact with live events but browse the underlying database for similar or related events or construct interesting queries.

  figure37
Figure 2: Different Layers of the environment model. Arrows indicate data input either from other layers or from sensors.

The MPI Video architecture shown in Figure 1[6, 9] has the following components:

  1. Video Data Analyzer: The MPI Video system must detect and recognize objects of potential interest and their locations in the scene. This requires powerful image segmentation methods. For structured applications, one may use knowledge of the domain and may even change or label objects to make the segmentation task easier.
  2. Environment Model Builder: Individual camera scenes will be combined in this system to form a model of the environment. All potential objects of interest and their locations will be recorded in the environment model. The representation of the environment model depends on the facilities provided to the viewer.
  3. Viewer Interface: A viewer is able to select the perspective that he or she desires. This information should be obtained from the user in a friendly but directed manner.
  4. View Selector: The view selector responds to the user's request by selecting appropriate images to be displayed. These images may all come from one perspective or the system may have to select the best camera at every point in time to display the selected view and perspective.
  5. Video Database: If the event is not a real time event, then it is possible to store the episode in a video database. Each camera sequence will be stored along with its metadata. Some of the metadata is feature based and allows content-based operations[5, 21]. Data can also be collected during a real time event and stored for later use.
  6. Virtual View Builder: A particularly important component of MPI Video is Immersive Video[12], where a virtual camera is created for the viewer by combining the extracted model with the original video streams thus giving a sense of omniscient presence[12]. The viewer in an Immersive Video environment is no longer controlled by the limitations of a physical camera.

Three aspects central to this architecture are[9]:

  1. Video data analysis and the assimilation of the multiple streams to form a single integrated world-representation. Selection of a ``best view'' from the input data stream.
  2. A database subsystem which stores the raw video data, the derived data generated by the video analysis portion and any meta-data input by the user. The database supports content-based query operations by the user or software agents.
  3.   A hypermedia interface which supports navigation and querying of the wealth of data input to and derived by the system[22].

In this paper, we describe an interface to MPI Video in which the user primarily interacts with the system using an intuitive three-dimensional metaphor[8]. A basic WWW system has been built using behavior-enhanced VRML (e.g., VRBS[13], the upcoming VRML 2.x standard, etc.), HTML forms, and CGI scripts. Extensions to some of these are suggested in order to achieve a functional interactive video interface on the WWW.

2.1 MPI Video modeling

  An important component of an MPI Video system is the Environment Model, a coherent, dynamic, multi-layered, three-dimensional representation of the content in the video streams (Figure 2[7]). It is this view-independent, task-dependent model that bridges the gap between two-dimensional image arrays which by themselves have no meaning and the complex information requirements placed by users and other components on the system.

The transformation of video data into objects in the environment model has been an ill-posed and difficult problem to solve. With the extra information provided by the multiple perspective video data, and with certain realistic assumptions (that hold in a large class of applications), however, it is possible to construct accurate models in a robust and quick fashion. In our current MPI Video systems, we make the following assumptions[8]:

In addition, we use the following sources of information extensively:

3. Information Specification in MPI Video

  Expressiveness of interaction is fundamental to the design of any user interface model. This expressiveness can be achieved by using an information visualization metaphor as used by several database browsing and visualization groups[3]. Our motivation for developing a three-dimensional interface for MPI Video stems from the intuition that if is user is given a three-dimensional world changing dynamically with time, he or she would meaningfully operate in this world (i.e., can specify an object, or a space, or a search condition) only when he or she has the ability to navigate and act in it.

Intuitively, a three-dimensional interface would be extremely useful because:

Query Specification
it provides a natural way for specification of several types of queries such as ones involving spatial relationship specification
Infinite Perspectives
unlimited control over viewpoint allows a viewer to observe ``interesting'' actions from a convenient perspective
Selective Viewing
unlike video which is often cluttered, only interesting objects can be selectively displayed
Query Result Visualization
results of many types of queries are presented better in 3D

3.1 Information-bearing entities in MPI Video

  To substantiate this intuition, let us first specify the information-bearing entities in MPI Video.

3.2 Functional requirements

  Next, let us explore what operations need to be performed in the World Wide Web to define, update and manipulate the above information categories. The interface needs to allow:

4. MPI Video information exchange architecture

  For a large number of users to access MPI Video archives, in addition to being intuitive and easy to use, the interface has to be widely accessible. In this section, we will describe how we can accomplish the user interactions described in Section 3 using the existing WWW protocols (such as HTTP) and languages (such as HTML, VRML, Java). With languages such as VRML and Java in a nascent stage, some enhancements are needed to implement even a rudimentary system. Wherever possible, our current implementations and proposed systems are based upon expected language enhancements.

4.1 Current prototype

 

  figure161
Figure 4: Schematic of MPI Video interface showing the video data streams and the remote server, and the interface at the local user site. The local interface uses a HTML browser for initiating form-based queries and displaying text and image based system information, and a VRML browser for interacting with a three dimensional dynamic model of the underlying video data. Interactions between the different components are also shown. The client side is made dynamic and intelligent using behaviors.

Figure 4 shows schematically the various components of a WWW-based MPI Video system and some interactions between the components. To implement such a WWW-based MPI Video system, we need technologies for:

Unfortunately, the current WWW technology is designed to present and provide rudimentary interactions with two dimensional layouts. While this has proven to be sufficient for most of the current set of WWW-based interactive video and video database systems[23, 19], our user information specification paradigm, which allows the users to interact with the system at a content level instead of at the data level, cannot be easily implemented with this technology.

VRML[14], which is being designed primarily for multiuser interactions (``a scalable, fully interactive cyberspace''[15] such as Stephenson's Metaverse[20]), is currently usable as a way of presenting static 3D content on WWW. For our current implementations, we use a behavior-enhanced VRML prototype, VRBS[13] to present dynamic 3D content as well as provide rudimentary 3D interactions. Because this implementation uses VRBS[13], a experimental VRML behavior system that is not widely used, we cannot make it available on the WWW for wide accessibility until it is reimplemented using VRML 2.0 in a few months. At the time of writing, standardization of behaviors in VRML is underway, and the features we need to provide complex user interactions are being discussed (e.g, Moving Worlds proposal[18]).

  figure186
Figure 5: Snapshot of an MPI Video WWW interface prototype showing dynamic models based on live events in a 3D (VRML) browser, a set of associated queries in an HTML browser and results of some queries.

Figure 5 shows a sample session at a client (user) site which uses the VRML browser, the HTML browser, and other applications[8]. The information is presented using the VRML browser, with interesting dynamic and static portions of the scene hyper-linked to either related VRML worlds or HTML query forms.

For the session shown in Figure 5, we used a campus courtyard with pedestrians covered with six video-resolution cameras. The video sequence was digitized at 10 frames per second and processed by the MPI Video modeling system. (A Quicktime video segment is available). The dynamic objects in the environment were detected and tracked to create a database of object ids, object locations, video clips, and related information built using flat files and dbm files.

Interface to this database is at two levels: via CGI scripts that interact with the underlying database, and via the VRML and HTML browsers that construct queries based on user input. All context information required to answer these queries is encoded as parameters to the CGI programs. Currently, the server answers all queries. As we will discuss in the Section 4.3, since information about objects, their locations, and their structure is continuously sent to each client, the client has the necessary information to answer a large class of queries without going to the server.

Using a combination of VRML and HTML browsers, the current system handles several types of queries:

To handle these queries, two client-side behaviors are implemented:

UpdateState
This behavior is called periodically to update the state of the world. The new state is downloaded from the server.
Monitor
This behavior is created on demand when the user requests monitoring of a certain region. Currently, since the behavior system (VRBS) does not have any ``sensors'', this behavior has to be called periodically. With the addition of sensors, this behavior could be called only when an event such as an object entering the specified region occurs.

4.2 Handling user interactions

  While the prototype described above is useful in testing all forms of communication -- server and VRML browser, server and VRML behaviors, VRML behaviors and browser, server and HTML browser, VRML browser and HTML browser, etc. -- it contained only limited user interaction in 3D: viewpoint control and object selection. Even with the other queries being handled using HTML forms, this prototype is a significant step forward towards intuitive interactive video interfaces. To further advance the interface, we need to handle the types of user interactions described in Section 3.2 using VRML. The discussion in this section is based on the current version of one of the VRML 2.0 proposals[18]. We can safely assume that the VRML 2.0 standard will provide similar functionality.

A toolkit of editable basic shapes (such as lines, cubes, cylinders) is used to define paths, regions, and volumes of interest and to construct query objects. With a version of VRML that supports scripting, it is possible to implement a simple object construction suite. A more interesting question is how these are associated with queries. For example, if a user wants to ask the system ``did anybody come here, the user's definition of here has to be somehow associated with the form where the query is being constructed. A somewhat circuitous but feasible method is to ask the user to label each object she creates and to use the same label as parameters to queries. A more elegant solution is to allow the user to drag and drop objects. This requires the browsers to espouse a standard such as OpenDoc[16] or different browsers to be integrated.

4.3 Clients and servers

  WWW is currently based on a client-server model[4]. A MPI Video system could be implemented in this framework. Our current prototypes use this model of interaction. When the system is to be accessed by a large number of users, especially in a live-event scenario, several problems arise: the load on the server increases as the number of querying clients increases, the problem of handling user's context information using a protocol free of context such as HTTP becomes more difficult, appreciable delay in the query response is counterproductive.

4.3.1 Intelligent clients

  In our case, since the server is already sending the clients information about object position and structure continuously, if each client caches this information and has intelligence to answer frequently asked queries by itself, the load on the server will be reduced. The response to the query in this case is faster than the case where the client has to forward the query to the server and wait for a response. The user's context in this case is stored at the client, and this information is passed when necessary to the server. This model raises two key issues: specifying client side intelligence, and determining the default environment model entities that are sent to a client at every time instant so that most queries are handled by the client.

Client-side intelligence is specified using a safe language such as Java. Typically this is the same language as the one used for scripting behaviors in VRML. Because we do not want to send a general-purpose query handling engine to each client, to a certain extent, the client-side query handling logic is domain dependent. Determining the default environment model is a much harder problem. This is highly domain dependent. For example, the frequently queried upon entities in a football game are not the same as the entities queries upon in a interactive drama. An interesting strategy the could be followed is to start with a minimal default set, with minimal client query handling, and to incrementally enhance both based on server access statistics. How this is exactly achieved is currently being investigated.

4.3.2 Multicasting the environment model

  If we have the client-side functionality described above, for live events, incremental updates to the default environment model may be multicast to the clients. This approach reduces the load on the server and engenders scalability. In this scenario, a new client joining the multicast session contacts the server (a nearby server in case of multiple servers) to download the current environment model and a default context. Alternatively, the client may choose to download the environment model and the context from a nearby ``friend''. After this bootstrapping, the client may monitor the current state by listening to the multicast channel for environment model updates. When the user queries, the client first checks its local environment model to check if it could answer the query locally. If the design of the system is correct, the information will be available locally most of the times. Queries that cannot be handled will be passed off to the server. The client may chose to augment its environment model with the results of this query.

This approach can be extended to handle clients (and networks) with different capabilities. Environment model, shown in Figure 2, is made up of easily decomposable layers. Hence, akin to the layered video concept of MBONE[10], we can multicast the environment model on several channels. Based on the client's capabilities (whether it can handle 10k polygons per second or a 4-joint articulated model) or the network bandwidth, the client can choose to listen to a subset of available channels.

4.4 Language for the environment model

  Information in the MPI Video paradigm is shared using the environment model. How can this information be exchanged on the WWW? VRML, which is convenient for representing graphical entities is a good start for this. In addition, we need the ability to represent multi-modal information seamlessly, both in the continuous and discrete domain (``all media are equal''[11]). We also should be able to add semantics into the environment model which establishes links between different entities in the environment (e.g., this set of polygons is Person A's left hand).

Another issue is the decoding and the encoding of the environment model. In a WWW-based scenario, the server should spend time on the creation of the environment model if this helps the client in transforming the environment model into usable information faster. For instance, several higher-level entities in the environment model are deducible from the voxels that represent occupancy. But, from a WWW perspective, it is more efficient to compute these higher-level entities at the server.

5. Conclusion

  Video is spatio-temporal data. To fully access the available information we must move beyond the preconceived notions of the VCR interface and keep in mind that:

  1. Interactive TV is more than video-on-demand. Providing the user with only the capability to download videos at a convenient time, or select merchandise for purchase, ignores the fact that the scene captured by the video is inherently three-dimensional. It is this 3D data which the user wishes to manipulate.
  2. User desired interactions require a 3D interface. Only 3D will support the desirable query by example.
  3. Our current implementation is a step towards this goal, but assistance from the World Wide Web community is needed to enhance protocols which can support MPI Video. The worldwide success of any web-based application depends on the presence of standards which allow communication in a heterogeneous environment.

References

1
J. Boyle, J. E. Fothergill, and P. M. Gray. Design of a 3D user interface to a database. In J. Lee and G. Grinstein, editors, Database Issues for Data Visualization. IEEE Visualization '93 Workshop. Berlin, Germany: Springer-Verlag, 1994.

2
L. Campbell and A. Bobick. Recognition of human body motion using phase space constraints. Technical Report 309, MIT Media Laboratory, Perceptual Computing Section, MIT, Cambridge, MA, 1995.

3
C. Graham. Database visualization and VRML. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 21-24, San Diego, CA, 13-15 Dec. 1995. ACM Press.

4
K. Hughes. Entering the World-Wide Web: A Guide to Cyberspace. WWW document, Oct. 1993.

5
R. Jain and A. Hampapur. Metadata in Video Databases. In SIGMOD Record: Special Issue On Metadata For Digital Media. ACM: SIGMOD, Dec. 1994.

6
R. Jain and K. Wakimoto. Multiple Perspective Interactive Video. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 202-211, Washington, DC, USA, May 15-18 1995. Los Alamitos, CA, USA: IEEE Computer Society Press.

7
A. Katkere, S. Moezzi, D. Kuramura, P. Kelly, and R. Jain. Towards video-based immersive environments. ACM-Springer Multimedia Systems Journal: Special Issue on Multimedia and Multisensory Virtual Worlds, Spring 1996.

8
A. Katkere, J. Schlenzig, and R. Jain. VRML-Based WWW interface to MPI Video. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 25-32, 137, San Diego, CA, Dec. 13-15 1995. ACM Press.

9
P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain. An architecture for Multiple Perspective Interactive Video. In ACM Multimedia 1995 Proceedings, pages 201-212, San Francisco, CA, Nov. 5-9 1995.

10
S. McCanne. Layered Video. WWW document, Dec. 1995.

11
Microsoft Corporation. ActiveVRML white paper, Dec. 1995.

12
S. Moezzi, A. Katkere, D. Y. Kuramura, and R. Jain. Immersive Video. In Proceedings of the IEEE Virtual Reality Annual International Symposium 1996, Mar. 1996. To be published.

13
D. R. Nadeau and J. L. Moreland. The Virtual Reality Behavior System (VRBS): a behavior language protocol for VRML. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 53-61, San Diego, CA, Dec. 13-15 1995. ACM Press.

14
M. Pesce. VRML: browsing and building cyberspace. New Riders, 1995.

15
M. D. Pesce. VRML Architecture Group. WWW document, 1996.

16
K. Piersol. A Close-Up of OpenDoc. BYTE, Mar. 1994.

17
J. Schlenzig, E. Hunter, and R. Jain. Recursive identification of gesture inputs using hidden Markov models. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pages 187-194. IEEE Computer Society Press, 5-7 Dec. 1994.

18
Silicon Graphics, WorldMaker, Sony, OnLive, Black Sun, Visual Software, and Paper, Inc. The Moving Worlds Proposal for VRML 2.0. WWW document, Jan. 1996. maintained by Chris Marrin.

19
J. R. Smith and S.-F. Chang. VisualSEEk: a Content-Based Image/Video Retrieval System. Java-based WWW demo, 1996.

20
N. Stepenson. Snow Crash. Bantam Books, 1992.

21
D. Swanberg, T. Weymouth, and R. Jain. Domain information model: an extended data model for insertions and query. In Proceedings of the Multimedia Information Systems, pages 39-51, Feb. 1992.

22
L.-C. Tai. Hypermedia in Multiple Perspective Interactive Video. Visual Computing Laboratory internal document, 1996. version 0.7.

23
Telemedia, Networks, and Systems Group, MIT LCS. TNS Technology Demonstrations. WWW demo, 1996.

24
C. Varela, D. Nekhayev, P. Chandrasekharan, C. Krishnan, V. Govindan, D. Modgil, S. Siddiqui, O. Nickolayev, D. Lebedenko, and M. Winslett. DB: browsing object-oriented databases over the web. In Proceedings of the Fourth International World Wide Web Conference,, 1994.

25
D. Yow, B. Yeo, M. M. Yeung, and B. Liu. Analysis and Presentation of Soccer Highlights from Digital Video. In Proceedings, Second Asian Conference on Computer Vision, Dec. 1995.


Arun Katkere
Mon Jan 29 17:09:18 PST 1996