Content-Centric Interactive Video on the World Wide Web

Arun Katkere - Jennifer Schlenzig - Ramesh Jain

Praja, Inc.
5405 Morehouse Drive, Suite 330
San Diego, CA 92121, USA.
katkere@praja.com, schlenz@praja.com, jain@praja.com,
http://www.praja.com

Abstract:

The World Wide Web (WWW) as a mechanism for providing access to "real world" information in the form of live or recorded video and audio data is becoming more common. However, access to this information is limited to simple playback. In this paper, we present an architecture and a WWW implementation of Multiple Perspective Interactive Video, (MPI Video), an infrastructure to access these new forms of information in more useful ways. Using an information system to merge the sensory and virtual data into a coherent, accessible, dynamic database, MPI Video provides content-centric interactivity. Multiple users can access this database to retrieve disparate information at the same time. Most interactions occur in a three-dimensional interface (a natural medium for interacting with real-world data) which combines relevant real and virtual components. This approach is useful in several applications including teleconferencing, remote monitoring, and interactive entertainment. In this paper, we present the concepts of MPI Video and describe the latest implementation, a Web-based Remote Access, Monitoring, and Presence (RAMP) system.

1. Introduction

The sharing of information among people located across different space and time zones has been a strong motivation behind the World Wide Web. Clearly the Web has been extremely successful in achieving this goal so far. It has become a major mechanism for collaboration and it appears that what we have seen is only the beginning. In this paper we present an advancement of collaboration techniques that allows for people to share not only virtual documents and virtual objects, but also real objects and environments by introducing telepresence mechanisms. The proposed approach uses content-centric interactivity to provide a smooth merger of real and virtual environments and content-based access to real (and possibly live) data. We describe the latest implementation of Multiple Perspective Interactive Video [3], MPI Video, as an example of a system which provides content-centric interactivity for remote events, such as sports broadcasts, that are typically captured from multiple perspectives.

The current WWW is dominated by documents. Multimedia documents are becoming more prevalent, but these are typically authored by hand. The documents may represent three-dimensional (3D) dynamic worlds but are based on hand-built models. VRML[7] has played a key role in bringing Three-dimensionality to the web by providing a standard means of describing these "virtual worlds." At the same time, cameras are becoming more prevalent in the real world. The reduction in the cost of cameras and computers is leading to multiple cameras being deployed in the same environment providing one with the capability to select a particular camera to view a region of interest. It is becoming an accepted fact that soon cameras will be in many public places. Not surprisingly, the next major change is said to be the exploitation of the Web as a distribution system for real-time multimedia content[16]. Many groups have begun solving the problems of Web-based video delivery [11, 1, 13]. Consequently, the first generation of WWW tools for streaming video and live video are now available. Many individuals and groups have started using these tools to place live cameras on the Web. A recent check of an index of live cameras on the Web[12] revealed hundreds of entries.

Just as search engines were required to handle the deluge of text-based documents, this next wave demands sophisticated access tools. Simple VCR-like playback mechanism is as lacking as hyperlink-based navigation was for the text-based static Web. MPI Video provides the infrastructure to access sensor-derived data, live or archived. Using an information system to mediate between the variety of sensors (e.g., cameras, microphones) and the multitude of users, MPI Video provides powerful content-centric interactivity to smoothly combine the real world with the virtual world. For example, a user can select a viewpoint of choice and can also navigate at the desired speed within the 3D scene without disturbing the events in the environment. Mechanisms to overcome the limitations and misconceptions due to our single perspective are provided by the information assimilated from multiple sensors and a priori known data.

Many issues which arise in implementing MPI Video are dependent on specific applications (e.g., remote surveillance and monitoring, telepresence, education and training, or entertainment), but there are several core issues that must be addressed to successfully implement any application. In this paper, we present a brief overview of MPI Video, explain concepts of content-centric interactivity and gestalt vision[2], and the current implementation.

2. Multiple Perspective Interactive Video

Figure 1. The MPI Video conceptual architecture: a task-specific environment model is constructed automatically from the multi-perspective, multi-modal data. The environment model is used to personalize the way the underlying data are presented to the user. The user has access to both the uninterpreted data as well as the MPI Video-generated abstractions.

Multiple Perspective Interactive Video provides a framework for the management of and interactive access to multiple streams of video data capturing different perspectives of related events[6]. MPI Video has dominant database and hypermedia components which allow a user to not only interact with live events but to browse the underlying database (which is automatically generated from the sensory data) for similar or related events or to construct additional queries. By processing the video to extract information about the captured scenes, MPI Video allows a user to interact with the system on a semantic level. This differs greatly from typical scenarios where the queries are based on keywords entered by a person and are subject to his/her interpretations. Instead, our users are able to make queries base on the actual activity occurring in the video data. For example, in a football application the user may ask to see the closest view of the quarterback. The response to the query is an MPI Playback, which includes automatic switching of cameras to provide the best view, where what is best is defined by the user[6].

This new camera-switching capability represents a shift in access mechanism. In many applications, such as sports broadcasts, traffic monitoring, and visual surveillance, multiple cameras are placed at strategically selected points to provide an operator a global view of events. In all these applications, different cameras are fed to one location where all of the views are displayed. In a broadcast application, one of these views is selected by the editor or producer of the program to be broadcast to consumers. Our system eliminates this centralized control without transferring the tedious task of camera selection to the consumer. MPI Video system handles the low-level control in response to high-level requests by the user.

Content-centric interactivity can be implemented by combining the tools and techniques that are being developed in different aspects of computer science, including evolving information systems and the delivery mechanisms created by the network infrastructure commonly available today. The image stream from each camera is processed to extract task-dependent information from it and is fed to an information system, called the Environment Model (EM). The EM is a coherent, dynamic, multilayered, three-dimensional representation of the content in the video streams. It is this view-independent, task-dependent model that bridges the gap between two-dimensional image arrays, which by themselves have no meaning, and the complex information requirements placed by users as well as other components on the system. The environment modelis an active real-time database containing spatial and object information at several levels of abstraction. The assimilated information contained in the environment model allows us to achieve gestalt vision[2] where the whole is greater than the sum of the parts.

The EM information system offers two major facilities. A user may interact with the EM at many different levels of information abstraction. A user can view the information of interest in a visualization mode preferred by the user for that information, ranging from simple text to immersive environments. Also, a user can view any information of interest from any viewpoint of interest. Thus the human multiplexor is removed and the user becomes the producer of information. Another major advantage is that the EM can be used by several users to view different information at the same time. Because the EM is an information system, it can be designed to reside at one or multiple locations and satisfy information or entertainment needs of a diverse, distributed group of users at one time.

The transformation of video data into objects in the environment model has been an ill-posed and difficult problem to solve. With the extra information provided by the multiple-perspective video data, and with certain realistic assumptions (that hold in a large class of applications), however, it is possible to construct accurate models in a robust and quick fashion. In our current MPI Video systems, we assume that certain information is available[5]: In addition, we use the following sources of information extensively:

a priori knowledge of the geometry of the static environment,
knowledge about shapes and dynamic behaviors of moving objects in the domain, and
precomputed internal and external camera calibration models.

A Three-dimensional Interface

New interface mechanisms and metaphors are required to allow proper interfaces for content-centric interactivity and gestalt vision. We believe that in addition to menu-based selections, new methods of spatiotemporal interactions will be required to allow intuitive access to the objects and events in the scene[4].

Expressiveness of interaction is fundamental to the design of any user-interface model. This expressiveness can be achieved by adopting a 3D information visualization metaphor as used by several database browsing and visualization groups[14]. Our motivation for developing a three-dimensional interface for MPI Video stems from the intuition that if a user were given a three-dimensional world changing dynamically with time, he or she would meaningfully operate in this world (i.e., can specify an object, a space, or a search condition) only when he or she has the ability to navigate and act within it.

Some of the areas in which a three-dimensional interface would be extremely useful are:

Query Specification: provides a natural way for specification of several types of queries such as ones involving spatial relationship specification.
Infinite Perspectives: unlimited control over viewpoint allows a viewer to observe "interesting" actions from a convenient perspective.
Selective Viewing: unlike video which is often cluttered, only interesting objects can be selectively displayed.
Query Result Visualization: results of many types of queries are presented better in 3D.

Some of the specific instances of the three-dimensional interface are:

Camera selection

Selecting cameras by name is difficult when the space being monitored is large and/or there are a large number of cameras.

Viewpoint closest to the viewpoint of the user's avatar A camera is selected by specifying a similar viewpoint in the 3D iconic model. There are two modes available here: continuous mode, where the displayed camera is contiuously changed as the viewer navigates in the 3D world, and a one-shot mode, where the user positions his avatar at the desired viewpoint and makes a real camera selection.
Selecting the camera by clicking on its icon in the 3D model This technique can be used either to request the system to change the viewpoint, or in case of a pan/tilt/zoom-able camera described in the previous item, to gain control of that camera. While selecting an entity by clicking is not a very novel idea by itself, most systems do not have the concept of selecting/manipulating real entities using their virtual counterparts, especially in the area of video.

Entity selection

The concept of selecting/manipulating real entities by using their virtual counterparts is not limited to camera selection. Entities in the scene, such as people, important objects, etc., can also be selected. Selected entities may be used in many ways, e.g., as parameters to queries and the object of interest for camera selection.

Query parameter selection

Entity selection can be used to select parameters for various queries more intuitively. For example, the query "At what times did Person X appear in Camera Y" can be instantiated by dragging and dropping the appropriate person into slot X and the appropriate camera into slot Y.

Region/Volume of Interest

For certain queries, a region or a volume of interest needs to be specified. Typically, a user has to type in the coordinates of the region or volume, a cumbersome task. In our system, the user will draw the region/volume of interest in the 3D model.

The environment model, presented in Section 2, forms the basis for the creation of this three-dimensional interface. Components of the interface reflect the current state of the environment.

With the latest VRML standard, VRML 2.0[15], the ability to model and interact with dynamic three-dimensional scenes on the WWW has increased manyfold. It has enough expressibility to specify the three-dimensional model-based interactions described in the previous section. MPI Video uses VRML for representing scene geometry- static scene geometry, camera geometry, iconic representations of the dynamic objects, overlays such as trajectories, velocity vector, etc.- and for providing user interactions such as drawing regions of interest, selecting objects of interest, sketching queries, etc.

3.1 Avatars

Virtual worlds use the concept of avatars to display positions of objects in the virtual world. An avatar allows a user to see the relative positions of objects in the space. This aids navigation and allows more natural interactions with other users who may be literally thousands of miles and several time zones away. Most of the avatars are currently preselected objects that have a limited set of motions and facial expressions. Though current avatars are very limited in their functionality; most are only adequate to show the relative position of objects. It is clear that soon machine-vision techniques will be combined with the graphics techniques to create avatars that will be related to the facial expressions of a user and may even look like the user, if so desired.

Since we are dealing with real worlds and are displaying real scenes, the objects must look realistic. The best avatar for a person in this situation is the person himself. Thus, depending on the application, one should try to use the model of the person, or pictures of the person, as his avatar and show all facial expressions and motions for the person. This can be done in some situations, but in others it may be very difficult.

3.2 Navigation mechanisms

The interface should provide powerful and flexible navigation mechanisms, much like video games. In most telepresence applications, it is essential that a user feels as if he is navigating through the environment following all physical rules. In some cases, navigation may involve just expressing a desired point of view with respect to some objects or events. This will usually require the combination of visual mechanisms offered by VRML and symbolic methods.

4. Implementation

Experimental Setup
Figure 2. Layout of the office used for this example and the location of the cameras in the scene.

Our test environment is an office with two rooms. Six cameras are used to cover the environment. The layout of the room and cameras is shown in Figure 2. Figure 3 shows the high-level architecture of the current implementation. Different components of this architecture roughly correspond to the different components shown in the conceptual architecture (Figure 1).

Figure 3. Overview of the WWW version of MPI Video.

The current implementation consists of the following components:

4.1 Sensor Host

Each sensor host processes and/or transmits information about a sensor (in this implementation, video). Some sensor hosts only transmit the sensor information without processing it. In our demonstration, some cameras were not used to construct the environment model, but only to provide a camera for best view selection. If the sensor host is processing the data, then it is responsible for detecting, locating, and classifying objects of interest in the camera space and for delivering this information to the EM Server. Each sensor host is responsible for informing the server about its sensor's (camera's) calibration and coverage information.

The current implementation uses "Motion JPEG"[10] for video compression and RTP/RTCP[9] for video delivery. Use of other compression algorithms more suited for video is being investigated.

4.2 EM Server

The tasks of the EM Server can be divided into the following components:

Environment Model: For this particular application, the environment model consists of the static model of the rooms, camera calibration information, object tracking information (position, velocity, and trajectory of each person in the scene), information about the requests made by the clients.
Assimilator: The assimilator receives processed information from the sensor hosts (object location and trajectory in sensor space) and assimilates it in real-time into the environment model. Time-stamps on the processed information are used to make sure any portion of the scene is not updated using outdated information.
Model transmitter: This module periodically refreshes the client's environment model by sending out updates. Updates are only sent when there is a change in the environment model's state.

4.3 Query Manager

This is currently implemented on a Web server. This module parses client queries and redirects them to the appropriate query handler. Currently, only the EM server can handle client queries. So, the query manager simply forwards queries to the EM server. Eventually, when the real-time archive component is built, the query manager may have multiple choices (EM server, Video database, etc.). The role of the query manager is to provide a single point of contact for the client for querying. The query manager may also cache the query results to answer some of the frequently asked queries itself.

4.4 Web Client

In our implementation, the client is built using Java for the most part. For portions of the client that Java doesn't yet support efficiently- video and 3D interface- the implementation uses Netscape plugins written in C++. Both the video and the model plugin have Java wrappers around them. The interaction between the three components of the user interface occurs via this Java interface using Netscape's LiveConnect object bus[8].

4.5 Authorizer

Client access to the MPI Video system is regulated by the authorizer which is responsible for the maintenance and distribution of encryption keys. The client may be prevented from viewing a specific camera or from performing a subset of the available queries.

Figure 4. Sample screen of the MPI Video client showing the different components of the interface and some queries.

The client interface to the system is shown in Figure 4. This interface allows the user to perform a variety of queries related to the position and velocity of the dynamic objects (people) in the environment. As illustrated in Figure 4, queries include requesting notification when a person enters a selected region and maintaining the trajectories of selected objects. The system also allows the user to request several different types of best view (e.g., proximity, frontal, etc.). The result of a proximity best view request is given in Figure 5.

Figure 5. An example of "best proximity view": A sequence of camera changes that occurred when the user requested that the view from the closest camera to the selected object be shown.

5. Conclusion

The World Wide Web has proved itself as an important means of distributing many forms of information. The next step is to make that information more accessible. MPI Video includes an environment model which stores information in such a way that there is a smooth merging of real and virtual worlds. Queries to the system are based on the user's interest in the content of the video rather than through predefined keywords that tell more about the person designing the system than about the video they are meant to describe. The user can access the information in several ways, including via the World Wide Web. MPI Video provides motivation to the many people working on web-based video delivery that moves beyond video-on-demand and point-to-point video conferencing. Future applications of MPI Video include telemeeting systems which merge virtual- and real- world information to yield a productive, collaborative environment of people, objects, documents and other assorted media. In addition, this technology can remove the elitist nature from surveillance systems by permitting wide access to the video data covering public spaces. It is expected that new forms of entertainment will be discovered as the power and flexibility of the Environment Model and MPI Video are explored and exploited.

Acknowledgments

Several people have assisted in developing and implementing the ideas reported in this paper. We would like to acknowledge Don Kuramura, David Kosiba, John Studarus, David Lehman, and C. K. Prahalad at Praja and Edd Hunter, Patrick Kelly, Saied Moezzi, and Andy Tai at the Visual Computing Laboratory.

References

1
CU-SeeMe Development Team. Cornell University CU-SeeMe Welcome Page. WWW document.

2
R. Jain, A. Katkere, and J. Schlenzig. MPI Video: Content-Centric Interactivity and Gestalt Video. In Imagina 97, Feb. 1997.

3
R. Jain and K. Wakimoto. Multiple Perspective Interactive Video. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 202-211, Washington, DC, USA, May 15-18 1995. Los Alamitos, CA, USA: IEEE Computer Society Press.

4
A. Katkere, J. Schlenzig, A. Gupta, and R. Jain. Interactive video on WWW: Beyond VCR-like interfaces. Computer Networks and ISDN Systems, 28:1559-1572, 1996. Also published in the Proceedings of the Fifth World Wide Web conference.

5
A. Katkere, J. Schlenzig, and R. Jain. VRML-Based WWW interface to MPI Video. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 25-32, 137, San Diego, CA, Dec. 13-15 1995. ACM Press.

6
P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain. An architecture for Multiple Perspective Interactive Video. In ACM Multimedia 1995 Proceedings, pages 201-212, San Francisco, CA, Nov. 5-9 1995.

7
R. McKeon, D. Nadeau, and J. Moreland. The VRML Repository. WWW document, Feb. 1997.

8
Netscape Communications Corp. LiveConnect. WWW document, 1997.

9
Network Working Group Audio-Video Transport Working Group. rfc1889. WWW document.

10
W. B. Pennebaker and J. L. Mitchell. JPEG: Still Image Data Compression Standard. Van Nostrand Reinhold, 1993.

11
Progressive Networks. RealVideo Technical White Paper. WWW document, Feb. 1997.

12
T. René. The Original Tommy's List of Live Cam Worldwide. WWW document, Feb. 1997.

13
VDOnet Corp. VDOLive: Real-time Video and Audio Over the Internet. WWW Document, 1996.

14
K. H. Veltman. Conceptual Navigation: Views Beyond Windows. WWW document, 1996.

15
VRML 2.0 Specification. WWW Document, 1996.

16
W3C Activity: Real Time Multimedia. WWW document., Jan. 1997.

...Jain
Also: Professor, Departments of Electrical and Computer Engineering, and Computer Science and Engineering, University of California, San Diego. Email: jain@ece.ucsd.edu

Return to Top of Page
Return to Technical Papers Index