Bitstream Syntax Description Language:
Application of XML-Schema to Multimedia Content Adaptation

Myriam Amielh, Sylvain Devillers

Philips Research France
51, rue Carnot - B.P. 301 92156 Suresnes Cedex, France
{Myriam.Amielh, Sylvain.Devillers}@philips.com

Abstract

Although many content adaptation strategies are available to face the problem of transparently accessing multimedia documents from any kind of device connected to the Web, no solution exists at the bitstream level to adapt an atomic media element. In this paper, we present a complete framework based on XML to describe the structure of a bitstream. These descriptions are then transformed to dynamically adapt multimedia data to the network and terminal capabilities. For that purpose, we introduce a new schema language, the Bitstream Syntax Description Language (BSDL) to specify the document model of a multimedia bitstream. We give the in-depth specification of this language and show several applications on emerging formats featuring scalability properties. Finally, we describe a Web-based implementation on a Client/Server system.

Keywords

Content Adaptation, XML, XML-Schema, XSLT, Scalable Encoding, Multimedia, Image, Video.

Approximate word counts:

7800

1 INTRODUCTION

Web services are becoming accessible from a wide collection of non-desktop computers including mobile phones, televisions, Personal Digital Assistants (PDAs). Indeed, according to some recent survey, the W3C estimated that the number of web documents viewed through non-desktop devices, rather than through browsers on PCs, will rise dramatically in the coming years. Heterogeneity in terminal capabilities is a limitation for interoperability between devices and Web services. This is why many studies are currently carried out in order to provide strategies for content adaptation and profile negotiation and to ensure seamless web access from various kinds of devices.

At the server side, holding one version of a content for each possible kind of terminal is a very heavy solution, especially when the server has to address low to high level devices. A better solution is to dynamically publish adapted content, according to the terminal properties. The generation of dynamic content is an issue for which many solutions have been developed. XML technologies [1] are broadly used to create text documents independently from the presentation aspect and to dynamically generate adapted content, displayable for instance by Web or WAP browsers. Several multimedia document models like SMIL2.0 [2] have been developed to adapt composite documents containing links. Finally, content servers sometimes hold a collection of proprietary softwares, able to parse, edit and adapt multimedia bitstreams. But gathering and maintaining such a set of dedicated tools is not a reliable solution.

However, up to now, no generic solution has been provided at a lower level to adapt an atomic media element, dealing with one coding format, for instance an MPEG-4 or JPEG2000 bitstream. In this paper, we describe a complete framework for bitstream adaptation based on a generic method, relying on the use of a common structuring language to describe the structure of a bitstream.

The prime elements of this new framework have been previously pictured in [3] and [4]. Section 2 summarizes these bases and especially how to fully exploit the scalability for bitstream adaptation purpose by describing the bitstream structure with XML. In this paper, we introduce a new schema language, the Bitstream Syntax Description Language (BSDL), to validate XML bitstream descriptions. Section 3 is dedicated to the in-depth BSDL specification, notably the structural and data types aspects. In order to validate this new technology, we show in Section 4 several applications to MPEG-4 videos, JPEG2000 still images and finally to a proprietary video compression format. Section 5 replaces our technology in a complete Web based system and shows a possible server implementation for a non-streamed media in order to prove that this solution can be easily integrated in a Client/Server context.

2 MULTIMEDIA CONTENT ADAPTATION

In this section, we identify a level of content for which no solution has been proposed so far to enable adaptation to different kinds of devices. We also show that scalability properties, which are provided by many encoding formats, can be efficiently exploited for that purpose. Afterwards, we give the fundamentals of a new framework we designed and tested to solve this problem.

2.1 Adaptation of documents through the web

The variety of devices gaining access to the Internet and exchanging an increasing range of different formats makes crucial the issue of content adaptation. In the last years, the consumption of new audio and video formats has been raising sharply, not only for PC agents but also for mobile phones or Personal Digital Assistants (PDA) such as iPAQ, PALM.

Initially, the issue was to publish text documents in different versions adapted to the capabilities of the rendering devices. For modeling adaptive text document, web site designers employ XML in order to separate the structure of the document from its presentation. In this way, the source document can be dynamically processed in order to generate a presentation tailored to the available resources, e.g. in HTML or WML for respectively a web or WAP browser.

Following up on this, web site developers rapidly came up against the question of how to adapt composite documents to the different agents expected to display them. This problem relates to the transformation of the document structure. Here again, multimedia document models employing XML have been designed to model adaptative multimedia content (as for instance SMIL [2] or MADEUS [5]).

Although content adaptation is a topic for which numerous activities are currently being carried out, the solutions always relate to text or composite documents. Up to now, no solution has been provided for the adaptation of the media bitstream itself. However, numerous encoding formats provide scalability properties, also called progressivity, which could be straightforwardly exploited for adaptation purpose.

In this paper, we describe a new technology for multimedia content adaptation taking the device capabilities into account. This work is located below the usage of media elements in multimedia composite documents. Thus, to avoid any confusion, the reader shall assume that in this paper the term "multimedia content" relates to an atomic content, which deals with a single coding format, possibly multiplexed (e.g. JPEG2000, MPEG-4...).

2.2 Scalable content

In this section we define and illustrate the notion of scalable content and we conclude on the role of scalability for content adaptation strategies. The term scalability refers to methods that allow the partial decoding or transmission of a compressed bitstream. In scalable multimedia coding standards such as MPEG-4 video or JPEG2000 still images, data are organized in a way such that, when retrieving a bitstream, it is possible to first render a degraded version of the content, and then progressively improve it by loading additional data retrieved from the source. From a multimedia content coded in a relevant scalable way, it is hence possible to retrieve an adapted version from a single bitstream by performing simple operations such as data truncation.

There are several types of scalability. The most commonly implemented ones are SNR (Signal to Noise Ratio) scalability, temporal scalability, and spatial scalability. SNR scalability consists in compressing a quality-degraded version of an image at the same time than one or several complementary images to enhance its quality. The temporal scalability enables to decode a bitstream at various temporal resolutions. Practically, the decoder can extract a temporally sub-sampled version of the initial video from the bitstream. The spatial scalability offers scalability of the content size. The terminal can first render a small image and then progressively display several larger images until the full resolution is reached. However, other scalable encoding techniques, more application-oriented, do exist. For instance, the visual scalability in 3D scenes enables to progressively decode a 3D realistic environment, taking the viewer position into account, so that only the relevant information is processed.

Scalable encoding plays a key-role for content adaptation in Client/Server architectures. Depending on the system features (bit rate, errors, available resources), the client decoder can take some portions of a stream and still decode its content (audio, video or still images) at different quality levels. Similarly, the server can send selected portions of the initial content. Hence, if a bitstream is scalable, decoders from low to high performances can coexist. While low performance decoders may decode only small portions of the bitstream producing a basic quality, high performance decoders may decode much more data and produce a significantly higher quality.

2.3 XML description of the bitstream structure

In the previous section, we saw that with a multimedia content encoded in a scalable way, it is possible to retrieve an adapted version from a single bitstream by performing simple operations such as data truncation. However, for this operation, a dedicated software is required to parse the bitstream and cut the irrelevant parts off. A server providing several contents coded in different standards therefore needs as many software modules as offered formats to manipulate them. Since the number of multimedia coding formats is continuously growing, maintaining numerous software modules dedicated to content adaptation will be an increasingly difficult task. For that reason, we proposed a generic approach to this issue by providing a method based on XML for manipulating bitstreams.

A multimedia bitstream consists in a structured sequence of binary symbols, this structure being specific to the coding format. We propose to use XML to describe the high-level structure of a bitstream and we call the resulting XML document a Bitstream Description. This description is not meant to replace the original binary format, but acts as an additional layer, similar to metadata. It does not describe the bitstream on a bit-per-bit basis, but addresses its high-level structure, i.e. how it is organized in layers or packets of data. Lastly, it does not deal with the semantics of the bitstream, i.e. the original object, image, audio or video it represents, but only considers it as a sequence of binary symbols.

Document 1 shows an example of a fragment XML description of a JPEG2000 bitstream [6], an emerging still image coding standard based on a wavelet transform. In a simplified view, the bitstream is organized as a sequence of packets containing a header, identified by a two-byte marker with the hexadecimal value FF91, followed by two bytes indicating the length of the header, and four bytes indicating the packet number. This header is followed by the packet data, consisting in a sequence of bytes resulting from the arithmetic coding of the wavelet coefficients.

The parameters listed above are defined in the XML description as elements and their values written in a readable format, hexadecimal in the case of the marker and numeric for the other two parameters. Due to the scalable properties of the bitstream, the packet data does not need to be decoded to adapt the image, and may be manipulated as a whole. Its inner structure is therefore not further detailed. The data themselves are not included in the description, but referred to by a pointer indicating the relevant byte range in the original file.

<Packet>
  <Header>
    <Marker>ff91</Marker>
     <Length>4</Length>
     <PacketNumber>1</PacketNumber>
  </Header>
  <PacketData>155-396</PacketData>
</Packet>
Document 1: Example of a fragment JPEG2000 Bitstream Description

2.4 Description Transformation

We saw in section 2.2 that a scalable coding format allows to transform a content by simple editing operations. For example, if an image is encoded in a progressive-by-resolution scheme, it is possible to render a smaller resolution image by cutting off relevant packets of data and modifying some header parameters related to the image size.

Once we get an XML representation of the bitstream, it is then possible to define the corresponding editing operations on the XML document, which will consist in removing the elements corresponding to the packets of data to be cut off, and in modifying some element or attribute values.

The strength of describing a bitstream structure using XML is that there are solutions already available to perform such XML-to-XML editing operations. Notably, the W3C language XSLT [7] is an efficient way to specify transformations on XML documents by means of style sheets. An XSLT style sheet contains one or several templates defining the modifications to be applied to the elements or to the attributes matching a set of conditions. For instance, a template can remove some elements or as well change their content.

The XSLT language also provides the possibility to define parametric style sheets. In the Document 2, a parameter called nCompOut is defined at the head of the style sheet, and is used later to change the value of the Lsiz element.

<!--Parameter definition: number of output components - Default value = 1 -->
  <xsl:param name="nCompOut">1</xsl:param>
<!--Template: Match Lsiz -->
  <xsl:template match="Lsiz" >
    <xsl:copy>
      <xsl:value-of select="38 + 3 * $nCompOut"/>
    </xsl:copy>
  </xsl:template>
Document 2: Example of parametric style sheet

Once the client capabilities have been translated in terms of bitstream quality, the parameters can be used to transmit the adaptation values to the style sheet, for instance the number of SNR layers to be kept, or the number of temporal levels to be suppressed.

Thus, XSLT can be straightforwardly used on XML bitstream descriptions (Figure 1). Furthermore, processors able to apply XSL transformations are likely to be available on web servers (for instance, the Apache project provides the Xalan Java API [8]).

Figure 1

Figure 1: XML-to-XML transformation

Figure 2 below summarizes the overall method for scalable bitstream adaptation. Assuming that we can produce the XML description of a scalable bitstream, we apply one or several XSLT style sheet(s) to produce a transformed XML description, from which it is possible to re-generate an adapted and compliant bitstream.

Figure 2

Figure 2: Multimedia Content Adaptation with XML

3 BITSTREAM SYNTAX DESCRIPTION LANGUAGE (BSDL) SPECIFICATION

3.1 Introduction of a bitstream schema

The Bitstream Description introduced in section 2.3 is a temporary representation intended for content adaptation. Once the XSLT transformation has been applied to this description, the original binary format needs to be re-generated since it is the only understandable and usable format by the final user. However, to achieve a full interoperability, the way the bitstream is generated should be generic and should not depend on the coding format.

The description itself contains the data to be written in the output bitstream, but does not specify how the value should be encoded. For example, if the description contains an element <Length>4</Length>, additional information is required to know whether the value "4" should be encoded as a short integer on two bytes or as a floating point variable. This property is common to all XML documents describing the same bitstream format and thus should be specified in the document model defining the set of constraints common to these descriptions.

The XML specification [1] provides a grammar named Document Type Definition (DTD) for this purpose, but which uses a limited set of types. In particular, the value "4" in the example above is considered by the DTD as a string of characters. A document model or schema expressing the types of elements and their coding scheme is therefore required to achieve a full interoperability.

Several schema languages have been developed to increase the expressing power of DTD constraints. [9] reviews several schema languages, comparing their respective features. Among them, XML-Schema [10][11][12], recently normalized by the W3C, provides the richest set of datatypes and structures constraints, and partially solves the issue stated above. This is why we chose this language to write the document models or schemas defining the family of XML documents describing a same bitstream format and we call it a Bitstream Schema.

However, the primary role of a schema is to validate an XML document, i.e. to check that it follows a set of constraints on its structure and datatypes. In this paper, we propose to use this schema to generate the bitstream from its description. In this way, we are extending the XML-Schema language by introducing a new functionality. We thus define a new language we name Bitstream Syntax Description Language (BSDL), built on top of XML-Schema.

From a semantic point of view, BSDL adds a new functionality beyond the validation of the schema, which is to provide information to a generic software enabling it to generate a bitstream, as pictured in Figure 3. This generation mechanism is described in Section 3.2. From a syntactic point of view, it extends the XML-Schema language by introducing a set of restrictions and extensions, using XML-Schema extension mechanisms (Section 3.3), and applying to its structures (Section 3.4) and datatypes (Section 3.5).

Figure 3

Figure 3: Functionality of a Bitstream Schema

3.2 Bitstream generation: XMLtoBin Parser

In the following, we name XMLtoBin Parser the generic software parsing the Bitstream Description and generating the bitstream, and BintoXML the software perfoming the reverse operation. The BintoXML parser is not covered in this paper, nor the underlying specific BSDL extensions. The mechanism of XMLtoBin parser is explained as follows: it parses the Bitstream Description following the navigation path of an XML document and assigns to each element its declaration and the type definition in the corresponding Bitstream Schema. When the XML element contains string data, it is binary encoded following the rule given by its type and the resulting binary symbol is appended to the output bitstream as illustrated in Figure 4. When the element contains a pointer to an external entity, the relevant range of data is copied from the pointed resource into the output bitstream.

<Packet>
  <Header>
    <Marker>ff91</Marker>
    <Length>4</Length>
    <PacketNumber>1</PacketNumber>
  </Header>
  <!-- etc... -->
</Packet>
<xsd:element name="Header">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="Marker"
                   type="xsd:hexBinary"/>
      <xsd:element name="Length"
                   type="xsd:short"/>
      <xsd:element name="PacketNumber"
                   type="xsd:int"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>
Output bitstream: 0xFF91 0x00 0x04 0x00 0x00 0x00 0x01 ...
Figure 4: Example of bitstream generation

Note that the XMLtoBin Parser does not need to validate the document in the XML-Schema meaning since this may have been done previously by a regular XML-Schema parser. However, it may be required to first validate the BSDL specific restrictions and extensions. It is up to the developer to choose whether a single software should validate the XML-Schema syntax, the specific BSDL restrictions and extensions, and generate the bitstream, or to assign these functions to different software modules.

3.3 Construction of BSDL over XML-Schema

The XML-Schema language specification is divided in two parts, Structures [11] and Datatypes [12]. The Bitstream Syntax Description Language specification given next follows this dichotomy. It is not intended to be exhaustive, but it is detailed enough to understand the main principles of the application. However, a sufficient knowledge of XML-Schema is assumed in the following sections. The reader may first study the Primer [10] to get a good introduction to XML-Schema.

In the following, the "xsd:" prefix is used as a convention to refer to the XML-Schema namespace "http://www.w3.org/2001/XMLSchema", "xsi:" refers to "http://www.w3.org/2001/XMLSchema-instance", and "bsdl:" to BSDL extensions.

The Bitstream Syntax Description Language (BSDL) uses a set of XML-Schema components for which a semantics may be assigned in the context of the bitstream generation. Some components should therefore be ignored by the XMLtoBin parser because they have no meaning in the BSDL context, while other constructs are excluded. Lastly, some components are currently excluded from the BSDL because their impact on the bitstream generation has not been fully studied yet, but may be allowed or ignored in future versions.

In order to remain conformant to XML-Schema, we use its extension mechanisms. In this way, the Bitstream Schema may be validated by an XML-Schema parser.

In particular, XML-Schema provides two ways of adding application-specific extensions. Firstly, most schema components contain an xsd:annotation, which itself contains an xsd:appinfo, intended as a placeholder for application specific information. We use xsd:appinfo to plug BSDL extensions written in an XML-Schema-like syntax. Furthermore, all XML-Schema components allow attributes with non-schema namespace, which leaves the possibility to add new attributes to some schema components by qualifying them with a specific namespace. The current specification of BSDL does not make use of this feature, but future versions may require so.

3.4 Structural aspects of BSDL

An important restriction on the Bitstream Description is that data should be embedded as an element content and not in an attribute, since the order of attributes in XML is not significant: should attributes contain symbols to be added to the output bitstream, then an external knowledge is required to specify in what order they should be processed. Any attribute declaration in the schema will therefore be ignored, as long as it is valid for XML-Schema.

Another major restriction is that a type must be assigned to each element of the instance. For this reason, mixed content models are excluded from BSDL, since in this case, character data inserted between the child elements of a parent element have no type assigned by the schema.

In the following example extracted from the XML-Schema Primer and reproduced in Document 3, the character data "Dear Mr." contained in the <salutation> element and at the same level as the child element <name> have no type assigned by the schema.

<!-- XML document fragment -->
<salutation>Dear Mr.<name>Robert Smith</name>.</salutation>
<!-- Corresponding schema fragment -->
<xsd:element name="salutation">
  <xsd:complexType mixed="true">
    <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>
Document 3: Example of an XML element with mixed content

Following the same reason, all types should be explicit. XML-Schema provides wildcards mechanisms to append a non typed element or attribute, the type then being specified in the instance (via the xsi:type attribute). All components or attributes (xsd:any, xsd:anyType, xsd:anyAttribute...) allowing open content are therefore excluded from BSDL. Similarly, the use of abstract element declarations and substitution groups is prohibited in BSDL.

3.5 Datatypes aspects of BSDL

3.5.1 Introduction
XML-Schema defines a datatype as a 3-tuple consisting of (see section 2.1 of [12]):

We define BSDL datatypes by adding a fourth component:

BSDL datatypes are therefore restricted to those defined by XML-Schema for which a binary representation may be defined. For example, xsd:integer represents the mathematical concept for an unbounded integer. No implicit binary representation may be assigned to this type, which is therefore excluded from BSDL. On the other hand, xsd:int is derived from xsd:integer by restricting its value space to the values that may be represented on four bytes (xsd:minInclusive = -2147483648 and xsd:maxInclusive = 2147483647). BSDL includes this type and assigns a binary representation on four bytes.

3.5.2 Restriction of datatypes

BSDL includes the following list of types:

Other XML-Schema types are excluded from BSDL, either because they are not used in a multimedia bitstream, such as the types related to dates, time or duration, or because they have no implicit binary representation.

3.5.3 BSDL pseudo-facets

Beyond the XML-Schema types, BSDL needs to manipulate symbols encoded on a number of bits that is not a multiple of eight. A way to do this is to use a specific facet. However, XML-Schema facets characterize a value space along independent axes or dimensions (see section 2.4 of [12]). Since BSDL does not consider the values of types but only their binary representations, we choose not to use the XML-Schema facets and to ignore them when specified. On the other hand, we introduce a new kind of facets we call pseudo-facets that characterize the binary representation of a type. In particular, we define the pseudo-facet bsdl:bitsLength to specify the number of bits on which a value should be encoded. It applies to unsigned integer types (xsd:unsignedByte, xsd:unsignedInt, xsd:unsignedLong...) and xsd:hexBinary. Since this schema component is an extension of XML-Schema, we assign it the BSDL namespace, and include it in the schema via the xsd:appinfo schema component (Document 4).

<xsd:simpleType name="threeBitType">
  <xsd:restriction base="xsd:unsignedByte">
    <xsd:annotation><xsd:appinfo>
      <bsdl:bitsLength value="3"/>
    </xsd:appinfo></xsd:annotation>
  </xsd:restriction>
</xsd:simpleType>
Document 4: Example of a Bitstream Schema fragment using the pseudo-facet bsdl:bitsLength

Any unsigned integer type can be used as the base type as long as it is consistent with the number of bits specified in the pseudo-facet. Furthermore, xsd:maxExclusive and other related facets may be used by the schema author, but are ignored by BSDL even if their values may be correlated with the bsdl:bitsLength. For example, using <xsd:maxExclusive value="8"> to specify that the value should be encoded on three bits is incorrect.

3.5.4 Simple type derivation

As in XML-Schema, the author of a schema may define his/her own datatypes by deriving them from built-in types. Beyond the derivation by restriction seen above, the derivation by list is also allowed. In this case, each element of the list is binarily encoded following the rule of the base type and sequentially added to the output bitstream. On the other hand, BSDL excludes the derivation by union since this allows several types to be assigned to a same element in the instance. Document 5 shows examples of simple type derivation. In the case of derivation by restriction, the xsd:length facet sets a constraint on the value, but has no impact on the encoding. In the case of derivation by list, the element value "1 2 3" will be encoded into the output bitstream as the following byte sequence: "0x00 0x01 0x00 0x02 0x00 0x03". Lastly, the derivation by union is excluded from BSDL since in this case, the number of bytes on which the value 5 should be encoded (four or two) is not determined.

<!-- Schema -->
<!-- Derivation by restriction -->
<xsd:simpleType name="markerType">
  <xsd:restriction base="xsd:hexBinary">
    <xsd:length value="2"/>
  </xsd:restriction>
</xsd:simpleType>
<!-- Derivation by list -->
<xsd:simpleType name="arrayOfShortType">
  <xsd:list itemType="xsd:short"/>
</xsd:simpleType>
<!-- Derivation by union: excluded by BSDL! -->
<xsd:simpleType name="intOrShortType">
  <xsd:union memberTypes="xsd:int xsd:short"/>
</xsd:simpleType>
<!-- Description -->
<!-- Derivation by restriction -->
<marker>FF91</marker>
<!-- Derivation by list -->
<arrayOfShort>1 2 3</arrayOfShort>
<!-- Derivation by union: excluded by BSDL! -->
<intOrShort>5</intOrShort>
Document 5: Example of simple type derivation
3.5.5 Introduction of a new type bsdl:byteRange

As seen above, the bitstream data may be embedded in the description in a readable format, or be referred to by a pointer. For this latter case, we introduce a new datatype named bsdl:byteRange. It is derived from the XML-Schema datatype xsd:string, as specified in Document 6. This definition is included in a schema specifying BSDL extensions, which should therefore be imported by any Bitstream Schema using this type.

Its semantics is defined as follows: an element of type bsdl:byteRange indicates the first and last byte offsets of the relevant byte range in the described bitstream. When generating the output bitstream, the XMLtoBin Parser should fetch the described resource, copy the relevant byte range and append it to the output bitstream. The content of an element of type bsdl:byteRange should therefore follow the syntax:

firstByte-lastByte

Where firstByte and lastByte indicate the first and last byte offsets of the relevant range of data, the first byte of the resource having the number 0. Note that this syntax is compatible with the one used in HTTP/1.1 to specify a byte range (see [14], section 14.35.1).

In the example shown in Document 6, the bytes from 15 to 200 (inclusive) will be copied from the described bitstream and appended to the output bitstream.

<!-- Fragment of the schema defining BSDL extensions -->
<xsd:simpleType name="byteRange">
  <xsd:restriction base="xsd:string">
    <xsd:pattern value="[0-9]*-[0-9]*"/>
  </xsd:restriction>
</xsd:simpleType>
<!-- Schema fragment of an element declaration with a bsdl:byteRange type-->
<xsd:element name="Packet" type="bsdl:byteRange"/>
<!-- Description fragment using a bsdl:byteRange type -->
<Packet>15-200</Packet>
Document 6: Example of bsd:byteRange

Note that we currently restrict the use of bsdl:byteRange to byte-aligned ranges of data. This has proven to be sufficient in the applications tested so far, but the syntax may be easily extended to indicate non byte-aligned ranges of data.

4 APPLICATIONS AND RESULTS

4.1 Introduction / Flexibility of the level of description

This section reports how the original method introduced in this paper has been tested on several coding formats (image and video), standards (MPEG, JPEG) and proprietary developments. In the following, we briefly describe for each case the format, its scalable features, the BSDL schema proposed and the XSLT style sheet to transform the BSDL description. Each proposed schema is not unique, but corresponds to the specific requirements of an application, and demonstrates the flexibility of the method.

For example, the JPEG2000 BSDL schema shown in Section 4.2 details each parameter of the main header though only a subset of them are used. On the other hand, the MPEG-4 BSDL schema shown in Section 4.3 views the header and the different types of frames as black boxes and does not detail their content. This example shows that a very coarse level of description may be sufficient for some applications. On the other hand, other applications dealing with more elaborate scalability mechanisms of MPEG-4 Videos would require a finer description. We therefore see that the level of description and hence the schema are not unique and depend on the application.

4.2 Application to JPEG2000 still images

The emerging still image coding method JPEG2000 [15][16] is based on the wavelet transform, an inherently scalable method which allows scalable encoding driven by SNR quality, color component or resolution, i.e. image size.

In the first case (progression by SNR quality), the bitstream is organized in layers, each layer carrying a quality increment. In the second case (progression by color component), data are organized by color components. By removing the last components of a color image, a luminance (i.e. gray-level) image is thus obtained. In the last case (progression by resolution), data are organized following the successive image size decompositions starting from the smallest one. For example, considering a 512² image, the bitstream first yields a 32² image, then a 64² image and so forth up to the full original resolution. All these image transformations can be achieved by editing the bitstream in a relevant way.

The JPEG2000 standard defines the following terminology:

Main and tile-part headers are organized in markers and markers segments. A marker is a particular 2-byte value and a marker segment consists of a marker followed by two bytes indicating the marker segment length and the actual parameters. The bitstream is divided in packets, the number of which depends on the number of color components, quality layers and resolutions. The packets are the elementary segments of data that should be removed to obtain a degraded version of the image. Figure 5 shows the structure of elements declared in the JPEG2000 BSDL schema.

Figure 5

Figure 5: BSDL schema for a high-level description of a JPEG2000 image

We wrote three XSLT style sheets corresponding to the different types of scalability listed above. They consist in removing a given number of packets from the bitstream, this number depending on the targeted quality or resolution level. Furthermore, the style sheet updates some parameters in the main header. For example, when producing a smaller resolution, the image width and height parameters will be modified accordingly in the SIZ marker segment. Once the XSLT transformation is achieved, it is then possible to generate a compliant bitstream from the resulting description.

Note that this operation is more powerful than a simple file truncation since we can modify some parameters and remove segments of data from a random location in the bitstream and not only at its end. In this way, the resulting bitstream remains fully compliant to the coding format.

4.3 Application to MPEG-4 videos

A simple application using MPEG-4 Video Elementary Streams [17] enabled us to validate the generation of a bitstream at different frame-rates. An MPEG-4 video bitstream contains a succession of frames also called Video Object Planes (VOP). An I-VOP is encoded independently of any other VOP, a P-VOP is predicted (using motion compensation) from another previously decoded VOP and a B-VOP is bi-directionally predicted from past and future VOPs.

As MPEG-4 frames contain their own time stamps, the decoder can skip some B-VOPs to provide a lower time resolution video. Otherwise, all the decoded VOPs can be used to provide the full temporal rate. For this test, we simply applied an XSLT Style Sheet on the bitstream high-level description to get a new description without B-VOPs. The BSDL schema for this application is not as detailed as the JPEG2000 schema, where each header parameter was described. We remain here at a level of description only identifying I, B or P VOPs and the header as a whole without detailing them. Header and VOPs data are not directly embedded within the description but pointed to, using byteRange types. Figure 6 below shows the structure of elements declared in this MPEG-4 BSDL schema.

Figure 6

Figure 6: BSDL schema for a high-level description of MPEG-4 Video Streams

Then, we applied an XSLT style sheet to remove the B-VOP elements from the initial description. The obtained XML description was processed to generate a compliant bitstream at a lower framerate. Document 7 below shows an excerpt of the input and the output descriptions.

<Bitstream xmlns="mpeg4"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="mpeg4 MPEG4Bitstream.xsd">
    <Header>0-17</Header>
    <I_VOP>18-4658</I_VOP>
    <P_VOP>4659-4756</P_VOP>
    <B_VOP>4757-4772</B_VOP>
    <B_VOP>4773-4795</B_VOP>
    <P_VOP>4796-4973</P_VOP>
    <B_VOP>4974-5026</B_VOP>
    <B_VOP>5027-5065</B_VOP>
    <P_VOP>5066-5300</P_VOP>
    <!-- etc -->
</Bitstream>
<Bitstream xmlns="mpeg4"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="mpeg4 MPEG4Bitstream.xsd">
    <Header>0-17</Header>
    <I_VOP>18-4658</I_VOP>
    <P_VOP>4659-4756</P_VOP>
    <P_VOP>4796-4973</P_VOP>
    <P_VOP>5066-5300</P_VOP>
    <!-- etc -->
</Bitstream>
Document 7: Initial and transformed XML descriptions of an MPEG-4 video stream

For information only, the Table 1 shows the sizes of the resulting XML descriptions.

MPEG-4 File Bitstream size (bytes) XML description size (bytes) Initial number of frames Final number of frames Size ratio (XML/Bitstream)
akiyo_n30_m3.bits 165034 12108 297 100 0.073
foreman_n15_m2.bits 572100 12999 299 99 0.023
Table 1: results for two different MPEG-4 video bitstreams

Note that even though XML data is obviously much more verbose than a binary format, the bitstream descripion is smaller than the original bitstream. This is due to the fact that we do not describe each field of the bitstream with an XML element, but give a high-level description. For example, the single element <I_VOP>18-4658</I_VOP> describes a 4640 byte long field with only 22 ASCII characters.

4.4 Application to a proprietary scalable format for video

In order to show that our method is as well efficient on non-standard formats, we tested it on a video coding method developed at Philips Research France. The 3D-subband video codec proposed by Bottreau et al. [18] is based on a 3D-wavelet transform, the third dimension being the time scale, and provides several progressive modes.

The bitstream is organized in Groups of Frames (GOF), containing one or several SNR quality layers (called BitPlanes), subdivided into several time resolution layers (called JT), including themselves several spatial resolution layers (called JS). Separators in the bitstream dissociate the successive spatial, temporal and quality levels (GOFEndCode, BitPlaneEndCode, JTEndCode). If the required spatial resolution, frame-rate or quality is inferior to the ones provided by the encoder, then the decoder or server simply skips some parts of the bitstream.

Figure 7 shows the structure of elements declared in the 3D-subband BSDL schema.

Figure 7

Figure 7: BSDL schema for a proprietary video bitstream

Document 8 shows an excerpt of a 3D-subband XML bitstream description.

<!--bitstream description for 3ds -->
<Bitstream xmlns="3DScal"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="3DScal 3DScalableBitstream.xsd">
  <Header>
    <bitrate>1024</bitrate>
    <colorType>2</colorType>
    <nbGOFframes>16</nbGOFframes>
    <imageWidth>352</imageWidth>
    <imageHeigth>288</imageHeigth>
    <framerateFactor>1</framerateFactor>
    <nbTempLevels>4</nbTempLevels>
    <nbTempRecLevels>4</nbTempRecLevels>
    <nbSpatLevels>5</nbSpatLevels>
    <nbSpatRecLevels>5</nbSpatRecLevels>
    <!-- etc -->
  </Header>
<GOF>
    <BitPlane>
    <JT>
        <JS>20-650</JS>
        <JS>651-652</JS>
        <JS>653-654</JS>
        <JTEndCode>7ffb</JTEndCode>
    </JT>
    <!-- etc -->
    </GOF>
</Bitstream>
Document 8: Initial XML description of a proprietary video stream

As this format provides three types of scalabilities, we tested three different XSLT style sheets: the first one to change the image size, the second one to change the frame-rate and the last one to change the SNR quality. Document 9 shows the style sheet dedicated to the adaptation of the spatial resolution. These style sheets contain a parameter that enables to control the number of layers to be kept for each kind of progressivity (for instance the parameter keepJsLevels in the document below).

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  xmlns:s3d="3DScal"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
    <xsl:output method="xml" indent="yes"/>
    <!-- Parameter: number of JS levels to keep -->
    <xsl:param name="keepJsLevels">2</xsl:param>
    <!-- Match all default template -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <!-- change value of nbSpatRecLevels to keepJsLevels +1-->
    <xsl:template match="s3d:Header/s3d:nbSpatRecLevels">
        <xsl:copy>
            <xsl:value-of select="$keepJsLevels"/>
        </xsl:copy>
    </xsl:template>
    <!-- Match last nJsLevels JS elements in JT elements -->
    <xsl:template match="s3d:JT/s3d:JS[position() > $keepJsLevels]">
        <!-- do nothing -->
    </xsl:template>
</xsl:stylesheet>
Document 9: XSL style sheet to adapt spatial resolution

For information only, the Table 2 shows the size of the resulting XML description.

3DS File Bitstream size (bytes) XML description size (bytes) Initial number of spatial layers Final number of spatial layers Size ratio (XML/Bitstream)
forman.bits 134000 13000 5 2 0.097
Table 2: results for a 3D-subband video bitstreams

4.5 Other applications under study

We are currently testing the generation of adapted bitstreams on two other types of encoding formats: the MPEG-4 Visual Texture Coding (VTC) format [17] and the MPEG-4 Fine Granularity Scalable (FGS) format [19]. The MPEG-4 VTC bitstream is an example of application-oriented scalability. The MPEG-4 FGS format is a relevant way to validate the application of this technology for encoders of different complexities.

5 SYSTEM APPROACH

5.1 A server implementation

In order to validate our solutions, we integrated the bitstream generator XMLtoBin in a Client/Server system and tested the delivery of JPEG2000 bitstreams. The test platform is partly based on the Cocoon servlet, which is an implementation of a web-publishing framework provided by the Apache project [20].

Our implementation involves an Apache server, configured so that it forwards JPEG2000 files to a servlet called MediaHandler, and XML files to Cocoon. Basically, Cocoon is a servlet dedicated to dynamic document production, which receives an XML description, applies an XSLT style sheet and returns the transformed XML description. The test scenario is the following one:

Figure 8 below shows the request processing in the server implementation.

Figure 8

Figure 8: System architecture

The http server, the Cocoon and the MediaHandler servlets may run on different machines, hence distributing the different tasks to dedicated servers to avoid the overload of a single server.

Note that this architecture is based on HTTP and therefore does not suit the case of streamed media such as an MPEG-4 Video. Furthermore, XML transformation with XSLT is context-dependent and cannot be run on-line. We are currently extending this architecture to allow the steaming of media. For this, we shall need to use RTP-based connections and to define a subset of XSLT that may be run on-line.

5.2 Content negotiation aspect

The technology we describe in this paper relies on the exchange of capabilities between the Client and the Server. Although the capability description aspect is not in the scope of our work, we identified several ongoing efforts, which are relevant to integrate our solution in a Client/Server system. We can mention notably the W3C solution CC/PP (Composite Capabilities/Preference Profiles) to describe device capabilities [21]. A CC/PP profile contains a number of attributes, names and associated values that are used by a server to determine the most appropriate form of a resource to be delivered to a client.

The Internet Engineering Task Force (IETF) also formed a Content Negotiation Working Group (Conneg) to cover content negotiation inside and outside of HTTP [22]. HTTP enables web site authors to put multiple versions of the same information under a single resource URI. The Conneg specification offers an extensible negotiation mechanism to automatically retrieve the best version of an HTML document. For that purpose, when the resource is accessed, the user agent sends small accept-headers, which express both user agent capabilities and user preferences.

The Wireless Application Protocols (WAP) standard is an adaptation of Web protocols and languages for mobile phone applications. The User Agent Profile (UAProf) specification [23] extends CC/PP to enable the end-to-end flow of capability and preference information, between the WAP client, the intermediate network points and the origin server. It seeks to inter-operate seamlessly with CC/PP distribution over the Internet. The specification defines a set of components and attributes that WAP-devices may convey within the capability and preference information, such as hardware characteristics (screen size, color capabilities, manufacturer, etc), and network characteristics (latency, reliability, etc).

6 FUTURE WORKS

We are currently extending BSDL to a grammar that enables to parse a bitstream and to generate an XML description. Hence, a generic software agent can parse a bitstream without any internal knowledge of its structure, only by referring to the BSDL schema, as depicted in Figure 9.

Figure 9

Figure 9: Extensions of the Bitstream Syntax Description Language

The bitstream structure often depends on the encoding options. For instance, the presence of one or several enhancement layers can be specified in a field located in the bitstream header. Thus, a block of data may be present only if a specific flag occurs earlier in the stream. In some cases, the value of a field can possibly be linked to the value of a previous field. Consequently, we need to introduce conditional statements and variables in the BSDL. For that purpose, we are presently defining new BSDL elements - as XSLT provides some mechanisms to model variables, we intend to do it by deriving some of its structures.

7 CONCLUSION

XML technologies have gained an overwhelming success for structuring, editing and exchanging electronic textual data. However, up to now no solution has been developed at the bitstream level to exchange adapted media elements. In this paper, we briefly reviewed our previous works on that issue: an original extension of the use of XML technologies to describe the syntax of a multimedia bitstream. Then, we introduced and specified a new schema language called Bitstream Syntax Description Language (BSDL) that embraces such XML bitstream descriptions. Moreover, this language provides a cross-standard and generic approach to adapt scalable content according to the terminal resources. As a validation, we showed three applications to multimedia formats featuring scalability properties: JPEG2000 still images, MPEG-4 videos and a proprietary video-encoding format. Finally, we showed a possible web-based implementation of this framework for a Client/Server system and non-streamed media.

The extension of XML technologies to multimedia bitstream modeling is an original and promising solution for advanced web publishing and allows numerous industrial applications, beyond the Internet context. Actually, the content adaptation technology we present here can be straightforwardly considered for any industrial application dealing with devices producing, exchanging or receiving multimedia contents. This technology is notably relevant for advanced broadcasting chains involving Set-Top Boxes (STBs), high and low definition televisions and Personal Digital Assistants (PDAs).

REFERENCES