Rich
Media Retrieval on the Web – a Multi-level
Indexing Approach
Jian Zhai1 Jun
Yang1 Qing Li1 Liu
Wenyin2 Bo Feng1
1Dept. of Computer Engineering and
Information Technology
2Dept. of Computer Science
ABSTRACT
Rich media is experiencing a breathtaking growth in recent
years. Because of the potentially very large index size, it is hard to adopt and adapt content-based method to search rich media files on the Web. In this work, we describe a multi-level indexing method to solve this problem.
Specifically, we propose a novel technique of indexing rich media at
different levels in a large collection. Query examples are presented to show that our approach can find more relevant rich
media files and filter out
noise data more effectively than traditional search
engines
without notable loss in efficiency.
Keywords
rich media retrieval,
multi-level index, XML representation.
Rich media is a new media format that usually
integrates with audio, video, hypertext, graphic, vector objects, animations,
and user interactions. For example, SMIL [1], Flash [2], and PowerPoint [3] are all instances of rich media. According to the
latest statistics, the three major rich media formats, SMIL, Flash, and
PowerPoint, are supported by 99%, 98%, and 96% [4] of the web browsers worldwide, respectively.
However, there are still many difficulties when we try
to implement a practical content-based rich media retrieval system in the Web
environment. Though Google can retrieve the content
of PowerPoint and PDF files, it is shown by the latest statistics that these two types of files account for only 5% of
all rich media files on the Web. [4]
Particularly, the index size would be too massive if we
had simply applied traditional indexing techniques on top of the hundreds of
thousands of rich media files (approximately 9,537,000 files, according to Google’s search result) appeared on the Web, not to mention
the fact that this number keeps increasing these years. As a result, the gigantic
index size entails tremendous calculation in the retrieval process.
To solve this problem, we make use of the vector based
information of rich media to reduce the size of indices. The extracted feature is quite
small in size but can still precisely capture the characteristics of the
original rich media file. Also, several similar rich media files can share the
same feature vector, which exactly outlines the characteristics of the group of
rich media files. In addition, we provide a multi-level indexing approach as
another means to solve the large index size problem. With properly designed
data structure and algorithm, the computational complexity of finding the
matched results is relatively low.
In our model, five levels are provided at which rich
media files are indexed. From the top to bottom, they are domain level,
category level, semantic level, structural level, and featured level,
respectively.
Domain level index is the highest level of index in our model. Its
function is similar to that of the web site directory of Yahoo! i.e.,
indicating the subject or topic of the media file. Index at domain level cannot
be changed.
Category level index
specifies the categories of rich media files. Indices are subject to change as
the result of user feedbacks, as in the case of many existing retrieval systems. . Its classification
is more fine-grained than Domain level index and can be modified according to
newly inserted indices.
Semantic level index emphasizes the semantics of rich media files.
This level and below are created automatically.
Structural
level index focuses on the internal
characteristics of rich media files.
Feature level is the lowest level for index of rich media files. At
this level, the index describes the main objects existing in the rich media files.
At each level, the indices are all mapped to a numeric range. The query keyword is
also converted to a query id by a lexicon. ª Then the remaining query can be processed in the same
way as the traditional (database) multi-level index operations.
Figure 1. The systematic view of multi-level
index
The organization of the multi-level index is shown in Figure 1. Each
item in rich media library (Records) includes the URL and several character ids.
In order to make the abstract approach idiographic, we
have implemented a prototype system to validate our algorithm. To illustrate our
approach of multi-level indexing method, we present a four-tier architecture to
address the representation, indexing, retrieval, and query specification of rich
media files. As shown in Figure
2, the architecture is constituted by representation module,
indexing module, retrieval module and user interface from bottom to top.
Figure 2. The 4
tier architecture of rich media retrieval system
To allow this architecture to support different types
of rich media files, we only need to adapt/extend its representation module. Currently, our system only targets at Flash files because they
account for the biggest portion of rich media on the Web (about 90%), although it is effectively applicable to the
content-based retrieval of other rich media formats.
In our experiment, we don’t realize a crawler. The
testing data is achieved from Google search with “.swf” in the link area.
We compared the performance of our system with other
existing Web search engines, including Google and Flashkit.
The selected rich media file for testing is a Flash
shooting game [6]. In this experiment, the keyword “M4A1 gun” (which is the name of a weapon in the game) was used as a query. In our test,
the file is found and returned as the result expectedly.
Then we tried to search the same file using Google and FlashKit. It is shown that none of them find the file ones we are looking for. The reason is
that our advanced method in multi-level index approach embeds a feature vector (such as “M4A1”) in the lexicon or even in the
description field of index data, so that the required target can be
located more accurately.
When a user input “shoot game” as the searching keyword in FlashKit, more than 200 Flash movies were returned as the results but we
can easily find that some of them are not a shooting game or not even a game. Such noisy results should be filtered out in the desired scenario.
When we tried the same query in our system, the result is positive because actually
there is no “shoot game” text object in this Flash movie, and our multi-level
index mechanism is able to filter out such
noise data.
As an online content-based search engine, searching
speed is an important
aspect. In order to see how much the multi-level indexing algorithm can improve
the searching speed, we carried out ten keyword-based experiments with and without the multi-level index algorithm being applied to the system. Table
1 gives the detailed
comparisons.
Table 1. Test results for speed
Input keyword |
Number of movies found |
Scratch time (in second) |
Stop level |
|
Multi-level indexing |
Non-multi-level indexing |
|||
age of empire |
1,940 |
0.58 |
51 |
3 |
lion king |
1,060 |
0.32 |
47 |
5 |
football |
17,300 |
0.56 |
67 |
3 |
birthday card |
89,400 |
0.66 |
82 |
4 |
classic music |
4,340 |
0.47 |
53 |
4 |
Samsung |
740 |
0.27 |
36 |
4 |
yesterday once more |
780 |
0.22 |
37 |
5 |
Mp3 |
115,600 |
0.71 |
97 |
3 |
Jacky Chan |
343 |
0.20 |
30 |
4 |
abasdfdaas |
0 |
0.02 |
0.1 |
0 |
As shown above, when the multi-level index is not used the
system, the
query processing is based on a straightforward string comparison, which makes the response time too long to be acceptable in practice. In the case of using our multi-level indexing algorithm,
the responding time is evidently shortened.
In fact, our response time is at the same level as Google’s. By
making use of the vector characteristics of rich media, our multi-level index method is moreover suitable for content based
retrieval on rich media.
We have presented an approach to content-based rich media
retrieval using multi-level indexing method, and demonstrated
its effectiveness through experimental studies. Admittedly,
there are still some
valuable research problems to be addressed in the future. Currently, Hoffman
encoding is used in the lexicon, in which the efficiency
bottleneck still exists. If the multi-level
index is also applied to the lexicon, the efficiency may be improved. However, a
fully-fledged algorithm is yet to be designed for maintaining the indices of the lexicon. Our future work will address this issue too.
This work was partially supported by a grant from CityU
(Project No.
7001384).
[1] http://www.w3.org/AudioVideo/
[2] http://www.macromedia.com/
[3] http://www.microsoft.com/office/powerpoint/
[4]
http://www.copyright.gov/1201/comments/reply/090maddux.pdf.
[5] Yannis Labrou and Tim Finin, “Yahoo! as an Ontology - Using Yahoo! Categories to
Describe Documents”, Eighth International Conference on Knowledge and
Information Management (CIKM-99),
ª Details of our
algorithm description can be found at http://personal.cityu.edu.hk/~50005280/research/TR20021115.pdf.