Image Collector III: A Web Image-Gathering System with Bag-of-Keypoints

Keiji Yanai

The University of Electro-Communications
1-5-1 Chofugaoka, Chofu, Tokyo, 182-8585, Japan

yanai@computer.org

Copyright is held by the World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2007, May 8-12, 2007, Banff, Alberta, Canada.
ACM Wolrd Wide Web Conference 2007.

ABSTRACT

We propose a new system to mine visual knowledge on the Web. There are huge image data as well as text data on the Web. However, mining image data from the Web is paid less attention than mining text data, since treating semantics of images are much more difficult. In this paper, we propose introducing a latest image recognition technique, which is the bag-of-keypoints representation[1], into Web image-gathering task. By the experiments we show the proposed system outperforms our previous systems and Google Image search greatly.

Categories & Subject Descriptors

I.4 [Image Processing and Computer Vision]: Miscellaneous

General Terms

Algorithms, Experimentation, Measurement

Keywords

Web image mining, image recognition, bag-of-keypoints

1 Introduction

In this paper we present a new system employing an up-to-date image recognition technique for visual object categorization / recognition, which is bag-of-keypoints representation [1]. The bag-of-keypoints representation got popular recently in the research community of computer vision. It is proven that it has excellent ability to represent image concepts in the context of visual object categorization/recognition in spite of its simplicity [1].

The basic idea of the bag-of-keypoints representation is that a set of local image patches is sampled by an interest point detector or randomly, and a vector of visual descriptors is evaluated by Scale Invariant Feature Transform (SIFT) descriptor [2] on each patch. The resulting distribution of description vectors is then quantified by vector quantization against a pre-specified codebook, and the quantified distribution vector is used as a characterization of the image. As a classifier to classify images associated with quantified vectors as relevant or irrelevant, we use an SVM classifier.

In this paper, we propose an Web image-gathering system with the bag-of-keypoints model. By the experiments we show the new system outperforms our previous systems greatly.

2 Overview of Proposed System

The proposed system gathers images associated with the keywords provided by a user fully automatically. Therefore, an input of the system is just keywords, and the output is several hundreds or thousands images associated with the keywords.

Our proposed system consists of two stages, which are a collection stage and a selection stage. In this paper, we modify only the selection stage of our previous system[4].

In the selection stage, first we convert all the downloaded images into feature vectors based on the bag-of-keypoints representation, and then train an SVM classifier with all the vectors in the group A as training data. Next, we classify all the vectors in the group A and B as relevant or irrelevant with the trained SVM. Finally, we can get only images classified as relevant to the provided keywords as a result. The detail of this processing is as follows:

3 Experimental Results

We made experiments for the following ten concepts independently: beach, sunset, flower, waterfall, mountain, lion, apple, baby, note-PC, and Chinese noodle. For only ``lion'' and ``apple'', actually we added subsidiary keywords ``animal'' and ``fruit'' to restrict its meaning to ``lion of animal'' and ``apple of fruit'' in the collection stage, respectively.

In the collection stage, we gathered around 5000 URLs for each concept from both Google Search and Yahoo Web Search which are not ``image search'' but ``text search''. The exact numbers vary depending on concepts, since we excluded duplicate URLs from the URL list for each category.

Table 1 shows the results of the collection stage, namely raw images, and we added to it the evaluation of the results of Google Image Search and our previous system which employs the CBIR-based image selection method [4] and GMM-based probabilistic method [5] for comparison. The results of the collection stage consists of the number of images downloaded from the Web with only HTML analysis and their precision. To compute the precision and the recall, we randomly selected 500 images from the images of each concepts and checked their relevancy by the subjective evaluation. Note that for the downloaded images we cannot estimate the recall, since the denominator to estimate it corresponds to the number of images associated to the given concept on whole the Web and we cannot get to know it. Regarding the results of Google Image Search, we show the precision of output images ranked between 1 and 500 in the table. The average precision of raw images, 62.2%, was slightly superior to the average precision of top 500 results of Google images, 58.6%, while we collected about 3000 images a concept. This shows that our original image collection method is better than Google Image Search.

Table 1 also shows the number, the precision and the recall of the results by the proposed method by the bag-of-keypoints model and SVM. In the experiments, we used the parameter setting so that the recall rates are close to the recall rate by two old methods shown in Table 1 for easy comparison. Note that in the Web image gathering task, the recall rate is less important than the precision rate, since the more Web sites we crawl, the more images we can get easily. So we mainly evaluate the system performance by the precision below.

In case of (1), we obtained the 81.1% precision on the average, which outperformed the 66.0% precision by the CBIR method and the 73.5% precision by the GMM-based probabilistic method. Except ``baby'' and ``notebook PC'', the precision of each concept were also improved. Especially, in case of ``flower'', ``lion'' and ``apple'', the precision were improved prominently. This shows that the bag-of-keypoints representation is very effective to classify ``object'' images. On the other hand, the precisions of ``baby'' and ``notebook PC'' were not good, which were less than the precision by the probabilistic methods. This is because we used all the A-group images as positive training samples, and for these two concept the precision of the raw A-group images were 56% and 57%, respectively. In short, training data for two concepts contained too many irrelevant samples. That is why the precisions were not improved. To overcome that, we need to prepare better raw group-A images or to develop a mechanism to remove irrelevant training samples.

We have prepared the Web site to show the experimental results we provided in this paper. The URL is as follows:
http://mm.cs.uec.ac.jp/yanai/www07/

**Table 1:** Results by the CBIR-based method [4], the GMM-based probabilistic method [5] and the proposed system. This table describes the precision of the 500 output images of Google Image Search which are ranked from 1 to 500, the number of raw images collected from the Web, the number of selected images out of them by the two old methods. Numerical values in () represent the precision and the recall.
	Goo.	raw images		CBIR [4]	region-based [5]	bag-of-keypoints (proposed system)
concepts	prec.	A	B	A+B	A+B	A+B	A	B	A+B
sunset	79.8	790 (67)	710 (44)	1500 (55.3)	828 (62.2, 62.1)	636 (91.0, 70.2)	441 (94, 78)	113 (92, 34)	564 (93.3, 62.5)
mountain	48.8	1950 (88)	3887 (71)	5837 (79.2)	3423 (82.6, 61.2)	3510 (89.0, 65.0)	1628 (94, 89)	1133 (92, 46)	2761 (93.7, 68.7)
Chinese noodle	65.2	901 (78)	1695 (55)	2596 (66.6)	1492 (71.0, 61.3)	1266 (77.0, 53.2)	572 (84, 68)	448 (94, 33)	1020 (86.9, 54.2)
waterfall	72.4	2065 (71)	2584 (70)	4649 (70.3)	3281 (71.4, 71.7)	3504 (76.8, 74.6)	1728 (80, 94)	1535 (86, 62)	3263 (82.3, 82.0)
beach	63.2	768 (69)	1155 (62)	1923 (65.5)	1128 (67.3, 60.3)	983 (73.3, 62.5)	440 (84, 70)	262 (93, 24)	702 (86.5, 48.7)
flower	65.6	576 (72)	1418 (67)	1994 (69.6)	952 (79.3, 54.4)	758 (71.9, 41.0)	360 (84, 73)	348 (94, 18)	708 (86.7, 45.0)
lion	44.0	511 (87)	1548 (49)	2059 (66.0)	967 (71.0, 50.5)	711 (69.4, 53.6)	414 (87, 81)	375 (73, 18)	789 (85.5, 56.1)
apple	47.6	1141 (78)	2137 (59)	3278 (64.3)	1495 (68.8, 48.8)	1252 (67.2, 37.7)	759 (85, 73)	212 (84, 20)	971 (85.3, 38.5)
baby	39.4	1833 (56)	1738 (53)	3571 (54.5)	1831 (55.1, 51.8)	1338 (63.9, 45.9)	1441 (54, 76)	601 (61, 29)	2042 (55.7, 58.4)
notebook PC	60.2	781 (57)	1756 (32)	2537 (43.6)	1290 (46.9, 54.6)	867 (56.0, 47.6)	612 (58, 80)	602 (45, 42)	1214 (55.0, 66.3)
TOTAL/AVG.	58.6	11316 (72)	18628 (56)	29944 (62.2)	16687 (66.0, 57.7)	14825 (73.5, 55.1)	7926 (80, 78)	5371 (81, 32)	13297 (81.1, 58.0)

4 Conclusions

As future work, we plan to prepare better raw group-A images by improving HTML analysis methods and combining query keywords for Web search engines with effective subsidiary keywords, and we need to study how to remove irrelevant data in training data or how to learn from imperfect training image data gathered from the Web.