The existing image search engines do not use link-structure analysis algorithms to analyze and classify images according to the link-structure on the Web. The image-ranking methods using link-structure analysis algorithms proposed by the paper [4] may cause the problem of the "Tightly-knit" community (TKC) effect as described in [3, 5, 7]. We propose three combinations of image-ranking schemes to solve the problem of TKC effect on image search. Our experiments demonstrate that the image-ranking methods proposed by [4] have the TKC effect, but the three proposed combinations of image-ranking schemes can solve the TKC effect on image search.
Image Ranking Scheme, PageRank, Kleinberg's mutual reinforcement approach, LinkSink, TKC effect
The most famous link-structure analysis algorithms are PageRank [6] and Kleinberg's mutual reinforcement approach [2]. However, PageRank has the problem of Link Sink [6] while Kleinberg's mutual reinforcement approach has the problem of "Tightly-knit" community (TKC) effect [3, 5, 7].
Link Sink occurs when page a and page b point to each other but have no links link to other pages. If page a is pointed to by an external page, during the iteration of PageRank, the loop accumulates the weight and never distributes the weight to other pages. This causes the oscillation of the algorithm and the algorithm cannot converge.
The paper about SALSA [5] describes the TKC and TKC effect as follows: "A 'tightly-knit' community is a small but highly interconnected set of sites. Roughly specking, the TKC effect occurs when such a community scores high in link-analyzing algorithms, even though the sites in the TKC are not authoritative on the topic, or pertain to just one aspect of the topic".
To solve these problems, [6] modifies the PageRank equation and [3] introduces a modification of Kleinberg's algorithm called Hub-Threshold Kleinberg's algorithm.
For image search using link-structure analysis algorithms, the paper on PicASHOW [4] defines a t-relevance score rt(p) of page p and suggests that the authority scores calculated in Kleinberg's Mutual Reinforcement approach can be the t-relevance score. Then, it suggests three image ranking schemes. They are PageRank-influenced ranking scheme, Co-citation based analysis and Enhanced co-citation based analysis.
However, they still have the TKC effect if the t-relevant score is calculated using Kleinberg's algorithm. To solve the TKC problem, we propose there combinations of image-ranking schemes.
In this paper, we first propose three combinations of image-ranking schemes to solve the TKC problem. Then, we describe the experiments to evaluate the image-ranking schemes and demonstrate that the three combinations of image-ranking schemes proposed can solve the TKC problem. Beside the TKC problem, we also evaluated whether the algorithms had the link sink problem and showed the image-to-image relations. Moreover, we used large random generated link structures to evaluate the speed of the algorithms.
The image-ranking schemes we proposed are based on PageRank-influenced ranking scheme, enhanced co-citation based analysis, Hub-Threshold Kleinberg's algorithm and PageRank algorithm.
It uses the PageRank score of page p, instead of the authority score of page p calculated by the original Kleinberg's algorithm as the t-relevant score rt(p) since PageRank does not have the problem of TKC effect.
It uses the authority score of page p calculatied by Hub-Threshold Kleinberg's algorithm, instead of original Kleinberg's algorithm as the t-relevant score rt(p) since Hub-Threshold Kleinberg's algorithm does not have the problem of TKC effect.
The enhanced co-citation based analysis is same as that suggested by the paper [4]. The difference is the t-relevant score rt(p) used is the authority score of page p calculated by Hub-Threshold Kleinberg's algorithm, instead of original Kleinberg's algorithm since Hub-Threshold Kleinberg's algorithm does not have the problem of TKC effect.
This section explains the evaluation of the image-ranking methods using synthetic link structures. We evaluated the image-ranking methods that may cause link sink problem or TKC problem on image search. The image-ranking methods are:
The web structure in Figure 1 is used to evaluate the link sink problem on the image-ranking methods. In Figure 1, page 2 and page 3 form a link sink structure. Figure 2 shows the results of each method on the structure.
The result shows that PrinfPR is affected by PageRank algorithm; it has the link sink problem while the others do not. Images 3 and 4 of Figure 1 should not be highly ranked as shown by the other methods. But PrinfPR thinks that images 2, 3 and 4 should be highly ranked.
Note that PageRank-Influenced method sums the authority scores of the pages containing the image. The effects of Link Sink are the same if the page contains distinct image or multiple images.
The web structure in Figure 3 is used to evaluate the problem of TKC effect on the image-ranking schemes.
In Figure 3, that the pages inside the dotted rectangle are relevant to the topic t. But the pages outside the rectangle may be little or not relevant to the topic t. The page 5 is relevant to topic t but it does not contain the keyword on t. But P4 which is a page or a small set of pages is irrelevant to topic t. Users want to find the images contained in page 5 (images 18 and 19) and the pages inside the rectangle (images 1-15). To emphasize the results and for simplification, Figure 4 shows the results of the each method on the structure considering P4 as a page.
The results show the problem of TKC effect on PrinfKM, CoKM, CoHT and ECKM. Images 16 and 17 obtain very high scores from PrinfKM, CoKM, CoHT and ECKM. But images 16 and 17 may not be relevant to the topic.
PrinfPR, PrinfHT and ECHT solved the problem of TKC effect. The irrelevant images 16 and 17 do not obtain high scores from these methods. The other images become significant on PrinfPR, PrinfHT and ECHT.
Also, no TKC effect in page domain does not imply that there is no TKC effect in image domain. Hub-Threshold Kleinberg's algorithm does not induce TKC effect in our example, but CoHT has TKC effect and ECHT does not.
Moreover, the ranking scores of image 1 and images 2 to 15 are the same on the results produced by PrinfKM, PrinfPR and PrinfHT. But they are different on CoKM, CoHT, ECKM and ECHT. The ranking score of image 1 is higher than that of images 2 to 15. The ranking scores are same on PrinfKM, PrinfPR and PrinfHT because PageRank-influenced ranking scheme only analyzes the page-to-page relations in the link structure. It does not analyze the image-to-image relations.
On the other hand, co-citation based analysis and enhanced co-citation based analysis analyze the image-to-image relations. The ranking score of image 1 is higher because it is pointed to by page 6, which is a good image container [4] that also points to highly ranked image 19.
Likewise, for considering P4 as a small set of sites, the three ranking schemes proposed can still solve the problem of TKC effect while CoKM and CoHT cannot. Figure 5 shows the results of the each method on the structure considering P4 as a small set of sites.
We also evaluated the speed of the three image-ranking schemes. We generated a large link structure by random and then counted the time used for performing PageRank-influenced ranking scheme, co-citation based analysis and enhanced co-citation based analysis on the structure.
The link structure generated had 1000 pages and 1000 images. The average out-degree of each page was 20 and each page contained 20 images. Figure 5 shows the calculation time of each scheme.
The result shows that enhanced co-citation based analysis and co-citation based analysis required much more time than PageRank-influenced ranking scheme. This was because the multiplication of the adjacency matrix of page-to-page relation and the adjacency matrix of page-to-image relation required a lot of time. The time complexity of multiplication of two matrices is at least O(N2), but that of PageRank-influenced ranking scheme is only about O(N), where N is the number of web pages or images. The difference is more significant if N increases.
PageRank-influenced ranking scheme with PageRank algorithm, PageRank-influenced ranking scheme with Hub-Threshold Kleinberg's algorithm and Enhanced co-citation based analysis with Hub-Threshold Kleinberg's algorithm can solve the problem of TKC effect. However, the first one may cause link sink problem and both first and second do not analyze the relations between images.
The speed of the third method is a problem when the number of pages or images is large because it is based on enhanced co-citation based analysis. The solution to this is to set a threshold. If the number of pages or images is less than the threshold, the search engine calculates the image scores using the third method. Otherwise, it calculates the image scores using the second method.
This project is supported in part by grants from the Hong Kong's Research Grants Council (RGC) under CUHK 4407/99E and #2050259.