Detecting Semantic Cloaking on the Web

Baoning Wu, Department of Computer Science & Engineering, Lehigh University, USA
Brian D. Davison, Department of Computer Science & Engineering, Lehigh University, USA

Full text:

Track: Industrial Practice and Experience

By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.

Other items being presented by these speakers

Topical TrustRank: Using Topicality to Combat Web Spam (Search Track)

Organised by

in association with

Detecting Semantic Cloaking on the Web

Other items being presented by these speakers

Organised by

Platinum Sponsors

Sponsor of The CIO Dinner