Copyright is held by the World Wide Web Conference Committee (IW3C2).
Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
ACM 1-59593-323-9/06/0005.
A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with ``relevant'' keywords. Based on this training, it can then extract new keywords from previously unseen pages. Accuracy is substantially better than several baseline systems.
H.3.1 [Content Analysis and Indexing]: Abstracting methods; H.4.m [Information Systems]: Miscellaneous
Algorithms, experimentation
keyword extraction, information extraction, advertising
Advertising on the web is often done through keywords. Advertisers pick a keyword, and their ad appears based on that. For instance, an advertiser like Amazon.com would pick a keyword like ``book'' or ``books.'' If someone searches for the word ``book'' or ``books'', an Amazon ad is shown. Similarly, if the keyword ``book'' is highly prominent on a web page, Amazon would like an ad to appear. We need to show our computer program examples of web pages, and then tell it which keywords are ``highly prominent.'' That way, it can learn that words like ``the'' and ``click here'' are never highly prominent. It might learn that words that appear on the right (or maybe the left) are more likely to be highly prominent, etc. Your task is to create the examples for the system to learn from. We will give you web pages, and you should list the highly prominent words that an advertiser might be interested in.There was one more important instruction, which was to try to use only words or phrases that actually occurred on the page being labeled. The remaining portion of the instructions gave examples and described technical details of the labeling process. We used a snapshot of the pages, to make sure that the training, testing, and labeling processes all used identical pages. The snapshotting process also had the additional advantage that most images, and all content-targeted advertising, were not displayed to the annotators, preventing them from either selecting terms that occurred only in images, or from being polluted by a third-party keyword selection process.
|
|
|
[1] L. Breiman, Bagging predictors, Machine Learning, 24(2):123-140, 1996.
[2] M. Califf and R. Mooney, Bottom-up relational learning of pattern matching rules for information extraction, JMLR, 4:177-210, 2003.
[3] X. Carreras, L. Màrquez, and J. Castro., Filtering-ranking perceptron learning for partial parsing, Machine Learning, 60(1-3):41-71, 2005.
[4] S. F. Chen and R. Rosenfeld, A gaussian prior for smoothing maximum entropy models, Technical Report CMU-CS-99-108, CMU, 1999.
[5] H. Chieu and H. Ng, A maximum entropy approach to information extraction from semi-structure and free text, In Proc. of AAAI-02, pages 786-791, 2002.
[6] Y. Even-Zohar and D. Roth, A sequential model for multi class classification, In EMNLP-01, 2001.
[7] E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning, Domain-specific keyphrase extraction, In Proc. of IJCAI-99, pages 668-673, 1999.
[8] D. Freitag, Machine learning for information extraction in informal domains, Machine Learning, 39(2/3):169-202, 2000.
[9] J. Goodman, Sequential conditional generalized iterative scaling, In ACL '02, 2002.
[10] J. Goodman and V. R. Carvalho, Implicit queries for email, In CEAS-05, 2005.
[11] M. Henzinger, B. Chang, B. Milch, and S. Brin, Query-free news search, In Proceedings of the 12th World Wide Web Conference, pages 1-10, 2003.
[12] A. Hulth, Improved automatic keyword extraction given more linguistic knowledge, In Proc. of EMNLP-03, pages 216-223, 2003.
[13] D. Kelleher and S. Luz, Automatic hypertext keyphrase detection, In IJCAI-05, 2005.
[14] T. Mitchell, Tutorial on machine learning over natural language documents, 1997, Available from http://www.cs.cmu.edu/~tom/text-learning.ps.
[15] V. Punyakanok and D. Roth, The use of classifiers in sequential inference, In NIPS-00, 2001.
[16] J. R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann, San Mateo, CA, 1993.
[17] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 77(2), February 1989.
[18] B. Ribeiro-Neto, M. Cristo, P. B. Golgher, and E. S. de Moura, Impedance coupling in content-targeted advertising, In SIGIR-05, pages 496-503, 2005.
[19] D. Roth and W. Yih, Relational learning via propositional algorithms: An information extraction case study, In IJCAI-01, pages 1257-1263, 2001.
[20] C. Sutton and A. McCallum, Composition of conditional random fields for transfer learning, In Proceedings of HLT/EMLNLP-05, 2005.
[21] E. F. Tjong Kim Sang, Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition, In CoNLL-02, 2002.
[22] P. D. Turney, Learning algorithms for keyphrase extraction, Information Retrieval, 2(4):303-336, 2000.
[23] P. D. Turney, Coherent keyphrase extraction via web mining, In Proc. of IJCAI-03, pages 434-439, 2003.