Word Sense Based User Models for Web Sites

Carlo Strapparava, Bernardo Magnini and Anna Stefani
ITC-IRST Istituto per la Ricerca Scientifica e Tecnologica, I-38050 Povo/Trento, Italy
{strappa, magnini, stefani}@irst.itc.it

Introduction

SiteIF is a personal agent that takes into account the user's browsing "watching over the user's shoulder". It learns user's interests from the requested pages that are analyzed to generate or to update a model of the user. Exploiting this model, the system tries to anticipate which documents in the web site could be interesting for the user.

Last developments of SiteIF take advantage of natural language processing (NLP) techniques in building the user model. The system exploits the WordNet lexical database to make the model more meaningful and accurate. In this perspective the user model is built as a semantic network whose nodes represent not simply the word frequency but the word sense frequency. Furthermore, taking advantage of MultiWordNet, a multilingual extension of WordNet developed at IRST, the resulting user model is independent from the language of the documents browsed. This is particular important with multilingual web sites, that are becoming very common especially in news sites or in electronic commerce domains.

A Word Sense Based User Model

Personalizing the access to news is a challenging application for research on user modelling and adaptivity. In fact, the number of Web sites containing news is rapidly growing (e.g. electronic newspapers, news servers, press agencies), and many other sites are becoming news providers. Information preferences vary greatly across users and to serve individual interests filtering systems must be highly personalized. This implies that the system has to be able to recognize the users and to maintain a model of their interests. Moreover, it is important to build tools that not only help users satisfy their information needs, but also that suggest new documents that are potentially interesting for them. This is valuable both from the user point of view and from the web sites maintainers: for example, in the field of electronic commerce, knowledge of customers personal interests allows the exploitation of the one-to-one marketing paradigm [10].

Several tools have been proposed to search and retrieve relevant documents [6,1,5,7]. However all these systems share some basic limitations: the approach used to represent a user's profile is based on simple lists of keywords and the learning method requires the users' conscious and active involvement filling a form of keywords (i.e. topics) for their interests or adding a score to each visited document. Another common limitation is that the representation of user's interest is built without considering word sense disambiguation, but only taking into account morpho-syntactic properties, such as word frequency and words co-occurrence. This yields a representation that is often not enough accurate from a semantic point of view. The issue is even more important in the Web world, where documents have to do with many different topics and the chance to misinterpret word senses is a real problem.

Even if classical, fine-grained NLP techniques, such as full parsing and semantic/pragmatic analysis, are still quite unrealistic for large web domains (many subjects, large lexicon etc...), there is a growing interest for shallow techniques, which allow to produce systems with high degree of robustness and not restricted to a particular domain. In this perspective, the availability of large scale word sense repositories, such as WordNet [8,3] increased the interest for the realization of concrete NLP applications that can take advantage of sense distinctions.

The SiteIF system [12] takes into account the possibility to disambiguate word senses for the web site documents using the WordNet hierarchy. We use a measure of semantic similarity in a IS-A taxonomy very much like that described in [11]. The idea is to build a semantic network whose nodes represent not simply the word frequency but the word sense (i.e. synsets in WordNet terminology) frequency, and whose arcs represent synset co-occurences in a document. This approach takes also advantage of the MultiWordNet project on work at IRST [2], which considers the problem of extending the WordNet hierarchy to other languages (in particular to Italian). This makes possible to build a user interest model independent from the language of the documents browsed. This is particular important with multilingual web sites, that are becoming very common especially in news sites or in electronic commerce domains.

A document representation module analyses the Web site documents and produces their internal representations, constituted by a list of synsets. During the modelling phase, the system yields (i.e. builds or augments) the user model considering the browsed documents during a navigation session. During the filtering phase, the system compares any document (i.e. any representation of the document in terms of synsets) in the site with the user model, producing a classification of the document (i.e. it is worth or not the user's attention). In fact the computation of the relevance is much more complicated by the navigation through the WordNet hierarchy: the classification value of the document is calculated also considering the hyper/hyponymy relationship (or other WordNet semantic relationships) of document concepts with respect to the user model synset network. This means that it is possible to suggest documents that specialize or abstract (or in general are in relation to) those concepts whose the user is interested in.

Since the representation of user interest is made of synsets (and not of simple words), the relevant documents proposed to the user embody not just the "same" words as other visited documents, but the same concepts.

Once the user model network is built, it is necessary to have a method to estimate how much the user model talks "about the same things" in order to dynamically infer the interest areas of the resulting user model. The idea to introduce a notion of user model coherence looking for clusters formed by cohesive synsets chains is common to many approaches (see for example [9,4]). The user model clusterization process detects some aggregations of synsets that display particular cohesion both from topological point of view (high connectivity in the network) and from relevance (weights of nodes and arcs). Peculiar UM network configurations can influence the filtering process and in particular the matching phase. For example, a user that shows many interests (i.e. clusters) could enjoy suggestions for documents not strictly related to previous ones. In this case the matching module will perform a deeper WordNet hierarchy navigation (e.g. considering more semantic relations) in the computation of the relevant documents.

We have already started a practical experimentation using bilingual (Italian and English) documents supplied by AdnKronos, an important Italian service and news provider.

References

  1. Armstrong, R., D. Freitag, T. Joachims, and T. Mitchell: 1995 "WebWatcher: A Learning Apprentice for the World Wide Web", AAAI 1995 Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, March 1995.
  2. Artale, A., B. Magnini, and C. Strapparava: 1997, "WordNet for Italian and its Use for Lexical Discrimination", in M. Lenzerini (ed.): AI*IA97: Advances in Artificial Intelligence, Springer Verlag, 1997.
  3. Fellbaum, C. (ed.): 1998, WordNet. An Electronic Lexical Database, The MIT Press, 1998.
  4. Hirst, G and D. St-Onge: 1998, "Lexical Chains Representations of Context for the Detection and Correction of Malapropisms". In Fellbaum, C. (ed.) WordNet. An Electronic Lexical Database, The MIT Press, 1998.
  5. Kamba, T. and H. Sakagami: 1997, "Learning Personal Preferences on online Newspaper articles from user behaviors", Sixth International World Wide Web Conference Proceedings, 1997. Rif.:http://atlanta.cs.nchu.edu.tw/www/PAPER142-2.html
  6. Lieberman, H.: 1995, "Letizia: An Agent That Assists Web Browsing", Proceedings of the 1995 International Joint Conference on Artificial Intelligent, Montreal, Canada, August 1995. Rif.: http://lcs.www.media.mit.edu/groups/agents/papers/
  7. Minio, M. and C. Tasso: 1996, "User Modeling for Information Filtering on INTERNET Services: Exploiting an Extended Version of the UMT Shell", Workshop on User Modeling for Information Filtering on the World Wide Web, in Proceedings of the Fifth International Conference on User Modeling, Kailia-Kuna Hawaii, January 1996. Rif.: http://www.cs.su.oz.au/~bob/um96-workshop.html
  8. Miller, G.: 1990, "An On-Line Lexical Database", International Journal of Lexicography, 13(4), pp.235-312, 1990.
  9. Morris, J. And G. Hirst: 1991, "Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text", Computational Linguistics, 17(1), 1991.
  10. Peppers, D. and M. Rogers: 1997, The One to One Future: Building Relationships One Customer at a time,Currency Doubleday, 1997.
  11. Resnik, P.: 1995, "Disambiguating Noun Groupings with Respect to WordNet Senses", Third workshop on very large corpora, MIT, June 1995.
  12. Stefani, A. and C. Strapparava: 1998, "Personalizing Access to Web Sites: The SiteIF Project", 2nd Workshop on Adaptive Hypertext and Hypermedia (held in conjunction with HYPERTEXT 98), Pittsburgh, USA, June 1998.