Semantic Web is a vision that web resources are made not only for humans to read but also for machines to understand and automatically process [3]. This requires that web resources be annotated with machine understandable metadata. Currently, the primary approach to achieve this is to firstly define an ontology and then use the ontology to add semantic markups for web resources. These semantic markups are written in standard languages such as RDF [20] and OWL [23] and the semantics is provided by the ontology that is shared among different web agents and applications. Usually, the semantic annotations are made manually using a toolkit such as Protege or CREAM [26,31] or semi-automatically through user interaction with a disambiguation algorithm [18,4,5,6]. There are also some work on automatic annotation with minimum human efforts. They either extract metadata from the web site's underlying databases [12] or analyze text content within the web pages using learning algorithms [7] and/or NLP techniques [8]. Most of these methods uses a pre-defined ontology as the semantic model for the annotations. The manual and semi-automatic methods usually requires the user be familiar with the concept of ontologies and taxonomies. Although these approaches have been successfully used in applications like bioinformatics (e.g. [22]) and knowledge management (e.g. [18]), they also have some disadvantages. Firstly, establishing an ontology as a semantic backbone for a large number of distributed web resources is not easy. Different people/applications may have different views on what exists in these web resources and this leads to the difficulty of the establishment of an commitment to a common- ontology. Secondly, even if the consensus of a common ontology can be achieved, it may not be able to catch the fast pace of change of the targeted web resources or the change of user vocabularies in their applications. Thirdly, using ontologies to do manual annotation requires the annotator have some skill in ontology engineering which is a quite high requirment for normal web users.
In this paper, we explore a complement approach of semantic annotations that focuses on the ``social annotations'' of the web. In the recent years, web blogs and social bookmarks services are becoming more and more popular on the web. A web blog service usually allows the user to categorize the blog posts under different category names chosen by the user. Social bookmark services (e.g. del.icio.us1) enable users to not only share their web bookmarks but also assign ``tags'' to these bookmarks. These category names and tags are freely chosen by the user without any a-priori dictionary, taxonomy, or ontology to conform to. Thus, they can be any strings that the user deems appropriate for the web resource. We see them as the ``social annotations'' of the web. We use the word ``social'' to emphasize that these annotations are made by a large number of normal web users with implicit social interactions on the open web without a pre-defined formal ontology. Social annotations remove the high barrier to entry because web users can annotate web resources easily and freely without using or even knowing taxonomies or ontologies. It directly reflects the dynamics of the vocabularies of the users and thus evolves with the users. It also decomposes the burden of annotating the entire web to the annotating of interested web resources by each individual web users.
Apparently, without a shared taxonomy or ontology, social annotations suffer the usual problem of ambiguity of semantics. The same annotation may mean different things for different people and two seemingly different annotations may bear the same meaning. Without a clear semantics, these social annotations won't be of much use for web agents and applications on the Semantic Web. In this paper, using a social bookmark service as an example, we propose to use a probabilistic generative model to model the user's annotation behavior and to automatically derive the emergent semantics [2] of the tags. Synonymous tags are grouped together and highly ambiguous tags are identified and separated. The relationship with the formal annotations is also discussed. Furthermore, we apply the derived emergent semantics to dicover and search shared web bookmarks and describe the implementation and evaluation of this application.
``allows you to easily add sites you like to your personal collection of links, to categorize those sites with keywords, and to share your collection not only between your own browsers and machines, but also with others'' - [29]There are many bookmarks manager tools available [17,11]. What's special about the social bookmarks services like Delicious is their use of keywords called ``tags'' as a fundamental construct for users to annotate and categorize web resources. These tags are freely chosen by the user without a pre-defined taxonomy or ontology. Some example tags are ``blog'', ``mp3'', ``photography'', ``todo'' etc. The tags page of the Delicious web site (http://del.icio.us/tags/) lists most popular tags among the users and their relative frequency of use. These user-created categories using unlimited tags and vocabularies was coined a name ``folksonomy'' by Thomas Vander Wal in a discussion on an information architecture mailing list [32]. The name is a combination of ``folk'' and ``taxonomy''.
As pointed out in [21], folksonomy is a kind of user creation of metadata which is very different from the professional creation of metadata (e.g. created by librarians) and author creation of metadata (e.g. created by a web page author). Without a tight control on the tags to use and some expertise in taxonomy building, the system soon runs into problems caused by ambiguity and synonymy. [21] cited some examples of ambiguous tags and synonymous tags in Delicious. For example, the tag ``ANT'' is used by many users to annotate web resources about Apache Ant, a building tool for Java. One user, however, uses it to tag web resources about ``Actor Network Theory''. Synonymous tags, like ``mac'' and ``macintosh'', ``blog'' and ``weblog'' are also widely used.
Despite of the seemingly chaos of unrestricted use of tags, social bookmarks services still attract a lot of web users and provide a viable and effective mechanism for them to organize web resources. [21] contributes the success to the following reasons.
Inspired by research on Latent Semantic Index [30], we try to make statistical studies on the co-occurrence numbers. We represent the semantics of an entity (a web resource, a tag or a user) as a multi-dimensional vector where each dimension represents a category of knowledge. Every entity can be mapped to a multi-dimensional vector, whose component on each dimension measures the relativity between the entity and the corresponding category of knowledge. If one entity relates to a special category of knowledge, the corresponding dimension of its vector has a high score. For example, in Del.icio.us, the tag 'xp' is used to tag web pages about both 'Extreme Programming' and 'Window XP'. Its vector thus should have high score on dimensions of 'software' and 'programming'. This actually is what we get in our experiments in Section 3.2. As in each annotation, the user, tag and resource co-occur in the same semantic context. The total knowledge of users, tags and resources are the same for them. Hence we can represent the three entities in the same multi-dimensional vector space, which we call the conceptual space. As illustrated in Fig.1, we can map users, web resources and tags to vectors in this conceptual space. For an ambiguous tag, it may have several notable components on different dimensions while a definite tag should only has one prominent component. In short, we can use the vectors in this conceptual space to represent the semantics of entities. Conceptual space is not a new idea. It also appears in many literatures studying e.g. the meaning of words [33] and texts [30].
Our job next is to determine the number of dimensions and acquire the vector values of entities from their co-occurrences. There are research on the statistical analysis of co-occurrences of objects in unsupervised learning. These approaches aim to first develop parametric models, and then estimate parameters by maximizing log-likelihood on the existing data set. The acquired parameter values can then be used to predict probability of future co-occurrences. Mixture models [14] and clustering models based on deterministic annealing algorithm [27] are of this kind approaches which have been used in many fields such as Information Retrieval [13] and Computational Linguistics [9]. We applied Separable Mixture Model [14](one kind of mixture models mentioned above) to the co-occurrence of tags and resources without users before in a separate paper [36]. In this paper, we extend the bigram Separable Mixture Model to a tripartite probabilistic model to obtain the emergent semantics contained in the social annotations data.
We assume that the conceptual space is a dimensional vector space, each dimension represent a
special category of knowledge included in social annotation data.
The generation of existing social annotation data can be modeled
by the following probabilistic process:
![]() |
(1) |
Probabilities in 2 can be estimated
by maximizing the log-likelihood
using EM (Expectation-Maximum) method. Suppose that the social
annotations data set contains
triples.
Let
,
,
denote the
th record in the data set
containing the
th user , the
th resource and the
th tag in respective
set of users, resources, and tags. The
matrix
is the indicator
matrix of EM algorithm.
denote the probability of assigning the
th record to dimension
.
E-step:
![]() |
(3) |
M-step:
![]() |
(4) |
![]() |
(5) |
![]() |
(6) |
![]() |
(7) |
Iterating E-step and M-step on the existing data set, the
log-likelihood converges to a local maximum gradually, and we get
the estimated values of ,
,
and
. We can use these values to calculate the
vectors of users, resources and tags in conceptual space using
Bayes' theorem. For example, the component value of the vector of
user
can be calculated as
:
In each iteration, the time complexity of the above EM
algorithm is , which is linear to
both the size of the annotations and the size of the concept
space dimension. Notice that the co-occurrence number is usually
much larger than any one data set of entities, so the indicator
matrix
occupies most of the storage
spaces. We interleave the output of E-step and the input of
M-step without saving indicator matrix
. Hence the space complexity without the storage of raw
triples in the algorithm is
.
In Figure 2, we can find that the log-likelihood increases very fast from 2-dimensions to 40-dimensions and slows down in dimensions higher than 40. Because the web bookmarks collected on Del.icio.us are mainly in the field of IT, the knowledge repository is relatively small and the conceptual space with 40 dimensions is basically enough to represent the major category of meanings in Del.icio.us. Higher dimensions are very probably redundant dimensions which can be replaced by others or a combination of other dimensions. Large number of dimensions may also bring out the problem of over-fitting. As to iteration, iterate 80 times can provide satisfying result and more iterations won't give great improvement and may cause over-fitting. In our experiment, we model our data with 40 dimensions and calculate the parameters by iterating 80 times.
Table 1: Top 5 tags in 10 out of 40 conceptual dimensions
1 | java programming Java eclipse software |
2 | css CSS web design webdesign |
3 | blog blogs design weblogs weblog |
4 | music mp3 audio Music copyright |
5 | search google web Google tools |
6 | python programming Python web software |
7 | rss RSS blog syndication blogs |
8 | games fun flash game Games |
9 | gtd productivity GTD lifehacks organization |
10 | programming perl development books Programming |
We choose the top 5 tags according to
on each
dimension, and randomly list 10 dimensions in Table 1. From this table, we can find that each
dimension concern with a special category of semantics. Dimension
1 is mainly about 'programming', and dimension 5 talk about
'search engines'. The semantically related tags have high
component values in the same dimension, such as 'mp3' and
'music', while 'css' and 'CSS', 'games' and 'Games' are actually
about the same thing.
We also study the ambiguity of different tags on dimensions.
The entropy of a tag can be computed as
Table 2: Tags and their entropy
No | Tags | Entropy | Tags | Entropy |
---|---|---|---|---|
1 | todo | 3.08 | cooking | 0 |
2 | list | 2.99 | blogsjava | 0 |
3 | guide | 2.92 | nu | 0 |
4 | howto | 2.84 | eShopping | 0 |
5 | online | 2.84 | snortgiggle | 0 |
6 | tutorial | 2.78 | czaby | 0 |
7 | articles | 2.77 | ukquake | 0 |
8 | collection | 2.76 | mention | 0 |
9 | the | 2.71 | convention | 0 |
10 | later | 2.70 | wsj | 0 |
We can also provide a more interactive searching interface,
when a user queries with tag which
is ambiguous and have a high entropy calculated in (9) larger than a predefined threshold. The
user will, in addition to the usual query results, also get a
list of categories of knowledge with top tags as further
disambiguation choices for the tag. The categories are ranked by
. When the user
chooses a specific category of knowledge, the resources will
return ranked by
, which
helps to narrow the search scope and increase search
accuracy.
can be viewed as the
probability of producing a query word
from the corresponding language model of the document resource
. We can use the popular
Jelinek-Mercer [16]
language model to estimate
.
![]() |
(15) |
When the input query is a
boolean combination of tags and other words, we adopt the
extended retrieval model [28] to estimate
. The query is represented in the following
manner:
![]() |
![]() |
Our search models are quite flexible. The web bookmarks
discovery model, personalized search model and complicated query
support model are independent optional parts built on the basic
model. We can use them separately or combine several of them
together. For example, (19) combined
all of them together.
In the back-end part, after the data is collected and stored to the 'Social Annotations DB', the system will start to run the EM algorithm with respect to the tripartite model developed in Section 3.1 and compute the vectors of users, web resources and tags as the semantic index. For the words which are not tags but appear in the web pages of URLs, an language model approach developed in Section 3.3.4 is implemented to index them.
In the foreground part, when a user initiates a search action,
three parameters are passed to the system: the input query,
user's identification and the search model (personalized or
discovery or both). In the 'query processor' unit, the input
query is first parsed to a boolean
combination of tags and other keywords and then mapped to a
vector
One important difference of our search model is the ability to discover semantically-related web resources from emergent semantics, even if the web resource is not tagged by the query tags and does not contain query keywords. This search capability is not available in the current social bookmarking services. We evaluate the effectiveness of this discovery ability using our implementation system.
We choose 5 widely used tags 'google', 'delicious', 'java',
'p2p' and 'mp3' on Del.icio.us folksonomy data set, and
separately input them into our system. The system works in the
resources discovery mode (filtering out the URLs tagged by these
tags), and returns the discovered list of URLs. We choose top 20
URLs in every list to evaluate the semantic relatedness between
the tags and the results. As the URLs in Del.icio.us are mainly
on the IT subjects, we invited 10 students in our lab who are
doctor or master candidates majoring in computer science and
engineering to take part in the experiment. Each student are
given all the 100 URLs. They are asked to judge the semantic
relatedness between the tag and the web pages of URLs based on
their knowledge and score the relatedness from 0 point (not
relevant) to 10 points (highly relevant). We average their scores
on each URL and use the graded precision to evaluate the
effectiveness of the resources discovery capability. The graded
precision is:
Semantic annotation is a key problem in the Semantic Web area. A lot of work has been done about the topic. Early work like [26,31] mainly uses an ontology engineering tool to build an ontology first and then manually annotate web resources in the tool. In order to help automate the manual process, many techniques have been proposed and evaluated. [7] learns from a small amount of training examples and then automatically tags concept instances on the web. The work has been tested on a very large-scale basis and achieves impressive precision. [4] helps users annotate documents by automatically generate natural language sentences according to the ontology and let users interact with these sentences to incrementally formalize them. Another interesting approach is proposed by [5] that utilizes the web itself as a disambiguation source. Most annotations can be disambiguated purely by the number of hits returned by web search engines on the web. [6] improves the method using more sophisticated statistical analysis. Given that many web pages nowadays are generated from a backend database, [12] proposes to automatically produce semantic annotations from the database for the web pages. Information extraction techniques are employed by [8] to automatically extract instances of concepts of a given ontology from web pages. However, this work on semantic annotations follows the traditional top-down approach to semantic annotation which assumes that an ontology is built before the annotation process.
Much work has been done to help users manage their bookmarks on the (semantic) web such as [17]. [11] gives a good review of the social bookmarks tools available. These tools help make the social bookmarking easy to use but lack capabilities to derive emergent semantics from the social bookmarks.
Work on emergent semantics [19,2] has appeared recently, for example [35,1,15]. [1] proposes an emergent semantics framework and shows how the spreading of simple ontology mappings among adjacent peers can be utilized to incremently achieve a global consensus of the ontology mapping. [15] described how to incrementally obtain a unified data schema from the users of a large collection of heterogeneous data sources. [35] is more related to our work. It proposes that the semantics of a web page should not and cannot be decided alone by the author. The semantics of a web page is also determined by how the users use the web page. This idea is similar to our thought. In our work, a URL's semantics is determined from its co-occurrences with users and tags. However, our method of achieving emergent semantics is different from [35]. We use a probabilistic generative model to analyze the annotation data while [35] utilizes the common sub-paths of users' web navigation paths.
Unlike traditional formal semantic annotation based on RDF or OWL, social annotation works in a bottom-up way. We will study the evolution of social annotations and its combination with formal annotations. For example, enrich formal annotations with social annotations.
Social annotations are also sensitive to the topic drift in the user community. With the increasing of a special kind of annotations, the answers for the same query may change. Our model can reflect this change but requires re-calculation on the total data set periodically which is quite time consuming. One goal of our future work is to improve our model to support incremental analysis of the social annotations data.
1:Discovery results for query tag 'delicious'
1 | http://www.betaversion.org/ stefano/linotype/news/57 |
2 | http://www.amk.ca/talks/2003-03/ |
3 | http://www.ldodds.com/foaf/foaf-a-matic.html |
4 | http://www.foaf-project.org/ |
5 | http://gmpg.org/xfn/ |
6 | http://www.ilrt.bris.ac.uk/discovery/rdf/resources/ |
7 | http://xml.mfd-consult.dk/foaf/explorer/ |
8 | http://xmlns.com/foaf/0.1/ |
9 | http://simile.mit.edu/welkin/ |
10 | http://www.xml.com/pub/a/2004/09/01/ |
hack-congress.html | |
11 | http://www.w3.org/2001/sw/ |
12 | http://simile.mit.edu/ |
13 | http://jena.sourceforge.net/ |
14 | http://www.w3.org/RDF/ |
15 | http://www.foafspace.com/ |
2: Discovery results for query tag 'google'
1 | http://www.musicplasma.com/ |
2 | http://www.squarefree.com/bookmarklets/ |
3 | http://www.kokogiak.com/amazon4/default.asp |
4 | http://www.feedster.com/ |
5 | http://http://www.gnome.org/projects/beagle/ |
6 | http://www.faganfinder.com/urlinfo/ |
7 | http://www.newzbin.com/ |
8 | http://www.daypop.com/ |
9 | http://www.copernic.com/ |
10 | http://www.alltheweb.com/ |
11 | http://a9.com/-/search/home.jsp?nocookie=1 |
12 | http://snap.com/index.php/ |
13 | http://www.blinkx.tv/ |
14 | http://www.kartoo.com/ |
15 | http://www.bookmarklets.com/ |