There is a general agreement that ``encoding some of the semantics of web resources in a machine processable form'' would allow designers to implement much smarter applications for final users, including information integration and interoperability, web service discovery and composition, semantic browsing, and so on. In a nutshell, this is what the Semantic Web is about. However, it is less obvious how such a result can be achieved in practice, possibly starting from the current web. Indeed, providing explicit semantic to already existing data and information sources can be extremely time and resource consuming, and may require skills that users (including web professionals) may not have.
Our work starts from the observation that in many Web sites,
web-based applications (such as web portals, e-marketplaces,
search engines), and in the file system of personal computers, a
wide variety of schemas (such as taxonomies, directory trees,
Entity Relationship schemas, RDF Schemas) are published which (i)
convey a clear meaning to humans (e.g. help in the navigation of
large collections of documents), but (ii) convey only a small
fraction (if any) of their meaning to machines, as their intended
meaning is not formally/explicitly represented. Well-known
examples are: classification schemas (or directories) used for
organizing and navigating large collections of documents,
database schemas (e.g. Entity-Relationship), used for describing
the domain about which data are provided; RDF schemas, used for
defining the terminology used in a collection of RDF statements.
As an example, imagine that a multimedia repository uses a
taxonomy like the one depicted in Figure 1 to classify pictures. For
humans, it is straightforward to understand that any resource
classified at the end of the path:
In the paper, we present a general methodology and an implementation to make this rich meaning available and usable by computer programs. This is a contribution to bootstrapping semantics on the Web, which can be used to automatically elicit knowledge from very common web objects. The paper has two main parts. In the first part we argue that, in making explicit the meaning of a schema, most approaches tend to focus on what we call structural meaning, but almost completely disregard (i) the linguistic meaning of components (typically encoded in the labels), and (ii) its composition with structural meaning; our thesis is that this approach misses the most important aspects of how meaning is encoded in schemas. The second part of the paper describes our method for eliciting meaning from schemas, and presents an implementation called CTXMATCH2. In conclusion, as an example of an application, we show how the results of this elicitation process can be used for schema and ontology matching and alignment.
Consider three very common types of schemas: hierarchical classifications (HCs), ER Schemas and RDF Schemas. Examples are depicted in Figures 2, 3 and 1.
They are used in different domains (document management, database design, vocabulary specification) to provide a structure which can be used to organize information sources. However, there is a second purpose which is typically overlooked, namely to provide humans with an easier access to those data. This is achieved mainly by labelling the elements of a schema with meaningful labels, typiclaly from some natural language. This is why, in our opinion, it is very uncommon to find a taxonomy (or an ER schema, or an RDF Schema specification) whose labels are meaningless for humans. Imagine, for example, how odd (and maybe hopeless) it would be for a human to navigate a classification schema whose labels are meaningless strings; or to read a ER schema whose nodes are labeled with random strings. Of course, humans would still be able to identify and use some formal properties of such schemas (for example, in a classification schema, we can always infer that a child node is more specific than its parent node, because this belongs to the structural understanding of a classification), but we would have no clues about what the two nodes are about. Similar observations can be made for the two other types of schemas. So, our research interest can be stated as follows: can we define a method which can be used to automatically elicit and represent the meaning of a schema in a form that makes available to machines the same kind of rich meaning which is available to humans when going through a schema?
We said that each node (e.g. in a HC) has an intuitive meaning
for humans. For example, the node in Figure 1 can
be easily interpreted as ``pictures of mountains in Tuscany'',
whereas
can be interpreted as
``color pictures of mountains in Trentino''. However, this
meaning is mostly implicit, and its elicitation may require a lot
of knowledge which is either encoded in the structure of the
schema, or must be extracted from external resources. In
[5], we
identified at least three distinct levels of knowledge which are
used to elicit a schema's meaning:
Most past attempts focused only on the first level. A recent example is [19], in which the authors present a methodology for converting thesauri into RDF/OWL; the proposed method is very rich from a structural point of view, but labels are disregarded, and no background domain knowledge is used. As to ER schemas, a formal semantics is defined for example in [4], using Description Logics; again, the proposed semantics is completely independent from the intuitive meaning of expressions used to label single components. For RDF Schemas, the situation is slightly different. Indeed, the common understanding is that RDFS schemas are used to define the meaning of terms, and thus their meaning is completely explicit; however, we observe that even for RDFS the associated semantics (see http://www.w3.org/TR/rdf-mt/) is purely structural, which means that there is no special interpretation provided for the labels used to name classes or other resources.
However, as we argued above through a few examples, labels (together with their organization in a schema) appear to be one of the main sources of meaning for humans. So we think that considering only structural semantics is not enough, and may lead to at least two serious problems:
The first issue can be explained through a simple example.
Suppose we have some method for making explicit the meaning of paths in HCs,
and that
does not take
the meaning of labels into account. Now imagine we apply
to the path
-
in Figure 1,
and compare to a path like
in
another schema (notice that typical HCs do not provide any
explicit information about edges in the path). Whatever
representation
is capable
of producing, the outcome for the two paths will be structurally
isomorphic, as the two paths are structurally isomorphic.
However, our intuition is that the two paths have a very
different semantic structure: the first should result is a term
where a class (``pictures'') is modified/restricted by two
attributes (``pictures of beaches located in
Tuscany''); the second is a standard Is-A hierarchy, where the
relation between the three classes is subsumption. The only way
we can imagine to explain this semantic (but not structural)
difference is by appealing to the meaning of labels. We grasp the
meaning of the first path because we know that pictures have a
subject (e.g. beaches), that beaches have a geographical
location, and that Tuscany is a geographical location. All this
is part of what we called lexical and domain knowledge. Without
it, we would not have any reason to consider ``pictures'' as a
class and ``Tuscany'' and ``beaches'' as values for attributes of
pictures. Analogously, we know that (a sense of the word) ``dog''
in English refers to a subclass of the class denoted by (a sense
of the word) ``mammals'' in English, and similarly for
``animals''.
The second issue is closely related to the first one. How do
we understand (intuitively) that
refers to pictures
of beaches located in Tuscany, and not e.g. to
pictures working for Tuscany teaching beaches?
After all, the edges between nodes are not qualified, and
therefore any structurally possible relation is in principle
admissible. The answer is trivial: because, among other things,
we know that pictures do not work for anybody (but they
may have a subject), that Tuscany can't be the teacher of a beach
(but can be the geographical location of a beach). It is only
this body of background knowledge which allows humans to
conjecture the correct relation between the meanings of node
labels. If we disregard it, there is no special reason to prefer
one interpretation to the other.
The examples above should be sufficient to support the conclusion that any attempt to design a methodology for eliciting the meaning of schemas (basically, for reconstructing the intuitive meaning of any schema element into an explicit and formal representation of such a meaning) cannot be based exclusively on structural semantics, but must seriously take into account at least lexical and domain knowledge about the labels used in the schema5. The methodology we propose in the next section is an attempt to do this.
Intuitively, the problem of semantic elicitation can be viewed as the problem of computing and representing the (otherwise implicit) meaning of a schema in a machine understandable way. Clearly, meaning for human beings has very complex aspects, directly related to human cognitive and social abilities. Trying to reconstruct the entire and precise meaning of a term would probably be a hopeless goal, so our intuitive characterization must be read as referring to a reasonable approximation of meaning.
In our method, meanings are represented in a formal language
(called WDL, for WordNet Description Logic), which is the result
of combining two main ingredients: a logical language (in
this paper, use the logical language which belongs to the family of Description
Logics [2]), and
IDs of lexical entries in a dictionary (more specifically, from
WordNet [8], a well-known
electronic lexical database). Description logics are a family
logical languages that are defined starting from a set of
primitive concepts, relations and individuals, with a set of
logical constructors, and has been proved to provide a good
compromise between expressivity and computability. It is
supported with efficient reasoning services (see for instance
[14]);
WordNet is the largest and most shared online lexical resource,
whose design is inspired by psycholinguistic theories of human
lexical memory. WORDNET associates
with any word ``word'' a list of senses (equivalent to
entries in a dictionary), denoted as
, each
of which denotes a possible meaning of ``word''.
The core idea of WDL is to use a DL language for representing
structural meaning, and any additional constraints (axioms) we
might have from domain knowledge; and to use
WORDNET to anchor the meaning of
labels in a schema to lexical meanings, which are listed and
uniquely identified as WORDNET
senses. Indeed, the primitives of any DL language do not have an
``intended'' meaning; this is evident from the fact that, as in
standard model-theoretic semantics, the primitive components of
DL languages (i.e. concepts, roles, individuals) are interpreted,
respectively, as generic sets, relations or individuals from some
domain. What we need to do is to ``ground'' their interpretation
to the WordNet sense that best represents their intended meaning
in the label. So, for example, a label like can be
interpreted as a generic class in a standard DL semantics, but
can be also assigned an intended meaning by attaching it to the
the first sense in WORDNET (which
in version 2.0 is defined as ``a body of (usually fresh) water
surrounded by land'').
The advantage of WDL w.r.t. a standard DL encoding is that assigning an intended meaning to a label allows us to import automatically a body of (lexical) knowledge which is associated with a given meaning of a word used in a label. For example, from WORDNET we know that there is a relation between the class ``lakes'' and the class ``bodies of water'', which in turn is a subclass of physical entities. In addition, if an ontology is available where classes and roles are also lexicalized (an issue that here we do not address directly, but details can be found in [17]), then we can also import and use additional domain knowledge about a given (sense of) a word, for example that lakes can be holiday destinations, that Trentino has plenty of lakes, even that a lake called ``Lake Garda'' is partially located in Trentino, and so on and so forth.
Technically, the idea described above is implemented by using WORDNET senses as primitives for a DL language. A WDL language is therefore defined as follows:
Some remarks are necessary.
Despite the fact that the intended semantics cannot be formally represented or easily determined by a computer, one should accept its existence and consider it at the same level as a ``potential'' WordNet sense. Under this hypothesis we can assume that expressions in WDL convey meanings, and can be used to represents meaning in a machine. Put it differently, since the WDL primitives represent common-sense concepts, then the complex concepts of WDL will also represent common-sense concepts, since common-sense concepts are closed under boolean operations and universal and existential role restriction.
From this perspective, the problem of semantic
elicitation can be thought of as the problem of finding a WDL
expression for each element
of a schema, so that the
intuitive semantics of
is a
good enough approximation of the intended meaning of the
node.
This section is devoted to the description of a practical semantic elicitation algorithm. This algorithm has been implemented as basic functionality of the CTXMATCH2 matching platform [17], and has been extensively tested in the 2nd Ontology Alignment Evaluation Initiative7.
In the following we will adopt the notation to denote the meaning
of a node
.
to denote the label of
the node, and
or simply
to denote the
meaning of a label associated with the node
considered out of its
context.
is also called
the local meaning.
The algorithm for semantic elicitation is composed of three main steps. In the first step we use the structural knowledge on a schema to build a meaning skeleton. A meaning skeleton describes only the structure of a WDL complex concepts that constitutes the meaning of a node. In the second step, we fill nodes of with the appropriate concepts and individuals, using linguistic knowledge, and in the final step, we provide the roles, by exploiting domain knowledge.
Meaning skeletons are DL descriptions together with a set of
axioms. The basic components of a meaning skeleton (i.e. the
primitive concepts and roles) are the meanings of the single
labels associated with nodes, denoted by ), and the semantic relations between different
nodes (denoted by
).
Intuitively
represents a
semantic relation between the node
and the node
.
In the rest of this section we show how the meaning skeletons of
the types of schema considered in this paper are computed.
A number of alternative formalizations for HCs have been proposed
(e.g., [15,,]).
Despite their differences, they share the idea that, in a HC, the
meaning of a node is a specification of the meaning of its father
node. E.g., the meaning of a node labeled with ``clubs'', with a
father node which means ``documents about Ferrari cars'' is
``Ferrari fan clubs''. In DL, this is encoded as
, where
is some node that
connets the meaning of
with that of
. If the label of
is for instance ``F40'' (a
Ferrari model) then the meaning of
is ``documents about Ferrari F40 car'', then it is the
meaning of the label of
that acts as modifier of the meaning of
. In description logics this is formalized as
. The choice
between the first of the second case essentially depends both on
lexical knowledge, which provides the meaning of the labels, and
domain knowledge, which provides candidate relations between
and
. The following table summarizes some
meaning skeletons associated with the HC provided above:
node | meaning skeleton | |
---|---|---|
*[2pt] ![]() |
![]() |
|
*[3pt] ![]() |
![]() |
|
![]() |
||
*[3pt] ![]() |
![]() |
|
![]() |
||
![]() |
||
![]() |
Unlike HCs, the formal semantics for ER schemata is widely
shared. In [4],
one can find a comprehensive survey of this area. Roughly
speaking, any ER schema can be converted in an equivalent
set of DL axioms, which express the formal semantics of such a
schema. This formal semantics is defined independently from the
meaning of the single nodes (labels of nodes). Every node is
considered as an atom. To stress this fact in writing meaning
skeletons for ER, we will assign to each node an anonymous
identifier. For instance we use to denote the 5 nodes of the schema of
Figure 2.
If we apply the formal semantics described in [4] to the example of ER given above, we obtain the following meaning skeletons.
node | label | meaning skeleton |
---|---|---|
![]() |
Publication |
![]() |
![]() |
Author | ![]() |
![]() |
||
![]() |
||
![]() |
Person | ![]() |
![]() |
Article |
![]() |
![]() |
Journal | ![]() |
The meaning skeleton of the RDF Schema described in Figure 3 is provided by the formal semantics for RDF schema described for instance in [11]. Most commonly used RDFS constructs can be rephrased in terms of description logics, as discussed in [13]. As we did above, we report the meaning skeletons for some of the nodes of the RDF Schema of Figure 3 in a table, in which we ``anonymize'' the nodes, by giving them meaningless names.
node | label | meaning skeleton |
---|---|---|
![]() |
Staff | ![]() |
![]() |
Researcher | ![]() |
![]() |
||
![]() |
Paper | ![]() |
![]() |
Author | ![]() |
![]() |
||
![]() |
The observations about ER schemas mostly hold also for the meaning skeletons of RDF Schemas. Moreover, it is worth observing that the comments of the RDF Schema are not considered in the formal semantics, and therefore they are not reported in the meaning skeletons. However, we all know that comments are very useful to understand the real meaning of a concept, especially in large schema. As we will see later, they are indeed very important to select and add the right domain knowledge to the meaning skeleton.
If the label of a node is
a simple word like ``Image'', or ``Florence'', then
represents all
senses that this word can have in any possible context. For
example, WORDNET provides seven
senses for the word ``Images'' and two for ``Florence''. If
and
are nodes labeled with these two words, then
and
.
When labels are more complex than a single word, as for
instance ``University of Trento'', or ``Component of
Gastrointestinal Tract'' (occurring in Galen Ontology [16])
then is a more complex
DL description computable with advanced natural language
techniques. The description of these techniques is beyond the
scope of this paper and we refer the reader to [12]. For the sake of explanation
we therefore concentrate our attention to single word labels.
With respect to our methodology, a body of domain knowledge (called a knowledge base) can be viewed as a set of facts describing the properties and the relations between the objects of a domain. For instance, a geographical knowledge base may contain the fact that Florence is a town located in Italy, and that Florence is also a town located in South Carolina. Clearly, the knowledge base will use two different constants to denote the two Florences. From this simple example, one can see how knowledge base relations are defined between meanings rather than between linguistic entities.
More formally, we define a knowledge base to be a pair
where
is a T-box (terminological
box) and
is an A-box
(assertional box) of some descriptive language. Moreover, to
address the fact that knowledge is about meanings, we require
that the atomic concepts, roles, and individuals that appear in
the KB be taken from a set of senses provided by one (or more)
linguistic resources. An fragment of knowledge base relevant to
the examples given above is shown in Figure 4.
Domain knowledge is used to discover semantic relations
holding between local meanings. Intuitively, given two primitive
concepts and
, we search for a role
that possibly connect a
-object with a
-object. As an example,
suppose we need to find a role that connects the concept
and
the nominal concept
; in the knowledge base of
Figure 4, a candidate relation
is
.
This is because Florence#1 is a possible value of
the attributed
of
an
.
More formally, is a
semantic relation between the concept
and
w.r.t., the
knowledge base
if and
only if
According to this definition one can verify that
is
a semantic relation between
and
the nominal concept
. Indeed
(condition
),
(condition
) and for no other
primitive concepts
different
from
we have that
(condition
). Similarly
is
a semantic relation between the nominal concepts
and
, but
it is not a semantic relation between
and
.
The relations computed via conditions -
can be used also for
disambiguation of local meanings. Namely, the existence of a
semantic relation between two senses of two local meanings,
constitutes an evidence that those senses are the right one. This
allows us to discard all the others. For instance in the
situation depicted in Figure 5, it to keep the
sense
and
eliminate the other two senses from the local meaning
. Similarly we
prefer
on
since the former has more semantic relations that the latter.
As we said in the introduction, the idea and method we proposed can be applied to several fields, including semantic interoperability, information integration, peer-to-peer databases, and so on. Here, as an illustration, we briefly present an application which we developed, where semantic elicitation is used to implement a semantic method for matching hierarchical classifications (HCs).
Matching HCs is an especially interesting case for the Web. Indeed, classifying documents is one of the main techniques people use to improve navigation across large collections of documents. Probably the most blatant example is that of web directories, which most major search engines (e.g. Google, Yahoo!, Looksmart) use to classify web pages and web accessible resources. Suppose that a Web user is navigating Google's directory, and finds an interesting category of documents (for example, the category named 'Baroque' on the left hand side of Figure 6 along the path Arts > Music > History > Baroque) She might want to find semantically related categories in other web directories. One way of achieving this result is by ``comparing'' the meaning of the selected category with the meaning of other categories in different directories. In what follows, we will describe a P2P-like approach to this application, which was developed as part of a tool for supporting distributed knowledge management called KEx [3]. The example discussed in this section is adapted from [6]. The entire matching process is run by CTXMATCH2.
Imagine that both Google and Yahoo had enabled their web
directories with some semantic elicitation system8.
This means that each node in the two web directories is equipped
with a WDL formula which represents its meaning. In addition, we
can imagine that each node contains also a body of domain
knowledge which has been extracted from some ontology; this
knowledge is basically what it is locally known about the content
of the node (for example, given a node labeled , we can
imagine that it can contain also the information that Tuscany is
a region in Central Italy, whose capital is Florence, and so
on).
Let us go back to our Google user interested in Baroque music. When she selects this category, we can imagine that the following process is startedç
In the following table we present some results obtained through CTXMATCH2 for finding relations between the nodes of the portion of Google and Yahoo classifications depicted in Figure 6.
Google node | Yahoo node | semantic relation |
Baroque | Baroque | Disjoint (![]() |
Visual Arts | Visual Arts | More general than
(![]() |
Photography | Photography | Equivalent
(![]() |
Chat and Forum | Chat and Forum | Less general than
(![]() |
In the second example, CTXMATCH2 returns the
`more general than' relation between the nodes Visual
Arts. This is a rather sophisticated result: indeed, world
knowledge provides the information that `photography visual art' (
).
From structural knowledge, we can deduce that, while in the left
structure the node Visual Arts denotes the whole concept
(in fact photography is one of its children), in the
right structure the node Visual Arts denotes the concept
`visual arts except photography' (in fact photography is
one of its siblings). Given this information, it easy to deduce
that, although despite the two nodes lie on the same path, they
have different meanings.
The third example shows how the correct relation holding
between nodes Photography is returned (`equivalence'),
despite the presence of different paths, as world knowledge tells
us that
.
Finally, between the nodes Chat and Forum a `less general than' relation is found as world knowledge gives us the axiom `literature is a humanities'.
This work has been inspired from the approach described in [12] in which the technique of semantic elicitation has been applied to the special case of hierarchical classification. The approach described in this paper extends this initial approach in three main directions. First, the logic in which the meaning is expressed in some description logic, while in [12] meaning was encoded in propositional logic. Second, [12] adopts only WordNet as both linguistic and domain knowledge repository, while in this approach we allow the use of multiple linguistic resources, and knowledge bases. Third, in [12] no particular attention was paid to structural knowledge, while here we introduced the concept of a meaning skeleton, which captures exactly this notion.
The paper [1] describes an approach which enrich xml schema with the semantic encoded in an ontology. This approach is similar in the spirit of the idea of semantic elicitation of schemas, but it does not make an extensive use of explicit structural knowledge, and of linguistic knowledge, which are two of the three knowledge sources used in our approach.
The approach described in [18] describes a possible application of the linguistic enrichment of an ontology in the area of keyword based document retrieval. This approach is quite similar in the spirit on what we have proposed here, with the limitation of considering only hierarchical classifications. Moreover, in the process of enriching a concept hierarchy, no domain knowledge is used.
Finally, most of the approaches of schema matching uses linguistic knowledge (WordNet) and domain knowledge to find correspondences between elements of heterogeneous schemata. Among all the approaches CTXMATCH [5] and [10] is based on the idea of matching meaning, rather than matching syntax. Both approaches implement a two step algorithm, and the first phase computes the meaning of a node by using linguistic and domain knowledge. However both approaches are based on propositional logic.
Semantic elicitation mat be an important method for bootstrapping semantics on the web. Our method does not address the issue of extracting knowledge from documents, which of course will be the main source of semantic information. But knowledge extraction from documents is still an expensive and error prone task, as it must address a lot of well-known problems related to natural language analysis. Instead, semantic elicitation can be applied to objects which have a simpler structure (labels are typically quite simple from a linguistic point of view), and thus is less demanding from a computational point of view and more precise (needless to say, a lot of errors may occur, see [5] for a few tests). But schemas, as we said, are very common on the web, and have a very high informative power. Moreover, in many applications in the area integration of semantic web services the only available information is based on schemas and no data are present. Therefore, we assume that, in the short-mid term, this would be one of the main ways to add semantics to data on the web on a large scale.