Edgar Meij, Yahoo! Research, Netherlands
Krisztian Balog, Norwegian Univ. of Science and Technology, Norway
Daan Odijk, University of Amsterdam, Netherlands
The explosive increase in the amount of unstructured textual data being produced in all kinds of domains (web, news, social media, corporate, etc.), calls for advanced methodologies for making sense of this data. Historically, automatic techniques for making sense of unstructured text consisted of rather crude classifications at the document level, into categories such as “sports,” “finance,” etc. Later, advances in machine learning, text mining, and natural language processing enabled a more fine-grained analysis, giving rise to approaches such as named entity recognition (NER). NER effectively moves the granularity of the analysis from the document level to the phrase level and enables identifying the type of a phrase, such as “person,” “location,” etc. Once limited to fairly narrow domains, NER techniques are now mainstream.
Recent advances have enabled an even more precise manner of analysis, where phrases— consisting of a single term or sequence of terms—are automatically linked to entries in a knowledge base. This process is commonly known as entity linking. Entity linking facilitates advanced forms of searching and browsing in various domains and contexts. It can be used, for instance, to anchor the textual resources in background knowledge; authors or readers of a piece of text may find entity links to supply useful pointers. Another application can be found in search engines, where it is increasingly common to link queries to entities and present entity-specific overviews. More and more, users want to find the actual entities that satisfy their information need, rather than merely the documents that mention them; a process known as entity retrieval.
It is common to consider entities from a general-purpose knowledge base such as Wikipedia or Freebase, since they provide sufficient coverage for most tasks and applications. Wikipedia is therefore a common target for entity linking and a fertile ground for research on entity retrieval. Its rich structure (including article link structure and categorization, infoboxes, etc.) makes it possible to advance over plain document retrieval. Approaches for linking and retrieving entities are not Wikipedia-specific, however. Recent developments in the Web of Data enable the use of domain or task-specific entities. Alternatively, legacy or corporate knowledge bases can be used to provide entities. Entity linking and retrieval is also gaining popularity in the public domain, as witnessed by Wolfram Alpha, QWhisper, the Google Knowledge Graph, digital personal assistants such as Siri and Google Now, and entity-oriented vertical search results for places, products, etc. Examples of such collections and applications will be used throughout the tutorial as illustrative cases. Entity linking and retrieval is an emerging research area and no textbooks exist on the suject. With two relevant tracks in the research track and four relevant workshops, WWW2013 is therefore the ideal venue for this tutorial.
The first part of the tutorial focuses on entity linking, where typical solutions consist of two steps: deciding (i) whether a phrase should be linked and, if so, (ii) what the target of the link should be. While the first step can be addressed using NER techniques, it is the second step that is the main focus of this part of the tutorial. In order to automatically identify an entity link, it is crucial to analyze the surface forms for entities. These consist of lexical and spelling variants, acronyms, and may also include multi-lingual variants. It is common to use the set of surface forms for each entity to determine and weight candidate links. Furthermore, a phrase’s context—consisting of, e.g., linguistic features—can be used to further improve entity linking performance. Leveraging context is especially prudent for disambiguation, i.e., when there is an overlap in surface forms between entities. We will also introduce recent, advanced methods for entity linking, including graph-based methods and feature-based approaches in a machine learning setting. Here, a machine learning algorithm makes a decision to link a phrase to an entity or not, based on information pertinent to the knowledge base, the phrase, its context, external information, or a combination thereof. Machine learning methods also enable task and/or domain adaptation, where manually curated training data encodes the peculiarities of the domain, data, or task at hand.
As to evaluating entity linking, a number of initiatives exist. In 2007, INEX launched the Link the wiki track, specifically devoted to developing standard procedures and metrics for the evaluation of link discovery within Wikipedia. TAC has been running the Knowledge Base Population track since 2009, with a separate entity linking task. The Knowledge Base Acceleration track is a new track at TREC, featuring a filtering task: given an textual stream consisting of news and social media content and a target entity from a knowledge base, generate a score for each document based on how pertinent it is to the target. In this part of the tutorial we will discuss the strengths and weaknesses of, and lessons learned at these initiatives.
The second part of the tutorial focuses on entity retrieval and introduces methods and algorithms that rank entities with respect to a topic or to another entity. We start this part of the tutorial by considering scenarios where explicit representations of entities are available in the form of either Wikipedia pages or RDF triples. Under these conditions, ad-hoc entity search can be approached by adapting standard document retrieval methods based on textual entity representations. On top of this basic layer of models, we discuss ways of exploiting and leveraging the structure and semantics present in these data sets, including type and category information, attributes and typed relations, and link structure. We then continue in a setting with more complex queries, that require evidence to be collected and aggregated from massive volumes of unstructured textual data (with the potential help of some structured data) and that call for a combination of techniques from both entity linking and entity retrieval.
We conclude this part of the tutorial with a discussion of entity retrieval evaluation initiatives.
In 2005, an expert finding track was launched at TREC, where a list of experts had to be returned for a given topic. Later on, a separate Entity track started at TREC in 2009, acknowledging the need for a generalization to arbitrary types of entities. This track defined the related entity finding task: return a ranked list of entities (of a specified type) that engage in a given relationship with a given source entity. Common to both tasks is that entities are not directly represented as retrievable units; one of the challenges is to recognise entities in the document collection, then aggregate the textual information associated with them for the purpose of retrieval. INEX also featured an Entity Ranking track between 2007 and 2009. Most recently, the Semantic Search Challenge introduced the ad-hoc entity search task in the Web of Data: given a keyword query, targeting a particular entity, provide a ranked list of relevant entities identified by their URIs.?Finally, a number of publicly available toolkits and web services for entity linking and entity retrieval exist, including EARS, Wikipedia Miner, OpenCalais, Sindice, and DBpedia Spotlight. The last part of the tutorial will give an overview and comparative analysis of these, followed by a hands-on session where they will be evaluated in various settings, including full-text documents, microblogs, queries, and the web. Participants will work through guided examples using a web-based learning environment.
Slides: http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/