Information Network Analysis and Extraction on the World Wide Web

Presenters: Tim Weninger, Univ. Illinois Urbana Champaign, US
Jiawei Han, Univ. Illinois Urbana Champaign, US

Extraction, Search and Exploration of Semi–Structured Web pages is an increasingly popular task because of its potential impact on the way information is managed and retrieved. For example, a bibliographic database of computer science research publications (e.g., DBLP) could be integrated with information from the authors homepages. This integration would allow for a free-text query to return structured, (database-style), results. Alternatively, a structured (SQL-style) query could be used to return Web results. Furthermore, the unstructured and structured information could be used to mutually enhance one another providing more informative search results. The general process of this effort is to take semi-structured, HTML-style Web data transform it into relatively structured, manipulable information. This process becomes especially challenging when working with a data source as vast and varied as the World Wide Web. Researchers have found that most Web sites, especially large sites, contain informative structures. This observation has led to the development of several approaches that leverage the structural similarities of Web pages (from a common Web site) to learn Web wrappers that extract content, data records, lists, and tables. The extracted content is often put into a database for later retrieval, however, it is usually not suf?cient to extract content from only a single site. Instead, information from many sites are gathered in order to enhance the information en masse. In these cases, content extraction is only part of the story; we may also choose to integrate the data into a consistent database with a single schema, or link data records with Web pages. These efforts usually rely on heuristic assumptions of Web page presentation, the ?nal section of this tutorial explores recent efforts to develop more principled approaches using recent developments in information network analysis with the same goal.
This tutorial is intended to contain a mix shallow introduction of several techniques leading to deeper insights into the larger task. Thus, the tutorial is composed of several speci?c, rather shallow techniques, which, when combined, can be used to explain deeper, more complex ideas.

Link to material: http://web.engr.illinois.edu/~weninge1/pubs/tutorial_WWW13.pptx