Tutorials

WWW 2017 will offer three full day and 10 half day tutorials that will provide a high quality learning experience to conference attendees on current and emergent Web topics.

Participants will have the opportunity to exchange ideas and experience and identify new opportunities for collaboration and new directions for future research.

The tutorials will take place at the conference venue, the Perth Convention and Exhibition Centre, on Monday 3 April and Tuesday 4 April, 2017. Where available, a brief description of the tutorial has been provided. Hopefully, more of these will be added as they become available.

Tutorials will be scheduled so as to avoid conflicts with conference sessions in similar or related areas. A tentative schedule has been set out below. The exact schedule will be released closer to the conference date.

Accepted tutorials

Monday 3 April - Full day

Caching at the Web Scale
Victor Zakhary http://cs.ucsb.edu/~victorzakhary/
Divyakant Agrawal, Amr El Abbadi

Abstract: Today’s web applications and social networks are serving billions of users around the globe. These users generate billions of key lookups and millions of data object updates per second. A single user’s social network page load requires hundreds of key lookups. This scale creates many design challenges for the underlying storage systems. First, these systems have to serve user requests with low latency. Any increase in the request latency leads to a decrease in user interest. Second, storage systems have to be highly available. Failures should be handled seamlessly without affecting user requests. Third, users consume an order of magnitude more data than they produce. Therefore, storage systems have to be optimized for read-intensive workloads. To address these challenges, distributed in-memory caching services have been widely deployed on top of persistent storage. In this tutorial, we survey the recent developments in distributed caching services. We present the algorithmic and architectural efforts behind these systems focusing on the challenges in addition to open research questions.

Audience: This half day tutorial targets researchers, designers, and practitioners interested in systems and infrastructure research for big data management and processing. The target audience with basic background about cache replacement policies, sharding, replication, and consistency would benefit the most from this tutorial. For general audience and newcomers, the tutorial introduces the design challenges that arise when caching services are designed at the web scale. For researchers, algorithmic and architectural efforts in distributed caching are presented together showing the spectrum of recent solutions side by side with their unhandled challenges. Our goal is to enable researchers to develop designs and algorithms that handle these challenges at scale.

Distributed Machine Learning: Foundations, Trends, and Practices
Tie-Yan Liu https://www.microsoft.com/en-us/research/people/tyliu/
Wei Chen, Taifeng Wang

Abstract: In recent years, artificial intelligence has achieved great success in many important applications. Both novel machine learning algorithms (e.g., deep neural networks), and their distributed implementations play very critical roles in the success. In this tutorial, we will first review popular machine learning algorithms and the optimization techniques they use. Second, we will introduce widely used ways of parallelizing machine learning algorithms (including both data parallelism and model parallelism, both synchronous and asynchronous parallelization), and discuss their theoretical properties, strengths, and weakness. Third, we will present some recent works that try to improve standard parallelization mechanisms. Last, we will provide some practical examples of parallelizing given machine learning algorithms by using popular distributed platforms, such as Spark MlLib, DMTK, and Tensorflow. By listening to this tutorial, the audience can form a clear knowledge framework about distributed machine learning, and gain some hands-on experiences on parallelizing a given machine learning algorithm using popular distributed systems.

Audience: Both academic researchers and industrial practitioners in the artificial intelligence domain will be interested in this topic, especially for those who are working on/around deep learning, and large-scale machine learning.

Constructing Structured Information Networks from Massive Text Corpora
Xiang Ren http://xren7.web.engr.illinois.edu/
Meng Jiang, Jingbo Shang, Jiawei Han

Abstract: In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information from various domains (medical records, corporate reports). To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations, events) in the text. In this tutorial, we introduce data-driven methods to construct structured information networks (where nodes are different types of entities attached with attributes, and edges are different relations between entities) for text corpora of different kinds (especially for massive, domain-specific text corpora) to represent their factual information. We focus on methods that are minimally-supervised, domain-independent, and language-independent for fast network construction across various application domains (news, web, biomedical, reviews). We demonstrate on real datasets including news articles, scientific publications, tweets and reviews how these constructed networks aid in text analytics and knowledge discovery at a large scale.

Audience: Researchers and practitioners in the field of web search, information retrieval, data mining, text mining, database systems. While the audience with a good background in these areas would benefit most from this tutorial, we believe the material to be presented would give general audience and newcomers an introductory pointer to the current work and important research topics in this field, and inspire them to learn more. Only preliminary knowledge about text mining, data mining, algorithms and their applications are needed.

Monday 3 April - Half day - afternoon

Computational Models for Social Network Analysis
Jie Tang Antisocial Behavior on the Web: Characterization and Detection
Srijan Kumar http://cs.umd.edu/~srijan/
Justin Cheng, Jure Leskovec

Abstract: Web platforms enable unprecedented speed and ease in transmission of knowledge, and allow users to communicate and shape opinions. However, the safety, usability and reliability of these platforms are compromised by the prevalence of online antisocial behavior, for e.g. 40% of users have experienced online harassment. This is present in the form of antisocial users, such as trolls, sockpuppets and vandals, and misinformation, such as hoaxes, rumors and fraudulent reviews. This tutorial presents the state-of-the-art research spanning two aspects of antisocial behavior: first, characterization of their behavioral properties, and second, development of algorithms for identifying and predicting them. The tutorial first discusses antisocial users - trolls, sockpuppets and vandals. We present the causes, community effect, and linguistic, social and temp oral characteristics of trolls. Then we discuss the types of sockpuppets, i.e. multiple accounts of the same user, and their behavioral characteristics in Wikipedia and online discussion forums. Vandals make destructive edits on Wikipedia and we discuss the properties of vandals and vandalism edits. In each case, detection and prediction algorithms of the antisocial user are also discussed. The second part of the tutorial discusses about misinformation - hoaxes, rumors and fraudulent reviews. First, we present the characteristics and impact of hoaxes on Wikipedia, followed by the spread and evolution of rumors on social media. Second, we discuss the algorithms to identify fake reviews and reviewers from their characteristics, and the camouflage and coordination among sophisticated fraudsters. Again, in each case, we present the detection algorithms, using textual, temporal, sentiment, network structure and rating patterns. Finally, the tutorial concludes with future research avenues.

Audience: This tutorial targets academic, industry and government researchers and practitioners with interests in social network anomaly detection, user behavior modeling, graph mining, cybersecurity, and community policy design. Beginners in the area will learn the basics of these algorithms. Experts in the area will learn in-depth algorithms and case-studies to detect online antisocial behavior that are both platform-specific techniques and platform-independent.

Scalable deep document / sequence reasoning with Cognitive Toolkit
Sayan Pathak
William Darling, Clemens Marschner

Audience: This Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. Large DNNs can be trained with supervised backpropagation whenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will find these parameters and solve the problem. Sequences pose a challenge for DNNs because they require the dimensionality of the inputs and outputs is known and fixed. Researchers have shown short-Term Memory (LSTM) architecture can solve general sequence to sequence problems. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding output. Recently, research has shown that similar to how human brain pays attention to certain parts of the enormous amount of information that comes our way and distills important attention seeking sub components, one can extend the concept to neural networks. This approach commonly referred to as “attention” helps improve on challenging sequence processing tasks where simple sequence to sequence models fail. Additionally, we will also walk through the Reasoning Network (ReasoNet) framework which has achieved industry leading results in the area of reading comprehension.

In this tutorial, we plan to introduce Microsoft’s Cognitive Toolkit, also known as CNTK, to solve a large number of deep network based algorithms applied to sequence data. We will introduce the audience to solving sequence to sequence using hands-on tutorials using the toolkit. While there are several other deep learning toolkits, we will introduce to the audience a few key innovations made available in CNTK which provided unique advantages in addition to ease to use, speed, and scalability. CNTK achieves such scalability via advanced algorithms such as 1-bit SGD and block-momentum SGD. We will explain in detail these algorithms in this tutorial.

Audience: We expect that individuals with interest in text mining, entity tagging, relationship mining would find the tutorial very useful to extend their knowledge of building advanced deep learning models with scalability in mind. We expect both beginners as wells as experts to benefit from the proposed topics.

Tuesday 4 April - Full day

Towards Semantic Applications: from Knowledge Management to Data Publishing on the Web
Dung Xuan Thi Le http://semanticsoftware.com.au/semantic-team/meet-the-leadership-team/
Michel Héon, Nick Volmer

Audience: Semantic web technologies focus on supporting data expression in a common language for interoperability together with knowledge capturing, representation, reuse and sharing. In the semantic management space, an ontology can be used as a knowledge representation language, and several textual syntaxes exists such as Triples, Manchester, Turtles, etc. for ontology building. Most situations need to represent knowledge in graphical mode, for example, for knowledge elicitation, knowledge sharing between humans or knowledge base systems modelling. In addition, existing public ontologies are normally large and complex. Each of these ontologies could describe many specific sub-domains. Understanding such large ontology structure or reusing it to address a specific sub-domain will result in a high cost, which can be avoided. Given a significantly large master data repository, we can expose the data in Resource Description Framework (RDF) triples and at the same time extract a knowledge representation that describes the extracted information for management and sharing purposes. To extract data and capture the knowledge in an effective manner, it should be made more adaptable for users who have no or very little semantic web knowledge, who often faces challenges. In the semantic integration space, questions such as: how flexible and adaptable are the entities, attributes and relationships being captured; how can inferences be enabled without the need to use a standard rule engine or reasoners; how are RDF triples being efficiently managed for efficient manipulation, performance and scalable purposes; etc., are important and still need to be addressed.

Audience: The event targets researchers and practitioners who are interested in applying semantic computing to explore solutions of create ontologies, prune existing ontology and moving towards semantic application. The technologies and topics in this tutorial are relevant to researchers, people from IoT, cognitive computing communities, as well as social media, health, oil industries, etc., who want to put their existing information (data) into a common language for semantic applications or for publishing data of the web.

The Lifecycle of Geotagged Data
Rossano Schifanella http://www.di.unito.it/~schifane/
Bart Thomee, David Shamma

Audience: The world is a big place. At any given instant something is happening somewhere, but even when nothing is going on people still find ways to generate data, ranging from posting on social media, taking photos, and issuing search queries. A substantial number of these actions is associated with a location, and in an increasingly mobile and connected world (both in terms of people and devices), this number is only bound to get larger. Yet, in the literature we observe that many researchers often unwittingly treat the geospatial dimension as if it were a regular feature dimension, despite it requiring special attention. In order to avoid pitfalls and to steer clear of erroneous conclusions, this tutorial aims to teach researchers and students how geotagged data differs from regular data, and to educate them on best practices when dealing with such data. We will cover the lifecycle of geotagged data in research, where the topics range from how this kind of data is created, represented, processed, modeled, analyzed, visualized, and perceived. The tutorial requires both passive and active involvement—we not only present the material, but the attendees also get the opportunity to interact with it using a variety of open source data and tools that we have prepared using a virtual machine. Participants should expect to leave the course with a high-level understanding of methods for properly using geospatial data and reporting results, the necessary context to better understand the geography literature, and resources for further engaging with georeferenced data.

Audience: This introductory tutorial targets all researchers and students that want to learn more about how to properly work with geotagged multimedia data. It provides information to get complete novices started, while at the same time do not shy away from presenting advanced representation, modeling and analysis techniques for those interested in a deeper understanding of geographic data. A substantial portion of the data on the World Wide Web refers to specific geographic places or areas, and in an increasingly mobile world this data is created and consumed at varying locations. Considering that hundreds of papers that use geotagged data are published every year, each year more than the year before, we deem our tutorial to be particularly relevant to the audience at WWW 2017.

Tuesday 4 April - Half day – morning

Semantic Data Management in Practice
Olaf Hartig http://olafhartig.de/
Olivier Curé

Audience: After years of research and development, standards and technologies for semantic data are sufficiently mature to be used as the foundation of novel data science projects that employ semantic technologies in various application domains such as bio-informatics, materials science, criminal intelligence, and social science. Typically, such projects are carried out by domain experts who have a conceptual understanding of semantic technologies but lack the expertise to choose and to employ existing data management solutions for the semantic data in their project. For such experts, including domain-focused data scientists, project coordinators, and project engineers, our tutorial will deliver a practitioner's guide to semantic data management. We will discuss the following important aspects of semantic data management and demonstrate how to address these aspects in practice by using mature, production-ready tools: Storing and querying semantic data; understanding, searching, and visualizing the data; automated reasoning; integrating external data and knowledge; and, cleaning the data.

Audience: The tutorial targets both academic and industrial researchers, data scientists, and other professionals that aim to exploit semantic technologies in some application domain. The attendees will learn about the key aspects of semantic data management that they may need to address to achieve the goals of their application projects, and they will be guided through the maze of mature data management tools ready to be used in such projects. Our main goal is to enable the audience to select and to employ the right tools for the right jobs in applied semantic technologies projects.

Digital Demography
Bogdan State https://sociology.stanford.edu/people/bogdan-state
Ingmar Weber

Audience: Due to the increasing availability of large-scale data on human behavior collected on the social web, as well as advances in analyzing larger and larger data sets, interest in applying computational methods to study population dynamics continues to grow. Data scientists entering the interdisciplinary field of Computational Social Science (CSS) often lack background in theories and methods in the social sciences, whereas social scientists are often not aware of cutting edge advances in computational methods. This problem is felt particularly acutely in the field of demography, which has shown to be one of the most promising areas for the development of novel means of answering key scientific questions using digital data and computational methods.

This tutorial exposes computer scientists to digital demography by presenting (i) an overview of what demographic research is, (ii) data sets and methods traditionally used, (iii) novel data sets and computational approaches, and (iv) opportunities for future advances in this area. The goal of this tutorial is to give participants a rich repertoire of research questions, data sets, and methods that help to address challenges related to demographic changes.

Audience: This tutorial is aimed at participants with a basic level of data mining and data processing. The content covered in the tutorial is designed to introduce both PhD students and researchers interested in learning or advancing their current knowledge of digital methods for demographic research.

Tuesday 4 April - Half day – afternoon

Semantic Web meets Internet of Things and Web of Things
Amelie Gyrard https://www.insight-centre.org/users/amelie-gyrard
Pankesh Patel, Soumya Kanti Datta, Muhammad Intizar Ali

Audience: An ever growing interest and wide adoption of Internet of Things (IoT) and Web technologies are unleashing a true potential of designing a broad range of high-quality consumer applications. Smart cities, smart buildings, and e-health are among various application domains which are currently benefiting and will continue to benefit from IoT and Web technologies in a foreseeable future. Similarly, semantic technologies have proven their effectiveness in various domains and a few among multiple challenges which semantic Web technologies are addressing are to (i) mitigate heterogeneity by providing semantic inter-operability, (ii) facilitate easy integration of data application, (iii) deduce and extract new knowledge to build applications providing smart solutions, and (iv) facilitate inter-operability among various data processes including representation, management and storage of data.

In this tutorial, our focus will be on the combination of Web technologies, Semantic Web, and IoT technologies and we will present to our audience that how a merger of these technologies is leading towards an evolution from IoT to Web of Things (WoT) to Semantic Web of Things. This tutorial will introduce the basics of
Internet of Things, Web of Things and Semantic Web and will demonstrate tools and techniques designed to enable the rapid development of semantics-based Web of Things applications. One key aspect of this tutorial is to familarize its audience with the open source tools designed by different semantic Web, IoT and WoT based projects and provide the audience a rich hands-on experience to use these tools and build smart applications with minimal efforts. Thus, reducing the learning curve at its maximum. We will showcase real-world use case scenarios which are designed using semantically-enabled WoT frameworks (e.g. CityPulse, FIESTA-IoT).

Audience: This tutorial should be of equal interest to a wider community including application developers, Web and Semantic Web technologists, academics, students and most importantly application developers who are thriving to build IoT and WoT based data applications. Tutorial is designed for participants with all levels experience and expertise of technologies. Ideal preparation can be to have some basic know how of IoT, WoT, semantic Web and
Web technologies/standards etc.

Large Scale Distributed Data Science from scratch using Apache Spark 2.0
James Shanahan https://www.ischool.berkeley.edu/people/james-shanahan
Liang Dai

Audience: Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls), and its core data abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.

This tutorial will provide an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will cover fundamental Spark concepts, including Spark Core, functional programming ala map-reduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization, and deep learning. The home homegrown implementations will help shed some light on the internals of the MLlib libraries (and on the difficulties of parallelizing some key machine learning algorithms). Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (pySpark) notebooks.

Audience: The tutorial is targeted to most WWW attendees, both industry practitioners and researchers who wish to learn best practices of large scale data science using next generation tools. The level of the tutorial can be considered introductory with hands-on exposure to algorithmic development in Spark (and pySpark the python API to Spark).

Tutorials

The tutorials will cater to a varied range of interests and backgrounds: beginners, developers, designers, researchers, practitioners, users, lecturers and representatives of governments and funding agencies who wish to learn new technologies.

More detail on the tutorials will be published closer to the conference date.

← Back to the program