The BIG Web

Keynotes Speakers

Jon Kleinberg

Jon Kleinberg

Jon Kleinberg is the Tisch University Professor in the Departments of Computer Science and Information Science at Cornell University…

Title of presentation :
Fairness in Data Analysis and Algorithms

Alessandro Vespignani

Alessandro Vespignani is the Sternberg Family Distinguished University professor at Northeastern University…

Title of presentation :
Big data epidemiology: more than forecast

Alessandro Vespignani

Invited Speakers

  • Doctor AI – Interpretable Deep Learning Methods for modeling Electronic Health Records
    Speaker: Jimeng Sun (GA Tech)
    Jimeng Sun

    Jimeng Sun is an Associate Professor of College of Computing at Georgia Tech. Prior to Georgia Tech, he was a researcher at IBM TJ Watson Research Center. His research focuses on health analytics and machine learning, especially in designing tensor factorizations, deep learning methods, and large-scale predictive modeling systems. He published over 120 papers and filed over 20 patents (5 granted). He has received SDM/IBM early career research award 2017, ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. Dr. Sun received B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, M.Sc and PhD in Computer Science from Carnegie Mellon University in 2006 and 2007.

    Abstract: Deep neural networks provide great potential to create better models for longitudinal electronic health records (EHRs). In this talk, we will present a series of case studies of deep learning for modeling EHR.
    1) We illustrate how recurrent neural networks (RNN) can be used to model temporal relations among events in electronic health records (EHRs) to predict heart failures.
    2) We introduce an interpretable predictive model RETAIN which achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits (e.g. key diagnoses).
    3) Finally, we present a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records.?

  • Of Nets and Neighbors: making ML systems correctable in production
    Speaker: Misha Bilenko (Yandex)
    Misha Bilenko

    Misha Bilenko leads Machine Intelligence and Research at Yandex, one of Europe’s largest Internet companies. MIR teams conduct research and produce AI products and technologies in areas that include speech recognition and synthesis, machine translation, computer vision, and machine learning algorithms and tools. Before Yandex, Misha lead the Machine Learning Algorithms team in Cloud+Enterprise division of Microsoft, where his work was incorporated into multiple products across all Microsoft divisions. He started his career in the Machine Learning Group in Microsoft Research after receiving his Ph.D. in Computer Science from the University of Texas at Austin and stints at Google and IBM Research.

    Abstract: How does one build a production ML system that can effectively incorporate corrections, while avoiding the typical risks and engineering costs of online learning methods? An effective solution requires a combination of very latest and some well-dated algorithms. Parametric machine learning methods – such as neural networks, boosted trees, factorization methods and their ensembles – yield state-of-the-art results on ML benchmarks and competitions. Real-world deployments of ML systems, however, differ dramatically from those static settings. In this talk, we discuss issues that differentiate production and academic ML systems, leading to the need for combining parametric models with their non-parametric brethren, i.e, modern variants of Nearest Neighbor algorithms. The combined approach is particularly suitable for systems where incorporating corrections must be accomplished rapidly, as illustrated by some lively real-life examples from a large-scale conversational assistant.

  • Empowering the Quantum Revolution
    Speaker: Julie Love (Microsoft)
    Julie Love

    Dr. Julie Love is Director at Microsoft leading business development for Quantum Computing. In addition to holding a range of strategic and operational roles at Microsoft she has worked as a management consultant with McKinsey in the semiconductor sector, led strategy for Adobe’s creative business, and held board and advisory roles for quantum computing start-ups. Julie has a Ph.D. in quantum physics from Yale and an undergraduate physics degree from MIT.

    Abstract: Quantum computing represents a profound paradigm shift in computing where we can try out, on a single piece of hardware, all possible computational paths at once. In quantum computing, the unit of computation is not 1 and 0, but a linear superposition of both. Harnessing the power of quantum mechanics requires us to build an extraordinary piece of hardware and to program in such a way that we can “hear” the right answer. I’ll explain quantum bits and entanglement and how they can be leveraged to solve problems the world’s fastest supercomputers would take billions of years to compute.
    I will share Microsoft’s unique approach to building a the world’s most scalable quantum system based on a new topological phase of matter and show the software Microsoft has released to empower developers everywhere to participate in the quantum revolution.?

  • Applying Machine Learning to next generation Networks for Threat Detection, Cognitive and Predictive Analytics
    Speaker: JP Vasseur (Cisco)
    JP Vasseur

    JP Vasseur PhD is a Cisco Fellow where he has been working on a number of networking technologies such as IP/MPLS, Quality of Service, Traffic Engineering, network recovery, “The Internet of Things” (as the Chief Architect of the Internet of Things), Security, Wireless Networks since he joined Cisco in 1998. From 1992 to 1998, he worked for Service Providers in large multi-protocol environments. He is an active member of the Internet Engineering Task Force (co-author of more than 35 IETF RFCs, funders and co-chair of several Working Groups such as the PCE and ROLL WG), and an active member in several SDOs.

    JP has been leading world class engineering teams of advanced networking and Analytics/Machine Learning (Self Learning Networks, Cloud-based Machine Learning) with key applications such as Security, network cognitive and predictive analytics. JP is a regular speaker at various international conferences, he is involved in various research projects in the area of IP/Sensor Networks/Internet of Things/Security and the member of a number of Technical Program Committees. JP Vasseur is also Associate Professor at Telecom Paris. He is the (co)inventor of more than 500 patents in the area of IP/MPLS, Security, The Internet of Things and Machines Learning / Analytics.He is the coauthor of “Network Recovery” (Morgan Kaufmann, July 2004), “Definitive MPLS Network Designs” (Cisco Press, March 2005) and “Interconnecting Smart Object with IP: The Next Internet (Morgan Kaufmann, July 2010 –

    JP receives an engineering degree in computer Science (France), a Master of Science in Computer Science (Steven – USA) and PhD in Networking (Mines-Telecom Paris – France).

    Abstract: Networking technologies have been fast evolving over the past two decades leading to a broad range of technologies (numerous PHY/MACs, routing, QoS, high availability, security, ….) while requiring increasingly stringent requirements (from best effort to deterministic networks). Advanced analytics with Machine Learning is already playing a key role in today’s networks, a trend that will undoubtedly increase very quickly in the coming years. In this talk, several key use cases applying Machine learning to Networking will be presented: threat detection, cloud-based cognitive & predictive analytics in wireless networks, along with results from deployed network. A demo will also be provided using an advanced cloud based machine learning architecture.?

  • Challenges and Innovations in Building a Product Knowledge Graph
    Speaker: Luna Dong (Amazon)

    Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington Post. She has got the VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”. She co-authored book “Big Data Integration”, and is the PC co-chair for Sigmod 2018 and WAIM 2015.?

    Abstract: Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph.

    In this talk we describe four scientific directions we are investigating in building and using such a graph, namely, harvesting product knowledge from the web, hands-off-the-wheel knowledge integration and cleaning, human-in-the-loop knowledge learning, and graph mining and graph-enhanced search. This talk will present our progress to achieve near-term goals in each direction, and show the many research opportunities towards our moon-shot goals.

  • The European Science Cloud: from vision to reality
    Speaker:Athanasios Karalopoulos (European Commission)
    Athanasios Karalopoulos

    Athanasios is a policy officer at the ‘Open Data Policy and Science Cloud’ Unit of the General Directorate on Research & Innovation of the European Commission. He is a member of the team that is responsible for the design and implementation of the European Open Science Cloud, contributing mostly to the data and governance aspects of this European initiative. In addition, he is responsible for the drafting and implementation of an action plan that aims to improve the ‘findability’, ‘accessibility’, ‘interoperability’ and ‘reusability’ of research data in Europe (aka FAIR data action plan). Athanasios’ career at the European Commission, before he joined the European Open Science Cloud team, includes: a) 4 years at the General Directorate on Informatics of the European Commission, as Programme Manager at the Interoperability Solutions for Public Administrations Programme (ISA Programme) and b) 3.5 years at the cabinet of the EU Commissioner for maritime affairs and fisheries. His research interests include semantic interoperability, open and linked data, information management and geo-informatics.

    Abstract: The EOSC emerged as a policy idea in 2014-2015 following a wide consultation of European Scientific Stakeholders on how science is being changed due to ICT, globalisation etc. EOSC is the answer of Europe to offer its 1.7 million researchers a seamless environment for cross border and cross discipline data driven research. it will be based on Europe’s existing initiatives in data infrastructures and services. EOSC became one the key policy priorities for the CssR for Science and Innovation, Carlos Moedas.

    In 2018 EOSC is ready to be launched on the basis of a set of guiding principles the European scientific stakeholders agree with regard to data sharing, governance and business model.

  • Efficient Evaluation of Interactive Systems with Theoretical Guarantees
    Speaker: Maarten de Rijke (Univ. Amsterdam)
    Maarten de Rijke

    Maarten de Rijke is professor of computer science at the University of Amsterdam. He is a member of the Royal Netherlands Academy of Arts and Sciences (KNAW), editor-in-chief of ACM Transactions on Information Retrieval, and co-editor-in-chief of Foundations and Trends in Information Retrieval. His research focuses on the interface of artificial intelligence and information retrieval (search engines, recommender systems, conversational agents). He has published over 700 papers.

    Abstract: Evaluation is of tremendous importance to the development ofonline interactive systems such as search engines. Any proposed change to the interactive system should be verified to ensure it is a true improvement. Online approaches to evaluation aim to measure the actual utility of an interactive system in a natural usage environment.Interleaved comparison methods are a within-subject setup for online experimentation with search engines and recommender systems. For interleaved comparison, two experimental conditions (“control” and “treatment”) are typical.
    Recently, multileaved comparisons have been introduced for the purpose of efficiently comparing large numbers of systems or variations of systems. While considerably more efficient than interleaved comparisons, until recently no multileaved comparison method has been able to infer correct preferences without degrading the search experience of the user. In the talk I present a multileaved comparison method that performs comparisons based on document-pair preferences and that is provably correct without hurting the user experience. In addition, the new method has good empirical performance in terms of sensitivity and scalability.

    The talk is based on joint work with Harrie Oosterhuis.?

  • Discovering Polygamous Relationships in Spatio-Temporal Datasets
    Speaker: Juliana Freire (NYU)
    Juliana Freire

    Juliana Freire is a Professor of Computer Science and Data Science at New York University. She also holds an appointment in the Courant Institute for Mathematical Science and is a faculty member at the NYU Center for Urban Science and Progress. She is the lead investigator and executive director of the NYU Moore-Sloan Data Science Environment, the elected chair of the ACM Special Interest Group on Management of Data (SIGMOD), and a council member of the Computing Research Association’s Computing Community Consortium (CCC).
    Her research interests are in large-scale data analysis and integration, visualization, provenance management, and web information discovery. She has made fundamental contributions to data management methods and tools that address problems introduced by emerging applications including urban analytics and computational reproducibility.
    Freire has published over 180 technical papers, several open-source systems, and is an inventor of 12 U.S. patents. She has co-authored 5 award-winning papers, including one that received the ACM SIGMOD Most Reproducible Paper Award. She is an ACM Fellow and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She has chaired or co-chaired workshops and conferences, and participated as a program committee member in over 70 events. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She received M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook.

    Abstract: The ability to collect data from urban environments through a variety of sensors, coupled with a push towards openness and transparency by governments, has resulted in the availability of numerous spatio-temporal datasets containing information about diverse components of the cities, including their residents, infrastructure, and the environment. By analyzing the data exhaust from these components, we have the opportunity to better understand how they interact and obtain insights to help address important challenges brought about by urbanization with respect to transportation, resource consumption, housing affordability, and inadequate or aging infrastructure.
    In this talk, we present Data Polygamy (DP), a new approach that leverages computational topology to discover spatio-temporal relationships between disparate datasets. We show how these relationships can be used to provide explanations for patterns and unusual behavior in data, to identify important datasets and variables for model design, and to guide domain experts in data discovery. We also discuss experimental results which show that DP is efficient and scalable.

  • Teaching Machines to understand natural Language
    Speaker: Antoine Bordes (Facebook)
    Antoine Bordes

    Antoine Bordes leads the lab of Facebook Artificial Intelligence Research in Paris since early 2017. Antoine joined the NYC lab of Facebook AI Research in 2014. Prior to joining Facebook, he was a CNRS researcher in Compiegne in France and a postdoctoral fellow in Yoshua Bengio’s lab of University of Montreal. He received his PhD in machine learning from Pierre & Marie Curie University in Paris in 2010 with two awards for best PhD from the French Association for Artificial Intelligence and from the French Armament Agency. Antoine’s current interests are centered around natural language understanding using neural networks, with a focus on question answering and dialogue systems. He published more than 50 papers cumulating more than 6,000 citations.

    Abstract: Despite the recent successes of Deep Learning for multiple tasks ranging from image segmentation to speech recognition, understanding language remains a largely unsolved problem for machines. This is still highly challenging for multiple reasons such as the intrinsic complexity of language, the need for machine common-sense or the difficulty of actually evaluating natural language understanding. Yet, current research is making progress and this talk will exhibit some of them in the areas of open-domain question answering (answering questions on any topic) and machine reading (answering questions related to a short piece of text). We will show how the combined use of innovative neural networks architectures with new training and test benchmarks can yield promising results.

  • Machine Learning Practice: How to Make a Machine Learning Strategy
    Speaker: Vanja Josifovski (Pinterest)
    Vanja Josifovski

    Vanja Josifovski is the Chief Technology Officer of Pinterest. He is working on establishing the broader technical strategy and leads the efforts around machine learning for the company. Vanja also leads the Discovery team, a group that oversees content selection for the Pinterest home feed, search, and visual discovery. Prior to his current role, as head of Pinterest’s Growth team, Vanja was responsible for driving both user and engagement growth. Before joining Pinterest, Vanja worked on large scale machine learning and information extraction at Google Research. His career began with roles at Yahoo Research and IBM Research. Vanja holds a PhD in large scale database systems.

    Abstract: The past decade gave the rise of the machine learning in practice making it a fundamental part of the success of most companies. Fueled by the rise of the Web and the huge amounts of data produced by powerful mobile devices, the practice has changed multiple times over the last decade or so. In this talk we will review how to approach a question of defining a Machine Learning strategy at a small to medium size company as Pinterest. The strategy helps guide the company’s investments and defines the framework for all ML work. We will overview the key elements of the strategy at Pinterest and the thought process behind it.

  • Speech and Language to AI Evolution
    Speaker: Xuedong Huang (Microsoft)
    Xuedong Huang

    Xuedong Huang (???) is a Microsoft Technical Fellow in AI and Research. He leads Microsoft’s Speech and Language Group.

    In 1993, Huang joined Microsoft to found the company’s speech technology group. As the general manager of Microsoft’s spoken language efforts, he helped to bring speech to the mass market by introducing SAPI to Windows in 1995 and Speech Server to the enterprise call center in 2004. He served as General Manager for MSR Incubation and Chief Architect for Bing and Ads. In 2015, he returned to AI and Research to lead the advanced technology group.

    Abstract: Amongst all creatures the human species stands unique in Darwin’s natural selection process because of our ability to communicate, our ability to manipulate symbols, and our ability to construct language. Speech and language provides the way we communicate our collective intelligence from one generation to the next. It is no exaggeration to state that it is speech and language that differentiated human intelligence from animal intelligence in the evolution. The impact of speech and language to the evolution of AI should be as foundational as speech and language to the evolution of homo sapiens!

Panel: Machine learning in the medicine program

List of participants

  • Shai Shen-Orr (Technion)
    Shai Shen-Orr

    Systems Biologist and Data Scientist Shai Shen-Orr is the Co-founder and Chief Scientist of CytoReason and a Professor in the Faculty of Medicine at the Technion – Israel Institute of Technology —where he directs the laboratory of Systems Immunology and Precision Medicine. In his research, Shai develops new analytical methodologies for grappling with the intricate complexities of the immune system—tools which he has applied to understand how the immune system works in advanced age and to define biomarkers to evaluate immune health.
    His research has been cited numerous times and has laid the foundation of CytoReason, a company that uses an artificial intelligence model of the immune-system to make predictions from biological data. Shai did his BSc at the Technion (99′), MSc at the Weizmann Institute of Science (01′), PhD at Harvard University (07′) and his postdoctoral studies at Stanford University.

  • Jimeng Sun (GA Tech)
    Jimeng Sun

    Jimeng Sun is an Associate Professor of College of Computing at Georgia Tech. Prior to Georgia Tech, he was a researcher at IBM TJ Watson Research Center. His research focuses on health analytics and machine learning, especially in designing tensor factorizations, deep learning methods, and large-scale predictive modeling systems. He published over 120 papers and filed over 20 patents (5 granted). He has received SDM/IBM early career research award 2017, ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. Dr. Sun received B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, M.Sc and PhD in Computer Science from Carnegie Mellon University in 2006 and 2007.

  • Elad Yom-Tov (Microsoft Research)
    Elad Yom-Tov

    Elad Yom-Tov is a Principal Researcher at Microsoft Research. Before joining Microsoft he was with Yahoo Research, IBM Research, and Rafael. His primary research interests are in applying large-scale Machine Learning and Information Retrieval methods to medicine. Dr. Yom-Tov studied at Tel-Aviv University and the Technion, Israel. He has published four books, over 100 papers (of which 3 were awarded prizes), and was awarded more than 20 patents. His latest book is “Crowdsourced Health: How What You Do on the Internet Will Improve Medicine” (MIT Press, 2016).

  • Alessandro Vespignani (Northeastern University)
    Alessandro Vespignani

    Alessandro Vespignani is the Sternberg Family Distinguished University professor at Northeastern University. He is the founding director of the Network Science Institute and lead the Laboratory for the Modeling of Biological and Socio-technical Systems.
    His recent work focuses on data-driven computational modeling and forecast of emerging infectious diseases; resilience of complex networks; and collective behavior of techno-social systems.
    He is elected fellow of the American Physical Society, member of the Academy of Europe, and fellow of the Institute for Quantitative Social Sciences at Harvard University. He served in the board/leadership of a variety of professional association, journals and the Institute for Scientific Interchange Foundation.

Accepted Papers

  • Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world Dataset
    Authors: Xinpeng Chen, Jingyuan Chen, Lin Ma, Jian Yao, Wei Liu, Jiebo Luo and Tong Zhang

    Billions of videos are online ready to be viewed and shared. Among the enormous volume of videos, some popular ones are widely viewed by the online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different number of views. This phenomena leads to a challenging yet important problem, namely fine-grained video attractiveness prediction, which only relies on the video content to forecast video attractiveness at fine-grained levels, specifically video segments of the length of 5 seconds in this paper. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset (FVAD), which is collected from one of the most popular video websites, Tencent Video. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video content. Apart from the large amount of videos, hundreds of millions user behaviors while watching videos are also included, such as view counts, fast-forward, fast-rewind, and so on, where view counts reflects the video attractiveness while other engagements capture the interactions between the viewers and videos. First, we demonstrate that video attractiveness and the different engagements present different relationships. Second, FVAD provides us an opportunity to study the fine-grained video attractiveness prediction problem. We design different sequential models to perform video attractiveness prediction by relying solely on video content. The sequential models exploit the multimodal relationships between visual and audio components of the video content at different levels.
    Experimental results demonstrate the effectiveness of our proposed sequential models on different visual and audio representations, the necessity of incorporating the two modalities, as well as the complementary behaviors of the sequential prediction models at different levels. As a side contribution, the FVAD dataset will be released to facilitate researchers to improve fine-grained video attractiveness prediction.

  • Urban Perception of Commercial Activeness from Satellite Images and Streetscapes
    Authors: Wenshan Wang, Su Yang, Zhiyuan He, Minjie Wang, Jiulong Zhang and Weishan Zhang

    People can percept social attributes from streetscapes such as safety, richness, and happiness by means of visual perception, which inspires the researches in terms of urban perception. To the best of our knowledge, this is the first work focused on revealing the relationship between visual patterns of satellite images as well as streetscapes and commercial activeness. We propose to make use of bag of features (BoF) in the context of computer vision and sparse representation in the sense of machine learning to predict commercial activeness of urban commercial districts. After obtaining the urban commercial districts via clustering, we predict the commercial activeness degrees of them using four image features, namely, Histogram of Oriented Gradients (HOG), Autoencoder, GIST, and multifractal spectra for satellite images and street view images, respectively. The performance evaluation with four large-scale datasets demonstrates that the presented computational framework can not only predict the commercial activeness with satisfactory precision compared with that based on Point of Interest (POI) data but also discover the visual patterns related.

  • Positivity Bias in Customer Satisfaction Ratings
    Authors: Kunwoo Park, Meeyoung Cha and Eunhee Rhim

    Customer ratings are valuable sources to understand their satisfaction and are critical for designing better customer experiences and recommendations. The majority of customers, however, do not respond to rating surveys, which makes the result less representative. To understand overall satisfaction, this paper aims to investigate how likely customers without responses had satisfactory experiences compared to those respondents. To infer customer satisfaction of such unlabeled sessions, we propose models using recurrent neural networks (RNNs) that learn continuous representations of unstructured text conversation. By analyzing online chat logs of over 170,000 sessions from Samsung’s customer service department, we make a novel finding that while labeled sessions contributed by a small fraction of customers received overwhelmingly positive reviews, the majority of unlabeled sessions would have received lower ratings by customers. The data analytics presented in this paper not only have practical implications for helping detect dissatisfied customers on live chat services but also make theoretical contributions on discovering the level of biases in online rating platforms.

  • Anomaly Detection with Partially Observed Anomalies
    Authors: Ya-Lin Zhang, Longfei Li, Jun Zhou, Xiaolong Li and Zhi-Hua Zhou

    In this paper, we consider the problem of anomaly detection. Previous studies mostly deal with this task in either supervised or unsupervised manner according to whether label information is available. However, there always exists settings which are different from the two standard manners. In this paper, we address the scenario when anomalies are partially observed, i.e., we are given a large amount of unlabeled instances as well as a handful labeled anomalies. We refer to this problem as anomaly detection with POA (partially observed anomalies), and proposed a two-stage method extbf{ADOA} to solve it.
    Firstly, by addressing difference between the anomalies, the observed anomalies are clustered, while the unlabeled instances are filtered to get potential anomalies and reliable normal instances. Then, with the above instances, a weight is attached to each instance according to the confidence of its label, and a weighted multi-class model is built, which will be further used to distinguish different anomalies to the normal instances.
    Experimental results show that in the aforementioned setting, existing methods behave unsatisfactory and the proposed method performs significantly better than all these methods, which validates the effectiveness of the proposed approach under the described setting.

  • Automated Extractions for Machine Generated Mail
    Authors: Dotan Di Castro, Iftah Gamzu, Irena Grabovitch-Zuyev, Liane Lewin-Eytan, Abhinav Pundir, Nil Sahoo and Michael Viderman

    Mail extraction is a critical task whose objective is to extract valuable data from the content of mail messages. This task is key for many types of applications including re-targeting, mail search, and mail summarization, which utilize the important personal data pieces in mail messages to achieve their objectives. We focus on machine generated traffic, which comprises most of the Web mail traffic today, and use its structured and large-scale repetitive nature to devise a fully automated extraction method. Our solution builds on an advanced structural clustering technique previously presented by some of the authors of this work. The heart of our solution is an offline process that leverages the structural mail-specific characteristics of the clustering, and automatically creates extraction rules that are later applied online for each new arriving message. We provide of a full description of our process, which has been productized in Yahoo mail backend. We complete our work with large-scale experiments carried over real Yahoo mail traffic, and evaluate the performance of our automatic extraction method.

  • Post Purchase Search Engine Marketing
    Authors: Qianyun Zhang, Shawndra Hill and David Rothschild

    Though consumer behavior in response to search engine marketing has been studied extensively, few efforts have been made to understand how consumers search and respond to ads post purchase. This is in part due to the fact that purchases are difficult to track and link to search queries. Thus, it is unsurprising that advertisers have been targeting consumers on search engines similarly regardless of the heterogeneity in their search intents and context. Advertising to current customers the same way as to prospective customers inevitably leads to wasteful and inefficient marketing. Employing a unique dataset that combines both search query and purchase data, we examine consumers’ searching behavior and response to search engine marketing after purchase. We study large advertising campaigns for two popular technology products. We find that over half of the branded keyword searches come from consumers who already purchased, and that advertising response varies based on whether searchers are pre or post purchase. In general, post-purchase searchers are less likely to click on focal brand ads (i.e., they are less responsive to ads for products they already own). However, post-purchase searchers are still responsive to advertising, and much more likely to click on ads for complementary products (i.e., they are more responsive to ads for relevant products other than the focal product). Our findings offer unique academic contributions regarding consumer behavior along with practical implications for how platform should market post-purchase targeting and how marketers should advertise to customers post purchase.

More about The BIG Web Track and previous BIG conferences

The BIG Web Track Chairs:

  • Evgeniy Gabrilovich (Google, US)
  • Kira Radinsky (eBay, Israel)
  • Kuansan Wang (Microsoft, US)

The BigData Innovators Gathering (BIG) will bring together academic and industry leaders in the big data space to share the state of the art and its successful application in business. This event will be co-located with WWW conference for the fifth time, now as a fully fledged alternate track named The BIG Web in The Web Conference.