Web and Society

List of accepted papers :

  • VizByWiki: Mining Data Visualizations from the Web to Enrich News Articles
    Authors: Allen Yilun Lin, Joshua Ford, Eytan Adar and Brent Hecht

    Keywords: News articles, Wikimedia Commons, User-generated content, Data visualizations, Peer-production, Wikipedia

    Data visualizations in news articles (e.g. maps, line graphs, bar charts) greatly enrich the content of these articles and produce well-established improvements to reader comprehension. However, existing systems that generate news data visualizations either require substantial manual effort or are limited to very specific types of data visualizations, thereby greatly restricting the number of news articles that can be enhanced with visualizations. To address this issue, we define a new problem: given a news article, retrieve relevant visualizations that already exist on the web. We show that this problem is tractable through a new system – VizByWiki – that mines contextually relevant data visualizations from Wikimedia Commons, the central file repository for Wikipedia. Using a novel ground truth dataset, we show that VizByWiki is able to retrieve at least one relevant data visualization for as many as 48% of popular online news articles. We also demonstrate that VizByWiki can automatically rank visualizations according to their usefulness with reasonable ranking quality (nDCG@5 of 0.82). To facilitate further advances on the news visualization retrieval problem, we release our ground truth dataset and make our system and its source code publicly available.

  • Geographical Feature Extraction for Entities in Location-based Social Networks
    Authors: Daizong Ding, Mi Zhang, Xudong Pan, Duocai Wu and Pearl Pu

    Keywords: Location-based Social Networks, Feature Embedding, Deep Learning

    Location-based embedding is a fundamental problem to solve in location-based social networks (LBSN). In this paper, we propose a geographical convolutional neural tensor network (GeoCNTN) as a generic embedding model. GeoCNTN first takes the raw location data and extracts from it a more well-conditioned representation by our proposed Geo-CMeans algorithm. We then use a convolutional neural network (CNN) and an embedding structure to extract individual latent structural patterns from the preprocessed data. Finally, we apply a neural tensor network (NTN) to craft the implicitly related features we have obtained into a unified geographical feature. The advantages of our GeoCNTN mainly come from its novel neural network structure, which intrinsically offers a mechanism to extract latent structural features from the geographical data, as well as its wide applicability in various LBSN-related tasks. From two case studies, i.e. link prediction and entity classification in user-group LBSN, we evaluate the embedding efficacy of our model. Results show that GeoCNTN significantly performs better on at least two tasks, with improvement by 9% w.r.t. NDCG and 11% w.r.t. F1 score respectively, using the Meetup-USA dataset.

  • On Ridesharing Competition and Accessibility: Evidence from Uber, Lyft, and Taxi
    Authors: Shan Jiang, Le Chen, Christo Wilson and Alan Mislove

    Keywords: ridesharing, sharing economy, spatial econometrics

    Ridesharing services such as Uber and Lyft have become an important part of the Vehicle For Hire (VFH) market, which used to be dominated by taxis. Unfortunately, ridesharing services are not required to share data like taxis, which has made it challenging to compare the competitive dynamics of these services, or assess their impact on cities. In this paper, we comprehensively compare Uber, Lyft, and taxis with respect to key market features (supply, demand, price, and wait time) in San Francisco and New York City. Based on point pattern statistics, we develop novel statistical techniques to validate our measurement methods. Using spatial lag models, we investigate the accessibility of VFH services, and find that transportation infrastructure and socio-economic features have substantial effects on VFH market features.

  • Auditing the Personalization and Composition of Politically-Related Search Engine Results Pages
    Authors: Ronald Robertson, David Lazer and Christo Wilson

    Keywords: Search engine results, search ranking bias, search suggestions, personalization, filter bubble

    Search engines are a primary means through which people obtain information in today’s connected world. Yet, apart from the search engine companies themselves, little is known about how their algorithms filter, rank, and present the web to users. This question is especially pertinent with respect to political queries, given growing concerns about filter bubbles, and the recent finding that bias or favoritism in search rankings can influence voting behavior. In this study, we conduct a targeted algorithm audit of Google Search using a dynamic set of political queries. We designed a Chrome extension to survey participants and collect the Search Engine Results Pages (SERPs) and autocomplete suggestions that they would have been exposed to while searching our set of political queries during the month after Donald Trump’s Presidential inauguration. Using this data, we found significant differences in the composition and personalization of politically-related SERPs by query type, subjects’ characteristics, and date.

  • “You are no Jack Kennedy”: On Media Selection of Highlights from Presidential Debates
    Authors: Chenhao Tan, Hao Peng and Noah Smith

    Keywords: media bias, presidential debates, quotations, wording

    Political speeches and debates play an important role in shaping the images of politicians, and the public often relies on media outlets to select bits of political communication from a large pool of utterances. It is an important research question to understand what factors impact this selection process. To quantitatively explore the selection process, we build a three-decade dataset of presidential debate transcripts and post-debate coverage. We first examine the effect of wording and propose a binary classification framework that controls for both the speaker and the debate situations. We find that crowdworkers can only achieve an accuracy of 60% in this task, indicating that media choices are not entirely obvious. Our classifiers outperform crowdworkers on average, mainly in primary debates. We also compare important factors from crowdworkers’ free responses with those from data-driven methods and find interesting differences. Few crowdworkers mentioned that “context matters”, whereas our data show that well-quoted sentences are more distinct from the previous utterance by the same speaker than less-quoted sentences. Finally, we examine the aggregate effect of media preferences towards different wordings to understand the extent of fragmentation among media outlets. By analyzing a bipartite graph built from quoting behavior in our data, we observe a decreasing trend in bipartisan coverage.

  • What We Read, What We Search: Media Attention and Public Attention among 193 Countries
    Authors: Haewoon Kwak, Jisun An, Joni Salminen, Soon-Gyo Jung and Bernard Jansen

    Keywords: Media Attention, Public Attention, Google Trends, Google News, Multiplex Network

    In this research, we investigate the alignment of international attention of news media organizations within 193 countries with the expressed international interests of the public within those same countries from March 7, 2016 to April 14, 2017. We collect fourteen months of large-scale longitudinal data of online news from Unfiltered News and web search volume from Google Trends. We build a multiplex network of media attention and public attention in order to study its structural and dynamic properties, along with investigating specific topical interests by country. Structurally, the media attention and the public attention are both similar and different depending on the resolution of the analysis. For example, we find that 63.2\% of the country-specific media and the public pay the attention to different countries, but local attention flow patterns, which are measured by network motifs, are very similar. We also show that there are strong regional similarities with both media and public attention that is only disrupted by significantly major worldwide incidents (e.g., Brexit, Nice Terrorist Attack). Using Granger causality, we show that there are a substantial number of countries where media attention and public attention are dissimilar by topical interest. Our findings show that the media and public attention toward specific countries are often at odds indicating that the public within these countries may be ignoring their country-specific news outlets and seeking other online sources to address their media needs and desires.

  • Political Discourse on Social Media: Echo Chambers, Gatekeepers, and the Price of Bipartisanship
    Authors: Kiran Garimella, Gianmarco De Francisci Morales, Aristides Gionis and Michael Mathioudakis

    Keywords: echo chambers, polarization, filter bubble, controversy, social media

    Echo chambers, i.e., situations where one is exposed only to opinions that agree with their own, are an increasing concern for the political discourse in many democratic countries. This paper studies the phenomenon of political echo chambers on social media. We identify the two components in the phenomenon: the opinion that is shared (“echo”), and the place that allows its exposure (“chamber” – the social network), and examine closely at how these two components interact. We define a production and consumption measure for social-media users, which captures the political leaning of the content shared and received by them. By comparing the two, we find that Twitter users are, to a large degree, exposed to political opinions that agree with their own. We also find that users who try to bridge the echo chambers, by sharing content with diverse leaning, have to pay a “price of bipartisanship” in terms of their network centrality and content appreciation. In addition, we study the role of “gatekeepers,” users who consume content with diverse leaning but produce partisan content (with a single-sided leaning), in the formation of echo chambers. Finally, we apply these findings to the task of predicting partisans and gatekeepers from social and content features. While partisan users turn out relatively easy to identify, gatekeepers prove to be more challenging.

  • Me, My Echo Chamber, and I: Introspection on Social Media Polarization
    Authors: Nabeel Gillani, Ann Yuan, Martin Saveski, Soroush Vosoughi and Deb Roy

    Keywords: political polarization, randomized experiment, social networks

    Homophily—our tendency to surround ourselves with others who share our perspectives and opinions about the world—is both a part of human nature and an organizing principle underpinning many of our digital social networks. However, when it comes to politics or culture, homophily can amplify tribal mindsets and produce “”echo chambers”” that degrade the quality, safety, and diversity of discourse online. While several studies have empirically proven this point, few have explored how making users aware of the extent and nature of their political echo chambers influences their subsequent beliefs and actions. In this paper, we introduce Social Mirror, a social network visualization tool that enables a sample of Twitter users to explore the politically-active parts of their social network. We use Social Mirror to recruit Twitter users with a prior history of political discourse to a randomized experiment where we evaluate the effects of different treatments on participants’ i) beliefs about their network connections, ii) the political diversity of who they choose to follow, and iii) the political alignment of the URLs they choose to share. While we see no effects on average political alignment of shared URLs, we find that recommending accounts of the opposite political ideology to follow reduces participants’ beliefs in the political homogeneity of their network connections but still enhances their connection diversity one week after treatment. Conversely, participants who enhance their belief in the political homogeneity of their Twitter connections have less diverse network connections 2-3 weeks after treatment. We explore the implications of these disconnects between beliefs and actions on future efforts to promote healthier exchanges in our digital public spheres.

  • Algorithmic Glass Ceiling in Social Networks: The effects of social recommendations on network diversity
    Authors: Ana Stoica, Christopher Riederer and Augustin Chaintreau

    Keywords: social recommender, fairness, random walks, homophily

    Social recommendations (friend suggestion, people to follow, and the like) were shown to affect the network growth of social media. Simultaneously, a growing concern has documented signs of intrinsic barriers to equal opportunity online, either due to decisions informed by algorithms using personal data, or even in the spontaneous growth of interactions that online services facilitate. Leveraging new data collected from Instagram, we offer here for the first time an analysis that studies the effect of gender, homophily and growth dynamics completed with the effect of social recommendation algorithms. Our main finding is that prominent social recommendation algorithms, under natural conditions, emph{exacerbates} the under-representation of demographic groups at the top. We prove, empirically and through mathematical analysis, the presence of an emph{algorithmic glass ceiling}, exhibiting all properties of the metaphorical barrier preventing subgroups to reach superior notoriety. What raises largest concerns is that we mathematically prove, under fixed minority and homophily parameters, that the algorithmic effect is systematically larger than the glass ceiling generated by the spontaneous growth of social networks. We briefly discuss ways to explore to address this concern in future design.

  • (Don’t) Mention the War: A Comparison of Wikipedia and Britannica Articles on National Histories
    Authors: Anna Samoilenko, Florian Lemmerich, Maria Zens, Mohsen Jadidi, Mathieu Génois and Markus Strohmaier

    Keywords: Computational history, Collective memory, Wikipedia, Britannica, Null Model, Focal points, Readability, Natural language processing

    In this paper we present a large-scale quantitative comparison between expert- and crowdsourced writing of history by analysing articles from the English Wikipedia and Britannica. In order to quantify attention to particular periods, we extract mentioned year numbers and utilise them to study historical timelines of nations stretched over the last thousand years. By combining this temporal analysis with lexical analysis of both encyclopedic corpora we can identify distinctive historiographic points of view in each encyclopedia. We find that Britannica focuses on social and cultural phenomena, e.g. religion, as well as the geographical characteristics of states, while Wikipedia puts emphasis on political aspects, concentrating on wars and violent conflicts, and events of high popularity. Finally, both encyclopedias exhibit characteristics of English Academic prose, with Britannica being slightly less readable compared to Wikipedia, according to several readability scores.

  • Human Perceptions of Fairness in Algorithmic Decision Making: A Case Study of Criminal Risk Prediction
    Authors: Nina Grgic-Hlaca, Elissa M. Redmiles, Krishna P. Gummadi and Adrian Weller

    Keywords: Algorithmic fairness, Procedural fairness, Fair feature selection

    Algorithms are increasingly used to make important decisions about human lives, ranging from social benefit assignment to predicting risk of criminal recidivism. The sensitivity of these decisions has raised important questions about ensuring fairness in algorithmic decision making. Recent work has focused on addressing two aspects of fairness: distributional fairness, concerning the fairness of the outcomes of decision making, and procedural fairness, concerning the fairness of the decision-making process. Assessment of procedural fairness relies on human judgments, for example, about the fairness of using certain features in the algorithm making the decision. Yet to our knowledge, almost no prior work examines people’s perceptions of fairness in algorithmic decision making. We propose a framework for understanding why people perceive certain features as fair to be used in algorithms. We define a framework of eight underlying properties of features, such as relevance, volitionality and reliability, and hypothesize that people use these properties to form a heuristic for making moral judgments about the fairness of feature use in decision-making algorithms. We evaluate our framework through a series of scenario-based surveys with 576 people. We find that, based on a person’s assessment of the eight underlying properties in our exemplar scenario, we can predict if they will perceive a feature as fair to be used with at least 87% accuracy.

  • Adaptive Sensitive Reweighting to Mitigate Bias in Fairness-aware Classification
    Authors: Emmanouil Krasanakis, Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos and Yiannis Kompatsiaris

    Keywords: Classification Fairness, Treating Bias, Reweighting

    Machine learning bias and fairness have recently emerged as key issues due to the pervasive deployment of data-driven decision making in a variety of sectors and services. It has often been argued that unfair classifications can be attributed to bias in training data, but previous attempts to ‘repair’ training data have led to limited success. To circumvent shortcomings prevalent in data repairing approaches, e.g. by weighting training samples of the sensitive group (e.g. gender, race, financial status) based on their misclassification error, we present a process that iteratively adapts training sample weights with a theoretically grounded model. This model addresses different kinds of bias to better achieve fairness objectives, such as trade-offs between accuracy and disparate impact elimination or disparate mistreatment elimination. We show that, compared to previous fairness-aware approaches on real-world and synthetic datasets, our methodology achieves better or similar trades-offs between accuracy and unfairness mitigation.

  • To Stay or to Leave: Churn Prediction for Urban Migrants in the Initial Period
    Authors: Yang Yang, Zongtao Liu, Chenhao Tan, Fei Wu, Yueting Zhuang and Yafeng Li

    Keywords: urban migrants, migrant integration, churn prediction, mobile communication networks

    In China, 2.6 billion people migrate to cities to realize their urban dreams every year. Despite the fact that these migrants play an important role in the rapid urbanization process, many of them fail to settle down and eventually leave the city. The integration process of migrants is thus an important issue both for scholars and policymakers. In this paper, we use Shanghai as an example to investigate migrants’ behavior in their first weeks and in particular, how their behavior relates to early departure. We employ a one-month complete dataset of telecommunication metadata in Shanghai with 54 million users and 698 million call logs, plus a novel housing price dataset for 20K real estates in Shanghai. This dataset allows us to identify new migrants to Shanghai because it is uncommon for a temporary visitor to apply for a local number in China. We find that migrants who end up leaving early tend to neither develop diverse connections in their first weeks nor move around the city. Their active areas also have higher housing prices than that of staying migrants. We formulate classification tasks to predict whether a migrant is going to leave based on her behavior in the first few days. The prediction performance improves as we include data from more days. Interestingly, when using the same features, the classifier trained from only the first few days is already as good as the classifier trained using full data, suggesting that the performance difference mainly lies in the difference between features.

  • Computationally Inferred Genealogical Networks Uncover Long-Term Trends in Assortative Mating
    Authors: Eric Malmi, Aristides Gionis and Arno Solin

    Keywords: genealogy, family tree, pedigree, population reconstruction, probabilistic record linkage, assortative mating, social stratification, homogamy

    Genealogical networks, also known as family trees or population pedigrees, are commonly studied by genealogists wanting to know about their ancestry, but they also provide a valuable resource for disciplines such as digital demography, genetics, and computational social science. These networks are typically constructed by hand through a very time-consuming process, which requires comparing large numbers of historical records manually. We develop computational methods for automatically inferring large-scale genealogical networks. A comparison with human-constructed networks attests to the accuracy of the proposed methods. To demonstrate the applicability of the inferred large-scale genealogical networks, we present a longitudinal analysis on the mating patterns observed in a network. This analysis shows a consistent tendency of people choosing a spouse with a similar socioeconomic status, a phenomenon known as assortative mating. Interestingly, we do not observe this tendency to consistently decrease (nor increase) over our study period of 150 years.

  • A Structured Approach to Understanding Recovery and Relapse in AA
    Authors: Yue Zhang, Arti Ramesh, Jennifer Golbeck, Dhanya Sridhar and Lise Getoor
    Presentation moved from track Web Content Analysis, Semantics and Knowledge

    Keywords: alcoholism, alcoholics anonymous, social media analysis, hinge-loss Markov random fields, Twitter

    Alcoholism, also known as Alcohol Use Disorder (AUD) is a serious problem affecting millions of people worldwide. Recovery from AUD is known to be challenging and often leads to relapse at various points after enrolling in a rehabilitation program such as Alcoholics Anonymous (AA). In this work, we take a structured approach to understand recovery and relapse from AUD using social media data. To do so, we combine linguistic and psychological attributes of users with relational features that capture useful structure in the user interaction network. We evaluate our models on AA-attending users extracted from the Twitter social network and predict recovery at two different points—90-days and 1 year after the user joins AA, respectively. Our experiments reveal that our structured approach is helpful in predicting recovery in these users. We perform extensive quantitative analysis of different groups of features and dependencies among them. Our analysis sheds light on the role of each feature group and how they combine to predict recovery and relapse. Finally, we present qualitative analysis of different reasons behind users relapsing to AUD. Our models and analysis are helpful in making meaningful predictions in scenarios where only a subset of features are available. Our analysis can potentially be helpful in identifying and preventing relapse early.

  • Community Interaction and Conflict on the Web
    Authors: Srijan Kumar, William L. Hamilton, Jure Leskovec and Dan Jurafsky

    Keywords: intercommunity, user interaction, antisocial behavior

    Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities—less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts—such as increasing the direct engagement between attackers and defenders. Further, we design classifiers to predict whether conflict will occur by creating an LSTM model which combines graph embeddings, user, community, and text features, and we also use these techniques to predict if a user will participate in a conflict. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.