In this keynote we describe progress in work that our research teams have been doing over the past years, including advances in difficult problems in artificial intelligence, on building large-scale computer systems for machine learning research, and, in collaboration with many teams at Google, on applying our research and systems to dozens of Google products. Our group has open-sourced the TensorFlow system [2], a widely popular system designed to easily express machine learning ideas, and to quickly train, evaluate and deploy machine learning systems. We then highlight some of our research accomplishments, and relate them to the National Academy of Engineering's Grand Engineering Challenges for the 21st Century.
This is joint work with many people at Google.
The Internet moves in phases, and we are entering the third in 20 years. In this keynote, using a framework drawn from the Law of the Horse [1], I describe the phase we are entering - the surveillance phase - and the threat it presents to society generally, and democracy in particular. Along the way, I offer an understanding of the Net circa 1999, and the phase that followed it, circa 2009. At each stage, our inability to govern has been a significant liability. In the phase we are entering, it will be devastating.
Over the past three years, platforms, governments and a plethora of nonprofit initiatives have prioritized fighting online misinformation through a variety of different means. Yet the current framework is too fragmented to deliver global results. The big tech platforms have data, but no public accountability. Governments (mostly) have democratic legitimacy, but little information on what is actually going on in the platforms they're itching to regulate. And nonprofit initiatives too often lack the scale to affect change at the level needed. What if we came up with a dramatically new deliberative process that involves a global community of concerned citizens ready to share information and participate in consultations to improve collective decision-making? What if a more accountable, diverse and verifiable Web were still possible?
Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the click incompleteness bias, but usually assume that the clicks are noise-free, i.e., a clicked document is always assumed to be relevant. In this paper, we relax this unrealistic assumption and study click noise explicitly in the unbiased learning-to-rank setting. Specifically, we model the noise as the position-dependent trust bias and propose a noise-aware Position-Based Model, named TrustPBM, to better capture user click behavior. We propose an Expectation-Maximization algorithm to estimate both examination and trust bias from click data in TrustPBM. Furthermore, we show that it is difficult to use a pure IPS method to incorporate click noise and thus propose a novel method that combines a Bayes rule application with IPS for unbiased learning-to-rank. We evaluate our proposed methods on three personal search data sets and demonstrate that our proposed model can significantly outperform the existing unbiased learning-to-rank methods.
Graph edges, along with their labels, can represent information of fundamental importance, such as links between web pages, friendship between users, the rating given by users to other users or items, and much more. We introduce LEAP, a trainable, general framework for predicting the presence and properties of edges on the basis of the local structure, topology, and labels of the graph. The LEAP framework is based on the exploration and machine-learning aggregation of the paths connecting nodes in a graph. We provide several methods for performing the aggregation phase by training path aggregators, and we demonstrate the flexibility and generality of the framework by applying it to the prediction of links and user ratings in social networks.
We validate the LEAP framework on two problems: link prediction, and user rating prediction. On eight large datasets, among which the arXiv collaboration network, the Yeast protein-protein interaction, and the US airlines routes network, we show that the link prediction performance of LEAP is at least as good as the current state of the art methods, such as SEAL and WLNM. Next, we consider the problem of predicting user ratings on other users: this problem is known as the edge-weight prediction problem in weighted signed networks (WSN). On Bitcoin networks, and Wikipedia RfA, we show that LEAP performs consistently better than the Fairness & Goodness based regression models, varying the amount of training edges between 10 to 90%. These examples demonstrate that LEAP, in spite of its generality, can match or best the performance of approaches that have been especially crafted to solve very specific edge prediction problems.
Email remains a critical channel for communicating information in both personal and work accounts. The number of emails people receive every day can be overwhelming, which in turn creates challenges for efficient information management and consumption. Having a good estimate of the significance of emails forms the foundation for many downstream tasks (e.g. email prioritization); but determining significance at scale is expensive and challenging.
In this work, we hypothesize that the cumulative set of actions on any individual email can be considered as a proxy for the perceived significance of that email. We propose two approaches to summarize observed actions on emails, which we then evaluate against the perceived significance. The first approach is a fixed-form utility function parameterized on a set of weights, and we study the impact of different weight assignment strategies. In the second approach, we build machine learning models to capture users' significance directly based on the observed actions. For evaluation, we collect human judgments on email significance for both personal and work emails. Our analysis suggests that there is a positive correlation between actions and significance of emails and that actions performed on personal and work emails are different. We also find that the degree of correlation varies across people, which may reflect the individualized nature of email activity patterns or significance. Subsequently, we develop an example of real-time email significance prediction by using action summaries as implicit feedback at scale. Evaluation results suggest that the resulting significance predictions have positive agreement with human assessments, albeit not at statistically strong levels. We speculate that we may require personalized significance prediction to improve agreement levels.
Can neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e. feature engineering or labeled graphs). We propose Deep Divergence Graph Kernels, an unsupervised method for learning representations over graphs that encodes a relaxed notion of graph isomorphism. Our method consists of three parts. First, we learn an encoder for each anchor graph to capture its structure. Second, for each pair of graphs, we train a cross-graph attention network which uses the node representations of an anchor graph to reconstruct another graph. This approach, which we call isomorphism attention, captures how well the representations of one graph can encode another. We use the attention-augmented encoder's predictions to define a divergence score for each pair of graphs. Finally, we construct an embedding space for all graphs using these pair-wise divergence scores.
Unlike previous work, much of which relies on 1) supervision, 2) domain specific knowledge (e.g. a reliance on Weisfeiler-Lehman kernels), and 3) known node alignment, our unsupervised method jointly learns node representations, graph representations, and an attention-based alignment between graphs.
Our experimental results show that Deep Divergence Graph Kernels can learn an unsupervised alignment between graphs, and that the learned representations achieve competitive results when used as features on a number of challenging graph classification tasks. Furthermore, we illustrate how the learned attention allows insight into the the alignment of sub-structures across graphs.
With the ever-increasing cases of hate spread on social media platforms, it is critical to design abuse detection mechanisms to pro-actively avoid and control such incidents. While there exist methods for hate speech detection, they stereotype words and hence suffer from inherently biased training. Bias removal has been traditionally studied for structured datasets, but we aim at bias mitigation from unstructured text data.
In this paper, we make two important contributions. First, we systematically design methods to quantify the bias for any model and propose algorithms for identifying the set of words which the model stereotypes. Second, we propose novel methods leveraging knowledge-based generalizations for bias-free learning.
Knowledge-based generalization provides an effective way to encode knowledge because the abstraction they provide not only generalizes content but also facilitates retraction of information from the hate speech detection classifier, thereby reducing the imbalance. We experiment with multiple knowledge generalization policies and analyze their effect on general performance and in mitigating bias. Our experiments with two real-world datasets, a Wikipedia Talk Pages dataset (WikiDetox) of size ~ 96k and a Twitter dataset of size ~ 24k, show that the use of knowledge-based generalizations results in better performance by forcing the classifier to learn from generalized content. Our methods utilize existing knowledge-bases and can easily be extended to other tasks.
Product bundling, offering a combination of items to customers, is one of the marketing strategies commonly used in online e-commerce and offline retailers. A high-quality bundle generalizes frequent items of interest, and diversity across bundles boosts the user-experience and eventually increases transaction volume. In this paper, we formalize the personalized bundle list recommendation as a structured prediction problem and propose a bundle generation network (BGN), which decomposes the problem into quality/diversity parts by the determinantal point processes (DPPs). BGN uses a typical encoder-decoder framework with a proposed feature-aware softmax to alleviate the inadequate representation of traditional softmax, and integrates the masked beam search and DPP selection to produce high-quality and diversified bundle list with an appropriate bundle size. We conduct extensive experiments on three public datasets and one industrial dataset, including two generated from co-purchase records and the other two extracted from real-world online bundle services. BGN significantly outperforms the state-of-the-art methods in terms of quality, diversity and response time over all datasets. In particular, BGN improves the precision of the best competitors by 16% on average while maintaining the highest diversity on four datasets, and yields a 3.85x improvement of response time over the best competitors in the bundle list recommendation problem.
Clinical notes contain detailed information about health status of patients for each of their encounters with a health system. Developing effective models to automatically assign medical codes to clinical notes has been a long-standing active research area. Despite a great recent progress in medical informatics fueled by deep learning, it is still a challenge to find the specific piece of evidence in a clinical note which justifies a particular medical code out of all possible codes. Considering the large amount of online disease knowledge sources, which contain detailed information about signs and symptoms of different diseases, their risk factors, and epidemiology, there is an opportunity to exploit such sources. In this paper we consider Wikipedia as an external knowledge source and propose Knowledge Source Integration (KSI), a novel end-to-end code assignment framework, which can integrate external knowledge during training of any baseline deep learning model. The main idea of KSI is to calculate matching scores between a clinical note and disease related Wikipedia documents, and combine the scores with output of the baseline model. To evaluate KSI, we experimented with automatic assignment of ICD-9 diagnosis codes to the emergency department clinical notes from MIMIC-III data set, aided by Wikipedia documents corresponding to the ICD-9 codes. We evaluated several baseline models, ranging from logistic regression to recently proposed deep learning models known to achieve the state-of-the-art accuracy on clinical notes. The results show that KSI consistently improves the baseline models and that it is particularly successful in assignment of rare codes. In addition, by analyzing weights of KSI models, we can gain understanding about which words in Wikipedia documents provide useful information for predictions.
Many approaches focus on detecting dense blocks in the tensor of multimodal data to prevent fraudulent entities (e.g., accounts, links) from retweet boosting, hashtag hijacking, link advertising, etc. However, no existing method is effective to find the dense block if it only possesses high density on a subset of all dimensions in tensors. In this paper, we novelly identify dense-block detection with dense-subgraph mining, by modeling a tensor into a weighted graph without any density information lost. Based on the weighted graph, which we call information sharing graph (ISG), we propose an algorithm for finding multiple densest subgraphs, D-Spot, that is faster (up to 11x faster than the state-of-the-art algorithm) and can be computed in parallel. In an N-dimensional tensor, the entity group found by the ISG+D-Spot is at least 1/2 of the optimum with respect to density, compared with the 1/N guarantee ensured by competing methods. We use nine datasets to demonstrate that ISG+D-Spot becomes new state-of-the-art dense-block detection method in terms of accuracy specifically for fraud detection.
Data collection often involves the partial measurement of a larger system. A common example arises in collecting network data: we often obtain network datasets by recording all of the interactions among a small set of core nodes, so that we end up with a measurement of the network consisting of these core nodes along with a potentially much larger set of fringe nodes that have links to the core. Given the ubiquity of this process for assembling network data, it is crucial to understand the role of such a “core-fringe” structure.
Here we study how the inclusion of fringe nodes affects the standard task of network link prediction. One might initially think the inclusion of any additional data is useful, and hence that it should be beneficial to include all fringe nodes that are available. However, we find that this is not true; in fact, there is substantial variability in the value of the fringe nodes for prediction. Once an algorithm is selected, in some datasets, including any additional data from the fringe can actually hurt prediction performance; in other datasets, including some amount of fringe information is useful before prediction performance saturates or even declines; and in further cases, including the entire fringe leads to the best performance. While such variety might seem surprising, we show that these behaviors are exhibited by simple random graph models.
Tor hidden services allow offering and accessing various Internet resources while guaranteeing a high degree of provider and user anonymity. So far, most research work on the Tor network aimed at discovering protocol vulnerabilities to de-anonymize users and services. Other work aimed at estimating the number of available hidden services and classifying them. Something that still remains largely unknown is the structure of the graph defined by the network of Tor services. In this paper, we describe the topology of the Tor graph (aggregated at the hidden service level) measuring both global and local properties by means of well-known metrics. We consider three different snapshots obtained by extensively crawling Tor three times over a 5 months time frame. We separately study these three graphs and their shared “stable” core. In doing so, other than assessing the renowned volatility of Tor hidden services, we make it possible to distinguish time dependent and structural aspects of the Tor graph. Our findings show that, among other things, the graph of Tor hidden services presents some of the characteristics of social and surface web graphs, along with a few unique peculiarities, such as a very high percentage of nodes having no outbound links.
Despite being vast repositories of factual information, cross-domain knowledge graphs, such as Wikidata and the Google Knowledge Graph, only sparsely provide short synoptic descriptions for entities. Such descriptions that briefly identify the most discernible features of an entity provide readers with a near-instantaneous understanding of what kind of entity they are being presented. They can also aid in tasks such as named entity disambiguation, ontological type determination, and answering entity queries. Given the rapidly increasing numbers of entities in knowledge graphs, a fully automated synthesis of succinct textual descriptions from underlying factual information is essential. To this end, we propose a novel fact-to-sequence encoder-decoder model with a suitable copy mechanism to generate concise and precise textual descriptions of entities. In an in-depth evaluation, we demonstrate that our method significantly outperforms state-of-the-art alternatives.
This paper provides an in-depth and diversified analysis of the Wikidata query logs, recently made publicly available. Although the usage of Wikidata queries has been the object of recent studies, our analysis of the query traffic reveals interesting and unforeseen findings concerning the usage, types of recursion, and the shape classification of complex recursive queries. Wikidata specific features combined with recursion let us identify a significant subset of the entire corpus that can be used by the community for further assessment. We considered and analyzed the queries across many different dimensions, such as the robotic and organic queries, the presence/absence of constants along with the correctly executed and timed out queries. A further investigation that we pursue in this paper is to find, given a query, a number of queries structurally similar to the given query. We provide a thorough characterization of the queries in terms of their expressive power, their topological structure and shape, along with a deeper understanding of the usage of recursion in these logs. We make the code for the analysis available as open source.
The spread of content produced by fake news publishers was one of the most discussed characteristics of the 2016 U.S. Presidential Election. Yet, little is known about the prevalence and focus of such content, how its prevalence changed over time, and how this prevalence related to important election dynamics. In this paper, we address these questions using tweets that mention the two presidential candidates sampled at the daily level, the news content mentioned in such tweets, and open-ended responses from nationally representative telephone interviews. The results of our analysis highlight various important lessons for news consumers and journalists. We find that (i.) traditional news producers outperformed fake news producers in aggregate, (ii.) the prevalence of content produced by fake news publishers increased over the course of the campaign-particularly among tweets that mentioned Clinton, and (iii.) changes in such prevalence were closely following changes in net Clinton favorability. Turning to content, we (iv.) identify similarities and differences in agenda setting by fake and traditional news media and show that (v.) information individuals most commonly reported to having read, seen or heard about the candidates was more closely aligned with content produced by fake news outlets than traditional news outlets, in particular for information Republican voters retained about Clinton. We also model fake-ness of retained information as a function of demographics characteristics. Implications for platform owners, news consumers, and journalists are discussed.
Incorporating knowledge graph (KG) into recommender system is promising in improving the recommendation accuracy and explainability. However, existing methods largely assume that a KG is complete and simply transfer the ”knowledge” in KG at the shallow level of entity raw data or embeddings. This may lead to suboptimal performance, since a practical KG can hardly be complete, and it is common that a KG has missing facts, relations, and entities. Thus, we argue that it is crucial to consider the incomplete nature of KG when incorporating it into recommender system.
In this paper, we jointly learn the model of recommendation and knowledge graph completion. Distinct from previous KG-based recommendation methods, we transfer the relation information in KG, so as to understand the reasons that a user likes an item. As an example, if a user has watched several movies directed by (relation) the same person (entity), we can infer that the director relation plays a critical role when the user makes the decision, thus help to understand the user's preference at a finer granularity.
Technically, we contribute a new translation-based recommendation model, which specially accounts for various preferences in translating a user to an item, and then jointly train it with a KG completion model by combining several transfer schemes. Extensive experiments on two benchmark datasets show that our method outperforms state-of-the-art KG-based recommendation methods. Further analysis verifies the positive effect of joint training on both tasks of recommendation and KG completion, and the advantage of our model in understanding user preference. We publish our project at https://github.com/TaoMiner/joint-kg-recommender.
Enriching the content of news articles with auxiliary resources is a technique often employed by online news services to keep articles up-to-date and thereby increase users' engagement. We address the task of enriching news articles with related search queries, which are extracted from a search engine's query log. Clicking on a recommended query invokes a search session that allows the user to further explore content related to the article. We present a three-phase retrieval framework for query recommendation that incorporates various article-dependent and article-independent relevance signals. Evaluation based on an offline experiment, performed using annotations by professional editors, and a large-scale online experiment, conducted with real users, demonstrates the merits of our approach. In addition, a comprehensive analysis of our online experiment reveals interesting characteristics of the type of queries users tend to click and the nature of their interaction with the resultant search engine results page.
We revisit the opinion susceptibility problem that was proposed by Abebe et al. [1], in which agents influence one another's opinions through an iterative process. Each agent has some fixed innate opinion. In each step, the opinion of an agent is updated to some convex combination between its innate opinion and the weighted average of its neighbors' opinions in the previous step. The resistance of an agent measures the importance it places on its innate opinion in the above convex combination. Under non-trivial conditions, this iterative process converges to some equilibrium opinion vector. For the unbudgeted variant of the problem, the goal is to select the resistance of each agent (from some given range) such that the sum of the equilibrium opinions is minimized.
Contrary to the claim in the aforementioned KDD 2018 paper, the objective function is in general non-convex. Hence, formulating the problem as a convex program might have potential correctness issues. We instead analyze the structure of the objective function, and show that any local optimum is also a global optimum, which is somehow surprising as the objective function might not be convex. Furthermore, we combine the iterative process and the local search paradigm to design very efficient algorithms that can solve the unbudgeted variant of the problem optimally on large-scale graphs containing millions of nodes.
Community norm violations can impair constructive communication and collaboration online. As a defense mechanism, community moderators often address such transgressions by temporarily blocking the perpetrator. Such actions, however, come with the cost of potentially alienating community members. Given this tradeoff, it is essential to understand to what extent, and in which situations, this common moderation practice is effective in reinforcing community rules.
In this work, we introduce a computational framework for studying the future behavior of blocked users on Wikipedia. After their block expires, they can take several distinct paths: they can reform and adhere to the rules, but they can also recidivate, or straight-out abandon the community. We reveal that these trajectories are tied to factors rooted both in the characteristics of the blocked individual and in whether they perceived the block to be fair and justified. Based on these insights, we formulate a series of prediction tasks aiming to determine which of these paths a user is likely to take after being blocked for their first offense, and demonstrate the feasibility of these new tasks. Overall, this work builds towards a more nuanced approach to moderation by highlighting the tradeoffs that are in play.
We consider the problem of regulating products with negative externalities to a third party that is neither the buyer nor the seller, but where both the buyer and seller can take steps to mitigate the externality. The motivating example to have in mind is the sale of Internet-of-Things (IoT) devices, many of which have historically been compromised for DDoS attacks that disrupted Internet-wide services such as Twitter [5, 26]. Neither the buyer (i.e., consumers) nor seller (i.e., IoT manufacturers) was known to suffer from the attack, but both have the power to expend effort to secure their devices. We consider a regulator who regulates payments (via fines if the device is compromised, or market prices directly), or the product directly via mandatory security requirements.
Both regulations come at a cost-implementing security requirements increases production costs, and the existence of fines decreases consumers' values-thereby reducing the seller's profits. The focus of this paper is to understand the efficiency of various regulatory policies. That is, policy A is more efficient than policy B if A more successfully minimizes negatives externalities, while both A and B reduce seller's profits equally.
We develop a simple model to capture the impact of regulatory policies on a buyer's behavior. In this model, we show that for homogeneous markets-where the buyer's ability to follow security practices is always high or always low-the optimal (externality-minimizing for a given profit constraint) regulatory policy need regulate only payments or production. In arbitrary markets, by contrast, we show that while the optimal policy may require regulating both aspects, there is always an approximately optimal policy which regulates just one.
Online advertising is one of the primary sources of funding for content, services, and applications on both web and mobile platforms. Mobile in-app advertisements are implemented on top of existing web technologies with the same ad-serving model (i.e., users - publishers - ad networks - advertisers). Even so, in-app advertising is different from traditional web advertising. For example, malicious mobile app developers can generate fraudulent ad clicks in an automated fashion, but malicious web publishers have to leverage bots to launch click fraud. In spite of using the same underlying web infrastructure, these ad threats behave differently on different platforms.
Existing works have separately studied click fraud and malvertising in mobile settings. However, it is not known if there is a strong relationship between these two dominant threats. In this paper, we develop an ad collection framework - MAdLife- on Android to capture all in-app ad traffic generated during each ad's entire lifespan. We revisit both threats in a fine-grained manner with MAdLife to determine the relationship between them. Furthermore, MAdLife also allows us to explore other threats related to landing pages.
We analyzed 5.7K Android apps crawled from Google Play store, and collected 83K ads and their landing pages with MAdLife. 58K ads land on a web page, which is similar to traditional web ads. We discovered 37 click-fraud apps, and revealed that 1.49% of the 58K ads are related to malvertising. We also found a strong correlation between fraudulent apps and malicious ads. Specifically, over 15.44% of malicious ads originate from the fraudulent apps. Conversely, 18.36% of the ads displayed in the fraudulent apps are malicious, compared to only 1.28% found in the rest apps. Due to fraudulent apps, users are much more (14x) likely to encounter malvertising ads. Additionally, we discovered 243 popular JavaScript snippets embedded by over 10% of the landing pages are malicious. Finally, we also present the first analysis on inappropriate mobile in-app ads.
Modeling the behaviors of drug-target-disease interactions is crucial in the early stage of drug discovery and holds great promise for precision medicine and personalized treatments. The growing availability of new types of data on the internet brings great opportunity of learning a more comprehensive relationship among drugs, targets, and diseases. However, existing methods often consider drug-target interactions or drug-disease interactions separately, which ignores the dependencies among these three entities. Also, many of them cannot directly incorporate rich heterogeneous information from diverse sources.
In this work, we investigate the utility of tensor factorization to model the relationships of drug-target-disease, specifically leveraging different types of online data. Our motivation is two-fold. First, in human metabolic systems, many drugs interact with protein targets in cells to modulate target activities, which in turn alter biological pathways to promote healthy functions and to treat diseases. Instead of binary relationships of <drug, disease> or <drug, target>, a tighter triple relationships <drug, target, disease> should be exploited to better understand drug mechanism of actions (MoAs). Second, medical data could be collected from different sources (i.e., drug's chemical structure, target's sequence, or expression measurements). Therefore, effectively exploiting the complementarity among multiple sources is of great importance. Our method elegantly explores a <drug, target, disease> tensor together with complementarity among different data sources, thus improves prediction accuracy. We achieve this goal by formulating the problem into a coupled tensor-matrix factorization problem and directly optimize it on the nonlinear manifold. Experimental results on real-world datasets show that the proposed model outperforms several competitive methods. Our model opens up opportunities to use large Web data to predict drugs' MoAs in pharmacological studies.
Recommendation from implicit feedback is a highly challenging task due to the lack of reliable negative feedback data. Only positive feedback are observed and the unobserved feedback can be attributed to two reasons: unknow or dislike. Existing methods address this challenge by treating all the un-observed data as negative (dislike) but downweight the confidence of these data. However, this treatment causes two problems: (1) Confidence weights of the unobserved data are usually assigned manually, which lack flexible and may create empirical bias in evaluating user's preference. (2) To handle massive volume of the unobserved feedback data, most of the existing methods rely on stochastic inference and data sampling strategies. However, since users are only aware of a very small fraction of items in a large dataset, it is difficult for existing samplers to select informative training instances in which the user really dislikes the item rather than does not know it.
To address the above two problems, we propose a new recommendation method SamWalker that leverages social information to infer data confidence and guide the sampling process. By modeling data confidence with a social context-aware function, SamWalker can adaptively specify different weights to different data based on users' social contexts. Further, a personalized random-walk-based sampling strategy is developed to adaptively draw informative training instances, which can speed up gradient estimation and reduce sampling variance. Extensive experiments on three real-world datasets demonstrate the superiority of the proposed SamWalker method and its sampling strategy.
Recommendation serendipity is being increasingly recognized as being equally important as the other beyond-accuracy objectives (such as novelty and diversity), in eliminating the “filter bubble” phenomenon of the traditional recommender systems. However, little work has empirically verified the effects of serendipity on increasing user satisfaction and behavioral intention. In this paper, we report the results of a large-scale user survey (involving over 3,000 users) conducted in an industrial mobile e-commerce setting. The study has identified the significant causal relationships from novelty, unexpectedness, relevance, and timeliness to serendipity, and from serendipity to user satisfaction and purchase intention. Moreover, our findings reveal that user curiosity plays a moderating role in strengthening the relationships from novelty to serendipity and from serendipity to satisfaction. Our third contribution lies in the comparison of several recommender algorithms, which demonstrates the significant improvements of the serendipity-oriented algorithm over the relevance- and novelty-oriented approaches in terms of user perceptions. We finally discuss the implications of this experiment, which include the feasibility of developing a more precise metric for measuring recommendation serendipity, and the potential benefit of a curiosity-based personalized serendipity strategy for recommender systems.
Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages, e.g., more English texts are labeled than texts in any other languages, which creates a considerable inequality in the quality of related information services received by users speaking different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language, and thus compromises the accuracy of the downstream classification task. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification. The proposed method demonstrates state-of-the-art performance on benchmark datasets, which is sustained even when sentiment labels are scarce.
Graph smoothing methods are an extremely popular family of approaches for semi-supervised learning. The choice of graph used to represent relationships in these learning problems is often a more important decision than the particular algorithm or loss function used, yet this choice is less well-studied in the literature. In this work, we demonstrate that for social networks, the basic friendship graph itself may often not be the appropriate graph for predicting node attributes using graph smoothing. More specifically, standard graph smoothing is designed to harness the social phenomenon of homophily whereby individuals are similar to “the company they keep.” We present a decoupled approach to graph smoothing that decouples notions of “identity” and “preference,” resulting in an alternative social phenomenon of monophily whereby individuals are similar to “the company they're kept in,” as observed in recent empirical work. Our model results in a rigorous extension of the Gaussian Markov Random Field (GMRF) models that underlie graph smoothing, interpretable as smoothing on an appropriate auxiliary graph of weighted or unweighted two-hop relationships.
Recently, data mining through analyzing the complex structure and diverse relationships on multi-network has attracted much attention in both academia and industry. One crucial prerequisite for this kind of multi-network mining is to map the nodes across different networks, i.e., so-called network alignment. In this paper, we propose a cross-network embedding method CrossMNA for multi-network alignment problem through investigating structural information only. Unlike previous methods focusing on pair-wise learning and holding the topology consistent assumption, our proposed CrossMNA considers the multi-network scenarios which involve at least two types of networks with diverse network structures. CrossMNA leverages the cross-network information to refine two types of node embedding vectors, i.e., inter-vector for network alignment and intra-vector for other downstream network analysis tasks. Finally, we verify the effectiveness and efficiency of our proposed method using several real-world datasets. The extensive experiments show that our CrossMNA can significantly outperform the existing baseline methods on multi-network alignment task, and also achieve better performance for link prediction task with less memory usage.
We present a method for implementing shrinkage of treatment effect estimators, and hence improving their precision, via experiment splitting. Experiment splitting reduces shrinkage to a standard prediction problem. The method makes minimal distributional assumptions, and allows for the degree of shrinkage in one metric to depend on other metrics. Using a dataset of 226 Facebook News Feed A/B tests, we show that a lasso estimator based on repeated experiment splitting has a 44% lower mean squared predictive error than the conventional, unshrunk treatment effect estimator, a 18% lower mean squared predictive error than the James-Stein shrinkage estimator, and would lead to substantially improved launch decisions over both.
This paper introduces an active-learning-based truth estimator for social networks, such as Twitter, that enhances estimation accuracy significantly by requesting a well-selected (small) fraction of data to be labeled. Data assessment and truth discovery from arbitrary open online sources are a hard problem due to uncertainty regarding source reliability. Multiple truth finding systems were developed to solve this problem. Their accuracy is limited by the noisy nature of the data, where distortions, fabrications, omissions, and duplication are introduced. This paper presents a semi-supervised truth estimator for social networks, in which a portion of inputs are carefully selected to be reliably verified. The challenge is to find the subset of observations to verify that would maximally enhance the overall fact-finding accuracy. This work extends previous passive approaches to recursive truth estimation, as well as semi-supervised approaches where the estimator has no control over the choice of data to be labeled. Results show that by optimally selecting claims to be verified, we improve estimated accuracy by 12% over unsupervised baseline, and by 5% over previous semi-supervised approaches.
With the rapid development of fashion market, the customers' demands of customers for fashion recommendation are rising. In this paper, we aim to investigate a practical problem of fashion recommendation by answering the question “which item should we select to match with the given fashion items and form a compatible outfit”. The key to this problem is to estimate the outfit compatibility. Previous works which focus on the compatibility of two items or represent an outfit as a sequence fail to make full use of the complex relations among items in an outfit. To remedy this, we propose to represent an outfit as a graph. In particular, we construct a Fashion Graph, where each node represents a category and each edge represents interaction between two categories. Accordingly, each outfit can be represented as a subgraph by putting items into their corresponding category nodes. To infer the outfit compatibility from such a graph, we propose Node-wise Graph Neural Networks (NGNN) which can better model node interactions and learn better node representations. In NGNN, the node interaction on each edge is different, which is determined by parameters correlated to the two connected nodes. An attention mechanism is utilized to calculate the outfit compatibility score with learned node representations. NGNN can not only be used to model outfit compatibility from visual or textual modality but also from multiple modalities. We conduct experiments on two tasks: (1) Fill-in-the-blank: suggesting an item that matches with existing components of outfit; (2) Compatibility prediction: predicting the compatibility scores of given outfits. Experimental results demonstrate the great superiority of our proposed method over others.
The proliferation of online communities has created exciting opportunities to study the mechanisms that explain group success. While a growing body of research investigates community success through a single measure - typically, the number of members - we argue that there are multiple ways of measuring success. Here, we present a systematic study to understand the relations between these success definitions and test how well they can be predicted based on community properties and behaviors from the earliest period of a community's lifetime. We identify four success measures that are desirable for most communities: (i) growth in the number of members; (ii) retention of members; (iii) long term survival of the community; and (iv) volume of activities within the community. Surprisingly, we find that our measures do not exhibit very high correlations, suggesting that they capture different types of success. Additionally, we find that different success measures are predicted by different attributes of online communities, suggesting that success can be achieved through different behaviors. Our work sheds light on the basic understanding on what success represents in online communities and what predicts it. Our results suggest that success is multi-faceted and cannot be measured nor predicted by a single measurement. This insight has practical implications for the creation of new online communities and the design of platforms that facilitate such communities.
Network Embedding is the task of learning continuous node representations for networks, which has been shown effective in a variety of tasks such as link prediction and node classification. Most of existing works aim to preserve different network structures and properties in low-dimensional embedding vectors, while neglecting the existence of noisy information in many real-world networks and the overfitting issue in the embedding learning process. Most recently, generative adversarial networks (GANs) based regularization methods are exploited to regularize embedding learning process, which can encourage a global smoothness of embedding vectors. These methods have very complicated architecture and suffer from the well-recognized non-convergence problem of GANs. In this paper, we aim to introduce a more succinct and effective local regularization method, namely adversarial training, to network embedding so as to achieve model robustness and better generalization performance. Firstly, the adversarial training method is applied by defining adversarial perturbations in the embedding space with an adaptive L2 norm constraint that depends on the connectivity pattern of node pairs. Though effective as a regularizer, it suffers from the interpretability issue which may hinder its application in certain real-world scenarios. To improve this strategy, we further propose an interpretable adversarial training method by enforcing the reconstruction of the adversarial examples in the discrete graph domain. These two regularization methods can be applied to many existing embedding models, and we take DeepWalk as the base model for illustration in the paper. Empirical evaluations in both link prediction and node classification demonstrate the effectiveness of the proposed methods.
Finding diagrams that contain a specific part or a similar part is important in many engineering tasks. In this search task, the query part is expected to match only a small region in a complex image. This paper investigates several local matching networks that explicitly model local region-to-region similarities. Deep convolutional neural networks extract local features and model local matching patterns. Spatial convolution is employed to cross-match local regions at different scale levels, addressing cases where the target part appears at a different scale, position, and/or angle. A gating network automatically learns region importance, removing noise from sparse areas and visual metadata in engineering diagrams.
Experimental results show that local matching approaches are more effective for engineering diagram search than global matching approaches. Suppressing unimportant regions via the gating network enhances accuracy. Matching across different scales via spatial convolution substantially improves robustness to scale and rotation changes. A pipelined architecture efficiently searches a large collection of diagrams by using a simple local matching network to identify a small set of candidate images and a more sophisticated network with convolutional cross-scale matching to re-rank candidates.
We study how to evaluate Anti-Fingerprinting Privacy Enhancing Technologies (AFPETs). Experimental methods have the advantage of control and precision, and can be applied to new AFPETs that currently lack a user base. Observational methods have the advantage of scale and drawing from the browsers currently in real-world use. We propose a novel combination of these methods, offering the best of both worlds, by applying experimentally created models of a AFPET's behavior to an observational dataset. We apply our evaluation methods to a collection of AFPETs to find the Tor Browser Bundle to be the most effective among them. We further uncover inconsistencies in some AFPETs' behaviors.
Heterographic pun plays a critical role in human writing and literature, which usually has a similar sounding or spelling structure. It is important and difficult research to recognize the heterographic pun because of the ambiguity. However, most existing methods for this task only focus on designing features with rule-based or machine learning methods. In this paper, we propose an end-to-end computational approach - Pronunciation Spelling Understanding Gated Attention (PSUGA) network. For pronunciation, we exploit the hierarchical attention model with phoneme embedding. While for spelling, we consider the character-level, word-level, tag-level, position-level and contextual-level embedding with attention model. To deal with the two parts, we present a gated attention mechanism to control the information integration. We have conducted extensive experiments on SemEval2017 task7 and Pun of the Day datasets. Experimental results show that our approach significantly outperforms state-of-the-art methods.
In this paper, we study the efficacy of login challenges at preventing account takeover, as well as evaluate the amount of friction these challenges create for normal users. These secondary authentication factors-presently deployed at Google, Microsoft, and other major identity providers as part of risk-aware authentication-trigger in response to a suspicious login or account recovery attempt. Using Google as a case study, we evaluate the effectiveness of fourteen device-based, delegation-based, knowledge-based, and resource-based challenges at preventing over 350,000 real-world hijacking attempts stemming from automated bots, phishers, and targeted attackers. We show that knowledge-based challenges prevent as few as 10% of hijacking attempts rooted in phishing and 73% of automated hijacking attempts. Device-based challenges provide the best protection, blocking over 94% of hijacking attempts rooted in phishing and 100% of automated hijacking attempts. We evaluate the usability limitations of each challenge based on a sample of 1.2M legitimate users. Our results illustrate that login challenges act as an important barrier to hijacking, but that friction in the process leads to 52% of legitimate users failing to sign-in-though 97% of users eventually access their account in a short period.
RNN models have achieved the state-of-the-art performance in a wide range of text mining tasks. However, these models are often regarded as black-boxes and are criticized due to the lack of interpretability. In this paper, we enhance the interpretability of RNNs by providing interpretable rationales for RNN predictions. Nevertheless, interpreting RNNs is a challenging problem. Firstly, unlike existing methods that rely on local approximation, we aim to provide rationales that are more faithful to the decision making process of RNN models. Secondly, a flexible interpretation method should be able to assign contribution scores to text segments of varying lengths, instead of only to individual words. To tackle these challenges, we propose a novel attribution method, called REAT, to provide interpretations to RNN predictions. REAT decomposes the final prediction of a RNN into additive contribution of each word in the input text. This additive decomposition enables REAT to further obtain phrase-level attribution scores. In addition, REAT is generally applicable to various RNN architectures, including GRU, LSTM and their bidirectional versions. Experimental results demonstrate the faithfulness and interpretability of the proposed attribution method. Comprehensive analysis shows that our attribution method could unveil the useful linguistic knowledge captured by RNNs. Some analysis further demonstrates our method could be utilized as a debugging tool to examine the vulnerability and failure reasons of RNNs, which may lead to several promising future directions to promote generalization ability of RNNs.
Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph - a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.
Motivated by the increasing need to preserve privacy in digital devices, we introduce the on-device public-private model of computation. Our motivation comes from social-network based recommender systems in which the users want to receive recommendations based on the information available on their devices, as well as the suggestions of their social contacts, without sharing such information or contacts with the central recommendation system. Our model allows us to solve many algorithmic problems while providing absolute (deterministic) guarantees of the privacy of on-device data and the user's contacts. In fact, we ensure that the private data and private contacts are never revealed to the central system. Our restrictive model of computation presents several interesting algorithmic challenges because any computation based on private information and contacts must be performed on local devices of limited capabilities. Despite these challenges, under realistic assumptions of inter-device communication, we show several efficient algorithms for fundamental data mining and machine learning problems, ranging from k-means clustering to heavy hitters. We complement this analysis with strong impossibility results for efficient private algorithms without allowing inter-device communication. In our experimental evaluation, we show that our private algorithms provide results almost as accurate as those of the non-private ones while speeding up the on-device computations by orders of magnitude.
In recent years, Graph Neural Networks (GNNs), which can naturally integrate node information and topological structure, have been demonstrated to be powerful in learning on graph data. These advantages of GNNs provide great potential to advance social recommendation since data in social recommender systems can be represented as user-user social graph and user-item graph; and learning latent factors of users and items is the key. However, building social recommender systems based on GNNs faces challenges. For example, the user-item graph encodes both interactions and their associated opinions; social relations have heterogeneous strengths; users involve in two graphs (e.g., the user-user social graph and the user-item graph). To address the three aforementioned challenges simultaneously, in this paper, we present a novel graph neural network framework (GraphRec) for social recommendations. In particular, we provide a principled approach to jointly capture interactions and opinions in the user-item graph and propose the framework GraphRec, which coherently models two graphs and heterogeneous strengths. Extensive experiments on two real-world datasets demonstrate the effectiveness of the proposed framework GraphRec.
Representing words as embeddings in a continuous vector space has been proven to be successful in improving the performance in many natural language processing (NLP) tasks. Beyond the traditional methods that learn the embeddings from large text corpora, ensemble methods have been proposed to leverage the merits from pre-trained word embeddings as well as external semantic sources. In this paper, we propose a knowledge-enhanced ensemble method to combine both knowledge graphs and pre-trained word embedding models. Specifically, we interpret relations in knowledge graphs as linear translation from one word to another. We also propose a novel weighting scheme to further distinguish edges in the knowledge graph with same type of relation. Extensive experiments demonstrate that our proposed method is up to 20% times better than state-of-the-art in word analogy task and up to 16% times better than state-of-the-art in word similarity task.
Entity linking is the task of aligning mentions to corresponding entities in a given knowledge base. Previous studies have highlighted the necessity for entity linking systems to capture the global coherence. However, there are two common weaknesses in previous global models. First, most of them calculate the pairwise scores between all candidate entities and select the most relevant group of entities as the final result. In this process, the consistency among wrong entities as well as that among right ones are involved, which may introduce noise data and increase the model complexity. Second, the cues of previously disambiguated entities, which could contribute to the disambiguation of the subsequent mentions, are usually ignored by previous models. To address these problems, we convert the global linking into a sequence decision problem and propose a reinforcement learning model which makes decisions from a global perspective. Our model makes full use of the previous referred entities and explores the long-term influence of current selection on subsequent decisions. We conduct experiments on different types of datasets, the results show that our model outperforms state-of-the-art systems and has better generalization performance.
Third-party applications present a convenient way for attackers to orchestrate a large number of fake and compromised accounts on popular online social networks. Despite recent high-profile reports of third-party application abuse on popular online social networks, prior work lacks automated approaches for accurate and early detection of abusive applications. In this paper, we perform a longitudinal study of abusive third-party applications on Twitter that perform a variety of malicious and spam activities in violation of Twitter's Terms of Service (ToS). Our measurements spanning over a period of 16 months demonstrate an ongoing arms race between attackers continuously registering and abusing new applications and Twitter trying to detect them. We find that hundreds of thousands of abusive applications remain undetected by Twitter for several months while posting tens of millions of tweets. We propose a machine learning approach for accurate and early detection of abusive Twitter applications by analyzing their first few tweets. The evaluation shows that our machine learning approach can accurately detect abusive application with 92.7% precision and 87.0% recall by analyzing their first seven tweets. The deployment of our machine learning approach in the wild shows that attackers continue to abuse third-party applications despite Twitter's recent countermeasures targeting third-party applications.
Online services are playing critical roles in almost all aspects of users' life. Users usually have multiple online identities (IDs) in different online services. In order to fuse the separated user data in multiple services for better business intelligence, it is critical for service providers to link online IDs belonging to the same user. On the other hand, the popularity of mobile networks and GPS-equipped smart devices have provided a generic way to link IDs, i.e., utilizing the mobility traces of IDs. However, linking IDs based on their mobility traces has been a challenging problem due to the highly heterogeneous, incomplete and noisy mobility data across services.
In this paper, we propose DPLink, an end-to-end deep learning based framework, to complete the user identity linkage task for heterogeneous mobility data collected from different services with different properties. DPLink is made up by a feature extractor including a location encoder and a trajectory encoder to extract representative features from trajectory and a comparator to compare and decide whether to link two trajectories as the same user. Particularly, we propose a pre-training strategy with a simple task to train the DPLink model to overcome the training difficulties introduced by the highly heterogeneous nature of different source mobility data. Besides, we introduce a multi-modal embedding network and a co-attention mechanism in DPLink to deal with the low-quality problem of mobility data. By conducting extensive experiments on two real-life ground-truth mobility datasets with eight baselines, we demonstrate that DPLink outperforms the state-of-the-art solutions by more than 15% in terms of hit-precision. Moreover, it is expandable to add external geographical context data and works stably with heterogeneous noisy mobility traces. Our code is publicly available1.
Network embedding aims at learning an effective vector transformation for entities in a network. We observe that there are two diverse branches of network embedding: for homogeneous graphs and for multi-relational graphs. This paper then proposes MARINE, a unified embedding framework for both homogeneous and multi-relational networks to preserve both the proximity and relation information. We also extend the framework to incorporate existing features of nodes in a graph, which can further be exploited for the ensemble of embedding. Our solution possesses complexity linear to the number of edges, which is suitable for large-scale network applications. Experiments conducted on several real-world network datasets, along with applications in link prediction and multi-label classification, exhibit the superiority of our proposed MARINE.
The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distributed through the network, poorly connected individuals are systematically less likely to receive the information, producing a gap in access to the information between individuals. In this work, we study how best to spread information in a social network while minimizing this access gap.
We propose to use the maximin social welfare function as an objective function, where we maximize the minimum probability of receiving the information under an intervention. We prove that in this setting this welfare function constrains the access gap whereas maximizing the expected number of nodes reached does not. We also investigate the difficulties of using the maximin, and present hardness results and analysis for standard greedy strategies. Finally, we investigate practical ways of optimizing for the maximin, and give empirical evidence that a simple greedy-based strategy works well in practice.
Web systems that provide the same functionality usually share a certain amount of items. This makes it possible to combine data from different websites to improve recommendation quality, known as the cross-domain recommendation task. Despite many research efforts on this task, the main drawback is that they largely assume the data of different systems can be fully shared. Such an assumption is unrealistic - different systems are typically operated by different companies, and it may violate business privacy policy to directly share user behavior data since it is highly sensitive.
In this work, we consider a more practical scenario to perform cross-domain recommendation. To avoid the leak of user privacy during the data sharing process, we consider sharing only the information of the item side, rather than user behavior data. Specifically, we transfer the item embeddings across domains, making it easier for two companies to reach a consensus (e.g., legal policy) on data sharing since the data to be shared is user-irrelevant and has no explicit semantics. To distill useful signals from transferred item embeddings, we rely on the strong representation power of neural networks and develop a new method named as NATR (short for Neural Attentive Transfer Recommendation). We perform extensive experiments on two real-world datasets, demonstrating that NATR achieves similar or even better performance than traditional cross-domain recommendation methods that directly share user-relevant data. Further insights are provided on the efficacy of NATR in using the transferred item embeddings to alleviate the data sparsity issue.
Good quality similarity metrics can significantly facilitate the performance of many large-scale, real-world applications. Existing studies have proposed various solutions to learn a Mahalanobis or bilinear metric in an online fashion by either restricting distances between similar (dissimilar) pairs to be smaller (larger) than a given lower (upper) bound or requiring similar instances to be separated from dissimilar instances with a given margin. However, these linear metrics learned by leveraging fixed bounds or margins may not perform well in real-world applications, especially when data distributions are complex. We aim to address the open challenge of “Online Adaptive Metric Learning” (OAML) for learning adaptive metric functions on-the-fly. Unlike traditional online metric learning methods, OAML is significantly more challenging since the learned metric could be non-linear and the model has to be self-adaptive as more instances are observed. In this paper, we present a new online metric learning framework that attempts to tackle the challenge by learning a ANN-based metric with adaptive model complexity from a stream of constraints. In particular, we propose a novel Adaptive-Bound Triplet Loss (ABTL) to effectively utilize the input constraints, and present a novel Adaptive Hedge Update (AHU) method for online updating the model parameters. We empirically validates the effectiveness and efficacy of our framework on various applications such as real-world image classification, facial verification, and image retrieval.
Mental health illness such as depression is a significant risk factor for suicide ideation, behaviors, and attempts. A report by Substance Abuse and Mental Health Services Administration (SAMHSA) shows that 80% of the patients suffering from Borderline Personality Disorder (BPD) have suicidal behavior, 5-10% of whom commit suicide. While multiple initiatives have been developed and implemented for suicide prevention, a key challenge has been the social stigma associated with mental disorders, which deters patients from seeking help or sharing their experiences directly with others including clinicians. This is particularly true for teenagers and younger adults where suicide is the second highest cause of death in the US. Prior research involving surveys and questionnaires (e.g. PHQ-9) for suicide risk prediction failed to provide a quantitative assessment of risk that informed timely clinical decision-making for intervention. Our interdisciplinary study concerns the use of Reddit as an unobtrusive data source for gleaning information about suicidal tendencies and other related mental health conditions afflicting depressed users. We provide details of our learning framework that incorporates domain-specific knowledge to predict the severity of suicide risk for an individual. Our approach involves developing a suicide risk severity lexicon using medical knowledge bases and suicide ontology to detect cues relevant to suicidal thoughts and actions. We also use language modeling, medical entity recognition and normalization and negation detection to create a dataset of 2181 redditors that have discussed or implied suicidal ideation, behavior, or attempt. Given the importance of clinical knowledge, our gold standard dataset of 500 redditors (out of 2181) was developed by four practicing psychiatrists following the guidelines outlined in Columbia Suicide Severity Rating Scale (C-SSRS), with the pairwise annotator agreement of 0.79 and group-wise agreement of 0.73. Compared to the existing four-label classification scheme (no risk, low risk, moderate risk, and high risk), our proposed C-SSRS-based 5-label classification scheme distinguishes people who are supportive, from those who show different severity of suicidal tendency. Our 5-label classification scheme outperforms the state-of-the-art schemes by improving the graded recall by 4.2% and reducing the perceived risk measure by 12.5%. Convolutional neural network (CNN) provided the best performance in our scheme due to the discriminative features and use of domain-specific knowledge resources, in comparison to SVM-L that has been used in the state-of-the-art tools over similar dataset.
Most popular web browsers include “reader modes” that improve the user experience by removing un-useful page elements. Reader modes reformat the page to hide elements that are not related to the page's main content. Such page elements include site navigation, advertising related videos and images, and most JavaScript. The intended end result is that users can enjoy the content they are interested in, without distraction.
In this work, we consider whether the “reader mode” can be widened to also provide performance and privacy improvements. Instead of its use as a post-render feature to clean up the clutter on a page we propose SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline. Once the tool decides during the initial phase of a page load that a page is suitable for reader mode use, it directly applies document tree translation before the page is rendered. Based on our measurements, we believe that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slow mobile connections. Combined with our approach to predicting which pages should be rendered in reader mode with 91% accuracy, SpeedReader achieves average speedups and bandwidth reductions of up to 27 × and 84 × , respectively. We further find that our novel “reader mode” approach brings with it significant privacy improvements to users. Our approach effectively removes all commonly recognized trackers, issues 115 fewer requests to third parties, and interacts with 64 fewer trackers on average, on transformed pages.
One of the important yet insufficiently studied subjects in fair allocation is the externality effect among agents. For a resource allocation problem, externalities imply that the share allocated to an agent may affect the utilities of other agents.
In this paper, we conduct a study of fair allocation of indivisible goods when the externalities are not negligible. Inspired by the models in the context of network diffusion, we present a simple and natural model, namely network externalities, to capture the externalities. To evaluate fairness in the network externalities model, we generalize the idea behind the notion of maximin-share () to achieve a new criterion, namely, extended-maximin-share (). Next, we consider two problems concerning our model.
First, we discuss the computational aspects of finding the value of for every agent. For this, we introduce a generalized form of partitioning problem that includes many famous partitioning problems such as maximin, minimax, and leximin. We further show that a 1/2-approximation algorithm exists for this partitioning problem.
Next, we investigate on finding approximately optimal allocations, i.e., allocations that guarantee each agent a utility of at least a fraction of his extended-maximin-share. We show that under a natural assumption that the agents are a-self-reliant, an a/2- allocation always exists. The combination of this with the former result yields a polynomial-time a/4- allocation algorithm.
To make images on Twitter and other social media platforms accessible to screen reader users, image descriptions (alternative text) need to be added that describe the information contained within the image. The lack of alternative text has been an enduring accessibility problem since the “alt” attribute was added in HTML 2.0 over 20 years ago, and the rise of user-generated content has only increased the number of images shared. As of 2016, Twitter provides users the ability to turn on a feature that allows descriptions to be added to images in their tweets, presumably in an effort to combat this accessibility problem. What has remained unknown is whether simply enabling users to provide alternative text has an impact on experienced accessibility. In this paper, we present a study of 1.09 million tweets with images, finding that only 0.1% of those tweets included descriptions. In a separate analysis of the timelines of 94 blind Twitter users, we found that these image tweets included descriptions more often. Even users with the feature turned on only write descriptions for about half of the images they tweet. To better understand why users provide alternative text descriptions (or not), we interviewed 20 Twitter users who have written image descriptions. Users did not remember to add alternative text, did not have time to add it, or did not know what to include when writing the descriptions. Our findings indicate that simply making it possible to provide image descriptions is not enough, and reveal future directions for automated tools that may support users in writing high-quality descriptions.
Cryptocurrencies are a novel and disruptive technology that has prompted a new approach to how currencies work in the modern economy. As such, online discussions related to cryptocurrencies often go beyond posts about the technology and underlying architecture of the various coins, to subjective speculations of price fluctuations and predictions. Furthermore, online discussions, potentially driven by foreign adversaries, criminals or hackers, can have a significant impact on our economy and national security if spread at scale.
This paper is the first to qualitatively measure and contrast discussion growth about three popular cryptocurrencies with key distinctions in motivation, usage, and implementation - Bitcoin, Ethereum, and Monero on Reddit. More specifically, we measure how discussions relevant to these coins spread in online social environments - how deep and how wide they go, how long they last, how many people they reach, etc. More importantly, we compare user behavior patterns between the focused community of the official coin subreddits and the general community across Reddit as a whole. Our Reddit sample covers three years of data between 2015 and 2018 and includes a time period of a record high Bitcoin price rise.1
Our results demonstrate that while the largest discussions on Reddit are focused on Bitcoin, posts about Monero (a cryptocurrency often used by criminals for illegal transactions on the Dark Web2) start discussions that are typically longer and wider. Bitcoin posts trigger subsequent discussion more immediately but Monero posts are more likely to trigger a longer lasting discussion. We find that moderately subjective posts across all three coins trigger larger, longer, and more viral discussion cascades within both focused and general communities on Reddit. Our analysis aims to bring the awareness to online discussion spread relevant to cryptocurrencies in addition to informing models for forecasting cryptocurrency price that rely on discussions in social media.
Activity tracking apps often make use of goals as one of their core motivational tools. There are two critical components to this tool: setting a goal, and subsequently achieving that goal. Despite its crucial role in how a number of prominent self-tracking apps function, there has been relatively little investigation of the goal-setting and achievement aspects of self-tracking apps.
Here we explore this issue, investigating a particular goal setting and achievement process that is extensive, recorded, and crucial for both the app and its users' success: weight loss goals in MyFitnessPal. We present a large-scale study of 1.4 million users and weight loss goals, allowing for an unprecedented detailed view of how people set and achieve their goals. We find that, even for difficult long-term goals, behavior within the first 7 days predicts those who ultimately achieve their goals, that is, those who lose at least as much weight as they set out to, and those who do not. For instance, high amounts of early weight loss, which some researchers have classified as unsustainable, leads to higher goal achievement rates. We also show that early food intake, self-monitoring motivation, and attitude towards the goal are important factors. We then show that we can use our findings to predict goal achievement with an accuracy of 79% ROC AUC just 7 days after a goal is set. Finally, we discuss how our findings could inform steps to improve goal achievement in self-tracking apps.
With the overwhelming popularity of Knowledge Graphs (KGs), researchers have poured attention to link prediction to complete KGs for a long time. However, they mainly focus on promoting the performance on binary relational data, where facts are usually represented as triples in the form of (head entity, relation, tail entity). In practice, n-ary relational facts are also ubiquitous. When encountering such facts, existing studies usually decompose them into triples by introducing a multitude of auxiliary virtual entities and additional triples. These conversions result in the complexity of carrying out link prediction concerning more than two arities. It has even proven that they may cause loss of structural information. To overcome these problems, in this paper, without decomposition, we represent each n-ary relational fact as a set of its role-value pairs. We further propose a method to conduct Link Prediction on N-ary relational data, thus called NaLP, which explicitly models the relatedness of all the role-value pairs in the same n-ary relational fact. Experimental results validate the effectiveness and merits of the proposed NaLP method.
User's digital identity information has privacy and security requirements. Privacy requirements include confidentiality of the identity information itself, anonymity of those who verify and consume a user's identity information and unlinkability of online transactions which involve a user's identity. Security requirements include correctness, ownership assurance and prevention of counterfeits of a user's identity information. Such privacy and security requirements, although conflicting, are critical for identity management systems enabling the exchange of users' identity information between different parties during the execution of online transactions. Addressing all such requirements, without a centralized party managing the identity exchange transactions, raises several challenges. This paper presents a decentralized protocol for privacy preserving exchange of users' identity information addressing such challenges. The proposed protocol leverages advances in blockchain and zero knowledge proof technologies, as the main building blocks. We provide prototype implementations of the main building blocks of the protocol and assess its performance and security.
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object.
We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages.
With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.
Fraud transactions are one of the major threats faced by online e-commerce platforms. Recently, deep learning based classifiers have been deployed to detect fraud transactions. Inspired by findings on adversarial examples, this paper is the first to analyze the vulnerability of deep fraud detector to slight perturbations on input transactions, which is very challenging since the sparsity and discretization of transaction data result in a non-convex discrete optimization. Inspired by the iterative Fast Gradient Sign Method (FGSM) for the L8 attack, we first propose the Iterative Fast Coordinate Method (IFCM) for discrete L1 and L2 attacks which is efficient to generate large amounts of instances with satisfactory effectiveness. We then provide two novel attack algorithms to solve the discrete optimization. The first one is the Augmented Iterative Search (AIS) algorithm, which repeatedly searches for effective “simple” perturbation. The second one is called the Rounded Relaxation with Reparameterization (R3), which rounds the solution obtained by solving a relaxed and unconstrained optimization problem with reparameterization tricks. Finally, we conduct extensive experimental evaluation on the deployed fraud detector in TaoBao, one of the largest e-commerce platforms in the world, with millions of real-world transactions. Results show that (i) The deployed detector is highly vulnerable to attacks as the average precision is decreased from nearly 90% to as low as 20% with little perturbations; (ii) Our proposed attacks significantly outperform the adaptions of the state-of-the-art attacks. (iii) The model trained with an adversarial training process is significantly robust against attacks and performs well on the unperturbed data.
Web crawlers spend significant resources to maintain freshness of their crawled data. This paper describes the optimization of resources to ensure that product prices shown in ads in a context of a shopping sponsored search service are synchronized with current merchant prices. We are able to use the predictability of price changes to build a machine learned system leading to considerable resource savings for both the merchants and the crawler. We describe our solution to technical challenges due to partial observability of price history, feedback loops arising from applying machine learned models, and offers in cold start state. Empirical evaluation over large-scale product crawl data demonstrates the effectiveness of our model and confirms its robustness towards unseen data. We argue that our approach can be applicable in more general data pull settings.
Descriptive titles provide crucial context for interpreting tables that are extracted from web pages and are a key component of search features such as tabular featured snippets from Google and Bing. Prior approaches have attempted to produce titles by selecting existing text snippets associated with the table. These approaches, however, are limited by their dependence on suitable titles existing a priori. In our user study, we observe that the relevant information for the title tends to be scattered across the page, and often-more than 80% of the time-does not appear verbatim anywhere in the page. We propose instead the application of a sequence-to-sequence neural network model as a more generalizable approach for generating high-quality table titles. This is accomplished by extracting many text snippets that have potentially relevant information to the table, encoding them into an input sequence, and using both copy and generation mechanisms in the decoder to balance relevance and readability of the generated title. We validate this approach with human evaluation on sample web tables and report that while sequence models with only a copy mechanism or only a generation mechanism are easily outperformed by simple selection-based baselines, the model with both capabilities performs the best, approaching the quality of crowdsourced titles while training on fewer than ten thousand examples. To the best of our knowledge, the proposed technique is the first to consider text-generation methods for table titles, and establishes a new state of the art.
The investigation of the effect of the built environment in a neighbourhood and how it impacts residents' health is of value to researchers from public health policy to social science. The traditional methods to assess this impact is through surveys which lead to temporally and spatially coarse grained data and are often not cost effective. Here we propose an approach to link the effects of neighbourhood services over citizen health using a technique that attempts to highlight the cause-effect aspects of these relationships. The method is based on the theory of propensity score matching with multiple 'doses' and it leverages existing fine grained open web data. To demonstrate the method, we study the effect of sport venue presence on the prevalence of antidepressant prescriptions in over 600 neighbourhoods in London over a period of three years. We find the distribution of effects is approximately normal, centred on a small negative effect on prescriptions with increases in the availability of sporting facilities, on average. We assess the procedure through some standard quantitative metrics as well as matching on synthetic data generated by modelling the real data. This approach opens the door to fast and inexpensive alternatives to quantify and continuously monitor effects of the neighborhood built environment on population health.
In the past few decades, there has been rapid growth in quantity and variety of healthcare data. These large sets of data are usually high dimensional (e.g. patients, their diagnoses, and medications to treat their diagnoses) and cannot be adequately represented as matrices. Thus, many existing algorithms can not analyze them. To accommodate these high dimensional data, tensor factorization, which can be viewed as a higher-order extension of methods like PCA, has attracted much attention and emerged as a promising solution. However, tensor factorization is a computationally expensive task, and existing methods developed to factor large tensors are not flexible enough for real-world situations.
To address this scaling problem more efficiently, we introduce SGranite, a distributed, scalable, and sparse tensor factorization method fit through stochastic gradient descent. SGranite offers three contributions: (1) Scalability: it employs a block partitioning and parallel processing design and thus scales to large tensors, (2) Accuracy: we show that our method can achieve results faster without sacrificing the quality of the tensor decomposition, and (3) FlexibleConstraints: we show our approach can encompass various kinds of constraints including l2 norm, l1 norm, and logistic regularization. We demonstrate SGranite's capabilities in two real-world use cases. In the first, we use Google searches for flu-like symptoms to characterize and predict influenza patterns. In the second, we use SGranite to extract clinically interesting sets (i.e., phenotypes) of patients from electronic health records. Through these case studies, we show SGranite has the potential to be used to rapidly characterize, predict, and manage a large multimodal datasets, thereby promising a novel, data-driven solution that can benefit very large segments of the population.
Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new vandalism detection model that avoids them. Our model FAIR-S reduces the bias ratio of the state-of-the-art vandalism detector WDVD from 310.7 to only 11.9 while maintaining high predictive performance at 0.963 ROC and 0.316 PR.
Information diffusion is usually modeled as a process in which immutable pieces of information propagate over a network. In reality, however, messages are not immutable, but may be morphed with every step, potentially entailing large cumulative distortions. This process may lead to misinformation even in the absence of malevolent actors, and understanding it is crucial for modeling and improving online information systems. Here, we perform a controlled, crowdsourced experiment in which we simulate the propagation of information from medical research papers. Starting from the original abstracts, crowd workers iteratively shorten previously produced summaries to increasingly smaller lengths. We also collect control summaries where the original abstract is compressed directly to the final target length. Comparing cascades to controls allows us to separate the effect of the length constraint from that of accumulated distortion. Via careful manual coding, we annotate lexical and semantic units in the medical abstracts and track them along cascades. We find that iterative summarization has a negative impact due to the accumulation of error, but that high-quality intermediate summaries result in less distorted messages than in the control case. Different types of information behave differently; in particular, the conclusion of a medical abstract (i.e., its key message) is distorted most. Finally, we compare extractive with abstractive summaries, finding that the latter are less prone to semantic distortion. Overall, this work is a first step in studying information cascades without the assumption that disseminated content is immutable, with implications on our understanding of the role of word-of-mouth effects on the misreporting of science.
The text snippets presented in web search results provide users with a slice of page content that they can quickly scan to help inform their click decisions. However, little is known about how these snippets are generated or how they relate to a user's search query. Motivated by the growing body of evidence suggesting that search engine rankings can influence undecided voters, we conducted an algorithm audit of the political partisanship of Google Search snippets relative to the webpages they are extracted from. To accomplish this, we constructed lexicon of partisan cues to measure partisanship and construct a set of left- and right-leaning search queries. Then, we collected a large dataset of Search Engine Results Pages (SERPs) by running our partisan queries and their autocomplete suggestions on Google Search. After using our lexicon to score the machine-coded partisanship of snippets and webpages, we found that Google Search's snippets generally amplify partisanship, and that this effect is robust across different types of webpages, query topics, and partisan (left- and right-leaning) queries.
With the wide adoption of multi-community structure in many popular online platforms, human mobility across online communities has drawn increasing attention from both academia and industry. In this work, we study the statistical patterns that characterize human movements in cyberspace. Inspired by previous work on human mobility in physical space, we decompose human online activities into return and exploration - two complementary types of movements. We then study how people perform these two movements, respectively. We first propose a preferential return model that uncovers the preferential properties of people returning to multiple online communities. Interestingly, this model echos the previous findings on human mobility in physical space. We then present a preferential exploration model that characterizes exploration movements from a novel online community-group perspective. Our experiments quantitatively reveal the patterns of people exploring new communities, which share striking similarities with online return movements in terms of underlying principles. By combining the mechanisms of both return and exploration together, we are able to obtain an overall model that characterizes human mobility patterns in cyberspace at the individual level. We further investigate human online activities using our models, and discover valuable insights on the mobility patterns across online communities. Our models explain the empirically observed human online movement trajectories remarkably well, and more importantly, sheds better light on the understanding of human cyberspace dynamics.
Citywide abnormal events, such as crimes and accidents, may result in loss of lives or properties if not handled efficiently. It is important for a wide spectrum of applications, ranging from public order maintaining, disaster control and people's activity modeling, if abnormal events can be automatically predicted before they occur. However, forecasting different categories of citywide abnormal events is very challenging as it is affected by many complex factors from different views: (i) dynamic intra-region temporal correlation; (ii) complex inter-region spatial correlations; (iii) latent cross-categorical correlations. In this paper, we develop a Multi-View and Multi-Modal Spatial-Temporal learning (MiST) framework to address the above challenges by promoting the collaboration of different views (spatial, temporal and semantic) and map the multi-modal units into the same latent space. Specifically, MiST can preserve the underlying structural information of multi-view abnormal event data and automatically learn the importance of view-specific representations, with the integration of a multi-modal pattern fusion module and a hierarchical recurrent framework. Extensive experiments on three real-world datasets, i.e., crime data and urban anomaly data, demonstrate the superior performance of our MiST method over the state-of-the-art baselines across various settings.
Under increasing scrutiny, many web companies now offer bespoke mechanisms allowing any third party to file complaints (e.g., requesting the de-listing of a URL from a search engine). While this self-regulation might be a valuable web governance tool, it places huge responsibility within the hands of these organisations that demands close examination. We present the first large-scale study of web complaints (over 1 billion URLs). We find a range of complainants, largely focused on copyright enforcement. Whereas the majority of organisations are occasional users of the complaint system, we find a number of bulk senders specialised in targeting specific types of domain. We identify a series of trends and patterns amongst both the domains and complainants. By inspecting the availability of the domains, we also observe that a sizeable portion go offline shortly after complaints are generated. This paper sheds critical light on how complaints are issued, who they pertain to and which domains go offline after complaints are issued.
Diffusions, such as the heat kernel diffusion and the PageRank vector, and their relatives are widely used graph mining primitives that have been successful in a variety of contexts including community detection and semi-supervised learning. The majority of existing methods and methodology involves linear diffusions, which then yield simple algorithms involving repeated matrix-vector operations. Recent work, however, has shown that sophisticated and complicated techniques based on network embeddings and neural networks can give empirical results superior to those based on linear diffusions. In this paper, we illustrate a class of nonlinear graph diffusions that are competitive with state of the art embedding techniques and outperform classic diffusions. Our new methods enjoy much of the simplicity underlying classic diffusion methods as well. Formally, they are based on nonlinear dynamical systems that can be realized with an implementation akin to applying a nonlinear function after each matrix-vector product in a classic diffusion. This framework also enables us to easily integrate results from multiple data representations in a principled fashion. Furthermore, we have some theoretical relationships that suggest choices of the nonlinear term. We demonstrate the benefits of these techniques on a variety of synthetic and real-world data.
It is common in recommendation systems that users both consume and produce information as they make strategic choices under uncertainty. While a social planner would balance “exploration” and “exploitation” using a multi-armed bandit algorithm, users' incentives may tilt this balance in favor of exploitation. We consider Bayesian Exploration: a simple model in which the recommendation system (the “principal”) controls the information flow to the users (the “agents”) and strives to incentivize exploration via information asymmetry. A single round of this model is a version of a well-known “Bayesian Persuasion game” from [24]. We allow heterogeneous users, relaxing a major assumption from prior work that users have the same preferences from one time step to another. The goal is now to learn the best personalized recommendations. One particular challenge is that it may be impossible to incentivize some of the user types to take some of the actions, no matter what the principal does or how much time she has. We consider several versions of the model, depending on whether and when the user types are reported to the principal, and design a near-optimal “recommendation policy” for each version. We also investigate how the model choice and the diversity of user types impact the set of actions that can possibly be “explored” by each type.
In consumer search, there is a set of items. An agent has a prior over her value for each item and can pay a cost to learn the instantiation of her value. After exploring a subset of items, the agent chooses one and obtains a payoff equal to its value minus the search cost. We consider a sequential model of consumer search in which agents' values are correlated and each agent updates her priors based on the exploration of past agents before performing her search. Specifically, we assume the value is the sum of a common-value component, called the quality, and a subjective score. Fixing the variance of the total value, we say a population is more diverse if the subjective score has a larger variance. We ask how diversity impacts average utility. We show that intermediate diversity levels yield significantly higher social utility than the extreme cases of no diversity (when agents under-explore) or full diversity (when agents are unable to learn from each other) and quantify how the impact of the diversity level changes depending on the time spent searching.
Ordering take-out food (a.k.a. takeaway food) on online-to-offline (O2O) food ordering and delivery platforms is becoming a new lifestyle for people living in big cities, thanks to its great convenience. Web users and mobile device users can order take-out food (i.e. obtain online food ordering services) on an O2O platform. Then the O2O platform will dispatch food carriers to deliver food from restaurants to users, i.e. providing users with offline food delivery services. For an O2O food ordering and delivery platform, improving food delivery efficiency, given the massive number of food orders each day and the limited number of food carriers, is of paramount importance to reducing the length of time users wait for their food. Thus, in this paper, we study the food delivery task grouping problem so as to improve food delivery efficiency and alleviate the pain of waiting for users, which to the best of our knowledge has not been studied yet. However, the food delivery task grouping problem is challenging, given two reasons. First, the food delivery efficiency is affected by multiple factors, which are non-trivial to formulate and jointly consider. Second, the problem is a typical NP-hard problem and to find near-optimal grouping results is not easy. To address these two issues, we propose an effective task grouping method. On one hand, we provide formal formulations for the factors affecting the food delivery efficiency, and provide an objective to organically combine these factors such that it can better guide the task grouping. On the other hand, we propose heuristic algorithms to efficiently obtain effective task grouping results, consisting of a greedy algorithm and a replacement algorithm. We evaluate our task grouping method using take-out food order data from web users and mobile device users on a real-world O2O food ordering and delivery platform. Experiment results demonstrate that our task grouping method can save ~ 16% (87 seconds) of average waiting time for each user, comparing with many baseline methods. It indicates that our method is able to significantly improve the food delivery efficiency and can provide better food delivery services for users.
Community detection refers to the task of discovering groups of vertices sharing similar properties or functions so as to understand the network data. With the recent development of deep learning, graph representation learning techniques are also utilized for community detection. However, the communities can only be inferred by applying clustering algorithms based on learned vertex embeddings. These general cluster algorithms like K-means and Gaussian Mixture Model cannot output much overlapped communities, which have been proved to be very common in many real-world networks. In this paper, we propose CommunityGAN, a novel community detection framework that jointly solves overlapping community detection and graph representation learning. First, unlike the embedding of conventional graph representation learning algorithms where the vector entry values have no specific meanings, the embedding of CommunityGAN indicates the membership strength of vertices to communities. Second, a specifically designed Generative Adversarial Net (GAN) is adopted to optimize such embedding. Through the minimax competition between the motif-level generator and discriminator, both of them can alternatively and iteratively boost their performance and finally output a better community structure. Extensive experiments on synthetic data and real-world tasks demonstrate that CommunityGAN achieves substantial community detection performance gains over the state-of-the-art methods.
Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-long-document text matching has many applications, but has rarely been studied. Most existing approaches for semantic text matching have limited success in this setting, due to their inability to capture and distill the main ideas and topics from long-form text.
In this paper, we propose a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long-form document based semantic text matching. In addition to word information, SMASH RNN is using the document structure to improve the representation of long-form documents. Specifically, SMASH RNN synthesizes information from different document structure levels, including paragraphs, sentences, and words. An attention-based hierarchical RNN derives a representation for each document structure level. Then, the representations learned from the different levels are aggregated to learn a more comprehensive semantic representation of the entire document. For semantic text matching, a Siamese structure couples the representations of a pair of documents, and infers a probabilistic score as their similarity.
We conduct an extensive empirical evaluation of SMASH RNN with three practical applications, including email attachment suggestion, related article recommendation, and citation recommendation. Experimental results on public data sets demonstrate that SMASH RNN significantly outperforms competitive baseline methods across various classification and ranking scenarios in the context of semantic matching of long-form documents.
As the size of real-world graphs has drastically increased in recent years, a wide variety of graph engines have been developed to deal with such big graphs efficiently. However, the majority of graph engines have been designed without considering the power-law degree distribution of real-world graphs seriously. Two problems have been observed when existing graph engines process real-world graphs: inefficient scanning of the sparse indicator and the delay in iteration progress due to uneven workload distribution. In this paper, we propose RealGraph, a single-machine based graph engine equipped with the hierarchical indicator and the block-based workload allocation. Experimental results on real-world datasets show that RealGraph significantly outperforms existing graph engines in terms of both speed and scalability.
WhatsApp has revolutionized the way people communicate and interact. It is not only cheaper than the traditional Short Message Service (SMS) communication but it also brings a new form of mobile communication: the group chats. Such groups are great forums for collective discussions on a variety of topics. In particular, in events of great social mobilization, such as strikes and electoral campaigns, WhatsApp group chats are very attractive as they facilitate information exchange among interested people. Yet, recent events have raised concerns about the spreading of misinformation in WhatsApp. In this work, we analyze information dissemination within WhatsApp, focusing on publicly accessible political-oriented groups, collecting all shared messages during major social events in Brazil: a national truck drivers' strike and the Brazilian presidential campaign. We analyze the types of content shared within such groups as well as the network structures that emerge from user interactions within and cross-groups. We then deepen our analysis by identifying the presence of misinformation among the shared images using labels provided by journalists and by a proposed automatic procedure based on Google searches. We identify the most important sources of the fake images and analyze how they propagate across WhatsApp groups and from/to other Web platforms.
We consider the problem of analyzing timestamped relational events between a set of entities, such as messages between users of an on-line social network. Such data are often analyzed using static or discrete-time network models, which discard a significant amount of information by aggregating events over time to form network snapshots. In this paper, we introduce a block point process model (BPPM) for continuous-time event-based dynamic networks. The BPPM is inspired by the well-known stochastic block model (SBM) for static networks. We show that networks generated by the BPPM follow an SBM in the limit of a growing number of nodes. We use this property to develop principled and efficient local search and variational inference procedures initialized by regularized spectral clustering. We fit BPPMs with exponential Hawkes processes to analyze several real network data sets, including a Facebook wall post network with over 3,500 nodes and 130,000 events.
In-browser cryptojacking is a form of resource abuse that leverages end-users' machines to mine cryptocurrency without obtaining the users' consent. In this paper, we design, implement, and evaluate Outguard, an automated cryptojacking detection system. We construct a large ground-truth dataset, extract several features using an instrumented web browser, and ultimately select seven distinctive features that are used to build an SVM classification model. Outguardachieves a 97.9% TPR and 1.1% FPR and is reasonably tolerant to adversarial evasions. We utilized Outguardin the wild by deploying it across the Alexa Top 1M websites and found 6,302 cryptojacking sites, of which 3,600 are new detections that were absent from the training data. These cryptojacking sites paint a broad picture of the cryptojacking ecosystem, with particular emphasis on the prevalence of cryptojacking websites and the shared infrastructure that provides clues to the operators behind the cryptojacking phenomenon.
Neural network models have achieved impressive results in the field of text classification. However, existing approaches often suffer from insufficient training data in a large-scale text classification involving a large number of categories (e.g., several thousands of categories). Several neural network models have utilized multi-task learning to overcome the limited amount of training data. However, these approaches are also limited to small-scale text classification. In this paper, we propose a novel neural network-based multi-task learning framework for large-scale text classification. To this end, we first treat the different scales of text classification (i.e., large and small numbers of categories) as multiple, related tasks. Then, we train the proposed neural network, which learns small- and large-scale text classification tasks simultaneously. In particular, we further enhance this multi-task learning architecture by using a gate mechanism, which controls the flow of features between the small- and large-scale text classification tasks. Experimental results clearly show that our proposed model improves the performance of the large-scale text classification task with the help of the small-scale text classification task. The proposed scheme exhibits significant improvements of as much as 14% and 5% in terms of micro-averaging and macro-averaging F1-score, respectively, over state-of-the-art techniques.
Implicit user feedback is a fundamental dataset for personalized recommendation models. Because of its inherent characteristics of sparse one-class values, it is challenging to uncover meaningful user/item representations. In this paper, we propose dual neural personalized ranking (DualNPR), which fully exploits both user- and item-side pairwise rankings in a unified manner. The key novelties of the proposed model are three-fold: (1) DualNPR discovers mutual correlation among users and items by utilizing both user- and item-side pairwise rankings, alleviating the data sparsity problem. We stress that, unlike existing models that require extra information, DualNPR naturally augments both user- and item-side pairwise rankings from a user-item interaction matrix. (2) DualNPR is built upon deep matrix factorization to capture the variability of user/item representations. In particular, it chooses raw user/item vectors as an input and learns latent user/item representations effectively. (3) DualNPR employs a dynamic negative sampling method using an exponential function, further improving the accuracy of top-N recommendation. In experimental results over three benchmark datasets, DualNPR outperforms baseline models by 21.9-86.7% in hit rate, 14.5-105.8% in normalized discounted cumulative gain, and 5.1-23.3% in the area under the ROC curve.
People receive dozens, or hundreds, of notifications per day and each notification poses some risk of accidental information disclosure in the presence of others; onlookers may see notifications on a mobile phone lock screen, on the periphery of a desktop or laptop display. We quantify the prevalence of these accidental disclosures in the context of email notifications, and we study people's relevant preferences and concerns. Our results are compiled from a retrospective survey of 131 respondents, and a contextual-labeling study where 169 participants labeled 1,040 meeting-email pairs. We find that, for 53% of people, at least 1 in 10 email notifications poses an information disclosure risk, and the real or perceived severity of these risks depend both on user characteristics and the meeting or email attributes. We conclude by exploring machine learning for predicting people's comfort levels, and we present implications for the design of future social-context aware notification systems.
Recent studies show that an overwhelming majority of emails are machine-generated and sent by businesses to consumers. Many large email services are interested in extracting structured data from such emails to enable intelligent assistants. This allows experiences like being able to answer questions such as “What is the address of my hotel in New York?” or “When does my flight leave?”. A high-quality email classifier is a critical piece in such a system. In this paper, we argue that the rich formatting used in business-to-consumer emails contains valuable information that can be used to learn better representations. Most existing methods focus only on textual content and ignore the rich HTML structure of emails. We introduce RiSER (Richly Structured Email Representation) - an approach for incorporating both the structure and content of emails. RiSER projects the email into a vector representation by jointly encoding the HTML structure and the words in the email. We then use this representation to train a classifier. To our knowledge, this is the first description of a neural technique for combining formatting information along with the content to learn improved representations for richly formatted emails. Experimenting with a large corpus of emails received by users of Gmail, we show that RiSER outperforms strong attention-based LSTM baselines. We expect that these benefits will extend to other corpora with richly formatted documents. We also demonstrate with examples where leveraging HTML structure leads to better predictions.
Current reputation systems in online labor markets (e.g., Freelancer, PeoplePerHour) experience three major shortcomings: (1) reputation inflation (i.e., reputation scores are inflated to “above average” values) (2) reputation attribution (i.e., attribution of reputation scores to individual skills is unfeasible) and (3) reputation staticity (i.e., reputation scores are uniformly averaged over time). These shortcomings render online reputation systems uninformative, and sometimes even misleading. This work proposes a data-driven approach that deflates reputation scores by solving the problems of reputation attribution and saticity. The deflating process starts with projecting any random set of skills to a set of competency dimensions. For each competency dimension, a Hidden Markov Model estimates a contractor's current (but latent) competency-specific expertise. Aggregation of competency-specific estimates provides expertise predictions for any given set of required skills. Empirical analysis on 61,330 completed tasks from a major online labor market shows that the resulting estimates are deflated and they better predict contractor performance. These results suggest a series of implications for online (labor) markets and their users.
Question answering promises a means of efficiently searching web-based content repositories such as Wikipedia. However, the systems of this type most prevalent today merely conduct their learning once in an offline training phase while, afterwards, all parameters remain static. Thus, the possibility of improvement over time is precluded. As a consequence of this shortcoming, question answering is not currently taking advantage of the wealth of feedback mechanisms that have become prominent on the web (e. g., buttons for liking, voting, or sharing).
This is the first work that introduces a question-answering system for (web-based) content repositories with an on-line mechanism for user feedback. Our efforts have resulted in QApedia - a framework for on-line improvement based on shallow user feedback. In detail, we develop a simple feedback mechanism that allows users to express whether a question was answered satisfactorily or whether a different answer is needed. Even for this simple mechanism, the implementation represents a daunting undertaking due to the complex, multi-staged operations that underlie state-of-the-art systems for neural questions answering. Another challenge with regard to web-based use is that feedback is limited (and possibly even noisy), as the true labels remain unknown. We thus address these challenges through a novel combination of neural question answering and a dynamic process based on distant supervision, asynchronous updates, and an automatic validation of feedback credibility in order to mine high-quality training samples from the web for the purpose of achieving continuous improvement over time.
Our QApedia framework is the first question-answering system that continuously refines its capabilities by improving its now dynamic content repository and thus the underlying answer extraction. QApedia not only achieves state-of-the-art results over several benchmarking datasets, but we further show that it successfully manages to learn from shallow user feedback, even when the feedback is noisy or adversarial. Altogether, our extensive experimental evaluation, with more than 2,500 hours of computational experiments, demonstrates that a feedback mechanism as simple as a binary vote (which is widespread on the web) can considerably improve performance when combined with an efficient framework for continuous learning.
We study the strategic implications that arise from adding one extra option to the miners participating in the bitcoin protocol. We propose that when adding a block, miners also have the ability to pay forward an amount to be collected by the first miner who successfully extends their branch, giving them the power to influence the incentives for mining. We formulate a stochastic game for the study of such incentives and show that with this added option, smaller miners can guarantee that the best response of even substantially more powerful miners is to follow the expected behavior intended by the protocol designer.
Given posts on 'abortion' and posts on 'religion' from a political forum, how can we find topics that are discriminative and those in common? In general, (1) how can we compare and contrast two or more different ('labeled') document collections? Moreover, (2) how can we visualize the data (in 2-d or 3-d) to best reflect the similarities and differences between the collections?
We introduce (to the best of our knowledge) the first contrastive and visual topic model, called ContraVis, that jointly addresses both problems: (1) contrastive topic modeling, and (2) contrastive visualization. That is, ContraVis learns not only latent topics but also embeddings for the documents, topics and labels for visualization. ContraVis exhibits three key properties by design. It is (i) Contrastive: It enables comparative analysis of different document corpora by extracting latent discriminative and common topics across labeled documents; (ii) Visually-expressive: Different from numerous existing models, it also produces a visualization for all of the documents, labels, and the extracted topics, where proximity in the coordinate space is reflective of proximity in semantic space; (iii) Unified: It extracts topics and visual coordinates simultaneously under a joint model.
Through extensive experiments on real-world datasets, we show ContraVis 's potential for providing visual contrastive analysis of multiple document collections. We show both qualitatively and quantitatively that ContraVis significantly outperforms both unsupervised and supervised state-of-the-art topic models in contrastive power, semantic coherence and visual effectiveness.
Automatically classifying extremely short texts, such as social media posts and web page titles, plays an important role in a wide range of content analysis applications. However, traditional classifiers based on bag-of-words (BoW) representations often fail in this task. The underlying reason is that the document similarity can not be accurately measured under BoW representations due to the extreme sparseness of short texts. This results in significant difficulty to capture the generality of short texts. To address this problem, we use a better regularized word mover's distance (RWMD), which can measure distances among short texts at the semantic level. We then propose a RWMD-based centroid classifier for short texts, named RWMD-CC. Basically, RWMD-CC computes a representative semantic centroid for each category under the RWMD measure, and predicts test documents by finding the closest semantic centroid. The testing is much more efficient than the prior art of K nearest neighbor classifier based on WMD. Experimental results indicate that our RWMD-CC can achieve very competitive classification performance on extremely short texts.
Domain ontology is widely used to index literature for the convenience of literature retrieval. Due to the high cost of manual curation of key aspects from the scientific literature, automated methods are crucially required to assist the process of semantic indexing. However, it is a challenging task due to the huge amount of terms and complex hierarchical relations involved in a domain ontology. In this paper, in order to lessen the curse of dimensionality and enhance the training efficiency, we propose an approach named Deep Level-wise Extreme Multi-label Learning and Classification (Deep Level-wise XMLC), to facilitate the semantic indexing of literatures. Specifically, Deep Level-wise XMLC is composed of two sequential modules. The first module, deep level-wise multi-label learning, decomposes the terms of a domain ontology into multiple levels and builds a special convolutional neural network for each level with category-dependent dynamic max pooling and macro F-measure based weights tuning. The second module, hierarchical pointer generation model merges the level-wise outputs into a final summarized semantic indexing. We demonstrate the effectiveness of Deep Level-wise XMLC by comparing it with several state-of-the-art methods on automatic labeling of MeSH, on literature from PubMed MEDLINE and automatic labeling of AmazonCat13K.
The problem of selecting a group of vertices under certain constraints that maximize their joint centrality arises in many practical scenarios. In this paper, we extend the notion of current flow closeness centrality (CFCC) to a set of vertices in a graph, and investigate the problem of selecting a subset S to maximizes its CFCC C(S), with the cardinality constraint |S| = k. We show the NP-hardness of the problem, but propose two greedy algorithms to minimize the reciprocal of C(S). We prove the approximation ratios by showing the monotonicity and supermodularity. A proposed deterministic greedy algorithm has an approximation factor and cubic running time. To compare with, a proposed randomized algorithm gives -approximation in nearly-linear time, for any ? > 0. Extensive experiments on model and real networks demonstrate the effectiveness and efficiency of the proposed algorithms, with the randomized algorithm being applied to massive networks with more than a million vertices.
Node classification and graph classification are two graph learning problems that predict the class label of a node and the class label of a graph respectively. A node of a graph usually represents a real-world entity, e.g., a user in a social network, or a protein in a protein-protein interaction network. In this work, we consider a more challenging but practically useful setting, in which a node itself is a graph instance. This leads to a hierarchical graph perspective which arises in many domains such as social network, biological network and document collection. For example, in a social network, a group of people with shared interests forms a user group, whereas a number of user groups are interconnected via interactions or common members. We study the node classification problem in the hierarchical graph where a “node” is a graph instance, e.g., a user group in the above example. As labels are usually limited in real-world data, we design two novel semi-supervised solutions named SEmi-supervised grAph cLassification via Cautious/Active Iteration (or SEAL-C/AI in short). SEAL-C/AI adopt an iterative framework that takes turns to build or update two classifiers, one working at the graph instance level and the other at the hierarchical graph level. To simplify the representation of the hierarchical graph, we propose a novel supervised, self-attentive graph embedding method called SAGE, which embeds graph instances of arbitrary size into fixed-length vectors. Through experiments on synthetic data and Tencent QQ group data, we demonstrate that SEAL-C/AI not only outperform competing methods by a significant margin in terms of accuracy/Macro-F1, but also generate meaningful interpretations of the learned representations.
A fundamental question in any peer-to-peer ridesharing system is how to, both effectively and efficiently, dispatch user's ride requests to the right driver in real time. Traditional rule-based solutions usually work on a simplified problem setting, which requires a sophisticated hand-crafted weight design for either centralized authority control or decentralized multi-agent scheduling systems. Although recent approaches have used reinforcement learning to provide centralized combinatorial optimization algorithms with informative weight values, their single-agent setting can hardly model the complex interactions between drivers and orders. In this paper, we address the order dispatching problem using multi-agent reinforcement learning (MARL), which follows the distributed nature of the peer-to-peer ridesharing problem and possesses the ability to capture the stochastic demand-supply dynamics in large-scale ridesharing scenarios. Being more reliable than centralized approaches, our proposed MARL solutions could also support fully distributed execution through recent advances in the Internet of Vehicles (IoV) and the Vehicle-to-Network (V2N). Furthermore, we adopt the mean field approximation to simplify the local interactions by taking an average action among neighborhoods. The mean field approximation is capable of globally capturing dynamic demand-supply variations by propagating many local interactions between agents and the environment. Our extensive experiments have shown the significant improvements of MARL order dispatching algorithms over several strong baselines on the accumulated driver income (ADI), and order response rate measures. Besides, the simulated experiments with real data have also justified that our solution can alleviate the supply-demand gap during the rush hours, thus possessing the capability of reducing traffic congestion.
Many e-commerce platforms today allow users to give their rating scores and reviews on items as well as to establish social relationships with other users. As a result, such platforms accumulate heterogeneous data including numeric scores, short textual reviews, and social relationships. However, many recommender systems only consider historical user feedbacks in modeling user preferences. More specifically, most existing recommendation approaches only use rating scores but ignore reviews and social relationships in the user-generated data. In this paper, we propose TSNPF-a latent factor model to effectively capture user preferences and item features. Employing Poisson factorization, TSNPF fully exploits the wealth of information in rating scores, review text and social relationships altogether. It extracts topics of items and users from the review text and makes use of similarities between user pairs with social relationships, which results in a comprehensive understanding of user preferences. Experimental results on real-world datasets demonstrate that our TSNPF approach is highly effective at recommending items to users.
Tips, as a compacted and concise form of reviews, were paid less attention by researchers. In this paper, we investigate the task of tips generation by considering the “persona” information which captures the intrinsic language style of the users or the different characteristics of the product items. In order to exploit the persona information, we propose a framework based on adversarial variational auto-encoders (aVAE) for persona modeling from the historical tips and reviews of users and items. The latent variables from aVAE are regarded as persona embeddings. Besides representing persona using the latent embeddings, we design a persona memory for storing the persona related words for users and items. Pointer Network is used to retrieve persona wordings from the memory when generating tips. Moreover, the persona embeddings are used as latent factors by a rating prediction component to predict the sentiment of a user over an item. Finally, the persona embeddings and the sentiment information are incorporated into a recurrent neural networks based tips generation component. Extensive experimental results are reported and discussed to elaborate the peculiarities of our framework.
Travel time estimation of a given route with respect to real-time traffic condition is extremely useful for many applications like route planning. We argue that it is even more useful to estimate the travel time distribution, from which we can derive the expected travel time as well as the uncertainty. In this paper, we develop a deep generative model - DeepGTT - to learn the travel time distribution for any route by conditioning on the real-time traffic. DeepGTT interprets the generation of travel time using a three-layer hierarchical probabilistic model. In the first layer, we present two techniques, amortization and spatial smoothness embeddings, to share statistical strength among different road segments; a convolutional neural net based representation learning component is also proposed to capture the dynamically changing real-time traffic condition. In the middle layer, a nonlinear factorization model is developed to generate auxiliary random variable i.e., speed. The introduction of this middle layer separates the statical spatial features from the dynamically changing real-time traffic conditions, allowing us to incorporate the heterogeneous influencing factors into a single model. In the last layer, an attention mechanism based function is proposed to collectively generate the observed travel time. DeepGTT describes the generation process in a reasonable manner, and thus it not only produces more accurate results but also is more efficient. On a real-world large-scale data set, we show that DeepGTT produces substantially better results than state-of-the-art alternatives in two tasks: travel time estimation and route recovery from sparse trajectory data.
Crowd-sourcing is a cheap and popular means of creating training and evaluation datasets for machine learning, however it poses the problem of 'truth inference', as individual workers cannot be wholly trusted to provide reliable annotations. Research into models of annotation aggregation attempts to infer a latent 'true' annotation, which has been shown to improve the utility of crowd-sourced data. However, existing techniques beat simple baselines only in low redundancy settings, where the number of annotations per instance is low (= 3), or in situations where workers are unreliable and produce low quality annotations (e.g., through spamming, random, or adversarial behaviours.) As we show, datasets produced by crowd-sourcing are often not of this type: the data is highly redundantly annotated (= 5 annotations per instance), and the vast majority of workers produce high quality outputs. In these settings, the majority vote heuristic performs very well, and most truth inference models underperform this simple baseline. We propose a novel technique, based on a Bayesian graphical model with conjugate priors, and simple iterative expectation-maximisation inference. Our technique produces competitive performance to the state-of-the-art benchmark methods, and is the only method that significantly outperforms the majority vote heuristic at one-sided level 0.025, shown by significance tests. Moreover, our technique is simple, is implemented in only 50 lines of code, and trains in seconds. 1
Sources in computer-based collaborative systems such as webpages can help employees to connect and cooperate with each other. It is natural to enable the systems to look not only for documents but also for experts. In this paper, we study the problem of expert retrieval in enterprise corpora: given a topic, also known as query containing a set of words, identify a rank list of candidate experts who have expertise on the topic. To tackle the problem, we propose an unsupervised semantic two-player minimax game, i.e., our unsupervised semantic generative adversarial networks (USGAN). Unlike almost all the previous generative adversarial networks-based algorithms that require ground truth training data, our USGAN is an unsupervised semantic expert retrieval algorithm that consists of a discriminative network and a generative network aiming at capturing the representations of words and experts in an unsupervised way. Candidates that have similar semantic representations to that of the topic are retrieved as relevant to the topic. Our USGAN would provide inspiration on how to extend the standard GAN and its variants by unsupervised ways to address other retrieval tasks where labelled data are missing. Experimental results on public datasets validate the effectiveness of the proposed expert retrieval algorithm.
We study the problem of estimating the total volume of queries of a specific domain, which were submitted to the Google search engine in a given time period. Our statistical model assumes a Zipf's law distribution of the population in the reference domain, and a non-uniform or noisy sampling of queries. Parameters of the distribution are estimated using nonlinear least square regression. Estimations with errors are then derived for the total number of queries and for the total number of searches (volume). We apply the method on the recipes and cooking domain, where a sample of queries is collected by crawling popular Italian websites specialized on this domain. The relative volumes of queries in the sample are computed using Google Trends, and transformed to absolute frequencies after estimating a scaling factor. Our model estimates that the volume of Italian recipes and cooking queries submitted to Google in 2017 and with at least 10 monthly searches consists of 7.2B searches.
Roughly one in ten Americans move every year, bringing significant social and economic impact to both the places they move from and places they move to. We show that migration intent mined from internet search queries can forecast domestic migration and provide new insights beyond government data. We extract from a major search engine (Bing.com) 120 million raw queries with migration intent from 2014 to 2016, including origin and destination geographies, and the specific intent for migration such as whether the potential migration is housing or employment related. Using these queries, we map U.S. state level migration flows, validate them against government data, and demonstrate that adding search query-based metrics explains variance in migration prediction above robust baseline models. In addition, we show that the specific migration intent extracted from these queries unpack the differential demands of migrants with different demographic backgrounds and geographic interests. Examples include interactions between age, education, and income, and migration attributes such as buying versus renting housing and employment in technology versus manual labor job sectors. We discuss how local government, policy makers, and computational social scientists can benefit from this information.
Relation extraction is an important task in structuring content of text data, and becomes especially challenging when learning with weak supervision-where only a limited number of labeled sentences are given and a large number of unlabeled sentences are available. Most existing work exploits unlabeled data based on the ideas of self-training (i.e., bootstrapping a model) and self-ensembling (e.g., ensembling multiple model variants). However, these methods either suffer from the issue of semantic drift, or do not fully capture the problem characteristics of relation extraction. In this paper, we leverage a key insight that retrieving sentences expressing a relation is a dual task of predicting the relation label for a given sentence-two tasks are complementary to each other and can be optimized jointly for mutual enhancement. To model this intuition, we propose DualRE, a principled framework that introduces a retrieval module which is jointly trained with the original relation prediction module. In this way, high-quality samples selected by the retrieval module from unlabeled data can be used to improve the prediction module, and vice versa. Experimental results1 on two public datasets as well as case studies demonstrate the effectiveness of the DualRE approach.
Personalized PageRank (PPR) has enormous applications, such as link prediction and recommendation systems for social networks, which often require the fully PPR to be known. Besides, most of real-life graphs are edge-weighted, e.g., the interaction between users on the Facebook network. However, it is computationally difficult to compute the fully PPR, especially on large graphs, not to mention that most existing approaches do not consider the weights of edges. In particular, the existing approach cannot handle graphs with billion edges on a moderate-size cluster. To address this problem, this paper presents a novel study on the computation of fully edge-weighted PPR on large graphs using the distributed computing framework. Specifically, we employ the Monte Carlo approximation that performs a large number of random walks from each node of the graph, and exploits the parallel pipeline framework to reduce the overall running time of the fully PPR. Based on that, we develop several optimization techniques which (i) alleviate the issue of large nodes that could explode the memory space, (ii) pre-compute short walks for small nodes that largely speedup the computation of random walks, and (iii) optimize the amount of random walks to compute in each pipeline that significantly reduces the overhead. With extensive experiments on a variety of real-life graph datasets, we demonstrate that our solution is several orders of magnitude faster than the state-of-the-arts, and meanwhile, largely outperforms the baseline algorithms in terms of accuracy.
The task of fashion recommendation includes two main challenges: visual understanding and visual matching. Visual understanding aims to extract effective visual features. Visual matching aims to model a human notion of compatibility to compute a match between fashion items. Most previous studies rely on recommendation loss alone to guide visual understanding and matching. Although the features captured by these methods describe basic characteristics (e.g., color, texture, shape) of the input items, they are not directly related to the visual signals of the output items (to be recommended). This is problematic because the aesthetic characteristics (e.g., style, design), based on which we can directly infer the output items, are lacking. Features are learned under the recommendation loss alone, where the supervision signal is simply whether the given two items are matched or not.
To address this problem, we propose a neural co-supervision learning framework, called the FAshion Recommendation Machine (FARM). FARM improves visual understanding by incorporating the supervision of generation loss, which we hypothesize to be able to better encode aesthetic information. FARM enhances visual matching by introducing a novel layer-to-layer matching mechanism to fuse aesthetic information more effectively, and meanwhile avoiding paying too much attention to the generation quality and ignoring the recommendation performance.
Extensive experiments on two publicly available datasets show that FARM outperforms state-of-the-art models on outfit recommendation, in terms of AUC and MRR. Detailed analyses of generated and recommended items demonstrate that FARM can encode better features and generate high quality images as references to improve recommendation performance.
Automatic question generation is an important technique that can improve the training of question answering, help chatbots to start or continue a conversation with humans, and provide assessment materials for educational purposes. Existing neural question generation models are not sufficient mainly due to their inability to properly model the process of how each word in the question is selected, i.e., whether repeating the given passage or being generated from a vocabulary. In this paper, we propose our Clue Guided Copy Network for Question Generation (CGC-QG), which is a sequence-to-sequence generative model with copying mechanism, yet employing a variety of novel components and techniques to boost the performance of question generation. In CGC-QG, we design a multi-task labeling strategy to identify whether a question word should be copied from the input passage or be generated instead, guiding the model to learn the accurate boundaries between copying and generation. Furthermore, our input passage encoder takes as input, among a diverse range of other features, the prediction made by a clue word predictor, which helps identify whether each word in the input passage is a potential clue to be copied into the target question. The clue word predictor is designed based on a novel application of Graph Convolutional Networks onto a syntactic dependency tree representation of each passage, thus being able to predict clue words only based on their context in the passage and their relative positions to the answer in the tree. We jointly train the clue prediction as well as question generation with multi-task learning and a number of practical strategies to reduce the complexity. Extensive evaluations show that our model significantly improves the performance of question generation and out-performs all previous state-of-the-art neural question generation models by a substantial margin.
Click-Through Rate prediction is an important task in recommender systems, which aims to estimate the probability of a user to click on a given item. Recently, many deep models have been proposed to learn low-order and high-order feature interactions from original features. However, since useful interactions are always sparse, it is difficult for DNN to learn them effectively under a large number of parameters. In real scenarios, artificial features are able to improve the performance of deep models (such as Wide & Deep Learning), but feature engineering is expensive and requires domain knowledge, making it impractical in different scenarios. Therefore, it is necessary to augment feature space automatically. In this paper, We propose a novel Feature Generation by Convolutional Neural Network (FGCNN) model with two components: Feature Generation and Deep Classifier. Feature Generation leverages the strength of CNN to generate local patterns and recombine them to generate new features. Deep Classifier adopts the structure of IPNN to learn interactions from the augmented feature space. Experimental results on three large-scale datasets show that FGCNN significantly outperforms nine state-of-the-art models. Moreover, when applying some state-of-the-art models as Deep Classifier, better performance is always achieved, showing the great compatibility of our FGCNN model. This work explores a novel direction for CTR predictions: it is quite useful to reduce the learning difficulties of DNN by automatically identifying important features.
The problem of computing (a, &bgr;)-core in a bipartite graph for given a and &bgr; is a fundamental problem in bipartite graph analysis and can be used in many applications such as online group recommendation, fraudsters detection, etc. Existing solution to computing (a, &bgr;)-core needs to traverse the entire bipartite graph once. Considering the real bipartite graph can be very large and the requests to compute (a, &bgr;)-core can be issued frequently in real applications, the existing solution is too expensive to compute the (a, &bgr;)-core. In this paper, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size (thus, the algorithm is optimal since it needs at least linear time to output the result). We prove that the index only requires O(m) space where m is the number of edges in the bipartite graph. Moreover, we devise an efficient algorithm with time complexity O(d ċ m) for index construction where d is bounded by and is much smaller than in practice. We also discuss efficient algorithms to maintain the index when the bipartite graph is dynamically updated and parallel implementation of the index construction algorithm. The experimental results on real and synthetic graphs (more than 1 billion edges) demonstrate that our algorithms achieve up to 5 orders of magnitude speedup for computing (a, &bgr;)-core and up to 3 orders of magnitude speedup for index construction, respectively, compared with existing techniques.
With the rapid development of the Internet, millions of documents, such as news and web pages, are generated everyday. Mining the topics and knowledge on them has attracted a lot of interest on both academic and industrial areas. As one of the prevalent unsupervised data mining tools, topic models are usually explored as probabilistic generative models for large collections of texts. Traditional probabilistic topic models tend to find a closed form solution of model parameters and approach the intractable posteriors via approximation methods, which usually lead to the inaccurate inference of parameters and low efficiency when it comes to a quite large volume of data. Recently, an emerging trend of neural variational inference can overcome the above issues, which offers a scalable and powerful deep generative framework for modeling latent topics via neural networks. Interestingly, a common assumption for the most neural variational topic models is that topics are independent and irrelevant to each other. However, this assumption is unreasonable in many practical scenarios. In this paper, we propose a novel Centralized Transformation Flow to capture the correlations among topics by reshaping topic distributions. Furthermore, we present the Transformation Flow Lower Bound to improve the performance of the proposed model. Extensive experiments on two standard benchmark datasets have well-validated the effectiveness of the proposed approach.
Search engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index BitFunnel (BF) for search efficiency, and (b) the state-of-the-art Partitioned Elias-Fano (PEF) inverted-index compression method. We use this proposed hybrid method to minimize time while satisfying a fixed space constraint, and to minimize space while satisfying a fixed time constraint. Each document is stored using either BF or PEF, and we use a local search strategy to find an approximately optimal BF-PEF partition. Since performing full experiments on each candidate BF-PEF partition is impractically slow, we use a regression model to predict the time and space costs resulting from candidate partitions (space accuracy 97.6%; time accuracy 95.2%). Compared with a hybrid mathematical index (Ottaviano et al., 2015), the time cost is reduced by up to 47% without significantly exceeding its size. Compared with three mathematical encoding methods, the hybrid BF-PEF index allows performing list intersection between around 16% to 76% faster (without significantly increasing the index size). Compared with BF, the index size is reduced by 45% while maintaining an intersection time comparable to that of BF.
Different time series is measured in almost all fields including biology, economics and sociology. A common challenge for using such data is the imputation of the missing values with reasonable ones. Most of existing approaches to data imputation assume that individual's observations are independent to each other, which is rarely the case in real-world. In this paper, we study the social-aware time series imputation problem. Given a social network that represents social relations between individuals, we propose a sequential encoder-decoder-based framework and build a connection between the missing observations and the social context. In particular, the proposed model employs the attention mechanism to incorporate social context and temporal context into the imputation task. Experimental results based on two real-world datasets demonstrate that our approach outperforms 11 different baseline methods.
Facial recognition is a key enabling component for emerging Internet of Things (IoT) services such as smart homes or responsive offices. Through the use of deep neural networks, facial recognition has achieved excellent performance. However, this is only possibly when trained with hundreds of images of each user in different viewing and lighting conditions. Clearly, this level of effort in enrolment and labelling is impossible for wide-spread deployment and adoption. Inspired by the fact that most people carry smart wireless devices with them, e.g. smartphones, we propose to use this wireless identifier as a supervisory label. This allows us to curate a dataset of facial images that are unique to a certain domain e.g. a set of people in a particular office. This custom corpus can then be used to finetune existing pre-trained models e.g. FaceNet. However, due to the vagaries of wireless propagation in buildings, the supervisory labels are noisy and weak. We propose a novel technique, AutoTune, which learns and refines the association between a face and wireless identifier over time, by increasing the inter-cluster separation and minimizing the intra-cluster distance. Through extensive experiments with multiple users on two sites, we demonstrate the ability of AutoTune to design an environment-specific, continually evolving facial recognition system with entirely no user effort.
User behaviors are widely used as implicit feedbacks of user preferences in personalized information systems. In previous works and online applications, the user's click signals are used as positive feedback for ranking, recommendation, evaluation, etc. However, when users click on a piece of low-quality news, they are more likely to have negative experiences and different reading behaviors. Hence, the ignorance of the quality effects of news may lead to the misinterpretation of user behaviors as well as consequence studies. To address these issues, we conducted an in-depth user study in mobile news streaming scenario to investigate whether and how the quality of news may affect user preferences and user behaviors. Firstly, we verify that quality does affect user preferences, and low-quality news results in a lower preference. We further find that this effect varies with both interaction phases and user's interest in the topic of the news. Secondly, we inspect how users interact with low-quality news. Surprisingly, we find that users are more likely to click on low-quality news because of its high title persuasion. Moreover, users will read less and slower with fewer revisits and examinations while reading the low-quality news.
Based on these quality effects we have discovered, we propose the Preference Behavior Quality (PBQ) probability model which incorporates the quality into traditional behavior-only implicit feedback. The significant improvement demonstrates that incorporating quality can help build implicit feedback. Since the importance and difficulty in collecting news quality, we further investigate how to identify it automatically. Based on point-wise and pair-wise distinguishing experiments, we show that user behaviors, especially reading ratio and dwell time, have high ability to identify news quality. Our research has comprehensively analyzed the effects of quality on user preferences and behaviors, and raised the awareness of item quality in interpreting user behaviors and estimating user preferences.
Answer selection is an important problem in community question answering (CQA), as it enables the distilling of reliable information and knowledge. Most existing approaches tackle this problem as a text matching task. However, they ignore the influence of the community in voting the best answers. Answer quality is highly correlated with semantic relevance and user expertise in CQA. In this paper, we formalize the answer selection problem from the user expertise view, considering both the semantic relevance in question-answer pair and user expertise in question-user pair. We design a novel matching function, explicitly modeling the influence of user expertise in community acceptance. Moreover, we introduce latent user vectors into the representation learning of answer, capturing the implicit topic interests in learned user vectors. Extensive experiments on two datasets from real world CQA sites demonstrate that our model outperforms state-of-the-art approaches for answer selection in CQA. Furthermore, the user representations learned by our model provide us a quantitative way to understand both the authority and topic-sensitive interests of users.
Explainability and effectiveness are two key aspects for building recommender systems. Prior efforts mostly focus on incorporating side information to achieve better recommendation performance. However, these methods have some weaknesses: (1) prediction of neural network-based embedding methods are hard to explain and debug; (2) symbolic, graph-based approaches (e.g., meta path-based models) require manual efforts and domain knowledge to define patterns and rules, and ignore the item association types (e.g. substitutable and complementary). In this paper, we propose a novel joint learning framework to integrate induction of explainable rules from knowledge graph with construction of a rule-guided neural recommendation model. The framework encourages two modules to complement each other in generating effective and explainable recommendation: 1) inductive rules, mined from item-centric knowledge graphs, summarize common multi-hop relational patterns for inferring different item associations and provide human-readable explanation for model prediction; 2) recommendation module can be augmented by induced rules and thus have better generalization ability dealing with the cold-start issue. Extensive experiments1 show that our proposed method has achieved significant improvements in item recommendation over baselines on real-world datasets. Our model demonstrates robust performance over “noisy” item knowledge graphs, generated by linking item names to related entities.
An effective virtual agent (VA) that serves humans not only completes tasks efficaciously, but also manages its interpersonal relationships with users judiciously. Although past research has studied how agents apologize or seek help appropriately, there lacks a comprehensive study of how to design an emotionally intelligent (EI) virtual agent. In this paper, we propose to improve a VA's perceived EI by equipping it with personality-driven responsive expression of emotions. We conduct a within-subject experiment to verify this approach using a medical assistant VA. We ask participants to observe how the agent (displaying a dominant or submissive trait, or having no personality) handles user challenges when issuing reminders and rate its EI. Results show that simply being emotionally expressive is insufficient for suggesting VAs as fully emotionally intelligent. Equipping such VAs with a consistent, distinctive personality trait (especially submissive) can convey a significantly stronger sense of EI in terms of the ability to perceive, use, understand, and manage emotions, and can better mitigate user challenges.
Recently, several JavaScript-based deep learning frameworks have emerged, making it possible to perform deep learning tasks directly in browsers. However, little is known on what and how well we can do with these frameworks for deep learning in browsers. To bridge the knowledge gap, in this paper, we conduct the first empirical study of deep learning in browsers. We survey 7 most popular JavaScript-based deep learning frameworks, investigating to what extent deep learning tasks have been supported in browsers so far. Then we measure the performance of different frameworks when running different deep learning tasks. Finally, we dig out the performance gap between deep learning in browsers and on native platforms by comparing the performance of TensorFlow.js and TensorFlow in Python. Our findings could help application developers, deep-learning framework vendors and browser vendors to improve the efficiency of deep learning in browsers.
Email continues to be one of the most commonly used forms of online communication. As inboxes grow larger, users rely more heavily on email search to effectively find what they are looking for. However, previous studies on email have been exclusive to enterprises with access to large user logs, or limited to small-scale qualitative surveys and analyses on limited public datasets such as Enron1 and Avocado2. In this work, we propose a novel framework that allows for experimentation with real email data. In particular, our approach provides a realistic way of simulating email re-finding tasks in a crowdsourcing environment using the workers' personal email data. We use our approach to experiment with various ranking functions and quality degradation to measure how users behave under different conditions, and conduct analysis across various email types and attributes. Our results show that user behavior can be significantly impacted as a result of the quality of the search ranker, but only when differences in quality are very pronounced. Our analysis confirms that time-based ranking begins to fail as email age increases, suggesting that hybrid approaches may help bridge the gap between relevance-based rankers and the traditional time-based ranking approach. Finally, we also found that users typically reformulate search queries by either entirely re-writing the query, or simply appending terms to the query, which may have implications for email query suggestion facilities.
Detecting and understanding implicit measures of user satisfaction are essential for enhancing recommendation quality. When users interact with a recommendation system, they leave behind fine grained traces of interaction signals, which contain valuable information that could help gauging user satisfaction. User interaction with such systems is often motivated by a specific need or intent, often not explicitly specified by the user, but can nevertheless inform on how the user interacts with, and the extent to which the user is satisfied by the recommendations served. In this work, we consider a complex recommendation scenario, called Slate Recommendation, wherein a user is presented with an ordered set of collections, called slates, in a specific page layout. We focus on the context of music streaming and leverage fine-grained user interaction signals to tackle the problem of predicting user satisfaction.
We hypothesize that user interactions are conditional on the specific intent users have when interacting with a recommendation system, and highlight the need for explicitly considering user intent when interpreting interaction signals. We present diverse approaches to identify user intents (interviews, surveys and a quantitative approach) and identify a set of common intents users have in a music streaming recommendation setting. Additionally, we identify the importance of shared learning across intents and propose a multi-level hierarchical model for user satisfaction prediction that leverages user intent information alongside interaction signals. Our findings from extensive experiments on a large scale real world data demonstrate (i) the utility of considering different interaction signals, (ii) the role of intents in interpreting user interactions and (iii) the interplay between interaction signals and intents in predicting user satisfaction.
To provide stable and responsive public SPARQL query services, data providers enforce quotas on server usage. Queries which exceed these quotas are interrupted and deliver partial results. Such interruption is not an issue if it is possible to resume queries execution afterward. Unfortunately, there is no preemption model for the Web that allows for suspending and resuming SPARQL queries. In this paper, we propose SaGe: a SPARQL query engine based on Web preemption. SaGe allows SPARQL queries to be suspended by the Web server after a fixed time quantum and resumed upon client request. Web preemption is tractable only if its cost in time is negligible compared to the time quantum. The challenge is to support the full SPARQL query language while keeping the cost of preemption negligible. Experimental results demonstrate that SaGe outperforms existing SPARQL query processing approaches by several orders of magnitude in term of the average total query execution time and the time for first results.
Email accounts represent an enticing target for attackers, both for the information they contain and the root of trust they provide to other connected web services. While defense-in-depth approaches such as phishing detection, risk analysis, and two-factor authentication help to stem large-scale hijackings, targeted attacks remain a potent threat due to the customization and effort involved. In this paper, we study a segment of targeted attackers known as “hack for hire” services to understand the playbook that attackers use to gain access to victim accounts. Posing as buyers, we interacted with 27 English, Russian, and Chinese blackmarket services, only five of which succeeded in attacking synthetic (though realistic) identities we controlled. Attackers primarily relied on tailored phishing messages, with enough sophistication to bypass SMS two-factor authentication. However, despite the ability to successfully deliver account access, the market exhibited low volume, poor customer service, and had multiple scammers. As such, we surmise that retail email hijacking has yet to mature to the level of other criminal market segments.
In this work, we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. The algorithm is unsupervised. It is able to detect and track anomalous activity in a dynamic network despite the noise from multiple interfering sources.
We use the Hopfield network model of memory to combine the graph and time information. We show that anomalies can be spotted with good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm.
To demonstrate its efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information, opening the door to new quantitative studies on collective and social behavior on large and dynamic datasets.
The goal of Information Retrieval (IR) systems is to satisfy searchers' Information Need (IN). Our research focuses on next-generation IR engines, which can proactively detect, identify, and serve INs without receiving explicit queries. It is essential, therefore, to be able to detect when INs occur. Previous research has established that a realisation of INs physically manifests itself with specific brain activity. With this work we take the next step, showing that monitoring brain activity can lead to accurate predictions of a realisation of IN occurrence. We have conducted experiments whereby twenty-four participants performed a Q/A Task, while their brain activity was being monitored using functional Magnetic Resonance Imaging (fMRI) technology. The questions were selected and developed from the TREC-8 and TREC 2001 Q/A Tracks. We present two methods for predicting the realisation of an IN, i.e. Generalised method (GM) and Personalised method (PM). GM is based on the collective brain activity of all twenty-four participants in a predetermined set of brain regions known to be involved in representing a realisation of INs. PM is unique to each individual and employs a 'Searchlight' analysis to locate brain regions informative for distinguishing when a “specific” user realises an information need. The results of our study show that both methods were able to predict a realisation of an IN (statistically) significantly better than chance. Our results also show that PM (statistically) significantly outperformed GM in terms of prediction accuracy. These encouraging findings make the first fundamental step towards proactive IR engines based on brain signals.
Social influence plays a vital role in shaping a user's behavior in online communities dealing with items of fine taste like movies, food, and beer. For online recommendation, this implies that users' preferences and ratings are influenced due to other individuals. Given only time-stamped reviews of users, can we find out who-influences-whom, and characteristics of the underlying influence network? Can we use this network to improve recommendation?
While prior works in social-aware recommendation have leveraged social interaction by considering the observed social network of users, many communities like Amazon, Beeradvocate, and Ratebeer do not have explicit user-user links. Therefore, we propose GhostLink, an unsupervised probabilistic graphical model, to automatically learn the latent influence network underlying a review community - given only the temporal traces (timestamps) of users' posts and their content. Based on extensive experiments with four real-world datasets with 13 million reviews, we show that GhostLink improves item recommendation by around 23% over state-of-the-art methods that do not consider this influence. As additional use-cases, we show that GhostLink can be used to differentiate between users' latent preferences and influenced ones, as well as to detect influential users based on the learned influence graph.
Measuring similarities between vertices is an important task in network analysis, which has numerous applications. One major approach to define a similarity between vertices is by accumulating weights of walks between them that encompasses personalized PageRank (PPR) and Katz similarity. Although many effective methods for PPR based on efficient simulation of random walks have been proposed, these techniques cannot be applied to other walk-based similarity notions because the random walk interpretation is only valid for PPR.
In this paper, we propose a random-walk reduction method that reduces the computation of any walk-based similarity to the computation of a stationary distribution of a random walk. With this reduction, we can exploit techniques for PPR to compute other walk-based similarities. As a concrete application, we design an indexing method for walk-based similarities with which we can quickly estimate the similarity value of queried pairs of vertices, and theoretically analyze its approximation error. Our experimental results demonstrate that the instantiation of our method for Katz similarity is two orders of magnitude faster than existing methods on large real networks, without any deterioration in solution quality.
Revealing important vertices is a fundamental task in network analysis. As such, many indicators have been proposed for doing so, which are collectively called centralities. However, the abundance of studies on centralities blurs their differences.
In this work, we compare centralities based on their sensivitity to modifications in the graph. Specifically, we introduce a quantitative measure called (average-case) edge sensitivity, which measures how much the centrality value of a uniformly chosen vertex (or an edge) changes when we remove a uniformly chosen edge. Edge sensitivity is applicable to unweighted graphs, regarding which, to our knowledge, there has been no theoretical analysis of the centralities. We conducted a theoretical analysis of the edge sensitivities of six major centralities: the closeness centrality, harmonic centrality, betweenness centrality, endpoint betweenness centrality, PageRank, and spanning tree centrality. Our experimental results on synthetic and real graphs confirm the tendency predicted by the theoretical analysis. We also discuss an extension of edge sensitivity to the setting that we remove a uniformly chosen set of edges of size k for an integer k = 1.
Activity logs collected from wearable devices (e.g. Apple Watch, Fitbit, etc.) are a promising source of data to facilitate a wide range of applications such as personalized exercise scheduling, workout recommendation, and heart rate anomaly detection. However, such data are heterogeneous, noisy, diverse in scale and resolution, and have complex interdependencies, making them challenging to model. In this paper, we develop context-aware sequential models to capture the personalized and temporal patterns of fitness data. Specifically, we propose FitRec - an LSTM-based model that captures two levels of context information: context within a specific activity, and context across a user's activity history. We are specifically interested in (a) estimating a user's heart rate profile for a candidate activity; and (b) predicting and recommending suitable activities on this basis. We evaluate our model on a novel dataset containing over 250 thousand workout records coupled with hundreds of millions of parallel sensor measurements (e.g. heart rate, GPS) and metadata. We demonstrate that the model is able to learn contextual, personalized, and activity-specific dynamics of users' heart rate profiles during exercise. We evaluate the proposed model against baselines on several personalized recommendation tasks, showing the promise of using wearable data for activity modeling and recommendation.
Product descriptions play an important role in the e-commerce ecosystem, conveying to buyers information about a merchandise they may purchase. Yet, on leading e-commerce websites, with high volumes of new items offered for sale every day, product descriptions are often lacking or missing altogether. Moreover, many descriptions include information that holds little value and sometimes even disrupts buyers, in an attempt to draw attention and purchases. In this work, we suggest to mitigate these issues by generating short crowd-based product descriptions from user reviews . We apply an extractive approach, where review sentences are used in their original form to compose the product description. At the core of our method is a supervised approach to identify candidate review sentences suitable to be used as part of a description. Our analysis, based on data from both the Fashion and Motors domains, reveals the top reasons for review sentences being unsuitable for the product's description and these are used, in turn, as part of a deep multi-task learning architecture. We then diversify the set of candidates by removing redundancies and, at the final step, select the top candidates to be included in the description. We compare different methods for each step and also conduct an end-to-end evaluation, based on rating from professional annotators, showing the generated descriptions are of high quality.
There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.
In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.
We propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For example, we may want to summarize a set of news sources, but avoid selecting ideologically-extreme articles in order to reduce polarization.
To solve the proposed k-median formulation we adopt the local-search algorithm of Arya et al. [2], We show that the algorithm provides a provable approximation guarantee, which becomes constant under a mild assumption on the minimum number of points for each cluster. We experimentally evaluate our problem formulation and proposed algorithm on datasets inspired by the motivating applications. In particular, we experiment with data extracted from Twitter, the US Congress voting records, and popular news sources. The results show that our objective can lead to choosing less polarized groups of representatives without significant loss in representation fidelity.
Neural networks have recently been shown to be highly effective at predicting links for constructing knowledge graphs. Existing research has mainly focused on designing 1) deep neural network models that are expressive in capturing fine-grained semantics, e.g., NTN and ConvE, but that are however less scalable; or 2) shallow models that are scalable, e.g., TransE and DistMult, yet limited in capturing expressive semantic features. In this work, we demonstrate that we can get the best of both worlds while drastically reducing the amount of data needed to train a deep network by leveraging active learning.
We present a novel deep active learning framework, ActiveLink, which can be applied to actively train any neural link predictor. Inspired by recent advances in Bayesian deep learning, ActiveLink takes a Bayesian view on neural link predictors, thereby enabling uncertainty sampling for deep active learning. ActiveLink extends uncertainty sampling by exploiting the underlying structure of the knowledge graph, i.e., links between entities, to improve sampling effectiveness. To accelerate model training, ActiveLink further adopts an incremental training method that allows deep neural networks to be incrementally trained while optimizing their generalizability at each iteration. Extensive validation on real-world datasets shows that ActiveLink is able to match state-of-the-art approaches while requiring only 20% of the original training data.
We provide a framework for modeling social network formation through conditional multinomial logit models from discrete choice and random utility theory, in which each new edge is viewed as a “choice” made by a node to connect to another node, based on (generic) features of the other nodes available to make a connection. This perspective on network formation unifies existing models such as preferential attachment, triadic closure, and node fitness, which are all special cases, and thereby provides a flexible means for conceptualizing, estimating, and comparing models. The lens of discrete choice theory also provides several new tools for analyzing social network formation; for example, the significance of node features can be evaluated in a statistically rigorous manner, and mixtures of existing models can be estimated by adapting known expectation-maximization algorithms. We demonstrate the flexibility of our framework through examples that analyze a number of synthetic and real-world datasets. For example, we provide rigorous methods for estimating preferential attachment models and show how to separate the effects of preferential attachment and triadic closure. Non-parametric estimates of the importance of degree show a highly linear trend, and we expose the importance of looking carefully at nodes with degree zero. Examining the formation of a large citation graph, we find evidence for an increased role of degree when accounting for age.
Decision making is a challenging task in online recommender systems. The decision maker often needs to choose a contextual item at each step from a set of candidates. Contextual bandit algorithms have been successfully deployed to such applications, for the trade-off between exploration and exploitation and the state-of-art performance on minimizing online costs. However, the applicability of existing contextual bandit methods is limited by the over-simplified assumptions of the problem, such as assuming a simple form of the reward function or assuming a static environment where the states are not affected by previous actions.
In this work, we put forward Policy Gradients for Contextual Recommendations (PGCR) to solve the problem without those unrealistic assumptions. It optimizes over a restricted class of policies where the marginal probability of choosing an item (in expectation of other items) has a simple closed form, and the gradient of the expected return over the policy in this class is in a succinct form. Moreover, PGCR leverages two useful heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former ensures PGCR to be empirically greedy in the limit, and the latter addresses the trade-off between exploration and exploitation by using the policy network with Dropout as a Bayesian approximation.
PGCR can solve the standard contextual bandits as well as its Markov Decision Process generalization. Therefore it can be applied to a wide range of realistic settings of recommendations, such as personalized advertising. We evaluate PGCR on toy datasets as well as a real-world dataset of personalized music recommendations. Experiments show that PGCR enables fast convergence and low regret, and outperforms both classic contextual-bandits and vanilla policy gradient methods.
User data is the primary input of digital advertising, fueling the free Internet as we know it. As a result, web companies invest a lot in elaborate tracking mechanisms to acquire user data that can sell to data markets and advertisers. However, with same-origin policy and cookies as a primary identification mechanism on the web, each tracker knows the same user with a different ID. To mitigate this, Cookie Synchronization (CSync) came to the rescue, facilitating an information sharing channel between 3rd-parties that may or not have direct access to the website the user visits. In the background, with CSync, they merge user data they own, but also reconstruct a user's browsing history, bypassing the same origin policy.
In this paper, we perform a first to our knowledge in-depth study of CSync in the wild, using a year-long weblog from 850 real mobile users. Through our study, we aim to understand the characteristics of the CSync protocol and the impact it has on web users' privacy. For this, we design and implement CONRAD, a holistic mechanism to detect CSync events at real time, and the privacy loss on the user side, even when the synced IDs are obfuscated. Using CONRAD, we find that 97% of the regular web users are exposed to CSync: most of them within the first week of their browsing, and the median userID gets leaked, on average, to 3.5 different domains. Finally, we see that CSync increases the number of domains that track the user by a factor of 6.75.
Ad-hoc retrieval models with implicit feedback often have problems, e.g., the imbalanced classes in the data set. Too few clicked documents may hurt generalization ability of the models, whereas too many non-clicked documents may harm effectiveness of the models and efficiency of training. In addition, recent neural network-based models are vulnerable to adversarial examples due to the linear nature in them. To solve the problems at the same time, we propose an adversarial sampling and training framework to learn ad-hoc retrieval models with implicit feedback. Our key idea is (i) to augment clicked examples by adversarial training for better generalization and (ii) to obtain very informational non-clicked examples by adversarial sampling and training. Experiments are performed on benchmark data sets for common ad-hoc retrieval tasks such as Web search, item recommendation, and question answering. Experimental results indicate that the proposed approaches significantly outperform strong baselines especially for high-ranked documents, and they outperform IRGAN in [email protected] using only 5% of labeled data for the Web search task.
Invalid ad traffic is an inherent problem of programmatic advertising that has not been properly addressed so far. Traditionally, it has been considered that invalid ad traffic only harms the interests of advertisers, which pay for the cost of invalid ad impressions while other industry stakeholders earn revenue through commissions regardless of the quality of the impression. Our first contribution consists of providing evidence that shows how the Demand Side Platforms (DSPs), one of the most important intermediaries in the programmatic advertising supply chain, may be suffering from economic losses due to invalid ad traffic. Addressing the problem of invalid traffic at DSPs requires a highly scalable solution that can identify invalid traffic in real time at the individual bid request level. The second and main contribution is the design and implementation of a solution for the invalid traffic problem, a system that can be seamlessly integrated into the current programmatic ecosystem by the DSPs. Our system has been released under an open source license, becoming the first auditable solution for invalid ad traffic detection. The intrinsic transparency of our solution along with the good results obtained in industrial trials have led the World Federation of Advertisers to endorse it.
The curation of a knowledge base is a crucial but costly task. In this work, we propose to take advantage of the edit history of the knowledge base in order to learn how to correct constraint violations. Our method is based on rule mining, and uses the edits that solved some violations in the past to infer how to solve similar violations in the present. The experimental evaluation of our method on Wikidata shows significant improvements over baselines.
The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites , DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.
Search engines play an important role on the Web, helping users find relevant resources and answers to their questions. At the same time, search logs can also be of great utility to researchers. For instance, a number of recent research efforts have relied on them to build prediction and inference models, for applications ranging from economics and marketing to public health surveillance. However, companies rarely release search logs, also due to the related privacy issues that ensue, as they are inherently hard to anonymize. As a result, it is very difficult for researchers to have access to search data, and even if they do, they are fully dependent on the company providing them. Aiming to overcome these issues, this paper presents Private Data Donor (PDD), a decentralized and private-by-design platform providing crowd-sourced Web searches to researchers. We build on a cryptographic protocol for privacy-preserving data aggregation, and address a few practical challenges to add reliability into the system with regards to users disconnecting or stopping using the platform. We discuss how PDD can be used to build a flu monitoring model, and evaluate the impact of the privacy-preserving layer on the quality of the results. Finally, we present the implementation of our platform, as a browser extension and a server, and report on a pilot deployment with real users.
Community detection is one of the most important problems in network analysis. Among many algorithms proposed for this task, methods based on statistical inference are of particular interest: they are mathematically sound and were shown to provide partitions of good quality. Statistical inference methods are based on fitting some random graph model (a.k.a. null model) to the observed network by maximizing the likelihood. The choice of this model is extremely important and is the main focus of the current study. We provide an extensive theoretical and empirical analysis to compare several models: the widely used planted partition model, recently proposed degree-corrected modification of this model, and a new null model having some desirable statistical properties. We also develop and compare two likelihood optimization algorithms suitable for the models under consideration. An extensive empirical analysis on a variety of datasets shows, in particular, that the new model is the best one for describing most of the considered real-world complex networks according to the likelihood of observed graph structures.
We study the problem of large-scale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2) the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrix-which is dense-is prohibitively expensive in terms of both time and space, making it not scalable for large networks.
In this work, we present the algorithm of large-scale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved efficiency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded approximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks and factorization based methods, NetSMF is the only method that achieves both high efficiency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a large-scale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution. The source code of NetSMF is publicly available1.
Knowledge about the organization of the main physical elements (e.g. streets) and objects (e.g. trees) that structure cities is important in the maintenance of city infrastructure and the planning of future urban interventions. In this paper, a novel approach to crowd-mapping urban objects is proposed. Our method capitalizes on strategies for generating crowdsourced object annotations from street-level imagery, in combination with object density and geo-location estimation techniques to enable the enumeration and geo-tagging of urban objects. To address both the coverage and precision of the mapped objects within budget constraints, we design a scheduling strategy for micro-task prioritization, aggregation, and assignment to crowd workers. We experimentally demonstrate the feasibility of our approach through a use case pertaining to the mapping of street trees in New York City and Amsterdam. We show that anonymous crowds can achieve high recall (up to 80%) and precision (up to 68%), with geo-location precision of approximately 3m. We also show that similar performance could be achieved at city scale, possibly with stringent budget constraints.
Measuring and characterizing web page performance is a challenging task. When it comes to the mobile world, the highly varying technology characteristics coupled with the opaque network configuration make it even more difficult. Aiming at reproducibility, we present a large scale empirical study of web page performance collected in eleven commercial mobile networks spanning four countries. By digging into measurement from nearly two million web browsing sessions, we shed light on the impact of different web protocols, browsers, and mobile technologies on the web performance. We find that the impact of mobile broadband access is sizeable. For example, the median page load time using mobile broadband increases by a third compared to wired access. Mobility clearly stresses the system, with handover causing the most evident performance penalties. Contrariwise, our measurements show that the adoption of HTTP/2 and QUIC has practically negligible impact. To understand the intertwining of all parameters, we adopt state-of-the-art statistical methods to identify the significance of different factors on the web performance. Our analysis confirms the importance of access technology and mobility context as well as webpage composition and browser. Our work highlights the importance of large-scale measurements. Even with our controlled setup, the complexity of the mobile web ecosystem is challenging to untangle. For this, we are releasing the dataset as open data for validation and further research.
Information about world events is disseminated through a wide variety of news channels, each with specific considerations in the choice of their reporting. Although the multiplicity of these outlets should ensure a variety of viewpoints, recent reports suggest that the rising concentration of media ownership may void this assumption. This observation motivates the study of the impact of ownership on the global media landscape and its influence on the coverage the actual viewer receives. To this end, the selection of reported events has been shown to be informative about the high-level structure of the news ecosystem. However, existing methods only provide a static view into an inherently dynamic system, providing underperforming statistical models and hindering our understanding of the media landscape as a whole.
In this work, we present a dynamic embedding method that learns to capture the decision process of individual news sources in their selection of reported events while also enabling the systematic detection of large-scale transformations in the media landscape over prolonged periods of time. In an experiment covering over 580M real-world event mentions, we show our approach to outperform static embedding methods in predictive terms. We demonstrate the potential of the method for news monitoring applications and investigative journalism by shedding light on important changes in programming induced by mergers and acquisitions, policy changes, or network-wide content diffusion. These findings offer evidence of strong content convergence trends inside large broadcasting groups, influencing the news ecosystem in a time of increasing media ownership concentration.
While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.
Wikipedia is playing an increasingly central role on the web, and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason . We evaluate the accuracy of such models across different classes of Wikipedia articles of varying quality, and on external datasets of claims annotated for fact-checking purposes.
We present a seamless challenge-response authentication protocol which leverages on the variations of html5 canvas rendering made by the software and hardware stacks. After a training phase that leads to feature extraction with deep learning techniques, a server becomes able to authenticate a user based on fresh canvasses, hence avoiding replay attacks. The whole authentication process is natively supported by any mainstream browser, stateless on client side and can be transparent to the user. We argue that those features facilitate deployment and composition with other authentication mechanisms without lowering the user experience. We present the threat model against which our protocol is expected to live and discuss its security. We also present a prototype implementation of our protocol and report on a real-word experimentation that we ran in order to analyze its efficiency and effectiveness.
The commencement of EU's General Data Protection Regulation (GDPR) has led to massive compliance and consent activities on websites. But did the new regulation result in fewer third party server appearances? Based on an eight months longitudinal study from February to September 2018 of 1250 popular websites in Europe and US, we present a mapping of the subtle shifts in the third party topology before and after May 25, 2018. The 1250 websites cover 39 European countries from EU, EEA, and outside EU, belonging to categories that cover both public-oriented citizen services, as well as commercially-oriented sites. The developments in the numbers and types of third party vary for categories of websites and countries. Analyzing the number of third parties over time, even though we notice a decline in the number of third parties in websites belonging to certain categories, we are cautious about attributing these effects to the general assumption that GDPR would lead to less third party activity. We believe that it is quite difficult to draw conclusions on cause-effect relationships in such a complex environment with many impacting factors.
Causal inference using observational data on multiple treatments is an important problem in a wide variety of fields. However, the existing literature tends to focus only on causal inference in case of binary or multinoulli treatments. These models are either incompatible with multiple treatments, or extending them to multiple treatments is computationally expensive. We use a previous formulation of causal inference using variational autoencoder (VAE) and propose a novel architecture to estimate the causal effect of any subset of the treatments. The higher order effects of multiple treatments are captured through a task embedding. The task embedding allows the model to scale to multiple treatments. The model is applied on real digital marketing dataset to evaluate the next best set of marketing actions. For evaluation, the model is compared against competitive baseline models on two semi-synthetic datasets created using the covariates from the real dataset. The performance is measured along four evaluation metrics considered in the causal inference literature and one proposed by us. The proposed evaluation metric measures the loss in the expected outcome when a particular model is used for decision making as compared to the ground truth. The proposed model outperforms the baselines along all five evaluation metrics. It outperforms the best baseline by over 30% along these evaluation metrics. The proposed approach is also shown to be robust when a subset of the confounders is not observed. The results on real data show the importance of the flexible modeling approach provided by the proposed model.
In this paper, we present a semi-automated, “human-in-the-loop” framework for attribute design that assists human analysts to transform raw attributes into effective derived attributes for classification problems. Our proposed framework is optimization guided and fully agnostic to the underlying classification model. We present an algebra with various operators (arithmetic, relational, and logical) to transform raw attributes into derived attributes and solve two technical problems: (a) the top-k buckets design problem aims at presenting human analysts with k buckets, each bucket containing promising choices of raw attributes that she can focus on only without having to look at all raw attributes; and (b) the top-l snippets generation problem, which iteratively aids human analysts with top-l derived attributes involving an attribute. For the former problem, we present an effective exact bottom-up algorithm that is empowered by pruning capability, as well as random walk based heuristic algorithms that are intuitive and work well in practice. For the latter, we present a greedy heuristic algorithm that is scalable and effective. Rigorous evaluations are conducted involving 6 different real world datasets to showcase that our framework generates effective derived attributes compared to fully manual or fully automated methods.
Triplestores are data management systems for storing and querying RDF data. Over recent years, various benchmarks have been proposed to assess the performance of triplestores across different performance measures. However, choosing the most suitable benchmark for evaluating triplestores in practical settings is not a trivial task. This is because triplestores experience varying workloads when deployed in real applications. We address the problem of determining an appropriate benchmark for a given real-life workload by providing a fine-grained comparative analysis of existing triplestore benchmarks. In particular, we analyze the data and queries provided with the existing triplestore benchmarks in addition to several real-world datasets. Furthermore, we measure the correlation between the query execution time and various SPARQL query features and rank those features based on their significance levels. Our experiments reveal several interesting insights about the design of such benchmarks. With this fine-grained evaluation, we aim to support the design and implementation of more diverse benchmarks. Application developers can use our result to analyze their data and queries and choose a data management system.
In this paper, we quantify the impact of self- and cross-excitation on the temporal development of user activity in Stack Exchange Question & Answer (Q&A) communities. We study differences in user excitation between growing and declining Stack Exchange communities, and between those dedicated to STEM and humanities topics by leveraging Hawkes processes. We find that growing communities exhibit early stage, high cross-excitation by a small core of power users reacting to the community as a whole, and strong long-term self-excitation in general and cross-excitation by casual users in particular, suggesting community openness towards less active users. Further, we observe that communities in the humanities exhibit long-term power user cross-excitation, whereas in STEM communities activity is more evenly distributed towards casual user self-excitation. We validate our findings via permutation tests and quantify the impact of these excitation effects with a range of prediction experiments. Our work enables researchers to quantitatively assess the evolution and activity potential of Q&A communities.
In the medical domain, systematic reviews are a highly trustworthy evidence source used to inform clinical diagnosis and treatment, and governmental policy making. Systematic reviews must be complete in that all relevant literature for the research question of the review must be synthesised in order to produce a recommendation. To identify the literature to screen for inclusion in systematic reviews, information specialists construct complex Boolean queries that capture the information needs defined by the research questions of the systemic review. However, in the quest for total recall, these Boolean queries return many non relevant results.
In this paper, we propose automatic methods for Boolean query refinement in the context of systematic review literature retrieval with the aim of alleviating this high-recall, low-precision problem. To do this, we build upon current literature and define additional semantic transformations for Boolean queries in the form of query expansion and reduction. Empirical evaluation is done on a set of real systematic review queries to show how our method performs in a realistic setting. We found that query refinement strategies produced queries that were more effective than the original in terms of six information retrieval evaluation measures. In particular, queries were refined to increase precision, while maintaining, or even increasing, recall - this, in turn, translates into both time and cost savings when creating laborious and expensive systematic reviews.
Network traffic classification is an important tool for network administrators in enabling monitoring and service provisioning. Traditional techniques employed in classifying traffic do not work well for mobile app traffic due to lack of unique signatures. Encryption renders this task even more difficult since packet content is no longer available to parse. More recent techniques based on statistical analysis of parameters such as packet-size and arrival time of packets have shown promise; such techniques have been shown to classify traffic from a small number of applications with a high degree of accuracy. However, we show that when employed to a large number of applications, the performance falls short of satisfactory. In this paper, we propose a novel set of bit-sequence based features which exploit differences in randomness of data generated by different applications. These differences originating due to dissimilarities in encryption implementations by different applications leave footprints on the data generated by them. We validate that these features can differentiate data encrypted with various ciphers (89% accuracy) and key-sizes (83% accuracy). Our evaluation shows that such features can not only differentiate traffic originating from different categories of mobile apps (90% accuracy), but can also classify 175 individual applications with 95% accuracy.
Group detection is gaining popularity as it enables various applications ranging from marketing to urban planning. The group information is an important social context which could facilitate a more comprehensive behavior analysis. An example is for retailers to determine the right incentive for potential customers. Existing methods use received signal strength indicator (RSSI) to detect co-located people as groups. However, this approach might have difficulties in crowded urban spaces since many strangers with similar mobility patterns could be identified as groups. Moreover, RSSI is vulnerable to many factors like the human body attenuation and thus is unreliable in crowded scenarios. In this work, we propose a behavior-aware group detection system (BaG). BaG fuses people's mobility information and smartphone usage behaviors. We observe that people in a group tend to have similar phone usage patterns. Those patterns could be effectively captured by the proposed feature: number of bursts (NoB). Unlike RSSI, NoB is more resilient to environmental changes as it only cares about receiving packets or not. Besides, both mobility and usage patterns correspond to the same underlying grouping information. The latent associations between them cannot be fully utilized in conventional detection methods like graph clustering. We propose a detection method based on collective matrix factorization to reveal the hidden associations by factorizing mobility information and usage patterns simultaneously. Experimental results indicate BaG outperforms baseline approaches by in F-score. The proposed system could also achieve robust and reliable performance in scenarios with different levels of crowdedness.
Given a terabyte-scale graph distributed across multiple machines, how can we summarize it, with much fewer nodes and edges, so that we can restore the original graph exactly or within error bounds?
As large-scale graphs are ubiquitous, ranging from web graphs to online social networks, compactly representing graphs becomes important to efficiently store and process them. Given a graph, graph summarization aims to find its compact representation consisting of (a) a summary graph where the nodes are disjoint sets of nodes in the input graph, and each edge indicates the edges between all pairs of nodes in the two sets; and (b) edge corrections for restoring the input graph from the summary graph exactly or within error bounds. Although graph summarization is a widely-used graph-compression technique readily combinable with other techniques, existing algorithms for graph summarization are not satisfactory in terms of speed or compactness of outputs. More importantly, they assume that the input graph is small enough to fit in main memory.
In this work, we propose SWeG, a fast parallel algorithm for summarizing graphs with compact representations. SWeG is designed for not only shared-memory but also MapReduce settings to summarize graphs that are too large to fit in main memory. We demonstrate that SWeG is (a) Fast: SWeG is up to 5400 × faster than its competitors that give similarly compact representations, (b) Scalable: SWeG scales to graphs with tens of billions of edges, and (c) Compact: combined with state-of-the-art compression methods, SWeG achieves up to 3.4 × better compression than them.
We present techniques for generating random graphs whose Laplacian spectrum approximately matches that of a given input graph. The motivation for matching the Laplacian spectrum is that it naturally encodes high-level connectivity information about the input graph; most existing models (e.g., variants of the Configuration Model, Stochastic Block Model, or Kronecker Graphs) focus on local structure or limited high-level partitions.
Our techniques succeed in matching the spectrum of the input graph more closely than the benchmark models. We also evaluate our generative model using other global and local properties, including shortest path distances, betweenness centrality, degree distribution, and clustering coefficients. The graphs produced by our model almost always match the input graph better than those produced by the benchmark models with respect to shortest path distance and clustering coefficient distributions. The performance on betweenness centrality is comparable to the benchmarks, while a worse match on the degree distribution is a price our method pays for more global similarity.
Our results suggest that focusing on spectral properties may lead to good performance for other global properties, at a modest loss in local similarity. Since global connectivity patterns are usually more important than local features for processes such as information flow, spread of epidemics, routing, etc., our main goal is to advocate for a shift in focus from graph generative models matching local properties to those matching global connectivity patterns.
Modern enterprises rely on Data Leakage Prevention (DLP) systems to enforce privacy policies that prevent unintentional flow of sensitive information to unauthorized entities. However, these systems operate based on rule sets that are limited to syntactic analysis and therefore completely ignore the semantic relationships between participants involved in the information exchanges. For similar reasons, these systems cannot enforce complex privacy policies that require temporal reasoning about events that have previously occurred.
To address these limitations, we advocate a new design methodology for DLP systems centered on the notion of Contextual Integrity (CI). We use the CI framework to abstract real-world communication exchanges into formally defined information flows where privacy policies describe sequences of admissible flows. CI allows us to decouple (1) the syntactic extraction of flows from information exchanges, and (2) the enforcement of privacy policies on these flows. We applied this approach to built VACCINE, a DLP auditing system for emails. VACCINE uses state-of-the-art techniques in natural language processing to extract flows from email text. It also provides a declarative language for describing privacy policies. These policies are automatically compiled to operational rules that the system uses for detecting data leakages. We evaluated VACCINE on the Enron email corpus and show that it improves over the state of the art both in terms of the expressivity of the policies that DLP systems can enforce as well as its precision in detecting data leakages.
One of the central challenges in online advertising is attribution, namely, assessing the contribution of individual advertiser actions including emails, display ads and search ads to eventual conversion. Several heuristics are used for attribution in practice; however, there is no formal justification for them and many of these fail even in simple canonical settings. The main contribution in this work is to develop an axiomatic framework for attribution in online advertising. In particular, we consider a Markovian model for the user journey through the conversion funnel, in which ad actions may have disparate impacts at different stages. We propose a novel attribution metric, that we refer to as counterfactual adjusted Shapley value, which inherits the desirable properties of the traditional Shapley value. Furthermore, we establish that this metric coincides with an adjusted “unique-uniform” attribution scheme. This scheme is efficiently computable and implementable and can be interpreted as a correction to the commonly used uniform attribution scheme.
We investigate spatial patterns in mobile service consumption that emerge at national scale. Our investigation focuses on a representative case study, i.e., France, where we find that: (i) the demand for popular mobile services is fairly uniform across the whole country, and only a reduced set of peculiar services (mainly operating system updates and long-lived video streaming) yields geographic diversity; (ii) even for such distinguishing services, the spatial heterogeneity of demands is limited, and a small set of consumption behaviors is sufficient to characterize most of the mobile service usage across the country; (iii) the spatial distribution of these behaviors correlates well with the urbanization level, ultimately suggesting that the adoption of geographically-diverse mobile applications is linked to a dichotomy of cities and rural areas. We derive our results through the analysis of substantial measurement data collected by a major mobile network operator, leveraging an approach rooted in information theory that can be readily applied to other scenarios.
JavaScript has been used for various attacks on client-side web applications. To hinder both manual and automated analysis from detecting malicious scripts, code minification and code obfuscation may hide the behavior of a script. Unfortunately, little is currently known about how real-world websites use such code transformations. This paper presents an empirical study of obfuscation and minification in 967,149 scripts (424,023 unique) from the top 100,000 websites. The core of our study is a highly accurate (95%-100%) neural network-based classifier that we train to identify whether obfuscation or minification have been applied and if yes, using what tools. We find that code transformations are very widespread, affecting 38% of all scripts. Most of the transformed code has been minified, whereas advanced obfuscation techniques, such as encoding parts of the code or fetching all strings from a global array, affect less than 1% of all scripts (2,842 unique scripts in total). Studying which code gets obfuscated, we find that obfuscation is particularly common in certain website categories, e.g., adult content. Further analysis of the obfuscated code shows that most of it is similar to the output produced by a single obfuscation tool and that some obfuscated scripts trigger suspicious behavior, such as likely fingerprinting and timing attacks. Finally, we show that obfuscation comes at a cost, because it slows down execution and risks to produce code that changes the intended behavior. Overall, our study shows that the security community must consider minified and obfuscated JavaScript code, and it provides insights into what kinds of transformations to focus on. Our learned classifiers provide an automated and accurate way to identify obfuscated code, and we release a set of real-world obfuscated scripts for future research.
This paper describes, develops, and validates SciLens, a method to evaluate the quality of scientific news articles. The starting point for our work are structured methodologies that define a series of quality aspects for manually evaluating news. Based on these aspects, we describe a series of indicators of news quality. According to our experiments, these indicators help non-experts evaluate more accurately the quality of a scientific news article, compared to non-experts that do not have access to these indicators. Furthermore, SciLens can also be used to produce a completely automated quality score for an article, which agrees more with expert evaluators than manual evaluations done by non-experts. One of the main elements of SciLens is the focus on both content and context of articles, where context is provided by (1) explicit and implicit references on the article to scientific literature, and (2) reactions in social media referencing the article. We show that both contextual elements can be valuable sources of information for determining article quality. The validation of SciLens, done through a combination of expert and non-expert annotation, demonstrates its effectiveness for both semi-automatic and automatic quality evaluation of scientific news.
Request latency is a critical metric in determining the usability of web services. The latency of a request includes service time - the time when the request is being actively serviced - and waiting time - the time when the request is waiting to be served. Most existing works aim to reduce request latency by focusing on reducing the mean service time (that is, shortening the critical path).
In this paper, we explore an alternative approach to reducing latency - using variability as a guiding principle when designing web services. By tracking the service time variability of the request as it traverses across software layers within the user and kernel space of the web server, we identify the most critical stages of request processing. We then determine control knobs in the OS and application, such as thread scheduling and request batching, that regulate the variability in these stages, and demonstrate that tuning these specific knobs can significantly improve end-to-end request latency. Our experimental results with Memcached and Apache web server under different request rates, including real-world traces, show that this alternative approach can reduce mean and tail latency by 30-50%.
In this paper, we investigate the task of aggregating search results from heterogeneous sources in an E-commerce environment. First, unlike traditional aggregated web search that merely presents multi-sourced results in the first page, this new task may present aggregated results in all pages and has to dynamically decide which source should be presented in the current page. Second, as pointed out by many existing studies, it is not trivial to rank items from heterogeneous sources because the relevance scores from different source systems are not directly comparable. To address these two issues, we decompose the task into two subtasks in a hierarchical structure: a high-level task for source selection where we model the sequential patterns of user behaviors onto aggregated results in different pages so as to understand user intents and select the relevant sources properly; and a low-level task for item presentation where we formulate a slot filling process to sequentially present the items instead of giving each item a relevance score when deciding the presentation order of heterogeneous items. Since both subtasks can be naturally formulated as sequential decision problems and learn from the future user feedback on search results, we build our model with hierarchical reinforcement learning. Extensive experiments demonstrate that our model obtains remarkable improvements in search performance metrics, and achieves a higher user satisfaction.
Understanding temporal dynamics has proved to be highly valuable for accurate recommendation. Sequential recommenders have been successful in modeling the dynamics of users and items over time. However, while different model architectures excel at capturing various temporal ranges or dynamics, distinct application contexts require adapting to diverse behaviors.
In this paper we examine how to build a model that can make use of different temporal ranges and dynamics depending on the request context. We begin with the analysis of an anonymized Youtube dataset comprising millions of user sequences. We quantify the degree of long-range dependence in these sequences and demonstrate that both short-term and long-term dependent behavioral patterns co-exist. We then propose a neural Multi-temporal-range Mixture Model (M3) as a tailored solution to deal with both short-term and long-term dependencies. Our approach employs a mixture of models, each with a different temporal range. These models are combined by a learned gating mechanism capable of exerting different model combinations given different contextual information. In empirical evaluations on a public dataset and our own anonymized YouTube dataset, M3 consistently outperforms state-of-the-art sequential recommendation methods.
Crowdsourcing has become a popular tool for large-scale data collection where it is often assumed that crowd workers complete the work independently. In this paper, we relax such independence property and explore the usage of peer communication-a kind of direct interactions between workers-in crowdsourcing. In particular, in the crowdsourcing setting with peer communication, a pair of workers are asked to complete the same task together by first generating their initial answers to the task independently and then freely discussing the task with each other and updating their answers after the discussion. We first experimentally examine the effects of peer communication on individual microtasks. Our results conducted on three types of tasks consistently suggest that work quality is significantly improved in tasks with peer communication compared to tasks where workers complete the work independently. We next explore how to utilize peer communication to optimize the requester's total utility while taking into account higher data correlation and higher cost introduced by peer communication. In particular, we model the requester's online decision problem of whether and when to use peer communication in crowdsourcing as a constrained Markov decision process which maximizes the requester's total utility under budget constraints. Our proposed approach is empirically shown to bring higher total utility compared to baseline approaches.
Real-time traffic volume inference is key to an intelligent city. It is a challenging task because accurate traffic volumes on the roads can only be measured at certain locations where sensors are installed. Moreover, the traffic evolves over time due to the influences of weather, events, holidays, etc. Existing solutions to the traffic volume inference problem often rely on dense GPS trajectories, which inevitably fail to account for the vehicles which carry no GPS devices or have them turned off. Consequently, the results are biased to taxicabs because they are almost always online for GPS tracking. In this paper, we propose a novel framework for the citywide traffic volume inference using both dense GPS trajectories and incomplete trajectories captured by camera surveillance systems. Our approach employs a high-fidelity traffic simulator and deep reinforcement learning to recover full vehicle movements from the incomplete trajectories. In order to jointly model the recovered trajectories and dense GPS trajectories, we construct spatiotemporal graphs and use multi-view graph embedding to encode the multi-hop correlations between road segments into real-valued vectors. Finally, we infer the citywide traffic volumes by propagating the traffic values of monitored road segments to the unmonitored ones through masked pairwise similarities. Extensive experiments with two big regions in a provincial capital city in China verify the effectiveness of our approach.
Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.
With millions of images that are shared online on social networking sites, effective methods for image privacy prediction are highly needed. In this paper, we propose an approach for fusing object, scene context, and image tags modalities derived from convolutional neural networks for accurately predicting the privacy of images shared online. Specifically, our approach identifies the set of most competent modalities on the fly, according to each new target image whose privacy has to be predicted. The approach considers three stages to predict the privacy of a target image, wherein we first identify the neighborhood images that are visually similar and/or have similar sensitive content as the target image. Then, we estimate the competence of the modalities based on the neighborhood images. Finally, we fuse the decisions of the most competent modalities and predict the privacy label for the target image. Experimental results show that our approach predicts the sensitive (or private) content more accurately than the models trained on individual modalities (object, scene, and tags) and prior privacy prediction works. Also, our approach outperforms strong baselines, that train meta-classifiers to obtain an optimal combination of modalities.
Personalized recommendation algorithms learn a user's preference for an item by measuring a distance/similarity between them. However, some of the existing recommendation models (e.g., matrix factorization) assume a linear relationship between the user and item. This approach limits the capacity of recommender systems, since the interactions between users and items in real-world applications are much more complex than the linear relationship. To overcome this limitation, in this paper, we design and propose a deep learning framework called Signed Distance-based Deep Memory Recommender, which captures non-linear relationships between users and items explicitly and implicitly, and work well in both general recommendation task and shopping basket-based recommendation task. Through an extensive empirical study on six real-world datasets in the two recommendation tasks, our proposed approach achieved significant improvement over ten state-of-the-art recommendation models.
Collaborative crowd computing, e.g., human computation and crowdsourcing, involves a team of workers jointly solving tasks of varying difficulties. In such settings, the ability to manage the workflow based on workers' skills and task strains can improve output quality. However, many practical systems employ a simple additive scoring scheme to measure worker performance, and do not consider the task difficulty or worker interaction. Some prior works have looked at ways of measuring worker performance or task difficulty in collaborative settings, but usually assume sophisticated models. In our work, we address this question by taking a competitive perspective and leveraging the vast prior work on competitive games. We adapt TrueSkill's standard competitive model by treating the task as a fictitious worker that the team of humans jointly plays against. We explore two fast online approaches to estimate the worker and task ratings: (1) an ELO rating system, and (2) approximate inference with the Expectation Propagation algorithm. To assess the strengths and weaknesses of the various rating methods, we conduct a human study on Amazon's Mechanical Turk with a simulated ESP game. Our experimental design has the novel element of pairing a carefully designed bot with human workers; these encounters can be used, in turn, to generate a larger set of simulated encounters, yielding more data. Our analysis confirms that our ranking scheme performs consistently and robustly, and outperforms the traditional additive scheme in terms of predicted accuracy.
Key to recommender systems is learning user preferences, which are expressed through various modalities. In online reviews, for instance, this manifests in numerical rating, textual content, as well as visual images. In this work, we hypothesize that modelling these modalities jointly would result in a more holistic representation of a review towards more accurate recommendations. Therefore, we propose Multimodal Review Generation (MRG), a neural approach that simultaneously models a rating prediction component and a review text generation component. We hypothesize that the shared user and item representations would augment the rating prediction with richer information from review text, while sensitizing the generated review text to sentiment features based on user and item of interest. Moreover, when review photos are available, visual features could inform the review text generation further. Comprehensive experiments on real-life datasets from several major US cities show that the proposed model outperforms comparable multimodal baselines, while an ablation analysis establishes the relative contributions of the respective components of the joint model.
Triangle counts of massive graphs can provide important information regarding the structure of the networks these graphs model. Exact triangle counting can be expensive and thus researchers have proposed a number of approximation approaches. The state-of-the-art triangle count approximation techniques depend on wedge (two-path) sampling. In this paper we offer a mechanism to significantly improve wedge sampling for triangle counting. We shrink the sampling space by eliminating wedges that are less likely to participate in triangles. Experiments over large-scale real-world graphs show that proposed mechanism provides five- to a few hundred-folds sampling space reduction. When compared against the state-of-the-art approaches, it requires as low as ~ 100 × less sampling to provide the same accuracy, or makes as low as ~ 8 × less error when used with the same sampling ratio.
The phenomenal growth of graph data from a wide variety of real-world applications has rendered graph querying to be a problem of paramount importance. Traditional techniques use structural as well as node similarities to find matches of a given query graph in a (large) target graph. However, almost all existing techniques have tacitly ignored the presence of relationships in graphs, which are usually encoded through interactions between node and edge labels. In this paper, we propose RAQ-Relationship-Aware Graph Querying-to mitigate this gap. Given a query graph, RAQ identifies the k best matching subgraphs of the target graph that encode similar relationships as in the query graph. To assess the utility of RAQ as a graph querying paradigm for knowledge discovery and exploration tasks, we perform a user survey on the Internet Movie Database (IMDb), where an overwhelming 86% of the 170 surveyed users preferred the relationship-aware match over traditional graph querying. The need to perform subgraph isomorphism renders RAQ NP-hard. The querying is made practical through beam stack search. Extensive experiments on multiple real-world graph datasets demonstrate RAQ to be effective, efficient, and scalable.
Dynamically extracting and representing continually evolving knowledge entities is an essential scaffold for grounded intelligence and decision making. Creating knowledge schemas for newly emerging, unfamiliar, domain-specific ideas or events poses the following challenges: (i) detecting relevant, often previously unknown concepts associated with the new domain; and (ii) learning ontological, semantically accurate relationships among the new concepts, despite having severely limited annotated data. To this end, we propose a novel LSTM-based framework with attentive pooling, BOLT-K, to learn an ontology for a target subject or domain. We bootstrap our ontology learning approach by adapting and transferring knowledge from an existing, functionally related source domain. We also augment the inadequate labeled data available for the target domain with various strategies to minimize human expertise during model development and training. BOLT-K first employs semantic and graphical features to recognize the entity or concept pairs likely to be related to each other, and filters out spurious concept combinations. It is then jointly trained on knowledge from the target and source domains to learn relationships among the target concepts. The target concepts and their corresponding relationships are subsequently used to construct an ontology. We extensively evaluate our framework on several, real-world bio-medical and commercial product domain ontologies. We obtain significant improvements of 5-25% F1-score points over state-of-the-art baselines. We also examine the potential of BOLT-K in detecting the presence of novel kinds of relationships that were unseen during training.
Finding clusters of well-connected nodes in a graph is an extensively studied problem in graph-based data analysis. Because of its many applications, a large number of distinct graph clustering objective functions and algorithms have already been proposed and analyzed. To aid practitioners in determining the best clustering approach to use in different applications, we present new techniques for automatically learning how to set clustering resolution parameters. These parameters control the size and structure of communities that are formed by optimizing a generalized objective function. We begin by formalizing the notion of a parameter fitness function, which measures how well a fixed input clustering approximately solves a generalized clustering objective for a specific resolution parameter value. Under reasonable assumptions, which suit two key graph clustering applications, such a parameter fitness function can be efficiently minimized using a bisection-like method, yielding a resolution parameter that fits well with the example clustering. We view our framework as a type of single-shot hyperparameter tuning, as we are able to learn a good resolution parameter with just a single example. Our general approach can be applied to learn resolution parameters for both local and global graph clustering objectives. We demonstrate its utility in several experiments on real-world data where it is helpful to learn resolution parameters from a given example clustering.
Data brokers such as Acxiom and Experian are in the business of collecting and selling data on people; the data they sell is commonly used to feed marketing as well as political campaigns. Despite the ongoing privacy debate, there is still very limited visibility into data collection by data brokers. Recently, however, online advertising services such as Facebook have begun to partner with data brokers-to add additional targeting features to their platform- providing avenues to gain insight into data broker information.
In this paper, we leverage the Facebook advertising system-and their partnership with six data brokers across seven countries-in order to gain insight into the extent and accuracy of data collection by data brokers today. We find that a surprisingly large percentage of Facebook accounts (e.g., above 90% in the U.S.) are successfully linked to data broker information. Moreover, by running controlled ads to 183 crowdsourced U.S.-based volunteers, we find that at least 40% of data broker sourced user attributes are not at all accurate, that users can have widely varying fractions of inaccurate attributes, and that even important information such as financial information can have a high degree of inaccuracy. Overall, this paper provides the first fine-grained look into the extent and accuracy of data collection by offline data brokers, helping to inform the ongoing privacy debate.
The public is increasingly concerned about the practices of large technology companies with regards to privacy and many other issues. To force changes in these practices, there have been growing calls for “data strikes.” These new types of collective action would seek to create leverage for the public by starving business-critical models (e.g. recommender systems, ranking algorithms) of much-needed training data. However, little is known about how data strikes would work, let alone how effective they would be. Focusing on the important commercial domain of recommender systems, we simulate data strikes under a wide variety of conditions and explore how they can augment traditional boycotts. Our results suggest that data strikes can be effective and that users have more power in their relationship with technology companies than they do with other companies. However, our results also highlight important trade-offs and challenges that must be considered by potential organizers.
A semantic model of a data source is a representation of the concepts and relationships contained in the data. Building semantic models is a prerequisite to automatically publishing data to a knowledge graph. However, creating these semantic models is a complex process requiring considerable manual effort and can be error-prone. In this paper, we present a novel approach that efficiently searches over the combinatorial space of possible semantic models, and applies a probabilistic graphical model to identify the most probable semantic model for a data source. Probabilistic graphical models offer many advantages over existing methods: they are robust to noisy inputs and provide a straightforward approach for exploiting relationships within the data. Our solution uses a conditional random field (CRF) to encode structural patterns and enforce conceptual consistency within the semantic model. In an empirical evaluation, our approach outperforms state of the art systems by an average 8.4% of F1 score, even with noisy input data.
In many online platforms, people must choose how broadly to allocate their energy. Should one concentrate on a narrow area of focus, and become a specialist, or apply oneself more broadly, and become a generalist? In this work, we propose a principled measure of how generalist or specialist a user is, and study behavior in online platforms through this lens. To do this, we construct highly accurate community embeddings that represent communities in a high-dimensional space. We develop sets of community analogies and use them to optimize our embeddings so that they encode community relationships extremely well. Based on these embeddings, we introduce a natural measure of activity diversity, the GS-score.
Applying our embedding-based measure to online platforms, we observe a broad spectrum of user activity styles, from extreme specialists to extreme generalists, in both community membership on Reddit and programming contributions on GitHub. We find that activity diversity is related to many important phenomena of user behavior. For example, specialists are much more likely to stay in communities they contribute to, but generalists are much more likely to remain on platforms as a whole. We also find that generalists engage with significantly more diverse sets of users than specialists do. Furthermore, our methodology leads to a simple algorithm for community recommendation, matching state-of-the-art methods like collaborative filtering. Our methods and results introduce an important new dimension of online user behavior and shed light on many aspects of online platform use.
Hypernymy is a semantic relation, expressing the “is-a” relation between a concept and its instances. Such relations are building blocks for large-scale taxonomies, ontologies and knowledge graphs. Recently, much progress has been made for hypernymy prediction in English using textual patterns and/or distributional representations. However, applying such techniques to other languages is challenging due to the high language dependency of these methods and the lack of large training datasets of lower-resourced languages.
In this work, we present a family of fuzzy orthogonal projection models for both monolingual and cross-lingual hypernymy prediction. For the monolingual task, we propose a Multi-Wahba Projection (MWP) model to distinguish hypernymy vs. non-hypernymy relations based on word embeddings. This model establishes distributional fuzzy mappings from embeddings of a term to those of its hypernyms and non-hypernyms, which consider the complicated linguistic regularities of these relations. For cross-lingual hypernymy prediction, a Transfer MWP (TMWP) model is proposed to transfer the semantic knowledge from the source language to target languages based on neural word translation. Additionally, an Iterative Transfer MWP (ITMWP) model is built upon TMWP, which augments the training sets of target languages when target languages are lower-resourced with limited training data. Experiments show i) MWP outperforms previous methods over two hypernymy prediction tasks for English; and ii) TMWP and ITMWP are effective to predict hypernymy over seven non-English languages.
Repeat consumption is a common scenario in daily life, such as repurchasing items and revisiting websites, and is a critical factor to be taken into consideration for recommender systems. Temporal dynamics play important roles in modeling repeat consumption. It is noteworthy that for items with distinct lifetimes, consuming tendency for the next one fluctuates differently with time. For example, users may repurchase milk weekly, but it is possible to repurchase mobile phone after a long period of time. Therefore, how to adaptively incorporate various temporal patterns of repeat consumption into a holistic recommendation model has been a new and important problem.
In this paper, we propose a novel unified model with introducing Hawkes Process into Collaborative Filtering (CF). Different from most previous work which ignores various time-varying patterns of repeat consumption, the model explicitly addresses two item-specific temporal dynamics: (1) short-term effect and (2) life-time effect, which is named as Short-Term and Life-Time Repeat Consumption (SLRC) model. SLRC learns importance of the two factors for each item dynamically by interpretable parameters. According to extensive experiments on four datasets in diverse scenarios, including two public collections, SLRC is superior to previous approaches for repeat consumption modeling. Moreover, due to the high flexibility of SLRC, various existing recommendation algorithms are shown to be easily leveraged in this model to achieve significant improvements. In addition, SLRC is good at balancing recommendation for novel items and consumed items (exploration and exploitation). We also find that the learned parameters is highly interpretable, and hence the model is able to be leveraged to discover items' lifetimes, and to distinguish different types of items such as durable and fast-moving consumer goods.
The continuing expansion of mobile app ecosystems has attracted lots of efforts from the research community. However, although a large number of research studies have focused on analyzing the corpus of mobile apps and app markets, little is known at a comprehensive level on the evolution of mobile app ecosystems. Because the mobile app ecosystem is continuously evolving over time, understanding the dynamics of app ecosystems could provide unique insights that cannot be achieved through studying a single static snapshot. In this paper, we seek to shed light on the dynamics of mobile app ecosystems. Based on 5.3 million app records (with both app metadata and apks) collected from three snapshots of Google Play over more than three years, we conduct the first study on the evolution of app ecosystems from different aspects. Our results suggest that although the overall ecosystem shows promising progress in regard of app popularity, user ratings, permission usage and privacy policy declaration, there still exists a considerable number of unsolved issues including malicious apps, update issues, third-party tracking threats, improper app promotion behaviors, and spamming/malicious developers. Our study shows that understanding the evolution of mobile app ecosystems can help developers make better decision on developing and releasing apps, provide insights for app markets to identifying misbehaviors, and help mobile users to choose desired apps.
Collaborative filtering often suffers from sparsity and cold start problems in real recommendation scenarios, therefore, researchers and engineers usually use side information to address the issues and improve the performance of recommender systems. In this paper, we consider knowledge graphs as the source of side information. We propose MKR, a Multi-task feature learning approach for Knowledge graph enhanced Recommendation. MKR is a deep end-to-end framework that utilizes knowledge graph embedding task to assist recommendation task. The two tasks are associated by crosscompress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in the knowledge graph. We prove that crosscompress units have sufficient capability of polynomial approximation, and show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning. Through extensive experiments on real-world datasets, we demonstrate that MKR achieves substantial gains in movie, book, music, and news recommendation, over state-of-the-art baselines. MKR is also shown to be able to maintain satisfactory performance even if user-item interactions are sparse.
With the rapid development of sharing economy, massive sharing systems such as Uber, Airbnb, and bikeshare have percolated into people's daily life. The sharing economy, at its core, is to achieve efficient use of resources. The actual usage of shared resources, however, is unclear to us. Little measurement or analysis, if any, has been conducted to investigate the resource usage status with the large-scale data collected from these sharing systems. In this paper, we analyze the bike usage status in three typical bikeshare systems based on 140-month multi-event data. Our analysis shows that the most used 20% of bikes account for 45% of usage, while the least used 20% of bikes account for less than 1% of usage. To efficiently utilize shared bikes, we propose a usage balancing design called eShare which has three components: (i) a statistical model based on archived data to infer historical usage; (ii) an entropy-based prediction model based on both real-time and archived data to infer future usage; (iii) a model-driven optimal calibration engine for bike selection to dynamically balance usage. We develop an ID swapping based evaluation methodology and measure the efficiency of eShare with data from three systems including the world's largest bikeshare system with 84,000 bikes and 3,300 stations. Our results show that eShare not only fully utilizes shared bikes but also improves service quality.
Graph neural network, as a powerful graph representation technique based on deep learning, has shown superior performance and attracted considerable research interest. However, it has not been fully considered in graph neural network for heterogeneous graph which contains different types of nodes and links. The heterogeneity and rich semantic information bring great challenges for designing a graph neural network for heterogeneous graph. Recently, one of the most exciting advancements in deep learning is the attention mechanism, whose great potential has been well demonstrated in various areas. In this paper, we first propose a novel heterogeneous graph neural network based on the hierarchical attention, including node-level and semantic-level attentions. Specifically, the node-level attention aims to learn the importance between a node and its meta-path based neighbors, while the semantic-level attention is able to learn the importance of different meta-paths. With the learned importance from both node-level and semantic-level attention, the importance of node and meta-path can be fully considered. Then the proposed model can generate node embedding by aggregating features from meta-path based neighbors in a hierarchical manner. Extensive experimental results on three real-world heterogeneous graphs not only show the superior performance of our proposed model over the state-of-the-arts, but also demonstrate its potentially good interpretability for graph analysis.
Aspect-level sentiment analysis aims to provide complete and detailed view of sentiment analysis from different aspects. Existing solutions usually adopt a two-staged approach: first detecting aspect category in a document, then categorizing the polarity of opinion expressions for detected aspect(s). Inevitably, such methods lead to error accumulation. Moreover, aspect detection and aspect-level sentiment classification are highly correlated with each other. The key issue here is how to perform aspect detection and aspect-level sentiment classification jointly, and effectively. In this paper, we propose the aspect-level sentiment capsules model (AS-Capsules), which is capable of performing aspect detection and sentiment classification simultaneously, in a joint manner. AS-Capsules utilizes the correlation between aspect and sentiment through shared components including capsule embedding, shared encoders, and shared attentions. AS-Capsules is also capable of communicating with different capsules through a shared Recurrent Neural Network (RNN). More importantly, AS-Capsules model does not require any linguistic knowledge as additional input. Instead, through the attention mechanism, this model is able to attend aspect related words and sentiment words corresponding to different aspect(s). Experiments show that the AS-Capsules model achieves state-of-the-art performances on a benchmark dataset for aspect-level sentiment analysis.
Social advertisement has emerged as a viable means to improve purchase sharing in the context of e-commerce. However, humanly generating lots of advertising scripts can be prohibitive to both e-platforms and online sellers, and moreover, developing the desired auto-generator will need substantial gold-standard training samples. In this paper, we put forward a novel seq2seq model to generate social advertisements automatically, in which a quality-sensitive loss function is proposed based on user click behavior to differentiate training samples of varied qualities. Our motivation is to leverage the clickthrough data as a kind of quality indicator to measure the textual fitness of each training sample quantitatively, and only those ground truths that satisfy social media users will be considered the eligible and able to optimize the social advertisement generation. Specifically, under the qualified case, the ground truth should be utilized to supervise the whole training phase as much as possible, whereas in the opposite situation, the generated result ought to preserve the semantics of original input to the greatest extent. Simulation experiments on a large-scale dataset demonstrate that our approach achieves a significant superiority over two existing methods of distant supervision and three state-of-the-art NLG solutions.
Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts.
In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.
The increasing utilization of massive open online courses has significantly expanded global access to formal education. Despite the technology's promising future, student interaction on MOOCs is still a relatively under-explored and poorly understood topic. This work proposes a multi-level pattern discovery through hierarchical discriminative tensor factorization. We formulate the problem as a hierarchical discriminant subspace learning problem, where the goal is to discover the shared and discriminative patterns with a hierarchical structure. The discovered patterns enable a more effective exploration of the contrasting behaviors of two performance groups. We conduct extensive experiments on several real-world MOOC datasets to demonstrate the effectiveness of our proposed approach. Our study advances the current predictive modeling in MOOCs by providing more interpretable behavioral patterns and linking their relationships with the performance outcome.
Recommender systems have to handle a highly non-stationary environment, due to users' fast changing interests over time. Traditional solutions have to periodically rebuild their models, despite high computational cost. But this still cannot empower them to automatically adjust to abrupt changes in trends caused by timely information. It is important to note that the changes of reward distributions caused by a non-stationary environment can also be context dependent. When the change is orthogonal to the given context, previously maintained models should be reused for better recommendation prediction.
In this work, we focus on contextual bandit algorithms for making adaptive recommendations. We capitalize on the unique context-dependent property of reward changes to conquer the challenging non-stationary environment for model update. In particular, we maintain a dynamic ensemble of contextual bandit models, where each bandit model's reward estimation quality is monitored regarding given context and possible environment changes. Only the admissible models to the current environment will be used for recommendation. We provide a rigorous upper regret bound analysis of our proposed algorithm. Extensive empirical evaluations on both synthetic and three real-world datasets confirmed the algorithm's advantage against existing non-stationary solutions that simply create new models whenever an environment change is detected.
Social recommendation leverages social information to solve data sparsity and cold-start problems in traditional collaborative filtering methods. However, most existing models assume that social effects from friend users are static and under the forms of constant weights or fixed constraints. To relax this strong assumption, in this paper, we propose dual graph attention networks to collaboratively learn representations for two-fold social effects, where one is modeled by a user-specific attention weight and the other is modeled by a dynamic and context-aware attention weight. We also extend the social effects in user domain to item domain, so that information from related items can be leveraged to further alleviate the data sparsity problem. Furthermore, considering that different social effects in two domains could interact with each other and jointly influence users' preferences for items, we propose a new policy-based fusion strategy based on contextual multi-armed bandit to weigh interactions of various social effects. Experiments on one benchmark dataset and a commercial dataset verify the efficacy of the key components in our model. The results show that our model achieves great improvement for recommendation accuracy compared with other state-of-the-art social recommendation methods.
Compared to general web search engines, web image search engines display results in a different way. In web image search, results are typically placed in a grid-based manner rather than a sequential result list. In this scenario, users can view results not only in a vertical direction but also in a horizontal direction. Moreover, pagination is usually not (explicitly) supported on image search search engine result pages (SERPs), and users can view results by scrolling down without having to click a “next page” button. These differences lead to different interaction mechanisms and user behavior patterns, which, in turn, create challenges to evaluation metrics that have originally been developed for general web search. While considerable effort has been invested in developing evaluation metrics for general web search, there has been relatively little effort to construct grid-based evaluation metrics.
To inform the development of grid-based evaluation metrics for web image search, we conduct a comprehensive analysis of user behavior so as to uncover how users allocate their attention in a grid-based web image search result interface. We obtain three findings: (1) “Middle bias”: Confirming previous studies, we find that image results in the horizontal middle positions may receive more attention from users than those in the leftmost or rightmost positions. (2) “Slower decay”: Unlike web search, users' attention does not decrease monotonically or dramatically with the rank position in image search, especially within a row. (3) “Row skipping”: Users may ignore particular rows and directly jump to results at some distance. Motivated by these observations, we propose corresponding user behavior assumptions to capture users' search interaction processes and evaluate their search performance. We show how to derive new metrics from these assumptions and demonstrate that they can be adopted to revise traditional list-based metrics like Discounted Cumulative Gain (DCG) and Rank-Biased Precision (RBP). To show the effectiveness of the proposed grid-based metrics, we compare them against a number of list-based metrics in terms of their correlation with user satisfaction. Our experimental results show that the proposed grid-based evaluation metrics better reflect user satisfaction in web image search.
Sarcasm is sophisticated linguistic expression and is commonly observed on social media and e-commerce platforms. Failure to detect sarcastic expressions in natural language processing tasks, such as opinion mining and sentiment analysis, leads to poor model performance. Traditional approaches rely heavily on discrete handcrafted features and will incur enormous human costs. It was not until recent that scholars began to employ neural networks to address these limitations and have achieved new state-of-the-art performance. In this work, we propose a novel self-matching network to capture sentence ”incongruity” information by exploring word-to-word interactions. In particular, we calculate the joint information in each word-to-word pair in the input sentence to build a self-matching attention vector, based on which we attend the sentence and build its representation vector. Such a network allows sentence to match within itself word by word and cater to the words of conflict sentiments. In addition, we incorporate a bi-directional LSTM network into our proposed network to retain compositional information. We concatenate incongruity information and compositional information through a Low-rank Bilinear Pooling method to control for potential information redundancy without losing discriminative power. Experiment results on publicly available datasets demonstrate that our model significantly outperforms extant baselines on standard evaluation metrics including precision, recall, F1 score and accuracy.
To bridge the knowledge gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, protection of these models, and validation of research ideas on these models.
Detecting local graph clusters is an important problem in big graph analysis. Given seed nodes in a graph, local clustering aims at finding subgraphs around the seed nodes, which consist of nodes highly relevant to the seed nodes. However, existing local clustering methods either allow only a single seed node, or assume all seed nodes are from the same cluster, which is not true in many real applications. Moreover, the assumption that all seed nodes are in a single cluster fails to use the crucial information of relations between seed nodes. In this paper, we propose a method to take advantage of such relationship. With prior knowledge of the community membership of the seed nodes, the method labels seed nodes in the same (different) community by the same (different) color. To further use this information, we introduce a color-based random walk mechanism, where colors are propagated from the seed nodes to every node in the graph. By the interaction of identical and distinct colors, we can enclose the supervision of seed nodes into the random walk process. We also propose a heuristic strategy to speed up the algorithm by more than 2 orders of magnitude. Experimental evaluations reveal that our clustering method outperforms state-of-the-art approaches by a large margin.
Location Based Social Networks (LBSNs) have been widely used as a primary data source to study the impact of mobility and social relationships on each other. Traditional approaches manually define features to characterize users' mobility homophily and social proximity, and show that mobility and social features can help friendship and location prediction tasks, respectively. However, these hand-crafted features not only require tedious human efforts, but also are difficult to generalize. In this paper, by revisiting user mobility and social relationships based on a large-scale LBSN dataset collected over a long-term period, we propose LBSN2Vec, a hypergraph embedding approach designed specifically for LBSN data for automatic feature learning. Specifically, LBSN data intrinsically forms a hypergraph including both user-user edges (friendships) and user-time-POI-semantic hyperedges (check-ins). Based on this hypergraph, we first propose a random-walk-with-stay scheme to jointly sample user check-ins and social relationships, and then learn node embeddings from the sampled (hyper)edges by preserving n-wise node proximity (n = 2 or 4). Our evaluation results show that LBSN2Vec both consistently and significantly outperforms the state-of-the-art graph embedding methods on both friendship and location prediction tasks, with an average improvement of 32.95% and 25.32%, respectively. Moreover, using LBSN2Vec, we discover the asymmetric impact of mobility and social relationships on predicting each other, which can serve as guidelines for future research on friendship and location prediction in LBSNs.
This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.
People's content choices are ideally driven by their intentions, aspirations, and plans. However, in reality, choices may be modulated by recommendation systems which are typically trained to promote popular items and to reinforce users' historical behavior. As a result, the utility and user experience of content consumption can be affected implicitly and undesirably. To study this problem, we conducted a 2 × 2 randomized controlled field experiment (105 urban college students) to compare the effects of intention informed recommendations with classical intention agnostic systems. The study was conducted in the context of spoken word web content (podcasts) which is often consumed through subscription sites or apps. We modified a commercial podcast app to include (1) a recommender that takes into account users' stated intentions at onboarding, and (2) a Collaborative Filtering (CF) recommender during daily use. Our study suggests that: (1) intention-aware recommendations can significantly raise users' interactions (subscriptions and listening) with channels and episodes related to intended topics by over 24%, even if such a recommender is only used during onboarding, and (2) the CF-based recommender doubles users' explorations on episodes from not-subscribed channels and improves satisfaction for users onboarded with the intention-aware recommender.
Spatial-temporal prediction is a fundamental problem for constructing smart city, which is useful for tasks such as traffic control, taxi dispatching, and environment policy making. Due to data collection mechanism, it is common to see data collection with unbalanced spatial distributions. For example, some cities may release taxi data for multiple years while others only release a few days of data; some regions may have constant water quality data monitored by sensors whereas some regions only have a small collection of water samples. In this paper, we tackle the problem of spatial-temporal prediction for the cities with only a short period of data collection. We aim to utilize the long-period data from other cities via transfer learning. Different from previous studies that transfer knowledge from one single source city to a target city, we are the first to leverage information from multiple cities to increase the stability of transfer. Specifically, our proposed model is designed as a spatial-temporal network with a meta-learning paradigm. The meta-learning paradigm learns a well-generalized initialization of the spatial-temporal network, which can be effectively adapted to target cities. In addition, a pattern-based spatial-temporal memory is designed to distill long-term temporal information (i.e., periodicity). We conduct extensive experiments on two tasks: traffic (taxi and bike) prediction and water quality prediction. The experiments demonstrate the effectiveness of our proposed model over several competitive baseline models.
Recent advances in deep learning motivate the use of deep neural networks in Internet-of-Things (IoT) applications. These networks are modelled after signal processing in the human brain, thereby leading to significant advantages at perceptual tasks such as vision and speech recognition. IoT applications, however, often measure physical phenomena, where the underlying physics (such as inertia, wireless signal propagation, or the natural frequency of oscillation) are fundamentally a function of signal frequencies, offering better features in the frequency domain. This observation leads to a fundamental question: For IoT applications, can one develop a new brand of neural network structures that synthesize features inspired not only by the biology of human perception but also by the fundamental nature of physics? Hence, in this paper, instead of using conventional building blocks (e.g., convolutional and recurrent layers), we propose a new foundational neural network building block, the Short-Time Fourier Neural Network (STFNet). It integrates a widely-used time-frequency analysis method, the Short-Time Fourier Transform, into data processing to learn features directly in the frequency domain, where the physics of underlying phenomena leave better footprints. STFNets bring additional flexibility to time-frequency analysis by offering novel nonlinear learnable operations that are spectral-compatible. Moreover, STFNets show that transforming signals to a domain that is more connected to the underlying physics greatly simplifies the learning process. We demonstrate the effectiveness of STFNets with extensive experiments on a wide range of sensing inputs, including motion sensors, WiFi, ultrasound, and visible light. STFNets significantly outperform the state-of-the-art deep learning models in all experiments. A STFNet, therefore, demonstrates superior capability as the fundamental building block of deep neural networks for IoT applications for various sensor inputs 1.
To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called “CoaCor”), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.1
JavaScript execution is heavily used during the loading of web apps, taking a substantial portion of the app loading time. To accelerate JavaScript execution, snapshot-based app loading has been proposed [5, 17]. We take a snapshot of the JavaScript objects in the heap at some point during app loading (which we call snapshot point) and save them in a file in advance. Then, we start app loading by copying the objects in the snapshot to the heap directly, skipping JavaScript execution to create those objects. One issue is that the JavaScript execution state at the snapshot point should be the same at every app loading. If JavaScript execution included in the snapshot is involved with some nondeterminism (e.g., use random function or current time/location), snapshot-based app loading might be inapplicable since the loaded state might differ each time.
In this paper, we perform an empirical study for the nondeterministic behavior of web apps during app loading. The result shows that nondeterminism is used frequently, but its impact is small from the user's perspectives. Based on this observation, we propose two techniques that can increase the applicability of snapshot. First, we propose two types of snapshot point, which allows saving as much execution state as possible in the snapshot, just before the nondeterministic execution affects the user's perception. Second, we develop a tool to identify those snapshot points by static and dynamic analysis, so that the developer can easily identify them. Our experimental results show that the techniques can accelerate the app loading by 1.81 times, by applying the snapshot that would otherwise be inapplicable, achieving a performance comparable to that of a snapshot that completely ignores the nondeterminism issues.
Anonymous network services on the World Wide Web have emerged as a new web architecture, called the Dark Web. The Dark Web has been notorious for harboring cybercriminals abusing anonymity. At the same time, the Dark Web has been a last resort for people who seek freedom of the press as well as avoid censorship. This anonymous nature allows website operators to conceal their identity and thereby leads users to have difficulties in determining the authenticity of websites. Phishers abuse this perplexing authenticity to lure victims; however, only a little is known about the prevalence of phishing attacks on the Dark Web.
We conducted an in-depth measurement study to demystify the prevalent phishing websites on the Dark Web. We analyzed the text content of 28,928 HTTP Tor hidden services hosting 21 million dark webpages and confirmed 901 phishing domains. We also discovered a trend on the Dark Web in which service providers perceive dark web domains as their service brands. This trend exacerbates the risk of phishing for their service users who remember only a partial Tor hidden service address.
Our work facilitates a better understanding of the phishing risks on the Dark Web and encourages further research on establishing an authentic and reliable service on the Dark Web.
Recommender systems that can learn from cross-session data to dynamically predict the next item a user will choose are crucial for online platforms. However, existing approaches often use out-of-the-box sequence models which are limited by speed and memory consumption, are often infeasible for production environments, and usually do not incorporate cross-session information, which is crucial for effective recommendations. Here we propose Hierarchical Temporal Convolutional Networks (HierTCN), a hierarchical deep learning architecture that makes dynamic recommendations based on users' sequential multi-session interactions with items. HierTCN is designed for web-scale systems with billions of items and hundreds of millions of users. It consists of two levels of models: The high-level model uses Recurrent Neural Networks (RNN) to aggregate users' evolving long-term interests across different sessions, while the low-level model is implemented with Temporal Convolutional Networks (TCN), utilizing both the long-term interests and the short-term interactions within sessions to predict the next interaction. We conduct extensive experiments on a public XING dataset and a large-scale Pinterest dataset that contains 6 million users with 1.6 billion interactions. We show that HierTCN is 2.5x faster than RNN-based models and uses 90% less data memory compared to TCN-based models. We further develop an effective data caching scheme and a queue-based mini-batch generator, enabling our model to be trained within 24 hours on a single GPU. Our model consistently outperforms state-of-the-art dynamic recommendation methods, with up to 18% improvement in recall and 10% in mean reciprocal rank.
To enhance the performance of the recommender system, side information is extensively explored with various features (e.g., visual features and textual features). However, there are some demerits of side information: (1) the extra data is not always available in all recommendation tasks; (2) it is only for items, there is seldom high-level feature describing users. To address these gaps, we introduce the spectral features extracted from two hypergraph structures of the purchase records. Spectral features describe the similarity of users/items in the graph space, which is critical for recommendation. We leverage spectral features to model the users' preference and items' properties by incorporating them into a Matrix Factorization (MF) model.
In addition to modeling, we also use spectral features to optimize. Bayesian Personalized Ranking (BPR) is extensively leveraged to optimize models in implicit feedback data. However, in BPR, all missing values are regarded as negative samples equally while many of them are indeed unseen positive ones. We enrich the positive samples by calculating the similarity among users/items by the spectral features. The key ideas are: (1) similar users shall have similar preference on the same item; (2) a user shall have similar perception on similar items. Extensive experiments on two real-world datasets demonstrate the usefulness of the spectral features and the effectiveness of our spectrum-enhanced pairwise optimization. Our models outperform several state-of-the-art models significantly.
In taxi ride-sharing, multiple customers are allotted to the same taxi as long as they are compatible, i.e., if none of them suffers a detour beyond a permissible threshold. To attract customers to ride-sharing, taxi operators promise a reduced fare upfront. As a result, if the taxi fails to pair the initial customer with additional compatible passengers, the taxi operator incurs a financial loss. Hence, it is important to ensure that the taxi finds compatible customers once it has picked up the initial customer. In the current scenario, the appearance of subsequent compatible customers is left to luck: a taxi moves along the shortest (or quickest) path for the existing customer and hopes to find additional compatible customers on its way. In this paper, we ask: Is the shortest path the optimal path for ride-sharing? To investigate this question, we develop a route recommendation algorithm called Share, which predicts the route with the highest probability of finding compatible customers, while staying within the detour limits. Through extensive evaluations on real datasets, we show that Share reduces failure rate in ride-sharing by up to 40%, while also improving waiting time for customers by 20%. Technically, the route recommendation problem is NP-hard. We overcome this bottleneck, by reducing the search space of the entire road network into a smaller directed acyclic graph, on which we perform dynamic programming to identify the optimal route. This strategy allows us to answer queries within 0.2 seconds, and thereby making Share deployable on real road conditions.
Online behavior leaves a digital footprint that can be analyzed to reveal our cognitive and psychological state through time. Recognizing these subtle cues can help identify different aspects of mental health, such as low self-esteem, depression, and anxiety. Google's web search engine, used daily by millions of people, logs every search query made by a user, which is accessible through a platform called Google Takeout. Previous researchers have made efforts to detect and predict behaviors associated with depression and anxiety from web data, but only at a population level. This paper fills in the gap of looking into signs of low self-esteem, a condition that work in a vicious cycle with depression and anxiety, at an individual level by looking into Google search history data. We target college students, a population prone to depression, anxiety, and low self-esteem, and ask to take mental health assessment survey along with their individual search history. Textual analysis show that search logs contain strong signals that can identify individuals with current low self-esteem. For example, participants with low self-esteem have fewer searches pertaining to family, friend, and money attributes; and we also observed differences in the search category distribution, over time, when compared with individuals with moderate to high self-esteem. Using these markers we were able to build a probabilistic classifier that can identify low self-esteem conditions, based on search history, with an average F1 score of 0.86.
We study the computation contests where players compete for searching a solution to a given problem with a winner-take-all reward. The search processes are independent across the players and the search speeds of players are proportional to their computational powers. One concrete application of this abstract model is the mining process of proof-of-work type blockchain systems, such as Bitcoin. Although one's winning probability is believed to be proportional to his computational power in previous studies on Bitcoin, we show that it is not the case in the strict sense.
Because of the gaps between the winning probabilities and the proportions of computational powers, the Matthew effect will emerge in the system, where the rich get richer and the poor get poorer. In addition, we show that allowing the players to pool with each other or reducing the number of solutions to the problem may aggravate the Matthew effect and vice versa.
Facial appearance matters in social networks. Individuals frequently make trait judgments from facial clues. Although these face-based impressions lack the evidence to determine validity, they are of vital importance, because they may relate to human network-based social behavior, such as seeking certain individuals for help, advice, dating, and cooperation, and thus they may relate to centrality in social networks. However, little to no work has investigated the apparent facial traits that influence network centrality, despite the large amount of research on attributions of the central position including personality and behavior. In this paper, we examine whether perceived traits based on facial appearance affect network centrality by exploring the initial stage of social network formation in a first-year college residential area. We took face photos of participants who are freshmen living in the same residential area, and we asked them to nominate community members linking to different networks. We then collected facial perception data by requiring other participants to rate facial images for three main attributions: dominance, trustworthiness, and attractiveness. Meanwhile, we proposed a framework to discover how facial appearance affects social networks. Our results revealed that perceived facial traits were correlated with the network centrality and that they were indicative to predict the centrality of people in different networks. Our findings provide psychological evidence regarding the interaction between faces and network centrality. Our findings also offer insights in to a combination of psychological and social network techniques, and they highlight the function of facial bias in cuing and signaling social traits. To the best of our knowledge, we are the first to explore the influence of facial perception on centrality in social networks.
Measuring the distances between vertices on graphs is one of the most fundamental components in network analysis. Since finding shortest paths requires traversing the graph, it is challenging to obtain distance information on large graphs very quickly. In this work, we present a preprocessing algorithm that is able to create landmark based distance sketches efficiently, with strong theoretical guarantees. When evaluated on a diverse set of social and information networks, our algorithm significantly improves over existing approaches by reducing the number of landmarks stored, preprocessing time, or stretch of the estimated distances.
On Erdos-Renyi graphs and random power law graphs with degree distribution exponent 2 < &bgr; < 3, our algorithm outputs an exact distance data structure with space between T(n5/4) and T(n3/2) depending on the value of &bgr;, where n is the number of vertices. We complement the algorithm with tight lower bounds for Erdos-Renyi graphs and the case when &bgr; is close to two.
The understanding of talent flow is critical for sharpening company talent strategy to keep competitiveness in the current fast-evolving environment. Existing studies on talent flow analysis generally rely on subjective surveys. However, without large-scale quantitative studies, there are limits to deliver fine-grained predictive business insights for better talent management. To this end, in this paper, we aim to introduce a big data-driven approach for predictive talent flow analysis. Specifically, we first construct a time-aware job transition tensor by mining the large-scale job transition records of digital resumes from online professional networks (OPNs), where each entry refers to a fine-grained talent flow rate of a specific job position between two companies. Then, we design a dynamic latent factor based Evolving Tensor Factorization (ETF) model for predicting the future talent flows. In particular, a novel evolving feature by jointly considering the influence of previous talent flows and global market is introduced for modeling the evolving nature of each company. Furthermore, to improve the predictive performance, we also integrate several representative attributes of companies as side information for regulating the model inference. Finally, we conduct extensive experiments on large-scale real-world data for evaluating the model performances. The experimental results clearly validate the effectiveness of our approach compared with state-of-the-art baselines in terms of talent flow forecast. Meanwhile, the results also reveal some interesting findings on the regularity of talent flows, e.g. Facebook becomes more and more attractive for the engineers from Google in 2016.
Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, discuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.
Social media platforms are a plethora of misinformation and its potential negative influence on the public is a growing concern. This concern has drawn the attention of the research community on developing mechanisms to detect misinformation. The task of misinformation detection consists of classifying whether a claim is True or False. Most research concentrates on developing machine learning models, such as neural networks, that outputs a single value in order to predict the veracity of a claim. One of the major problem faced by these models is the inability of representing the uncertainty of the prediction, which is due incomplete or finite available information about the claim being examined. We address this problem by proposing a Bayesian deep learning model. The Bayesian model outputs a distribution used to represent both the prediction and its uncertainty. In addition to the claim content, we also encode auxiliary information given by people's replies to the claim. First, the model encodes a claim to be verified, and generate a prior belief distribution from which we sample a latent variable. Second, the model encodes all the people's replies to the claim in a temporal order through a Long Short Term Memory network in order to summarize their content. This summary is then used to update the prior belief generating the posterior belief. Moreover, in order to train this model, we develop a Stochastic Gradient Variational Bayes algorithm to approximate the analytically intractable posterior distribution. Experiments conducted on two public datasets demonstrate that our model outperforms the state-of-the-art detection models.
Network alignment, which aims to find the node correspondence across multiple networks, is a fundamental task in many areas, ranging from social network analysis to adversarial activity detection. The state-of-the-art in the data mining community often view the node correspondence as a probabilistic cross-network node similarity, and thus inevitably introduce an O(n2) lower bound on the computational complexity. Moreover, they might ignore the rich patterns (e.g., clusters) accompanying the real networks. In this paper, we propose a multilevel network alignment algorithm (Moana) which consists of three key steps. It first efficiently coarsens the input networks into their structured representations, and then aligns the coarsest representations of the input networks, followed by the interpolations to obtain the alignment at multiple levels including the node level at the finest granularity. The proposed coarsen-align-interpolate method bears two key advantages. First, it overcomes the O(n2) lower bound, achieving a linear complexity. Second, it helps reveal the alignment between rich patterns of the input networks at multiple levels (e.g., node, clusters, super-clusters, etc.). Extensive experimental evaluations demonstrate the efficacy of the proposed algorithm on both the node-level alignment and the alignment among rich patterns (e.g., clusters) at different granularities.
Nowadays, online shoppers have paid more and more attention to detailed product descriptions, since a well-written description is a huge factor in making online sales. However, for a website with billions of product data like Alibaba, the writing efficiency of human copywriters cannot match the growth rate of new products. To address this issue, we propose a novel pointer-generator neural network to generate product description. In particular, coordinate encoders and a pattern-controlled decoder are utilized to improve generation quality with an attention mechanism. The coordinate encoders equipped with a Transformer and a gated convolutional unit is introduced to learn the source input representations. In the decoding phase, a pattern controlled decoder is proposed to control the output description pattern (such as category, length, and style) to ensure the quality of the description. For evaluation, we build a substantial collection of real-world products along with human-written descriptions. An extensive set of experiments with both human annotated data demonstrate the advantage of the proposed method for generation qualities. Finally, an online deployment shows significant benefits of our model in a real online shopping scenario, as measured by the click-through rate.
Reasoning is essential for the development of large knowledge graphs, especially for completion, which aims to infer new triples based on existing ones. Both rules and embeddings can be used for knowledge graph reasoning and they have their own advantages and difficulties. Rule-based reasoning is accurate and explainable but rule learning with searching over the graph always suffers from efficiency due to huge search space. Embedding-based reasoning is more scalable and efficient as the reasoning is conducted via computation between embeddings, but it has difficulty learning good representations for sparse entities because a good embedding relies heavily on data richness. Based on this observation, in this paper we explore how embedding and rule learning can be combined together and complement each other's difficulties with their advantages. We propose a novel framework IterE iteratively learning embeddings and rules, in which rules are learned from embeddings with proper pruning strategy and embeddings are learned from existing triples and new triples inferred by rules. Evaluations on embedding qualities of IterE show that rules help improve the quality of sparse entity embeddings and their link prediction results. We also evaluate the efficiency of rule learning and quality of rules from IterE compared with AMIE+, showing that IterE is capable of generating high quality rules more efficiently. Experiments show that iteratively learning embeddings and rules benefit each other during learning and prediction.
Hashtags in online social networks have gained tremendous popularity during the past five years. The resulting large quantity of data has provided a new lens into modern society. Previously, researchers mainly rely on data collected from Twitter to study either a certain type of hashtags or a certain property of hashtags. In this paper, we perform the first large-scale empirical analysis of hashtags shared on Instagram, the major platform for hashtag-sharing. We study hashtags from three different dimensions including the temporal-spatial dimension, the semantic dimension, and the social dimension. Extensive experiments performed on three large-scale datasets with more than 7 million hashtags in total provide a series of interesting observations. First, we show that the temporal patterns of hashtags can be categorized into four different clusters, and people tend to share fewer hashtags at certain places and more hashtags at others. Second, we observe that a non-negligible proportion of hashtags exhibit large semantic displacement. We demonstrate hashtags that are more uniformly shared among users, as quantified by the proposed hashtag entropy, are less prone to semantic displacement. In the end, we propose a bipartite graph embedding model to summarize users' hashtag profiles, and rely on these profiles to perform friendship prediction. Evaluation results show that our approach achieves an effective prediction with AUC (area under the ROC curve) above 0.8 which demonstrates the strong social signals possessed in hashtags.
Recently, neural models for information retrieval are becoming increasingly popular. They provide effective approaches for product search due to their competitive advantages in semantic matching. However, it is challenging to use graph-based features, though proved very useful in IR literature, in these neural approaches. In this paper, we leverage the recent advances in graph embedding techniques to enable neural retrieval models to exploit graph-structured data for automatic feature extraction. The proposed approach can not only help to overcome the long-tail problem of click-through data, but also incorporate external heterogeneous information to improve search results. Extensive experiments on a real-world e-commerce dataset demonstrate significant improvement achieved by our proposed approach over multiple strong baselines both as an individual retrieval model and as a feature used in learning-to-rank frameworks.
Multimodal dialogue systems are attracting increasing attention with a more natural and informative way for human-computer interaction. As one of its core components, the belief tracker estimates the user's goal at each step of the dialogue and provides a direct way to validate the ability of dialogue understanding. However, existing studies on belief trackers are largely limited to textual modality, which cannot be easily extended to capture the rich semantics in multimodal systems such as those with product images. For example, in fashion domain, the visual appearance of clothes play a crucial role in understanding the user's intention. In this case, the existing belief trackers may fail to generate accurate belief states for a multimodal dialogue system.
In this paper, we present the first neural multimodal belief tracker (NMBT) to demonstrate how multimodal evidence can facilitate semantic understanding and dialogue state tracking. Given the multimodal inputs, while applying a textual encoder to represent textual utterances, the model gives special consideration to the semantics revealed in visual modality. It learns concept level fashion semantics by delving deep into image sub-regions and integrating concept probabilities via multiple instance learning. Then in each turn, an adaptive attention mechanism learns to automatically emphasize on different evidence sources of both visual and textual modalities for more accurate dialogue state prediction. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain and the results show that our method achieves superior performance as compared to a wide range of baselines.
Entity matching (EM), also known as entity resolution, fuzzy join, and record linkage, refers to the process of identifying records corresponding to the same real-world entities from different data sources. It is an important and long-standing problem in data integration and data mining. So far progresses have been made mainly in the form of model improvements, where models with better accuracy are developed when large amounts of training data is available. In real-world applications we find that advanced approaches can often require too many labeled examples that is expensive to obtain, which has become a key obstacle to wider adoption.
We in this work take a different tack, proposing a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB). Specifically, for each entity-type in KB, (e.g., location, organization, people, etc.), we use rich synonymous names of known entities in the KB as training data, to pre-train type-detection and EM models for each type, using a novel hierarchical neural network architecture we develop. Given a new EM task, with little or no training data, we can either fine-tune or directly leverage pre-trained EM models, to build end-to-end, high-quality EM systems. Experiments on a variety of real EM tasks suggest that the pre-trained approach is effective and outperforms existing EM methods.1.
''User reviews” are becoming an essential component of e-commerce. When buyers write a negative or doubting review, ideally, the sellers need to quickly give a response to minimize the potential impact. When the number of reviews is growing at a frightening speed, there is an urgent need to build a response writing assistant for customer service providers. In order to generate high-quality responses, the algorithm needs to consume and understand the information from both the original review and the target product. The classical sequence-to-sequence (Seq2Seq) methods can hardly satisfy this requirement. In this study, we propose a novel deep neural network model based on the Seq2Seq framework for the review response generation task in e-commerce platforms, which can incorporate product information by a gated multi-source attention mechanism and a copy mechanism. Moreover, we employ a reinforcement learning technique to reduce the exposure bias problem. To evaluate the proposed model, we constructed a large-scale dataset from a popular e-commerce website, which contains product information. Empirical studies on both automatic evaluation metrics and human annotations show that the proposed model can generate informative and diverse responses, significantly outperforming state-of-the-art text generation models.
Building height estimation is important in many applications such as 3D city reconstruction, urban planning, and navigation. Recently, a new building height estimation method using street scene images and 2D maps was proposed. This method is more scalable than traditional methods that use high-resolution optical data, LiDAR data, or RADAR data which are expensive to obtain. The method needs to detect building rooflines and then compute building height via the pinhole camera model. We observe that this method has limitations in handling complex street scene images in which buildings overlap with each other and the rooflines are difficult to locate. We propose CBHE, a building height estimation algorithm considering both building corners and rooflines. CBHE first obtains building corner and roofline candidates in street scene images based on building footprints from 2D maps and the camera parameters. Then, we use a deep neural network named BuildingNet to classify and filter corner and roofline candidates. Based on the valid corners and rooflines from BuildingNet, CBHE computes building height via the pinhole camera model. Experimental results show that the proposed BuildingNet yields a higher accuracy on building corner and roofline candidate filtering compared with the state-of-the-art open set classifiers. Meanwhile, CBHE outperforms the baseline algorithm by over 10% in building height estimation accuracy.
Advertising (ad for short) keyword suggestion is important for sponsored search to improve online advertising and increase search revenue. There are two common challenges in this task. First, the keyword bidding problem: hot ad keywords are very expensive for most of the advertisers because more advertisers are bidding on more popular keywords, while unpopular keywords are difficult to discover. As a result, most ads have few chances to be presented to the users. Second, the inefficient ad impression issue: a large proportion of search queries, which are unpopular yet relevant to many ad keywords, have no ads presented on their search result pages. Existing retrieval-based or matching-based methods either deteriorate the bidding competition or are unable to suggest novel keywords to cover more queries, which leads to inefficient ad impressions.
To address the above issues, this work investigates to use generative neural networks for keyword generation in sponsored search. Given a purchased keyword (a word sequence) as input, our model can generate a set of keywords that are not only relevant to the input but also satisfy the domain constraint which enforces that the domain category of a generated keyword is as expected. Furthermore, a reinforcement learning algorithm is proposed to adaptively utilize domain-specific information in keyword generation. Offline evaluation shows that the proposed model can generate keywords that are diverse, novel, relevant to the source keyword, and accordant with the domain constraint. Online evaluation shows that generative models can improve coverage (COV), click-through rate (CTR), and revenue per mille (RPM) substantially in sponsored search.
In many applications, it is important to characterize the way in which two concepts are semantically related. Knowledge graphs such as ConceptNet provide a rich source of information for such characterizations by encoding relations between concepts as edges in a graph. When two concepts are not directly connected by an edge, their relationship can still be described in terms of the paths that connect them. Unfortunately, many of these paths are uninformative and noisy, which means that the success of applications that use such path features crucially relies on their ability to select high-quality paths. In existing applications, this path selection process is based on relatively simple heuristics. In this paper we instead propose to learn to predict path quality from crowdsourced human assessments. Since we are interested in a generic task-independent notion of quality, we simply ask human participants to rank paths according to their subjective assessment of the paths' naturalness, without attempting to define naturalness or steering the participants towards particular indicators of quality. We show that a neural network model trained on these assessments is able to predict human judgments on unseen paths with near optimal performance. Most notably, we find that the resulting path selection method is substantially better than the current heuristic approaches at identifying meaningful paths.
The growth of the Web in recent years has resulted in the development of various online platforms that provide healthcare information services. These platforms contain an enormous amount of information, which could be beneficial for a large number of people. However, navigating through such knowledgebases to answer specific queries of healthcare consumers is a challenging task. A majority of such queries might be non-factoid in nature, and hence, traditional keyword-based retrieval models do not work well for such cases. Furthermore, in many scenarios, it might be desirable to get a short answer that sufficiently answers the query, instead of a long document with only a small amount of useful information. In this paper, we propose a neural network model for ranking documents for question answering in the healthcare domain. The proposed model uses a deep attention mechanism at word, sentence, and document levels, for efficient retrieval for both factoid and non-factoid queries, on documents of varied lengths. Specifically, the word-level cross-attention allows the model to identify words that might be most relevant for a query, and the hierarchical attention at sentence and document levels allows it to do effective retrieval on both long and short documents. We also construct a new large-scale healthcare question-answering dataset, which we use to evaluate our model. Experimental evaluation results against several state-of-the-art baselines show that our model outperforms the existing retrieval techniques.
As the popularity of adblocking has soared over the last few years, publishers are increasingly deploying anti-adblocking paywalls that ask users to either disable their adblockers or pay to access content. In this work we propose ShadowBlock, a new Chromium-based adblocking browser that can hide traces of adblocking activities from anti-adblockers as it removes ads from web pages. To bypass anti-adblocking paywalls, ShadowBlock takes advantage of existing filter lists used by adblockers and hides all ad elements stealthily in such a way that anti-adblocking scripts cannot detect any tampering of the ads (e.g., absence of ad elements). Specifically, ShadowBlock introduces lightweight hooks in Chromium to ensure that DOM states queried by anti-adblocking scripts are exactly as if adblocking is not employed. We implement a fully working prototype by modifying Chromium which shows great promise in terms of adblocking effectiveness and anti-adblocking circumvention but also more efficient than the state-of-the-art adblocking browser extensions. Our evaluation on Alexa top-1K websites shows that ShadowBlock successfully blocks 98.3% of all visible ads while only causing minor breakage on less than 0.6% of the websites. Most importantly, ShadowBlock is able to bypass anti-adblocking paywalls on more than 200 websites that deploy visible anti-adblocking paywalls with a 100% success rate. Our performance evaluation further shows that ShadowBlock loads pages as fast as the state-of-the-art adblocking browser extension on average.
Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to networks that have tens of millions or even hundreds of millions of nodes remains a challenging problem. In this paper, we propose GraphVite, a high-performance CPU-GPU hybrid system for training node embeddings, by co-optimizing the algorithm and the system. On the CPU end, augmented edge samples are parallelly generated by random walks in an online fashion on the network, and serve as the training data. On the GPU end, a novel parallel negative sampling is proposed to leverage multiple GPUs to train node embeddings simultaneously, without much data transfer and synchronization. Moreover, an efficient collaboration strategy is proposed to further reduce the synchronization cost between CPUs and GPUs. Experiments on multiple real-world networks show that GraphVite is super efficient. It takes only about one minute for a network with 1 million nodes and 5 million edges on a single machine with 4 GPUs, and takes around 20 hours for a network with 66 million nodes and 1.8 billion edges. Compared to the current fastest system, GraphVite is about 50 times faster without any sacrifice on performance.
A considerable body of research has demonstrated that online search data can be used to complement current syndromic surveillance systems. The vast majority of previous work proposes solutions that are based on supervised learning paradigms, in which historical disease rates are required for training a model. However, for many geographical regions this information is either sparse or not available due to a poor health infrastructure. It is these regions that have the most to benefit from inferring population health statistics from online user search activity. To address this issue, we propose a statistical framework in which we first learn a supervised model for a region with adequate historical disease rates, and then transfer it to a target region, where no syndromic surveillance data exists. This transfer learning solution consists of three steps: (i) learn a regularized regression model for a source country, (ii) map the source queries to target ones using semantic and temporal similarity metrics, and (iii) re-adjust the weights of the target queries. It is evaluated on the task of estimating influenza-like illness (ILI) rates. We learn a source model for the United States, and subsequently transfer it to three other countries, namely France, Spain and Australia. Overall, the transferred (unsupervised) models achieve strong performance in terms of Pearson correlation with the ground truth (> .92 on average), and their mean absolute error does not deviate greatly from a fully supervised baseline.
We contribute by quantifying the effect of network latency and battery consumption on mobile app performance and retention, i.e., user's decisions to continue or stop using apps. We perform our analysis by fusing two large-scale crowdsensed datasets collected by piggybacking on information captured by mobile apps. We find that app performance has an impact in its retention rate. Our results demonstrate that high energy consumption and high latency decrease the likelihood of retaining an app. Conversely, we show that reducing latency or energy consumption does not guarantee higher likelihood of retention as long as they are within reasonable standards of performance. However, we also demonstrate that what is considered reasonable depends on what users have been accustomed to, with device and network characteristics, and app category playing a role. As our second contribution, we develop a model for predicting retention based on performance metrics. We demonstrate the benefits of our model through empirical benchmarks which show that our model not only predicts retention accurately, but generalizes well across application categories, locations and other factors moderating the effect of performance.
Bots (i.e. automated accounts) involve in social campaigns typically for two obvious reasons: to inorganically sway public opinion and to build social capital exploiting the organic popularity of social campaigns. In the process, bots interact with each other and engage in human activities (e.g. likes, retweets, and following).
In this work, we detect a large number of bots interested in politics. We perform multi-aspect (i.e. temporal, textual, and topographical) clustering of bots, and ensemble the clusters to identify campaigns of bots. We observe similarity among the bots in a campaign in various aspects such as temporal correlation, sentimental alignment, and topical grouping. However, we also discover bots compete in gaining attention from humans and occasionally engage in arguments. We classify such bot interactions in two primary groups: agreeing (i.e. positive) and disagreeing (i.e. negative) interactions and develop an automatic interaction classifier to discover novel interactions among bots participating in social campaigns.
Signal strength maps are of great importance to cellular providers for network planning and operation, however they are expensive to obtain and possibly limited or inaccurate in some locations. In this paper, we develop a prediction framework based on random forests to improve signal strength maps from limited measurements. First, we propose a random forests (RFs)-based predictor, with a rich set of features including location as well as time, cell ID, device hardware and other features. We show that our RFs-based predictor can significantly improve the tradeoff between prediction error and number of measurements needed compared to state-of-the-art data-driven predictors, i.e., requiring 80% less measurements for the same prediction accuracy, or reduces the relative error by 17% for the same number of measurements. Second, we leverage two types of real-world LTE RSRP datasets to evaluate into the performance of different prediction methods: (i) a small but dense Campus dataset, collected on a university campus and (ii) several large but sparser NYC and LA datasets, provided by a mobile data analytics company.
Social media provides the government with novel methods to improve regulation. One leading case has been the use of Yelp reviews to target food safety inspections. While previous research on data from Seattle finds that Yelp reviews can predict unhygienic establishments, we provide a more cautionary perspective. First, we show that prior results are sensitive to what we call “Extreme Imbalanced Sampling”: extreme because the dataset was restricted from roughly 13k inspections to a sample of only 612 inspections with only extremely high or low inspection scores, and imbalanced by not accounting for class imbalance in the population. We show that extreme imbalanced sampling is responsible for claims about the power of Yelp information in the original classification setup. Second, a re-analysis that utilizes the full dataset of 13k inspections and models the full inspection score (regression instead of classification) shows that (a) Yelp information has lower predictive power than prior inspection history and (b) Yelp reviews do not significantly improve predictions, given existing information about restaurants and inspection history. Contrary to prior claims, Yelp reviews do not appear to aid regulatory targeting. Third, this case study highlights critical issues when using social media for predictive models in governance and corroborates recent calls for greater transparency and reproducibility in machine learning.
In this paper, we address the keyphrase extraction problem as sequence labeling and propose a model that jointly exploits the complementary strengths of Conditional Random Fields that capture label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring label, and Bidirectional Long Short Term Memory networks that capture hidden semantics in text through the long distance dependencies. Our results on three datasets of scholarly documents show that the proposed model substantially outperforms strong baselines and previous approaches for keyphrase extraction.
Although deep learning models trained on electronic health records (EHR) data have shown state-of-the-art performance in many predictive clinical tasks, the discovery of adversarial examples (i.e., input data that are engineered to cause misclassification) has exposed vulnerabilities with lab and imaging data. We specifically consider adversarial examples with longitudinal EHR data, an area that has not been previously examined because of the challenges with temporal high-dimensional and sparse features. We propose Longitudinal AdVersarial Attack (, a saliency score based adversarial example using a method that requires a minimal number of perturbations and that automatically minimizes the likelihood of detection. Features are selected and modified by jointly modeling a saliency map and attention mechanism. Experimental results with longitudinal EHR data show that an substantially reduce model performance for attention-based target models (from AUPR = 0.5 to AUPR = 0.08).
Story timeline summarization is widely used by analysts, law enforcement agencies, and policymakers for content presentation, story-telling, and other data-driven decision-making applications. Recent advancements in web technologies have rendered social media sites such as Twitter and Facebook as a viable platform for discovering evolving stories and trending events for story timeline summarization. However, a timeline summarization structure that models complex evolving stories by tracking event evolution to identify different themes of a story and generate a coherent structure that is easy for users to understand is yet to be explored. In this paper, we propose StoryGraph, a novel graph timeline summarization structure that is capable of identifying the different themes of a story. By using high penalty metrics that leverage user network communities, temporal proximity, and the semantic context of the events, we construct coherent paths and generate structural timeline summaries to tell the story of how events evolve over time. We performed experiments on real-world datasets to show the prowess of StoryGraph. StoryGraph outperforms existing models and produces accurate timeline summarizations. As a key finding, we discover that user network communities increase coherence leading to the generation of consistent summary structures.
In the last decade drug overdose deaths reached staggering proportions in the US. Besides the raw yearly deaths count that is worrisome per se, an alarming picture comes from the steep acceleration of such rate that increased by 21% from 2015 to 2016. While traditional public health surveillance suffers from its own biases and limitations, digital epidemiology offers a new lens to extract signals from Web and Social Media that might be complementary to official statistics. In this paper we present a computational approach to identify a digital cohort that might provide an updated and complementary view on the opioid crisis. We introduce an information retrieval algorithm suitable to identify relevant subspaces of discussion on social media, for mining data from users showing explicit interest in discussions about opioid consumption in Reddit. Moreover, despite the pseudonymous nature of the user base, almost 1.5 million users were geolocated at the US state level, resembling the census population distribution with a good agreement. A measure of prevalence of interest in opiate consumption has been estimated at the state level, producing a novel indicator with information that is not entirely encoded in the standard surveillance. Finally, we further provide a domain specific vocabulary containing informal lexicon and street nomenclature extracted by user-generated content that can be used by researchers and practitioners to implement novel digital public health surveillance methodologies for supporting policy makers in fighting the opioid epidemic.
Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or
argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
Most network embedding algorithms consist in measuring co-occur-rences of nodes via random walks then learning the embeddings using Skip-Gram with Negative Sampling. While it has proven to be a relevant choice, there are alternatives, such as GloVe, which has not been investigated yet for network embedding. Even though SGNS better handles non co-occurrence than GloVe, it has a worse time-complexity. In this paper, we propose a matrix factorization approach for network embedding, inspired by GloVe, that better handles non co-occurrence with a competitive time-complexity. We also show how to extend this model to deal with networks where nodes are documents, by simultaneously learning word, node and document representations. Quantitative evaluations show that our model achieves state-of-the-art performance, while not being so sensitive to the choice of hyper-parameters. Qualitatively speaking, we show how our model helps exploring a network of documents by generating complementary network-oriented and content-oriented keywords.
At the core of many important machine learning problems faced by online streaming services is a need to model how users interact with the content they are served. Unfortunately, there are no public datasets currently available that enable researchers to explore this topic. In order to spur that research, we release the Music Streaming Sessions Dataset (MSSD), which consists of 160 million listening sessions and associated user actions. Furthermore, we provide audio features and metadata for the approximately 3.7 million unique tracks referred to in the logs. This is the largest collection of such track metadata currently available to the public. This dataset enables research on important problems including how to model user listening and interaction behaviour in streaming, as well as Music Information Retrieval (MIR), and session-based sequential recommendations. Additionally, a subset of sessions were collected using a uniformly random recommendation setting, enabling their use for counterfactual evaluation of such sequential recommendations. Finally, we provide an analysis of user behavior and suggest further research problems which can be addressed using the dataset.
Over the last decade, the illegal distribution of child sexual abuse imagery (CSAI) has transformed alongside the rise of online sharing platforms. In this paper, we present the first longitudinal measurement study of CSAI distribution online and the threat it poses to society's ability to combat child sexual abuse. Our results illustrate that CSAI has grown exponentially-to nearly 1 million detected events per month-exceeding the capabilities of independent clearinghouses and law enforcement to take action. In order to scale CSAI protections moving forward, we discuss techniques for automating detection and response by using recent advancements in machine learning.
Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data.
To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.
Generative Adversarial Networks (GAN) have not only achieved a big success in various generation tasks such as images, but also boosted the accuracy of classification tasks by generating additional labeled data, which is called data augmentation. In this paper, we propose a Rating Augmentation framework with GAN, named RAGAN, aiming to alleviate the data sparsity problem in collaborative filtering (CF), eventually improving recommendation accuracy significantly. We identify a unique challenge that arises when applying GAN to CF for rating augmentation: naive RAGAN tends to generate values biased towards high ratings. Then, we propose a refined version of RAGAN, named RAGANBT, which addresses this challenge successfully. Via our extensive experiments, we validate that our RAGANBT is really effective to solve the data sparsity problem, thereby providing existing CF models with great improvement in accuracy under various situations such as basic top-N recommendation, long-tail item recommendation, and recommendation to cold-start users.
Online donation platforms, such as DonorsChoose, GlobalGiving, or CrowdFunder, enable donors to financially support entities in need. In a typical scenario, after a fundraiser submits a request specifying her need, donors contribute financially to help raise the target amount within a pre-specified timeframe. While the goal of such platforms is to counterbalance societal inequalities, biased donation trends might exacerbate the unfair distribution of resources to those in need. Prior research has looked at the impact of biased data, models, or human behavior on inequality in different socio-technical systems, while largely ignoring the choice architecture, in which the funding decisions are made.
In this paper, we focus on (i) quantifying inequality in the project funding in online donation platforms, and (ii) understanding the impact of platform design on donors' behavior in magnifying those inequalities. To this end, we borrow decomposable measures of income inequality from economics, and apply it to identify candidate factors contributing to inequality on the DonorsChoose website. Analyzing longitudinal changes in the website design, we show how the platform design impacts the relative contribution of the different factors. Our work motivates the need for a careful investigation of the impact of choice architectures on user decisions, in donation platforms in particular, and in online platforms more generally.
Consumers today face too many reviews to read when shopping online. Presenting the most helpful reviews, instead of all, to them will greatly ease purchase decision making. Most of the existing studies on review helpfulness prediction focused on domains with rich labels, not suitable for domains with insufficient labels. In response, we explore a multi-domain approach that learns domain relationships to help the task by transferring knowledge from data-rich domains to data-deficient domains. To better model domain differences, our approach gates multi-granularity embeddings in a Neural Network (NN) based transfer learning framework to reflect the domain-variant importance of words. Extensive experiments empirically demonstrate that our model outperforms the state-of-the-art baselines and NN-based methods without gating on this task. Our approach facilitates more effective knowledge transfer between domains, especially when the target domain dataset is small. Meanwhile, the domain relationship and domain-specific embedding gating are insightful and interpretable.
We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation. In the proposed framework, we differentiate two types of proximity relations: direct proximity and k-th order neighborhood proximity. While learning from the former exploits direct user-item associations observable from the graph, learning from the latter makes use of implicit associations such as user-user similarities and item-item similarities, which can provide valuable information especially when the graph is sparse. Moreover, for improving scalability and flexibility, we propose a sampling technique that is specifically designed to capture the two types of proximity relations. Extensive experiments on eight benchmark datasets show that CSE yields significantly better performance than state-of-the-art recommendation methods.
In graphs with rich texts, incorporating textual information with structural information would benefit constructing expressive graph embeddings. Among various graph embedding models, random walk (RW)-based is one of the most popular and successful groups. However, it is challenged by two issues when applied on graphs with rich texts: (i) sampling efficiency: deriving from the training objective of RW-based models (e.g., DeepWalk and node2vec), we show that RW-based models are likely to generate large amounts of redundant training samples due to three main drawbacks. (ii) text utilization: these models have difficulty in dealing with zero-shot scenarios where graph embedding models have to infer graph structures directly from texts. To solve these problems, we propose a novel framework, namely Text-driven Graph Embedding with Pairs Sampling (TGE-PS). TGE-PS uses Pairs Sampling (PS) to improve the sampling strategy of RW, being able to reduce ~ 99% training samples while preserving competitive performance. TGE-PS uses Text-driven Graph Embedding (TGE), an inductive graph embedding approach, to generate node embeddings from texts. Since each node contains rich texts, TGE is able to generate high-quality embeddings and provide reasonable predictions on existence of links to unseen nodes. We evaluate TGE-PS on several real-world datasets, and experiment results demonstrate that TGE-PS produces state-of-the-art results on both traditional and zero-shot link prediction tasks.
Natural Language Generation (NLG), as an important part of Natural Language Processing (NLP), has begun to take full advantage of recent advances in language models. Based on recurrent neural networks (RNNs), NLG has made ground breaking improvement and is widely applied in many tasks. RNNs typically learn a joint probability of words, and the additional information is usually fed to RNNs hidden layer using implicit vector representations. Still, there exists some problem unsolved. Standard RNN is not applicable when we need to impose hard constraints on the language generation tasks: for example, standard RNNs cannot guarantee designated word(s) to appear in a target sentence to generate. In this paper, we propose a Backward-or-Forward Generative Adversarial Nets model (BoFGAN) to address this problem. Starting from a particular given word, a generative model at every time step generates a new preceding or subsequent word conditioned on the generated sequence so far until both sides reach an end. To train the generator, we first model it as a stochastic policy using Reinforcement Learning; then we employ a discriminator to evaluate the quality of a complete sequence as the end reward; and lastly, we apply Monte Carlo (MC) search to estimate the long-term return and update the generator via policy gradient. Experimental results demonstrate the effectiveness and rationality of our proposed BoFGAN model.
With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.
Team composition is a central factor in determining the effectiveness of a team. In this paper, we present a large-scale study on the effect of team composition on multiple measures of team effectiveness. We use a dataset from the largest multiplayer online battle arena (MOBA) game, Honor of Kings, with 96 million matches involving 100 million players. We measure team effectiveness based on team performance (whether a team is going to win), team tenacity (whether a team is going to surrender), and team rapport (whether a team uses abusive language). Our results confirm the importance of team diversity with respect to player roles, and show that diversity has varying effects on team effectiveness: although diverse teams perform well and show tenacity in adversity, they are more likely to abuse when losing than less diverse teams. Our study also contributes to the situation vs. personality debate and show that abusive players tend to choose the leading role and players do not become more abusive when taking such roles.
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin.
Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algorithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that defines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribution heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ ”co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.
Personalized Point of Interest (POI) recommendation that incorporates users' personal preferences is an important subject of research. However, challenges exist such as dealing with sparse rating data and spatial location factors. As one of the biggest card payment organizations in the United States, our company holds abundant card payment transaction records with numerous features.
In this paper, using restaurant recommendation as a demonstrating example, we present a personalized POI recommendation system (Pcard) that learns user preferences based on user transaction history and restaurants' locations. With a novel embedding approach that captures user embeddings and restaurant embeddings, we model pairwise restaurant preferences with respect to each user based on their locations and dining histories. Finally, a ranking list of restaurants within a spatial region is presented to the user. The evaluation results show that the proposed approach is able to achieve high accuracy and present effective recommendations.
In this paper, we propose a method of improving accuracy of multiclass classification tasks in crowdsourcing. In crowdsourcing, it is important to assign appropriate workers to appropriate tasks. In multiclass classification, different workers are good at different subcategories. In our method, we reorganize a given flat classification task into a hierarchical classification task consisting of several subtasks, and assign each worker to an appropriate subtask. In this approach, it is important to choose a good hierarchy. In our method, we first post a flat classification task with a part of data and collect statistics on each worker's ability. Based on the obtained statistics, we simulate all candidate hierarchical schemes, estimate their expected accuracy, choose the best scheme, and post it with the rest of data. In our method, it is also important to allocate workers to appropriate subtasks. We designed several greedy worker allocation algorithms. The results of our experiments show that our method improves the accuracy of multiclass classification tasks.
Debate is a process that gives individuals the opportunity to express, and to be exposed to, diverging viewpoints on controversial issues; and the existence of online debating platforms makes it easier for individuals to participate in debates and obtain feedback on their debating skills. But understanding the factors that contribute to a user's success in debate is complicated: while success depends, in part, on the characteristics of the language they employ, it is also important to account for the degree to which their beliefs and personal traits are compatible with that of the audience. Friendships and previous interactions among users on the platform may further influence success.
In this work, we aim to better understand the mechanisms behind success in online debates. In particular, we study the relative effects of debaters' language, their prior beliefs and personal traits, and their social interactions with other users. We find, perhaps surprisingly, that characteristics of users' social interactions play the most important role in determining their success in debates although the best predictive performance is achieved by combining social interaction features with features that encode information on language use during the debate.
This paper introduces HopRank, an algorithm for modeling human navigation on semantic networks. HopRank leverages the assumption that users know or can see the whole structure of the network. Therefore, besides following links, they also follow nodes at certain distances (i.e., k-hop neighborhoods), and not at random as suggested by PageRank, which assumes only links are known or visible. We observe such preference towards k-hop neighborhoods on BioPortal, one of the leading repositories of biomedical ontologies on the Web. In general, users navigate within the vicinity of a concept. But they also “jump” to distant concepts less frequently. We fit our model on 11 ontologies using the transition matrix of clickstreams, and show that semantic structure can influence teleportation in PageRank. This suggests that users-to some extent-utilize knowledge about the underlying structure of ontologies, and leverage it to reach certain pieces of information. Our results help the development and improvement of user interfaces for ontology exploration.
Helpful reviews are essential for e-commerce and review websites, as they can help customers make quick purchase decisions and merchants to increase profits. Due to a great number of online reviews with unknown helpfulness, it recently leads to promising research on building automatic mechanisms to assess review helpfulness. The mainstream methods generally extract various linguistic and embedding features solely from the text of a review as the evidence for helpfulness prediction. We, however, consider that the helpfulness of a review should be fully aware of the metadata (such as the title, the brand, the category, and the description) of its target product, besides the textual content of the review itself. Hence, in this paper we propose an end-to-end deep neural architecture directly fed by both the metadata of a product and the raw text of its reviews to acquire product-aware review representations for helpfulness prediction. The learned representations do not require tedious labor on feature engineering and are expected to be more informative as the target-aware evidence to assess the helpfulness of online reviews. We also construct two large-scale datasets which are a portion of the real-world web data in Amazon and Yelp, respectively, to train and test our approach. Experiments are conducted on two different tasks: helpfulness identification and regression of online reviews, and results demonstrate that our approach can achieve state-of-the-art performance with substantial improvements.
As emojis become prevalent in personal communications, people are always looking for new, interesting emojis to express emotions, show attitudes, or simply visualize texts. In this study, we collected more than thirty million tweets mentioning the word “emoji” in a one-year period to study emoji requests on Twitter. First, we filtered out bot-generated tweets and extracted emoji requests from the raw tweets using a comprehensive list of linguistic patterns. Then, we examined patterns of new emoji requests by exploring their time, locations, and context. Finally, we summarized users' advocacy behaviors and identified expressions of equity, diversity, and fairness issues due to unreleased but expected emojis, and concluded the significance of new emojis on society. To the best of our knowledge, this paper is the first to conduct a systematic, large-scale study on new emoji requests.
In this paper we investigate the problem of measuring end-to-end Incentive Compatibility (IC) regret given black-box access to an auction mechanism. Our goal is to 1) compute an estimate for IC regret in an auction, 2) provide a measure of certainty around the estimate of IC regret, and 3) minimize the time it takes to arrive at an accurate estimate. We consider two main problems, with different informational assumptions: In the advertiser problem the goal is to measure IC regret for some known valuation v, while in the more general demand-side platform (DSP) problem we wish to determine the worst-case IC regret over all possible valuations. The problems are naturally phrased in an online learning model and we design algorithms for both problems. We give an online learning algorithm where for the advertiser problem the error of determining IC shrinks as (where B is the finite set of bids, T is the number of time steps, and n is number of auctions per time step), and for the DSP problem it shrinks as . For the DSP problem, we also consider stronger IC regret estimation and extend our algorithm to achieve better IC regret error. We validate the theoretical results using simulations with Generalized Second Price (GSP) auctions, which are known to not be incentive compatible and thus have strictly positive IC regret.
We focus on the problem of interlinking Wikipedia tables with fine-grained table relations: equivalent and subPartOf. Such relations allow us to harness semantically related information by accessing related tables or facts therein. Determining the type of a relation is not trivial. Relations are dependent on the schemas, the cell-values, and the semantic overlap of the cell values in tables.
We propose TableNet, an approach for interlinking tables with subPartOf and equivalent relations. TableNet consists of two main steps: (i) for any source table we provide an efficient algorithm to find candidate related tables with high coverage, and (ii) a neural based approach that based on the table schemas and data, determines with high accuracy the fine-grained relation.
Based on an extensive evaluation with more than 3.2M tables, we show that TableNet retains more than 88% of relevant tables pairs, and assigns table relations with an accuracy of 90%.
With the development of graph convolutional networks (GCN), deep learning methods have started to be used on graph data. In additional to convolutional layers, pooling layers are another important components of deep learning. However, no effective pooling methods have been developed for graphs currently. In this work, we propose the graph pooling (gPool) layer, which employs a trainable projection vector to measure the importance of nodes in graphs. By selecting the k-most important nodes to form the new graph, gPool achieves the same objective as regular max pooling layers operation on images and texts. Another limitation of GCN when used on graph-based text representation tasks is that, GCNs do not consider the order information of nodes in graph. To address this limitation, we propose the hybrid convolutional (hConv) layer that combines GCN and regular convolutional operations. The hConv layer is capable of increasing receptive fields quickly and computing features automatically. Based on the proposed gPool and hConv layers, we develop new deep networks for text categorization tasks. Our experimental results show that the networks based on gPool and hConv layers achieves new state-of-the-art performance as compared to baseline methods.
An important task in Location based Social Network applications is to predict mobility - specifically, user's next point-of-interest (POI) - challenging due to the implicit feedback of footprints, sparsity of generated check-ins, and the joint impact of historical periodicity and recent check-ins. Motivated by recent success of deep variational inference, we propose VANext (Variational Attention based Next) POI prediction: a latent variable model for inferring user's next footprint, with historical mobility attention. The variational encoding captures latent features of recent mobility, followed by searching the similar historical trajectories for periodical patterns. A trajectory convolutional network is then used to learn historical mobility, significantly improving the efficiency over often used recurrent networks. A novel variational attention mechanism is proposed to exploit the periodicity of historical mobility patterns, combined with recent check-in preference to predict next POIs. We also implement a semi-supervised variant - VANext-S, which relies on variational encoding for pre-training all current trajectories in an unsupervised manner, and uses the latent variables to initialize the current trajectory learning. Experiments conducted on real-world datasets demonstrate that VANext and VANext-S outperform the state-of-the-art human mobility prediction models.
Understanding the economic nature of consumer decisions in e-Commerce is important to personalized recommendation systems. Established economic theories claim that informed consumers always attempt to maximize their utility by choosing the items of the largest marginal utility per dollar (MUD) within their budgets. For example, gaining 5 dollars of extra benefit by spending 10 dollars makes a consumer much more satisfied than having the same amount of extra benefit by spending 20 dollars, although the second product may have higher absolute utility value. Meanwhile, making purchases online may be risky decisions that could cause dissatisfaction. For example, people may give low ratings towards purchased items that they thought they would like when placing the order. Therefore, the design of recommender systems should also take users' risk attitudes into consideration to better learn consumer behaviors.
Motivated by the first consideration, in this paper, we propose a learning algorithm to maximize marginal utility per dollar for recommendations. With the second, economic theory shows that rational people can be arbitrarily close to risk neutral when stakes are arbitrarily small, and this is generally applicable to consumer online purchase behaviors because most people spend a small portion of their total wealth for a single purchase. To integrate this theory with machine learning, we propose to augment MUD optimization with approximate risk-neural constraint to generate personalized recommendations. Experiments on real-world e-Commerce datasets show that our approach is able to achieve better performance than many classical recommendation methods, in terms of both traditional recommendation measures such as precision and recall, as well as economic measures such as MUD.
Ranking algorithms play a crucial role in online platforms ranging from search engines to recommender systems. In this paper, we identify a surprising consequence of popularity-based rankings: the fewer the items reporting a given signal, the higher the share of the overall traffic they collectively attract. This few-get-richer effect emerges in settings where there are few distinct classes of items (e.g., left-leaning news sources versus right-leaning news sources), and items are ranked based on their popularity. We demonstrate analytically that the few-get-richer effect emerges when people tend to click on top-ranked items and have heterogeneous preferences for the classes of items. Using simulations, we analyze how the strength of the effect changes with assumptions about the setting and human behavior. We also test our predictions experimentally in an online experiment with human participants. Our findings have important implications to understand the spread of misinformation.
Misspelled words of the malicious kind work by changing specific keywords and are intended to thwart existing automated applications for cyber-environment control such as harassing content detection on the Internet and email spam detection. In this paper, we focus on malicious spelling correction, which requires an approach that relies on the context and the surface forms of targeted keywords. In the context of two applications-profanity detection and email spam detection-we show that malicious misspellings seriously degrade their performance. We then propose a context-sensitive approach for malicious spelling correction using word embeddings and demonstrate its superior performance compared to state-of-the-art spell checkers.
Social recommendations have been a very intriguing domain for researchers in the past decade. The main premise is that the social network of a user can be leveraged to enhance the rating-based recommendation process. This has been achieved in various ways, and under different assumptions about the network characteristics, structure, and availability of other information (such as trust, content, etc.) In this work, we create neighborhoods of influence leveraging only the social graph structure. These are in turn introduced in the recommendation process both as a pre-processing step and as a social regularization factor of the matrix factorization algorithm. Our experimental evaluation using real-life datasets demonstrates the effectiveness of the proposed technique.
Spell correction is a must-have feature for any modern search engine in applications such as web or e-commerce search. Typical spell correction solutions used in production systems consist of large indexed lookup tables based on a global model trained across many users over a large scale web corpus or a query log.
For search over personal corpora, such as email, this global solution is not sufficient, as it ignores the user's personal lexicon. Without personalization, global spelling fails to correct tail queries drawn from a user's own, often idiosyncratic, lexicon. Personalization using existing algorithms is difficult due to resource constraints and unavailability of sufficient data to build per-user models.
In this work, we propose a simple and effective personalized spell correction solution that augments existing global solutions for search over private corpora. Our event driven spell correction candidate generation method is specifically designed with personalization as the key construct. Our novel spell correction and query completion algorithms do not require complex model training and is highly efficient. The proposed solution has shown over 30% click-through rate gain on affected queries when evaluated against a range of strong commercial personal search baselines - Google's Gmail, Drive, and Calendar search production systems.
We study the problem of search query inference from web documents, where a short, comprehensive natural language query is inferred from a long article. Search query generation or inference is of great value to search engines and recommenders in terms of locating potential target users and ranking content. Despite being closely related to other NLP tasks like abstract generation and keyword extraction, we point out that search query inference is, in fact, a new problem, in that the generated natural language query, which consists of a few words, is expected to be comprehensive enough to lead to the click-through of the corresponding document. Therefore, query generation requires an accurate inference of query words, as well as a deeper level of understanding on document semantic structures. Toward this end, we propose a novel generative model called the Graph-augmented Sequence to Attention (G-S2A) network. Adopting an Encoder-Decoder architecture, G-S2A incorporates a sentence-level Graph Convolutional Network (GCN), a keyword-level GCN, as well as a hierarchical recurrent neural network (RNN) into the encoder to generate structural document representations. An attentional Transformer decoder is then applied to combine different types of encoded features to generate a target query. On a query-document dataset from a real-world search engine, our model outperforms several neural generative models on a wide range of metrics.
Result ranking diversification has become an important issue for web search, summarization, and question answering. For more complex questions with multiple aspects, such as those in community-based question answering (CQA) sites, a retrieval system should provide a diversified set of relevant results, addressing the different aspects of the query, while minimizing redundancy or repetition. We present a new method, DRN , which learns novelty-related features from unlabeled data with minimal social signals, to emphasize diversity in ranking. Specifically, DRN parameterizes question-answer interactions via an LSTM representation, coupled with an extension of neural tensor network, which in turn is combined with a novelty-driven sampling approach to automatically generate training data. DRN provides a novel and general approach to complex question answering diversification and suggests promising directions for search improvements.
As an alternative means of convenient and smart transportation, mobility-on-demand (MOD), typified by online ride-sharing and connected taxicabs, has been rapidly growing and spreading worldwide. The large volume of complex traffic and the uncertainty of market supplies/demands have made it essential for many MOD service providers to proactively dispatch vehicles towards ride-seekers.
To meet this need effectively, we propose STRide, an MOD coordination-learning mechanism reinforced spatio-temporally with capsules. We formalize the adaptive coordination of vehicles into a reinforcement learning framework. STRide incorporates spatial and temporal distributions of supplies (vehicles) and demands (ride requests), customers' preferences and other external factors. A novel spatio-temporal capsule neural network is designed to predict the provider's rewards based on MOD network states, vehicles and their dispatch actions. This way, the MOD platform adapts itself to the supply-demand dynamics with the best potential rewards. We have conducted extensive data analytics and experimental evaluation with three large-scale datasets (~ 21 million rides from Uber, Yellow Taxis and Didi). STRide is shown to outperform state-of-the-arts, substantially reducing request-rejection rate and passenger waiting time, and also increasing the service provider's profits, often making 30% improvement over state-of-the-arts.
From a national security viewpoint, a higher degree of cyber autonomy is crucial to reduce the reliance on external, oftentimes untrustworthy entities, in order to achieve better resilience against adversaries. To probe into the concept of government cyber autonomy, this study examines the external dependency of public-facing government websites across the world's major industrialized, Group of Seven (G7) countries. Over a two-year period, we measured HTTPS adoption rates, the autonomy status of CAs, and the autonomy status of CPs on G7 government websites. We find that approximately 85% of web resources loaded by G7 government sites originate from the United States. By reviewing policy documents and surveying technicians who maintain government websites, we identify four significant forces that can influence the degree of a government's autonomy, including government mandates on HTTPS adoption, website development outsourcing, the citizens' fear of large-scale surveillance, and user confusion. Because a government websites are considered critical information infrastructures, we expect this study to raise awareness of their complex dependency, thereby reducing the risk of blindly trusting external entities when using critical government services.
Collaborative Filtering (CF) is the key technique for recommender systems. CF exploits user-item behavior interactions (e.g., clicks) only and hence suffers from the data sparsity issue. One research thread is to integrate auxiliary information such as product reviews and news titles, leading to hybrid filtering methods. Another thread is to transfer knowledge from source domains such as improving the movie recommendation with the knowledge from the book domain, leading to transfer learning methods. In real-world applications, a user registers for multiple services across websites. Thus it motivates us to exploit both auxiliary and source information for recommendation in this paper. To achieve this, we propose a Transfer Meeting Hybrid (TMH) model for cross-domain recommendation with unstructured text. The proposed TMH model attentively extracts useful content from unstructured text via a memory network and selectively transfers knowledge from a source domain via a transfer network. On two real-world datasets, TMH shows better performance in terms of three ranking metrics by comparing with various baselines. We conduct thorough analyses to understand how the text content and transferred knowledge help the proposed model.
Recently a number of algorithms under the theme of 'unbiased learning-to-rank' have been proposed, which can reduce position bias, the major type of bias in click data, and train a high-performance ranker with click data. Most of the existing algorithms, based on the inverse propensity weighting (IPW) principle, first estimate the click bias at each position, and then train an unbiased ranker with the estimated biases using a learning-to-rank algorithm. However, there has not been a method for unbiased pairwise learning-to-rank that can simultaneously conduct debiasing of click data and training of a ranker using a pairwise loss function. In this paper, we propose a novel framework to accomplish the goal and apply this framework to the state-of-the-art pairwise learning-to-rank algorithm, LambdaMART. Our algorithm named Unbiased LambdaMART can jointly estimate the biases at click positions and the biases at unclick positions, and learn an unbiased ranker. Experiments on benchmark data show that Unbiased LambdaMART can significantly outperform existing algorithms by large margins. In addition, an online A/B Testing at a commercial search engine shows that Unbiased LambdaMART can effectively conduct debiasing of click data and enhance relevance ranking.
Recently, scholars have demonstrated empirical successes of deep learning in sequence labeling, and most of the prior works focused on the word representation inside the target sentence. Unfortunately, the global information, e.g., domain information of the target document, were ignored in the previous studies. In this paper, we propose an innovative joint learning neural network which can encapsulate the global domain knowledge and the local sentence/token information to enhance the sequence labeling model. Unlike existing studies, the proposed method employs domain labeling output as a latent evidence to facilitate tagging model and such joint embedding information is generated by an enhanced highway network. Meanwhile, a redesigned CRF layer is deployed to bridge the 'local output labels' and 'global domain information'. Various kinds of information can iteratively contribute to each other, and moreover, domain knowledge can be learnt in either supervised or unsupervised environment via the new model. Experiment with multiple data sets shows that the proposed algorithm outperforms classical and most recent state-of-the-art labeling methods.
Obfuscated language is created to avoid censorship in adversarial communication such as sensitive information conveying, strong sentiment expression, secret actions plan, and illegal trading. The obfuscated sentences are usually generated by replacing one word with another to conceal the textual content. Intelligence and security agencies identify such adversarial messages by scanning with a watch-list of red-flagged terms. Though semantic expansion techniques are adopted, the precision and recall of the identification is limited due to the ambiguity and the unbounded creation way. To this end, this paper frames the obfuscated language identification problem as a text matching task, where each message is checked whether matches a red-flagged term. We propose a multimodal text matching model which combining textual and visual features. The proposed model extends a Bi-directional Long Short Term Memory network with a visual-level representation component to achieve the given task. Comparative experiments on real-world dataset demonstrate that the proposed method could achieve a better performance than the previous methods.
The Web is a tangled mass of interconnected services, where websites import a range of external resources from various third-party domains. The latter can also load resources hosted on other domains. For each website, this creates a dependency chain underpinned by a form of implicit trust between the first-party and transitively connected third-parties. The chain can only be loosely controlled as first-party websites often have little, if any, visibility on where these resources are loaded from. This paper performs a large-scale study of dependency chains in the Web, to find that around 50% of first-party websites render content that they did not directly load. Although the majority (84.91%) of websites have short dependency chains (below 3 levels), we find websites with dependency chains exceeding 30. Using VirusTotal, we show that 1.2% of these third-parties are classified as suspicious - although seemingly small, this limited set of suspicious third-parties have remarkable reach into the wider ecosystem.
Interaction-based neural ranking has been shown to be effective for document search using distributed word representations. However the time or space required is very expensive for online query processing with neural ranking. This paper investigates fast approximation of three interaction-based neural ranking algorithms using Locality Sensitive Hashing (LSH). It accelerates query-document interaction computation by using a runtime cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values. This paper presents the design choices with cost analysis, and an evaluation that assesses efficiency benefits and relevance tradeoffs for the tested datasets.
The Knowledge graph (KG) uses the triples to describe the facts in the real world. It has been widely used in intelligent analysis and applications. However, possible noises and conflicts are inevitably introduced in the process of constructing. And the KG based tasks or applications assume that the knowledge in the KG is completely correct and inevitably bring about potential deviations. In this paper, we establish a knowledge graph triple trustworthiness measurement model that quantify their semantic correctness and the true degree of the facts expressed. The model is a crisscrossing neural network structure. It synthesizes the internal semantic information in the triples and the global inference information of the KG to achieve the trustworthiness measurement and fusion in the three levels of entity level, relationship level, and KG global level. We analyzed the validity of the model output confidence values, and conducted experiments in the real-world dataset FB15K (from Freebase) for the knowledge graph error detection task. The experimental results showed that compared with other models, our model achieved significant and consistent improvements.
Residential buildings constitute roughly one-fourth of the total energy use across the globe. Numerous studies have shown that providing an energy breakdown increases residents' awareness of energy use and can help save up to 15% energy. A significant amount of prior work has looked into source-separation techniques collectively called non-intrusive load monitoring (NILM), and most prior NILM research has leveraged high-frequency household aggregate data for energy breakdown. However, in practice most smart meters only sample hourly or once every 15 minutes, and existing NILM techniques show poor performance at such a low sampling rate.
In this paper, we propose a TreeCNN model for energy breakdown on low frequency data. There are three key insights behind the design of our model: i) households consume energy with regular temporal patterns, which can be well captured by filters learned in CNNs; ii) tree structure isolates the pattern learning of each appliance that helps avoid magnitude variance problem, while preserves relationship among appliances; iii) tree structure enables the separation of known appliance from unknown ones, which de-noises the input time series for better appliance-level reconstruction. Our TreeCNN model outperformed seven existing baselines on a public benchmark dataset with lower estimation error and higher accuracy on detecting the active states of appliances.
Sequence-to-Sequence (Seq2Seq) models have achieved encouraging performance on the dialogue response generation task. However, existing Seq2Seq-based response generation methods suffer from a low-diversity problem: they frequently generate generic responses, which make the conversation less interesting. In this paper, we address the low-diversity problem by investigating its connection with model over-confidence reflected in predicted distributions. Specifically, we first analyze the influence of the commonly used Cross-Entropy (CE) loss function, and find that the CE loss function prefers high-frequency tokens, which results in low-diversity responses. We then propose a Frequency-Aware Cross-Entropy (FACE) loss function that improves over the CE loss function by incorporating a weighting mechanism conditioned on token frequency. Extensive experiments on benchmark datasets show that the FACE loss function is able to substantially improve the diversity of existing state-of-the-art Seq2Seq response generation methods, in terms of both automatic and human evaluations.
Topic models have many important applications in fields such as Natural Language Processing. Topic embedding modelling aims at introducing word and topic embeddings into topic models to describe correlations between topics. Existing topic embedding methods use documents alone, which suffer from the topical fuzziness problem brought by the introduction of embeddings of semantic fuzzy words, e.g. polysemous words or some misleading academic terms. Links often exist between documents which form document networks. The use of links may alleviate this semantic fuzziness, but they are sparse and noisy which may meanwhile mislead topics. In this paper, we utilize community structure to solve these problems. It can not only alleviate the topical fuzziness of topic embeddings since communities are often believed to be topic related, but also can overcome the drawbacks brought by the sparsity and noise of networks (because community is a high-order network information). We give a new generative topic embedding model which incorporates documents (with topics) and network (with communities) together, and uses probability transition to describe the relationship between topics and communities to make it robust when topics and communities do not match. An efficient variational inference algorithm is then proposed to learn the model. We validate the superiority of our new approach on two tasks, document classifications and visualization of topic embeddings, respectively.
We present the design and methodology for the large scale hybrid paper recommender system used by Microsoft Academic. The system provides recommendations for approximately 160 million English research papers and patents. Our approach handles incomplete citation information while also alleviating the cold-start problem that often affects other recommender systems. We use the Microsoft Academic Graph (MAG), titles, and available abstracts of research papers to build a recommendation list for all documents, thereby combining co-citation and content based approaches. Tuning system parameters also allows for blending and prioritization of each approach which, in turn, allows us to balance paper novelty versus authority in recommendation results. We evaluate the generated recommendations via a user study of 40 participants, with over 2400 recommendation pairs graded and discuss the quality of the results using [email protected] and nDCG scores. We see that there is a strong correlation between participant scores and the similarity rankings produced by our system but that additional focus needs to be put towards improving recommender precision, particularly for content based recommendations. The results of the user survey and associated analysis scripts are made available via GitHub and the recommendations produced by our system are available as part of the MAG on Azure to facilitate further research and light up novel research paper recommendation applications.
As virtually all aspects of our lives are increasingly impacted by algorithmic decision making systems, it is incumbent upon us as a society to ensure such systems do not become instruments of unfair discrimination on the basis of gender, race, ethnicity, religion, etc. We consider the problem of determining whether the decisions made by such systems are discriminatory, through the lens of causal models. We introduce two definitions of group fairness grounded in causality: fair on average causal effect (FACE), and fair on average causal effect on the treated (FACT). We use the Rubin-Neyman potential outcomes framework for the analysis of cause-effect relationships to robustly estimate FACE and FACT. We demonstrate the effectiveness of our proposed approach on synthetic data. Our analyses of two real-world data sets, the Adult income data set from the UCI repository (with gender as the protected attribute), and the NYC Stop and Frisk data set (with race as the protected attribute), show that the evidence of discrimination obtained by FACE and FACT, or lack thereof, is often in agreement with the findings from other studies. We further show that FACT, being somewhat more nuanced compared to FACE, can yield findings of discrimination that differ from those obtained using FACE.
In recent times, fake news and misinformation have had a disruptive and adverse impact on our lives. Given the prominence of microblogging networks as a source of news for most individuals, fake news now spreads at a faster pace and has a more profound impact than ever before. This makes detection of fake news an extremely important challenge. Fake news articles, just like genuine news articles, leverage multimedia content to manipulate user opinions but spread misinformation. A shortcoming of the current approaches for the detection of fake news is their inability to learn a shared representation of multimodal (textual + visual) information. We propose an end-to-end network, Multimodal Variational Autoencoder (MVAE), which uses a bimodal variational autoencoder coupled with a binary classifier for the task of fake news detection. The model consists of three main components, an encoder, a decoder and a fake news detector module. The variational autoencoder is capable of learning probabilistic latent variable models by optimizing a bound on the marginal likelihood of the observed data. The fake news detector then utilizes the multimodal representations obtained from the bimodal variational autoencoder to classify posts as fake or not. We conduct extensive experiments on two standard fake news datasets collected from popular microblogging websites: Weibo and Twitter. The experimental results show that across the two datasets, on average our model outperforms state-of-the-art methods by margins as large as ~ 6% in accuracy and ~ 5% in F1 scores.
To improve mobile application (App for short) user experience, it is very important to inform the users about the apps' privacy risk levels. To address the challenge of incorporating the heterogeneous feature indicators (such as app permissions, user review, developers' description and ads library) into the risk ranking model, we formalize the app risk ranking problem as an exclusive sparse coding optimization problem by taking advantage of features from different modalities via the maximization of the feature consistency and enhancement of feature diversity. We propose an efficient iterative re-weighted method to solve the resultant optimization problem, the convergence of which can be rigorously proved. The extensive experiments demonstrate the consistent performance improvement using the real-world mobile application datasets (totally 13786 apps, 37966 descriptions, 10557681 user reviews and 200 ad libraries).
Knowledge graphs such as DBPedia and Freebase contain sparse linkage connectivity, which poses severe challenge to link prediction between entities. In addressing this sparsity problem, our studies indicate that one needs to leverage model with low complexity to avoid overfitting the weak structural information in the graphs, requiring the simple models which can efficiently encode the entities and their description information and then effectively decode their relationships. In this paper, we present a simple and efficient model that can attain these two goals. Specifically, we use a bag-of-words model, where relevant words are aggregated using average pooling or a basic Graph Convolutional Network to encode entities into distributed embeddings. A factorization machine is then used to score the relationships between those embeddings to generate linkage predictions. Empirical studies on two real datasets confirms the efficiency of our proposed model and shows superior predictive performance over state-of-the-art approaches.
Ranking, used extensively online and as a critical tool for decision making across many domains, may embed unfair bias. Tools to measure and correct for discriminatory bias are required to ensure that ranking models do not perpetuate unfair practices. Recently, a number of error-based criteria have been proposed to assess fairness with regard to the treatment of protected groups (as determined by sensitive data attributes, e.g., race, gender, or age). However this has largely been limited to classification tasks, and error metrics used in these approaches are not applicable for ranking. Therefore, in this work we propose to broaden the scope of fairness assessment to include error-based fairness criteria for rankings. Our approach supports three criteria: Rank Equality, Rank Calibration, and Rank Parity, which cover a broad spectrum of fairness considerations from proportional group representation to error rate similarity. The underlying error metrics are formulated to be rank-appropriate, using pairwise discordance to measure prediction error in a model-agnostic fashion. Based on this foundation, we then design a fair auditing mechanism which captures group treatment throughout the entire ranking, generating in-depth yet nuanced diagnostics. We demonstrate the efficacy of our error metrics using real-world scenarios, exposing trade-offs among fairness criteria and providing guidance in the selection of fair-ranking algorithms.
Citizen engagement and technology usage are two emerging trends driven by smart city initiatives. Typically, citizens report issues, such as broken roads, garbage dumps, etc. through web portals and mobile apps, in order for the government authorities to take appropriate actions. Several mediums - text, image, audio, video - are used to report these issues. Through a user study with 13 citizens and 3 authorities, we found that image is the most preferred medium to report civic issues. However, analyzing civic issue related images is challenging for the authorities as it requires manual effort. In this work, given an image, we propose to generate a Civic Issue Graph consisting of a set of objects and the semantic relations between them, which are representative of the underlying civic issue. We also release two multi-modal (text and images) datasets, that can help in further analysis of civic issues from images. We present an approach for adversarial adaptation of existing scene graph models that enables the use of scene graphs for new applications in the absence of any labelled training data. We conduct several experiments to analyze the efficacy of our approach, and using human evaluation, we establish the appropriateness of our model at representing different civic issues.
The Bitcoin payment system involves two agent types: Users that transact with the currency and pay fees and miners in charge of authorizing transactions and securing the system in return for these fees. Two of Bitcoin's challenges are (i) securing sufficient miner revenues as block rewards decrease, and (ii) alleviating the throughput limitation due to a small maximal block size cap. These issues are strongly related as increasing the maximal block size may decrease revenue due to Bitcoin's pay-your-bid approach. To decouple them, we analyze the “monopolistic auction” [8], showing: (i) its revenue does not decrease as the maximal block size increases, (ii) it is resilient to an untrusted auctioneer (the miner), and (iii) simplicity for transaction issuers (bidders), as the average gain from strategic bid shading (relative to bidding one's true maximal willingness to pay) diminishes as the number of bids increases.
There is a growing concern about the extent to which algorithmic personalization limits people's exposure to diverse viewpoints, thereby creating “filter bubbles” or “echo chambers.” Prior research on web search personalization has mainly reported location-based personalization of search results. In this paper, we investigate whether web search results are personalized based on a user's browsing history, which can be inferred by search engines via third-party tracking. Specifically, we develop a “sock puppet” auditing system in which a pair of fresh browser profiles, first, visits web pages that reflect divergent political discourses and, second, executes identical politically oriented Google News searches. Comparing the search results returned by Google News for distinctly trained browser profiles, we observe statistically significant personalization that tends to reinforce the presumed partisanship.
Modeling user behaviors as sequences provides critical advantages in predicting future user actions, such as predicting the next product to purchase or the next song to listen to, for personalized search and recommendation. Recently, recurrent neural networks (RNNs) have been adopted to leverage their power in modeling sequences. However, most of the previous RNN-based work suffers from the complex dependency problem, which may lose the integrity of highly correlated behaviors and may introduce noises derived from unrelated behaviors. In this paper, we propose to integrate a novel Time Slice Self-Attention (TiSSA) mechanism into RNNs for better modeling sequential user behaviors, which utilizes the time-interval-based gated recurrent units to exploit the temporal dimension when encoding user actions, and has a specially designed time slice hierarchical self-attention function to characterize both local and global dependency of user actions, while the final context-aware user representations can be used for downstream applications. We have performed experiments on a huge dataset collected from one of the largest e-commerce platforms in the world. Experimental results show that the proposed TiSSA achieves significant improvement over the state-of-the-art. TiSSA is also adopted in this large e-commerce platform, and the results of online A/B test further indicate its practical value.
Music listening is a commonplace activity that has transformed as users engage with online streaming platforms. When presented with anytime, anywhere access to a vast catalog of music, users face challenges in searching for what they want to hear. We propose that users who engage in domain-specific search (e.g., music search) have different information-seeking needs than in general search. Using a mixed-method approach that combines a large-scale user survey with behavior data analyses, we describe the construct of search mindset on a leading online streaming music platform and then investigate two types of search mindsets: focused, where a user is looking for one thing in particular, and non-focused, where a user is open to different results. Our results reveal that searches in the music domain are more likely to be focused than non-focused. In addition, users' behavior (e.g., clicks, streams, querying, etc.) on a music search system is influenced by their search mindset. Finally, we propose design implications for music search systems to best support their users.
Search engine users always endeavor to formulate proper search queries during online search. To help users accurately express their information need during search, search engines are equipped with query suggestions to refine users' follow-up search queries. The success of a query suggestion system counts on whether we can understand and model user search intent accurately. In this work, we propose Click Feedback-Aware Network (CFAN) to provide feedback-aware query suggestions. In addition to modeling sequential search queries issued by a user, CFAN also considers user clicks on previous suggested queries as the user feedback. These clicked suggestions, together with the issued search query sequence, jointly capture the underlying search intent of users. In addition, we explicitly focus on improving the robustness of the query suggestion system through adversarial training. Adversarial examples are introduced into the training of the query suggestion system, which not only improves the robustness of system to nuisance perturbations, but also enhances the generalization performance for original training data. Extensive experiments are conducted on a recent real search engine log. The experimental results demonstrate that the proposed method, CFAN, outperforms competitive baseline methods across various situations on the task of query suggestion.
We propose a novel training scheme for fast matching models in Search Ads, motivated by practical challenges. The first challenge stems from the pursuit of high throughput, which prohibits the deployment of inseparable architectures, and hence greatly limits model accuracy. The second problem arises from the heavy dependency on human provided labels, which are expensive and time-consuming to collect, yet how to leverage unlabeled search log data is rarely studied. The proposed training framework targets on mitigating both issues, by treating the stronger but undeployable models as annotators, and learning a deployable model from both human provided relevance labels and weakly annotated search log data. Specifically, we first construct multiple auxiliary tasks from the enumerated relevance labels, and train the annotators by jointly learning from those related tasks. The annotation models are then used to assign scores to both labeled and unlabeled training samples. The deployable model is firstly learnt on the scored unlabeled data, and then fine-tuned on scored labeled data, by leveraging both labels and scores via minimizing the proposed label-aware weighted loss. According to our experiments, compared with the baseline that directly learns from relevance labels, training by the proposed framework outperforms it by a large margin, and improves data efficiency substantially by dispensing with 80% labeled samples. The proposed framework allows us to improve the fast matching model by learning from stronger annotators while keeping its architecture unchanged. Meanwhile, it offers a principled manner to leverage search log data in the training phase, which could effectively alleviate our dependency on human provided labels.
Under a newly introduced setting of multistream classification, two data streams are involved, which are referred to as source and target streams. The source stream continuously generates data instances from a certain domain with labels, while the target stream does the same task without labels from another domain. Existing approaches assume that domains for both data streams are identical, which is not quite true in real world scenario, since data streams from different sources may contain distinct features. Furthermore, obtaining labels for every instance in a data stream is often expensive and time-consuming. Therefore, it has become an important topic to explore whether labeled instances from other related streams can be helpful to predict those unlabeled instances in a given stream. Note that domains of source and target streams may have distinct features spaces and data distributions. Our objective is to predict class labels of data instances in the target stream by using the classifiers trained by the source stream.
We propose a framework of multistream classification by using projected data from a common latent feature space, which is embedded from both source and target domains. This framework is also crucial for enterprise system defenders to detect cross-platform attacks, such as Advanced Persistent Threats (APTs). Empirical evaluation and analysis on both real-world and synthetic datasets are performed to validate the effectiveness of our proposed algorithm, comparing to state-of-the-art techniques. Experimental results show that our approach significantly outperforms other existing approaches.
Predicting pregnancy has been a fundamental problem in women's health for more than 50 years. Previous datasets have been collected via carefully curated medical studies, but the recent growth of women's health tracking mobile apps offers potential for reaching a much broader population. However, the feasibility of predicting pregnancy from mobile health tracking data is unclear. Here we develop four models - a logistic regression model, and 3 LSTM models - to predict a woman's probability of becoming pregnant using data from a women's health tracking app, Clue by BioWink GmbH. Evaluating our models on a dataset of 79 million logs from 65,276 women with ground truth pregnancy test data, we show that our predicted pregnancy probabilities meaningfully stratify women: women in the top 10% of predicted probabilities have a 89% chance of becoming pregnant over 6 menstrual cycles, as compared to a 27% chance for women in the bottom 10%. We develop a technique for extracting interpretable time trends from our deep learning models, and show these trends are consistent with previous fertility research. Our findings illustrate the potential that women's health tracking data offers for predicting pregnancy on a broader population; we conclude by discussing the steps needed to fulfill this potential.
In traditional machine learning, classifiers training is typically undertaken in the setting of single-task learning, so the trained classifier can discriminate between different classes. However, this must be based on the assumption that different classes are mutually exclusive. In real applications, the above assumption does not always hold. For example, the same book may belong to multiple subjects. From this point of view, researchers were motivated to formulate multi-label learning problems. In this context, each instance can be assigned multiple labels but the classifiers training is still typically undertaken in the setting of single-task learning. When probabilistic approaches are adopted for classifiers training, multi-task learning can be enabled through transformation of a multi-labelled data set into several binary data sets. The above data transformation could usually result in the class imbalance issue. Without the above data transformation, multi-labelling of data results in an exponential increase of the number of classes, leading to fewer instances for each class and a higher difficulty for identifying each class. In addition, multi-labelling of data is very time consuming and expensive in some application areas, such as hate speech detection. In this paper, we introduce a novel formulation of the hate speech type identification problem in the setting of multi-task learning through our proposed fuzzy ensemble approach. In this setting, single-labelled data can be used for semi-supervised multi-label learning and two new metrics (detection rate and irrelevance rate) are thus proposed to measure more effectively the performance for this kind of learning tasks. We report an experimental study on identification of four types of hate speech, namely: religion, race, disability and sexual orientation. The experimental results show that our proposed fuzzy ensemble approach outperforms other popular probabilistic approaches, with an overall detection rate of 0.93.
Chinese word segmentation (CWS) is very important for Chinese text processing. Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach.
With the increasing popularity of micro-video sharing where people shoot short-videos effortlessly and share their daily stories on social media platforms, the micro-video recommendation has attracted extensive research efforts to provide users with micro-videos that interest them. In this paper, a hypothesis we explore is that, not only do users have multi-modal interest, but micro-videos have multi-modal targeted audience segments. As a result, we propose a novel framework User-Video Co-Attention Network (UVCAN), which can learn multi-modal information from both user and microvideo side using attention mechanism. In addition, UVCAN reasons about the attention in a stacked attention network fashion for both user and micro-video. Extensive experiments on two datasets collected from Toffee present superior results of our proposed UVCAN over the state-of-the-art recommendation methods, which demonstrate the effectiveness of the proposed framework.
In modern recommender systems, both users and items are associated with rich side information, which can help understand users and items. Such information is typically heterogeneous and can be roughly categorized into flat and hierarchical side information. While side information has been proved to be valuable, the majority of existing systems have exploited either only flat side information or only hierarchical side information due to the challenges brought by the heterogeneity. In this paper, we investigate the problem of exploiting heterogeneous side information for recommendations. Specifically, we propose a novel framework jointly captures flat and hierarchical side information with mathematical coherence. We demonstrate the effectiveness of the proposed framework via extensive experiments on various real-world datasets. Empirical results show that our approach is able to lead a significant performance gain over the state-of-the-art methods.
In on-demand ridesharing, optimizing the quality of supply-demand matches and minimizing the number of unfulfilled trip requests is an important business problem. One approach is to match supply with demand based on greedy local heuristics, which is sub-optimal at the market level. Another approach is to find the globally optimal matching over the whole marketplace. However, because the computation has high latency, and usually requires aggregate supply and demand data over time, instead of only the supply and demand at the time of individual trip requests, it is unfeasible to perform the global optimization for every trip request in realtime. This paper proposes a solution that performs a global optimization offline periodically based on forecasted supply and demand data, and uses the offline results to guide realtime supply and demand matching. The solution is a novel two-stage robust optimization scheme for supply-demand matching in on-demand ridesharing. We implemented the proposed solution and evaluated it using simulated trips based on public industry data. It reduces the number of unfulfilled trips significantly compared to a state-of-the-art approach.
Relation classification is a basic yet important task in natural language processing. Existing relation classification approaches mainly rely on distant supervision, which assumes that a bag of sentences mentioning a pair of entities and extracted from a given corpus should express the same relation type of this entity pair. The training of these models needs a lot of high-quality bag-level data. However, in some specific domains, such as medical domain, it is difficult to obtain sufficient and high-quality sentences in a text corpus that mention two entities with a certain medical relation between them. In such a case, it is hard for existing discriminative models to capture the representative features (i.e., common patterns) from diversely expressed entity pairs with a given relation. Thus, the classification performance cannot be guaranteed when limited features are obtained from the corpus. To address this challenge, in this paper, we propose to employ a generative model, called conditional variational autoencoder (CVAE), to handle the pattern sparsity. We define that each relation has an individually learned latent distribution from all possible sentences expressing this relation. As these distributions are learned based on the purpose of input reconstruction, the model's classification ability may not be strong enough and should be improved. By distinguishing the differences among different relation distributions, a margin-based regularizer is designed, which leads to a margin-based CVAE (MCVAE) that can significantly enhance the classification ability. Besides, MCVAE can automatically generate semantically meaningful patterns that describe the given relations. Experiments on two real-world datasets validate the effectiveness of the proposed MCVAE on the tasks of relation classification and relation-specific pattern generation.
Rumors can cause devastating consequences to individual and/or society. Analysis shows that widespread of rumors typically results from deliberately promoted information campaigns which aim to shape collective opinions on the concerned news events. In this paper, we attempt to fight such chaos with itself to make automatic rumor detection more robust and effective. Our idea is inspired by adversarial learning method originated from Generative Adversarial Networks (GAN). We propose a GAN-style approach, where a generator is designed to produce uncertain or conflicting voices, complicating the original conversational threads in order to pressurize the discriminator to learn stronger rumor indicative representations from the augmented, more challenging examples. Different from traditional data-driven approach to rumor detection, our method can capture low-frequency but stronger non-trivial patterns via such adversarial training. Extensive experiments on two Twitter benchmark datasets demonstrate that our rumor detection method achieves much better results than state-of-the-art methods.
There is a growing need for discrete choice models that account for the complex nature of human choices, escaping traditional behavioral assumptions such as the transitivity of pairwise preferences. Recently, several parametric models of intransitive comparisons have been proposed, but in all cases the model log-likelihood is non-concave, making inference difficult. In this work we generalize this trend, showing that there cannot exist an parametric model that both (i) has a log-likelihood function that is concave in item-level parameters and (ii) can exhibit intransitive preferences. Given this observation, we also contribute a new simple model for analyzing intransitivity in pairwise comparisons, taking inspiration from the Condorcet method (majority vote) in social choice theory. The majority vote model we analyze is defined as a voting process over independent Random Utility Models (RUMs). We infer a multidimensional embedding of each object or player, in contrast to the traditional one-dimensional embedding used by models such as the Thurstone or Bradley-Terry-Luce (BTL) models. We show that a three-dimensional majority vote model is capable of modeling arbitrarily strong and long intransitive cycles, and can also represent arbitrary pairwise comparison probabilities on any triplet. We provide experimental results that substantiate our claims regarding the effectiveness of our model in capturing intransitivity for various pairwise choice tasks such as predicting choices in recommendation systems, winners in online video games, and elections.
Smartphone sensors can be leveraged by malicious apps for a plethora of different attacks, which can also be deployed by malicious websites through the HTML5 WebAPI. In this paper we provide a comprehensive evaluation of the multifaceted threat that mobile web browsing poses to users, by conducting a large-scale study of mobile-specific HTML5 WebAPI calls used in the wild. We build a novel testing infrastructure consisting of actual smartphones on top of a dynamic Android app analysis framework, allowing us to conduct an end-to-end exploration. Our study reveals the extent to which websites are actively leveraging the WebAPI for collecting sensor data, with 2.89% of websites accessing at least one mobile sensor. To provide a comprehensive assessment of the potential risks of this emerging practice, we create a taxonomy of sensor-based attacks from prior studies, and present an in-depth analysis by framing our collected data within that taxonomy. We find that 1.63% of websites could carry out at least one of those attacks. Our findings emphasize the need for a standardized policy across browsers and the ability for users to control what sensor data each website can access.
A common approach when setting up a website is to utilize third party Web hosting and content delivery networks. Without taking this trend into account, any measurement study inspecting the deployment and operation of websites can be heavily skewed. Unfortunately, the research community lacks generalizable tools that can be used to identify how and where a given website is hosted. Instead, a number of ad hoc techniques have emerged, e.g., using Autonomous System databases, domain prefixes for CNAME records. In this work we propose Pythia , a novel lightweight approach for identifying Web content hosted on third-party infrastructures, including both traditional Web hosts and content delivery networks. Our framework identifies the organization to which a given Web page belongs, and it detects which Web servers are self-hosted and which ones leverage third-party services to provide contents. To test our framework we run it on 40,000 URLs and evaluate its accuracy, both by comparing the results with similar services and with a manually validated groundtruth. Our tool achieves an accuracy of 90% and detects that under 11% of popular domains are self-hosted. We publicly release our tool to allow other researchers to reproduce our findings, and to apply it to their own studies.
Classical event encoding and extraction methods rely on fixed dictionaries of keywords and templates or require ground truth labels for phrase/sentences. This hinders widespread application of information encoding approaches to large-scale free form (unstructured) text available on the web. Event encoding can be viewed as a hierarchical task where the coarser level task is event detection, i.e., identification of documents containing a specific event, and where the fine-grained task is one of event encoding, i.e., identifying key phrases, key sentences. Hierarchical models with attention seem like a natural choice for this problem, given their ability to differentially attend to more or less important features when constructing document representations. In this work we present a novel factorized bilinear multi-aspect attention mechanism (FBMA) that attends to different aspects of text while constructing its representation. We find that our approach outperforms state-of-the-art baselines for detecting civil unrest, military action, and non-state actor events from corpora in two different languages.
Stack Overflow, a Q&A site on programming, awards reputation points and badges (game elements) to users on performing various actions. Situating our work in Digital Signaling Theory, we investigate the role of these game elements in characterizing social qualities (specifically, popularity and impact) of its users. We operationalize these attributes using common metrics and apply statistical modeling to empirically quantify and validate the strength of these signals. Our results are based on a rich dataset of 3,831,147 users and their activities spanning nearly a decade since the site's inception in 2008. We present evidence that certain non-trivial badges, reputation scores and age of the user on the site positively correlate with popularity and impact. Further, we find that the presence of costly to earn and hard to observe signals qualitatively differentiates highly impactful users from highly popular users.
Knowledge Graphs (KGs) have been proven to be incredibly useful for enriching semantic Web search results and allowing queries with a well-defined result set. In recent years much attention has been given to the task of inferring missing facts based on existing facts in a KG. Approaches have also been proposed for inferring types of entities, however these are successful in common types such as 'Person', 'Movie', or 'Actor'. There is still a large gap, however, in the inference of fine-grained types which are highly important for exploring specific lists and collections within web search. Generally there are also relatively fewer observed instances of fine-grained types present to train in KGs, and this poses challenges for the development of effective approaches. In order to address the issue, this paper proposes a new approach to the fine-grained type inference problem. This new approach is explicitly modeled for leveraging domain knowledge and utilizing additional data outside KG, that improves performance in fine-grained type inference. Further improvements in efficiency are achieved by extending the model to probabilistic inference based on entity similarity and typed class classification. We conduct extensive experiments on type triple classification and entity prediction tasks on Freebase FB15K benchmark dataset. The experiment results show that the proposed model outperforms the state-of-the-art approaches for type inference in KG, and achieves high performance results in many-to-one relation in predicting tail for KG completion task.
Computer-aided education systems are now seeking to provide each student with personalized materials based on a student's individual knowledge. To provide suitable learning materials, tracing each student's knowledge over a period of time is important. However, predicting each student's knowledge is difficult because students tend to forget. The forgetting behavior is mainly because of two reasons: the lag time from the previous interaction, and the number of past trials on a question. Although there are a few studies that consider forgetting while modeling a student's knowledge, some models consider only partial information about forgetting, whereas others consider multiple features about forgetting, ignoring a student's learning sequence. In this paper, we focus on modeling and predicting a student's knowledge by considering their forgetting behavior. We extend the deep knowledge tracing model [17], which is a state-of-the-art sequential model for knowledge tracing, to consider forgetting by incorporating multiple types of information related to forgetting. Experiments on knowledge tracing datasets show that our proposed model improves the predictive performance as compared to baselines. Moreover, we also examine that the combination of multiple types of information that affect the behavior of forgetting results in performance improvement.
While online review services provide a two-way conversation between brands and consumers, malicious actors, including misbehaving businesses, have an equal opportunity to distort the reviews for their own gains. We propose OneReview, a method for locating fraudulent reviews, correlating data from multiple crowd-sourced review sites. Our approach utilizes Change Point Analysis to locate points at which a business' reputation shifts. Inconsistent trends in reviews of the same businesses across multiple websites are used to identify suspicious reviews. We then extract an extensive set of textual and contextual features from these suspicious reviews and employ supervised machine learning to detect fraudulent reviews.
We evaluated OneReview on about 805K and 462K reviews from Yelp and TripAdvisor, respectively to identify fraud on Yelp. Supervised machine learning yields excellent results, with 97% accuracy. We applied the created model on suspicious reviews and detected about 62K fraudulent reviews (about 8% of all the Yelp reviews). We further analyzed the detected fraudulent reviews and their authors, and located several spam campaigns in the wild, including campaigns against specific businesses, as well as campaigns consisting of several hundreds of socially-networked untrustworthy accounts.
Talent Search systems aim to recommend potential candidates who are a good match to the hiring needs of a recruiter expressed in terms of the recruiter's search query or job posting. Past work in this domain has focused on linear and nonlinear models which lack preference personalization in the user-level due to being trained only with globally collected recruiter activity data. In this paper, we propose an entity-personalized Talent Search model which utilizes a combination of generalized linear mixed (GLMix) models and gradient boosted decision tree (GBDT) models, and provides personalized talent recommendations using nonlinear tree interaction features generated by the GBDT. We also present the offline and online system architecture for the productionization of this hybrid model approach in our Talent Search systems. Finally, we provide offline and online experiment results benchmarking our entity-personalized model with tree interaction features, which demonstrate significant improvements in our precision metrics compared to globally trained non-personalized models.
Existing recommendation algorithms mostly focus on optimizing traditional recommendation measures, such as the accuracy of rating prediction in terms of RMSE or the quality of top-k recommendation lists in terms of precision, recall, MAP, etc. However, an important expectation for commercial recommendation systems is to improve the final revenue/profit of the system. Traditional recommendation targets such as rating prediction and top-k recommendation are not directly related to this goal.
In this work, we blend the fundamental concepts in online advertising and micro-economics into personalized recommendation for profit maximization. Specifically, we propose value-aware recommendation based on reinforcement learning, which directly optimizes the economic value of candidate items to generate the recommendation list. In particular, we generalize the basic concept of click conversion rate (CVR) in computational advertising into the conversation rate of an arbitrary user action (XVR) in E-commerce, where the user actions can be clicking, adding to cart, adding to wishlist, etc. In this way, each type of user action is mapped to its monetized economic value. Economic values of different user actions are further integrated as the reward of a ranking list, and reinforcement learning is used to optimize the recommendation list for the maximum total value. Experimental results in both offline benchmarks and online commercial systems verified the improved performance of our framework, in terms of both traditional top-k ranking tasks and the economic profits of the system.
Entity alignment associates entities in different knowledge graphs if they are semantically same, and has been successfully used in the knowledge graph construction and connection. Most of the recent solutions for entity alignment are based on knowledge graph embedding, which maps knowledge entities in a low-dimension space where entities are connected with the guidance of prior aligned entity pairs. The study in this paper focuses on two important issues that limit the accuracy of current entity alignment solutions: 1) labeled data of priorly aligned entity pairs are difficult and expensive to acquire, whereas abundant of unlabeled data are not used; and 2) knowledge graph embedding is affected by entity's degree difference, which brings challenges to align high frequent and low frequent entities. We propose a semi-supervised entity alignment method (SEA) to leverage both labeled entities and the abundant unlabeled entity information for the alignment. Furthermore, we improve the knowledge graph embedding with awareness of the degree difference by performing the adversarial training. To evaluate our proposed model, we conduct extensive experiments on real-world datasets. The experimental results show that our model consistently outperforms the state-of-the-art methods with significant improvement on alignment accuracy.
Online social movements (OSMs) play a key role in promoting democracy in modern society. Most online activism is largely driven by critical offline events. Among many studies investigating collective behavior in OSMs, few has explored the interaction between crowd dynamics and their offline context. Here, focusing on the Black Lives Matter OSM and utilizing an event-driven approach on a dataset of 36 million tweets and thousands of offline events, we study how different types of offline events-police violence and heightened protests-influence crowd behavior over time. We find that police violence events and protests play important roles in the recruitment process. Moreover, by analyzing the re-participation dynamics and patterns of social interactions, we find that, in the long term, users who joined the movement during police violence events and protests show significantly more commitment than those who joined during other times. However, users recruited during other times are more committed to the movement than the other two groups in the short term. Furthermore, we observe that social ties formed during police violence events are more likely to be sustained over time than those formed during other times. Contrarily, ties formed during protests are the least likely to be maintained. Altogether, our results shed light on the impact of bursting events on the recruitment, retention, and communication patterns of collective behavior in the Black Lives Matter OSM.
A major drawback of cross-network recommender solutions is that they can only be applied to users that are overlapped across networks. Thus, the non-overlapped users, which form the majority of users are ignored. As a solution, we propose CnGAN, a novel multi-task learning based, encoder-GAN-recommender architecture. The proposed model synthetically generates source network user preferences for non-overlapped users by learning the mapping from target to source network preference manifolds. The resultant user preferences are used in a Siamese network based neural recommender architecture. Furthermore, we propose a novel user-based pairwise loss function for recommendations using implicit interactions to better guide the generation process in the multi-task learning environment. We illustrate our solution by generating user preferences on the Twitter source network for recommendations on the YouTube target network. Extensive experiments show that the generated preferences can be used to improve recommendations for non-overlapped users. The resultant recommendations achieve superior performance compared to the state-of-the-art cross-network recommender solutions in terms of accuracy, novelty and diversity.
When information or infectious diseases spread over a network, in many practical cases, one can observe when nodes adopt information or become infected, but the underlying network is hidden. In this paper, we analyze the problem of finding communities of highly interconnected nodes, given only the infection times of nodes. We propose, analyze, and empirically compare several algorithms for this task. The most stable performance, that improves the current state-of-the-art, is obtained by our proposed heuristic approaches, that are agnostic to a particular graph structure and epidemic model.
While test collections provide the cornerstone of system-based evaluation in information retrieval, human relevance judging has become prohibitively expensive as collections have grown ever larger. Consequently, intelligently deciding which documents to judge has become increasingly important. We propose a two-phase approach to intelligent judging across topics which does not require document rankings from a shared task. In the first phase, we dynamically select the next topic to judge via a multi-armed bandit method. In the second phase, we employ active learning to select which document to judge next for that topic. Experiments on three TREC collections (varying scarcity of relevant documents) achieve t 0.90 correlation for [email protected] ranking and find 90% of the relevant documents at 48% of the original budget. To support reproducibility and follow-on work, we have shared our code online1.
Counterfeit apps impersonate existing popular apps in attempts to misguide users. Many counterfeits can be identified once installed, however even a tech-savvy user may struggle to detect them before installation. In this paper, we propose a novel approach of combining content embeddings and style embeddings generated from pre-trained convolutional neural networks to detect counterfeit apps. We present an analysis of approximately 1.2 million apps from Google Play Store and identify a set of potential counterfeits for top-10,000 apps. Under conservative assumptions, we were able to find 2,040 potential counterfeits that contain malware in a set of 49,608 apps that showed high similarity to one of the top-10,000 popular apps in Google Play Store. We also find 1,565 potential counterfeits asking for at least five additional dangerous permissions than the original app and 1,407 potential counterfeits having at least five extra third party advertisement libraries.
Sequential history of user interactions as well as the context of interactions provide valuable information to recommender systems, for modeling user behavior. Modeling both contexts and sequential information simultaneously, in context-aware sequential recommenders, has been shown to outperform methods that model either one of the two aspects. In long sequential histories, temporal trends are also found within sequences of contexts and temporal gaps that are not modeled by previous methods. In this paper we design new context-aware sequential recommendation methods, based on Stacked Recurrent Neural Networks, that model the dynamics of contexts and temporal gaps. Experiments on two large benchmark datasets demonstrate the advantages of modeling the evolution of contexts and temporal gaps - our models significantly outperform state-of-the-art context-aware sequential recommender systems.
We perform cross-lingual question retrieval in community question answering (cQA), i.e., we retrieve similar questions for queries that are given in another language. The standard approach to cross-lingual information retrieval, which is to automatically translate the query to the target language and continue with a monolingual retrieval model, typically falls short in cQA due to translation errors. This is even more the case for specialized domains such as in technical cQA, which we explore in this work. To remedy, we propose two extensions to this approach that improve cross-lingual question retrieval: (1) we enhance an NMT model with monolingual cQA data to improve the translation quality, and (2) we improve the robustness of a state-of-the-art neural question retrieval model to common translation errors by adding back-translations during training. Our results show that we achieve substantial improvements over the baseline approach and considerably close the gap to a setup where we have access to an external commercial machine translation service (i.e., Google Translate), which is often not the case in many practical scenarios. Our source code and data is publicly available.1
Workers in crowd markets struggle to earn a living. One reason for this is that it is difficult for workers to accurately gauge the hourly wages of microtasks, and they consequently end up performing labor with little pay. In general, workers are provided with little information about tasks, and are left to rely on noisy signals, such as textual description of the task or rating of the requester. This study explores various computational methods for predicting the working times (and thus hourly wages) required for tasks based on data collected from other workers completing crowd work. We provide the following contributions. (i) A data collection method for gathering real-world training data on crowd-work tasks and the times required for workers to complete them; (ii) TurkScanner: a machine learning approach that predicts the necessary working time to complete a task (and can thus implicitly provide the expected hourly wage). We collected 9,155 data records using a web browser extension installed by 84 Amazon Mechanical Turk workers, and explored the challenge of accurately recording working times both automatically and by asking workers. TurkScanner was created using ~ 150 derived features, and was able to predict the hourly wages of 69.6% of all the tested microtasks within a 75% error. Directions for future research include observing the effects of tools on people's working practices, adapting this approach to a requester tool for better price setting, and predicting other elements of work (e.g., the acceptance likelihood and worker task preferences.)
The Web is one of the most successful Internet application. Yet, the quality of Web users' experience is still largely impenetrable. Whereas Web performances are typically gathered with controlled experiments, in this work we perform a large-scale study of one of the most popular websites,namely Wikipedia, explicitly asking (a small fraction of its) users for feedback on the browsing experience. We leverage user survey responses to build a data-driven model of user satisfaction which, despite including state-of-the art quality of experience metrics, is still far from achieving accurate results, and discuss directions to move forward. Finally, we aim at making our dataset publicly available, which hopefully contributes in enriching and refining the scientific community knowledge on Web users' quality of experience (QoE).
Music is known to exhibit different characteristics, depending on genre and style. While most research that studies such differences takes a musicological perspective and analyzes acoustic properties of individual pieces or artists, we conduct a large-scale analysis using various web resources. Exploiting content information from song lyrics, contextual information reflected in music artists' Wikipedia articles, and listening information, we particularly study the aspects of popularity, length, repetitiveness, and readability of lyrics and Wikipedia articles. We measure popularity in terms of song play count (PC) and listener count (LC), length in terms of character and word count, repetitiveness in terms of text compression ratio, and readability in terms of the Simple Measure of Gobbledygook (SMOG). Extending datasets of music listening histories and genre annotations from Last.fm, we extract and analyze 424,476 song lyrics by 18,724 artists from LyricWiki.
We set out to answer whether there exist significant genre differences in song lyrics (RQ1) and artist Wikipedia articles (RQ2) in terms of repetitiveness and readability. We also assess whether we can find evidence to support the cliche´ that lyrics of very popular artists are particularly simple and repetitive (RQ3). We further investigate whether the characteristics of popularity, length, repetitiveness, and readability correlate within and between lyrics and Wikipedia articles (RQ4).
We identify substantial differences in repetitiveness and readability of lyrics between music genres. In contrast, no significant differences between genres are found for artists' Wikipedia pages. Also, we find that lyrics of highly popular artists are repetitive but not necessarily simple in terms of readability. Furthermore, we uncover weak correlations between length of lyrics and of Wikipedia pages of the same artist, weak correlations between lyrics' reading difficulty and their length, and moderate correlations between artists' popularity and length of their lyrics.
This paper proposes an attributed network growth model. Despite the knowledge that individuals use limited resources to form connections to similar others, we lack an understanding of how local and resource-constrained mechanisms explain the emergence of structural properties found in real-world networks. We make three contributions. First, we propose a simple and accurate model of attributed network growth that jointly explains the emergence of in-degree, local clustering, clustering-degree relationship and attribute mixing patterns. Second, we make use of biased random walks to develop a model that forms edges locally, without recourse to global information. Third, we account for multiple sociological phenomena: bounded rationality; structural constraints; triadic closure; attribute homophily; preferential attachment. Our experiments show that the proposed Attributed Random Walk (ARW) model accurately preserves network structure and attribute mixing patterns of real-world networks; it improves upon the performance of eight well-known models by a significant margin of 2.5-10 × .
Microservice architectures and container technologies are broadly adopted by giant internet companies to support their web services, which typically have a strict service-level objective (SLO), tail latency, rather than average latency. However, diagnosing SLO violations, e.g., long tail latency problem, is non-trivial for large-scale web applications in shared microservice platforms due to million-level operational data and complex operational environments.
We identify a new type of tail latency problem for web services, small-window long-tail latency (SWLT), which is typically aggregated during a small statistical window (e.g., 1-minute or 1-second). We observe SWLT usually occurs in a small number of containers in microservice clusters and sharply shifts among different containers at different time points. To diagnose root-causes of SWLT, we propose an unsupervised and low-cost diagnosis algorithm-?-Diagnosis, using two-sample test algorithm and ?-statistics for measuring similarity of time series to identify root-cause metrics from millions of metrics. We implement and deploy a real-time diagnosis system in our real-production microservice platforms. The evaluation using real web application datasets demonstrates that ?-Diagnosis can identify all the actual root-causes at runtime and significantly reduce the candidate problem space, outperforming other time-series distance based root-cause analysis algorithms.
Recommender systems are widely used to recommend the most appealing items to users. These recommendations can be generated by applying collaborative filtering methods. The low-rank matrix completion method is the state-of-the-art collaborative filtering method. In this work, we show that the skewed distribution of ratings in the user-item rating matrix of real-world datasets affects the accuracy of matrix-completion-based approaches. Also, we show that the number of ratings that an item or a user has positively correlates with the ability of low-rank matrix-completion-based approaches to predict the ratings for the item or the user accurately. Furthermore, we use these insights to develop four matrix completion-based approaches, i.e., Frequency Adaptive Rating Prediction (FARP), Truncated Matrix Factorization (TMF), Truncated Matrix Factorization with Dropout (TMF + Dropout) and Inverse Frequency Weighted Matrix Factorization (IFWMF), that outperforms traditional matrix-completion-based approaches for the users and the items with few ratings in the user-item rating matrix.
In the past, there have been several studies in analyzing password strength and structures. However, there are still many unknown questions to understand what really makes passwords both memorable and strong. In this work, we aim to answer some of these questions by analyzing password dataset through the lenses of data science and machine learning perspectives. We use memorable 3,260 password dataset collected from prior IRB-approved user studies over 3 years and classify passwords into three strength groups using online and offline attack limits. Then, we apply a tensor decomposition to analyze password dataset by constructing a 3rd-order tensor with passwords' syntactic and semantic features. In particular, we used PARAFAC2 tensor decomposition to uncover the main characteristics and features that affect password strength. We quantitatively identified the underlying factors that are more frequently observed in strong and memorable passwords. We hope that our finding can validate widely accepted advice for creating strong passwords and provide useful insights to design a better password suggestion system.
Rich engagement data can shed light on how people interact with online content and how such interactions may be determined by the content of the page. In this work, we investigate a specific type of interaction, backtracking, which refers to the action of scrolling back in a browser while reading an online news article. We leverage a dataset of close to 700K instances of more than 15K readers interacting with online news articles, in order to characterize and predict backtracking behavior. We first define different types of backtracking actions. We then show that “full” backtracks, where the readers eventually return to the spot at which they left the text, can be predicted by using features that were previously shown to relate to text readability. This finding highlights the relationship between backtracking and readability and suggests that backtracking could help assess readability of content at scale.
In this paper, we investigate to what extent the page modifications that make browser extensions fingerprintable are necessary for their operation. We characterize page modifications that are completely unnecessary for the extension's functionality as extension bloat. By analyzing 58,034 extensions from the Google Chrome store, we discovered that 5.7% of them were unnecessarily identifiable because of extension bloat. To protect users against unnecessary extension fingerprinting due to bloat, we describe the design and implementation of an in-browser mechanism that provides coarse-grained access control for extensions on all websites. The proposed mechanism and its built-in policies, does not only protect users from fingerprinting, but also offers additional protection against malicious extensions exfiltrating user data from sensitive websites.
Combining simple elements from the literature, we define a linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems. We show that its training objective has a closed-form solution, and discuss the resulting conceptual insights. Surprisingly, this simple model achieves better ranking accuracy than various state-of-the-art collaborative-filtering approaches, including deep non-linear models, on most of the publicly available data-sets used in our experiments.
The degree to which Mexican immigrants in the U.S. are assimilating culturally has been widely debated. To examine this question, we focus on musical taste, a key symbolic resource that signals the social positions of individuals. We adapt an assimilation metric from earlier work to analyze self-reported musical interests among immigrants in Facebook. We use the relative levels of interest in musical genres, where a similarity to the host population in musical preferences is treated as evidence of cultural assimilation. Contrary to skeptics of Mexican assimilation, we find significant cultural convergence even among first-generation immigrants, which problematizes their use as assimilative “benchmarks” in the literature. Further, 2nd generation Mexican Americans show high cultural convergence vis-à-vis both Anglos and African-Americans, with the exception of those who speak Spanish. Rather than conforming to a single assimilation path, our findings reveal how Mexican immigrants defy simple unilinear theoretical expectations and illuminate their uniquely heterogeneous character.
Airbnb is a two-sided rental marketplace offering a variety of unique and more traditional accommodation options. Similar to other online marketplaces we invest in optimizing the content surfaced on the search UI and ranking relevance to improve the guest online search experience. The unique Airbnb inventory, however, surfaces some major data challenges. Given the high stakes of booking less traditional accommodations, users can spend many days to weeks searching and scanning the description page of many accommodation ”listings” before making a decision to book. Moreover, much of the information about a listing is unstructured and can only be found by the user after they go through the details on the listing page. As a result, we have found traditional search metrics do not work well in the context of our platform. Basic metrics of single user actions, such as click-through-rates, number of listings viewed, or dwell time, are not consistently directionally correlated with our downstream business metrics. To address these issues we leverage machine learning to isolate signals of intent from rich behavioral data. These signals have key applications including analytical insights, ranking modeling inputs, and experimentation velocity. In this paper, we describe the development of a model-based user intent metric, ”intentful listing view”, which combines the signals of a variety of user micro-actions on the listing description page. We demonstrate this learned metric is directionally correlated with downstream conversion metrics and sensitive across a variety of historical search experiments.
Online users engage in self-disclosure - revealing personal information to others - in pursuit of social rewards. However, there are associated costs of disclosure to users' privacy. User profiling techniques support the use of contributed content for a number of purposes, e.g., micro-targeting advertisements. In this paper, we study self-disclosure as it occurs in newspaper comment forums. We explore a longitudinal dataset of about 60,000 comments on 2202 news articles from four major English news websites. We start with detection of language indicative of various types of self-disclosure, leveraging both syntactic and semantic information present in texts. Specifically, we use dependency parsing for subject, verb, and object extraction from sentences, in conjunction with named entity recognition to extract linguistic indicators of self-disclosure. We then use these indicators to examine the effects of anonymity and topic of discussion on self-disclosure. We find that anonymous users are more likely to self-disclose than identifiable users, and that self-disclosure varies across topics of discussion. Finally, we discuss the implications of our findings for user privacy.
The visual appearance of a webpage carries valuable information about the page's quality and can be used to improve the performance of learning to rank (LTR). We introduce the Visual learning TO Rank (ViTOR) model that integrates state-of-the-art visual features extraction methods: (i) transfer learning from a pre-trained image classification model, and (ii) synthetic saliency heat maps generated from webpage snapshots. Since there is currently no public dataset for the task of LTR with visual features, we also introduce and release the ViTOR dataset, containing visually rich and diverse webpages. The ViTOR dataset consists of visual snapshots, non-visual features and relevance judgments for ClueWeb12 webpages and TREC Web Track queries. We experiment with the proposed ViTOR model on the ViTOR dataset and show that it significantly improves the performance of LTR with visual features.
Health literacy, i.e. the ability to read and understand medical text, is a relevant component of public health. Unfortunately, many medical texts are hard to grasp by the general population as they are targeted at highly-skilled professionals and use complex language and domain-specific terms. Here, automatic text simplification making text commonly understandable would be very beneficial. However, research and development into medical text simplification is hindered by the lack of openly available training and test corpora which contain complex medical sentences and their aligned simplified versions. In this paper, we introduce such a dataset to aid medical text simplification research. The dataset is created by filtering aligned health sentences using expert knowledge from an existing aligned corpus and a novel simple, language independent monolingual text alignment method. Furthermore, we use the dataset to train a state-of-the-art neural machine translation model, and compare it to a model trained on a general simplification dataset using an automatic evaluation, and an extensive human-expert evaluation.
Capturing the meaning of sentences has long been a challenging task. Current models tend to apply linear combinations of word features to conduct semantic composition for bigger-granularity units e.g. phrases, sentences, and documents. However, the semantic linearity does not always hold in human language. For instance, the meaning of the phrase “ivory tower” cannot be deduced by linearly combining the meanings of “ivory” and “tower”. To address this issue, we propose a new framework that models different levels of semantic units (e.g. sememe, word, sentence, and semantic abstraction) on a single Semantic Hilbert Space, which naturally admits a non-linear semantic composition by means of a complex-valued vector word representation. An end-to-end neural network 1 is proposed to implement the framework in the text classification task, and evaluation results on six benchmarking text classification datasets demonstrate the effectiveness, robustness and self-explanation power of the proposed model. Furthermore, intuitive case studies are conducted to help end users to understand how the framework works.
The proliferation of publicly accessible urban data provide new insights on various urban tasks. A frequently used approach is to treat each region as a data sample and build a model over all the regions to observe the correlations between urban features (e.g., demographics) and the target variable (e.g., crime count). To define regions, most existing studies use fixed grids or pre-defined administrative boundaries (e.g., census tracts or community areas). In reality, however, definitions of regions should be different depending on tasks (e.g., regional crime count prediction vs. real estate prices estimation). In this paper, we propose a new problem of task-specific city region partitioning, aiming to find the best partition in a city w.r.t. a given task. We prove this is an NP-hard search problem with no trivial solution. To learn the partition, we first study two variants of Markov Chain Monte Carlo (MCMC). We further propose a reinforcement learning scheme for effective sampling the search space. We conduct experiments on two real datasets in Chicago (i.e., crime count and real estate price) to demonstrate the effectiveness of our proposed method.
To alleviate sparsity and cold start problem of collaborative filtering based recommender systems, researchers and engineers usually collect attributes of users and items, and design delicate algorithms to exploit these additional information. In general, the attributes are not isolated but connected with each other, which forms a knowledge graph (KG). In this paper, we propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework that captures inter-item relatedness effectively by mining their associated attributes on the KG. To automatically discover both high-order structure information and semantic information of the KG, we sample from the neighbors for each entity in the KG as their receptive field, then combine neighborhood information with bias when calculating the representation of a given entity. The receptive field can be extended to multiple hops away to model high-order proximity information and capture users' potential long-distance interests. Moreover, we implement the proposed KGCN in a minibatch fashion, which enables our model to operate on large datasets and KGs. We apply the proposed model to three datasets about movie, book, and music recommendation, and experiment results demonstrate that our approach outperforms strong recommender baselines.
Network embedding is a method to learn low-dimensional representation vectors for nodes in complex networks. In real networks, nodes may have multiple tags but existing methods ignore the abundant semantic and hierarchical information of tags. This information is useful to many network applications and usually very stable. In this paper, we propose a tag representation learning model, Tag2Vec, which mixes nodes and tags into a hybrid network. Firstly, for tag networks, we define semantic distance as the proximity between tags and design a novel strategy, parameterized random walk, to generate context with semantic and hierarchical information of tags adaptively. Then, we propose hyperbolic Skip-gram model to express the complex hierarchical structure better with lower output dimensions. We evaluate our model on the NBER U.S. patent dataset and WordNet dataset. The results show that our model can learn tag representations with rich semantic information and it outperforms other baselines.
With the blossoming of social networking platforms like Twitter and Facebook, how to infer the opinions of online social network users on specific topics they had not directly given yet, has received much attention. Existing solutions mainly rely on one's previous posted messages. However, recent studies show that over 40% of users opt to be silent all or most of the time and post very few messages. Consequently, the performance of existing solutions will drop dramatically when they are applied to infer silent users' opinions, and how to infer the opinions of these silent users becomes a meaningful while challenging task. Inspired by the collaborative filtering techniques in cold-start recommendations, we infer the opinions of silent users by leveraging the text content posted by active users and their relationships between silent users. Specifically, we first consider both observed and pseudo relationships among users, and cluster users into communities in order to extract various kinds of features for opinion inference. We then design a coupled sparse matrix factorization (CSMF) model to capture the complex relations among these features. Extensive experiments on real-world data from Twitter show that our CSMF model achieves over 80% accuracy for the inference of silent users' opinions.
The task of temporal slot filling (TSF) is to extract the values (or called facts) of specific attributes for a given entity from text data and find the time points when the values were valid. It is challenging to find precise time points with incomplete and noisy temporal contexts in the text. In this work, we propose an unsupervised approach of two modules that mutually enhance each other: one is a reliability estimator on fact extractors conditionally to the temporal contexts; the other is a fact trustworthiness estimator based on the extractor's reliability. The iterative learning process reduces the noise of the extractions. Experiments demonstrate that our approach, with the novel design, can accurately and efficiently extract precise temporal facts from newspaper corpora.
Users are usually involved in multiple social networks, without explicit anchor links that reveal the correspondence among different accounts of the same user across networks. Anchor link prediction aims to identify the hidden anchor links, which is a fundamental problem for user profiling, information cascading, and cross-domain recommendation. Although existing methods perform well in the accuracy of anchor link prediction, the pairwise search manners on inferring anchor links suffer from big challenge when being deployed in practical systems. To combat the challenges, in this paper we propose a novel embedding and matching architecture to directly learn binary hash code for each node. Hash codes offer us an efficient index to filter out the candidate node pairs for anchor link prediction. Extensive experiments on synthetic and real world large-scale datasets demonstrate that our proposed method has high time efficiency without loss of competitive prediction accuracy in anchor link prediction.
Chinese named entity recognition (CNER) is an important task in Chinese natural language processing field. However, CNER is very challenging since Chinese entity names are highly context-dependent. In addition, Chinese texts lack delimiters to separate words, making it difficult to identify the boundary of entities. Besides, the training data for CNER in many domains is usually insufficient, and annotating enough training data for CNER is very expensive and time-consuming. In this paper, we propose a neural approach for CNER. First, we introduce a CNN-LSTM-CRF neural architecture to capture both local and long-distance contexts for CNER. Second, we propose a unified framework to jointly train CNER and word segmentation models in order to enhance the ability of CNER model in identifying entity boundaries. Third, we introduce an automatic method to generate pseudo labeled samples from existing labeled data which can enrich the training data. Experiments on two benchmark datasets show that our approach can effectively improve the performance of Chinese named entity recognition, especially when training data is insufficient.
Semi-supervised multi-view feature learning (SMFL) is a feasible solution for webpage classification. However, how to fully extract the complementarity and correlation information effectively under semi-supervised setting has not been well studied. In this paper, we propose a semi-supervised multi-view individual and sharable feature learning (SMISFL) approach, which jointly learns multiple view-individual transformations and one sharable transformation to explore the view-specific property for each view and the common property across views. We design a semi-supervised multi-view similarity preserving term, which fully utilizes the label information of labeled samples and similarity information of unlabeled samples from both intra-view and inter-view aspects. To promote learning of diversity, we impose a constraint on view-individual transformation to make the learned view-specific features to be statistically uncorrelated. Furthermore, we train a linear classifier, such that view-specific and shared features can be effectively combined for classification. Experiments on widely used webpage datasets demonstrate that SMISFL can significantly outperform state-of-the-art SMFL and webpage classification methods.
In this paper, we study the fairness-aware classification problem by formulating it as a constrained optimization problem. Several limitations exist in previous works due to the lack of a theoretical framework for guiding the formulation. We propose a general fairness-aware framework to address previous limitations. Our framework provides: (1) various fairness metrics that can be incorporated into classic classification models as constraints; (2) the convex constrained optimization problem that can be solved efficiently; and (3) the lower and upper bounds of real-world fairness measures that are established using surrogate functions, providing a fairness guarantee for constrained classifiers. Within the framework, we propose a constraint-free criterion under which any learned classifier is guaranteed to be fair in terms of the specified fairness metric. If the constraint-free criterion fails to satisfy, we further develop the method based on the bounds for constructing fair classifiers. The experiments using real-world datasets demonstrate our theoretical results and show the effectiveness of the proposed framework.
Modeling people's activities in the urban space is a crucial socio-economic task but extremely challenging due to the deficiency of suitable methods. To model the temporal dynamics of human activities concisely and specifically, we present State-sharing Hidden Markov Model (SSHMM). First, it extracts the urban states from the whole city, which captures the volume of population flows as well as the frequency of each type of Point of Interests (PoIs) visited. Second, it characterizes the urban dynamics of each urban region as the state transition on the shared-states, which reveals distinct daily rhythms of urban activities. We evaluate our method via a large-scale real-life mobility dataset and results demonstrate that SSHMM learns semantics-rich urban dynamics, which are highly correlated with the functions of the region. Besides, it recovers the urban dynamics in different time slots with an error of 0.0793, which outperforms the general HMM by 54.2%.
Hierarchical text classification has many real-world applications. However, labeling a large number of documents is costly. In practice, we can use semi-supervised learning or weakly supervised learning (e.g., dataless classification) to reduce the labeling cost. In this paper, we propose a path cost-sensitive learning algorithm to utilize the structural information and further make use of unlabeled and weakly-labeled data. We use a generative model to leverage the large amount of unlabeled data and introduce path constraints into the learning algorithm to incorporate the structural information of the class hierarchy. The posterior probabilities of both unlabeled and weakly labeled data can be incorporated with path-dependent scores. Since we put a structure-sensitive cost to the learning algorithm to constrain the classification consistent with the class hierarchy and do not need to reconstruct the feature vectors for different structures, we can significantly reduce the computational cost compared to structural output learning. Experimental results on two hierarchical text classification benchmarks show that our approach is not only effective but also efficient to handle the semi-supervised and weakly supervised hierarchical text classification.
In this paper, we study the problem of recommending personalized items to users given their sequential behaviors. Most sequential recommendation models only capture a user's short-term preference in a short session, and neglect his general (unchanged over time) and long-term preferences. Besides, they are all based on deterministic neural networks, and consider users' latent preferences as point vectors in a low-dimensional continuous space. However, in real world, the evolutions of users' preferences are full of uncertainties. We address this problem by proposing a hierarchical neural variational model (HNVM). HNVM models users' three preferences: general, long-term and short-term preferences through an unified hierarchical deep generative process. HNVM is a hierarchical recurrent neural network that enables it to capture both user's long-term and short-term preferences. Experiments on two public datasets demonstrate that HNVM outperforms state-of-the-art sequential recommendation methods.
Answer ranking is an important task in Community Question Answering (CQA), by which “Good” answers should be ranked in the front of “Bad” or “Potentially Useful” answers. The state of the art is the attention-based classification framework that learns the mapping between the questions and the answers. However, we observe that existing attention-based methods perform poorly on complicated question-answer pairs. One major reason is that existing methods cannot get accurate alignments between questions and answers for such pairs. We call the phenomenon “attention divergence”. In this paper, we propose a new attention mechanism, called Focusing Attention Network(FAN), which can automatically draw back the divergent attention by adding the semantic, and metadata features. Our Model can focus on the most important part of the sentence and therefore improve the answer ranking performance. Experimental results on the CQA dataset of SemEval-2016 and SemEval-2017 demonstrate that our method respectively attains 79.38 and 88.72 on MAP and outperforms the Top-1 system in the shared task by 0.19 and 0.29.
In recent years, with the prevalence of social media and smart devices, people causally reveal their locations such as shops, hotels, and restaurants in their tweets. Recognizing and linking such fine-grained location mentions to well-defined location profiles are beneficial for retrieval and recommendation systems. In this paper, we propose DLocRL, a new deep learning pipeline for fine-grained location recognition and linking in tweets, and verify its effectiveness on a real-world Twitter dataset.
The sequential recommendation, which models sequential behavioral patterns among users for the recommendation, plays a critical role in recommender systems. However, the state-of-the-art Recurrent Neural Networks (RNN) solutions rarely consider the non-linear feature interactions and non-monotone short-term sequential patterns, which are essential for user behavior modeling in sparse sequence data. In this paper, we propose a novel Recurrent Convolutional Neural Network model (RCNN). It not only utilizes the recurrent architecture of RNN to capture complex long-term dependencies, but also leverages the convolutional operation of Convolutional Neural Network (CNN) model to extract short-term sequential patterns among recurrent hidden states. Specifically, we first generate a hidden state at each time step with the recurrent layer. Then the recent hidden states are regarded as an “image”, and RCNN searches non-linear feature interactions and non-monotone local patterns via intra-step horizontal and inter-step vertical convolutional filters, respectively. Moreover, the output of convolutional filters and the hidden state are concatenated and fed into a fully-connected layer to generate the recommendation. Finally, we evaluate the proposed model using four real-world datasets from various application scenarios. The experimental results show that our model RCNN significantly outperforms the state-of-the-art approaches on sequential recommendation.
With the flourishing of location based social networks, posting check-ins has become a common practice to document one's daily life. Users usually do not consider check-in records as violations of their privacy. However, through analyzing two real-world check-in datasets, our study shows that check-in records are vulnerable to linkage attacks. To address this problem, we design a partition-and-group framework to integrate the information of check-ins and additional mobility data to attain a novel privacy criterion - kt, l-anonymity. It ensures adversaries with arbitrary background knowledge cannot use check-ins to re-identify users in other anonymous datasets or learning unreported mobility records. The proposed framework achieves favorable performance against state-of-art baseline in terms of improving check-in utility by 24% ~ 57% while providing stronger privacy guarantee at the same time. We believe this study will open a new angle in attaining both privacy-preserving and useful check-in services.
Classic supervised learning makes the closed-world assumption that the classes seen in testing must have appeared in training. However, this assumption is often violated in real-world applications. For example, in a social media site, new topics emerge constantly and in e-commerce, new categories of products appear daily. A model that cannot detect new/unseen topics or products is hard to function well in such open environments. A desirable model working in such environments must be able to (1) reject examples from unseen classes (not appeared in training) and (2) incrementally learn the new/unseen classes to expand the existing model. This is called open-world learning (OWL). This paper proposes a new OWL method based on meta-learning. The key novelty is that the model maintains only a dynamic set of seen classes that allows new classes to be added or deleted with no need for model re-training. Each class is represented by a small set of training examples. In testing, the meta-classifier only uses the examples of the maintained seen classes (including the newly added classes) on-the-fly for classification and rejection. Experimental results with e-commerce product classification show that the proposed method is highly effective1.
Thanks to the advancing mobile location services, people nowadays can post about places to share visiting experience on-the-go. A large place graph not only helps users explore interesting destinations, but also provides opportunities for understanding and modeling the real world. To improve coverage and flexibility of the place graph, many platforms import places data from multiple sources, which unfortunately leads to the emergence of numerous duplicated places that severely hinder subsequent location-related services. In this work, we take the anonymous place graph from Facebook as an example to systematically study the problem of place deduplication: We carefully formulate the problem, study its connections to various related tasks that lead to several promising basic models, and arrive at a systematic two-step data-driven pipeline based on place embedding with multiple novel techniques that works significantly better than the state-of-the-art.
The potentially detrimental effects of cyberbullying have led to the development of numerous automated, data-driven approaches, with emphasis on classification accuracy. Cyberbullying, as a form of abusive online behavior, although not well-defined, is a repetitive process, i.e., a sequence of aggressive messages sent from a bully to a victim over a period of time with the intent to harm the victim. Existing work has focused on harassment (i.e., using profanity to classify toxic comments independently) as an indicator of cyberbullying, disregarding the repetitive nature of this harassing process. However, raising a cyberbullying alert immediately after an aggressive comment is detected can lead to a high number of false positives. At the same time, two key practical challenges remain unaddressed: (i) detection timeliness, which is necessary to support victims as early as possible, and (ii) scalability to the staggering rates at which content is generated in online social networks.
In this work, we introduce CONcISE, a novel approach for timely and accurate Cyberbullying detectiON on Instagram media SEssions. We propose a sequential hypothesis testing formulation that seeks to drastically reduce the number of features used in classifying each comment while maintaining high classification accuracy. CONcISE raises an alert only after a certain number of detections have been made. Extensive experiments on a real-world Instagram dataset with ~ 4M users and ~ 10M comments demonstrate the effectiveness, scalability, and timeliness of our approach and its benefits over existing methods.
With the increasing of online shopping services, fashion recommendation plays an important role in daily online shopping scenes. A lot of recommender systems have been developed with visual information. However, few works take into account compatibility relationship when they are generating recommendations. The challenge is that fashion concept is often subtle and subjective for different customers. In this paper, we propose a fashion compatibility knowledge learning method that incorporates visual compatibility relationships as well as style information. We also propose a fashion recommendation method with domain adaptation strategy to alleviate the distribution gap between the items in target domain and the items of external compatible outfits. Our results indicate that the proposed method is capable of learning visual compatibility knowledge and outperforms all the baselines.
The recent increase in online user generated content (UGC) has led to the availability of a large number of posts about products and services. Often, these posts contain complaints that the consumers purchasing the products and services have. However, discovering and summarizing product defects and the related knowledge from large quantities of user posts is a difficult task. Traditional aspect opinion mining models, that aim to discover the product aspects and their corresponding opinions, are not sufficient to discover the product defect information from the user posts. In this paper, we propose the Product Defect Latent Dirichlet Allocation model (PDLDA), a probabilistic model that identifies domain-specific knowledge about product issues using interdependent three-dimensional topics: Component, Symptom, and Resolution. A Gibbs sampling based inference method for PDLDA is also introduced. To evaluate our model, we introduce three novel product review datasets. Both qualitative and quantitative evaluations show that the proposed model results in apparent improvement in the quality of discovered product defect information. Our model has the potential to benefit customers, manufacturers, and policy makers, by automatically discovering product defects from online data.
Due to its anonymity, there has been a dramatic growth of underground drug markets hosted in the darknet (e.g., Dream Market and Valhalla). To combat drug trafficking (a.k.a. illicit drug trading) in the cyberspace, there is an urgent need for automatic analysis of participants in darknet markets. However, one of the key challenges is that drug traffickers (i.e., vendors) may maintain multiple accounts across different markets or within the same market. To address this issue, in this paper, we propose and develop an intelligent system named uStyle-uID leveraging both writing and photography styles for drug trafficker identification at the first attempt. At the core of uStyle-uID is an attributed heterogeneous information network (AHIN) which elegantly integrates both writing and photography styles along with the text and photo contents, as well as other supporting attributes (i.e., trafficker and drug information) and various kinds of relations. Built on the constructed AHIN, to efficiently measure the relatedness over nodes (i.e., traffickers) in the constructed AHIN, we propose a new network embedding model Vendor2Vec to learn the low-dimensional representations for the nodes in AHIN, which leverages complementary attribute information attached in the nodes to guide the meta-path based random walk for path instances sampling. After that, we devise a learning model named vIdentifier to classify if a given pair of traffickers are the same individual. Comprehensive experiments on the data collections from four different darknet markets are conducted to validate the effectiveness of uStyle-uID which integrates our proposed method in drug trafficker identification by comparisons with alternative approaches.
Abstractive meeting summarization is a challenging problem in natural language understanding, which automatically generates the condensed summary covering the important points in the meeting conversation. However, the existing abstractive summarization works mainly focus on the structured text documents, which may be ineffectively applied to the meeting summarization task due to the lack of modeling the unstructured long-form conversational contents. In this paper, we consider the problem of abstractive meeting summarization from the viewpoint of hierarchical adaptive segmental encoder-decoder network learning. We propose the hierarchical neural encoder based on adaptive recurrent networks to learn the semantic representation of meeting conversation with adaptive conversation segmentation. We then develop the reinforced decoder network to generate the high-quality summaries for abstractive meeting summarization. We conduct the extensive experiments on the well-known AMI meeting conversation dataset to validate the effectiveness of our proposed method.
Point-of-interest (POI) recommendation is essential to a variety of services for both users and business. An extensive number of models have been developed to improve the recommendation performance by exploiting various characteristics and relations among POIs (e.g., spatio-temporal, social, etc.). However, very few studies closely look into the underlying mechanism accounting for why users prefer certain POIs to others. In this work, we initiate the first attempt to learn the distribution of user latent preference by proposing an Adversarial POI Recommendation (APOIR) model, consisting of two major components: (1) the recommender (R) which suggests POIs based on the learned distribution by maximizing the probabilities that these POIs are predicted as unvisited and potentially interested; and (2) the discriminator (D) which distinguishes the recommended POIs from the true check-ins and provides gradients as the guidance to improve R in a rewarding framework. Two components are co-trained by playing a minimax game towards improving itself while pushing the other to the boundary. By further integrating geographical and social relations among POIs into the reward function as well as optimizing R in a reinforcement learning manner, APOIR obtains significant performance improvement in four standard metrics compared to the state of the art methods.
Unveiling human mobility patterns is an important task for many downstream applications like point-of-interest (POI) recommendation and personalized trip planning. Compelling results exist in various sequential modeling methods and representation techniques. However, discovering and exploiting the context of trajectories in terms of abstract topics associated with the motion can provide a more comprehensive understanding of the dynamics of patterns. We propose a new paradigm for moving pattern mining based on learning trajectory context, and a method - Context-Aware Variational Trajectory Encoding and Human Mobility Inference (CATHI) - for learning user trajectory representation via a framework consisting of: (1) a variational encoder and a recurrent encoder; (2) a variational attention layer; (3) two decoders. We simultaneously tackle two subtasks: (T1) recovering user routes (trajectory reconstruction); and (T2) predicting the trip that the user would travel (trajectory prediction). We show that the encoded contextual trajectory vectors efficiently characterize the hierarchical mobility semantics, from which one can decode the implicit meanings of trajectories. We evaluate our method on several public datasets and demonstrate that the proposed CATHI can efficiently improve the performance of both subtasks, compared to state-of-the-art approaches.
We present a novel generative Session-Based Recommendation (SBR) framework, called VAriational SEssion-based Recommendation (VASER) - a non-linear probabilistic methodology allowing Bayesian inference for flexible parameter estimation of sequential recommendations. Instead of directly applying extended Variational AutoEncoders (VAE) to SBR, the proposed method introduces normalizing flows to estimate the probabilistic posterior, which is more effective than the agnostic presumed prior approximation used in existing deep generative recommendation approaches. VASER explores soft attention mechanism to upweight the important clicks in a session. We empirically demonstrate that the proposed model significantly outperforms several state-of-the-art baselines, including the recently-proposed RNN/VAE-based approaches on real-world datasets.
In this paper, we propose a Joint Collaborative Autoencoder framework that learns both user-user and item-item correlations simultaneously, leading to a more robust model and improved top-K recommendation performance. More specifically, we show how to model these user-item correlations and demonstrate the importance of careful normalization to alleviate the influence of feedback heterogeneity. Further, we adopt a pairwise hinge-based objective function to maximize the top-K precision and recall directly for top-K recommenders. Finally, a mini-batch optimization algorithm is proposed to train the proposed model. Extensive experiments on three public datasets show the effectiveness of the proposed framework over state-of-the-art non-neural and neural alternatives.
Over the past three years, platforms, governments and a plethora of nonprofit initiatives have prioritized fighting online misinformation through a variety of different means. Yet the current framework is too fragmented to deliver global results. The big tech platforms have data, but no public accountability. Governments (mostly) have democratic legitimacy, but little information on what is actually going on in the platforms they're itching to regulate. And nonprofit initiatives too often lack the scale to affect change at the level needed. What if we came up with a dramatically new deliberative process that involves a global community of concerned citizens ready to share information and participate in consultations to improve collective decision-making? What if a more accountable, diverse and verifiable Web were still possible?
This paper introduces RecBoard, a unified web-based platform that facilitates researchers and practitioners to train, test, deploy, and monitor recommendation systems. RecBoard streamlines the end-to-end process of building recommendation systems by providing a collaborative user interface that automates repetitive tasks related to dataset management, model training, visualization, deployments, and monitoring. Our demo prototype demonstrates how RecBoard can empower common tasks in research and development. RecBoard will be open-sourced and publicly available upon publication.
This paper demonstrates a novel analytics service, CrowdPT, for capturing the key information, price target (PT), of individual investors on social media. PT, which is mentioned as a conclusion in most of analysts' reports, indicates not only the market sentiment (bullish/bearish) of investors, but also the analysis results. In order to provide the latest opinions of individual investors, we monitor Twitter in real time and update the information in price chart daily. For all component stocks in Dow Jones Industrial Average, textual information from numerous tweets is summarized into a single number, PT, in CrowdPT. Case studies confirm the effectiveness of our analytics service in the financial domain, and show that capturing the PT of individual investors is promising for stock price prediction. The Web API of CrowdPT is also provided for academic purpose.
We introduce and demonstrate the usefulness of a tool that automatically annotates therapist utterances in real-time according to the therapeutic role that they perform in an evidence-based psychological dialogue. This is implemented within the context of an on-line service that supports the delivery of one-to-one therapy. When combined with patient outcome measures, this tool allows us to discover the active ingredients in psychotherapy. In particular, we show that particular measures of therapy content are more strongly correlated with patient improvement than others, suggesting that they are a critical part of psychotherapy. As this tool gives us interpretable measures of therapy content, it can enable services to quality control the therapy delivered. Furthermore, we show how specific insights can be presented to the therapist so they can reflect on and improve their practice.
We present QAnswer, a Question Answering system which queries at the same time 3 core datasets of the Semantic Web, that are relevant for end-users. These datasets are Wikidata with Lexemes, LinkedGeodata and Musicbrainz. Additionally, it is possible to query these datasets in English, German, French, Italian, Spanish, Pourtuguese, Arabic and Chinese. Moreover, QAnswer includes a fallback option to the search engine Qwant when the answer to a question cannot be found in the datasets mentioned above. These features make QAnswer as the first prototype of a Question Answering System over a considerable part of the LOD cloud.
We describe Voyageur, which is an application of experiential search to the domain of travel. Unlike traditional search engines for online services, experiential search focuses on the experiential aspects of the service under consideration. In particular, Voyageur needs to handle queries for subjective aspects of the service (e.g., quiet hotel, friendly staff) and combine these with objective attributes, such as price and location. Voyageur also highlights interesting facts and tips about the services the user is considering to provide them with further insights into their choices.
In order to accurately populate and curate Knowledge Graphs (KGs), it is important to distinguish ?s?p?o? facts that can be traced back to sources from facts that cannot be verified. Manually validating each fact is time-consuming. Prior work on automating this task relied on numerical confidence scores which might not be easily interpreted. To overcome this limitation, we present Tracy, a novel tool that generates human-comprehensible explanations for candidate facts. Our tool relies on background knowledge in the form of rules to rewrite the fact in question into other easier-to-spot facts. These rewritings are then used to reason over the candidate fact creating semantic traces that can aid KG curators. The goal of our demonstration is to illustrate the main features of our system and to show how the semantic traces can be computed over both text and knowledge graphs with a simple and intuitive user interface.
How do learners schedule their online learning? This issue is concerned by both course instructors and researchers, especially in the context of self-paced online learning environment. Many indicators and methods have been proposed to understand and improve the time management of learning activities, however, there are few tools of visualizing, comparing and exploring the time management to gain intuitive understanding. In this demo, we introduce the LearnExp, an interactive visual analytic system designed to explore the temporal patterns of learning activities and explain the relationships between academic performance and these patterns. This system will help instructors to comparatively explore the distribution of learner activities from multiple aspects, and to visually explain the time management of different learner groups with the prediction of learning performance.
Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this demo, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and its performance is also highly competitive compared to the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort. Video: https://youtu.be/99bA9aamsgk
In this paper, we introduce Dixit, an interactive visual storytelling system that the user interacts with iteratively to compose a short story for a photo sequence. The user initiates the process by uploading a sequence of photos. Dixit first extracts text terms from each photo which describe the objects (e.g., boy, bike) or actions (e.g., sleep) in the photo, and then allows the user to add new terms or remove existing terms. Dixit then generates a short story based on these terms. Behind the scenes, Dixit uses an LSTM-based model trained on image caption data and FrameNet to distill terms from each image, and utilizes a transformer decoder to compose a context-coherent story. Users change images or terms iteratively with Dixit to create the most ideal story. Dixit also allows users to manually edit and rate stories. The proposed procedure opens up possibilities for interpretable and controllable visual storytelling, allowing users to understand the story formation rationale and to intervene in the generation process.
We present Exam Keeper, a tool to measure the availability of answers to exam questions for ESL students. Exam Keeper targets two major sources of answers: the web, and apps. ESL teachers can use it to estimate which questions are easily answered by information on the web or by using automatic question answering systems, which should help teachers avoid such questions on their exams or homework to prevent students from misusing technology. The demo video is available at https://youtu.be/rgq0UXOkb8o 1
Web pages and other documents often contain tables to provide numerical details in a structured manner. Typically, the text explains and highlights important quantities, often using approximate numbers and aggregates such as totals, averages or ratios. For a human reader, it is crucial to navigate between text and tables to understand the key information in its context, drill down into tables when details are needed, and obtain explanations on specific figures from the accompanying text.
In this demo, we present ExQuisiTe: a system to align quantity mentions in the text with related quantity mentions in tables, and enable extractive summarization that considers table contents. ExQuisiTe links quantity mentions in the text with quantities in tables, to support user-friendly exploration. ExQuisiTe handles exact single-cell references as well as rounded or truncated numbers and aggregations such as row or column totals.
This demonstration paper presents an argument-inducing online forum that stimulates participants with lack of premises for their claim in online discussions. The proposed forum provides its participants the following two subsystems: (1) Argument estimator for online discussions automatically generates a visualization of the argument structures in posts based on argument mining. The forum indicates structures such as claim-premise relations in real time by exploiting a state-of-the-art deep learning model. (2) Argument-inducing agent for online discussion (AIAD) automatically generates a reply post based on the argument estimator requesting further reasons to improve the argumentation of participants.
Our experimental discussion demonstrates that the argument estimator can detect the argument structures from online discussions, and AIAD can induce premises from the participants. To the best of our knowledge, our argument-inducing online forum is the first approach to either visualize or request a real-time argument for online discussions. Our forum can be used to collect and induce claim-reasons pairs rather than only opinions to understand various lines of reasoning in online arguments such as civic discussions, online debates, and education objectives. The argument estimator code is available at https://github.com/EdoFrank/EMNLP2018-ArgMining-Morio and the demonstration video is available at https://youtu.be/T9fNJfneQV8.
In the past few years, numerous privacy vulnerabilities have been discovered in the WiFi standards and their implementations for mobile devices. These vulnerabilities allow an attacker to collect large amounts of data on the device user, which could be used to infer sensitive information such as religion, gender, and sexual orientation. Solutions for these vulnerabilities are often hard to design and typically require many years to be widely adopted, leaving many devices at risk.
In this paper, we present UNVEIL - an interactive and extendable platform to demonstrate the consequences of these attacks. The platform performs passive and active attacks on smartphones to collect and analyze data leaked through WiFi and communicate the analysis results to users through simple and interactive visualizations.
The platform currently performs two attacks. First, it captures probe requests sent by nearby devices and combines them with public WiFi location databases to generate a map of locations previously visited by the device users. Second, it creates rogue access points with SSIDs of popular public WiFis (e.g. _Heathrow WiFi, Railways WiFi) and records the resulting internet traffic. This data is then analyzed and presented in a format that highlights the privacy leakage. The platform has been designed to be easily extendable to include more attacks and to be easily deployable in public spaces. We hope that UNVEIL will help raise public awareness of privacy risks of WiFi networks.
Containerized web-applications have gained popularity recently due to the advantages provided by the containers including light-weight, packaged, fast start up and shut down and easy scalability. As there are more than 267 cloud providers, finding a flexible deployment option for containerized web-applications is very difficult as each cloud offers numerous deployment infrastructure. Benchmarking is one of the eminent options to evaluate the provisioned resources before product-level deployment. However, benchmarking the massive infrastructure resources provisioned by various cloud providers is a time consuming, tedious and costly process and is not practical to accomplish manually.
In this demonstration, we present Smart Docker Benchmarking Orchestrator (SmartDBO), a general orchestration framework that automatically benchmarks (deploys and executes) users' containerized web-applications across different cloud providers while meeting the constraints of budget and deployment configurations. SmartDBO aims to answer two questions: (i) how to automate the benchmarking of containerized web-application across multi-cloud environments?, (ii) how to maximize the diversity in a benchmarking solution which covers maximum numbers of cloud providers and types of provisioned infrastructures without exceeding users' budgets? We evaluate all the features of SmartDBO using SimplCommerce and TPC-W executing across Amazon AWS and Microsoft Azure.
With the rapid development of location-based social networks (LBSNs), point of interest (POI) recommendation has become an important way to meet users' personalized demands. The aim of POI recommendation is to provide personalized recommendation of POIs for mobile users. However, traditional POI recommendation systems cannot satisfy users' personalized demands. The reason is that the traditional POI recommendation system cannot recommend the next POI to a user based on the user's context information. Also, the traditional POI recommendation system provides no real-time guarantee on performance. In this demo, we propose a novel real-time next POI recommendation system named R2SIGTP which provides more personalized real-time recommendation compared with existing ones. Our system has the following advantages: 1) it has real-time performance; 2) it uses a unified approach to integrate geographic and preference information; 3) it considers the feedback of each single user to provide more personalized recommendation. We have implemented our system. R2SIGTP is easy to use and can be used by the mobile terminal's browser to recommend the next POI to the user in real-time based on the automatically identified user location and current time. The experimental results on real-world LBSNs show that R2SIGTP's performance is satisfactory.
Data analytics is central to modern online services, particularly those data-driven. Often this entails the processing of large-scale datasets which may contain private, personal and sensitive information relating to individuals and organisations. Particular challenges arise where cloud is used to store and process the sensitive data. In such settings, security and privacy concerns become paramount, as the cloud provider is trusted to guarantee the security of the services they offer, including data confidentiality. Therefore, the issue this work tackles is “How to securely perform data analytics in a public cloud?”
To assist this question, we design and implement SGX-PySpark- a secure distributed data analytics system which relies on a trusted execution environment (TEE) such as Intel SGX to provide strong security guarantees. To build SGX-PySpark, we integrate PySpark - a widely used framework for data analytics in industry to support a wide range of queries, with SCONE - a shielded execution framework using Intel SGX.
TweetSenti is a system for analyzing the sentiment of an entity in tweets. A sentence or tweet may contain multiple entities, and they do not always have the same sentiment polarity. Therefore, it is necessary to detect the sentiment for a specific target entity. This type of target-dependent (entity level) sentiment analysis has become attractive and has been used in many applications, but it is still a challenging task. TweetSenti employs a new approach for detecting the entity level sentiment. Our model splits a sentence into a left context and a right context according to the target entity, and it also exploits two different types of word embeddings to represent a word, the general word embedding and the sentiment specific word embedding. A hybrid neural network is used to capture both the sequence and structure information of the two sides of the target entity. The sequence information is learned by attention-based bi-directional LSTM models. The structure information is captured by multi-context CNN models. Based on this algorithm, we built a web-based application that users can interact with and analyze an entity's sentiment in Twitter at real-time.
Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data sources without requiring prior data transformation. We showcase Squerall's ability to query five different data sources, including inter alia the popular Cassandra and MongoDB. In particular, we demonstrate how it can jointly query heterogeneous data sources, and how interested developers can easily extend it to support additional data sources. Graphical user interfaces (GUIs) are offered to support users in (1) building intra-source queries, and (2) creating required input files.
Fact checking is an essential task in journalism; its importance has been highlighted due to recently increased concerns and efforts in combating misinformation. In this paper, we present an automated fact checking platform which given a claim, it retrieves relevant textual evidence from a document collection, predicts whether each piece of evidence supports or refutes the claim, and returns a final verdict. We describe the architecture of the system and the user interface, focusing on the choices made to improve its user friendliness and transparency. We conduct a user study of the fact-checking platform in a journalistic setting: we integrated it with a collection of news articles and provide an evaluation of the platform using feedback from journalists in their workflow. We found that the predictions of our platform were correct 58% of the time, and 59% of the returned evidence was relevant.
In this paper we present a web-based open source tool and a method for generating insight from any text or discourse using text network analysis. The tool (InfraNodus) can be used by researchers and writers to organize and to better understand their notes, to measure the level of bias in discourse, and to identify the parts of the discourse where there is a potential for insight and new ideas. The method is based on text network analysis algorithm, which represents any text as a network and identifies the most influential words in a discourse based on the terms' co-occurrence. Graph community detection algorithm is then applied in order to identify the different topical clusters, which represent the main topics in the text as well as the relations between them. The community structure is used in conjunction with other measures to identify the level of bias or cognitive diversity of the discourse. Finally, the structural gaps in the graph can indicate the parts of the discourse where the connections are lacking, therefore highlighting the areas where there's a potential for new ideas. The tool can be used as stand-alone software by end users as well as implemented via an API into other tools. Another interesting application is in the field of recommendation systems: structural gaps could indicate potentially interesting non-trivial connections to any connected datasets.
People with visual impairments often rely on screen readers when interacting with computer systems. Increasingly, these individuals also make extensive use of voice-based virtual assistants (VAs). We conducted a survey of 53 people who are legally blind to identify the strengths and weaknesses of both technologies, as well as the unmet opportunities at their intersection. We learned that virtual assistants are convenient and accessible, but lack the ability to deeply engage with content (e.g., read beyond the first few sentences of Wikipedia), and the ability to get a quick overview of the landscape (list alternative search results & suggestions). In contrast, screen readers allow for deep engagement with content (when content is accessible), and provide fine-grained navigation & control, but at the cost of increased complexity, and reduced walk-up-and-use convenience. In this demonstration, we showcase VERSE, a system that combines the positive aspects of VAs and screen readers, and allows other devices (e.g., smart watches) to serve as optional input accelerators. Together, these features allow people with visual impairments to deeply engage with web content through voice interaction.
Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacy-concern analysis, community detection).
In this demo paper, we present the XFake system, an explainable fake news detector that assists end-users to identify news credibility. To effectively detect and interpret the fakeness of news items, we jointly consider both attributes (e.g., speaker) and statements. Specifically, MIMIC, ATTN and PERT frameworks are designed, where MIMIC is built for attribute analysis, ATTN is for statement semantic analysis and PERT is for statement linguistic analysis. Beyond the explanations extracted from the designed frameworks, relevant supporting examples as well as visualization are further provided to facilitate the interpretation. Our implemented system is demonstrated on a real-world dataset crawled from PolitiFact1, where thousands of verified political news have been collected.
We present GraviTIE (Global Representation and Visualization of Text and Image Embeddings, pronounced ”gravity”), an interactive visualization system for large-scale image datasets. GraviTIE operates on datasets consisting of images equipped with unstructured and semi-structured text, relying on multi-modal unsupervised learning methods to produce an interactive similarity map. Users interact with the similarity map through pan and zoom operations, as well as keyword-oriented queries. GraviTIE makes no assumptions about the form, scale, or content of the data, allowing it to be used for exploratory analysis, assessment of unsupervised learning methods, data curation and quality control, data profiling, and other purposes where flexibility and scalability are paramount. We demonstrate GraviTIE on three real datasets: 500k images from the Russian misinformation dataset from Twitter, 2 million art images, and 5 million scientific figures. A screencast video is available at https://vimeo.com/310511187.
The demo presents TaxVis, a visual detection system for tax auditor. The system supports tax evasion group detection based on a two-phase detection approach. Different from the pattern matching based methods, this two-phase method can analyze the suspicious groups automatically without artificial extraction of tax evasion patterns. In the first phase, we use a network embedding method node2vec to learn representations that embed corporations from a Corporation Associated Network (CANet), and use LightGBM to calculate a suspicious score for each corporation. In the second phase, the system use three detection rules to analyze the transaction anomaly around the suspicious corporations. According to these transaction anomalies, we can discover potential suspicious tax evasion groups. We demonstrate TaxVis on tax data of Shaanxi province in China to verify the usefulness of the system.
Web research, data science, and artificial intelligence have been rapidly changing our life and society. Researchers and practitioners in the fields take a large amount of time to read literature and compare existing approaches. It would significantly improve their efficiency if there was a system that extracted and managed experimental evidences (say, a specific method achieves a score of a specific metric on a specific dataset) from tables of paper PDFs for search, exploration, and analytic. We build such a demonstration system, called Tablepedia, that use rule-based and learning-based methods to automate the “reading” of PDF tables. It has three modules: template recognition, unification, and SQL operations. We implement three functions to facilitate research and practice: (1) finding related methods and datasets, (2) finding top-performing baseline methods, and (3) finding conflicting reported numbers. A pointer to a screencast on Vimeo: https://vimeo.com/310162310
Traffic signal control is an emerging application scenario for reinforcement learning. Besides being as an important problem that affects people's daily life in commuting, traffic signal control poses its unique challenges for reinforcement learning in terms of adapting to dynamic traffic environment and coordinating thousands of agents including vehicles and pedestrians. A key factor in the success of modern reinforcement learning relies on a good simulator to generate a large number of data samples for learning. The most commonly used open-source traffic simulator SUMO is, however, not scalable to large road network and large traffic flow, which hinders the study of reinforcement learning on traffic scenarios. This motivates us to create a new traffic simulator CityFlow with fundamentally optimized data structures and efficient algorithms. CityFlow can support flexible definitions for road network and traffic flow based on synthetic and real-world data. It also provides user-friendly interface for reinforcement learning. Most importantly, CityFlow is more than twenty times faster than SUMO and is capable of supporting city-wide traffic simulation with an interactive render for monitoring. Besides traffic signal control, CityFlow could serve as the base for other transportation studies and can create new possibilities to test machine learning methods in the intelligent transportation domain.