Wednesday 27th | |
---|---|
08:30 10:00 |
|
10:15 11:00 |
|
11:00 12:30 |
|
14:00 15:15 |
|
15:15 16:45 |
|
17:00 18:00 |
Thursday 28th | |
---|---|
08:00 09:30 |
|
09:45 11:15 |
|
11:30 13:00 |
|
14:00 15:00 | |
15:00 16:30 |
|
16:45 18:15 |
|
18:15 19:00 |
|
Friday 29th | |
---|---|
08:30 09:30 |
|
09:45 11:15 |
|
11:30 13:00 |
|
14:00 15:30 |
|
15:45 16:45 | |
16:45 17:30 |
|
Introduction by the Workshop Organizers
No abstract available
Deep Partial Multiplex Network Embedding
Network embedding is an effective technique to learn the low-dimensional representations of nodes in networks. Real world networks are usually with multiplex or having multi-view representations from different relations. Recently, there has been increasing interest in network embedding on multiplex data. However, most existing multiplex approaches assume that the data is complete in all views. But in real applications, it is often the case that each view suffers from the missing of some data and therefore results in partial multiplex data.
Multi-view Omics Translation with Multiplex Graph Neural Networks
The rapid development of high-throughput experimental technologies for biological sampling has made the collection of omics data (e.g., genomics, epigenomics, transcriptomics and metabolomics) possible at a small cost. While multi-view approaches to omics data have a long history, omics-to-omics translation is a relatively new strand of research with useful applications such as recovering missing or censored data and finding new correlations between samples. As the relations between omics can be non-linear and exhibit long-range dependencies between parts of the genome, deep neural networks can be an effective tool. Graph neural networks have been applied successfully in many different areas of research, especially in problems where annotated data is sparse, and have recently been extended to the heterogeneous graph case, allowing for the modelling of multiple kinds of similarities and entities. Here, we propose a meso-scale approach to construct multiplex graphs from multi-omics data, which can construct several graphs per omics and cross-omics graphs. We also propose a neural network architecture for omics-to-omics translation from these multiplex graphs, featuring a graph neural network encoder, coupled with an attention layer. We evaluate the approach on the open The Cancer Genome Atlas dataset (N=3023), showing that for MicroRNA expression prediction our approach has lower prediction error than regularized linear regression or modern generative adversarial networks.
A Triangle Framework Among Subgraph Isomorphism, pharmacophore and structure-function relationship
Coronavirus disease 2019 (COVID-19) has gained utmost attention in the current time from academic research and industrial practices because it continues to rage in many countries. Pharmacophore models exploit molecule topological similarity as well as Functional compound similarity so that they can be reliable via the application of the concept of bioisosterism. In this work, we analyze the targets for coronavirus protein and the structure of RNA virus variation, thereby completing the safety and pharmacodynamic action evaluation of small-molecule anti- coronavirus oral drugs. Common pharmacophore identifications could be converted into subgraph querying problems, due to chemical structures can also be converted to graphs, which is a knotty problem pressing for a solution. In this work, we adopt simplified representation pharmacophore graphs by reducing complete molecular structures to abstracts to detect isomorphic topological patterns and further to improve the substructure retrieval efficiency. Our subgraph isomorphism-based method adopts a 3-stage framework for learning and refining structural correspondences over a large graph. First, we generate a set of candidate matches and compare the query graph with these candidate graphs over the corresponding number of vertex and edge, which can noticeably reduce the number of candidate graphs. Secondly, we employ the permutation theorem to evaluate the row sum of vertex and edge adjacency matrix of query graph and candidate graph. Lastly, our proposed scheme deploys the well-found equinumerosity theorem to verify if the query graph and candidate graph satisfy the isomorphic relationship. The proposed quantitative structure-function relationships (QSFR) approach can be effectively applied for pharmacophoric abstract patterns identification.
CCGG: A Deep Autoregressive Model for Class-Conditional Graph Generation
Graph data structures are fundamental for studying connected entities. With an increase in the number of applications where data is represented as graphs, the problem of graph generation has recently become a hot topic. However, despite its significance, conditional graph generation that creates graphs with desired features is relatively less explored in previous studies. This paper addresses the problem of class-conditional graph generation that uses class labels as generation constraints by introducing the Class Conditioned Graph Generator (CCGG). We built CCGG by adding the class information as an additional input to a graph generator model and including a classification loss in its total loss along with a gradient passing trick. Our experiments show that CCGG outperforms existing conditional graph generation methods on various datasets. It also manages to maintain the quality of the generated graphs in terms of distribution-based evaluation metrics.
Mining Multivariate Implicit Relationships in Academic Networks
Multivariate cooperative relations exist widely in the academic society. In-depth research on multivariate relationships can effectively promote the integration of disciplines and advance scientific and technological progress. The mining and analysis of advisor-advisee relationships in the cooperative network, as a hot research issue in sociology and other disciplines, is still facing various challenges such as the lack of universal models and the difficulty in identifying
SchemaWalk: Schema Aware Random Walks for Heterogeneous Graph Embedding
Heterogeneous Information Network (HIN) embedding has been a
Keynote Talk by Luz Rello: The Story Behind Dytective: How We Brought Research Results on Dyslexia and Accessibility to Spanish Public Schools
Introduction by the workshop organizers
Keynote Talk by Manolis Koubarakis:
"Geospatial Interlinking with JedAI-spatial"
No abstract available
Exploring Cross-Lingual Transfer to Counteract Data Scarcity for Causality Detection
Finding causal relations in text represents a difficult task for many languages other than English due to relative scarcity or even a lack of proper training data. In our study, we try to overcome this problem with the help of cross-lingual transfer and three state-of-the-art multilingual language representations (LASER, mBERT, XLM-R). We find that, for zero-shot-transfer with English as a transfer language, LASER is able to compete with the Transformer-based models. It is, however, outperformed by mBERT and especially XLM-R when using the smaller German data set as source language data for Swedish. Similar annotation schemes were used for annotating the German data and the Swedish test set, which could have played a role, especially since our experiments with various English data sets also suggest that differences in annotation may greatly impact performance. We also demonstrate that additional training data in the target language may help to further improve performance but attention needs to be given to the class distribution since our results for few-shot transfer for Swedish indicate that even adding a small set of sentences where the causal class dominated may lead to overgeneralization tendencies.
Introduction by the Workshop Organizers
Improving Operation Efficieny through Predicting Credit Card Application Turnaround Time with Index-based Encoding
This paper presents the successful use of index encoding and machine learning to predict the turnaround time of a complex business process the credit card application process. Predictions are made on in-progress processes and refreshed when new information is available. The business process is complex, with each individual instance having different steps, sequence, and length. For instances predicted to have higher than normal turnaround time, model explain-ability is employed to identify the top reasons. This allows for intervention in the process to potentially reduce turnaround time before completion.
Graph Representation Learning of Banking Transaction Network with Edge Weight-Enhanced Attention and Textual Information
In this paper, we propose a novel approach to capture inter-company relationships from banking transaction data using graph neural networks with a special attention mechanism and textual industry or sector information. Transaction data owned by financial institutions can be an alternative source of information to comprehend real-time corporate activities. Such transaction data can be applied to predict stock price and miscellaneous macroeconomic indicators as well as to sophisticate credit and customer relationship management. Although the inter-company relationship is important, traditional methods for extracting information have not captured that enough. With the recent advances in deep learning on graphs, we can expect better extraction of inter-company information from banking transaction data. Especially, we analyze common issues that arise when we represent banking transactions as a network and propose an efficient solution to such problems by introducing a novel edge weight-enhanced attention mechanism, using textual information, and designing an efficient combination of existing graph neural networks.
Understanding Financial Information Seeking Behavior from User Interactions with Company Filings
Publicly-traded companies are required to regularly file financial statements and disclosures. Analysts, investors, and regulators leverage these filings to support decision making, with high financial and legal stakes. Despite their ubiquity in finance, little is known about the information seeking behavior of users accessing such filings. In this work, we present the first study on this behavior by analyzing the access logs of the Electronic Data Gathering and Retrieval system~(EDGAR) from the US-based Securities and Exchange Commission (SEC), the primary resource for accessing company filings. We analyze logs that span 14 years of users accessing the filings of over 600K distinct companies. We provide an analysis of the information-seeking behavior for this high-impact domain. We find that little behavioral history is available for the majority of users, while frequent users have rich histories. Most sessions focus on filings belonging to a small number of companies, and individual users are interested in a limited number of companies. Out of all sessions, 66\% contain filings from one or two companies and 50\% of frequent users are interested in six companies or less.
TweetBoost: Influence of Social Media on NFT Valuation
NFT or Non-Fungible Token is a token that certifies a digital asset to be unique. A wide range of assets including, digital art, music, tweets, memes, are being sold as NFTs. NFT-related content has been widely shared on social media sites such as Twitter. We aim to understand the dominant factors that influence NFT asset valuation. Towards this objective, we create a first-of-its-kind dataset linking Twitter and OpenSea (the largest NFT marketplace) to capture social media profiles and linked NFT assets. Our dataset contains 245,159 tweets posted by 17,155 unique users, directly linking 62,997 NFT assets on OpenSea worth 19 Million USD. We have made the dataset publicly available.
Introduction by the Workshop Organizers
Multi-Context Based Neural Approach for COVID-19 Fake-News Detection
When the world is facing a disastrous pandemic disease of Corona Virus (COVID-19), society is also fighting another battle of tackling misinformation. Fake-news detection is challenging and active research problem. The lack of suitable dataset and external world knowledge stretched the difficulty level. Recently due of COVID 19 and due to increase in usage of social media (Twitter, Facebook) fake news and rumors about COVID 19 is being spread rapidly. This makes it a much more relevant and important problem to mitigate. In this paper, we propose a novel multi-context based neural architecture to mitigate the problem of COVID-19 fake news detection. In the proposed model we leverage the rich information of three different pre-trained transformer based models i.e. BERT, BERTweet and COVID-Twitter-BERT pertaining to three different aspects of information (viz. general English language semantics, Tweet semantics and information related to tweets on COVID 19) which together gives us a single multi-context representation. Our experiments provide evidence that the proposed model outperforms existing baseline and the candidate models (i.e three transformer architectures by large margin and become state-of-the-art model on the task of COVID-19 fake-news detection. We achieve new state-of-the-art performance on a benchmark COVID-19 fake-news dataset with 98.78% accuracy in validation dataset and 98.69% accuracy in the test dataset.
KAHAN: Knowledge-Aware Hierarchical Attention Network for Fake News detection on Social Media
In recent years, fake news detection has attracted a great deal of attention due to the myriad of misinformation. Some previous methods have focused on modeling the news content, while others have combined user comments and user information on social media. However, existing methods ignored some important clues for detecting fake news, such as temporal information on social media and external knowledge related to the news. To this end, we propose a Knowledge-Aware Hierarchical Attention Network (KAHAN) that integrates these information into the model to establish fact-based associations with entities in the news contents. Specifically, we introduce two hierarchical attention networks to model news contents and user comments respectively, which news contents and user comments are represented by different aspects for modeling various semantic granularity. Besides, to process the random occurrences of user comments at post-level, we further design a time-based subevent division algorithm to aggregate user comments at subevent-level to learn temporal patterns. Moreover, News towards Entities (N-E) attention and Comments towards Entities (C-E) attention are introduced to measure the importance of external knowledge. Finally, we detect the veracity of the news by combining the three aspects of news: content, user comments, and external knowledge. We conducted extensive experiments and ablation studies on two real-world datasets and showed that our proposed method outperformed the previous methods and empirically validated each component of KAHAN.
Making Adversarially-Trained Language Models Forget with Model Retraining: A Case Study on Hate Speech Detection
Adversarial training has become almost the de facto standard for robustifying Natural Language Processing models against adversarial attacks. Although adversarial training has proven to achieve accuracy gains and boost the performance of algorithms, research has not shown how adversarial training will stand ``the test of times'' when models are deployed and updated with new non-adversarial data samples. In this study, we aim to quantify the temporal impact of adversarial training on naturally-evolving language models using the hate speech task. We conduct extensive experiments on the Tweet Eval benchmark dataset using multiple hate speech classification models. In particular, our findings indicate that adversarial training is highly task-dependent as well as dataset dependent as models trained on the same dataset achieve high prediction accuracy but fare poorly when tested with new dataset even after retraining models with adversarial examples. We attribute this temporal and limited effect of adversarial training to distribution shift of the training data which implies that models' quality will degrade over-time as models are deployed in the real world and start serving new data.
Measuring the Privacy Dimension of Free Content Websites through Automated Privacy Policy Analysis and Annotation
Websites that provide books, music, movies, and other media free of charge are a central piece of the Internet, although they are poorly understood, especially for their security and privacy risks. In this paper, we contribute to the understanding of those websites by focusing on their privacy policy practices and reporting. Privacy policies are the primary channel where service providers inform users about their data collection and use practices. To better understand the data usage risks associated with using such websites, it is essential to study the reporting of the privacy policy practices. For instance, privacy policies may lack information on critical practices used by the service providers, such as data collection, use disclosure, tracking, and access, leading unaware users to potential data leakage. Studying 1,562 websites, we uncover that premium websites are more transparent in reporting their privacy practices, particularly in categories such as Â"Data RetentionÂ" and Â"Do Not TrackÂ", with premium websites are 85.00% and ?70% more likely to report their practices in comparison to the free content websites. We found the free content websites' privacy policies to be more similar and generic in comparison to premium websites' privacy policies, with ?11% higher similarity scores.
Hoaxes and Hidden agendas: A Twitter Conspiracy Theory Dataset (Data Paper)
Hoaxes and hidden agendas make for compelling conspiracy theories. While many of these theories are ultimately innocuous, others have the potential to do real harm, instigating real-world support or disapproval of the theories. This is further fueled by social media which provides a platform for conspiracy theories to spread at unprecedented rates. Thus, there is a need for the development of automated models to detect conspiracy theories from the social media space in order to quickly and effectively identify the topics of the season and the prevailing stance.
Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency
In this study, we analyze the influence of English language proficiency of non-native speakers on the readability of the text written by them. In addition, we present multiple approaches for automatically determining the language proficiency levels of non-native English speakers from the text data. To accomplish the above-mentioned tasks, we first introduce an annotated social media corpus of around 1000 reviews written by non-native English speakers of the following five English language proficiency (ELP) groups: very high proficiency (VHP), high proficiency (HP), moderate proficiency (MP), low proficiency (LP), and very low proficiency (VLP). We employ the Flesch Reading Ease (FLE) and Flesch-Kincaid Grade (FKG) tests to compute the readability scores of the reviews written by various ELP groups. We leverage both the classical machine learning (ML) classifiers and transformer-based approaches for deciding the language proficiency groups of the reviewers. We observe that distinct ELP groups do not exhibit any noticeable differences in the mean FRE scores, although slight differences are observed in the FKG test. In the language proficiency determination task, we notice that fine-tuned transformer-based approaches yield slightly better efficacy than the traditional ML classifiers.
Introduction by the Workshop Organizers
No abstract available
Introduction by the Workshop Organizers
Keynote Talk by Elias Carayannis:
"Smart Cities Futures: In the context of Industry 5.0 and Society 5.0 - Challenges and Opportunities for Policy and Practice"
Human centric design in smartcity technologies
Governance can be understood as the system by which actors in
Citizens as Developers and Consumers of Smart City Services: A Drone Tour Guide Case
The trend of urbanization has started over two centuries ago and is no longer limited to high-income countries. As a result, city population growth has led to the emergence of applications that manage complex processes within cities by utilizing recent technological advances, thereby transforming them into smart cities. Besides automating complex processes within a city, technology also enables a simplified integration of citizens into identifying problems and creating corresponding solutions. This paper discusses an approach that enables citizens to design and later execute their own services within a smart city environment by employing conceptual modeling and microservices. The overall aim is to establish the role of a citizen developer. The proposed approach is then discussed within our proof of concept environment based on a drone tour guide case.
A Human-Centered Design Approach for the Development of a Digital Care Platform in a Smart City Environment
Digital solutions are being sought increasingly in the care sector for making services more efficient and to be prepared for demographic change and the future care service and staff shortage. One possibility here is the implementation of a digital care platform that is target group-oriented and built according to the needs of the individual stakeholders. To build this platform successfully, it is also necessary to take a closer look at the business model. This paper examines the mentioned points by applying a human-centered design approach that focuses on all perspectives and allows a deep understanding of the opportunities and challenges of a digital care platform in a smart city environment.
Chiara Magosso, Dragan Ahmetovic, Tiziana Armano, Cristian Bernareggi, Sandro Coriasco, Adriano Sofia, Luisa Testa, Anna Capietto: Math-to-Speech Effectiveness and Appreciation for People with Developmental Learning Disorders
Silvia Rodríguez Vázquez: The Use of ADKAR to Instil Change in the Accessibility of University Websites
Andy Coverdale, Sarah Lewthwaite, Sarah Horton: Teaching accessibility as a shared endeavour: building capacity across academic and workplace contexts (Best Communication Paper Candidate)
Brett L. Fiedler, Taliesin L. Smith, Jesse Greenberg, Emily B. Moore: For one or for all?: Survey of educator perceptions of Web Speech-based auditory description in science interactives
Sara Abdollahi: "Language-specific Event Recommendation"
Swati: "Evaluating and Improving Inferential Knowledge Based Systems for Bias Prediction in Multilingual News Headlines"
Elisavet Koutsiana: "Talking Wikidata: Re-Discovering Knowledge Engineering in Wikidata Discussion Pages"
Gabriel Maia: "Assesing the quality of sources in Wikidata across languages"
Gaurish Thakkar: "Learner sourcing for sentiment dataset creation"
Diego Alves: "Typological approach for improving Dependency Parsing"
Golsa Tahmasebzadeh: "Contextualization of images in news sources"
Tin Kuculo: "Contextualising Event Knowledge through QuoteKG: A Multilingual Knowledge Graph of Quotes"
Caio Mello: "The media coverage of London 2012 and Rio 2016 Olympic legacies: Exploring news articles with digital methods"
Abdul Sittar: "News spreading Barriers"
Gullal Singh Cheema: "Multimodal Claims on Social Media"
Sahar Tahmasebi: "Detecting Missinformation in Multimodal Claims"
Endri Kacupaj: "Conversational Question Answering over Knowledge Graphs"
Daniela Major: "The media coverage of the European Union: challenges and solutions"
A Generative Approach for Financial Causality Extraction
Causality represents the foremost relation between events in financial documents such as financial news articles, financial reports. Each financial causality contains a cause span and an effect span. Previous works proposed sequence labeling approaches to solve this task. But sequence labeling models find it difficult to extract multiple causalities and overlapping causalities from the text segments. In this paper, we explore a generative approach for causality extraction using the encoder-decoder framework and pointer networks. We use a causality dataset from the financial domain, \textit{FinCausal}, for our experiments and our proposed framework achieves very competitive performance on this dataset.
FiNCAT: Financial Numeral Claim Analysis Tool
While making investment decisions by reading financial documents, investors need to differentiate between in-claim and out-of-claim numerals. In this paper, we present a tool which can do this task automatically. It extracts context embeddings of the numerals using a transformer based pre-trained language model - BERT. Subsequently, it uses a Logistic Regression based model to detect whether a numeral is in-claim or out-of-claim. We use the FinNum-3 (English) dataset to train our model. We conducted rigorous experiments and our best model achieved a Macro F1 score of 0.8223 on the validation set. We have open-sourced this tool which can be accessed from https://github.com/sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool
Rayleigh Portfolios and Penalised Matrix Decomposition
Since the development and growth of personalised financial services online, effective tailor-made and fast statistical portfolio allocation techniques have been sought after. In this paper, we introduce a framework called Rayleigh portfolios, that encompasses many well-known approaches, such as the Sharpe Ratio, maximum diversification or minimum concentration. By showing the commonalities amongst these approaches, we are able to provide a solution to all such optimisation problems via matrix decomposition, and principal component analysis in particular. In addition, thanks to this reformulation, we show how to include sparsity in such portfolios, thereby catering for two additional requirements in portfolio construction: robustness and low transaction costs. Importantly, modifications to the usual penalised matrix decomposition algorithms can be applied to other problems in statistics. Finally, empirical applications show promising results.
SEBI Regulation Biography
The Securities and Exchange Board of India is the regulatory body for securities and commodity market in India. A growing number of SEBI documents ranging from government regulations to legal case files are now available in the digital form. Advances in natural language processing and machine learning provides opportunities for extracting semantic insights from these documents.
Detecting Regulation Violations for an Indian Regulatory body through multi label classification
The Securities and Exchange Board of India (SEBI) is the regulatory body for securities and commodities in India. SEBI creates, and enforces regulations that must be followed by all listed companies. To the best of our knowledge, this is the first work on identifying the regulation(s) that a SEBI-related case violates, which could be of substantial value to companies, lawyers, and other stakeholders in the regulatory process. We create a dataset for this task by automatically extracting violations from publicly available case-files. Using this data, we explore various multi-label text classification methods to determine the potentially multiple regulations violated by (the facts of) a case. Our experiments demonstrate the importance of employing contextual text representations to understand complex financial and legal concepts. We also highlight the challenges that must be addressed to develop a fully functional system in the real-world.
Numeral Tense Detection Based on Chinese Financial News
Time information is a very important dimension in information space, which can be shown as tense expressions in language. As an isolating language, Chinese cannot express the tenses of sentences intuitively at the grammatical level, but through some adverbs or auxiliary words called tense operators to assist tense understanding. We find that there is no research on numeral tense in finance texts. However, the tense of numerals in the text is very crucial for the finance domain which pays attention to time series. Therefore, in this paper, we propose a novel task for numerals tense detection in the finance domain. We annotated a numeral tense dataset based on Chinese finance news texts called CFinNumTense, which defines the numeral tense categories into "past tense", "future tense", "static state" and "time". we conduct Chinese finance numeral tense detection task on CFinNumTense. We employ RoBERTa pre-training model as the embedding layer and use four baseline models which are FNN, TextCNN, RNN and BiLSTM respectively to detect numeral tenses. In the ablation experiments, we design a numeral encoding (NE) to improve the information on target numeral in the texts, and design an auxiliary learning model based on BiLSTM model. Experiments show that the joint learning of target numeral tense detection and tense operator extraction task strengthens the understanding ability of target numeral tense in the texts.
FinRED: A Dataset for Relation Extraction in Financial Domain
Relation extraction models trained on a source domain cannot be applied on a different target domain due to the mismatch between relation sets. In the current literature, there is no extensive open-source relation extraction dataset specific to the finance-domain. In this paper, we release FinRED, a relation extraction dataset curated from financial news and earning call transcripts containing relations from the finance domain. FinRED has been created by mapping Wikidata triplets using distance supervision method. We manually annotate the test data to ensure proper evaluation. We also experiment with various state-of-the-art relation extraction models on this dataset to create the benchmark. We see significant drop in their performance on FinRED compare to the general relation extraction datasets which tells that we need better models for financial relation extraction.
Best Paper Award and Wrap-up
A Bi-level assessment of Twitter data for election prediction: Delhi Assembly Elections 2020
Elections are the backbone of any democratic country, where voters elect the candidates as their representatives. The emergence of social networking sites has provided a platform for political parties and their candidates to connect with voters in order to spread their political ideas. Our study aims to use Twitter in assessing the outcome of the Delhi Assembly elections held in 2020, using a bi-level approach, i.e., concerning political parties and their candidates. We analyze the correlation of election results with the activities of different candidates and parties on Twitter, and the response of voters on them, especially the mentions and sentiment of voters towards a party over time. The Twitter profiles of the candidates are compared both at the party level as well as the candidate level to evaluate their association with the outcome of the election. We observe that the number of followers and the replies to candidates' tweets are good indicators for predicting actual election outcomes. However, we observe that the number of tweets mentioning a party and the temporal analysis of voters' sentiment towards the party shown in tweets are not aligned with the election result. Moreover, the variations in the activeness of candidates and political parties on Twitter with time could not also help much in identifying the winner. Thus, merely using temporal data from Twitter is not sufficient to make accurate predictions, especially for countries like India.
Detection of Infectious Disease Outbreaks in Search Engine Time Series Using Non-Specific Syndromic Surveillance with Effect-Size Filtering
Novel infectious disease outbreaks, including most recently that of the COVID-19 pandemic, could be detected by non-specific syndromic surveillance systems. Such systems, utilizing a variety of data sources ranging from Electronic Health Records to internet data such as aggregated search engine queries, create alerts when unusually high rates of symptom reports occur. This is especially important for the detection of novel diseases, where their manifested symptoms are unknown.
Why Round Years are Special? Analyzing Time References in News Article Collections
Time expressions embedded in text are important for many down-stream tasks in NLP and IR. They have been for example utilized for timeline summarization, named entity recognition, temporal information retrieval and others. In this paper, we introduce a novel analytical approach to analyzing characteristics of time expressions in diachronic text collections. Based on a collection of news articles published over a 34-year long time span, we investigate several aspects of time expressions with a focus on their interplay with publication dates of their documents. We utilize graph-based representation of temporal expressions to represent them through their co-occurring named entities. The proposed approach results in several observations that could be utilized in automatic systems that rely on processing temporal signals embedded in text. It could be also of importance for professionals (e.g., historians) who wish to understand fluctuations in collective memories and collective expectations based on large-scale, diachronic document collections
No abstract available
Graph Augmentation Learning
Graph Augmentation Learning (GAL) provides outstanding solutions for graph learning in handling incomplete data, noise data, etc. Numerous GAL methods have been proposed for graph-based applications such as social network analysis and traffic flow forecasting. However, the underlying reasons for the effectiveness of these GAL methods are still unclear. As a consequence, how to choose optimal graph augmentation strategy for a certain application scenario is still in black box. There is a lack of systematic, comprehensive, and experimentally validated guideline of GAL for scholars. Therefore, in this survey, we in-depth review GAL techniques from macro (graph), meso (subgraph), and micro (node/edge) levels. We further detailedly illustrate how GAL enhance the data quality and the model performance. The aggregation mechanism of augmentation strategies and graph learning models are also discussed by different application scenarios, i.e., data-specific, model-specific, and hybrid scenarios. To better show the outperformance of GAL, we experimentally validate the effectiveness and adaptability of different GAL strategies in different downstream tasks. Finally, we share our insights on several open issues of GAL, including heterogeneity, spatio-temporal dynamics, scalability, and generalization.
Multi-Graph based Multi-Scenario Recommendation in Large-scale Online Video Services
Recently, industrial recommendation services have been boosted by the continual upgrade of deep learning methods. However, they still face de-biasing challenges such as exposure bias and cold-start problem, where circulations of machine learning training on human interaction history leads algorithms to repeatedly suggest exposed items while ignoring less-active ones. Additional problems exist in multi-scenario platforms, e.g. appropriate data fusion from subsidiary scenarios, which we observe could be alleviated through graph structured data integration via message passing.
Mining Homophilic Groups of Users using Edge Attributed Node Embedding from Enterprise Social Networks
We develop a method to identify groups of similarly behaving users with similar work contexts from their activity on enterprise social media. This would allow organizations to discover redundancies and increase efficiency. To better capture the network structure and communication characteristics, we model user communications with directed attributed edges in a graph. Communication parameters including engagement frequency, emotion words, and post lengths act as edge weights of the multiedge. Upon the resultant adjacency tensor, we develop a node embedding algorithm using higher order singular value tensor decomposition and convolutional autoencoder. We develop a peer group identification algorithm using the cluster labels obtained from the node embedding and show its results on Enron emails and StackExchange Workplace community. We observe that people of the same roles in enterprise social media are clustered together by our method. We provide a comparison with existing node embedding algorithms as a reference indicating that attributed social networks and our formulations are an efficient and scalable way to identify peer groups in an enterprise
RePS: Relation, Position and Structure aware Entity Alignment
Entity Alignment (EA) is the task of recognizing the same entity present in different knowledge bases. Recently, embedding-based EA techniques have established dominance where alignment is done based on closeness in latent space. Graph Neural Networks (GNN) gained popularity as the embedding module due to its ability to learn entities' representation based on their local sub-graph structures. Although GNN shows promising results, limited works have aimed to capture relations while considering their global importance and entities' relative position during EA. This paper presents Relation, Position and Structure aware Entity Alignment (RePS), a multi-faceted representation learning-based EA method that encodes local, global, and relation information for aligning entities. To capture relations and neighborhood structure, we propose a relation-based aggregation technique  Graph Relation Network (GRN) that incorporates relation importance during aggregation. To capture the position of an entity, we propose Relation aware Position Aggregator (RPA) to capitalize entities' position in a non-Euclidean space using training labels as anchors, which provides a global view of entities. Finally, we introduce a Knowledge Aware Negative Sampling (KANS) that generates harder to distinguish negative samples for the model to learn optimal representations. We perform exhaustive experimentation on four cross-lingual datasets and report an ablation study to demonstrate the effectiveness of GRN, KANS, and position encodings.
Scaling R-GCN Training with Graph Summarization
The training of Relation Graph convolutional Networks (R-GCN) does not scale well in the size of the graph.
JGCL: Joint Self-Supervised and Supervised Graph Contrastive Learning
Semi-supervised and self-supervised learning on graphs are two popular avenues for graph representation learning. We demonstrate that no single method from semi-supervised and self-supervised learning works uniformly well for all settings. Self-supervised methods generally work well with very limited training data, but their performance could be further improved using the limited label information. We propose a joint self-supervised and supervised graph contrastive learning (JGCL) to capture the mutual benefits of both learning strategies. JGCL utilizes both supervised and self-supervised data augmentation and a joint contrastive loss function. Our experiments demonstrate that JGCL and its variants are one of the best performers across various proportions of labeled data when compared with state-of-the-art self-supervised, unsupervised, and semi-supervised methods on various benchmark graphs.
Enhancing crowd flow prediction in various spatial and temporal granularities
Thanks to the diffusion of the Internet of Things, nowadays it is possible to sense human mobility almost in real time using unconventional methods (e.g., number of bikes in a bike station). Due to the diffusion of such technologies, the last years have witnessed a significant growth of human mobility studies, motivated by their importance in a wide range of applications, from traffic management to public security and computational epidemiology.
A framework to enhance smart citizenship in coastal areas
Life quality in a city can be affected by the way citizens interact with the city. Under a smart city concept citizens are acting as human sensors and reporting natural hazards, generating real-time data and enhancing awareness about environmental issues. This crowdsourcing knowledge supports the city's sustainability and tourism. Especially, smart seaside cities can fully utilize citizen science data to improve the efficiency of city services such as smart tourism, smart transportation etc. The most well-known characteristic of smart coastal cities is beach monitoring. Environmental assistance and awareness is a beach monitoring issue that could be enhanced through crowdsourcing knowledge. Especially, for coastal areas which are under the Natura 2000 network it is important to identify and map citizens' knowledge.
Sensor Network Design for Uniquely Identifying Sources of Contamination in Water Distribution Networks
Sensors are being extensively adopted for use in smart cities in order to monitor various parameters, so that any anomalous behaviour manifesting in the deployment area, can easily be detected. Sensors in a deployment area have two functions, sensing/coverage and communication, with this paper focusing on the former. Over the years, several coverage models have been proposed where the underlying assumption is that a sensor placed in a certain location, can sense its environment up to a certain distance. This assumption often leads to a Set Cover based problem formulation which unfortunately has a serious limitation, in the sense that it lacks unique identification capability for the location where anomalous behavior is sensed. This limitation can be overcome through utilization of Identifying Code. The optimal solution of the Identifying Code problem provides the minimum number of sensors that will be needed to uniquely identify the location where anomalous behavior is sensed. In this paper, we introduce a budget constrained version of the problem, whose goal is to find the largest number of locations that can be uniquely identified with the sensors that can be deployed within the specified budget. We provide an Integer Linear Programming formulation and a Maximum Set-Group Cover (MSGC) formulation for the problem and prove that the MSGC problem cannot have a polynomial time approximation algorithm with a 1/k factor performance guarantee unless P = NP.
Multi-tenancy in Smart City Platforms
Multi-tenancy has emerged as a software architecture in an effort to optimize the use of compute resources and minimize the operational cost of large scale deployments. ItÂ's applicability however needs to take into account the particular context as the challenge this architectural pattern has may not make it an optimal choice in every
Wrap-up
Keynote Talk by Michael Bronstein (University of Oxford - United Kingdom): Graph Neural Networks beyond Weisfeiler-Lehman and vanilla Message Passing
MarkovGNN: Graph Neural Networks on Markov Diffusion
Most real-world networks contain well-defined community structures where nodes are densely connected internally within communities. To learn from these networks, we develop MarkovGNN that captures the formation and evolution of communities directly in different convolutional layers. Unlike most Graph Neural Networks (GNNs) that consider a static graph at every layer, MarkovGNN generates different stochastic matrices using a Markov process and then uses these community-capturing matrices in different layers. MarkovGNN is a general approach that could be used with most existing GNNs. We experimentally show that MarkovGNN outperforms other GNNs for clustering, node classification, and visualization tasks. The source code of MarkovGNN is publicly available at \url{https://github.com/HipGraph/MarkovGNN}.
Unsupervised Superpixel-Driven Parcel Segmentation of Remote Sensing Images Using Graph Convolutional Network
Accurate parcel segmentation of remote sensing images plays an important role in ensuring various downstream tasks. Traditionally, parcel segmentation is based on supervised learning using precise parcel-level ground truth information, which is difficult to obtain. In this paper, we propose an end-to-end unsupervised Graph Convolutional Network (GCN)-based framework for superpixel-driven parcel segmentation of remote sensing images. The key component is a novel graph-based superpixel aggregation model, which effectively learns superpixels' latent affinities and better aggregates similar ones in spatial and spectral spaces. We construct a multi-temporal multi-location testing dataset using Sentinel-2 images and the ground truth annotations in four different regions. Extensive experiments are conducted to demonstrate the efficacy and robustness of our proposed model. The best performance is achieved by our model compared with the competing methods.
Improving Bundles Recommendation Coverage in Sparse Product Graphs
In e-commerce, a group of similar or complementary products is recommended as a bundle based on the product category. Existing work on modeling bundle recommendations consists of graph-based approaches. In these methods, user-product interactions provide a more personalized experience. Moreover, these approaches require robust user-product interactions and cannot be applied to cold start scenarios. When a new product is launched or for products with limited purchase history, the lack of user-product interactions will render these algorithms inaccessible. Hence, no bundles recommendations will be provided to users for such product categories. These scenarios are frequent for retailers like Target, where much of the stock is seasonal, and new brands are launched throughout the year. This work alleviates this problem by modeling product bundles recommendation as a supervised graph link prediction problem. A graph neural network (GNN) based product bundles recommendation system, BundlesSEAL is presented. First, we build a graph using add-to-cart data and then use BundlesSEAL to predict the link representing bundles relation between products represented as nodes. We also propose a heuristic to identify relevant pairs of products for efficient inference. Further, we also apply BundlesSEAL for predicting the edge weights instead of just link existence. BundlesSEAL based link prediction leads to amelioration of the above-mentioned cold start problem by increasing the coverage of product bundles recommendations in various categories by 50% while achieving a 35% increase in revenue over behavioral baseline. The model was also validated over the Amazon product metadata dataset.
Revisiting Neighborhood-based Link Prediction for Collaborative Filtering
Collaborative filtering (CF) is one of the most successful and fundamental techniques in recommendation systems. In recent years,
Understanding Dropout for Graph Neural Networks
Graph neural network (GNN) has demonstrated superior performance on graph learning tasks. GNN captures the data dependencies via message passing amid neural networks. Hence the prediction of a node label can utilize information from its neighbors in a graph. Dropout is a regularization as well as an ensemble method for convolutional neural network (CNN), which has been carefully studied. However, there are few existing works that focused on dropout schemes for GNN. Although GNN and CNN share similar model architecture, both with convolutional layers and fully connected layers, the input data structure for GNN and CNN are different and convolution operation differs. This suggests the dropout schemes for CNN should not be directly applied to GNN without a good understanding of the impact. In this paper, we divide the existing dropout schemes for GNN into two categories: (1) dropout on feature maps and (2) dropout on graph structure. Based on the drawbacks of current GNN dropout models, we propose a novel layer compensation dropout and a novel adaptive heteroscadestic Gaussian dropout, which can be applied to any type of GNN models and outperforms their corresponding baselines in shallow GNNs. Then an experimental study shows Bernoulli dropout generalize better while Gaussian dropout is slightly stronger in transductive performance. At last, we theoretically study how different dropout schemes mitigate over-smoothing problems and experimental results shows that layer compensation dropout allows a GNN model to maintain or slightly improve its performance as the GNN model adds more layers while all the other dropout models suffer from performance degradation when GNN goes deep.
Surj: Ontological Learning for Fast, Accurate, and Robust Hierarchical Multi-label Classification
We consider multi-label classification in the context of complex hierarchical relationships organized into an ontology. These situations are ubiquitous in learning problems on the web and in science, where rich domain models are developed but labeled data is rare. Most existing solutions model the problem as a sequence of simpler problems: one classifier for each level in the hierarchy, or one classifier for each label. These approaches require more training data, which is often unavailable in practice: as the ontology grows in size and complexity, it becomes unlikely to find training examples for all expected combinations. In this paper, we learn offline representations of the ontology using a graph autoencoder and separately learn to classify input records, reducing dependence on training data: Since the relationships between labels are encoded independently of training data, the model can make predictions even for underrepresented labels, naturally generalize to DAG-structured ontologies and remain robust to low-data regimes. We show empirically that our label predictions respect the hierarchy (predicting a descendant implies predicting its ancestors) and propose a method of evaluating hierarchy violations that properly ignores irrelevant violations. Our main result is that our model outperforms all state-of-the-art models on 17 of 20 datasets across multiple domains by a significant margin, even with limited training data.
Wrap-up
Adam Chaboryk: Creating an Open Source, Customizable Accessibility Checker for Content Author
Mohammad Gulam Lorgat, Hugo Paredes, and Tânia Rocha : An Approach to Teach Accessibility with Gamification
Ovidiu-Andrei Schipor, Laura-Bianca Bilius, Ovidiu-Ciprian Ungurean, Alexandru-Ionuţ Şiean, Alexandru-Tudor Andrei, and Radu-Daniel Vatavu: Personalized Wearable Interactions with WearSkill
Rachana Sreedhar, Nicole Tan, Jingyue Zhang, Kim Jin, Spencer Gregson, Eli Moreta-Feliz, Niveditha Samudrala and Shrenik Sadalgi: AIDE: An Automatic Image Description Engine for Review Imagery
Demo Madness
Introduction by the Workshop Organizers
Invited Talk by Arthur Gervais: How Dark is the Forest? On Blockchain Extractable Value and High-Frequency Trading in Decentralized Finance
How much is the fork? Fast Probability and Profitability Calculation during Temporary Forks
Estimating the probability, as well as the profitability, of different
Introduction by the Workshop Organizers
Keynote Talk by Karin Verspoor: "Why bother enabling biomedical literature analysis with semantics?"
No abstract available
Exploring Representations for Singular and Multi-Concept Relations for Biomedical Named Entity Normalization
Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to chemical composition mentions. Normalizing or disambiguating these mentions within texts provides researchers and data-curators with more relevant articles returned by their search query. Named entity normalization aims to automate this disambiguation process by linking entity mentions onto an appropriate candidate concept within a biomedical knowledge base or ontology. We explore several term embedding aggregation techniques in addition to how the term's context affects evaluation performance. We also evaluate our embedding approaches for normalizing term instances containing one or many relations within unstructured texts.
Multi-touch Attribution for complex B2B customer journeys using Temporal Convolutional Networks
Customer journeys in Business-to-Business (B2B) transactions contain long and complex sequences of interactions between different stakeholders from the buyer and seller companies. On the seller side, there is significant interest in the multi-touch attribution (MTA) problem, which aims to identify the most influential stage transitions (in the B2B customer funnel), channels, and touchpoints. We design a novel deep learning-based framework, which solves these attribution problems by modeling the conversion of journeys as functions of stage transitions that occur in them. Each stage transition is modeled as a Temporal Convolutional Network (TCN) on the touchpoints that precede it. Further, a global conversion model Stage-TCN is built by combining these individual stage transition models in a non-linear fashion. We apply Layer-wise Relevance Propagation (LRP) based techniques to compute the relevance of all nodes and inputs in our network and use these to compute the required attribution scores. We run extensive experiments on two real-world B2B datasets and demonstrate superior accuracy of the conversion model compared to prior works. We validate the attribution scores using perturbation-based techniques that measure the change in model output when parts of the input having high attribution scores are deleted.
Semantic Modelling of Document Focus-time for Temporal Information Retrieval
Accurate understanding of the \textit{temporal dynamics} of Web content and user behaviors plays a crucial role during the interactive process between search engine and users. In this work, we focus on how to improve the retrieval performance via better understanding of the time factor. On the one hand, we proposed a novel method to estimate the \textit{focus-time} of documents leveraging their semantic information. On the other hand, we introduced a new way for understanding the temporal intent underlying a search query based on Google Trend. Furthermore, we applied the proposed methods to two search scenarios: \textit{temporal information retrieval} and \textit{temporal diversity retrieval}. Our experimental results based on NTCIR Temporalia test collections show that: (1) Semantic information can be used to predict the temporal tendency of documents. (2) The semantic-based model works effectively even when few temporal expressions and entity names are available in documents. (3) The effectiveness of the estimated focus-time was comparable to that of the article's publication time in relevance modelling, and thus, our method can be used as an alternative or supplementary tool when reliable publication dates are not available. (4) The trend time can improve the representation of temporal intents behind queries over query issue time.
Analytical Models for Motifs in Temporal Networks
Dynamic evolving networks capture temporal relations in domains such as social networks, communication networks, and financial transaction networks. In such networks, temporal motifs, which are repeated sequences of time-stamped edges/transactions, offer valuable information about the networksÂ' evolution and function. However, calculating temporal motif frequencies is computationally expensive as it requires: First, identifying all instances of the static motifs in the static graph induced by the temporal graph. And second, counting the number of subsequences of temporal edges that correspond to a temporal motif and occur within a time window. Since the number of temporal motifs changes over time, finding interesting temporal patterns involves iterative application of the above process over many consecutive time windows. This makes it impractical to scale to large real temporal networks. Here, we develop a fast and accurate model-based method for counting motifs in temporal networks. We first develop the Temporal Activity State Block Model (TASBM), to model temporal motifs in temporal graphs. Then we derive closed-form analytical expressions that allow us to quickly calculate expected motif frequencies and their variances in a given temporal network. Finally, we develop an efficient model fitting method, so that for a given network, we quickly fit the TASMB model and compute motif frequencies. We apply our approach to two real-world networks: a network of financial transactions and an email network. Experiments show that our TASMB framework (1) accurately counts temporal motifs in temporal networks; (2) easily scales to networks with tens of millions of edges/transactions; (3) is about 50x faster than explicit motif counting methods on networks of about 5 million temporal edges, a factor which increases with network size.
Rows from Many Sources: How to enrich row completions from Wikidata with a pre-trained Language Model
Row completion is the task of augmenting a given table of text and numbers with additional, relevant rows. The task divides into two steps: subject suggestion, the task of populating the main column; and gap filling, the task of populating the remaining columns. We present state-of-the-art results for subject suggestion and gap filling
A Map of Science in Wikipedia
In recent decades, the rapid growth of Internet adoption is offering opportunities for convenient and inexpensive access to scientific information. Wikipedia, one of the largest encyclopedias worldwide, has become a reference in this respect, and has attracted widespread attention from scholars. However, a clear understanding of the scientific sources underpinning Wikipedia's contents remains elusive. In this work, we rely on an open dataset of citations from Wikipedia to map the relationship between Wikipedia articles and scientific journal articles. We find that most journal articles cited from Wikipedia belong to STEM fields, in particular biology and medicine ($47.6$\% of citations; $46.1$\% of cited articles). Furthermore, Wikipedia's biographies play an important role in connecting STEM fields with the humanities, especially history. These results contribute to our understanding of Wikipedia's reliance on scientific sources, and its role as knowledge broker to the public.
Improving Linguistic Bias Detection in Wikipedia using Cross-Domain Adaptive Pre-Training
Wikipedia is a collective intelligence platform that helps contributors to collaborate efficiently for creating and disseminating knowledge and content. A key guiding principle of Wikipedia is to maintain a neutral point of view (NPOV), which can be challenging for new contributors and experienced editors alike. Hence, several previous studies have proposed automated systems to detect biased statements on Wikipedia with mixed results. In this paper, we investigate the potential of cross-domain pre-training to learn bias features from multiple sources, including Wikipedia, news articles, and ideological statements from political figures in an effort to learn richer cross-domain indicators of bias that may be missed by existing methods. Concretely, we study the effectiveness of bias detection via cross-domain pre-training of deep transformer models. We find that the cross-domain bias classifier with continually pre-trained RoBERTa model achieves a precision of 89% with an F1 score of 87%, and can detect subtle forms of bias with higher accuracy than existing methods.
Lightning Talks
Ovidiu-Andrei Schipor, Laura-Bianca Bilius, Radu-Daniel Vatavu: WearSkill: Personalized and Interchangeable Input with Wearables for Users with Motor Impairments
Hwayeon Joh, YunJung Lee, Uran Oh: Understanding the Touchscreen-based Nonvisual Target Acquisition Task Performance of Screen Reader Users
Juliette Regimbal, Jeffrey Blum, Jeremy Cooperstock: IMAGE: A Deployment Framework for Creating Multimodal Experiences of Web Graphics
Rachana Sreedhar, Nicole Tan, Jingyue Zhang, Kim Jin, Spencer Gregson, Niveditha Samudrala, Eli Moreta-Feliz, Shrenik Sadalgi: AIDE: Automatic and Accessible Image Descriptions for Review Imagery in Online Retail
William Payne, Fabiha Ahmed, Michael Gardell, R. Luke DuBois, Amy Hurst: SoundCells: Designing a Browser-Based Music Technology for Braille and Print Notation (Best Technical Paper Candidate)
Introduction by the Workshop Organizers
Keynote Talk by Marinka Zitnik (Harvard)
Keynote Talk by Marzyeh Ghassemi (MIT)
Keynote Talk by Himabindu Lakkaraju (Harvard)
Keynote Talk by Eyal Klang (Sheba Medical Center)
Keynote Talk by Kai Shu (Illinois Institute of Technology): Combating Disinformation on Social Media and Its Challenges
Keynote Talk by Tim Althoff (University of Washington): Understanding and Facilitating Empathic Conversations in Online Social Media
Wrap-up
Going down the Wikipedia Rabbit Hole: Characterizing the Long Tail of Reading Sessions
Wiki rabbit holes are typically described as navigation paths followed by Wikipedia readers that lead them to long explorations, sometimes finding themselves in unexpected articles. Despite being a popular concept in Internet culture, our current understanding of its dynamics is based only on anecdotal reports. This paper provides a large-scale quantitative characterization of the navigation traces of readers that supposedly fell into one of these rabbit holes. First, we aggregate the users' sessions in navigation trees and operationalize the concept of wiki rabbit holes based on the depth of these trees. Then, we characterize these sessions in terms of structural patterns, time properties, and exploration in topics space.
Offline Meetups of German Wikipedians: Boosting or braking activity?
The role of online attention in the supply of disinformation in Wikipedia
Lightning Talks
Invited Talk by Jiahua Xu: Yield Aggregators in DeFi
Analysis of Arbitrary Content on Blockchain-Based Systems using BigQuery
Blockchain-based systems have gained immense popularity as enablers of independent asset transfers and smart contract functionality. They have also, since as early as the first Bitcoin blocks, been used for storing arbitrary contents such as texts and images. On-chain data storage functionality is useful for a variety of legitimate use cases. It does, however, also pose a systematic risk. If abused, for example by posting illegal contents on a public blockchain, data storage functionality can lead to legal consequences for operators and users that need to store and distribute the blockchain, thereby threatening the operational availability of entire blockchain ecosystems. In this paper, we develop and apply a cloud-based approach for quickly discovering and classifying content on public blockchains. Our method can be adapted to different blockchain systems and offers insights into content-related usage patterns and potential cases of abuse. We apply our method on the two most prominent public blockchain systems---Bitcoin and Ethereum---and discuss our results. To the best of our knowledge, the presented study is the first to systematically analyze non-financial content stored on the Ethereum blockchain and the first to present a side-by-side comparison between different blockchains in terms of the quality and quantity of stored data.
Characterizing the OpenSea NFT Marketplace
'Non Fungible Tokens' (NFTs) are unique digital identifiers that are used to represent ownership of various cryptoassets such as music, artwork, collectibles, game assets, and much more. At current, they primarily exist on the Ethereum blockchain and allow for an unalterable and provable chain of creation and ownership. Albeit early, we feel that NFTs present an exciting prospect of things to come in the digital technology space and as such seek to further understand market trends. That being said, NFTs are still largely misunderstood by the wider community and so we have attempted to look into mainly a generalized view, but have also included some more specific facets of the most popular trading marketplace - OpenSea. Prior work focused on specific collections or the underlying technology of NFTs but did not look at how the market was evolving over time. For our study, data on all OpenSea sales was collected from January 1, 2019 through to December 31, 2021. This accounted for 5.25 million sales of 3.65 million unique NFT assets. We lead by presenting an overview of our data collection process as well as providing a summary of key statistics of the data set. From there, we examine user behaviour present in the market to show that a small subset of users are driving massive growth, while the typical user has been reducing average transaction count over time. Secondly we review the economic activity within the network to show how these power users drive extreme price volatility within the art and collectible categories. Lastly, we will review the network of buyers and sellers to show how a tight-knit community structure has formed within NFT categories.
Wrap-up
Introduction by the Workshop Organizers
Optimizing Data Layout for Training Deep Neural Networks
The widespread popularity of deep neural networks (DNNs) has made it an important workload in modern datacenters. Training DNNs is both computation-intensive and memory-intensive. While prior works focus on training parallelization (e.g., data parallelism and model parallelism) and model compression schemes (e.g., pruning and quantization) to reduce the training time, choosing an appropriate data layout for input feature maps also plays an important role and is considered to be orthogonal to parallelization and compression in delivering the overall training performance. However, finding an optimal data layout is non-trivial since the preferred data layout varies depending on different DNN models as well as different pruning schemes that are applied. In this paper, we propose a simple-yet-effective data layout arbitration framework that automatically picks up the beneficial data layout for different DNNs under different pruning schemes. The proposed framework is built upon a formulated cache estimation model. Experimental results indicate that our approach is always able to select the most beneficial data layout and achieves the average training performance improvement with 14.3% and 3.1% compared to uniformly using two popular data layouts.
Security Challenges for Modern Data Centers with IoT: A Preliminary Study
The wide deployment of internet of things (IoT) devices makes a profound impact on the data center industry from various perspectives, varying from infrastructure operation, resource management to end users. This is a double-edged sword -- it enables ubiquitous resource monitoring and intelligent management therefore significantly enhances daily operation efficiency while introducing new security issues for modern data centers. The emerging security challenges are not only related to detecting new IoT attacks or vulnerabilities but also include the implementations of cybersecurity protection mechanisms (e.g., intrusion detection system, vulnerability management system) to enhance data center security. As the new security challenges with IoT have not been thoroughly explored in the literature, this paper provides a survey on the most recent IoT security issues regarding modern data centers by highlighting IoT attacks and the trend of newly discovered vulnerabilities. We find that vulnerabilities related to data centers increase significantly since 2019. Compared to the total amount in 2018 (with 25 vulnerabilities), the number of data center vulnerabilities almost increase by a factor of four times (with 98 vulnerabilities) in 2020. This paper also introduces the existing cybersecurity tools and discusses the associated challenges and research issues for enhancing data center security.
Efficient Streaming Analytics with Adaptive Near-data Processing
Streaming analytics applications need to process massive volumes of data in a timely manner, in domains ranging from datacenter telemetry and geo-distributed log analytics to Internet-of-things systems. Such applications suffer from significant network transfer costs to transport the data to a stream processor and compute costs to analyze the data in a timely manner. Pushing the computation closer to the data source by partitioning the analytics query, is an effective strategy to reduce resource costs for the stream processor. However, the partitioning strategy depends on the nature of resource bottleneck that is encountered at the data source. Datacenter telemetry systems are constrained by limited compute resources on the data source (i.e., monitored server node), which is shared by foreground customer applications hosted by the platform. On the other hand, geo-distributed applications suffer from limited network bandwidth across geo-distributed sites. Furthermore, resources available on the data source nodes can change over time, requiring the partitioning strategy to quickly adapt to the changing resource conditions. In this paper, we study different issues which affect query partitioning strategies. With insights obtained from partitioning techniques within cloud datacenters which operate under constrained compute conditions, we suggest several different ways to improve the performance of stream analytics applications operating in different resource environments, by significantly reducing the stream processor resource costs, while also reducing the overhead of making partitioning decisions.
Powering Multi-Task Federated Learning with Competitive GPU Resource Sharing
Federated learning has been applied to train different tasks, posing new computation challenges in training, especially when the scenario becomes multi-task. In this paper, we profile the FL multi-task training process at the operator-level to identify and solve the problems in FL multi-task training. Second, we propose a Competitive GPU Resource Sharing method that can efficiently partition GPU resources to improve training efficiency. Third, for the imbalanced data problem in FL with multi-device training, we perform GPU resource partitioning according to the workload of different models. Experiments show that our method can obtain a 2.1 times speedup.
Graph Convolutional Networks for Chemical Relation Extraction
Extracting information regarding novel chemicals and chemical reactions from chemical patents plays a vital role in the chemical and pharmaceutical industry. Due to the increasing volume of chemical patents, there is an urgent need for automated solutions to extract relations between chemical compounds. Several studies have used models that apply attention mechanisms such as Bidirectional Encoder Representations from Transformers (BERT) to capture the contextual information within a text. However, these models do not capture the global information about a specific vocabulary. On the other hand, graph Convolutional Neural Networks (GCNs) capture global dependencies between terms within a corpus but not the local contextual information. In this work, we propose two novel approaches, GCN-Vanilla and GCN-BERT, for relation extraction. GCN-Vanilla approach builds a single graph for the whole corpus based on word co-occurrence and sentence-word relations. Then, we model the graph with GCN to capture the global information and classify the sentence nodes. GCN-BERT approach combines GCN and BERT to capture both global and local information and build together a final representation for relation extraction. We evaluate our approaches on the CLEF-2020 dataset. Our results show the combined GCN-BERT approach outperforms standalone BERT and GCN models and achieves a higher F1 than that reported in our previous studies.
Biomedical Word Sense Disambiguation with Contextualized Representation Learning
Representation learning is an important component in solving most Natural Language Processing~(NLP) problems, including Word Sense Disambiguation~(WSD). The WSD task tries to find the best meaning in a knowledge base for a word with multiple meanings~(ambiguous word). WSD methods choose this best meaning based on the context, i.e., the words around the ambiguous word in the input text document. Thus, word representations may improve the effectiveness of the disambiguation models if they carry useful information from the context and the knowledge base. Most of the current representation learning approaches are that they are mostly trained on the general English text and are not domain specified. In this paper, we present a novel contextual-knowledge base aware sense representation method in the biomedical domain. The novelty in our representation is the integration of the knowledge base and the context. This representation lies in a space comparable to that of contextualized word vectors, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbor approach. Comparing our approach with state-of-the-art methods shows the effectiveness of our method in terms of text coherence.
Keynote Talk by Olivier Bodenreider: Powering semantic analysis with bio-ontologies
No abstract available
Panel with Workshop Organizers and Keynote Speaker: Omar Alonso (Northeastern University), Ricardo Baeza-Yates (Northeastern University - UPF - UChile), Adam Jatowt (University of Innsbruck), Marc Spaniol (University of Caen Normandy)
Keynote Talk by Robin Christopherson: Out with accessibility - In With Inclusive Design
Keynote Talk by Andrew Beam (Harvard)
Keynote Talk by Greg Durrett (UT Austin)
Panel: Theoretical foundation for explainable AI in health
Moderator: Ben Glicksberg
Panelists: Fei Wang, Himabindu Lakkaraju, Eyal Klang, Andrew Beam and Greg Durrett
Panel: 10 Years After The SOPA/PIPA Blackout: The Past and Future of Online Protest
Moderator: Erik Moeller
Panelists: Tiffiniy Cheng, Mishi Choudhary and Cory Doctorow
Keynote Talk by Sean James (Director of Energy Research - Microsoft): Advanced Building Materials that Store Carbon
Keynote Talk by Liguang Xie (Senior Principal Architect - FutureWei Technologies): Building Next-Gen Cloud Infrastructure and Networking in a Cloud-Native Approach
Wrap-up
Panel with Trevor Cohen, Melissa Haendel, Chunhua Weng
Wrap-up
Keynote Talk by Lawrence Lessig: How can the Internet be so bad and so good: The lessons we must draw, and that Wiki must teach
Wrap-up
Introduction by the Workshop Organizers
Keynote Talk by Serena Villata: Towards argument-based explanatory dialogues: from argument mining to argument generation
Introduction by the Workshop Organizers
Keynote Talk by Stefano de Sabbata: Everyday digital geographies
Exploiting Geodata to Improve Image Recognition with Deep Learning
Due to the widespread availability of smartphones and digital cameras with GPS functionality, the number of photos associated with geographic coordinates or geoinformation on the internet is continuously increasing. Besides the obvious benefits of geotagged images for the users, geodata can enable a better understanding of the image content and thus facilitate their classification. This work shows the added value of integrating auxiliary geodata during a multi class single label image classification task. Various ways of encoding and extracting auxiliary features from raw coordinates are compared, followed by an investigation of approaches to integrate these features into a convolutional neural network (CNN) by fusion models. We show the classification improvements of adding the raw coordinates and derived auxiliary features such as satellite photos and location-related texts (address information and tags). The results show that the best performance is achieved by a fusion model, which is incorporating textual features based on address information. It is improving the performance the most while reducing the training time: The accuracy of the considered 25 concepts was increased to 85%, compared to 71% in the baseline, while the training time was reduced by 21%. Adding the satellite photos into the neural network shows significant performance improvements as well, but increases the training time. In contrast, numerical features derived directly from raw coordinates do not yield a convincing improvement in classification performance.
Introduction by the Workshop Organizers
Data models for annotating biomedical scholarly publications: the case of CORD-19
Semantic text annotations have been a key factor for supporting computer applications ranging from knowledge graph construction to biomedical question answering. In this systematic review, we provide an analysis of the data models that have been applied to semantic annotation projects for the scholarly publications available in the CORD-19 dataset, an open database of the full texts of scholarly publications about COVID-19. Based on Google Scholar and the screening of specific research venues, we retrieve seventeen publications on the topic mostly from the United States of America. Subsequently, we outline and explain the inline semantic annotation models currently applied on the full texts of biomedical scholarly publications. Then, we discuss the data models currently used with reference to semantic annotation projects on the CORD-19 dataset to provide interesting directions for the development of semantic annotation models and projects.
Quantifying the topic disparity of scientific articles
Citation count is a popular index for assessing scientific papers. However, it depends on not only the quality of a paper but also various factors, such as conventionality, team size, and gender. Here, we examine the extent to which the conventionality of a paper is related to its citation percentile in a discipline by using our measure, topic disparity. The topic disparity is the cosine distance between a paper and its discipline on a neural embedding space. Using this measure, we show that the topic disparity is negatively associated with the citation percentile in many disciplines, even after controlling team size and the genders of the first and last authors. This result indicates that less conventional research tends to receive fewer citations than conventional research. Our proposed method can be used to complement the raw citation counts and to recommend papers at the periphery of a discipline because of their less conventional topics.
Personal Research Knowledge Graphs
Maintaining research-related information in an organized manner can be challenging for a researcher. In this paper, we envision personal research knowledge graphs (PRKGs) as a means to represent structured information about the research activities of a researcher. PRKGs can be used to power intelligent personal assistants, and personalize various applications. We explore what entities and relations should be potentially included in a PRKG, how to extract them from various sources, and how to share PRKGs within a research group.
Sequence-based extractive summarisation for Scientific Articles
This paper presents the results of research on supervised extractive text summarisation for scientific articles. We show that a simple sequential tagging model based only on the text within a document achieves high results against a simple classification model. Improvements can be achieved through additional sentence-level features, though these were minimal. Through further analysis, we show the potential of the sequential model relying on the structure of the document depending on the academic discipline which the document is from.
Assessing Network Representations for Identifying Interdisciplinarity
Many studies have sought to identify interdisciplinary research as a
Introduction by the Workshop Organizers
Keynote Talk by Xing Chen: Adaptively Offloading the Software for Mobile Edge Computing
No abstract available
Word Embedding based Heterogeneous Entity Matching on Web of Things
Web of Things (WoT) is capable of promoting the knowledge discovery and address interoperability problems of diverse Internet of Things (IoT) applications. However, due to the dynamic and diverse features of data entities on WoT, the heterogeneous entity matching has become arguably the greatest Â"new frontierÂ" for WoT advancements. Currently, the data entities and the corresponding knowledge on WoT are generally modelled with the ontology, and therefore, matching heterogeneous data entities on WoT can be converted to the problem of matching ontologies. Ontology matching is a complex cognitive process, it is usually initially done manually by domain experts. To effectively distinguish the heterogeneous entities and determine high-quality ontology alignment, this work proposes a word embedding based matching technique. Our approach models the wordÂ's semantic in the vector space, and use two vectorsÂ' intersection angle to measure the corresponding wordsÂ' similarity. In addition, the word embedding approach does not depend on a specific knowledge base and retain the rich semantic information of words, which makes our proposal more robust. The experiment uses Ontology Alignment Evaluation Initiative (OAEI)'s benchmark for testing, and the experimental results show that our approach outperforms other advanced matching methods.
Discovering Top-k Profitable Paterns for Smart Manufacturing
In the past, many studies were developed to discover useful knowledge from rich data for decision making in wide applications in Internet of Things (IoT) and Web of Things (WoT), such as smart manufacturing. Utility-driven pattern mining (UPM) technology is famous in knowledge discovering domain. However, one of the biggest issues of UPM is the setting of a suitable minimum utility threshold (minUtil). The higher minUtil is, the less interesting patterns is obtained. Conversely, the lower minUtil is, the more useless patterns are gotten. In this paper, we propose a solution for discovering top-$k$ profitable patterns with average-utility measure, which can be applied to manufacturing. The average-utility is using utility of a pattern to divide its corresponding length so as to fairly measure the pattern. The proposed new upper-bounds on average-utility are more tighter than previous upper-bounds. Moreover, based on these upper-bounds, the novel algorithm utilizes merging and projection approaches to greatly reduce the search space. By adopting several threshold raising strategies, the proposed algorithm can discover correct top-$k$ patterns in a short time. We also implemented the efficiency and effectiveness of the algorithm on real and synthetic datasets. The experimental results reveal that the algorithm not only get a complete set of top-$k$ interesting patterns, but also works better than the-state-art-of algorithm in terms of runtime, memory consumption and scalability. Especially, the proposed algorithm performs very well on dense datasets.
Fast RFM Model for Customer Segmentation
With booming e-commerce and World Wide Web (WWW), a powerful tool in customer relationship management (CRM), called the RFM analysis model, has been used to ensure that major enterprises make more profit. Combined with data mining technologies, the CRM system can automatically predict the future behavior of customers to raise customer retention rate. However, a key issue is that the RFM analysis model must obtain private information, such as the ID and IP address of the customer, which is not safe, and may be illegal. Thus, in this study, a novel algorithm based on a compact list-based data structure is proposed along with several efficient pruning strategies to address this issue. The new algorithm considers recency (R), frequency (F), and monetary/utility (M) as three different thresholds to discover interesting patterns where the R, F, and M thresholds combined are no less than the user-specified minimum values. More significantly, the downward-closure property of frequency and utility metrics are utilized to discover super-itemsets. Then, in an extensive experimental study it is demonstrated that the algorithm outperforms state-of-the-art algorithms on various datasets. It is also demonstrated that the proposed algorithm performs well when considering the frequency metric alone.
Introduction by the Workshop Organizers
Keynote Talk by Prof. Khalid Al-Khatib: The Role of Users' Personality in Argumentation
Keynote Talk by Dr. Shereen Oraby: Personalized Style Conditioning and Conversational Assistants
No abstract available
Alexander Hambley, Yeliz Yesilada, Markel Vigo, Simon Harper: Optimising The Website Accessibility Conformance Evaluation Methodology Best Communication Paper Candidate
Idil Ece Trabzon, Furkan Yagiz, Elmas Eda Karadavut, Mahmoud Elhewahey, Sukru Eraslan, Yeliz Yesilada, Simon Harper: Framework for Experiential Transcoding of Web Pages with Scanpath Trend Analysis
Jorge Sassaki Resende Silva, André Pimenta Freire, Paula Christina Figueira Cardoso: When Headers are Not There: Design and User Evaluation of an Automatic Topicalisation and Labelling Tool to Aid the Exploration of Web Documents by Blind Users
Wajdi Aljedaani, Mohamed Wiem Mkaouer, Stephanie Ludi, Ali Ouni, Ilyes Jenhani: On the Identification of Accessibility Bug Reports in Open Source Systems
Candace Williams, Lilian de Greef, Edward Harris III, Leah Findlater, Amy Pavel, Cynthia Bennett: Toward Supporting Quality Alt Text in Computing Publications
Introduction
Enhancing Multilingual Accessibility of Question Answering over Knowledge Graphs
There are more than 7000 languages spoken today in the world. Yet, English dominates in many research communities, in particular in the field of Knowledge Graph Question Answering (KGQA). The goal of a KGQA system is to provide natural-language access to a knowledge graph. While many research works aim to achieve the best possible QA quality over English benchmarks, only a small portion of them focuses on providing these systems in a way that different user groups (e.g., speakers of different languages) may use them with the same efficiency (i.e., accessibility). To address this research gap, we investigate the multilingual aspect of the accessibility, which enables speakers of different languages (including low-resource and endangered languages) to interact with KGQA systems with the same efficiency.
Enhancing Query Answer Completeness with Query Expansion based on Synonym Predicates
Community-based knowledge graphs are generated following hybrid approaches, where human intelligence empowers computational methods to effectively integrate encyclopedic knowledge or provide a common understanding of a domain. Existing community-based knowledge graphs represent essential sources of knowledge for enhancing the accuracy of data mining, information retrieval, question answering, and multimodal processing. However, despite the enormous effort conducted by contributing communities, community-based knowledge graphs may be incomplete and integrate duplicated data and metadata. We tackle the problem of enhancing query answering against incomplete community-based knowledge graphs by proposing an efficient query processing approach to estimate answer completeness and increase the results. It assumes that community-based knowledge graphs comprise synonym predicates that complement a knowledge graph triples required to raise a query answering completeness. The aim is proposing a novel query expansion met
Personal Knowledge Graphs: Use cases in e-learning platforms
Personal Knowledge Graphs (PKGs) are introduced by the semantic web community as small-sized user-centric knowledge graphs (KGs). PKGs fill the gap of personalised representation of user data and interests on the top of big, well-established encyclopedic KGs, such as DBpedia. Inspired by the widely recent usage of PKGs in the medical domain to represent patient data, this PhD proposal aims to adopt a similar technique in the educational domain in e-learning platforms by deploying PKGs to represent users and learners. We propose a novel PKG development that relies on ontology and interlinks to Linked Open Data. Hence, adding the dimension of personalisation and explainability in users' featured data while respecting privacy. This research design is developed in two use cases: a collaborative search learning platform and an e-learning platform. Our preliminary results show that e-learning platforms can get benefited from our approach by providing personalised recommendations and more user and group-specific dat
Towards Automated Technologies in the Referencing Quality of Wikidata
Wikidata is a general-purpose knowledge graph with the content being crowd-sourced through an open wiki, along with bot accounts. Currently, there are over 95 million interrelated data items and more than 1 billion statements in Wikidata accessible through a public SPARQL endpoint and different dump formats. The Wikidata data model enables assigning references to every single statement. Due to the rapid growth of Wikidata, the quality of Wikidata references is not well covered in the literature. To cover the gap, we suggest using automated tools to verify and improve the quality of Wikidata references. For verifying reference quality, we develop and implement a comprehensive referencing assessment framework based on Data Quality dimensions and criteria. To improve reference quality, we use Relation Extraction methods to establish a reference-suggesting framework for Wikidata. During the research, we managed to develop a subsetting approach to create a comparison platform and handle the big size of Wikidata. W
Incorporating External Knowledge for Evidence-based Fact Verification
The recent success of pre-trained language models such as BERT has led to its application in evidenced-based fact verification. While existing works employ these models for the contextual representation of evidence sentences to predict whether a claim is supported or refuted, they do not
Towards Analyzing the Bias of News Recommender Systems Using Sentiment and Stance Detection
News recommender systems are used by online news providers to alleviate information overload and to provide personalized content to users. However, algorithmic news curation has been hypothesized to create filter bubbles and to intensify users' selective exposure, potentially increasing their vulnerability to polarized opinions and fake news. In this paper, we use stance detection and sentiment analysis to annotate a German news corpus. We show that those annotations can be utilized to quantify the extent to which recommender systems suffer from stance and sentiment bias. In an experimental evaluation with four different recommender systems, our results show a slight tendency of all four models for recommending articles with negative sentiments and stances against the topic of refugees and migration. Moreover, we observed a positive correlation between the sentiment and stance bias of the content-based recommenders and the preexisting user bias, which indicates that these systems amplify users' opinions and decrease the diversity of recommended news. The knowledge-aware model appears to be the least prone to such biases, at the cost of predictive accuracy.
Geotagging TweetsCOV19: Enriching a COVID-19 Twitter Discourse Knowledge Base with Geographic Information
Various aspects of the recent COVID-19 outbreak have been extensively discussed on online social media platforms and, in particular, on Twitter. Geotagging COVID-19-related discourse data on Twitter
Introduction by the Workshop Organizers
Conciseness, interest, and unexpectedness: user attitudes towards infographic and comic consent mediums
Being asked to consent to data sharing is a ubiquitous experience in digital services - yet it is very rare to encounter a well designed consent experience. Considering the momentum and importance of a European data space where personal information freely and easily flows across organizations, sectors and Member States, solving the long-discussed thorny issue of "how to get consent right" cannot be postponed any further. In this paper, we describe the first findings from a study based on 24 semi-structured interviews investigating participants' expectations and opinions toward consent in a data sharing scenario with a data trustee. We analyzed various dimensions of a consent form redesigned as a comic and an infographic, including language, information design, content and the writer-reader relationship. The results provide insights into the complexity of elements that should be considered when asking individuals to freely and mindfully disclose their data, especially sensitive information.
Internalization of privacy externalities through negotiation: Social costs of third-party web-analytic tools and the limits of the legal data protection framework
Tools for web-analytics such as Google Analytics are implemented across the majority websites. For most cases the use is free of charge for the website-owners. However, to use those tools to their full potential, it is necessary to share the collected personal data of the users with the tool-provider. This paper examines if this constellation of data collection and sharing can be interpreted as an externality of consumption in the sense of welfare economic theory. As it is shown that this is the case, the further analysis examines if the current technical and legal framework allows for an internalization of this externality through means of negotiation. It is illustrated that an internalization through negotiation is highly unlikely to succeed, because of the existence of information asymmetries, transaction cost and improper means for the enforcement of rights of disposal. It is further argued that even as some of those issues are addressed by data protection laws, the legal framework does not ensure a market situation necessary for a successful internalization. As a result, the externalities caused by the data collection through third party-web-analysis tools continue to exist. This leads to an inefficient high use of third-party tool for web-analytics by website owners.
Introduction by the Workshop Organizers
Invited Talk by Michael Bronstein: Graph Neural Networks: Trends and Open Problems
Traffic Accident Prediction using Graph Neural Networks: New Datasets and the TRAVEL Model
TeleGraph: A Benchmark Dataset for Hierarchical Link Prediction
Benchmarking Large-Scale Graph Training Over Effectiveness And Efficiency
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs
KGTuner: Efficient Hyper-parameter Search for Knowledge Graph Learning
Anonymous Hyperlocal Communities: What do they talk about?
In this paper, we study what users talk about in hyperlocal and anonymous online communities in Saudi Arabia (KSA).
Predicting Spatial Spreading on Social Media
The understanding and prediction of spreading phenomena is vital for numerous applications. Huge availability of social networks data provide a platform for studying spreading phenomena. Past works studying and predicting spreading phenomena has explored the spread in dimensions of time and volume such as predicting total infected users, predicting popularity, predicting the time when an information get a threshold number of infected users. However,as the information spreads from user to user, it also spreads from location to location. In this paper, we make an attempt to predict the spread in dimension of geographic space. In accordance with the past spreading prediction problems, we also design our problem to predict the spatial spread at an early stage. For this we utilized user-based features, textual features, emotion features and geographical features. We feed these features into existing classification algorithms and evaluate on three datasets from Twitter.
Discussion
Wrap-up
GraphCite: Citation Intent Classification in Scientific Publications via Graph Embeddings
Citations are crucial in scientific works as they help position a new publication. Each citation carries a particular intent, for example, to highlight the importance of a problem or to compare against results provided by another method. The authors' intent when making a new citation has been studied to understand the evolution of a field over time or to make recommendations for further citations. In this work, we address the task of citation intent prediction from a new perspective. In addition to textual clues present in the citation phrase, we also consider the citation graph, leveraging high-level information of citation patterns. In this novel setting, we perform a thorough experimental evaluation of graph-based models for intent prediction and we show they improve significantly upon models that take into consideration only the citation phrase.
Examining the ORKG towards Representation of Control Theoretic Knowledge – Preliminary Experiences and Conclusions
Control theory is an interdisciplinary academic domain which contains sophisticated elements from various sub-domains of both mathematics and engineering. The issue of knowledge transfer thus poses a considerable challenge w.r.t. transfer between researchers focusing on different niches as well as w.r.t. transfer into potential application domains. The paper investigates the Open Research Knowledge Graph (ORKG) as medium to facilitate such knowledge transfer. The main results are a) a list of proposed best practices and b) a list of structural improvement suggestions.
SciNoBo: A Hierarchical Multi-Label Classifier of Scientific Publications
Classifying scientific publications according to Field-of-Science
Beyond reproduction, experiments want to be understood
Position paper.
Semi-automated Literature Review for Scientific Assessment of Socioeconomic Climate Change Scenarios
Climate change is now recognized as a global threat, and the literature surrounding it continues to increase exponentially. Expert bodies such as the Intergovernmental Panel on Climate Change (IPCC) are tasked with periodically assessing the literature to extract policy-relevant scientific conclusions that might guide policymakers. However, concerns have been raised that climate change research may be too voluminous for traditional literature review to adequately cover. It has been suggested that practices for literature review for scientific assessment be updated/augmented with semi-automated approaches from bibliometrics or scientometrics. In this study, we explored the feasibility of such recommendations for the scientific assessment of literature around socioeconomic climate change scenarios, so-called Shared Socioeconomic Pathways (SSPs).
A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and Software
Datasets and software packages are considered important resources that can be used for replicating computational experiments. With the advocacy of Open Science and the growing interest of investigating reproducibility of scientific claims, including URLs linking to publicly available datasets and software packages has been an institutionalized part of research publications. In this preliminary study, we investigated the disciplinary dependencies and chronological trend of including open-access datasets and software (OADS) in electronic theses and dissertations (ETDs), based on a hybrid classifier called OADSClassifier, consisting of a heuristic and a supervised learning model. The classifier achieves a best F1 of 0.92. We found that the inclusion of OADS URLs exhibited a strong disciplinary dependence and the fraction of ETDs containing OADS URLs has been gradually increasing over the past 20 years. We developed and share a ground truth corpus consisting of 500 manually labeled sentences containing URLs from scientific papers. The datasets and source code are available at https://github.com/lamps-lab/oadsclassifier.
Mining with Rarity for Web Intelligence
Mining with rarity makes sense to take advantage of data mining for Web intelligence. In some scenarios, the rare patterns are meaningful in data intelligent systems. Interesting pattern discovery plays an important role in real-world applications. In this field, a great deal of work has been done. In general, a high-utility pattern may include frequent items and also rare items. Rare pattern discovery emerges gradually and helps policy-makers making related marketing strategies. However, the existing Apriori-like methods for discovering high-utility rare itemsets (HURIs) are not efficient. In this paper, we address the problem of mining with rarity and propose an efficient algorithm, named HURI-Miner, which uses the data structure called revised utility-list to find HURIs from a transaction database. Furthermore, we utilize several powerful pruning strategies to prune the search space and save the computational complexity. In the process of rare pattern mining, the HURIs are directly generated without the generate-and-test method. Finally, a series of experimental results show that this proposed method has superior effectiveness and efficiency.
A Hand Over and Call Arrival Cellular Signals-based Traffic Density Estimation Method
The growing number of vehicles has put a lot of pressure on the transportation system. Intelligent Transportation System (ITS) faces a great challenge of traffic congestion. Traffic density displays the congestion of current traffic which reflects explicitly about traffic status. With the development of communication technology, people use mobile stations (MSs) at any time and cellular signals are everywhere. Different from traditional traffic information estimation methods based global positioning system (GPS) and vehicle detector (VD), this paper resorts to Cellular Floating Vehicle Data (CFVD) to estimate the traffic density. In CFVD, Hand over (HO) and call arrival (CA) cellular signals are essentials to estimate traffic flow and traffic speed. In addition, mixture probability density distribution generator is adopted to assist estimating the probabilities HO and CA events. Through accurate traffic flow and traffic speed estimations, precise traffic density is achieved. In the simulation experiments, the proposed method achieves estimation MAPEs 11.92%, 13.97% and 16.47% for traffic flow, traffic speed and traffic density, respectively.
Fraship: A Framework to Support End-User Personalization of Smart Home Services with Runtime Knowledge Graph
With the continuous popularization of smart home devices, people often anticipate using different smart devices through natural language instructions and require personalized smart home services. However, existing challenges include the interoperability of smart devices and a comprehensive understanding of the user environment. This study proposes Fraship, a framework supporting smart home service personalization for end-users. It incorporates a runtime knowledge graph acting as a bridge between users' language instructions and the corresponding operations of smart devices. The runtime knowledge graph is used to reflect contextual information in a specific smart home, based on which a language-instruction parser is proposed to allow users to manage smart home devices and services in natural language. We evaluated Fraship on a real-world smart home. Our results show that Fraship can effectively manage smart home devices and services based on the runtime knowledge graph, and it recognizes instructions more accurately than other approaches.
MI-GCN: Node Mutual Information-based Graph Convolutional Network
Graph Neural Networks (GNNs) have been widely used in various processing tasks for processing graphs and complex network data. However, in recent studies, GNNs cannot effectively process the structural topology information and characteristics of the nodes in the graph, or even fail to deal with the information of the nodes. For optimal node embedding aggregation and delivery, this weakness may severely affect the ability of GNNs to classify nodes. In order to overcome this issue, we propose a novel node Mutual Information-based Graph Convolutional Network (MI-GCN) for semi-supervised node classification. First, we analyze the node information entropy that measures the importance of nodes in the complex network, and further define the node joint information entropy and node mutual information in the graph data. Then, we use node mutual information to strengthen the ability of GNNs to fuse node structure information. Extensive experiments demonstrate that our MI-GCN not only retains the advantages of the most advanced GNN, but also improves the ability to fuse node structure information. MI-GCN can achieve superior performance on node classification compared to several baselines on real-world multi-type datasets, including fixed data splits and random data splits.
A Spatio-Temporal Data-Driven Automatic Control Method for Smart Home Services
With the rapid development of smart home technologies, various smart devices have entered and brought convenience to peopleÂ's daily life. Meanwhile, higher demands for smart home services have gradually emerged, which cannot be well satisfied by using traditional service provisioning manners. This is because traditional smart home control systems commonly rely on manual operations and fixed rules, which cannot satisfy changeable user demands and may seriously degrade the user experience. Therefore, it is necessary to capture user preferences based on their historical behavior data. To address the above problems, a temporal knowledge graph is first proposed to support the acquisition of user-perceived environmental data and user behavior data. Next, a user-centered smart home service prediction model is designed based on the temporal knowledge graph, which can predict the service status and automatically perform the corresponding service for each user. Finally, a prototype system is built according to a real-world smart home environment. The experimental results show that the proposed method can provide personalized smart home services and well satisfy user demands.
Wrap-up
Concept Annotation from Users Perspective: a New Challenge
Text data is highly unstructured and can often be viewed as a complex representation of different concepts, entities, events, senti- ments etc. For a wide variety of computational tasks, it is thus very important to annotate text data with the associated concepts / enti- ties, which can put some initial structure / index on raw text data. However, It is not feasible to manually annotate a large amount of text, raising the need for automatic text annotation. In this paper, we focus on concept annotation in text data from the perspective of real world users. Concept annotation is not a trivial task and its utility often highly relies on the preference of the user. Despite significant progress in natural language processing research, we still lack a general purpose concept annotation tool which can effectively serve users from a wide range of application domains. Thus, further investigation is needed from a user-centric point of view to design an automated concept annotation tool that will ensure maximum utility to its users. To achieve this goal, we created a benchmark corpus of two real world data-sets, i.e., Â"News Concept Data-setÂ" and Â"Medical Concept Data-setÂ", to introduce the notion of user-oriented concept annotation and provide a way to evaluate this task. The term Â"user-centricÂ" means that the desired concepts are defined as well as characterized by the users them- selves. Throughout the paper, we describe the details about how we created the data-sets, what are the unique characteristics of each data-set, how these data-sets reflect real users perspective for the concept annotation task, and finally, how they can serve as a great resource for future research on user-centric concept annotation.
Detecting Addiction, Anxiety, and Depression by Users Psychometric Profiles
Mental, neurological, and behavioral disorders are among the most common health issues in all countries. Detecting and characterizing people with mental disorders is an important task that could help the work of different healthcare professionals. Sometimes having a diagnosis for specific mental disorders requires a long time. This might be a problem because a diagnosis can give access to various support groups, treatment programs, and medications that might help the patients. In this paper, we study the problem of exploiting supervised learning approaches, based on usersÂ' psychometric profiles extracted from Reddit posts, to detect users dealing with Addiction, Anxiety, and Depression disorders. The empirical evaluation shows a very good predictive power of the psychometric profile and that features capturing the content of the post are more effective for the classification task rather than features describing the user writing style. We achieve an accuracy of 96% using the entire psychometric profile and an accuracy of 95% when we exclude from the user profile linguistic features.
Expressing Metaphorically, Writing Creatively: Metaphor Identification for Creativity Assessment in Writing
Metaphor, which can implicitly express profound meanings and emotions, is a unique writing technique frequently used in human language. In writing, meaningful metaphorical expressions can enhance the literariness and creativity of texts. Therefore, the usage of metaphor is a significant impact factor when assessing the creativity and literariness of writing. Particularly, the assessment results of creativity will not be accurate enough without metaphor identification in automatic writing assessment systems. However, little to no automatic writing assessment system considers metaphorical expressions when giving the score of creativity. For improving the accuracy of automatic writing assessment, this paper proposes a novel creativity assessment model that imports a token-level metaphor identification method to extract metaphors as the indicators for creativity scoring. The experimental results show that our model can accurately assess the creativity of different texts with precise metaphor identification. To the best of our knowledge, we are the first to apply automatic metaphor identification to assess writing creativity. Moreover, identifying features (e.g., metaphors) that influence writing creativity using computational approaches can offer fair and reliable assessment methods for educational settings.
A decision model for designing NLP applications
Among the NLP models' usages, we notice that some programs provide multiple output options, and some offer only a single result to the end-users. However, there is little research on which situations providing multiple outputs from NLP models will benefit the user experience. Therefore, in this paper, we summarize the progress of NLP applications, which shows parallel outputs from the NLP model at once to users. Then presents a decision model that can decide whether a given condition is suitable to show multiple outputs at once from the NLP model. We hope developers and UX designers can examine the decision model and create an easy-to-use interface that can present numerous results from the NLP model at once. Moreover, we hope future researchers can reference the decision model from this paper to explore the potential of other NLP models' usage that can show parallel outputs at once to create a more satisfactory user experience.
Do Not Read the Same News! Enhancing Diversity and Personalization of News Recommendation
Personalized news recommendation by machine is one of the widely studied areas. For humans, it is impossible to select articles by reading them whole due to information overload. Thus, the goal of news recommendation is to provide relevant news based on the user's interest. However, only a few implicit interactions, such as click histories, are available in news recommendation systems, which leads to biased recommendation results towards generally popular articles. In this paper, we suggest a novel news recommendation model for higher personalization. If a user reads news that is not widely clicked by others, then the news reflects the user's personal interest rather than other clicked news which are popular. Based on this idea, we implement two user encoders, one to encode the general interest of the whole users and another one to encode the user's individual interest. We also propose novel regularization methods for a personalized recommendation. The experiment on real-world data shows that our proposed method improves the diversity and the quality of recommendations for different click histories without any significant performance drops.
Keynote Talk by Julio Abascal: Web Accessibility and beyond in eGovernment: Does Web Accessibility ensure Accessibility to Administration's Websites?
Keynote Talk by Angela Bonifati(Université Claude Bernard Lyon I): Query-driven Graph Processing
No abstract available
Predicting SPARQL Query Dynamics
The SPARQL language is the recommendation for querying Linked Data, but querying SPARQL endpoints has problems with performance, particularly when clients remotely query SPARQL endpoints over the Web. Traditionally, caching techniques have been used to deal with performance issues by allowing the reuse of intermediate data and results across different queries. However, the resources in Linked Data represent real-world things which change over time. The resources described by these datasets are thus continuously created, moved, deleted, linked, and unlinked, which may lead to stale data in caches. This situation is more critical in the case of applications that consume or interact intensively with Linked Data through SPARQL, including query engines and browsers that constantly send expensive and repetitive queries. Applications that leverage Linked Data could benefit from knowledge about the dynamics of changing query results to efficiently deliver accurate services, since they could refresh at least the dynam
Canonicalisation of SPARQL 1.1 Queries
SPARQL is the standard query language for RDF as stated by the W3C. It is a highly expressive querying language that contains the standard operations based on set algebra, as well as navigational operations found in graph querying languages. Because of this, there are various ways to represent the same query, which may lead to redundancy in applications of the Semantic Web such as caching systems. We propose a canonicalisation method that is sound for the entire SPARQL 1.1 language, and complete for the monotone fragment. Our results thus far indicate that although the method is inefficient in the worst-case scenario, in practice it takes less than a fraction of a second for most queries in the real-world, and scales better than naive methods for certain applications.
Introduction by the Workshop Organizers
Round Table
One year of DALIDA Data Literacy Workshops for Adults: a Report
In May 2021, during the 2nd Data Literacy Workshop, we reported on DALIDA, a project to design and deliver data literacy discussion workshops for adults in Ireland. While open to everyone, the target audience is adults from socially, economically, or educationally disadvantaged groups. The co-creation element in designing workshops was thus key to ensuring that the workshops appealed to that audience. We previously reported on the project and the results of the co-creation workshops. Now, almost a year later, we report on the delivery of these workshops. This experience paper describes the workshop's structure, elaborates on our challenges (primarily due to the pandemic), and details some of the lessons we've learned. We also present the findings of our participant evaluations. The most important lesson we've learned is that a collaboration between scholars and education and public engagement teams (EPE), where both stakeholders approach the projects as equals, is crucial for successful projects.
Towards benchmarking data literacy
Data literacy as a term is growing in presence in society. Until recently, most of the focus in data has been around how to equip people with the skills to use data. However the increased impact that data is having on society has demonstrated the need for a different approach, one where people are able to understand and think critically about how data is being collected, used and shared. Going beyond definitions, in this paper we present a piece of research into benchmarking data literacy through self-assessment based upon the creation of a set of adult data literacy levels. Although the work goes to highlight the limitations in self-assessment, there is clear potential to build on the definitions to create potential IQ style tests that help boost critical thinking and demonstrate the importance of data literacy education.
No abstract available
Methodology to Compare Twitter Reaction Trends between Disinformation Communities, to COVID related Campaign Events at Different Geospatial Granularities
There is immediate need in deeper understanding of how discussions in Twitter disinformation communities get triggered, and monitoring such Twitter reactions along with timelines of major relevant events. In this regard, in this short paper we have presented a novel way to quantify, compare and relate two Twitter disinformation communities, in terms of their reaction patterns to the timelines of major campaign events. Timelines of both NPI (Nonpharmaceutical Interventions) campaigns and Disinformation campaigns are considered together. We have also considered and analyzed these campaigns at three geospatial granularities: local county, state, and country/ federal. We have conducted novel dataset collection on Campaigns (NPI + Disinformation) at different geospatial granularities. Then with Twitter disinformation communities dataset collected, we have done case study to validate the effectiveness of our proposed algorithm.
Towards Building Live Open Scientific Knowledge Graphs
Due to the large number and heterogeneity of data sources related to scientific data sources, it becomes increasingly difficult to follow the scientific discourse. For example, a publication available from DBLP may be discussed on Twitter and its underlying data set may be available from a different publication venue, for example, arXiv. The scientific discourse this publication is involved in is divided among not integrated sites and for researchers very hard to follow all discourses a publication or data set may be involved in. Also, many of these data sources DBLP, arXiv, Papers with Code, or Twitter, to name a few are often updated in real-time. These systems are not integrated (silos), and there is no system for users to query the content/data actively or, what would be even more beneficial, in a publish/subscribe fashion, i.e., a system would actively notify researchers of work interesting to them when such work or discussions become available. In this position paper, we introduce our concept of a live open knowledge graph which can integrate an extensible set of existing or new data sources in a streaming fashion, continuously fetches data from these heterogeneous sources, and interlinking and enriching them on-the-fly. Users can subscribe to continuously query the content/data of their interest and get notified when new content/data becomes available. We also highlight the open challenges in realizing a system enabling this concept at a scale.
A Policy-Oriented Architecture for Enforcing Consent in Solid
The Solid project aims to restore end-users' control over their data by decoupling services and applications from data storage. The existing Solid model for complete data governance by the user is realized through access control languages with limited expressivity and interpretability. In contrast, recent privacy and data protection regulations impose strict requirements on data processing applications and the scope of their operation. Solid's current access control mechanism lacks the granularity and contextual awareness needed to enforce these regulatory requirements. Therefore, we suggest an architecture for relating Solid's low-level technical access control rules with higher-level concepts such as the legal basis and purpose for data processing, the abstract types of information being processed, and the data sharing preferences of the data subject. Our architecture combines recent technical efforts by the Solid community panels with prior proposals made by researchers on the use of ODRL and SPECIAL policies as an extension to SolidÂ's authorization mechanism. While our approach appears to avoid a number of pitfalls identified in previous research, further work is needed before it can be implemented and used in a practical setting.
Keynote Talk by Robin Berjon: Consent of the Governed
No abstract available
Invited Talk by Stephan Günnemann: Graph Neural Networks for Molecular Systems - Methods and Benchmarks
A Heterogeneous Graph Benchmark for Misinformation on Twitter
What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization
A Content-First Benchmark for Self-Supervised Graph Representation Learning
Introduction by the Workshop Organizers
Keynote Talk by Nazli Goharian: NLP Applications in Mental Health
No abstract available
A large-scale temporal analysis of user lifespan durability on the Reddit social media platform
Social media platforms thrive upon the intertwined combination of user-created content and social interaction between these users.
"I’m always in so much pain and no one will understand" - Detecting Patterns in Suicidal Ideation on Reddit
Social media has become another venue for those struggling with
Introduction by the Workshop Organizers
Keynote Talk by Ichiro Ide: Tailoring applications to users through multi-modal understanding
Invited Talk by Chiaoi Tseng: Multimodal Discourse Approach to Narrative Strategies of Online News Videos
Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal Characterization
Social media content routinely incorporates multi-modal design to covey information and shape meanings, and sway interpretations toward desirable implications, but the choices and outcomes of using both texts and visual images have not been sufficiently studied. This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content, focusing on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter. The two aspects are intertwined in the spread of misinformation: for example, an unreliable article that aims to misinform has to attain some popularity. This work has several contributions. First, we propose a multi-modal (image and text) approach to effectively identify popularity and reliability of information sources simultaneously. Second, we identify textual and visual elements that are predictive to information popularity and reliability. Third, by modeling cross-modal relations and similarity, we are able to uncover how unreliable articles construct multi-modal meaning in a distorted, biased fashion. Our work demonstrates how to use multi-modal analysis for understanding influential content and has implications to social media literacy and engagement.
Keynote Talk by Jason Priem: OpenAlex: An open and comprehensive index of scholarly works, citations, authors, institutions, and more
Keynote Talk by Alex Wade: The Semantic Scholar Academic Graph (S2AG)
No abstract available
Concept Annotation from Users Perspective: a New Challenge
Text data is highly unstructured and can often be viewed as a complex representation of different concepts, entities, events, senti- ments etc. For a wide variety of computational tasks, it is thus very important to annotate text data with the associated concepts / enti- ties, which can put some initial structure / index on raw text data. However, It is not feasible to manually annotate a large amount of text, raising the need for automatic text annotation. In this paper, we focus on concept annotation in text data from the perspective of real world users. Concept annotation is not a trivial task and its utility often highly relies on the preference of the user. Despite significant progress in natural language processing research, we still lack a general purpose concept annotation tool which can effectively serve users from a wide range of application domains. Thus, further investigation is needed from a user-centric point of view to design an automated concept annotation tool that will ensure maximum utility to its users. To achieve this goal, we created a benchmark corpus of two real world data-sets, i.e., Â"News Concept Data-setÂ" and Â"Medical Concept Data-setÂ", to introduce the notion of user-oriented concept annotation and provide a way to evaluate this task. The term Â"user-centricÂ" means that the desired concepts are defined as well as characterized by the users them- selves. Throughout the paper, we describe the details about how we created the data-sets, what are the unique characteristics of each data-set, how these data-sets reflect real users perspective for the concept annotation task, and finally, how they can serve as a great resource for future research on user-centric concept annotation.
Detecting Addiction, Anxiety, and Depression by Users Psychometric Profiles
Mental, neurological, and behavioral disorders are among the most common health issues in all countries. Detecting and characterizing people with mental disorders is an important task that could help the work of different healthcare professionals. Sometimes having a diagnosis for specific mental disorders requires a long time. This might be a problem because a diagnosis can give access to various support groups, treatment programs, and medications that might help the patients. In this paper, we study the problem of exploiting supervised learning approaches, based on usersÂ' psychometric profiles extracted from Reddit posts, to detect users dealing with Addiction, Anxiety, and Depression disorders. The empirical evaluation shows a very good predictive power of the psychometric profile and that features capturing the content of the post are more effective for the classification task rather than features describing the user writing style. We achieve an accuracy of 96% using the entire psychometric profile and an accuracy of 95% when we exclude from the user profile linguistic features.
Expressing Metaphorically, Writing Creatively: Metaphor Identification for Creativity Assessment in Writing
Metaphor, which can implicitly express profound meanings and emotions, is a unique writing technique frequently used in human language. In writing, meaningful metaphorical expressions can enhance the literariness and creativity of texts. Therefore, the usage of metaphor is a significant impact factor when assessing the creativity and literariness of writing. Particularly, the assessment results of creativity will not be accurate enough without metaphor identification in automatic writing assessment systems. However, little to no automatic writing assessment system considers metaphorical expressions when giving the score of creativity. For improving the accuracy of automatic writing assessment, this paper proposes a novel creativity assessment model that imports a token-level metaphor identification method to extract metaphors as the indicators for creativity scoring. The experimental results show that our model can accurately assess the creativity of different texts with precise metaphor identification. To the best of our knowledge, we are the first to apply automatic metaphor identification to assess writing creativity. Moreover, identifying features (e.g., metaphors) that influence writing creativity using computational approaches can offer fair and reliable assessment methods for educational settings.
A decision model for designing NLP applications
Among the NLP models' usages, we notice that some programs provide multiple output options, and some offer only a single result to the end-users. However, there is little research on which situations providing multiple outputs from NLP models will benefit the user experience. Therefore, in this paper, we summarize the progress of NLP applications, which shows parallel outputs from the NLP model at once to users. Then presents a decision model that can decide whether a given condition is suitable to show multiple outputs at once from the NLP model. We hope developers and UX designers can examine the decision model and create an easy-to-use interface that can present numerous results from the NLP model at once. Moreover, we hope future researchers can reference the decision model from this paper to explore the potential of other NLP models' usage that can show parallel outputs at once to create a more satisfactory user experience.
Do Not Read the Same News! Enhancing Diversity and Personalization of News Recommendation
Personalized news recommendation by machine is one of the widely studied areas. For humans, it is impossible to select articles by reading them whole due to information overload. Thus, the goal of news recommendation is to provide relevant news based on the user's interest. However, only a few implicit interactions, such as click histories, are available in news recommendation systems, which leads to biased recommendation results towards generally popular articles. In this paper, we suggest a novel news recommendation model for higher personalization. If a user reads news that is not widely clicked by others, then the news reflects the user's personal interest rather than other clicked news which are popular. Based on this idea, we implement two user encoders, one to encode the general interest of the whole users and another one to encode the user's individual interest. We also propose novel regularization methods for a personalized recommendation. The experiment on real-world data shows that our proposed method improves the diversity and the quality of recommendations for different click histories without any significant performance drops.
Brief report about the Doctoral Consortium by DC Chairs
Tlamelo Makati: Machine Learning for Accessible Web Navigation
William C. Payne: Sounds and (Braille) Cells: Co-Designing Music Technology with Blind and Visually Impaired Musicians
Closing Q&A
Keynote Talk by Benjamin Glicksberg (Mount Sinai)
Keynote Talk by Faisal Mahmood (Harvard)
Keynote Talk by Paul Varghese (Verily)
Keynote Talk by Mingchen Gao (U of Buffalo)
User Access Models to Event-Centric Information
Events such as terrorist attacks and Brexit play an important role in the research of social scientists and Digital Humanities researchers. They need innovative user access models to event-centric information which support them throughout their research. Current access models such as recommendation and information retrieval methods often fail to adequately capture essential features of events and provide acceptable access to event-centric information. This PhD research aims to develop efficient and effective user access models to event-centric information by leveraging well-structured information in Knowledge Graphs. The goal is to tackle the challenges researchers encounter during their research in a workflow, from exploratory search to accessing well-defined and complete information collections. This paper presents the specific research questions and presents the approach and preliminary results.
Interactions in information spread
Large quantities of data flow on the internet. When a user decides to help the spread of a piece of information (by retweeting, liking, posting content), most research works assumes she does so according to information's content, publication date, the user's position in the network, the platform used, etc. However, there is another aspect that has received little attention in the literature: the information interaction. The idea is that a user's choice is partly conditioned by the previous pieces of information she has been exposed to. In this document, we review the works done on interaction modeling and underline several aspects of interactions that complicate their study. Then, we present an approach seemingly fit to answer those challenges and detail a dedicated interaction model based on it. We show our approach fits the problem better than existing methods, and present leads for future works. Throughout the text, we show that taking interactions into account improves our comprehension of information in
Comprehensive Event Representations using Event Knowledge Graphs and Natural Language Processing
Recent work has utilized knowledge-aware approaches to natural language understanding, question answering, recommendation systems, and other tasks. These approaches rely on well-constructed and large-scale knowledge graphs that can be useful for many downstream applications and empower knowledge-aware models with commonsense reasoning. Such knowledge graphs are constructed through knowledge acquisition tasks such as relation extraction and knowledge graph completion. This work seeks to utilize and build on the growing body of work that uses findings from the field of natural language processing (NLP) to extract knowledge from text and build knowledge graphs. The focus of this research project is on how we can use transformer-based approaches to extract and contextualize event information, matching it to existing ontologies, to build comprehensive knowledge graph-based event representations. Specifically, sub-event extraction is used as a way of creating sub-event-aware event representations. These event repre
Geometric and Topological Inference for Deep Representations of Complex Networks
Understanding the deep representations of complex networks is an important step of building interpretable and trustworthy machine learning applications in the age of internet. Global surrogate models that approximate the predictions of a black box model (e.g. an artificial or biological neural net) are usually used to provide valuable theoretical insights for the model interpretability. In order to evaluate how well a surrogate model can account for the representation in another model, we need to develop inference methods for model comparison. Previous studies have compared models and brains in terms of their representational geometries (characterized by the matrix of distances between representations of the input patterns in a model layer or cortical area). In this study, we propose to explore these summary statistical descriptions of representations in models and brains as part of a broader class of statistics that emphasize the topology as well as the geometry of representations. The topological summary st
Wrap-up
Towards digital economy through data literate workforce
In todayÂ's digital economy, the data are part of everyone's work. Not only decision-makers, but also average workers are invited to conduct data-based experiments, interpret data, and create innovative data-based products and services. In this endowers, entire workforce needs additional skills to thrive in this world. This type of competence is united by the name data literacy and as such it becomes one of the most valuable skills on the labor market. This paper aims to highlight the needs and shortcomings in terms of competencies for working with data, as a critical factor in the business of modern companies striving for digital transformation. Through a systematic desk research spanning over 15 European countries, this paper sheds light on how data literacy is addressed in European Higher Education and professional training. In addition, our analysis uses results from online survey conducted in 20 countries in Europe and North Africa. The results show that the most valuable data literacy competence of an employee is the ability to evaluate or reflect data, and the skills related to reading or creating data classification.
No abstract available
Discussion and Wrap-up
No abstract available
Wrap-up and Awards
Panel: The Web of Consent
Wrap-up
Invited Talk by Tina Eliassi-Rad: The Why, How, and When of Representations for Complex Systems
An Open Challenge for Inductive Link Prediction on Knowledge Graphs
An Explainable AI Library for Benchmarking Graph Explainers
Robust Synthetic GNN Benchmarks with GraphWorld
EXPERT: Public Benchmarks for Dynamic Heterogeneous Academic Graphs
Panel on Mental Health and Social Media
Utilizing Pattern Mining and Classification Algorithms to Identify Risk for Anxiety and Depression in the LGBTQ+ Community During the COVID-19 Pandemic
In this paper, we examine the results of pattern mining and decision trees applied to a dataset of survey responses about life for individuals in the LGBTQ+ community during COVID, which have the potential to be used as a tool to identify those at risk for anxiety and depression. The world was immensely affected by the pandemic in 2020 through 2022, and our study attempts to use the data from this period to analyze the impact on anxiety and depression. First, we used the FP-growth algorithm for frequent pattern mining, which finds groups of items that frequently occur together, and utilized the resulting patterns and measures to determine which features were significant when inspecting anxiety and depression. Then, we trained a decision tree with the selected features to classify if a person has anxiety or depression. The resulting decision trees can be used to identify individuals at risk for these conditions. From our results, we also identified specific risk factors that helped predict whether an individual was likely to experience anxiety and/or depression, such as satisfaction with their sex life, cutting meals, and worries of healthcare discrimination due to their gender identity or sexual orientation.
Supporting People Receiving Substance Use Treatment During COVID-19 Through A Professional-Moderated Online Peer Support Group
The COVID-19 pandemic exacerbated the ongoing opioid crisis in the United States. Individuals with a substance use disorder are vulnerable to relapse during times of acute stress. Online peer support communities (OPSCs) have the potential to decrease social isolation and increase social support for participants. In September 2020, we launched a private, professional-moderated OPSC using the Facebook Group platform to study its effects on the mental health wellness of women undergoing substance use treatment. This study was particularly meaningful as the participants were not able to join in-person treatment sessions due to the COVID-19 pandemic. Preliminary findings indicate that study participants reported decreased loneliness and increased online social support three months after initiating the OPSC. They tended to interact with content initiated by a clinical professional more than those generated by peers.
Invited Talk by Christian Otto: Characterization and Classification of Semantic Image-Text Relations
Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection
Recent years have witnessed a massive growth in the proliferation of fake news online. User-generated content is a blend of text and visual information leading to producing different variants of fake news. As a result, researchers started targeting multimodal methods for fake news detection. Existing methods capture high-level information from different modalities and jointly model them to decide. Given multiple input modalities, we hypothesize that not all modalities may be equally responsible for decision-making. Hence, paper presents a novel architecture that effectively identifies and suppresses information from weaker modalities and extracts relevant information from strong modality on a per-sample basis. We also establish intra-modality relationship by extracting fine-grained image and text features. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average of 3.05% and 4.525% on accuracy and F1-score, respectively. We will also release the code, implementation details, and model checkpoints for the community's interest.
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system's performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally we analyze how recently proposed bi-modal and single modal attention explanations are affected by the incorporation of such entity enhanced representations. Our results show substantial improved performance on the KBVQA task without the need for additional costly pre-training and we provide insights for when entity knowledge injection helps improve a model's understanding. We provide our code and enhanced datasets for reproducibility.
ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer
Image narrative generation describes the creation of stories regarding the content of image data from a subjective viewpoint. Given the importance of the subjective feelings of writers, characters, and readers in storytelling, image narrative generation methods must consider human emotion, which is their major difference from descriptive caption generation tasks. The development of automated methods to generate story-like text associated with images may be considered to be of considerable social significance, because stories serve essential functions both as entertainment and also for many practical purposes such as education and advertising. In this study, we propose a model called ViNTER (Visual Narrative Transformer with Emotion arc Representation) to generate image narratives that focus on time series representing varying emotions as ``emotion arcs,'' to take advantage of recent advances in multimodal Transformer-based pre-trained models. We present experimental results of both manual and automatic evaluations, which demonstrate the effectiveness of the proposed emotion-aware approach to image narrative generation.
Wrap-up
Panel with Alex Wade, Jason Priem and Natalia Manola
Wrap-up
Keynote Talk by Dr. Shiran Dudy: Personalization and Relevance in NLG
No abstract available
Keynote Talk by Dr. Maarten Sap: Endowing NLP Systems with Social Intelligence and Social Commonsense
Keynote Talk by Fei Wang (Weill Cornell)
Keynote Talk by Prithwish Chakraborty (IBM Research)
Keynote Talk by Pranav Rajpurkar (Harvard) and Adriel Saporta (Health AI at Apple)
Keynote Talk by Shamim Nemati (UC San Diego)
Panel: Applications of explainable AI in health
Moderator: Justin Rousseau
Panelists: Pranav Rajpurkar, Adriel Saporta, Shamim Nemati, Fei Wang, Benjamin Glicksberg, Faisal Mahmood, Paul Varghese, Prithwish Chakraborty and Mingchen Gao
Panel Discussion
Wrap-up
Keynote Talk by Munmun De Choudhury: Employing Social Media to Improve Mental Health: Pitfalls, Lessons Learned and the Next Frontier
Wrap-up
Panel
Wrap-up
Socialbots on Fire: Modeling Adversarial Behaviors of Socialbots via Multi-Agent Hierarchical Reinforcement Learning
Socialbots are software-driven user accounts on social platforms, acting autonomously (mimicking human behavior), with the aims to influence the opinions of other users or spread targeted misinformation for particular goals. As socialbots undermine the ecosystem of social platforms, they are often considered harmful. As such, there have been several computational efforts to auto-detect the socialbots. However, to our best knowledge, the adversarial nature of these socialbots has not yet been studied. This begs a question ``can adversaries, controlling socialbots, exploit AI techniques to their advantage?" To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected. We first formulate the adversarial socialbot learning as a cooperative game between two functional hierarchical RL agents. While one agent curates a sequence of activities that can avoid the detection, the other agent aims to maximize network influence by selectively connecting with right users. Our proposed policy networks train with a vast amount of synthetic graphs and generalize better than baselines on unseen real-life graphs both in terms of maximizing network influence (up to +18%) and sustainable stealthiness (up to +40% undetectability) under a strong bot detector (with 90% detection accuracy). During inference, the complexity of our approach scales linearly, independent of a network's structure and the virality of news. This makes our approach a practical adversarial attack when deployed in a real-life setting.
MemStream: Memory-Based Streaming Anomaly Detection
Given a stream of entries over time in a multi-dimensional data setting where concept drift is present, how can we detect anomalous activities? Most of the existing unsupervised anomaly detection approaches seek to detect anomalous events in an offline fashion and require a large amount of data for training. This is not practical in real-life scenarios where we receive the data in a streaming manner and do not know the size of the stream beforehand. Thus, we need a data-efficient method that can detect and adapt to changing data trends, or concept drift, in an online manner. In this work, we propose MEMSTREAM, a streaming anomaly detection framework, allowing us to detect unusual events as they occur while being resilient to concept drift. We leverage the power of a denoising autoencoder to learn representations and a memory module to learn the dynamically changing trend in data without the need for labels. We prove the optimum memory size required for effective drift handling. Furthermore, MEMSTREAM makes use of two architecture design choices to be robust to memory poisoning. Experimental results show the effectiveness of our approach compared to state-of-the-art streaming baselines using 2 synthetic datasets and 11 real-world datasets.
Federated Unlearning via Class-Discriminative Pruning
We explore the problem of selectively forgetting categories from trained CNN classification models in the federated learning (FL). Given that the data used for training cannot be accessed globally in FL, our insights probe deep into the internal influence of each channel. Through visualization of feature map activated by different channels, we observe that different channels have a varying contribution to different categories in image classification.
ALLIE: Active Learning on Large-scale Imbalanced Graphs
Active learning reduces the cost of manually labeling by selecting the samples that contribute most to the model. Recent developments of active learning on graph data have shown promising results. However, existing methods commonly assume that the data/label distributions of graph data are balanced, which typically does not hold in real-world fraud detection scenarios. For example, the abusive behaviors only account for a small portion compared with the benign behaviors in online harassment, spam or fake review detection tasks for e-commerce websites. Because of the low prevalence rate of positive samples in these scenarios, samples selected by existing graph-based active learning methods would mostly yield negative samples, resulting in limited model improvement on the positive class. Besides, selecting among all unlabeled samples in large-scale dataset is inefficient.
An Accuracy-Lossless Perturbation Method for Defending Privacy Attacks in Federated Learning
Although federated learning improves privacy of training data by exchanging local gradients or parameters rather than raw data, the adversary still can leverage local gradients and parameters to obtain local training data by launching reconstruction and membership inference attacks. To defend such privacy attacks, many noises perturbation methods (like differential privacy or CountSketch matrix) have been widely designed. However, the strong defence ability and high learning accuracy of these schemes cannot be ensured at the same time, which will impede the wide application of FL in practice (especially for medical or financial institutions that require both high accuracy and strong privacy guarantee). To overcome this issue, in this paper, we propose \emph{an efficient model perturbation method for federated learning} to defend reconstruction and membership inference attacks launched by curious clients. On the one hand, similar to the differential privacy, our method also selects random numbers as perturbed noises added to the global model parameters, and thus it is very efficient and easy to be integrated in practice. Meanwhile, the random selected noises are positive real numbers and the corresponding value can be arbitrarily large, and thus the strong defence ability can be ensured. On the other hand, unlike differential privacy or other perturbation methods that cannot eliminate the added noises, our method allows the server to recover the true gradients by eliminating the added noises. Therefore, our method does not hinder learning accuracy at all. Extensive experiments demonstrate that for both regression and classification tasks, our method achieves the same accuracy as non-private approaches and outperforms the state-of-the-art related schemes. Besides, the defence ability of our method is significantly better than the state-of-the-art related defence schemes. Specifically, for the membership inference attack, our method achieves attack success rate (ASR) of around $50\%$, which is equivalent to blind
AR-BERT: Aspect-relation enhanced Aspect-level Sentiment Classification with Multi-modal Explanations
Aspect level sentiment classification (ALSC) is a difficult problem with state-of-the-art models showing less than $80\%$ macro-F1 score on benchmark datasets.
Unfreeze with Care: Space-Efficient Fine-Tuning of Semantic Parsing Models
Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pre-trained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.
EventBERT: A Pre-Trained Model for Event Correlation Reasoning
Event correlation reasoning infers whether a natural language paragraph containing multiple events conforms to human common sense. For example, "Andrew was very drowsy, so he took a long nap, and now he is very alert" is sound and reasonable. In contrast, "Andrew was very drowsy, so he stayed up a long time, now he is very alert" does not comply with human common sense. Such reasoning capability is essential for many downstream tasks, such as script reasoning, abductive reasoning, narrative incoherence, story cloze test, etc. However, conducting event correlation reasoning is challenging due to a lack of large amounts of diverse event-based knowledge and difficulty in capturing correlation among multiple events. In this paper, we propose EventBERT, a pre-trained model to encapsulate eventuality knowledge from unlabeled text. Specifically, we collect a large volume of training examples by identifying natural language paragraphs that describe multiple correlated events and further extracting event spans in an unsupervised manner. We then propose three novel event- and correlation-based learning objectives to pre-train an event correlation model on our created training corpus. Experimental results show EventBERT outperforms strong baselines on four downstream tasks, and achieves state-of-the-art results on most of them. Moreover, it outperforms existing pre-trained models by a large margin, e.g., 6.5~23%, in zero-shot learning of these tasks.
Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning
Graph Neural Networks (GNNs) have shown great advantages on learning representations for structural data such as social networks, citation networks and molecular structures. However, due to the non-transparency of the deep learning models and the complex topological structure of the data, it is difficult to explain and interpret the predictions made by GNNs. To make GNNs explainable, in this paper, we propose a Counterfactual and Factual (CF^2) reasoning framework--a model-agnostic framework for explaining GNNs. By taking the insights from causal inference, CF^2 generates explanations by formulating an optimization problem based on both factual and counterfactual reasoning. This distinguishes CF^2 from previous explainable GNNs that only consider one of the causal reasoning perspectives.
Path Language Modeling over Knowledge Graphs for Explainable Recommendation
To facilitate human decisions with credible suggestions, personalized recommender systems should have the ability to generate corresponding explanations while making recommendations. Knowledge graphs (KG), which contain comprehensive information about users and products, are widely used to enable this. By reasoning over a KG in a node-by-node manner, existing explainable models provide a KG-grounded path for each user-recommended item. Such paths serve as an explanation and reflect the historical behavior pattern of the user. However, not all items can be reached following the connections within the constructed KG under finite hops. Hence, previous approaches are constrained by a recall bias in terms of existing connectivity of KG structures. To overcome this, we propose a novel Path Language Modeling Recommendation (PLM-Rec) framework, learning a language model over KG paths consisting of entities and edges. Through path sequence decoding, PLM-Rec unifies recommendation and explanation in a single step and fulfills them simultaneously. As a result, PLM-Rec not only captures the user behaviors but also eliminates the restriction to pre-existing KG connections, thereby alleviating the aforementioned recall bias. Moreover, the proposed technique makes it possible to conduct explainable recommendation even when the KG is sparse or possesses a large number of relations. Experiments and extensive ablation studies on three Amazon e-commerce datasets demonstrate the effectiveness and explainability of the PLM-Rec framework.
MiDaS: Representative Sampling from Real-world Hypergraphs
Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling a small representative subgraph is indispensable for various purposes: simulation, visualization, stream processing, representation learning, crawling, to name a few. However, many complex systems consist of group interactions (e.g., collaborations of researchers and joint-interactions of proteins), and thus they can be represented more naturally and accurately by hypergraphs (i.e., sets of sets) than by ordinary graphs.
A New Dynamic Algorithm for Densest Subhypergraphs
Computing a dense subgraph is a fundamental problem in graph mining, with a diverse set of applications ranging from electronic commerce to community detection in social networks. In many of these applications, the underlying context is better modelled as a weighted hypergraph that keeps evolving with time.
Lightning Fast and Space Efficient k-clique Counting
$K$-clique counting is a fundamental problem in network analysis which has attracted much attention in recent years. Computing the count of $k$-cliques in a graph for a large $k$ (e.g., $k=8$) is often intractable as the number of $k$-cliques increases exponentially w.r.t.\ (with respect to) $k$. Existing exact $k$-clique counting algorithms are often hard to handle large dense graphs, while sampling-based solutions either require a huge number of samples or consume very high storage space to achieve a satisfactory accuracy. To overcome these limitations, we propose a new framework to estimate the number of $k$-cliques which integrates both the exact $k$-clique counting technique and two novel color-based sampling techniques. The key insight of our framework is that we only apply the exact algorithm to compute the $k$-clique counts in the \emph{sparse regions} of a graph, and use the proposed sampling-based techniques to estimate the number of $k$-cliques in the \emph{dense regions} of the graph. Specifically, we develop two novel dynamic programming based $k$-color set sampling techniques to efficiently estimate the $k$-clique counts, where a $k$-color set contains $k$ nodes with $k$ different colors. Since a $k$-color set is often a good approximation of a $k$-clique in the dense regions of a graph, our sampling-based solutions are extremely accurate. Moreover, the proposed sampling techniques are space efficient which use near-linear space w.r.t.\ graph size. We conduct extensive experiments to evaluate our algorithms using 8 real-life graphs. The results show that our best algorithm is around two orders of magnitude faster than the state-of-the-art sampling-based solutions (with the same relative error $0.1\%$) and can be up to three orders of magnitude faster than the state-of-the-art exact algorithm on large graphs.
Listing Maximal k-Plexes in Large Real-World Graphs
Listing dense subgraphs in a large graph plays a key task in varieties of network analysis applications like community detection.
FreSCo: Mining Frequent Patterns in Simplicial Complexes
Simplicial complexes are a generalization of graphs that model higher-order relations.
A Semi-Supervised VAE Based Active Anomaly Detection Framework in Multivariate Time Series for Online Systems
Nowadays, the large online system are constructed on the basis of microservice architecture. A failure in this architecture may cause a series of failures due to the fault propagation. Thus, the large online systems need to be monitored comprehensively to ensure the service quality. Even though many anomaly detection techniques have been proposed, few of them can be directly applied to a given microservice or cloud server in industrial environment. To settle these challenges, this paper presents SLA-VAE, a semi-supervised learning based active anomaly detection framework using variational auto-encoder. SLA-VAE first defines anomalies based on feature extraction module, introduces semi-supervised VAE to identify anomalies in multivariate time series, and employs active learning to update the online model via a small amount of uncertain samples. We conduct experiments on the cloud server data from two different types of game business in Tencent. The results show that SLA-VAE significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online business system.
FedKC: Federated Knowledge Composition for Multilingual Natural Language Understanding
Multilingual natural language understanding, which aims to comprehend multilingual documents, is an important task. Existing efforts have been focusing on the analysis of centrally stored text data, but in real practice, multilingual data is usually distributed. Federated learning is a promising paradigm to solve this problem, which trains local models with decentralized data on local clients and aggregates local models on the central server to achieve a good global model. However, existing federated learning methods assume that data are independent and identically distributed (IID), and cannot handle multilingual data, that are usually non-IID with severely skewed distributions: First, multilingual data is stored on local client devices such that there are only monolingual or bilingual data stored on each client. This makes it difficult for local models to know the information of documents in other languages. Second, the distribution over different languages could be skewed. High resource language data is much more abundant than low resource language data. The model trained on such skewed data may focus more on high resource languages but fail to consider the key information of low resource languages.
A Sampling-based Learning Framework for Big Databases
The autonomous database of the next generation aims to apply the reinforcement learning (RL) on tasks like query optimization and performance tuning with little or no human DBAs' intervention. Despite the promise, to obtain a decent policy model in the domain of database optimization is still challenging --- primarily due to the inherent computational overhead involved in the data hungry RL frameworks --- in particular on large databases. In the line of mitigating this adverse effect, we propose {\em Mirror} in this work. The core to {\em Mirror} is a sampling process built in an RL framework together with a transferring process of the policy model from the sampled database to its original counterpart. While being conceptually simple, we identify that the policy transfer between databases involves heavy noise and prediction drifting that cannot be neglectable. Thereby we build a theoretical-guided sampling algorithm in {\em Mirror} assisted by a continuous fine-tuning module. The experiments on the PostgreSQL and an industry database X-DB~\footnote{X-DB's real name is omitted for anonymity purposes.} validate that {\em Mirror} has effectively reduced the computational cost while maintaining a satisfactory performance.
UniParser: A Unified Log Parser for Heterogeneous Log Data
Logs provides first-hand information for engineers to diagnose failures in large-scale online service systems. Log parsing, which transforms semi-structured raw log messages into structured data, is a prerequisite of automated log analysis such as log-based anomaly detection and diagnosis. Almost all existing log parsers follow the general idea of extracting the common part as templates and the dynamic part as parameters. However, these log parsing methods, often neglect the semantic meaning of log messages. Furthermore, high diversity among various log sources also poses an obstacle in the generalization of log parsing across different systems. In this paper, we propose UniParser to capture the common logging behaviours from heterogeneous log data. UniParser utilizes a Token Encoder module and a Context Encoder module to learn the patterns from thelog token and its neighbouring context. A Context Similarity module is specially designed to model the commonalities of learned patterns. We have performed extensive experiments on 16 public log datasets and our results show that UniParser outperforms state-of-the-art log parsers by a large margin.
The Case of SPARQL UNION, FILTER and DISTINCT
SPARQLÂ's Basic Graph Pattern (BGP) queries resemble SQL inner joins. BGP queries are well-researched for query optimisation and RDF indexing techniques for facilitating efficient evaluation of them. Reorderability of triple patterns makes a crucial factor in BGP query plan optimisation. But optimisation of the other components of SPARQL such as OPTIONAL, UNION, FILTER, DISTINCT poses more challenges due to the restrictions on the reorderability of triple patterns in the queries. These other components of SPARQL are important as they help in querying the semi-structured data as opposed to strictly structured relational data with stringent schema. Also they are part of the SPARQL recursive grammar and are performance intensive. Previously published optimisation techniques for BGP-OPTIONAL patterns have shown to perform better for low-selectivity queries. In this paper, we use these BGP-OPTIONAL optimisation techniques as primitives and show how they can be used for the other SPARQL queries with an intermix of UNION, FILTER, and DISTINCT clauses. We mainly focuses on the structural aspects of these queries, challenges in using these query optimisation techniques, identify types of UNION, FILTER, and DISTINCT queries that can use these optimisations, and extend some of the previously published theoretical results.
Cross Pairwise Ranking for Unbiased Item Recommendation
Most recommender systems optimize the model on observed interaction data, which is affected by the previous exposure mechanism and exhibits many biases like popularity bias. The loss functions, such as the mostly used pointwise BCE and pairwise BPR, are not designed to consider the biases in observed data. As a result, the model optimized on the loss would inherit the data biases, or even worse, amplify the biases. For example, a few popular items take up more and more exposure opportunities, severely hurting the recommendation quality on niche items --- known as the notorious Mathew effect. While Inverse Propensity Scoring (IPS) could alleviate this issue via reweighting data samples, it is difficult to set the propensity score well since the exposure mechanism is seldom known and difficult to be estimated with a low variance. In this work, we develop a new learning paradigm named Cross Pairwise Ranking (CPR) that achieves unbiased recommendation without knowing the exposure mechanism. Instead of sample reweighting, we change the loss term of a sample --- we innovatively sample multiple observed interactions once and form the loss as the combination of their predictions. This way theoretically offsets the influence of user/item propensity on the learning, removing the influence of data biases caused by the exposure mechanism. Advantageous to IPS, our proposed CPR ensures unbiased learning for each training instance without the need of setting the propensity scores. Experimental results demonstrate the superiority of CPR over state-of-the-art debiasing solutions in both model generalization and training efficiency.
Off-policy Learning over Heterogeneous Information for Recommendation
Reinforcement learning has recently become an active topic in recommender system research, where the logged data that records interactions between items and users feedback is used to discover the policy.
CBR: Context Bias aware Recommendation for Debiasing User Modeling and Click Prediction
With the prosperity of recommender systems, the biases existing in user behaviors, which may lead to inconsistency between user preference and behavior records, have attracted wide attention. Though large efforts have been made to infer user preference from biased data with learning to debias, unfortunately, they mainly focus on the effect of one specific item attribute, e.g., position or modality which may affect usersÂ' click probability on items. However, the comprehensive description for potential interactions between multiple items with various attributes, namely the context bias between items, may not be fully summarized. To that end, in this paper, we design a novel Context Bias aware Recommendation (CBR) model for describing and debiasing the context bias caused by comprehensive interactions between multiple items. Specifically, we first propose a content encoder and a bias encoder based on multi-head self-attention to embed the latent interactions between items. Then, we calculate the biased representation for users based on an attention network, which will be further utilized to infer the negative preference, i.e., the dislikes of users based on the items the user never clicked. Finally, the real user preference will be captured based on the negative preference to estimate the click prediction score. Extensive experiments on a real-world dataset demonstrate the competitiveness of our CBR framework compared with state-of-the-art baseline methods.
UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized Knowledge Distillation
Post-click conversion rate (CVR) estimation plays an important role in online advertising. Conventional CVR estimation models are usually trained using clicked samples, because only clicked ads have post-click conversion labels based on user feedback logs. However, during online serving, the models need to estimate CVR scores for all impression ads (including both clicked and unclicked ones), leading to the sample selection bias (SSB) issue and biased CVR estimation models. Intuitively, providing reliable supervision signals for unclicked ads is a feasible way to alleviate the SSB issue. In this paper, we propose an uncertainty-regularized knowledge distillation (UKD) framework to debias CVR estimation via distilling knowledge from unclicked ads. The workflow contains a click-adaptive teacher model and an uncertainty-regularized student model. The teacher learns click-adaptive representations for impression ads and produces pseudo-conversion labels on unclicked ads as supervision signals. Then the student is trained using both clicked and unclicked ads with knowledge distillation, and performs uncertainty modeling to alleviate the inherent noise existed in pseudo-labels. Experimental results on large-scale datasets show that UKD outperforms previous debiasing methods. Online experiments further verify that UKD achieves significant improvements.
Rating Distribution Calibration for Selection Bias Mitigation in Recommendations
Real-world recommendation datasets have been shown to be subject to selection bias, which can challenge recommendation models to learn real preferences of users, so as to make accurate recommendations. Existing approaches to mitigate selection bias, such as data imputation and inverse propensity score, are sensitive to the quality of the additional imputation or propensity estimation models. To break these limitations, in this work, we propose a novel self-supervised learning (SSL) framework , i.e., Rating Distribution Calibration (RDC), to tackle selection bias without introducing additional models. In addition to the original training objective, we introduce a rating distribution calibration loss. It aims to correct the predicted rating distribution of biased users by taking advantage of that of their similar unbiased users. We empirically evaluate RDC on two real-world datasets and one synthetic dataset. The experimental results show that RDC outperforms the original model as well as the state-of-the-art debiasing approaches by a significant margin.
Accurate and Explainable Recommendation via Review Rationalization
Auxiliary information, e.g., reviews, is widely adopted to improve collaborative filtering (CF) algorithms, e.g., to boost accuracy and provide explanations. However, most of the existing methods cannot distinguish between co-appearance and causality when learning from reviews, so that they may rely on spurious correlations rather than causal relations in recommendation --- leading to poor generalization performance and unconvincing explanations. In this paper, we propose a Recommendation via Review Rationalization (R3) method including 1) a rationale generator to extract rationales from reviews to alleviate the effects of spurious correlations; 2) a rationale predictor to predict user ratings on items only from rationales; and 3) a correlation predictor upon both rationales and correlational features to ensure conditional independence between spurious correlations and rating predictions given causal rationales. Extensive experiments on real-world datasets show that the proposed method can achieve better generalization performance than state-of-the-art CF methods and provide causal-aware explanations even when the test data distribution changes.
AmpSum: Adaptive Multiple-Product Summarization towards Improving Recommendation Captions
Explainable recommendation seeks to provide not only high-quality recommendations but also intuitive explanations. Our objective is not on generating accurate recommendations per se, but on producing user-friendly explanations. Importantly, the focus of existing work has been predominantly on explaining a single item recommendation. In e-commerce websites, product recommendations are usually organized into "widgets", each given a name to describe the products within. These widget names are usually generic in nature and inadequate to reveal the purpose of recommendation, in part because they may be manually crafted, making it difficult to attach meaningful and informative names at scale.
Comparative Explanations of Recommendations
As recommendation is essentially a comparative (or ranking) process, a good explanation should illustrate to users why an item is believed to be better than another, i.e., comparative explanations about the recommended items. Ideally, after reading the explanations, a user should reach the same ranking of items as the system's. Unfortunately, little research attention has yet been paid on such comparative explanations.
Explainable Neural Rule Learning
Although neural networks have achieved great successes in various machine learning tasks, people can hardly know what neural networks learn from data and how they make decisions due to their black-box nature. The lack of such explainability is one of the limitations of neural networks when applied in domains, e.g., healthcare and finance, that demand transparency and accountability. Moreover, explainability is beneficial for guiding a neural network to learn the causal patterns that can extrapolate out-of-distribution (OOD) data, which is critical in real-world applications and has surged as a hot research topic recently.
Interpreting BERT-based Text Similarity via Activation and Saliency Maps
We present BERT Interpretations (BTI), a novel technique for interpreting unlabeled paragraph similarities inferred by a pre-trained BERT model. Given a paragraph-pair, BTI identifies the important words that dictate each paragraph's semantics, matches between the words from both elements, and retrieves the most influencing word-pairs that explain the similarity between the paragraphs. We demonstrate the ability
Welcome by the Web4Good Special Track Chairs
Invited Talk by Carlos Castillo (Universitat Pompeu Fabra): Algorithmic Fairness of Link-based Recommendation
Reproducibility and Replicability of Web Measurement Studies
Measurements studies are an essential tool to analyze and understand how the modern Web works as they aim to shed light on not yet fully understood phenomena.
A View into Youtube View Fraud
Artificially inflating the view count of videos on online portals, such as YouTube, opens up such platforms to manipulation and a unique class of fake engagement abuse known as view fraud. Limited prior research on such abuse focuses on automated or bot-driven approaches. In this paper, we explore organic or human-driven approaches to view fraud for YouTube, by investigating a long-running operation on a popular illegal free live streaming service (FLIS) named 123Movies. A YouTube video involved in this operation is overlayed as a pre-roll advertisement on an illegal stream that a visitor requests on 123Movies, forcing the visitor to watch a part of the video before the stream can be played. By having visitors view unsolicited YouTube videos, these FLIS services facilitate YouTube view fraud at a large scale. We reverse-engineered the video distribution methods on the streaming service and tracked the videos involved in this operation over a 9 month period. For a subset of these videos, we monitor their view counts and their respective YouTube channel dynamics over the same period. Our analysis reveals the characteristics of YouTube channels and videos participating in this view fraud, as well as how successful the view fraud campaigns are. Ultimately, our study provides an empirical grounding on an organic YouTube view fraud ecosystem.
Investigating Advertisers' Domain-changing Behaviors and Their Impacts on Ad-blocker Filter Lists
Ad blockers heavily rely on filter lists to block ad domains, which can serve advertisements and trackers. However, recent research has reported that some advertisers keep registering replica ad domains (RAD domains)---new domains that serve the same purpose as the original ones---which tends to slip through ad-blocker filter lists. Although this phenomenon might negatively affect ad blockers' effectiveness, no study to date has thoroughly investigated its prevalence and the issues caused by RAD domains. In this work, we proposed methods to discover RAD domains and categorize their change patterns. From a crawl of 50,000 websites, we identified 1,761 unique RAD domains, 1,096 of which survived for an average of 410.5 days before they were blocked; the rest have not been blocked as of February 2021. Notably, we found that non-blocked RAD domains could extend the timespan of ad or tracker distribution by more than two years. Our analysis further revealed a taxonomy of four techniques used to create RAD domains, including two less-studied ones. Additionally, we discovered that the RAD domains affected 10.4% of the websites we crawled, and 23.9% of the RAD domains exhibiting privacy-intrusive behaviors, severely undermining ad blockers' privacy protection.
Verba Volant, Scripta Volant: Understanding Post-publication Title Changes in News Outlets
Digital media facilitates broadcasting daily news via more flexible channels (e.g. Web, Twitter). Unlike conventional newspapers which become uneditable upon publication, online news sources are free to modify news headlines after their initial release. However, the motivation and potential confounding effects behind such post-edits have been understudied.
Beyond Bot Detection: Combating Fraudulent Online Survey Takers
Different techniques have been recommended to detect fraudulent responses in online surveys, but little research has been taken to systematically test the extent to which they actually work in practice. In this paper, we conduct an empirical evaluation of 22 anti-fraud tests in two complementary online surveys. The first survey recruits Rust programmers on public online forums and social media networks. We find that fraudulent respondents involve both bots and human characteristics. Among different tests, those designed based on domain knowledge achieve the best effectiveness. By combining individual tests, we can achieve a detection performance as good as commercial techniques while making the results more explainable.
Exploring Edge Disentanglement for Node Classification
Edges in real-world graphs are typically formed by a variety of factors and carry diverse relation semantics. For example, connections in a social network could indicate friendship, being colleagues, or living in the same neighborhood. However, these latent factors are usually concealed behind mere edge existence due to the data collection and graph formation processes. Despite rapid developments in graph learning over these years, most models take a holistic approach and treat all edges as equal. One major difficulty in disentangling edges is the lack of explicit supervisions. In this work, with close examination of edge patterns, we propose three heuristics and design three corresponding pretext tasks to guide the automatic edge disentanglement. Concretely, these self-supervision tasks are enforced on a multi-head graph attention module to be trained jointly with downstream tasks to encourage automatic edge disentanglement. Heads in this module are expected to capture distinguishable relations and neighborhood interactions, and outputs from them are aggregated as node representations. The proposed DisGNN is easy to be incorporated with various neural architectures, and we conduct experiments on $6$ real-world datasets. Empirical results show that it can achieve significant performance gains.
QEN: Applicable Taxonomy Completion via Evaluating Full Taxonomic Relations
Taxonomy is a fundamental type of knowledge graph for a wide range of web applications like searching and recommendation systems. To keep a taxonomy automatically updated with the latest concepts, the taxonomy completion task matches a pair of proper hypernym and hyponym in the original taxonomy with the new concept as its parent and child. Previous solutions utilize term embeddings as input and only evaluate the parent-child relations between the new concept and the hypernym-hyponym pair. Such methods ignore the important sibling relations, and are not applicable in reality since term embeddings are not available for the latest concepts. They also suffer from the relational noise of the Â"pseudo-leafÂ" node, which is a null node acting as a node's hyponym to enable the new concept to be a leaf node. To tackle the above drawbacks, we propose the Quadruple Evaluation Network (QEN), a novel taxonomy completion framework that utilizes easily accessible term descriptions as input, and applies pretrained language model and code attention for accurate inference while reducing online computation. QEN evaluates both parent-child and sibling relations to both enhance the accuracy and reduce the noise brought by pseudo-leaf. Extensive experiments on three real-world datasets in different domains with different sizes and term description sources prove the effectiveness and robustness of QEN on overall performance and especially the performance for adding non-leaf nodes, which largely surpasses previous methods and achieves the new state-of-the-art of the task.
EvoLearner: Learning Description Logics with Evolutionary Algorithms
Classifying nodes in knowledge graphs is an important task, e.g., predicting missing types of entities, predicting which molecules cause cancer, or predicting which drugs are promising treatment candidates. While black-box models often achieve high predictive performance, they are only post-hoc and locally explainable and do not allow the learned model to be easily enriched with domain knowledge. Towards this end, learning description logic concepts from positive and negative examples has been proposed. However, learning such concepts often takes a long time and state-of-the-art approaches provide limited support for literal data values, although they are crucial for many applications. In this paper, we propose EvoLearner - an evolutionary approach to learn ALCQ(D), which is the attributive language with complement (ALC) paired with qualified cardinality restrictions (Q) and data properties (D). We contribute a novel initialization method for the initial population: starting from positive examples (nodes in the knowledge graph), we perform biased random walks and translate them to description logic concepts. Moreover, we improve support for data properties by maximizing information gain when deciding where to split the data. We show that our approach significantly outperforms the state of the art on the benchmarking framework SML-Bench for structured machine learning. Our ablation study confirms that this is due to our novel initialization method and support for data properties.
TaxoEnrich: Self-Supervised Taxonomy Completion via Structure-Semantic Representations
Taxonomies are fundamental to many real-world applications in various domains, serving as structural representations of knowledge. To deal with the increasing volume of new concepts needed to be organized as taxonomies, researchers turn to automatically completion of an existing taxonomy with new concepts.
Creating Signature-Based Views for Description Logic Ontologies with Transitivity and Qualified Number Restrictions
Developing ontologies for the Semantic Web is a time-consuming, laborious and error-prone task that requires collaborative efforts, and moreover developing new ontologies from scratch just for the use of a particular Web application is deemed as inefficient, since it requires the investment of considerable manpower and resources and causes redundancy. Thus, a potentially better idea is to reuse well-established ontologies, as per certain requirements.
Neural Predicting Higher-order Patterns in Temporal Networks
Dynamic systems that consist of a set of interacting elements can be abstracted as temporal networks. Recently, higher-order patterns that involve multiple interacting nodes have been found crucial to indicate domain-specific laws of different temporal networks. This posts us the challenge of designing more sophisticated hypergraph models for these higher-order patterns and the associated new learning algorithms.
TREND: TempoRal Event and Node Dynamics for Graph Representation Learning
Temporal graph representation learning has drawn significant attention for the prevalence of temporal graphs in the real world. However, most existing works resort to taking discrete snapshots of the temporal graph, or are not inductive to deal with new nodes, or do not model the exciting effects between events. In this work, We propose TREND, a novel framework for temporal graph representation learning, driven by TempoRal Event and Node Dynamics and built upon a Hawkes process-based GNN. TREND presents a few major advantages: (1) it is inductive due to its GNN architecture; (2) it captures the exciting effects between events by the adoption of the Hawkes process; (3) as our main novelty, it captures the individual and collective characteristics of events by integrating both event and node dynamics, driving a more precise modeling of the temporal process. Extensive experiments on four real-word datasets demonstrate the effectiveness of our proposed model.
Unbiased Graph Embedding with Biased Graph Observations
Graph embedding techniques have been increasingly employed in real-world machine learning tasks on graph-structured data, such as social recommendations and protein structure modeling. Since the generation of a graph is inevitably affected by some sensitive node attributes (such as gender and age of users in a social network), the learned graph representations can inherit such sensitive information and introduce undesirable biases in downstream tasks. Most existing works on debiasing graph representations add ad-hoc constraints on the learnt embeddings to restrict their distributions, which however compromise the utility of resulting graph representations in downstream tasks.
Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding
Learning low-dimensional representations for Heterogeneous Information Networks (HINs) has drawn increasing attention recently thanks to its effectiveness on the downstream tasks. Existing HIN embedding methods mainly learn meta-path based embeddings and then combine them to get the ?nal representation. However, the dependence between meta-paths are largely ignored. As a result, learning embedding for each meta-path relies only on the speci?c meta-path, which may be highly sparse and insuf?cient to produce high-quality embedding. In this paper, to deal with this problem, we propose Semantic Dependence Aware Heterogeneous Information Network Embedding (SDA), which learns embeddings at node-level, subgraphlevel and meta-path-level. By maximizing mutual information between embeddings from different levels and meta-paths, learning a embedding from one meta-path would bene?t from the knowledge from other meta-paths. Extensive experiments are conducted on four real-world datasets and the results demonstrate the effectiveness of the proposed method.
Geometric Graph Representation Learning via Maximizing Rate Reduction
Learning discriminative node representations benefits various downstream tasks in graph analysis such as community detection and node classification. Existing graph representation learning methods (e.g., based on random walk and contrastive learning) are limited to maximizing the local similarity of connected nodes. Such a pairwise learning scheme could fail to capture the global distribution of representations, since it has no constraints on the global geometric properties of representation space. To this end, we propose Geometric Graph Representation Learning (G2R) to learn node representations in an unsupervised manner via maximizing rate reduction. In this way, G2R maps nodes in distinct groups (implicitly hidden in adjacency matrix) into different subspaces, while each subspace is compact and different subspaces are dispersedly distributed. G2R adopts a graph neural network as the encoder and maximizes the rate reduction with the adjacency matrix. Furthermore, we theoretically and empirically demonstrate that rate reduction maximization is equivalent to maximizing the principal angles between different subspaces. Experiments on the real-world datasets show that G2R outperforms various baselines on node classification and community detection tasks.
Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF, for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates.
DREW: Efficient Winograd CNN Inference with Deep Reuse
Deep learning has been used in various domains, including Web services. Convolutional neural networks (CNNs), as representatives of deep learning, are one of the most commonly used neural networks in Web systems. However, CNN has heavy computation patterns. Meanwhile, different from the training process, the inference process is more often executed on devices with low computing power in industry, such as CPUs. The limited computing resource and high computation pressure limit the effective use of CNN algorithms in industry. Fortunately, a minimal filtering algorithm, Winograd, can reduce the computations by reducing the number of multiplication operations. We find that the Winograd algorithm can be further accelerated by reusing similar data and computation patterns, which is called deep reuse. In this paper, we propose a new inference method, called DREW, which combines deep reuse with Winograd for further accelerating CNNs. DREW handles three difficulties. First, it can detect the similarities from the complex minimal filtering patterns by clustering. Second, it reduces the online clustering cost in a reasonable range. Third, it provides an adjustable method in clustering granularity balancing the performance and accuracy. Experiments show that 1) DREW further accelerates the Winograd convolution by an average of 2.10Ã - speedup; 2) when DREW is applied to end-to-end Winograd CNN inference, it achieves 1.85Ã - the average performance speedup with no (<0.4%) accuracy loss; 3) DREW reduces the number of convolution operations to 10% of the original operations on average.
PaSca: A Graph Neural Architecture Search System under the Scalable Paradigm
Graph neural networks (GNNs) have achieved state-of-the-art performance in various graph-based tasks. However, as mainstream GNNs are designed based on the neural message passing mechanism, they do not scale well to data size and message passing steps. Although there has been an emerging interest in the design of scalable GNN, current researches focus on specific GNN design, rather than the general design space, which limits the discovery of potential scalable GNN models. In addition, existing solutions cannot support extensive exploration over the design space for scalable GNNs. This paper proposes PasCa, a new paradigm and system that offers a principled approach to systemically construct and explore the design space for scalable GNNs, rather than studying individual designs. Through deconstructing the message passing mechanism, PasCa presents a novel Scalable Graph Neural Architecture Paradigm (SGAP), together with a general architecture design space consisting of 150k different designs. Following the paradigm, we implement an auto-search engine that can automatically search well-performing and scalable GNN architectures to balance the trade-off between multiple criteria (e.g., accuracy and efficiency) via multi-objective optimization. Empirical studies on ten benchmark datasets demonstrate that the representative instances (i.e., PasCa-V1, V2, and V3) discovered by our system achieve consistent performance among competitive baselines. Concretely, PasCa-V3 outperforms the state-of-the-art GNN method JK-Net by 0.4% in terms of predictive accuracy on our large industry dataset while achieving up to 28.3 times training speedups.
TRACE: A Fast Transformer-based General-Purpose Lossless Compressor
Deep-learning-based compressor has received interests recently due to much improved compression ratio. However, modern approaches suffer from long execution time. To ease this problem, this paper targets on cutting down the execution time of deep-learning-based compressors. Building history-dependencies sequentially is responsible for long inference latency. We introduce transformer into deep learning compressors to build history-dependencies in parallel. However, existing transformer is too heavy and incompatible to compression tasks.
Not All Layers Are Equal: A Layer-Wise Adaptive Approach Toward Large-Scale DNN Training
A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across "layers" in training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive learning rate scaling to adjust the learning rate for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of the single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art gradient variance-based method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.
Distributional Contrastive Embedding for Clarification-based Conversational Critiquing
Managing uncertainty in preferences is core to creating the next generation of conversational recommender systems (CRS). However, an often-overlooked element of conversational interaction is the role of clarification. Users are notoriously noisy at revealing their preferences, and a common error is being unnecessarily specific, e.g., suggesting "chicken fingers" when a restaurant with a "kids menu" was the intended preference. Correcting such errors requires reasoning about the level of generality and specificity of preferences and verifying that the user has expressed the correct level of generality. To this end, we propose a novel clarification-based conversational critiquing framework that allows the system to clarify user preferences as it accepts critiques. To support clarification, we propose the use of distributional embeddings that can capture the specificity and generality of concepts through distributional coverage while facilitating state-of-the-art embedding-based recommendation methods. Specifically, we incorporate Distributional Constrastive Embeddings of critiqueable keyphrases with user preference embeddings in a Variational Autoencoder recommendation framework that we term DCE-VAE. Our experiments show that our proposed DCE-VAE is (1) competitive in terms of general performance in comparison to state-of-the-art recommenders and (2) supports effective clarification-based critiquing in comparison to alternative clarification baselines. In summary, this work adds a new dimension of clarification to enhance the well-known critiquing framework along with a novel data-driven distributional embedding for clarification suggestions that significantly improves the efficacy of user interaction with critiquing-based CRSs.
Towards a Multi-view Attentive Matching for Personalized Expert Finding
In Community Question Answering (CQA) websites, expert finding aims at seeking suitable experts to answer questions.
Similarity-based Multi-Domain Dialogue State Tracking with Copy Mechanisms for Task-based Virtual Personal Assistants
Task-based Virtual Personal Assistants (VPAs) rely on multi-domain Dialogue State Tracking (DST) models to monitor goals throughout a conversation. Previously proposed models show promising results on established benchmarks, but they have difficulty adapting to unseen domains due to domain-specific parameters in their model architectures. We propose a new Similarity-based Multi-domain Dialogue State Tracking model (SM-DST) that uses retrieval-inspired and fine-grained contextual token-level similarity approaches to efficiently and effectively track dialogue state. The key difference with state-of-the-art DST models is that SM-DST has a single model with shared parameters across domains and slots. Because we base SM-DST on similarity it allows the transfer of tracking information between semantically related domains as well as to unseen domains without retraining. Furthermore, we leverage copy mechanisms that consider the system's response and the dialogue state from previous turn predictions, allowing it to more effectively track dialogue state for complex conversations. We evaluate SM-DST on three variants of the MultiWOZ DST benchmark datasets. The results demonstrate that SM-DST significantly and consistently outperforms state-of-the-art models across all datasets by absolute 5-18% and 3-25% in the few- and zero-shot settings, respectively.
Multiple Choice Questions based Multi-Interest Policy Learning for Conversational Recommendation
Conversational recommendation system (CRS) is able to obtain fine-grained and dynamic user preferences based on interactive dialogue. Previous CRS assumes that the user has a clear target item in a conversation, which deviates from the real scenario: for the user who resorts to CRS, he might not have a clear idea about what he really wants. Specifically, the user may have a clear single preference for some attribute types (e.g., color) of items, while for other attribute types, the user has multiple preferences or vague intentions, which results in multiple attribute instances (e.g., black) of one attribute type accepted by the user. In addition, the user shows a preference for different combinations of attribute instances rather than the single combination of all the attribute instances he prefers. Therefore, we propose a more realistic conversational recommendation scenario in which users may have multiple interests with attribute instance combinations and accept multiple items with partially the same attribute instances.
Discovering Personalized Semantics for Soft Attributes in Recommender Systems Using Concept Activation Vectors
Interactive recommender systems (RSs) have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional RSs by allowing users to express intent, preferences and contexts in a richer fashion, often using natural language. One challenge in effectively us-ing such feedback is inferring a userÂ's semantic intent from the open-ended terms (attributes or tags) used to describe an item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs)(Kim et al. 2018), an approach to model interpretability, we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in RSs. A novel feature of our approach is its ability to distinguish objective and subjective attributes and associate different senses with different users. We demonstrate on both synthetic and real-world datasets that our CAV representation accurately interprets usersÂ' subjective semantics, and can improve recommendations through interactive critiquing.
STAM: A Spatiotemporal Aggregation Method for Graph Neural Network-based Recommendation
Graph neural network-based recommendation systems are blossoming recently, and its core component is aggregation methods that determine neighbor embedding learning. Prior arts usually focus on how to aggregate information from the perspective of spatial structure information, but temporal information about neighbors is left insufficiently explored.
EXIT: Extrapolation and Interpolation-based Neural Controlled Differential Equations for Time-series Classification and Forecasting
Deep learning inspired by differential equations is a recent research trend and has marked the state-of-the-art performance for many machine learning tasks. Among them, time-series modeling with neural controlled differential equations (NCDEs) is considered as a breakthrough. In many cases, NCDE-based models not only provide better accuracy than recurrent neural networks (RNNs) but also make it possible to process irregular time-series. In this work, we enhance NCDEs by redesigning their core part, i.e., generating a continuous path from a discrete time-series input. NCDEs typically use interpolation algorithms to convert discrete time-series samples to continuous paths. However, we propose to i) generate another latent continuous path using an encoder-decoder architecture, which corresponds to the interpolation process of NCDEs, i.e., our neural network-based interpolation vs. the existing explicit interpolation, and ii) exploit the generative characteristic of the decoder, i.e., extrapolation beyond the time domain of original data if needed. Therefore, our NCDE design can use both the interpolated and extrapolated information for downstream machine learning tasks. In our experiments with 5 real-world datasets and 12 baselines, our extrapolation and interpolation-based NCDEs outperform existing baselines by non-trivial margins.
CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting
Probabilistic time-series forecasting enables reliable decision-making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and does not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.
Knowledge-based Temporal Fusion Network for Interpretable Online Video Popularity Prediction
Predicting the popularity of online videos has many real-world applications, such as recommendation, precise advertising, and edge caching strategies. Despite many efforts have been dedicated to the online video popularity prediction, there still exist several challenges: (1) The meta-data from online videos is usually sparse and noisy, which makes it difficult to learn a stable and robust representation. (2) The influence of content features and temporal features in different life cycles of online videos is dynamically changing, so it is necessary to build a model that can capture the dynamics. (3) Besides, there is a great need to interpret the predictive behavior of the model to assist administrators of video platforms in the subsequent decision-making.
Using Web Data to Reveal 22-Year History of Sneaker Designs
Web data and computational models can play an important role in analyzing cultural trends. The current study uses 23,492 sneaker product images and meta information collected from a global reselling shop, StockX.com. We construct an index named sneaker design index that summarizes the design characteristics of sneakers using a contrastive learning method. This index allows us to study changes in design over a 22 year-long period (1999-2020). Data suggest that sneakers tend to employ brighter colors and lower hue and saturation values over time. Yet, each brand continues to build toward its particular trajectory of shape-related design patterns. The embedding analysis also predicts which sneakers will likely see a high premium in the reselling market, suggesting a viable algorithm-driven investment and design strategies. The current work is the first to apply data science methods to a new research domain - i.e., analysis of product design evolution over a long historical period - and has implications for the novel use of Web data to understand cultural patterns that are otherwise hard to obtain.
A Duo-generative Approach to Explainable Multimodal COVID-19 Misinformation Detection
This paper focuses on a critical problem of explainable multimodal COVID-19 misinformation detection where the goal is to accurately detect misleading information in multimodal COVID-19 news articles and provide the reason or evidence that can explain the detection results. Our work is motivated by the lack of judicious study of \lanyus{the association between different modalities (e.g., text and image) of the COVID-19 news content in current solutions}. In this paper, we present a generative approach \lanyus{to detect multimodal COVID-19 misinformation by investigating the cross-modal association between the visual and textual content} that is deeply embedded in the multimodal news content. Two critical challenges exist in developing our solution: 1) how to accurately assess the consistency between the visual and textual content of a multimodal COVID-19 news article? 2) How to effectively retrieve useful information from the unreliable user comments to explain the misinformation detection results? To address the above challenges, we develop a duo-generative explainable misinformation detection (DGExplain) framework that explicitly explores the \lanyus{cross-modal association} between the news content in different modalities and effectively exploits user comments to detect and explain misinformation in multimodal COVID-19 news articles. We evaluate DGExplain on two real-world multimodal COVID-19 news datasets. Evaluation results demonstrate that DGExplain significantly outperforms state-of-the-art baselines in terms of the accuracy of multimodal COVID-19 misinformation detection and the explainability of detection explanations.
Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement
Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.
VICTOR: An Implicit Approach to Mitigate Misinformation via Continuous Verification Reading
We design and evaluate VICTOR, an easy-to-apply module on top of a recommender system to mitigate misinformation. VICTOR takes an elegant, implicit approach to deliver fake-news verifications, such that readers of fake news can continuously access more verified news articles about fake-news events without explicit correction. We frame fake-news intervention within VICTOR as a graph-based question-answering (QA) task, with Q as a fake-news article and A as the corresponding verified articles. Specifically, VICTOR adopts reinforcement learning: it first considers fake-news readersÂ' preferences supported by underlying news recommender systems and then directs their reading sequence towards the verified news articles. Given the various pitfalls of explicit misinformation debunking, VICTORÂ's implicit approach can increase the chances of readers being exposed to diverse and nuanced aspects of the misinformation event, which has the potential to reduce their faith in the fake news. To verify the performance of VICTOR, we collect and organize VERI, a new dataset consisting of real-news articles, user browsing logs, and fake-real news pairs for a large number of misinformation events. We evaluate zero-shot and few-shot VICTOR on VERI to simulate the never-exposed-ever and seen-before conditions of users while reading a piece of fake news. Results demonstrate that compared to baselines, VICTOR proactively delivers 6% more verified articles with a diversity increase of 7.5% to over 68% of at-risk users who have been exposed to fake news. Moreover, We conduct a field user study in which 165 participants evaluated fake news articles. Participants in the VICTOR condition show better exposure rates, proposal rates, and click rates on verified news articles than those in the other two conditions. Altogether, our work demonstrates the potentials of VICTOR, i.e., combat fake news by delivering verified information implicitly.
Moral Emotions Shape the Virality of COVID-19 Misinformation on Social Media
While false rumors pose a threat to the successful overcoming of the COVID-19 pandemic, an understanding of how rumors diffuse in online social networks is - even for non-crisis situations - still in its infancy. Here we analyze a large sample consisting of COVID-19 rumor cascades from Twitter that have been fact-checked by third-party organizations. The data comprises N = 10,610 rumor cascades that have been retweeted more than 24 million times. We investigate whether COVID-19 misinformation spreads more viral than the truth and whether the differences in the diffusion of true vs. false rumors can be explained by the moral emotions they carry. We observe that, on average, COVID-19 misinformation is more likely to go viral than truthful information. However, the veracity effect is moderated by moral emotions: false rumors are more viral than the truth if the source tweets embed a high number of other-condemning emotion words, whereas a higher number of self-conscious emotion words is linked to a less viral spread. The effects are pronounced both for health misinformation and false political rumors. These findings offer insights into how true vs. false rumors spread and highlight the importance of considering emotions from the moral emotion families in social media content.
Regulating Online Political Advertising
In the United States, regulations have been established in the past to oversee political advertising in TV and radio. The laws governing these marketplaces were enacted with the fundamental premise that important political information is provided to voters through advertising, and politicians should be able to easily inform the public. Today, online advertising constitutes a major part of all political ad spending, but lawmakers have not been able to keep up with this rapid change. In the online advertising marketplace, ads are typically allocated to the highest bidder through an auction. Auction mechanisms provide benefits to platforms in terms of revenue maximization and automation, but they operate very differently to offline advertising, and existing approaches to regulation cannot be easily implemented in auction-based environments. We first provide a theoretical model and deliver key insights that can be used to regulate online ad auctions for political ads, and analyze the implications of the proposed interventions empirically. We characterize the optimal auction mechanisms where the regulator takes into account both the ad revenues collected and societal objectives (such as the share of ads allocated to politicians, or the prices paid by them). We use bid data generated from Twitter's political advertising database to analyze the implications of implementing these changes. The results suggest that achieving favorable societal outcomes at a small revenue cost is possible through easily implementable, simple regulatory interventions.
Et tu, Brute? Privacy Analysis of Government Websites and Mobile Apps
Past privacy measurement studies on web tracking focused on high-ranked commercial websites, as user tracking is extensively used for monetization on those sites. Conversely, governments across the globe now offer services online, which unlike commercial sites, are funded by public money, and do not generally make it to the top million website lists. As such, web tracking on those services has not been comprehensively studied, even though these services deal with privacy and security-sensitive user data, and used by a significant number of users. In this paper, we perform privacy and security measurements on government websites and Android apps: 150,244 unique websites (from 206 countries) and 1166 Android apps (from 71 countries). We found numerous commercial trackers on these services  - e.g., 17% of government websites and 37% of government Android apps host Google trackers; 13% of government sites contain YouTube cookies with an expiry date in the year of 9999. 27% of government Android apps leak sensitive information (e.g., user/device identifiers, passwords, API keys) to third parties, or any network attacker (when sent over HTTP). We also found 304 government sites and 40 apps are flagged by VirusTotal as malicious. We hope our findings to help improve privacy and security of online government services, given that governments are now apparently taking Internet privacy/security seriously and imposing strict regulations on commercial sites.
Compressive Sensing Approaches for Sparse Distribution Estimation Under Local Privacy
Recent years, local differential privacy (LDP) has been adopted by many web service providers like Google \cite{erlingsson2014rappor}, Apple \cite{apple2017privacy} and Microsoft \cite{bolin2017telemetry} to collect and analyse users' data privately. In this paper, we consider the problem of discrete distribution estimation under local differential privacy constraints. Distribution estimation is one of the most fundamental estimation problems, which is widely studied in both non-private and private settings. In the local model, private mechanisms with provably optimal sample complexity are known. However, they are optimal only in the worst-case sense; their sample complexity is proportional to the size of the entire universe, which could be huge in practice. In this paper, we consider sparse or approximately sparse (e.g.\ highly skewed) distribution, and show that the number of samples needed could be significantly reduced. This problem has been studied recently \cite{acharya2021estimating}, but they only consider strict sparse distributions and the high privacy regime. We propose new privatization mechanisms based on compressive sensing. Our methods work for approximately sparse distributions and medium privacy, and have optimal sample and communication complexity.
Measuring Alexa Skill Privacy Practices across Three Years
Smart Personal Assistants (SPA) are transforming the way users interact with technology. This transformation is mostly fostered by the proliferation of voice-driven applications (called skills) offered by third-party developers through an online market. We see how the number of skills has rocked in recent years, with the Amazon Alexa skill ecosystem growing from just 135 skills in early 2016 to about 125k skills in early 2021. Along with the growth in skills, there is increasing concern over the risks that third-party skills pose to usersÂ' privacy. In this paper, we perform a systematic and longitudinal measurement study of the Alexa marketplace. We shed light on how this ecosystem evolves using data collected across three years between 2019 and 2021. We demystify developersÂ' data disclosure practices and present an overview of the third-party ecosystem. We see how the research community continuously contribute to the marketÂ's sanitation, but the Amazon vetting process still requires significant improvement. We perform a responsible disclosure process reporting 675 skills with privacy issues to both amazon and all affected developers, out of which 246 have important issues (i.e., broken traceability). We see that 107 out of the 246 (43.5%) skills continue to display broken traceability almost one year after being reported. As a result, the overall state of affairs has improved in the ecosystem over the years. Yet, newly submitted skills and unresolved known issues pose an endemic risk.
Measuring the Privacy vs. Compatibility Trade-off in Preventing Third-Party Stateful Tracking
Despite much web privacy research on sophisticated tracking techniques (e.g., fingerprinting, cache collusion, bounce tracking), most tracking on the Web is still done by transmitting stored identifiers across site boundaries. Â"StatefulÂ" tracking is not a bug but a misfeature of classical browser storage policies: per-site storage is shared across all visits, from both first- and third-party (i.e., embedded in other sites) context, enabling the most pervasive forms of online tracking.
DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders
As advanced machine learning models have proven to be effective at identifying individuals from the writing style of their texts or by their behavior of picking and judging movies, it is highly important to investigate methods to investigate methods that can ensure the anonymity of users sharing their data on the web.
A Rapid Source Localization Method in the Early Stage of Large-scale Network Propagation
Recently, the sensor-based method for diffusion source localization works efficiently in relieving the hazard of propagation. In reality, the best opportunity is to infer the source immediately once sensors observe a hazard propagation. However, most existing algorithms need the observed information of all prearranged sensors to infer the diffusion sources, probably missing the best chance to restrain the hazard propagation. Therefore, how to infer the source accurately and timely based on the limited information of sensors in the early propagation stage is of great significance. In this paper, we propose a novel greedy full-order neighbor localization (denoted as GFNL) strategy to solve this problem. More specifically, GFNL includes two main components, i.e., the greedy sensor deployment strategy (DS) and direction-path-based source estimator strategy (ES). In more detail, in order to ensure sensors can capture the diffusion information in the early period, DS deploys a set of sensors in a network based on the proposed greedy strategy to minimize the geodesic distance between the source candidate set and the sensor set. Then when a fraction of sensors observe a propagation, the acquired information is utilized by ES to infer the source. Compared with some state-of-the-art methods, comprehensive experiments have proved the superiority of our proposed GFNL.
An Invertible Graph Diffusion Neural Network for Source Localization
Localizing the source of graph diffusion phenomena, such as misinformation propagation, is an important yet extremely challenging task in the real world. Existing source localization models typically are heavily dependent on the hand-crafted rules and only tailored for certain domain-specific applications. Unfortunately, a large portion of the graph diffusion process for many applications is still unknown to human beings so it is important to have expressive models for learning such underlying rules automatically. Recently, there is a surge of research body on expressive models such as Graph Neural Networks (GNNs) for automatically learning the underlying graph diffusion. However, source localization is instead the inverse of graph diffusion, which is a typical inverse problem in graphs that is well-known to be ill-posed because there can be multiple solutions and hence different from the traditional (semi-)supervised learning settings. This paper aims to establish a generic framework of invertible graph diffusion models for source localization in graphs, namely Invertible Validity-aware Graph Diffusion (IVGD), to handle major challenges including 1). Difficulty to leverage knowledge in graph diffusion models for modeling their inverse processes in an end-to-end fashion, 2) Difficulty to ensure the validity of the inferred sources, and 3) efficiency and scalability in source inference. Specifically, first, to inversely infer sources of graph diffusion, we propose a graph residual scenario to make existing graph diffusion models invertible with theoretical guarantees; second, we develop a novel error compensation mechanism that learns to offset the errors of the inferred sources. Finally, to ensure the validity of the inferred sources, a new set of validity-aware layers have been devised to project inferred sources to feasible regions by flexibly encoding constraints with unrolled optimization techniq ues. A linearization technique is proposed to strengthen the efficiency of our proposed layers. The convergence of
Generating Simple Directed Social Network Graphs for Information Spreading
Online social networks have become a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can establish connections with other users by following them, who in turn can follow back. In recent years, researchers have studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between directed and reciprocal edges (two directed edges of opposite direction between two nodes). We propose a new approach to create directed social network graphs, focusing on the creation of reciprocal and directed edges separately, but considering correlations between them. To achieve high clustering coefficients (which is common in real-world networks), we apply an edge rewiring procedure that preserves the node degrees.
Graph Neural Network for Higher-Order Dependency Networks
Graph neural network (GNN) has become a popular tool to analyze the graph data. Existing GNNs only focus on networks with first-order dependency, that is, conventional networks following the Markov property. However, many networks in real life own the higher-order dependency, such as click-stream data where the choice of the next page depends not only on the current page but also on the previous pages. This kind of sequential data from complex systems include natural dependencies, which are often ignored by existing GNNs and make them ineffective. To address this problem, we propose for the first time the new GNN approaches for higher-order networks in this paper. First, we form sequence fragments by the current node and its predecessor nodes of different orders as candidate higher-order dependencies. When the fragment significantly affects the probability distribution of different successor nodes of the current node, we include it in the higher-order dependency set. We formulize the network with higher-order dependency as an augmented conventional first-order network, and then feed it into GNNs for getting network embeddings. Moreover, we further propose a new end-to-end GNN framework for dealing with higher-order networks directly in the model. Specifically, the higher-order dependency is used as the neighbor aggregation controller when the node is embedded and updated. In a graph neural network layer, in addition to the first-order neighbor information, it also aggregates the middle node information from the higher-order dependency segment. We finally test the new approaches on two representative networks with higher-order dependency, and compare with some state-of-the-art GNN methods. The results show significant improvements of these new approaches by considering higher-order dependency.
Learning the Markov Order of Paths in Graphs
We address the problem of learning the Markov order in categorical sequences that represent paths in a network, i.e., sequences of variable lengths where transitions between states are constrained to a known graph. Such data pose challenges for standard Markov order detection methods and demand modeling techniques that explicitly account for the graph constraint. Adopting a multi-order modeling framework for paths, we develop a Bayesian learning technique that (i) detects the correct Markov order more reliably than a competing method based on the likelihood ratio test, (ii) requires considerably less data than methods using AIC or BIC, and (iii) is robust against partial knowledge of the underlying constraints. We further show that a recently published method that uses a likelihood ratio test exhibits a tendency to overfit the true Markov order of paths, which is not the case for our Bayesian technique. Our method is important for data scientists analyzing patterns in categorical sequence data that are subject to (partially) known constraints, e.g. click stream data or other behavioral data on the Web, information propagation in social networks, mobility trajectories, or pathway data in bioinformatics. Addressing the key challenge of model selection, our work is also relevant for the growing body of research that emphasizes the need for higher-order models in network analysis.
Accelerating Serverless Computing by Harvesting Idle Resources
Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function's resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
QCluster: Clustering Packets for Flow Scheduling
Flow scheduling is crucial in data centers, as it directly influences user experience of applications. According to different assumptions and design goals, there are four typical flow scheduling problems/solutions: SRPT, LAS, Fair Queueing, and Deadline-Aware scheduling. When implementing these solutions in commodity switches with limited number of queues, they need to set static parameters by measuring traffic in advance, while optimal parameters vary across time and space. This paper proposes a generic framework, namely QCluster, to adapt all scheduling problems for limited number of queues. The key idea of QCluster is to cluster packets with similar weights/properties into the same queue. QCluster is implemented in Tofino switches, and can cluster packets at a speed of 3.2 Tbps. To the best of our knowledge, QCluster is the fastest clustering algorithm. Experimental results in testbed with programmable switches and ns-2 show that QCluster reduces the average flow completion time (FCT) for short flows up to 56.6%, and reduces the overall average FCT up to 21.7% over state-of-the-art. All the source code in ns-2 is available in Github without.
Fograph: Enabling Real-Time Deep Graph Inference with Fog Computing
Graph Neural Networks (GNNs) have gained growing interest in miscellaneous applications owing to their outstanding ability in extracting latent representation on graph structures. To render GNN-based service for IoT-driven smart applications, the traditional model serving paradigm resorts to the cloud by fully uploading the geo-distributed input data to the remote datacenter. However, our empirical measurements reveal the significant communication overhead of such cloud-based serving and highlight the profound potential in applying the emerging fog computing. To maximize the architectural benefits brought by fog computing, in this paper, we present Fograph, a novel distributed real-time GNN inference framework that leverages diverse resources of multiple fog nodes in proximity to IoT data sources. By introducing heterogeneity-aware execution planning and GNN-specific compression techniques, Fograph tailors its design to well accommodate the unique characteristics of GNN serving in fog environment. Prototype-based evaluation and case study demonstrate that Fograph significantly outperforms the state-of-the-art cloud serving and vanilla fog deployment by up to 5.39Ã - latency speedup and 6.84Ã - throughput improvement.
Robust System Instance Clustering for Large-Scale Web Services
System instance clustering is crucial for large-scale Web services because it can significantly reduce the training overhead of anomaly detection methods. However, the vast number of system instances with massive time points, redundant metrics, and noises bring significant challenges. We propose OmniCluster to accurately and efficiently cluster system instances for large-scale Web services. It combines one-dimensional convolutional autoencoder (1D-CAE), which extracts the main features of system instances, with a simple, novel, yet effective three-step feature selection strategy. We evaluated OmniCluster using real-world data collected from a top-tier content service provider providing services for one billion+ monthly active users (MAU), proving that OmniCluster achieves high accuracy (NMI=0.9160) and reduces the training overhead of five anomaly detection models by 95.01% on average.
Pyramid: Enabling Hierarchical Neural Networks with Edge Computing
As a key 5G enabler technology, edge computing facilitates edge artificial intelligence (edge AI) by allowing machine learning models to be trained at the network edge on edge devices (e.g., mobile phones) and edge servers. Compared with centralized cloud AI, edge AI significantly reduces the network traffic incurred by training data transmission and in the meantime, enables low-latency inference which is critical to many delay-sensitive internet-of-things (IoT) applications, e.g., autonomous driving, advanced manufacturing, augmented/virtual reality, etc. Existing studies of edge AI have mainly focused on resource and performance optimization for training and inference, leveraging edge computing merely as a tool for training and inference acceleration. However, the unique ability of edge computing to process data with location awareness - a powerful feature for various IoT applications - has not been leveraged. In paper, we propose a novel framework named HierML that further unleashes the potential of edge AI by facilitating hierarchical machine learning based on location-aware data processing. We motivate and present HierML with traffic prediction as an illustrative example and validate HierML through experiments conducted on real-world traffic data. The results indicate the usefulness of HierML by showing that a novel multi-layered machine model can be built based on HierML to make accurate local and global traffic predictions with low latency.
HRCF: Enhancing Collaborative Filtering via Hyperbolic Geometric Regularization
In large-scale recommender systems, the user-item networks are generally scale-free or expand exponentially. In representation learning, the latent features (a.k.a, embeddings) depend on how well the embedding space matches the data distribution.
Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning
Recommender systems play an important role in alleviating the information explosion of various online services. Among the research of recommender systems, graph collaborative filtering methods try to model the interactions between users and items as a bipartite graph. Despite the effectiveness, these methods suffer from data sparsity in real scenarios. Recently, contrastive learning is adopted on graph collaborative filtering methods to alleviate the sparsity of data. However, these methods construct the contrastive pairs by random sampling, which neglect the neighbor relations among users (or items) and fails to fully exploit the potential of contrastive learning for recommendation.
GSL4Rec: Session-based Recommendations with Collective Graph Structure Learning and Next Interaction Prediction
Users' social connections have recently shown significant benefits to the session-based recommendations, and graph neural networks (GNNs) have exhibited great success in learning the pattern of information flow among users. However, the current paradigm presumes a given social network, which is not necessarily consistent with the fast-evolving shared interests and is expensive to collect. We propose a novel idea to learn the graph structure among users and make recommendations collectively in a coupled framework. This idea raises two challenges, i.e., scalability and effectiveness. We introduce an innovative graph-structure learning framework for session-based recommendations (GSL4Rec) for solving both challenges simultaneously. Our framework has a two-stage strategy, i.e., the coarse neighbor screening and the self-adaptive graph structure learning, to enable the exploration of potential links among all users while maintaining a tractable amount of computation for scalability. We also propose a phased heuristic learning strategy to sequentially and synergistically train the graph learning part and recommendation part of GSL4Rec, thus improving the effectiveness by making the model easier to achieve good local optima. Experiments on five public datasets from different domains demonstrate that our proposed model significantly outperforms strong baselines, including state-of-the-art social network-based methods.
Graph Neural Transport Networks with Non-local Attentions for Recommender Systems
Graph Neural Networks (GNNs) have emerged as powerful tools for collaborative filtering. A key challenge of recommendations is to distill long-range collaborative signals from user-item graphs. Typically, GNNs generate embeddings of users/items by propagating and aggregating the messages between local neighbors. Thus, the ability of GNNs to capture \textsl{long-range dependencies} heavily depends on their depths. However, simply training deep GNNs has several bottleneck effects, \textsl{e.g.,} over-fitting, over-smoothing, which may lead to unexpected results if GNNs are not well regularized.
Hypercomplex Graph Collaborative Filtering
Hypercomplex algebras are well-developed in the area of mathematics. Recently, several hypercomplex recommendation approaches have been proposed and yielded great success. However, two vital issues have not been well-considered in existing hypercomplex recommenders. First, these methods are only designed for specific and low-dimensional hypercomplex algebras (e.g., complex and quaternion algebras), ignoring the exploration and utilization of high-dimensional ones. Second, these recommenders treat every user-item interaction as an isolated data instance, without considering high-order user-item collaborative relationships.
Mostra: A Flexible Balancing Framework to Trade-off User, Artist and Platform Objectives for Music Sequencing
We consider the task of sequencing tracks on music streaming platforms where the goal is to maximise not only user satisfaction, but also artist- and platform-centric objectives, needed to ensure long-term health and sustainability of the platform. Grounding the work across four objectives: Satisfaction, Discovery, Exposure and Boost, we highlight the need and the potential to trade-off performance across these objectives, and propose Mostra, a Set Transformer-based encoder-decoder architecture equipped with submodular multi-objective beam search decoding. The proposed model affords system designers the power to balance multiple goals, and dynamically control the impact on user satisfaction to satisfy other, artist- and platform-centric objectives. Through extensive experiments on data from a large-scale music streaming platform, we present insights on the trade-offs that exist across different objectives, and demonstrate that the proposed framework leads to a superior, just-in-time balancing across various objectives.
Contrastive Learning with Positive-Negative Frame Mask for Music Representation
Self-supervised learning, especially contrastive learning, has made an outstanding contribution to the development of many deep learning research fields. Recently, acoustic signal processing field researchers noticed its success and leveraged contrastive learning for better music representation. Typically, existing approaches maximize the similarity between two distorted audio segments sampled from the same music. In other words, they ensure a semantic agreement at the music level. However, those coarse-grained methods neglect some inessential or noisy elements at the frame level, which may be detrimental to the model to learn the effective representation of music. Towards this end, this paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. Concretely, PEMR incorporates a Positive-Negative Mask Generation module, which leverages transformer blocks to generate frame masks on log-Mel Spectrogram. We can generate self-augmented positives and negatives upon the mask by masking important components or inessential components, respectively. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives and positives sampled from the same music. We conduct experiments on four public datasets. The experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization and transferability of PEMR for music representation learning.
Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence query has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve the efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation.
Cross-Modal Ambiguity Learning for Multimodal Fake News Detection
Cross-modal learning is essential to enable accurate fake news detection due to the fast-growing multimodal contents in online social communities. A fundamental challenge of multimodal fake news detection lies in the inherent ambiguity across different content modalities, i.e., decisions made from unimodalities may disagree with each other, which may lead to inferior multimodal fake news detection. To address this issue, we formulate the cross-modal ambiguity learning problem from an information-theoretic perspective and propose CAFE --- an ambiguity-aware multimodal fake news detection method. CAFE mainly consists of 1) a cross-modal alignment module to transform the heterogeneous unimodality features into a shared semantic space, 2) a cross-modal ambiguity learning module to estimate the ambiguity between different modalities, and 3) a cross-modal fusion module to capture the cross-modal correlations. Based on such design, CAFE can judiciously and adaptively aggregate unimodal features and cross-modal correlations, i.e., rely on unimodal features when cross-modal ambiguity is weak and refer to cross-modal correlations when cross-modal ambiguity is strong, to achieve more accurate fake news detection. Experimental studies on two widely used datasets (Twitter and Weibo) demonstrate that CAFE can outperform state-of-the-art fake news detection methods by 2.2-18.9% and 1.7-11.4% in terms of accuracy, respectively.
VisGNN: Personalized Visualization Recommendation via Graph Neural Networks
In this work, we develop a Graph Neural Network (GNN) framework for the problem of personalized visualization recommendation. The GNN-based framework first represents the large corpus of datasets and visualizations from users as a large heterogeneous graph. Then, it decomposes a visualization into its data and visual components, and then jointly models each of them as a large graph to obtain embeddings of the users, attributes (across all datasets in the corpus), and visual-configurations. From these user-specific embeddings of the attributes and visual-configurations, we can then predict the probability of any visualization arising from a specific user. Finally, the experiments demonstrated the effectiveness of using graph neural networks for automatic and personalized recommendation of visualizations to specific users based on their data and visual (design choice) preferences. To the best of our knowledge, this is the first such work to develop and leverage GNNs for this problem.
Towards an Interpretable Approach to Classify and Summarize Crisis Events from Microblogs
Microblogging platforms like Twitter have been heavily leveraged to report and exchange information about natural disasters. The real-time data on these sites is highly helpful in gaining situational awareness and planning aid efforts. However, disaster-related messages are immersed in a high volume of irrelevant information. The situational data of disaster events also vary greatly in terms of information types ranging from general situational awareness (caution, infrastructure damage, casualties) to individual needs or not related to the crisis. It thus requires efficient methods to handle data overload and prioritize various types of information. This paper proposes a novel interpretable classification-summarization framework that first classifies tweets into different disaster-related categories and then summarizes those tweets. Unlike existing work, our classification model can provide explanations or rationales for its decisions. In the summarization phase, we employ an Integer Linear Programming (ILP) based optimization technique along with the help of rationales to generate summaries of event categories. Extensive evaluation on large-scale disaster events shows (a). our model can classify tweets into disaster-related categories with an 85% F1 score and high interpretability (b). the summarizer achieves (5-25%) improvement in terms of ROUGE-1 F-score over most state-of-the-art approaches.
Exposing Query Identification for Search Transparency
Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization.
To Trust or Not To Trust: How a Conversational Interface Affects Trust in a Decision Support System
Trust is an important component of human-AI relationships and plays a major role in shaping the reliance of users on online algorithmic decision support systems. With recent advances in natural language processing, text and voice-based conversational interfaces have provided users with new ways of interacting with such systems. Despite the growing applications of conversational user interfaces (CUIs), little is currently understood about the suitability of such interfaces for decision support and how CUIs inspire trust among humans engaging with decision support systems. In this work, we aim to address this gap and answer the following research question: how does a conversational interface compare to a typical web-based graphical user interface for building trust in the context of decision support systems? To this end, we built two distinct user interfaces: 1) a text-based conversational interface, and 2) a conventional web-based graphical user interface. Both of these served as interfaces to an online decision support system for suggesting housing options to the participants, given a fixed set of constraints. We carried out a 2x2 between-subjects study on the Prolific crowdsourcing platform. Our findings present clear evidence that suggests the conversational interface was significantly more effective in building user trust and satisfaction in the online housing recommendation system when compared to the conventional web interface. Our results highlight the potential impact of conversational interfaces for trust development in web-based decision support systems.
Generating Perturbation-based Explanations with Robustness to Out-of-Distribution Data
Perturbation-based techniques are promising for explaining black-box machine learning models due to their effectiveness and ease of implementation. However, prior works have faced the problem of Out-of-Distribution (OoD) --- an artifact of randomly perturbed data becoming inconsistent with the original dataset, degrading the reliability of generated explanations. To our best knowledge, the OoD data problem in perturbation-based algorithms is still under-explored. This work addresses the OoD issue by designing a simple yet effective module that can quantify the affinity between the perturbed data and the original dataset distribution. Specifically, we penalize the influences of unreliable OoD data for the perturbed samples by integrating the inlier scores and prediction results of the target models, thereby making the final explanations more robust. Our solution is shown to be compatible with the most popular perturbation-based XAI algorithms, such as RISE, OCCLUSION, and LIME. Extensive experiments confirmed that our methods exhibit superior performance in most cases with computational and cognitive metrics. In particular, we point out the degradation problem of RISE algorithm for the first time. With our design, the performance of RISE can be boosted significantly. Besides, our solution also resolves a fundamental problem with a faithfulness indicator, a commonly used evaluation metric of XAI algorithms that appears sensitive to the OoD issue.
BZNet: Unsupervised Multi-scale Branch Zooming Network for Detecting Low-quality Deepfake Videos
Generating a deep learning-based fake video has become no longer rocket science. The advancement of automated Deepfake (DF) generation tools that mimic certain targets has rendered society vulnerable to fake news or misinformation propagation. In real-world scenarios, DF videos are compressed to low-quality (LQ) videos, taking up less storage space and facilitating dissemination through social media. Such LQ DF videos are much more challenging to detect than high-quality (HQ) DF videos. To address this challenge, we rethink the design of standard deep learning-based DF detectors, specifically exploiting feature extraction to enhance the features of LQ images. We propose a novel LQ DF detection architecture, multi-scale Branch Zooming Network (BZNet), which adopts an unsupervised super-resolution (SR) technique and utilizes multi-scale images for training. We train our BZNet only using highly compressed LQ images and experiment under a realistic setting, where HQ training data are not readily accessible. Extensive experiments on the FaceForensics++ LQ and GAN-generated datasets demonstrate that our BZNet architecture improves the detection accuracy of existing CNN-based classifiers by 4.21% on average. Furthermore, we evaluate our method against a real-world Deepfake-in-the-Wild dataset collected from the internet, which contains 200 videos featuring 50 celebrities worldwide, outperforming the state-of-the-art methods by 4.13%.
From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer
Knowledge graph completion aims to address the problem of extending a KG with missing triples. In this paper, we provide an approach \textbf{GenKGC}, which converts knowledge graph completion to sequence-to-sequence generation task with the pre-trained language model. We further introduce relation-guided demonstration and entity-aware hierarchical decoding for better representation learning and fast inference. Experimental results on three datasets show that our approach can obtain better or comparable performance than baselines and achieve faster inference speed compared with previous methods with pre-trained language models. We also release a new large-scale Chinese knowledge graph dataset AliopenKG500 for research purpose\footnote{Code and datasets are available in \url{https://github.com/zjunlp/PromptKGC/tree/main/GenKGC}}.
Hypermedea: A Framework for Web (of Things) Agents
Hypermedea is an extension of the JaCaMo multi-agent programming framework to act on Web and Web of Things environments. It includes 4 main components: (1) a Linked Data component to discover statements (as RDF triples) about some Web environment, (2) an ontology component to infer implicit statements from discovered ones, (3) a planner to reason over the consequences of agent operations in the environment and (4) a protocol binding component that turns high-level commands into low-level protocol-specific operations. Hypermedea is evaluated in terms of performances of the Linked Data navigation and planning components, which both encapsulate computation-intensive algorithms.
GraphReformCD: Graph Reformulation for Effective Community Detection in Real-World Graphs
Community detection, one of the most important tools for graph analysis, finds groups of strongly connected nodes in a graph. However, community detection may suffer from misleading information in a graph, such as a nontrivial number of inter-community edges or an insufficient number of intra-community edges. In this paper, we propose GraphReformCD that reformulates a given graph into a new graph in such a way that community detection can be conducted more accurately. For the reformulation, it builds a k-nearest neighbor graph that gives a node k opportunities to connect itself to those nodes that are likely to belong to the same community together with the node. To find the nodes that belong to the same community, it employs the structural similarities such as Jaccard index and SimRank. To validate the effectiveness of our GraphReformCD, we perform extensive experiments with six real-world and four synthetic graphs. The results show that our GraphReformCD enables state-of-the-art methods to improve their acc
GraphZoo: A Development Toolkit for Graph Neural Networks with Hyperbolic Geometries
Hyperbolic spaces have recently gained prominence for representation learning in graph processing tasks such as link prediction and node classification. Several Euclidean graph models have been adapted to work in the hyperbolic space and the variants have shown a significant increase in performance. However, research and development in graph modeling currently involve several tedious tasks with a scope of standardization including data processing, parameter configuration, optimization tricks, and unavailability of public codebases. With the proliferation of new tasks such as knowledge graph reasoning and generation, there is a need in the community for a unified framework that eases the development and analysis of both Euclidean and hyperbolic graph networks, especially for new researchers in the field. To this end, we present a novel framework GraphZoo, that makes learning, using, and designing graph processing pipelines/models systematic by abstraction over the redundant components. The framework contains a
QAnswer: Towards question answering search over websites
Question Answering (QA) is increasingly used by search engines to provide results to their end users yet very few websites currently use QA technologies for their search functionality. To illustrate the potential of QA technologies for the website search practitioner, we demonstrate web searches that combine QA over knowledge graphs and QA over free text -- each being usually tackled separately. We also discuss the different benefits and drawbacks of both approaches for website searches. We use the case studies made of websites hosted by the Wikimedia Foundation (namely Wikipedia and Wikidata). Differently from a search engine (e.g. Google, Bing, etc), the data are indexed integrally, i.e. we do not index only a subset, and they are indexed exclusively, i.e. we index only data available on the corresponding website.
A Graph Temporal Information Learning Framework for Popularity Prediction
Effectively predicting the future popularity of online content has important implications in a wide range of areas, including online advertising, user recommendation, and fake news detection. Existing approaches mainly consider the popularity prediction task via path modeling or discrete graph modeling. However, most of them heavily exploit underlying diffusion structural and sequential information, while ignoring the temporal evolution information among different snapshots of cascades. In this paper, we propose a graph temporal information learning framework based on an improved graph convolutional network (GTGCN), which can capture both the temporal information governing the spread of information in a snapshot, and the inherent temporal dependencies among different snapshots. We validate the effectiveness of the GTGCN by applying it on a Sina Weibo dataset in the scenario of predicting retweet cascades. Experimental results demonstrate the superiority of our proposed method over the state-of-the-art approac
SHACL and ShEx in the Wild: A Community Survey on Validating Shapes Generation and Adoption
Knowledge Graphs (KGs) are the de-facto standard to represent heterogeneous domain knowledge on the Web and within organizations. Various tools and approaches exist to manage KGs and ensure the quality of their data. Among these, the Shapes Constraint Language (SHACL) and the Shapes Expression Language (ShEx) are the two state-of-the-art languages to define validating shapes for KGs. In the last few years, the usage of these constraint languages has increased, and hence new needs arose. One such need is to enable the efficient generation of these shapes. Yet, since these languages are relatively new, we witness a lack of understanding of how they are effectively employed for existing KGs. Therefore, in this work, we answer How validating shapes are being generated and adopted? Our contribution is threefold. First, we conducted a community survey to analyze the needs of users (both from industry and academia) generating validating shapes. Then, we cross-referenced our results with an extensive survey of the ex
Towards Knowledge-Driven Symptom Monitoring & Trigger Detection of Primary Headache Disorders
Headache disorders are experienced by many people around the world. In current clinical practice, the follow-up and diagnosis of headache disorder patients only happens intermittently, based on subjective data self-reported by the patient. The mBrain system tries to make this process more continuous, autonomous and objective by additionally collecting contextual and physiological data via a wearable, mobile app and machine learning algorithms. To support the monitoring of headache symptoms during attacks for headache classification and the detection of headache triggers, much knowledge and contextual data is available from heterogeneous sources, which can be consolidated with semantics. This paper presents a demonstrator of knowledge-driven services that perform these tasks using Semantic Web technologies. These services are deployed in a distributed cascading architecture that includes DIVIDE to derive and manage the RDF stream processing queries that perform the contextually relevant filtering in an intelli
Using Schema.org and Solid for Linked Data-based Machine-to-Machine Sales Contract Conclusion
We present a demo in which two robotic arms, controlled by rule- based Linked Data agents, trade a good in Virtual Reality. The agents follow the necessary steps to conclude a sales contract un- der German law. To conclude the contract, the agents exchange messages between their Solid Pods. The data in the messages is modelled using suitable terms from Schema.org.
Technology Growth Ranking Using Temporal Graph Representation Learning
A key component of technology sector business strategy is understanding the mechanisms by which technologies are adopted and the rate of their growth over time. Furthermore, predicting how technologies grow in relation to each other informs business decision-making in terms of product definition, research and development, and marketing strategies. An important avenue for exploring technology trends is by looking at activity in the software community. Social networks for developers can provide useful technology trend insights and have an inherent temporal graph structure. We demonstrate an approach to technology growth ranking that adapts spatiotemporal graph neural networks to work with structured temporal relational graph data.
Linking Streets in OpenStreetMap to Persons in Wikidata
Geographic web sources such as OpenStreetMap (OSM) and knowledge graphs such as Wikidata are often unconnected. An example connection that can be established between these sources are links between streets in OSM to the persons in Wikidata they were named after. This paper presents StreetToPerson, an approach for connecting streets in OSM to persons in a knowledge graph based on relations in the knowledge graph and spatial dependencies. Our evaluation shows that we outperform existing approaches by 26 percentage points. In addition, we apply StreetToPerson on all OSM streets in Germany, for which we identify more than 180,000 links between streets and persons.
Am I a Real or Fake Celebrity? Evaluating Face Recognition and Verification APIs under Deepfake Impersonation Attack
Recent advancements in web-based multimedia technologies, such as face recognition web services powered by deep learning, have been significant. As a result, companies such as Microsoft, Amazon, and Naver provide highly accurate commercial face recognition web services for a variety of multimedia applications. Naturally, such technologies face persistent threats, as virtually anyone with access to deepfakes can quickly launch impersonation attacks. These attacks pose a serious threat to authentication services, which rely heavily on the performance of their underlying face recognition technologies. Despite its gravity, deepfake abuse involving commercial web services and their robustness have not been thoroughly measured and investigated. By conducting a case study on celebrity face recognition, we examine the robustness of black-box commercial face recognition web APIs (Microsoft, Amazon, Naver, and Face++) and open-source tools (VGGFace and ArcFace) against Deepfake Impersonation (DI) attacks. We demonstrate the vulnerability of face recognition technologies to DI attacks, achieving respective success rates of 78.0% for targeted (TA) attacks; we also propose mitigation strategies, lowering respective attack success rates to as low as 1.26% for TA attacks with adversarial training. Our code is available here: https://anonymous.4open.science/r/DI_Attack.
Game of Hide-and-Seek: Exposing Hidden Interfaces in Embedded Web Applications of IoT Devices
Recent years have seen increased attacks targeting embedded web applications of IoT devices. An important target of such attacks is the hidden interface of embedded web applications, which employs no protection but exposes security-critical actions and sensitive information to illegitimate users. With the severity and the pervasiveness of this issue, it is crucial to identify the vulnerable hidden interfaces, shed light on best practices and raise public awareness.
ET-BERT: A Contextualized Datagram Representation with Pre-training Transformers for Encrypted Traffic Classification
Encrypted traffic classification requires discriminative and robust traffic representation captured from content-invisible and imbalanced traffic data for accurate classification, which is challenging but indispensable to achieve network security and network management. The major limitation of existing solutions is that they highly rely on the statistical or deep features obtained from the limited labeled data, which are overly dependent on data size and hard to generalize on unseen data. How to leverage the open-domain unlabeled traffic data to learn traffic representation with strong generalization ability remains a key challenge. In this paper, we propose a new traffic representation model called Encrypted Traffic Bidirectional Encoder Representations from Transformer (ET-BERT), which pre-trains deep contextualized datagram-level traffic representation from large-scale unlabeled data. The pre-trained ET-BERT model can be fine-tuned on a small number of task-specific labeled data and achieves new state-of-the-art performance across five encrypted traffic classification tasks, remarkably pushing the F1 of ISCX-Tor to 99.2% (4.4% absolute improvement), ISCX-VPN-Service to 98.9% (5.2% absolute improvement), Cross-Platform (Android) to 92.5% (5.4% absolute improvement), TLS 1.3 to 97.4% (10.0% absolute improvement). Notably, we provide indepth explanation of the empirically powerful pre-training model by analyzing the randomness of ciphers. It gives us insights in understanding the boundary of classification ability over encrypted traffic.
Attention-Based Vandalism Detection in OpenStreetMap
OpenStreetMap (OSM), a collaborative crowd-sourced Web map, is a unique source of openly available worldwide map data, increasingly adopted in many Web applications. To maintain trust and transparency, vandalism detection in OSM is critical and remarkably challenging due to the large scale of the dataset, the sheer number of contributors, various vandalism forms, and the lack of annotated data to train machine learning algorithms. This paper presents Ovid - a novel machine learning method for vandalism detection in OpenStreetMap. Ovid relies on a neural network architecture that adopts a multi-head attention mechanism to effectively summarize information indicating vandalism from OpenStreetMap changesets. To facilitate automated vandalism detection, we introduce a set of original features that capture changeset, user, and edit information. Furthermore, we extract a dataset of real-world vandalism incidents from the OpenStreetMap's edit history for the first time and provide this dataset as open data. Our evaluation results on real-world vandalism data demonstrate that the proposed Ovid method outperforms the baselines by eight percentage points regarding the F1 score on average.
Link: Black-Box Detection of Cross-Site Scripting Vulnerabilities Using Reinforcement Learning
Black-box web scanners have been a prevalent means of performing penetration testing to find reflected cross-site scripting (XSS) vulnerabilities. Unfortunately, off-the-shelf black-box web scanners suffer from unscalable testing as well as false negatives that stem from a testing strategy that employs fixed attack payloads, thus disregarding the exploitation of contexts to trigger vulnerabilities. To this end, we propose a novel method of adapting attack payloads to a target reflected XSS vulnerability using reinforcement learning (RL). We present Link, a general RL framework in which states, actions, and a reward function are designed to find reflected XSS vulnerabilities in a black-box and fully automatic manner. Link finds 45, 202, and 60 vulnerabilities with no false positives in Firing-Range, OWASP, and WAVSEP benchmarks, respectively, outperforming state-of-the-art web scanners in terms of finding vulnerabilities and ending testing campaigns earlier. Link also finds 44 vulnerabilities in 12 real-world applications, demonstrating the promising efficacy of using RL in finding reflected XSS vulnerabilities.
ONBRA: Rigorous Estimation of the Temporal Betweenness Centrality in Temporal Networks
In network analysis, the betweenness centrality of a node informally captures the fraction of shortest paths visiting that node. The computation of the betweenness centrality measure is a fundamental task in the analysis of modern networks, enabling the identification of the most central nodes in such networks. Additionally to being massive, modern networks contain also information about the time at which their events occur. Such networks are often called temporal networks. The presence of the temporal information makes the study of the betweenness centrality in temporal networks (i.e., temporal betweenness centrality) much more challenging than in static networks (i.e., networks without temporal information). Moreover, the exact computation of the temporal betweenness centrality is often impractical on even moderately-sized networks, given the extremely high computational cost of such task. A natural approach to reduce such computational cost is to obtain high-quality estimates of the exact values of temporal betweenness centrality, computed using rigorous sampling-based algorithms.
Temporal Walk Centrality: Ranking Nodes in Evolving Networks
We propose the Temporal Walk Centrality, which quantifies the importance of a node by measuring its ability to obtain and distribute information in a temporal network. In contrast to the widely-used betweenness centrality, we assume that information does not necessarily spread on shortest paths but on temporal random walks that satisfy the time constraints of the network. We show that temporal walk centrality can identify nodes playing central roles in dissemination processes that might not be detected by related betweenness concepts and other common static and temporal centrality measures. We propose exact and approximation algorithms with different running times depending on the properties of the temporal network and parameters of our new centrality measure. A technical contribution is a general approach to lift existing algebraic methods for counting walks in static networks to temporal networks. Our experiments on real-world temporal networks show the efficiency and accuracy of our algorithms. Finally, we demonstrate that the rankings by temporal walk centrality often differ significantly from those of other state-of-the-art temporal centralities.
FirmCore Decomposition of Multilayer Networks
A key graph mining primitive is extracting dense structures from graphs, and this has led to interesting notions such as k-cores which subsequently have been employed as building blocks for capturing the structure of complex networks and for designing efficient approximation algorithms for challenging problems such as finding the densest subgraph. In applications such as biological, social, and transportation networks, interactions between objects span multiple aspects. Multilayer (ML) networks have been proposed for accurately modeling such applications. In this paper, we present FirmCore, a new family of dense subgraphs in ML networks, and show that it satisfies many of the nice properties of k-cores in single-layer graphs. Unlike the state of the art core decomposition of ML graphs, FirmCores have a polynomial time algorithm, making them a powerful tool for understanding the structure of massive networks. We also extend FirmCore for directed ML graphs. We show that FirmCores and directed FirmCores can be used to obtain efficient approximation algorithms for finding the densest subgraphs of ML graphs and their directed counterparts. Our extensive experiments over several real ML graphs show that our FirmCore decomposition algorithm is significantly more efficient than known algorithms for core decompositions of ML graphs. Furthermore, it returns solutions of matching or better quality for the densest subgraph problem over (possibly directed) ML graphs.
Graph Alignment with Noisy Supervision
Recent years have witnessed increasing attention on the application of graph alignment to on-Web tasks, such as knowledge graph integration and social network linking. Despite achieving remarkable performance, prevailing graph alignment models still suffer from noisy supervision, yet how to mitigate the impact of noise in labeled data is still under-explored. The negative sampling based discrimination model has been a feasible solution to detect the noisy data and filter them out. However, due to its sensitivity to the sampling distribution, the negative sampling based discriminator would lead to an inaccurate decision boundary. Furthermore, it is difficult to find an abiding threshold to separate the potential positive (benign) and negative (noisy) data in the whole training process. To address these important issues, in this paper, we design a non-sampling discrimination model resorting to unbiased risk estimation of positive-unlabeled learning to circumvent the impact of negative sampling. We also propose to select the appropriate potential positive data at different training stages by an adaptive filtration threshold enabled by curriculum learning, for maximally improving the performance of the alignment model and discrimination model. Extensive experiments conducted on several real-world datasets validate the effectiveness of our proposed method.
SATMargin: Practical Maximal Frequent Subgraph Mining via Margin Space Sampling
Maximal Frequent Subgraph (MFS) mining asks to identify the maximal subgraph that commonly appears in a set of graphs, which has been found useful in many applications in social science, biology, and other domains. Previous studies focused on reducing the search space of MFSs and discovered the theoretically smallest search space. Despite the theoretical success, no practical algorithm can exhaustively search the space as it is huge even for small graphs with only tens of nodes and hundreds of edges. Moreover, deciding whether a subgraph is an MFS needs to solve subgraph monomorphism (SM), an NP complete problem that introduces extra challenges. Here, we propose a practical MFS mining algorithm that targets large MFSs, named SATMargin. SATMargin adopts random walk in the search space to perform efficient search and utilizes a customized conflict learning Boolean Satisfiability (SAT) algorithm to accelerate SM queries. We design a mechanism that reuses SAT solutions to combine the random walk and the SAT solver effectively. We evaluate SATMargin over synthetic graphs and $6$ real-world graph datasets. SATMargin shows superior performance to baselines in finding more and larger MFSs. We further demonstrate the effectiveness of SATMargin in a case study of RNA graphs. The identified frequent subgraph by SATMargin well matches the functional core structure of RNAs previously detected in biological experiments.
Filter-enhanced MLP is All You Need for Sequential Recommendation
Recent years have witnessed the remarkable progress in sequential recommendation with deep learning.
Learning to Augment for Casual User Recommendation
Users who come to recommendation platforms are heterogeneous in activity levels. There usually exists a group of core users who visit the platform regularly and consume a large body of contents upon each visit, while others are casual users who tend to visit the platform occasionally and consume less each time. As a result, consumption activities from core users often dominate the training data used for learning. As core users can exhibit different activity patterns from casual users, recommender systems trained on historical user activity data usually achieve much worse performance on casual users than core users. To bridge the gap, we propose a model-agnostic framework L2Aug to improve recommendations for casual users through data augmentation, without sacrificing core user experience. L2Aug is powered by a data augmentor that learns to generate augmented interaction sequences, in order to fine-tune and optimize the performance of the recommendation system for casual users. On four real-world public datasets, L2Aug outperforms other treatment methods and achieves the best sequential recommendation performance for both casual and core users. We also test L2Aug in an online simulation environment with real-time feedback to further validate its efficacy, and showcase its flexibility in supporting different augmentation actions.
Sequential Recommendation with Decomposed Item Feature Routing
Sequential recommendation basically aims to capture user evolving preference.
Unbiased Sequential Recommendation with Latent Confounders
Sequential recommendation holds the promise of understanding user preference by capturing successive behavior correlations.
Intent Contrastive Learning for Sequential Recommendation
Users' interactions with items are driven by various intents (e.g., preparing for holiday gifts, shopping for fishing equipment, etc.). However, users' underlying intents are often unobserved/latent, making it challenging to leverage such a latent intent factor forSequential recommendation(SR).
DiriE: Knowledge Graph Embedding with Dirichlet Distribution
Knowledge graph embedding aims to learn representations of entities and relations in low-dimensional space. Recently, extensive studies combine the characteristics of knowledge graphs with different geometric spaces, including Euclidean space, complex space, hyperbolic space and others, which achieves significant progress in representation learning. However, existing methods are subject to at least one of the following limitations: 1) ignoring the uncertainty, 2) incapability of complex relation patterns. To address the above issues simultaneously, we propose a novel model named DiriE, which embeds entities as Dirichlet distributions and relations as multinomial distributions. DiriE employs Bayesian inference to measure the relations between entities and learns binary embeddings of knowledge graphs for modeling complex relation patterns. Additionally, we propose a two-step negative triple generation method that generates negative triples of both entities and relations. We conduct a solid theoretical analysis to demonstrate the effectiveness and robustness of our method, including the expressiveness of complex relation patterns and the ability to model uncertainty. Furthermore, extensive experiments show that our method outperforms state-of-the-art methods in link prediction on benchmark datasets.
WebFormer: The Web-page Transformer for Structure Information Extraction
Structure information extraction refers to the task of extracting structured text fields of an object from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields.
KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction
Recently, prompt-tuning has achieved promising results for specific few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exists abundant semantic and prior knowledge among the relation labels that cannot be ignored. To this end, we focus on incorporating knowledge among relation labels into prompt-tuning for relation extraction and propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words. Then, we synergistically optimize their representation with structured constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach. Our code and datasets are available in an anonymous GitHub link for reproducibility.
Is this Question Real? Dataset Collection on Perceived Intentions and Implicit Attack Detection
The proliferation of social media and online communication platforms has made social interactions more accessible and led to a significant expansion of research into language use with a particular focus on toxic behavior and hate speech. Few studies, however, have focused on the tacit information that may imply a negative intention and the perspective that impacts the interpretation of this intention. Conversation is a joint activity that relies on coordination between what one party expresses and how the other party construes what has been expressed. Thus, how a message is perceived becomes equally important regardless of whether the sent message includes any form of explicit attack or offense. This study focuses on identifying the implicit attacks and the negative intentions in text-based conversation from the point of view of the reader. We focus on questions in conversations and investigate the perceived intention underlying them. We introduce our dataset that includes questions, intention polarity, and the type of attacks. We conduct a meta-analysis on the data to demonstrate how a question can be used as a means of attack and how different perspectives can lead to multiple interpretations. We also report benchmark results of several models for detecting instances of tacit attacks in questions with the aim of avoiding latent or manifest conflict in conversations.
Zero-Shot Stance Detection via Contrastive Learning
Zero-shot stance detection (ZSSD) is challenging as it requires detecting the stance of a previously unseen target during the inference stage. Being able to detect the target-dependent transferable stance features from the training data is arguably an important step in ZSSD. Generally speaking, stance features can be grouped into target-invariant and target-specific categories. Target-invariant stance features carry the same stance regardless of the targets they are associated with. On the contrary, target-specific stance features only co-occur with certain targets. As such, it is important to distinguish these two types of stance features when learning stance features of unseen targets. To this end, in this paper, we revisit ZSSD from a novel perspective by developing an effective approach to distinguish the types (target-invariant/-specific) of stance features, so as to better learn transferable stance features. To be specific, inspired by self-supervised learning, we frame the stance-feature-type identification as a pretext task in ZSSD. Furthermore, we devise a novel hierarchical contrastive learning strategy to capture the correlation and difference between target-invariant and -specific representations and further among different stance labels. This essentially allows the model to exploit transferable stance features more effectively for representing the stance of previously unseen targets. Extensive experiments on three benchmark datasets show that the proposed framework achieves the state-of-the-art performance in ZSSD.
Invited Talk by Daniela Paolotti (ISI Foundation): Having an impact: the challenges of using data science for social good
Semantic IR fused Heterogeneous Graph Model in Tag-based Video Search
With the rapid growth of video resources on the Internet, text-video retrieval has become a common requirement. Scholars handled text-video retrieval tasks with two-broad-category: concept-based methods and neural semantics match networks. In addition to deep neural semantics matching models, some scholars mined queries and videos relationships from the click-graph, which expresses the users'implicit judgments on relevance relations. However, bad generalization of click-based or concept-based models and hardly to capture semantic information from short queries of semantic-based models stunted existing methods to fully utilize the methods to enhance the IR performance. In this paper, we propose a framework ETHGS to combine the abilities of concept-based, click-based and semantic-based models in IR and publish a new video retrieval dataset QVT from a real-world video search engine. In ETHGS, we make use of tags (i.e. concepts) to construct a heterogeneous graph to allev
Modeling Position Bias Ranking for Streaming Media Services
We tackle the problem of position bias estimation for streaming media services. Position bias is a widely studied topic in ranking literature an its impact on ranking quality is well understood. Although several methods exist to estimate position bias, their applicability to an industrial setting is limited, either because they require ad-hoc interventions that harm user experience, or because their learning accuracy is poor. In this paper, we present a novel position bias estimator that overcomes these limitations: it can be applied to the streaming media services scenario without manual interventions while delivering best in class estimation accuracy. We compare the proposed method against existing techniques on real and synthetic data and illustrate its applicability to the Amazon Music use case.
Multi-task Ranking with User Behaviors for Text-Video Search
Text-video search has become an important demand in many industrial video sharing platforms, e.g. YouTube, TikTok, and WeChat Channels, thereby attracting increasing research attention. Traditional relevance-based ranking methods for text-video search concentrate on exploiting the semantic relevance between video and query. However, relevance is no longer the principal issue in the ranking stage, because the candidate items retrieved from the matching stage naturally guarantee adequate relevance. Instead, we argue that boosting user satisfaction should be an ultimate goal for ranking and it is promising to excavate cheap and rich user behaviors for model training. To achieve this goal, we propose an effective Multi-Task Ranking pipeline with User Behaviors (MTRUB) for text-video search. Specifically, to exploit the multi-modal data effectively, we put forward a Heterogeneous Multi-modal Fusion Module (HMFM) to fuse the query and video features of different modalities in adaptive ways. Besides that, we design
FastClip: An Efficient Video Understanding System with Heterogeneous Computing and Coarse-to-fine Processing
Recently, video medias including live streaming are exponentially growing in many areas such as E-commerce shopping and gam- ing. Understanding the video contents is critical for many real- world applications. However, processing long videos is usually time-consuming and expensive. In this paper, we present an efficient video understanding system, which aims to speed up the video processing with a coarse-to-fine two-stage pipeline and heterogeneous computing framework. In the first stage, we use a coarse but fast multi-modal filtering module to recognize and remove useless video segments from a long video, which could be deployed on an edge device and reduce computations for the next processing. In the second stage, several semantic models are applied for finely parsing the remained sequences. To accelerate the model inference, we propose a novel heterogeneous computing framework which trains a model with lightweight and heavyweight backbones to support a distributed deployment. Specifically, the heavyweight
Multilingual Semantic Sourcing using Product Images for Cross-lingual Alignment
In online retail stores with ever increasing catalog, product search is the primary means for customers to discover products of their interest. Surfacing irrelevant products can lead to poor customer experience and in extreme situations loss in engagement. With the recent advances in NLP, Deep Learning models are being used to represent queries and products in shared semantic space to enable semantic sourcing. These models require a lot of human annotated (query, product, relevance) tuples to give competitive results which is expensive to generate. The problem becomes more prominent in the emerging marketplaces/languages due to data paucity problem. When expanding to new marketplaces, it becomes imperative to support regional languages to reach a wider customer base and delighting them with good customer experience. Recently, in the NLP domain, approaches using parallel data corpus for training multilingual models have become prominent but they are expensive to generate. In this work, we learn semantic alignm
CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning
Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to provide powerful intelligence to help developers implement safe and effective code, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. To this end, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.
TTAGN: Temporal Transaction Aggregation Graph Network for Ethereum Phishing Scams Detection
In recent years, phishing scams have become the most serious type of crime involved in Ethereum, the second-largest blockchain platform. The existing phishing scams detection technology on Ethereum mostly uses traditional machine learning or network representation learning to mine the key information from the transaction network to identify phishing addresses. However, these methods adopt the last transaction record or even completely ignore these records, and only manual-designed features are taken for the node representation. In this paper, we propose a Temporal Transaction Aggregation Graph Network (TTAGN) to enhance phishing scams detection performance on Ethereum. Specifically, in the temporal edges representation module, we model the temporal relationship of historical transaction records between nodes to construct the edge representation of the Ethereum transaction network. Moreover, the edge representations around the node are aggregated to fuse topological interactive relationships into its representation, also named as trading features, in the edge2node module. We further combine trading features with common statistical and structural features obtained by graph neural network to identify phishing addresses. Evaluated on real-world Ethereum phishing scams datasets, our TTAGN (92.8% AUC and 81.6% F1-score) outperforms the state-of-the-art methods, and the effectiveness of temporal edges representation and edge2node module is also demonstrated.
Revisiting Email Forwarding Security under the Authenticated Received Chain Protocol
Email authentication protocols such as SPF, DKIM, and DMARC are used to detect email spoofing attacks, but they face key challenges when handling email forwarding scenarios. Recently in 2019, a new Authenticated Received Chain (ARC) protocol is introduced to support mail forwarding applications to preserve the authentication records. After 2 years, it is still not well understood how ARC is implemented, deployed, and configured in practice. In this paper, we perform an empirical analysis on ARC usage and examine how it affects spoofing detection decisions on popular email provides that support ARC. After analyzing an email dataset of 400K messages, we show that ARC is not yet widely adopted, but it starts to attract adoption from major email providers (e.g., Gmail, Outlook). Our controlled experiment shows that most email providers' ARC implementations are done correctly. However, some email providers (Zoho) have misinterpreted the meaning of ARC results, which can be exploited by spoofing attacks. Finally, we empirically investigate forwarding-based ``Hide My Email'' services offered by iOS 15 and Firefox, and show their implementations break ARC and can be leveraged by attackers to launch more successful spoofing attacks against otherwise well-configured email receivers (e.g., Gmail).
HiddenCPG: Large-Scale Vulnerable Clone Detection Using Subgraph Isomorphism of Code Property Graphs
A code property graph (CPG) is a joint representation of syntax, control flows, and data flows of a target application. Recent studies have demonstrated the promising efficacy of leveraging CPGs for the identification of vulnerabilities. It recasts the problem of implementing a specific static analysis for a target vulnerability as a graph query composition problem. It requires devising coarse-grained graph queries that model vulnerable code patterns. Unfortunately, such coarse-grained queries often leave vulnerabilities due to faulty input sanitization undetected. In this paper, we propose HiddenCPG, a scalable system designed to identify various web vulnerabilities, including bugs that stem from incorrect sanitization. We designed HiddenCPG to find a subgraph in a target CPG that matches a given CPG query having a known vulnerability, which is known as the subgraph isomorphism problem. To address the scalability challenge that stems from the NP-complete nature of this problem, HiddenCPG leverages optimization techniques designed to boost the efficiency of matching vulnerable subgraphs. HiddenCPG found 2,464 potential vulnerabilities including 42 CVEs in 7,174 real-world CPGs having a combined total of 1 billion nodes and 1.2 billion edges. The experimental results demonstrate that the scalable detection of buggy code clones in 7,174 real-world PHP applications is feasible with precision and efficiency.
Understanding the Practice of Security Patch Management across Multiple Branches in OSS Projects
Since the users of open source software (OSS) projects may not use the latest version all the time, OSS development teams often support code maintenance for old versions through maintaining multiple stable branches. Typically, the developers create a stable branch for each old stable version, deploy security patches on the branch, and release fixed versions at regular intervals. As such, old-version applications in production environments are protected from the disclosed vulnerabilities in a long time. However, the rapidly growing number of OSS vulnerabilities has greatly strained this patch deployment model, and a critical need has arisen for the security community to understand the practice of security patch management across stable branches. In this work, we conduct a large-scale empirical study of stable branches in OSS projects and the security patches deployed on them via investigating 608 stable branches belonging to 26 popular OSS projects as well as more than 2,000 security fixes for 806 CVEs deployed on stable branches.
Confidence May Cheat: Self-Training on Graph Neural Networks under Distribution Shift
Graph Convolutional Networks (GCNs) have recently attracted vast interest and achieved state-of-the-art performance on graphs, but its success could typically hinge on careful training with amounts of expensive and time-consuming labeled data. To alleviate labeled data scarcity, self-training methods have been widely adopted on graphs by labeling high-confidence unlabeled nodes and then adding them to the training step. In this line, we empirically make a thorough study for current self-training methods on graphs. Surprisingly, we find that high-confidence unlabeled nodes are not always useful, and even introduce the distribution shift issue between the original labeled dataset and the augmented dataset by self-training, severely hindering the capability of self-training on graphs. To this end, in this paper, we propose a novel Distribution Recovered Graph Self-Training framework (DR-GST), which could recover the distribution of the original labeled dataset. Specifically, we first prove the equality of loss function in self-training framework under the distribution shift case and the population distribution if each pseudo-labeled node is weighted by a proper coefficient. Considering the intractability of the coefficient, we then propose to replace the coefficient with the information gain after observing the same changing trend between them, where information gain is respectively estimated via both dropout variational inference and dropedge variational inference in DR-GST. However, such a weighted loss function will enlarge the impact of incorrect pseudo labels. As a result, we apply the loss correction method to improve the quality of pseudo labels. Both our theoretical analysis and extensive experiments on five benchmark datasets demonstrate the effectiveness of the proposed DR-GST, as well as each well-designed component in DR-GST.
Polarized Graph Neural Networks
Despite the recent success of Message-passing Graph Neural Networks (MP-GNNs), the strong inductive bias of homophily limits their ability to generalize to heterophilic graphs and leads to the over-smoothing problem. Most existing works attempt to mitigate this issue in the spirit of emphasizing the contribution from similar neighbors and reducing those from dissimilar ones when performing aggregation, where the dissimilarities are utilized passively and their positive effects are ignored, leading to suboptimal performances. Inspired by the idea of \emph{attitude polarization} in social psychology, that people tend to be more extreme when exposed to an opposite opinion, we propose Polarized Graph Neural Network (Polar-GNN). Specifically, pairwise similarities and dissimilarities of nodes are firstly modeled with node features and topological structure information. And specially, we assign negative weights for those dissimilar ones. Then nodes aggregate the messages on a hyper-sphere through a \emph{polarization operation}, which effectively exploits both similarities and dissimilarities. Furthermore, we theoretically demonstrate the validity of the proposed operation. Lastly, an elaborately designed loss function is introduced for the hyper-spherical embedding space. Extensive experiments on real-world datasets verify the effectiveness of our model.
Graph-adaptive Rectified Linear Unit for Graph Neural Networks
Graph Neural Networks (GNNs) have achieved remarkable success by extending traditional convolution to learning on non-Euclidean data. The key to the GNNs is adopting the neural message-passing paradigm with two stages: the aggregation and the update. The current design of GNNs considers the topology information in the aggregation stage. However, in the updating stage, all nodes share the same updating function. The identical updating function treats each node embedding as independent and identically distributed random variables and therefore ignores the implicit relationships between neighborhoods which limits the capacity of the GNNs. The updating function is usually implemented with linear transformation followed by a nonlinear activation function. To make the updating function topology-aware, we inject the topological information into the nonlinear activation function and propose Graph-adaptive Rectified Linear Unit (GReLU), which is a new parametric activation function to incorporate neighborhood information in a novel and efficient way. The parameters of GReLU are obtained from a hyperfunction based on both node features and the corresponding adjacent matrix. To reduce the overfitting risk and the computational cost, we decompose the hyperfunction as two independent components for nodes and features respectively. We conduct comprehensive experiments to show that the plug-and-play method, GReLU, is efficient and effective with different GNN backbones in the various downstream tasks.
Designing the Topology of Graph Neural Networks: A Novel Feature Fusion Perspective
In recent years, Graph Neural Networks (GNNs) have shown superior performance on diverse real-world applications. To improve the model capacity, besides designing aggregation operations, GNN topology design is also very important. In general, there are two mainstream GNN topology design manners. The first one is to stack aggregation operations to obtain the higher-level features but easily got performance drop as the network goes deeper. Secondly, the multiple aggregation operations are utilized in each layer which provides adequate and independent feature extraction stage on local neighbors while are costly to obtain the higher-level information. To enjoy the benefits while alleviating the corresponding deficiencies of these two manners, we learn to design the topology of GNNs in a novel feature fusion perspective which is dubbed F$^2$GNN. To be specific, we provide a feature fusion perspective in designing GNN topology and propose a novel framework to unify the existing topology designs with feature selection and fusion strategies. Then we develop a neural architecture search method on top of the unified framework which contains a set of selection and fusion operations in the search space and an improved differentiable search algorithm. The performance gains on eight real-world datasets demonstrate the effectiveness of F$^2$GNN. We further conduct experiments to show that F$^2$GNN can improve the model capacity while alleviating the deficiencies of existing GNN topology design manners, especially alleviating the over-smoothing problem, by utilizing different levels of features adaptively.
RawlsGCN: Towards Rawlsian Difference Principle on Graph Convolutional Network
Graph Convolutional Network (GCN) plays pivotal roles in many real-world applications. Despite the successes of GCN deployment, GCN often exhibits performance disparity with respect to node degrees, resulting in worse predictive accuracy for low-degree nodes. We formulate the problem of mitigating the degree-related performance disparity in GCN from the perspective of the Rawlsian difference principle, which is originated from the theory of distributive justice. Mathematically, we aim to balance the utility between low-degree nodes and high-degree nodes while minimizing the task-specific loss. Specifically, we reveal the root cause of this degree-related unfairness by analyzing the gradients of weight matrices in GCN. Guided by the gradients of weight matrices, we further propose a pre-processing method RawlsGCN-Graph and an in-processing method RawlsGCN-Grad that achieves fair predictive accuracy in low-degree nodes without modification on the GCN architecture or introduction of additional parameters. Extensive experiments on real-world graphs demonstrate the effectiveness of our proposed RawlsGCN methods in significantly reducing degree-related bias while retaining comparable overall performance.
Sequential Recommendation via Stochastic Self-Attention
Sequential recommendation models the dynamics of a user's previous behaviors in order to forecast the next item and has drawn a lot of attention. Transformer-based approaches, which embed items as vectors and use dot-product self-attention to measure the relationship between items, demonstrate superior capabilities among existing sequential methods. However, users' real-world sequential behaviors are uncertain rather than deterministic, posing a significant challenge to present techniques. We further suggest that dot-product-based approaches cannot fully capture collaborative transitivity, which can be derived in item-item transitions inside sequences and is beneficial for cold start items. We further argue that BPR loss has no constraint on positive and sampled negative items, which misleads the optimization.
Towards Automatic Discovering of Deep Hybrid Network Architecture for Sequential Recommendation
Recent years have witnessed tremendous success in deep learning-based sequential recommendation (SR), which can capture evolving user preferences and provide more timely and accurate recommendations. One of the most effective deep SR architectures is to stack high-performance residual blocks, e.g., prevalent self-attentive and convolutional operations, for capturing long- and short-range dependence of sequential behaviors. By carefully revisiting previous models, we observe: (1) simple architecture modification of gating each residual connection can help us train deeper SR models and yield significant improvements; (2) compared with self-attention mechanism, stacking of convolution layers also can cover each item of the whole sequential behaviors and achieve competitive or even superior recommendation performance.
Generative Session-based Recommendation
Session-based recommendation has recently attracted increasing attention from the academic and industry communities. Previous models mostly introduce different inductive biases to fit the training data by designing various neural models. However, the recommendation data can be quite sparse in practice, especially for user sequential behaviors, which makes it hard to learn reliable models. In order to solve this problem, in this paper, we propose a novel generative session-based recommendation framework. Our key idea is to generate additional samples to complement the original training data. In order to generate high quality samples, we consider two aspects: (1) the rationality as a sequence of user behaviors, and (2) the informativeness for training the target model. To satisfy these requirements, we design a doubly adversarial network. The first adversarial module aims to make the generated samples conform to the underlying patterns of the real user sequential preference (rationality requirement). The second adversarial module is targeted at widening the model experiences by generating samples which can induce larger model losses (informativeness requirement). In our model, the samples are generated based on a reinforcement learning strategy, where the reward is related with both of the above aspects. In order to stable the training process, we introduce a self-paced regularizer to learn the agent in an easy-to-hard manner. We conduct extensive experiments based on real-world datasets to demonstrate the effectiveness of our model. To promote this research direction, we have released our project as https://anonymous-code-repo.github.io/DASP/.
Disentangling Long and Short-Term Interests for Recommendation
Modeling users' long-term and short-term interests is crucial for accurate recommendation. However, since there is no manually annotated label for user interests, existing approaches always follow the paradigm of entangling these two aspects, which may lead to inferior recommendation accuracy and interoperability. In this paper, to address it, we propose a Contrastive learning framework to disentangle Long and Short-term interests for Recommendation (CLSR) with self-supervision. Specifically, we first propose two separate encoders to independently capture user interests of different time scales. We then extract long-term and short-term interests proxies from the interaction sequences, which serve as pseudo labels for user interests. Then pairwise contrastive tasks are designed to supervise the similarity between interest representations and their corresponding interest proxies. Finally, since the importance of long-term and short-term interests is dynamically changing, we propose to adaptively aggregate them through an attention-based network for prediction. We conduct experiments on two large-scale real-world datasets for e-commerce and short-video recommendation. Empirical results show that our CLSR consistently outperforms all state-of-the-art models with significant improvements: AUC and GAUC are improved by over 0.02, and NDCG is improved by over 10%. Further counterfactual evaluations demonstrate that stronger disentanglement of long and short-term interests is successfully achieved by CLSR.
Efficient Online Learning to Rank for Sequential Music Recommendation
Music streaming services heavily rely upon recommender systems to acquire, engage, and retain users. One notable component of these services are playlists, which can be dynamically generated in a sequential manner based on the user's feedback during a listening session. Online learning to rank approaches have recently been shown effective at leveraging such feedback to learn users' preferences in the space of song features. Nevertheless, these approaches can suffer from slow convergence as a result of their random exploration component and get stuck in local minima as a result of their session-agnostic exploitation component. To overcome these limitations, we propose a novel online learning to rank approach which efficiently explores the space of candidate recommendation models by restricting itself to the orthogonal complement of the subspace of previous underperforming exploration directions. Moreover, to help overcome local minima, we propose a session-aware exploitation component which adaptively leverages the current best model during model updates. Our thorough evaluation using simulated listening sessions from Last.fm demonstrates substantial improvements over state-of-the-art approaches regarding early-stage performance and overall long-term convergence.
"I Have No Text in My Post": Using Visual Hints to Model User Emotions in Social Media
As an emotion plays an important role in peopleÂ's everyday lives and is often mirrored in their social media use, extensive research has been conducted to characterize and model emotions from social media data. However, prior research has not sufficiently considered trends of social media use - the increasing use of images and the decreasing use of text - nor identified the features of images in social media that are likely to be different from those in non-social media. Our study aims to fill this gap by (1) considering the notion of visual hints that depict contextual information of images, (2) presenting their characteristics in positive or negative emotions, and (3) demonstrating their effectiveness in emotion prediction modeling through an in-depth analysis of their relationship with the text in the same posts. The results of our experiments showed that our visual hint-based model achieved 20% improvement in emotion prediction, compared with the baseline. In particular, the performance of our model was comparable with that of the text-based model, highlighting not only a strong relationship between visual hints of the image and emotion, but also the potential of using only images for emotion prediction which well reflects current and future trends of social media use.
A Meta-learning based Stress Category Detection Framework on Social Media
Psychological stress has become a wider-spread and serious health issue in modern society. Detecting stressors that cause the stress could enable people to take effective actions to manage the stress. Previous work relied on the stressor dictionary built upon words from the stressor-related categories in the LIWC (Linguistic Inquiry and Word Count), and focused on stress categories that appear frequently on social media. In this paper, we build a meta-learning based stress category detection framework, which can learn how to distinguish a new stress category with very little data through learning on frequently appeared categories without relying on any lexicon. It is comprised of three modules, i.e., encoder module, induction module, and relation module. The encoder module focuses on learning category-relevant representation of each tweet with Dependency Graph Convolutional Network and tweet attention. The induction module deploys Mixture of Experts mechanism to integrate and summarize a representation for each category. The relation module is adopted to measure the correlation between each pair of query tweets and categories. Through the three modules and the meta-training process, we can then obtain a model which learns to learn how to identify stress categories and can directly be employed to a new category with little labelled data. Our experimental results show that the proposed framework can achieve 75.3% accuracy with 3 labeled data for the rarely appeared stress categories. We also build a stress category dataset consisting of 12 stress categories with 1,549 manually labeled stressful microblogs which can help train AI models to assist psychological stress diagnosis.
Revisiting Graph based Social Recommendation: A Distillation Enhanced Social Graph Network
Social recommendation, which leverages social connections among users to construct Recommender Systems (RS), plays an important role in alleviating the problem of information overload. Recently, Graph Neural Networks (GNNs) have received increasing attention due to their great capacity for graph data. Since data in RS essentially has graph structures, the field of GNN-based RS is flourishing. However, we argue that existing works lack in-depth thinking of GNN-based social recommendations. These methods contain implicit assumptions that are not well analyzed in practical applications. To tackle these problems, we conduct statistical analyses on widely used social recommendation datasets. We design metrics to evaluate the social information, which can provide guidance about whether and how we should use this information in the RS task. Based on these analyses, we introduce the Knowledge Distillation (KD) technique into social recommendation. We train a model that integrates information from the user-item interaction graph and the user-user social graph and train two auxiliary models that only use one of the above graphs respectively. These models are trained simultaneously where the KD technique makes them learning from each other. The KD technique restricts the training process and can be regarded as a regularization strategy. Our extensive experiments show that our model significantly and consistently outperforms the state-of-the-art competitors on real-world datasets.
Effective Messaging on Social Media: What Makes Online Content Go Viral?
In this paper, we propose and test three content-based hypotheses that significantly increase message virality. We measure virality as the retweet counts of messages in a pair of real-world Twitter datasets: A large dataset - UK Brexit with 51 million tweets from 2.8 million users between June 1, 2015 and May 12, 2019 and a smaller dataset - Nord Stream 2 with 516,000 tweets from 250,000 users between October 1, 2019 and October 15, 2019. We hypothesize, test, and conclude that messages incorporating "negativity bias", "causal arguments" and "threats to personal or societal core values of target audiences" singularly and jointly increase message virality on social media.
A Guided Topic-Noise Model for Short Texts
Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain
"This is Fake! Shared It by Mistake": Assessing the Intent of Fake News Spreaders
Individuals can be misled by fake news and spread it unintentionally without knowing that it is false. This phenomenon has been frequently observed but has not been investigated. Our aim in this work is to assess the intent of fake news spreaders. To distinguish between intentional versus unintentional spreading, we study the psychological interpretations behind unintentional spreading. With this foundation, we then propose an influence graph, using which we assess the intent of fake news spreaders. Our extensive experiments show that the assessed intent can help significantly differentiate between intentional and unintentional fake news spreaders. Furthermore, the estimated intent can significantly improve the current techniques that detect fake news. To our best knowledge, this is the first work to model individuals' intent in fake news spreading.
Domain Adaptive Fake News Detection via Reinforcement Learning
With social media being a major force in information consumption, accelerated propagation of fake news has presented new challenges for platforms to distinguish between legitimate and fake news. Effective fake news detection is a non-trivial task due to the diverse nature of news domains and expensive annotation costs. In this work, we address the limitations of existing automated fake news detection models by incorporating auxiliary information (e.g., user comments and user-news interactions) into a novel reinforcement learning-based model called \textbf{RE}inforced \textbf{A}daptive \textbf{L}earning \textbf{F}ake \textbf{N}ews \textbf{D}etection (REAL-FND). REAL-FND exploits cross-domain and within-domain knowledge that makes it robust in a target domain, despite being trained in a different source domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model, especially when limited labeled data is available in the target domain.
Fostering Engagement of Underserved Communities with Credible Health Information on Social Media
The COVID-19 pandemic has necessitated rapid top-down dissemination of reliable and actionable information. This presents unique challenges in engaging hard-to-reach, low-literate communities that live in poverty and lack access to the Internet. Voice-based social media platforms, accessible over simple phones, have shown demonstrable impact in connecting underserved populations with each other and providing them access to instrumental information. We describe the design and deployment of a voice-based social media platform in Pakistan for actively engaging such communities with reliable COVID-related information. We developed three strategies to overcome the hesitation, mistrust, and skepticism depicted by these populations in engaging with COVID content. Users were: (1) encouraged to listen to reliable COVID advisory, (2) incentivized to share reliable content with others, and (3) encourage users to critically think about COVID-related information behaviors. Using a mixed-methods evaluation, we show that users approached with all three strategies had a significantly higher engagement with COVID content compared to others. We conclude by discussing how new designs of social media can enable users to engage with and propagate credible information.
Screenshots, Symbols, and Personal Thoughts: The Role of Instagram for Social Activism
In this paper, we highlight the use of Instagram for social activism, taking 2019 Hong Kong protests as a case study.
On Explaining Multimodal Hateful Meme Detection Models
Hateful meme detection is a new multimodal task that has gained significant traction in academic and industry research communities. Recently, researchers have applied pre-trained visual-linguistic models to perform the multimodal classification task, and some of these solutions have yielded promising results. However, what these visual-linguistic models learn for the hateful meme classification task remains unclear. For instance, it is unclear if these models are able to capture the derogatory or slurs references in multimodality (i.e., image and text) of the hateful memes. To fill this research gap, this paper propose three research questions to improve our understanding of these visual-linguistic models performing the hateful meme classification task. We found that the image modality contributes more to the hateful meme classification task, and the visual-linguistic models are able to perform visual-text slurs grounding to a certain extent. Our error analysis also shows that the visual-linguistic models have acquired biases, which resulted in false-positive predictions (i.e., wrongly predicting non-hateful memes as hateful).
Hate Speech in the Political Discourse on Social Media: Disparities Across Parties, Gender, and Ethnicity
Social media has become an indispensable communication channel for politician communication. However, the political discourse on social media is increasingly characterized by hate speech, which affects not only the reputation of individual politicians but also the functioning of society at large. In this work, we shed light on the role of hate speech in the political discourse on social media. Specifically, we empirically model how the amount of hate speech in replies to posts from politicians on Twitter depends on the party affiliation, the gender, and the ethnicity of the politician that has posted the tweet. For this purpose, we employ Twitter's Historical API to collect every tweet posted by members of the 117th U.S. Congress for an observation period of more than six months. Additionally, we gather replies for each tweet and use machine learning to predict the amount of hate speech they embed. Subsequently, we implement hierarchical regression models to analyze whether politicians with certain characteristics receive more hate speech. All else being equal, we find that tweets are particularly likely to receive hate speech in replies if they are authored by (i) persons of color from the Democratic party, (ii) white Republicans, and (iii) females. Furthermore, our analysis reveals that more negative sentiment (in the source tweet) is associated with more hate speech (in replies). However, the association varies across parties: negative sentiment attracts more hate speech for Democrats (vs. Republicans). Altogether, our empirical findings imply statistically significant differences in how politicians are treated on social media depending on their party affiliation, gender and ethnicity.
Q&A
Spot Virtual Machine Eviction Prediction in Microsoft Cloud
Azure Spot Virtual Machines (Spot VMs) utilize unused compute capacity at significant cost savings. They can be evicted when Azure needs the capacity back, therefore suitable for workloads that can tolerate interruptions. A good prediction of Spot VM evictions is beneficial for Azure to optimize capacity utilization and offers users information to better plan Spot VM deployments by selecting clusters to reduce potential evictions. The current in-service cluster-level prediction method ignores the node heterogeneity by aggregating node information. In this paper, we propose a spatial-temporal node-level Spot VM eviction prediction model to capture the inter-node relations and time dependency. The experiments with Azure data show that our node-level eviction prediction model performs better than the node-level and cluster-level baselines.
ROSE: Robust Caches for Amazon Product Search
Product search engines like Amazon Search often use caches to improve the customer user experience; caches can improve both the system's latency as well as search quality. However, as search traffic increases over time, the cache's ever-growing size can diminish the overall system performance. Furthermore, typos, misspellings, and redundancy widely witnessed in real-world product search queries can cause unnecessary cache misses, reducing the cache's utility. In this paper, we introduce ROSE, a RObuSt cachE, a system that is tolerant to misspellings and typos while retaining the look-up cost of traditional caches. The core component of ROSE is a randomized hashing schema that makes ROSE able to index and retrieve an arbitrarily large set of queries with constant memory and constant time. ROSE is also robust to any query intent, typos, and grammatical errors with theoretical guarantees. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of ROSE. ROSE is deployed in the
Unsupervised Customer Segmentation with Knowledge Graph Embeddings
We propose an unsupervised customer segmentation method from behavioural data. We model sequences of beer consumption from a publicly available dataset of 2.9M reviews of more than 110,000 brands over 12 years as a knowledge graph, learn their representations with knowledge graph embedding models, and apply off-the-shelf cluster analysis. Experiments and clusters interpretation show that we learn meaningful clusters of beer customers, without relying on expensive consumer surveys or time-consuming data annotation campaigns.
On Reliability Scores for Knowledge Graphs
The Instacart KG is a central data store which contains facts regarding grocery products, ranging from taxonomic classifications to product nutritional information. With a view towards providing reliable and complete information for downstream applications, we propose an automated system for providing these facts with a score based on their reliability. This system passes data through a series of contextualized unit tests; the outcome of these tests are aggregated in order to provide a fact with a discrete score: reliable, questionable, or unreliable. These unit tests are written with explainability, scalability, and correctability in mind.
DC-GNN: Decoupled Graph Neural Networks for Improving and Accelerating Large-Scale E-commerce Retrieval
In large-scale E-commerce retrieval scenarios, the Graph Neural Networks (GNNs) based click-through rate (CTR) prediction technique has become one of the state-of-the-arts due to its powerful capability on topological feature extraction and relational reasoning. However, the conventional GNNs-based CTR prediction model used in large-scale E-commerce retrieval suffers from low training efficiency, as E-commerce retrieval normally has billions of entities and hundreds of billions of relations. Under the limitation on training efficiency, only shallow graph algorithms can be employed, which severely hinders the GNNs representation capability and consequently weakens the retrieval quality. In order to deal with the trade-off between training efficiency and representation capability, we propose the Decoupled Graph Neural Networks framework, namely DC-GNN, to improve and accelerate the GNNs-based CTR prediction model for large-scale E-commerce retrieval. Specifically, DC-GNN decouples the conventional paradigm into
Learning Explicit User Interest Boundary for Recommendation
The core objective of modelling recommender systems from implicit feedback is to maximize the positive sample score $s_p$ and minimize the negative sample score $s_n$, which can usually be summarized into two paradigms: the pointwise and the pairwise. The pointwise approaches fit each sample with its label individually, which is flexible in weighting and sampling on instance-level but ignores the inherent ranking property. By qualitatively minimizing the relative score $s_n - s_p$, the pairwise approaches capture the ranking of samples naturally but suffer from training efficiency. Additionally, both approaches are hard to explicitly provide a personalized decision boundary to determine if users are interested in items unseen. To address those issues, we innovatively introduce an auxiliary score $b_u$ for each user to represent the User Interest Boundary(UIB) and individually penalize samples that cross the boundary with pairwise paradigms, i.e., the positive samples whose score is lower than $b_u$ and the negative samples whose score is higher than $b_u$. In this way, our approach successfully achieves a hybrid loss of the pointwise and the pairwise to combine the advantages of both. Analytically, we show that our approach can provide a personalized decision boundary and significantly improve the training efficiency without any special sampling strategy. Extensive results show that our approach achieves significant improvements on not only the classical pointwise or pairwise models but also state-of-the-art models with complex loss function and complicated feature encoding.
A Gain-Tuning Dynamic Negative Sampler for Recommendation
Selecting reliable negative training instances is the challenging task in the implicit feedback-based recommendation, which is optimized by pairwise learning on user feedback data.
FairGAN: GANs-based Fairness-aware Learning for Recommendations with Implicit Feedback
Ranking algorithms in recommender systems influence people to make decisions. Conventional ranking algorithms based on implicit feedback data aim to maximize the utility to users by capturing users' preferences over items. However, these utility-focused algorithms tend to cause fairness issues that require careful consideration in online platforms. Existing fairness-focused studies does not explicitly consider the problem of lacking negative feedback in implicit feedback data, while previous utility-focused methods ignore the importance of fairness in recommendations. To fill this gap, we propose a Generative Adversarial Networks (GANs) based learning algorithm FairGAN, mapping the exposure fairness issue to the problem of negative preferences in implicit feedback data. FairGAN does not explicitly treat unobserved interactions as negative, but instead, adopts a novel fairness-aware learning strategy to dynamically generate fairness signals. This optimizes the search direction to make FairGAN capable of searching the space of the optimal ranking that can fairly allocate exposure to individual items while preserving users' utilities as high as possible. The experiments on four real-world data sets demonstrate that the proposed algorithm significantly outperforms the state-of-the-art algorithms in terms of both recommendation quality and fairness.
CAUSPref: Causal Preference Learning for Out-of-Distribution Recommendation
In spite of the tremendous development of recommender system owing to the progressive capability of machine learning recently, the current recommender system is still vulnerable to the distribution shift of users and items in realistic scenarios, leading to the sharp decline of performance in testing environments. It is even more severe in many common applications where only the implicit feedback from sparse data is available. Hence, it is crucial to promote the performance stability of recommendation method in different environments. In this work, we first make a thorough analysis of implicit recommendation problem from the viewpoint of out-of-distribution (OOD) generalization. Then under the guidance of our theoretical analysis, we propose to incorporate the recommendation-specific DAG learner into a novel causal representation-based recommendation framework named Causal Preference Learning (CPL), mainly consisting of causal learning of invariant user preference and anti-preference negative sampling to deal with implicit feedback. Extensive experimental results from real-world datasets clearly demonstrate that our approach surpasses the benchmark models significantly under types of out-of-distribution settings, and show its impressive interpretability.
Knowledge-aware Conversational Preference Elicitation with Bandit Feedback
Conversational recommender systems (CRS) have been proposed recently to mitigate the cold-start problem suffered by the traditional recommender systems. By introducing conversational key-terms, existing conversational recommenders can effectively reduce the need for extensive exploration and elicit the user preferences faster and more accurately. However, existing conversational recommenders leveraging key-terms heavily rely on the availability and quality of the key-terms, and their performances might degrade significantly when the key-terms are incomplete or not well labeled, which usually happens when there are new items being consistently incorporated into the systems and involving lots of human efforts to acquire well-labeled key-terms is costly. Besides, existing CRS methods leverage the feedback to different conversational key-terms separately, without considering the underlying relations between the key-terms. In this case, the learning of the conversational recommenders is sample inefficient, especially when there is a large number of candidate conversational key-terms.
Graph Neural Networks Beyond Compromise Between Attribute and Topology
Although existing Graph Neural Networks (GNNs) based on message passing achieve state-of-the-art, the over-smoothing issue, node similarity distortion issue and dissatisfactory link prediction performance can't be ignored. This paper summarizes these issues as the interference between topology and attribute for the first time. By leveraging the recently proposed optimization perspective of GNNs, this interference is analyzed and ascribed to that \textit{the learned representation in GNNs essentially compromises between the topology and node attribute}.
Inflation Improves Graph Neural Networks
Graph neural networks (GNNs) have gained significant success in graph representation learning and have emerged as the go-to approach for many graph-based tasks such as node classification, link prediction, and node clustering. Despite their effectiveness, the performance of graph neural networks (GNNs) is known to decline gradually as the number of layers increases. This attenuation is partly caused by over-smoothing, in which repetitive graph convolution eventually renders node representations identical, and partly by noise propagation, in which poor homogeneity nodes absorb noise when their features are aggregated. In this paper, we find that the feature propagation process of GNNs could be seen as as a Markov chain, analyze the inevitability of the over-smoothing problem, and then use the idea of Markov clustering to propose a novel and general solution, called a graph inflation layer, to simultaneously addressing the above issue by preventing local noise from propagating to the global as depth increases, while retaining the uniqueness of local homogeneity characteristics. By applying the additional inflation layer mentioned, various variants of GCN and other GCN-based models could also be improved. Besides, our method is completely suitable for both graphs with or without features. We evaluated our method on node classification over several real networks. Results show that our model can significantly outperform other methods and have a stable performance with the depth increases.
EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks
Graph Neural Networks (GNNs) have demonstrated superior performance on analyzing attributed networks in various web-based applications such as social recommendation and web search. Nevertheless, with the wide-spreading practice of GNNs in high-stake decision-making processes such as online fraud detection, there is an increasing societal concern that GNNs could make discriminatory decisions towards certain demographic groups. Despite many recent explorations towards fair GNNs, these works are tailored for a specific GNN model. However, myriads of GNN variants have been proposed for different applications, and it is costly to fine-tune existing debiasing algorithms for each specific GNN architecture. In this paper, different from existing works that debias GNN models, we aim to directly debias the input attributed network to achieve more fair GNNs through feeding GNNs with less biased data. Specifically, we first propose novel definitions and metrics to measure the bias in an attributed network, which leads to the optimization objective to mitigate bias. Based on the optimization objective, we develop a framework named EDITS to mitigate the bias in attributed networks while maintaining the performance of GNNs in downstream tasks. It is worth noting that EDITS works in a model-agnostic manner, which means that it is independent of the specific GNNs that are applied for downstream tasks. Extensive experiments on both real-world and synthetic datasets demonstrate the validity of the proposed bias metrics and the superiority of EDITS on both bias mitigation and utility maintenance. Open-source implementation: https://github.com/Anonymoussubmissionpurpose/EDITS.
Meta-Weight Graph Neural Network: Push the Limits Beyond Global Homophily
Graph Neural Networks (GNNs) show strong expressive power on graph data mining, by aggregating information from neighbors and using the integrated representation in the downstream tasks. The same aggregation methods and parameters for each node in a graph are used to enable the GNNs to utilize the homophily relational data. However, not all graphs are homophilic, even in the same graph, the distributions may vary significantly. Using the same convolution over all nodes may lead to the ignorance of various graph patterns. Furthermore, many existing GNNs integrate node features and structure identically, which ignores the distributions of nodes and further limits the expressive power of GNNs. To solve these problems, we propose Meta Weight Graph Neural Network (MWGNN) to adaptively construct graph convolution layers for different nodes. First, we model the Node Local Distribution (NLD) from node feature, topological structure and positional identity aspects with the Meta-Weight. Then, based on the Meta-Weight, we generate the adaptive graph convolutions to perform a node-specific weighted aggregation and boost the node representations. Finally, we design extensive experiments on real-world and synthetic benchmarks to evaluate the effectiveness of MWGNN. These experiments show the excellent expressive power of MWGNN in dealing with graph data with various distributions.
GBK-GNN: Gated Bi-Kernel Graph Neural Network for Modeling Both Homophily and Heterophily
Graph Neural Networks (GNNs) are widely used on a variety of graph-based machine learning tasks. For node-level tasks, GNNs have strong power to model the homophily property of graphs (i.e., connected nodes are more similar) while their ability to capture heterophily property is often doubtful. This is partially caused by the design of the feature transformation with the same kernel for the nodes in the same hop and the followed aggregation operator. One kernel cannot model the similarity and the dissimilarity (i.e., the positive and negative correlation) between node features simultaneously even though we use attention mechanisms like Graph Attention Network (GAT), since the weight calculated by attention is always a positive value. In this paper, we propose a novel GNN model based on a bi-kernel feature transformation and a selection gate. Two kernels capture homophily and heterophily information respectively, and the gate is introduced to select which kernel we should use for the given node pairs. We conduct extensive experiments on various datasets with different homophily-heterophily properties. The experimental results show consistent and significant improvements against state-of-the-art GNN methods.
Lessons from the AdKDD’21 Privacy-Preserving ML Challenge
Designing data sharing mechanisms providing performance as well as strong privacy guarantees is a hot topic for the Online Advertising industry.
An Empirical Investigation of Personalization Factors on TikTok
TikTok currently is the fastest growing social media platform with over 1.5 billion active users of which the majority is from generation Z. Arguably, its most important success driver is its recommendation system. Despite the importance of TikTok's algorithm to the platform's success and content distribution, little work has been done on the empirical analysis of the algorithm. Our work lays the foundation to fill this research gap. Using a sock-puppet audit methodology with a custom algorithm developed by us, we tested and analysed the effect of the language and location used to access TikTok, follow- and like-feature, as well as how the recommended content changes as a user watches certain posts longer than others. We provide evidence that all the tested factors influence the content recommended to TikTok users. Further, we identified that the follow-feature has the strongest influence, followed by the like-feature and video view rate. We also discuss the implications of our findings in the context of the formation of filter bubbles on TikTok and the proliferation of problematic content.
Graph-based Extractive Explainer for Recommendations
Explanations in a recommender system assist users make informed decisions among a set of recommended items. Great research attention has been devoted to generate natural language explanations to depict how the recommendations are generated and why the users should pay attention to the recommendation. However, due to the different limitations of those solutions, e.g., template-based or generation-based, it is hard to make the explanations easily perceivable, reliable, and personalized at the same time.
Multi-level Recommendation Reasoning over Knowledge Graphs with Reinforcement Learning
Knowledge graphs (KGs) have been widely used to improve recommendation accuracy. The multi-hop paths on KGs also enable recommendation reasoning, which is considered a crystal type of explainability. In this paper, we propose a reinforcement learning framework for multi-level recommendation reasoning over KGs, which leverages both ontology-view and instance-view KGs to model multi-level user interests. This framework ensures convergence to a more satisfying solution by effectively transferring high-level knowledge to lower levels. Based on the framework, we propose a multi-level reasoning path extraction method, which automatically selects between high-level concepts and low-level ones to form reasoning paths that better reveal user interests. Experiments on three datasets demonstrate the effectiveness of our method.
Evidence-aware Fake News Detection with Graph Neural Networks
The prevalence and perniciousness of fake news has been a critical issue on the Internet, which stimulates the development of automatic fake news detection in turn. In this paper, we focus on the evidence-based fake news detection, where several evidences are utilized to probe the veracity of news (i.e., a claim). Most previous methods first employ sequential models to embed the semantic information and then capture the claim-evidence interaction based on different attention mechanisms. Despite their effectiveness, they still suffer from two main weaknesses. Firstly, due to the inherent drawbacks of sequential models, they fail to integrate the relevant information that is scattered far apart in evidences for veracity checking. Secondly, they neglect much redundant information contained in evidences that may be useless or even harmful. To solve these problems, we propose a unified graph-based semantic structure mining framework, namely GET in short. Specifically, different from the existing work that treats claims and evidences as sequences, we model them as graph-structured data and capture the long-distance semantic dependency among dispersed relevant snippets via neighborhood propagation. After obtaining contextual semantic information, our model reduces information redundancy by performing graph structure learning. Finally, the fine-grained semantic representations are fed into the downstream claim-evidence interaction module for predictions. Comprehensive experiments have demonstrated the superiority of GET over the state-of-the-arts.
EvidenceNet: Evidence Fusion Network for Fact Verification
Fact verification is a challenging task that requires the retrieval of multiple pieces of evidence from a reliable corpus for verifying the truthfulness of a claim. Although the current methods have achieved satisfactory performance, they still suffer from one or more of the following three problems: (1) unable to extract sufficient contextual information from the evidence sentences; (2) containing redundant evidence information and (3) incapable of capturing the interaction between claim and evidence. To tackle the problems, we propose an evidence fusion network called EvidenceNet. The proposed EvidenceNet model captures global contextual information from various levels of evidence information for deep understanding. Moreover, a gating mechanism is designed to filter out redundant information in evidence. In addition, a symmetrical interaction attention mechanism is also proposed for identifying the interaction between claim and evidence. We conduct extensive experiments based on the FEVER dataset. The experimental results have shown that the proposed EvidenceNet model outperforms the current fact verification methods and achieves the state-of-the-art performance.
The Impact of Twitter Labels on Misinformation Spread and User Engagement: Lessons from Trump's Election Tweets
Social media platforms are performing Â"soft moderationÂ" by attaching warning labels to misinformation to reduce dissemination of and engagement with such content. This study investigates the warning labels Twitter placed on Donald TrumpÂ's false tweets about the 2020 US Presidential election. We categorize the warning labels by type  ``veracity labels'' calling out falsity and ``contextual labels'' providing more information. In addition, we categorize labels by rebuttal strength and textual overlap (linguistic, topical) with the underlying tweet. Using appropriate statistical tools, we find that, overall, label placement did not change the propensity of users sharing and engaging with labeled content. Nevertheless, we show that the presence of textual overlap and rebuttal strength reduced user interactions in terms of liking, retweeting, quote tweeting, replying and generating toxic comments. We also locate that these properties were associated with users creating more polarized - but deliberate - replies. Results show that liberals engaged more than conservatives, when false content was labeled, while find that the user population in terms of tweeting activity varied across warning labels. The case study has direct implications for the design of effective soft moderation and related policies.
Conspiracy Brokers: Understanding the Monetization of YouTube Conspiracy Theories
Conspiracy theories are increasingly a subject of research interest as society grapples with the growth of misinformation on the web. Previous journalistic and academic work has established YouTube as one of the most popular sites for people to host and discuss different theories. In this paper, we present an analysis of monetization methods of conspiracy theorist YouTube creators and the types of advertisers potentially targeting this content. We collect 184,218 ad impressions from 6347 unique advertisers found on both conspiracy-focused channels and mainstream YouTube content. We classify the advertisements into different business categories and compare the prevalence between conspiracy and mainstream content. We also identify common offsite monetization methods used by conspiracy channels for additional revenue. In comparison with mainstream content, conspiracy videos had similar levels of advertising from well-known brands, but an almost eleven times higher prevalence of likely predatory or deceptive advertising. Additionally, we found that conspiracy channels were more than twice as likely as mainstream channels to use offsite monetization methods, and 53\% of the demonetized channels we observed were still able to leverage third-party sites for monetization. Our results indicate that conspiracy theorists on YouTube had many potential avenues to generate revenue and that predatory advertisements are more frequently served when viewing conspiracy videos.
How Misinformation Density Affects Health Information Search
Search engine results can include false information that is inaccurate, misleading, or even harmful. But people may not recognize or realize false information results when searching online. We suspect that the number of search results with false information (false information density) may influence people's search activities and outcomes. We conducted a zoom user study to examine this matter. The experiment used a between-subjects design. We asked 60 participants to finish two health information search tasks using search engines with High, Medium, or Low false information result density levels. We recorded participants' search activities, perceptions of the systems, and answers to topic-related factual questions measured before and after searching. Our findings indicate that search results' false information density strongly affects users' search behavior and answers to factual questions. Exposure to search results with higher false information density levels has created obstacles for search and learning, making people search more frequently but answer factual questions less correctly. Our study enriches the current understanding of false information in health information search.
Significance and Coverage in Group Testing on the Social Web
We tackle the longstanding question of checking hypotheses on the social Web. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries.
Detecting False Rumors from Retweet Dynamics on Social Media
False rumors are known to have detrimental effects on society. To prevent the spread of false rumors, social media platforms such as Twitter must detect them early. In this work, we develop a novel probabilistic mixture model that classifies true vs. false rumors based on the underlying spreading process. Specifically, our model is the first to formalize the self-exciting nature of true vs. false retweeting processes. This results in a novel mixture marked Hawkes model (MMHM). Owing to this, our model obviates the need for feature engineering; instead, it directly models the spreading process in order to make inferences of whether online rumors are incorrect. Our evaluation is based on 13,650 retweet cascades of both true. vs. false rumors from Twitter. Our model recognizes false rumors with a balanced accuracy of 64.97% and an AUC of 69.46%. It outperforms state-of-the-art baselines (both neural and feature engineering) by a considerable margin but while being fully interpretable. Our work has direct implications for practitioners: it leverages the spreading process as an implicit quality signal and, based on it, detects false content.
Rumor Detection on Social Media with Graph Adversarial Contrastive Learning
Rumors spread through the Internet, especially on Twitter, have harmed social stability and residents' daily lives. Recently, in addition to utilizing the text features of posts for rumor detection, the structural information of rumor propagation trees has also been valued. Most rumors with salient features can be quickly locked by graph models dominated by cross entropy loss. However, these conventional models may lead to poor generalization, and lack robustness in the face of noise and adversarial rumors, or even the conversational structures that is deliberately perturbed (e.g., adding or deleting some comments). In this paper, we propose a novel Graph Adversarial Contrastive Learning (GACL) method to fight these complex cases, where the contrastive learning is introduced as part of the loss function for explicitly perceiving differences between conversational threads of the same class and different classes. At the same time, an Adversarial Feature Transformation (AFT) module is designed to produce conflicting samples for pressurizing model to mine event-invariant features. These adversarial samples are also used as hard negative samples in contrastive learning to make the model more robust and effective. Experimental results on three public benchmark datasets prove that our GACL method achieves better results than other state-of-the-art models.
Identifying the Adoption or Rejection of Misinformation Targeting COVID-19 Vaccines in Twitter Discourse
Although billions of COVID-19 vaccines have been administered, too many people remain hesitant. Misinformation about the COVID-19 vaccines, propagating on social media, is believed to drive hesitancy towards vaccination. However, exposure to misinformation does not necessarily indicate misinformation adoption. In this paper we describe a novel framework for identifying the stance towards misinformation, relying on attitude consistency and its properties. The interactions between attitude consistency, adoption or rejection of misinformation and the content of microblogs are exploited in a novel neural architecture, where the stance towards misinformation is organized in a knowledge graph. This new neural framework is enabling the identification of stance towards misinformation about COVID-19 vaccines with state-of-the-art results. The experiments are performed on a new dataset of misinformation towards COVID-19 vaccines, called CoVaxLies, collected from recent Twitter discourse. Because CoVaxLies provides a taxonomy of the misinformation about COVID-19 vaccines, we are able to show which type of misinformation is mostly adopted and which is mostly rejected.
Massive Text Normalization via an Efficient Randomized Algorithm
Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN. Leveraging recent advances in efficiently computing minhash signatures, FLAN requires significantly less computational time compared to baseline text normalization techniques on large-scale Twitter and Reddit datasets. In a human evaluation of the quality of the normalization, FLAN achieves 5% and 14% improvements aga inst baselines on Reddit and Twitter datasets, respectively. Our method also improves performance
MagNet: Cooperative Edge Caching by Automatic Content Congregating
Nowadays, the surge of Internet contents and the need for high Quality of Experience (QoE) put the backbone network under unprecedented pressure. The emerging edge caching solutions help ease the pressure by caching contents closer to users. However, these solutions suffer from two challenges: 1) a low hit ratio due to edgesÂ' high density and small coverages. 2) unbalanced edgesÂ' workloads caused by dynamic requests and heterogeneous edge capacities. In this paper, we formulate a typical cooperative edge caching problem and propose the MagNet, a decentralized and cooperative edge caching system to address these two challenges. The proposed MagNet system consists of two innovative mechanisms: 1) the Automatic Content Congregating (ACC), which utilizes a neural embedding algorithm to capture underlying patterns of historical traces to cluster contents into some types. The ACC then can guide requests to their optimal edges according to their types so that contents congregate automatically in different edges by type. This process forms a virtuous cycle between edges and requests, driving a high hit ratio. 2) the Mutual Assistance Group (MAG), which lets idle edges share overloaded edgesÂ' workloads by forming temporary groups promptly. To evaluate the performance of MagNet, we conduct experiments to compare it with classical, Machine Learning (ML)-based and cooperative caching solutions using the real-world trace. The results show that the MagNet can improve the hit ratio from 40% and 60% to 75% for non-cooperative and cooperative solutions, respectively, and significantly improve the balance of edgesÂ' workloads.
Commutativity-guaranteed Docker Image Reconstruction towards Effective Layer Sharing
Owing to the benefits of lightweight, containers have become a promising solution for cloud native technologies. Container images containing applications and dependencies support flexible service deployment and migration. Rapid adoption and integration of containers generate millions of images to be stored. Addition-ally, non-local images have to be frequently downloaded from the registry resulting in huge amounts of traffic. Content Addressable Storage (CAS) has been adopted for saving storage and networking by enabling identical layers sharing across images. However, according to our measurements, the benefits are significantly limited as layers are rarely fully identical in practice. In this paper, we propose to reconstruct the docker images to raise the number of identical layers and thereby reduce storage and network consumption. We explore the layered structure of images and define the commutativity of files to assure image validity. The image reconstruction is formulated as an integer nonlinear programming problem. Inspired by the observed similarity of layers, we design a similarity-awared online image reconstruction algorithm. Extensive evaluations are conducted to verify the performance of the proposed approach.
A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years.
Learning-based Fuzzy Bitrate Matching at the Edge for Adaptive Video Streaming
The rapid growth of video traffic imposes significant challenges on content delivery over the Internet. Meanwhile, edge computing is developed to accelerate video transmission as well as release the traffic load of origin servers. Although some related techniques (e.g., transcoding and prefetching) are proposed to improve edge services, they cannot fully utilize cached videos. Therefore, we propose a Learning-based Fuzzy Bitrate Matching scheme (LFBM) at the edge for adaptive video streaming, which utilizes the capacity of network and edge servers. In accordance with user requests, cache states and network conditions, LFBM utilizes reinforcement learning to make a decision, either fetching the video of the exact bitrate from the origin server or responding with a different representation from the edge server. In the simulation, compared with the baseline, LFBM improves cache hit ratio by 128%. Besides, compared with the scheme without fuzzy bitrate matching, it improves Quality of Experience (QoE) by 45%. Moreover, the real-network experiments further demonstrate the effectiveness of LFBM. It increases the hit ratio by 84% compared with the baseline and improves the QoE by 51% compared with the scheme without fuzzy bitrate matching.
LocFedMix-SL: Localize, Federate, and Mix for Improved Scalability, Convergence, and Latency in Split Learning
Split learning (SL) is a promising distributed learning framework that enables to utilize the huge data and parallel computing resources of mobile devices. SL is built upon a model-split architecture, wherein a server stores an upper model segment that is shared by different mobile clients storing its lower model segments. Without exchanging raw data, SL achieves high accuracy and fast convergence by only uploading smashed data from clients and downloading global gradients from the server. Nonetheless, the original implementation of SL sequentially serves multiple clients, incurring high latency with many clients. A parallel implementation of SL has great potential in reducing latency, yet existing parallel SL algorithms resort to compromising scalability and/or convergence speed. Motivated by this, the goal of this article is to develop a scalable parallel SL algorithm with fast convergence and low latency. As a first step, we identify that the fundamental bottleneck of existing parallel SL comes from the model-split and parallel computing architectures, under which the server-client model updates are often imbalanced, and the client models are prone to detach from the server's model. To fix this problem, by carefully integrating local parallelism, federated learning, and mixup augmentation techniques, we propose a novel parallel SL framework, coined LocFedMix-SL. Simulation results corroborate that LocFedMix-FL achieves improved scalability, convergence speed, and latency, compared to sequential SL as well as the state-of-the-art parallel SL algorithms such as SplitFed and LocSplitFed.
Keynote Talk by Selen Turkay (Queensland University of Technology)
Moderated Discussion
Alexa, in you, I trust! Fairness and Interpretability Issues in E-commerce Search through Smart Speakers
In traditional (desktop) e-commerce search, a customer issues a specific query and the system returns a ranked list of products in order of relevance to the query. However, an increasingly popular alternative in e-commerce search is to issue a voice-query to a smart speaker (e.g., Amazon Echo) powered by a voice assistant (VA, e.g., Alexa). In this situation, the VA usually spells out the details of only one product, an explanation citing the reason for its selection, and a default action of adding the product to the customer's cart. This reduced autonomy of the customer in the choice of a product during voice-search makes it necessary for a VA to be far more responsible and trustworthy in its explanation and default action.
End-to-end Learning for Fair Ranking Systems
Ranking systems are a pervasive aspect of our everyday lives: They
Fairness Audit of Machine Learning Models with Confidential Computing
Algorithmic discrimination is one of the significant concerns in applying machine learning models to a real-world system. Many researchers have focused on developing fair machine learning algorithms without discrimination based on legally protected attributes. However, the existing research has barely explored various security issues that can occur while evaluating model fairness and verifying fair models. In this study, we propose a fairness auditing framework that assesses the fairness of ML algorithms while addressing potential security issues such as data privacy, model secrecy, and trustworthiness. To this end, our proposed framework utilizes confidential computing and builds a chain of trust through enclave attestation primitives combined with public scrutiny and state-of-the-art software-based security techniques, enabling fair ML models to be securely certified and clients to verify a certified one. Our micro-benchmarks on various ML models and real-world datasets show the feasibility of the fairness certification implemented with Intel SGX in practice. In addition, we analyze the impact of data poisoning, which is an additional threat during data collection for fairness auditing. Based on the analysis, we illustrate the theoretical curves of fairness gap and minimal group size and the empirical results of fairness certification on poisoned datasets.
Discussion on Fairness
XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages
Multiple critical scenarios need automated generation of descriptive text in low-resource (LR) languages given English fact triples. For example, Wikipedia text generation given English Infoboxes, automated generation of non-English product descriptions using English product attributes, etc. Previous work on fact-to-text (F2T) generation has focused on English only. Building an effective cross-lingual F2T (XF2T) system requires alignment between English structured facts and LR sentences. Either we need to manually obtain such alignment data at a large scale, which is expensive, or build automated models for cross-lingual alignment. To the best of our knowledge, there has been no previous attempt on automated cross-lingual alignment or generation for LR languages. We propose two unsupervised methods for cross-lingual alignment. We contribute XAlign, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on XAli
QuatRE: Relation-Aware Quaternions for Knowledge Graph Embeddings
We propose a simple yet effective embedding model to learn quaternion embeddings for entities and relations in knowledge graphs. Our model aims to enhance correlations between head and tail entities given a relation within the Quaternion space with Hamilton product. The model achieves this goal by further associating each relation with two relation-aware rotations, which are used to rotate quaternion embeddings of the head and tail entities, respectively. Experimental results show that our proposed model produces state-of-the-art performances on well-known benchmark datasets for knowledge graph completion. Our code is available at: \url{https://github.com/daiquocnguyen/QuatRE}.
Universal Graph Transformer Self-Attention Networks
The transformer has been extensively used in research domains such as computer vision, image processing, and natural language processing. The transformer, however, has not been actively used in graph neural networks. To this end, we introduce a transformer-based advanced GNN model, named UGformer, to learn graph representations. In particular, given an input graph, we present two UGformer variants. The first variant is to leverage the transformer on a set of sampled neighbors for each input node, while the second is to leverage the transformer on all input nodes. Experimental results demonstrate that these two UGformer variants achieve state-of-the-art accuracies on well-known benchmark datasets for graph classification and inductive text classification, respectively.
A Two-stage User Intent Detection Model on Complicated Utterances with Multi-task Learning
As one of the most natural manner of humanmachine interaction, the dialogue systems have attracted much attention in recent years, such as chatbots and intelligent customer service bots, etc. Intents of concise user utterances can be easily detected with classic text classification models or text matching models, while complicated utterances are harder to understand directly. In this paper, for improving the user intent detection from complicated utterances in an intelligent customer service bot JIMI (JD Instant Messaging intelligence), which is designed for creating an innovative online shopping experience in E-commerce, we propose a two-stage model which combines sentence Compression and intent Classification together with Multi-task learning, called MCC. Besides, a dialogue-oriented language model is trained to further improve the performance of MCC. Experimental results show that our model can achieve good performance on both a public dataset and the JIMI dataset.
HybEx: A Hybrid Tool for Template Extraction
This paper presents HybEx, a hybrid site-level web template extractor. It is a hybrid tool because it is based on the combination of two algorithms for template and content extraction: (i) TemEx, a site-level template detection technique that includes a mechanism to identify candidate webpages from the same website that share the same template. Once the candidates are identified, it performs a mapping between them in order to infer the template. And (ii) Page-level ConEx, a content extraction technique that isolates the main content of a webpage. It is based on the translation of some features from the DOM nodes into points in R4; and the computation of the Euclidean distance between those points. The key idea is to add a preprocess to TemEx that removes the main content inferred by Page-level ConEx. It is a fact that adding this new phase to the TemEx algorithm involves an increase of its runtime, however, this increase is very small compared to the TemEx runtime because Page-level ConEx is a page-level tech
Personal Attribute Prediction from Conversations
Personal knowledge bases (PKBs) are critical to many applications, such as Web-based chatbots and personalized recommendation. Conversations containing rich personal knowledge can be regarded as a main source to populate the PKB. Given a user, a user attribute, and user utterances from a conversational system, we aim to predict the personal attribute value for the user, which is helpful for the enrichment of PKBs. However, there are three issues existing in previous studies: (1) manually labeled utterances are required for model training; (2) personal attribute knowledge embedded in both utterances and external resources is underutilized; (3) the performance on predicting some difficult personal attributes is unsatisfactory. In this paper, we propose a framework DSCGN based on the pre-trained language model with a noise-robust loss function to predict personal attributes from conversations without requiring any labeled utterances. We yield two categories of supervision, i.e., document-level supervision via a
COCTEAU: an Empathy-Based Tool for Decision-Making
Traditional approaches to data-informed policymaking are often tailored to specific contexts and lack strong citizen involvement and collaboration, which are required to design sustainable policies. We argue the importance of empathy-based methods in the policymaking domain given the successes in diverse settings, such as healthcare and education. In this paper, we introduce COCTEAU (Co-Creating The European Union), a novel framework built on the combination of empathy and gamification to create a tool aimed at strengthening interactions between citizens and policy-makers. We describe our design process and our concrete implementation, which has already undergone preliminary assessments with different stakeholders. Moreover, we briefly report pilot results from the assessment. Finally, we describe the structure and goals of our demonstration regarding the newfound formats and organizational aspects of academic conferences.
Scriptoria: A Crowd-powered Music Transcription System
In this demo we present Scriptoria, an online crowdsourcing system to tackle the complex transcription process of classical orchestral scores. The system's requirements are based on experts' feedback from classical orchestra members. The architecture enables an end-to-end transcription process (from PDF to MEI) using a scalable microtask design. Reliability, stability, task and UI design were also evaluated and improved through Focus Group Discussions. Finally, we gathered valuable comments on the transcription process itself alongside future additions that could greatly enhance current practices in their field.
ECCE: Entity-centric Corpus Exploration Using Contextual Implicit Networks
In the Digital Age, the analysis and exploration of unstructured document collections is of central importance to members of investigative professions, whether they might be scholars, journalists, paralegals, or analysts. In many of their domains, entities play a key role in the discovery of implicit relations between the contents of documents and thus serve as natural entry points to a detailed manual analysis, such as the prototypical 5Ws in journalism or stock symbols in finance. To assist in these analyses, entity-centric networks have been proposed as a language model that represents document collections as a cooccurrence graph of entities and terms, and thereby enables the visual exploration of corpora. Here, we present ECCE, a web-based application that implements entity-centric networks, augments them with contextual language models, and provides users with the ability to upload, manage, and explore document collections. Our application is available as a Web-based service at https://tinyurl.com/bdf552
Effectiveness of Data Augmentation to Identify Relevant Reviews for Product Question Answering
With the rapid growth of e-commerce and an increasing number of questions posted on the Question Answer (QA) platforms of e-commerce websites, there is a need to provide automated answers to questions. In this paper, We use transformer-based review ranking models which provide a ranked list of reviews as a potential answer to a newly posed question. Since no explicit training data is available for this task, we exploit the product reviews along with available QA pairs to learn a relevance function between a question and a review sentence. Further, we present a data augmentation technique by fine-tuning the T5 model to generate new questions from customer review and summary of the review. We utilize a real-world dataset from three categories in Amazon.com. To assess the performance of our models, we use the annotated question review dataset from RIKER (Zhao et al., 2019). Experimental results show that our Deberta-RR model with the augmentation technique outperforms the current state-of-the-art model by 5.84%,
Does Evidence from Peers Help Crowd Workers in Assessing Truthfulness?
Misinformation has been rapidly spreading online. The current approach to deal with it is deploying expert fact-checkers that follow forensic processes to identify the veracity of statements. Unfortunately, such an approach does not scale well. To deal with this, crowdsourcing has been looked at as an opportunity to complement the work of trained journalists. In this work, we look at the effect of presenting the crowd with evidence from others while judging the veracity of statements. We implement various variants of the judgment task design to understand if and how the presented evidence may or may not affect the way of crowd workers judging truthfulness and their performance. Our results show that, in certain cases, the presented evidence may mislead crowd workers who would otherwise be more accurate if judging independently from others. Those who made correct use of the provided evidence, however, could benefit from it and generate better judgments.
Graph-level Semantic Matching model for Knowledge base Aggregate Question Answering
In knowledge base question answering, aggregate question is a kind of complex question with long-distance dependencies, which affects query graph matching. Most previous semantic parsing approaches have made significant progress in complex question answering. However, they mostly only compare based on the textual similarity of the predicate sequences, ignoring the global alignment between questions and query graphs. In this paper, we propose a Graph-level Semantic Matching (GSM) model, where a question-guiding mechanism is applied to overcome the gap in structure and representation between questions and query graphs. In addition, due to the structural complexity of query graphs, we propose a two-channel model to explicitly encode the structural and relational semantics of query graphs. Finally, the experimental results show that GSM outperforms strong baselines, especially on aggregate questions.
PEAR: Personalized Re-ranking with Contextualized Transformer for Recommendation
The goal of recommender systems is to provide ordered item lists to users that best satisfy their demands. As a critical task in the recommendation pipeline, re-ranking has received increasing attention in recent years. In contrast to conventional ranking models that score each item individually, re-ranking aims to explicitly model the mutual influence among items to further refine the ordering of items in a given initial ranking list. In this paper, we present a personalized re-ranking model (dubbed PEAR) based on a contextualized transformer. PEAR makes several main improvements over the existing models. Specifically, PEAR not only captures both feature-level and item-level interactions but also models item contexts including both the initial ranking list and the historical clicked item list. In addition to item-level ranking score prediction, we also augment the training of PEAR with a list-level classification task to assess users' satisfaction on the whole ranking list. Experiments have been conducted on
Beyond NDCG: behavioral testing of recommender systems with RecList
As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community.
Personalized Complementary Product Recommendation
Complementary product recommendation aims at providing product suggestions that are often bought together to serve a joint demand. Existing work mainly focuses on modeling product relationships at a population level, but does not consider personalized preferences of different customers. In this paper, we propose a framework for personalized complementary product recommendation capable of recommending products that fit the demand and preferences of the customers. Specifically, we model product relations and user preferences with a graph attention network and a sequential behavior transformer, respectively. The two networks are cast together through personalized re-ranking and contrastive learning, in which the user and product embedding are learned jointly in an end-to-end fashion. The system recognizes different customer interests by learning from their purchase history and the correlations among customers and products. Experimental results demonstrate that our model benefits from learning personalized inform
DCAF-BERT: A Distilled Cachable Adaptable Factorized Model For Improved Ads CTR Prediction
In this paper we present a Click-through-rate (CTR) prediction model for product advertisement at Amazon. CTR prediction is challenging because the model needs to a) learn from text and numerical features, b) maintain low-latency at inference time, and c) adapt to a temporal advertisement distribution shift. Our proposed model is DCAF-BERT, a novel lightweight cache-friendly factorized model that consists of twin-structured BERT-like encoders for text with a mechanism for late fusion for tabular and numeric features. The factorization of the model allows for compartmentalised retraining which enables the model to easily adapt to distribution shifts. The twin encoders are carefully trained to leverage historical CTR data, using a large pre-trained language model and cross-architecture knowledge distillation (KD). We empirically find the right combination of pretraining, distillation and fine-tuning strategies for teacher and student which leads to a 1.7\% ROC-AUC lift over the previous best model offline. In a
A Multi-Task Learning Approach for Delayed Feedback Modeling
Conversion rate (CVR) prediction is one of the most essential tasks for digital display advertising. In industrial recommender systems, online learning is particularly favored for its capability to capture the the dynamic change of data distribution, which often leads to significantly improvement of conversion rates. However, the gap between a click behavior and the corresponding conversion ranges from a few minutes to days; therefore, fresh data may not have accurate label information when they are ingested by the training algorithm, which is called the delayed feedback problem of CVR prediction. To solve this problem, previous works label the delayed positive samples as negative and correct them at their conversion time, then they optimize the expectation of actual conversion distribution via important sampling under the observed distribution. However, these methods approximate the actual feature distribution as the observed feature distribution, which may introduce additional bias to the delayed feedback
ClusterSCL: Cluster-Aware Supervised Contrastive Learning on Graphs
We study the problem of supervised contrastive (SupCon) learning on graphs. The SupCon loss has been recently proposed for classification tasks by pulling data points in the same class closer than those of different classes. By design, it is difficult for SupCon to handle datasets with large intra-class variances and high inter-class similarities. This issue becomes further challenging when it couples with graph structures. To address this, we present the cluster-aware supervised contrastive learning loss (ClusterSCL) for graph learning tasks. The main idea of ClusterSCL is to retain the structural and attribute properties of graphs in the form of nodes' cluster distributions during supervised contrastive learning. Specifically, ClusterSCL introduces the strategy of cluster-aware data augmentation and integrates it with the SupCon loss. Extensive experiments on several widely-adopted graph benchmarks demonstrate the superiority of ClusterSCL over the cross-entropy, SupCon, and other graph contrastive objectives.
Graph Communal Contrastive Learning
Graph representation learning is crucial for many real-world applications (e.g. social relation analysis). A fundamental problem for graph representation learning is how to effectively learn representations without human labeling, which is usually costly and time-consuming. Graph contrastive learning (GCL) addresses this problem by pulling the positive node pairs (or similar nodes) closer while pushing the negative node pairs (or dissimilar nodes) apart in the representation space. Despite the success of the existing GCL methods, they primarily sample node pairs based on the node-level proximity yet the community structures have rarely been taken into consideration. As a result, two nodes from the same community might be sampled as a negative pair. We argue that the community information should be considered to identify node pairs in the same communities, where the nodes insides are semantically similar. To address this issue, we propose a novel Graph Communal Contrastive Learning (gCooL) framework to jointly learn the community partition and learn node representations in an end-to-end fashion. Specifically, the proposed gCooL consists of two components: a Dense Community Aggregation (DeCA) algorithm for community detection and a Reweighted Self-supervised Cross-contrastive (ReSC) training scheme to utilize the community information. Additionally, the real-world graphs are complex and often consist of multiple views. In this paper, we demonstrate that the proposed gCooL can also be naturally adapted to multiplex graphs. Finally, we comprehensively evaluate the proposed gCooL on a variety of real-world graphs. The experimental results show that the gCooL outperforms the state-of-the-art methods.
CGC: Contrastive Graph Clustering for Community Detection and Tracking
Given entities and their interactions in the web data, which may have occurred at different time, how can we effectively find communities of entities and track their evolution in an unsupervised manner? In this paper, we approach this important task from graph clustering perspective. Recently, state-of-the-art clustering performance in various domains has been achieved by deep clustering methods. Especially, deep graph clustering (DGC) methods have successfully extended deep clustering to graph-structured data by learning node representations and cluster assignments in a joint optimization framework. Despite some differences in modeling choices (e.g., encoder architectures), existing DGC methods are mainly based on autoencoders, minimizing reconstruction loss, and use the same clustering objective with relatively minor adaptations. Also, while many real-world graphs are dynamic in nature, previous studies have designed DGC methods only for static graphs. In this work, we develop CGC, a novel end-to-end framework for graph clustering, which fundamentally differs from existing methods. CGC learns node embeddings and cluster assignments in a contrastive graph learning framework, where positive and negative samples are carefully selected in a multi-level scheme such that they reflect the hierarchical community structures and network homophily. Also, we extend CGC for time-evolving data, where temporal graph clustering is performed in an incremental learning fashion, with the ability to detect change points. Extensive evaluation on static and temporal real-world graphs demonstrates that the proposed CGC consistently outperforms existing methods.
Fair k-Center Clustering in MapReduce and Streaming Settings
Center-based clustering techniques are fundamental to many real-world applications such as data summarization and social network analysis. In this work, we study the problem of fairness aware k-center clustering over large datasets. We are given an input dataset comprising a set of $n$ points, where each point belongs to a specific demographic group characterized by a protected attribute such as race or gender. The goal is to identify k clusters such that all clusters have considerable representation from all groups and the maximum radius of these clusters is minimized.
Unsupervised Graph Poisoning Attack via Contrastive Loss Back-propagation
Graph contrastive learning is the state-of-the-art unsupervised graph representation learning framework and has shown comparable performance with supervised approaches. However, evaluating whether the graph contrastive learning is robust to adversarial attacks is still an open problem because most existing graph adversarial attacks are supervised models, which means they heavily rely on labels and can only be used to evaluate the graph contrastive learning in a specific scenario. For unsupervised graph representation methods such as graph contrastive learning, it is difficult to acquire labels in real-world scenarios, making traditional supervised graph attack methods difficult to be applied to test their robustness. In this paper, we propose a novel unsupervised gradient-based adversarial attack that does not rely on labels for graph contrastive learning. We compute the gradients of the adjacency matrices of the two views and flip the edges with gradient ascent to maximize the contrastive loss. In this way, we can fully use multiple views generated by the graph contrastive learning models and pick the most informative edges without knowing their labels, and therefore can promisingly support our model adapted to more kinds of downstream tasks. Extensive experiments show that our attack outperforms unsupervised baseline attacks and has comparable performance with supervised attacks in multiple downstream tasks including node classification and link prediction. We further show that our attack can be transferred to other graph representation models as well.
MCL: Mixed-Centric Loss for Collaborative Filtering
The majority of recent work in latent Collaborative Filtering (CF) has focused on developing new model architectures to learn accurate user and item representations. Typically, a standard pairwise loss function (BPR, triplet, etc.) is used in these models, and little exploration is done on how to optimally extract signals from the available preference information. In the implicit setting, negative examples are generally sampled, and standard pairwise losses allocate weights that solely depend on the difference in user distance between observed (positive) and negative item pairs. This can ignore valuable global information from other users and items, and lead to sub-optimal results. Motivated by this problem, we propose a novel loss which first leverages mining to select the most informative pairs, followed by a weighing process to allocate more weight to harder examples. Our weighting process consists of four different components, and incorporates distance information from other users, enabling the model to better position the learned representations. We conduct extensive experiments and demonstrate that our loss can be applied to different types of CF models leading to significant gains on each type. In particular, by applying our loss to the graph convolutional architecture, we achieve new state-of-the-art results on four different datasets. Further analysis shows that through our loss the model is able to learn better user-item representation space compared to other losses. Full code for this work will be released at the time of publication.
Deep Unified Representation for Heterogeneous Recommendation
Recommendation system has been a widely studied task both in academia and industry.
Re4: Learning to Re-contrast, Re-attend, Re-construct for Multi-interest Recommendation
Effectively representing users lie at the core of modern recommender systems. Since users' interests naturally exhibit multiple aspects, it is of increasing interest to develop multi-interest frameworks for recommendation, rather than represent each user with an overall embedding. Despite their effectiveness, existing methods solely exploit the encoder (the forward flow) to represent multiple aspects. However, without explicit regularization, the interest embeddings may not be distinct from each other nor semantically reflect representative historical items. Towards this end, we propose the Re4 framework, which leverages the backward flow to reexamine each interest embedding. Specifically, Re4 encapsulates three backward flows, i.e., 1) Re-contrast, which drives each interest embedding to be distinct from other interests using contrastive learning; 2) Re-attend, which ensures the interest-item correlation estimation in the forward flow to be consistent to the criterion used in final recommendation; and 3) Re-construct, which ensures that each interest embedding can semantically reflect the information of representative items that relate to the corresponding interest. We demonstrate the novel forward-backward multi-interest paradigm on ComiRec, and perform extensive experiments on three real-world datasets. Empirical studies validate that Re4 is helpful in learning distinct and effective multi-interest representations.
Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation
Recent works have shown the effectiveness of incorporating
Dynamic Gaussian Embedding of Authors
Authors publish documents in a dynamic manner. Their topic of interest and writing style might shift over time. Tasks such as author classification, author identification or link prediction are difficult to resolve in such complex data settings. We propose a new representation learning model, DGEA (for Dynamic Gaussian Embedding of Authors), that is more suited to solve these tasks by capturing this temporal evolution. The representations should retain some form of multi-topic information and temporal smoothness. We formulate a general embedding framework: author representation at time t is a Gaussian distribution that leverages pre-trained document vectors, and that depends on the publications observed until t. We propose two models that fit into this framework. The first one, K-DGEA, uses a first order Markov model optimized with an Expectation Maximization Algorithm with Kalman Equations, while the second one, R-DGEA, makes use of a Recurrent Neural Network. We evaluate our method on several quantitative tasks: author identification, classification, and co-authorship prediction, on two datasets written in English. In addition, our model is language agnostic since it only requires pre-trained document embeddings. It shows good results over baselines: e.g., our method outperforms the best existing method by up to 18% on an author classification task, on a news dataset.
Scheduling Virtual Conferences Fairly: Achieving Equitable Participant and Speaker Satisfaction
In recent year, almost all conferences have moved to virtual mode due to (COVID-19) pandemic-induced restrictions on travel and social gathering. Contrary to the in-person conferences, virtual conferences face the challenge of efficiently scheduling talks, accounting for the availability of participants from different timezones and their interests in attending different talks. A natural objective for conference organizers is to maximize some efficiency measure, e.g., the total expected audience participation across all talks. However, we show that in virtual conference setting, optimizing for efficiency alone can result in an unfair schedule, i.e., individual utilities for participants and speakers can be highly unequal.
Second-level Digital Divide: A Longitudinal Study of Mobile Traffic Consumption Imbalance in France
We study the interaction between the consumption of digital services via mobile devices and urbanization levels, using measurement data collected in an operational network serving the whole territory of France.
Context-based Collective Preference Aggregation for Prioritizing Crowd Opinions in Social Decision-making
Given an issue that needs to be solved, people can collect many human opinions from the crowds on web and then prioritize them for social decision-making. A solution of the prioritization is collecting a large amount of pairwise preference comparisons from crowds and utilizing the aggregated preference labels as the collective preferences on the opinions. In practice, because there is a large number of combinations of all candidate opinion pairs, we can only collect a small number of labels for a small subset of pairs. How to only utilize a small number of pairwise crowd preferences on the opinions to estimate collective preferences is the problem. Existing works on preference aggregation methods for general scenarios only utilize the pairwise preference labels. In our scenario, additional contextual information such as the text contents of the opinions can potentially promote aggregation performance. Therefore, we propose preference aggregation approaches that can effectively incorporate contextual information by externally or internally building the relations between the opinion contexts and preference scores. We propose approaches for both homogeneous and heterogeneous settings of modeling the evaluators. The experiments based on real datasets collected from real-world crowdsourcing platform show that our approaches can generate better aggregation results than the baselines for estimating collective preferences, especially when there are only a small number of preference labels available.
How Do Mothers and Fathers Talk About Parenting to Different Audiences?
While major strides have been made towards gender equality in public life, serious inequality remains in the domestic sphere, especially around parenting. The present study analyses discussions about parenting on Reddit (i.e., a content aggregation website) to explore gender stereotypes and audience effects. It suggests a novel method to study topical variation in individuals' language when interacting with different audiences. Comments posted in 2020 were collected from three parenting subreddits (i.e., topical communities), described as being for fathers (r/Daddit), mothers (r/Mommit), and all parents (r/Parenting). Users posting on r/Parenting and r/Daddit or on r/Parenting and r/Mommit were assumed to identify as fathers or mothers, respectively, allowing gender comparison. Users' comments on r/Parenting (to a mixed-gender audience) were compared with their comments to single-gender audiences on r/Daddit or r/Mommit, using Latent Dirichlet Allocation (LDA) topic modelling. The best model included 12 topics: Education/Family advice, Work/Raise children, Leisure activities, School/Teaching, Birth/Pregnancy, Thank you/ Appreciation, Physical appearance/Picture, Furniture/Design, Medical care, Food, Sleep training, and Change/Potty training. Findings indicated that mothers expressed more stereotypical concerns  reflected in Medical care, Food, and Sleep training  regardless of the audience. Second, both mothers and fathers discussed the topic Education/Family advice more with the mixed-gender audience. Finally, the topics Birth/Pregnancy (usually announcements) and Physical appearance/Picture were discussed the most by fathers with a single-gender audience. These results demonstrate that concerns expressed by parents on Reddit are context-sensitive but also consistent with gender stereotypes, potentially reflect ing a persistent gendered and unequal division of labour in parenting.
Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations
Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pre-trained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM embeddings for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and generic linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.
TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters
Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
Genre-Controllable Story Generation via Supervised Contrastive Learning
While controllable text generation has received more attention due to the recent advance in large-scale pretrained language models, there is a lack of research that focuses on story-specific controllability.
User Satisfaction Estimation with Sequential Dialogue Act Modeling in Goal-oriented Conversational Systems
User Satisfaction Estimation (USE) is an important yet challenging task in goal-oriented conversational systems. Whether the user is satisfied with the system largely depends on the fulfillment of the user's needs, which can be implicitly reflected by users' dialogue acts. However, existing studies often neglect the sequential transitions of dialogue act or rely heavily on annotated dialogue act labels when utilizing dialogue acts to facilitate USE. In this paper, we propose a novel framework, namely USDA, to incorporate the sequential dynamics of dialogue acts for predicting user satisfaction, by jointly learning User Satisfaction Estimation and Dialogue Act Recognition tasks. In specific, we first employ a Hierarchical Transformer to encode the whole dialogue context, with two task-adaptive pre-training strategies to be a second-phase in-domain pre-training for enhancing the dialogue modeling ability. In terms of the availability of dialogue act labels, we further develop two variants of USDA to capture the dialogue act information in either supervised or unsupervised manners. Finally, USDA leverages the sequential transitions of both content and act features in the dialogue to predict the user satisfaction. Experimental results on four benchmark goal-oriented dialogue datasets across different applications show that the proposed method substantially and consistently outperforms existing methods on USE, and validate the important role of dialogue act sequences in USE.
Modeling Inter Round Attack of Online Debaters for Winner Prediction
In a debate, two debaters with opposite stances put forward arguments to fight for their viewpoints. Debaters organize their arguments to support their proposition and attack opponents' points. The common purpose of debating is to persuade the opponents and the audiences to agree with the mentioned propositions. Previous works have investigated the issue of identifying which debater is more persuasive. However, modeling the interaction of arguments between rounds is rarely discussed. In this paper, we focus on assessing the overall performance of debaters in a multi-round debate on online forums. To predict the winner in a multi-round debate, we propose a novel neural model that is aimed at capturing the interaction of arguments by exploiting raw text, structure information, argumentative discourse units (ADUs), and the relations among ADUs. Experimental results show that our model achieves competitive performance compared with the existing models, and is capable of extracting essential argument relations during a multi-round debate by leveraging argument structure and attention mechanism.
Socially-Equitable Interactive Graph Information Fusion-based Prediction for Urban Dockless E-Scooter Sharing
Urban dockless e-scooter sharing (DES) has become a popular Webof-Things (WoT) service and widely adopted world-wide. Despite its early commercial success, conventional mobility demand and supply prediction based on machine learning, and subsequent redistribution, may favor advantaged socio-economic communities and tourist regions, at the expense of reducing mobility accessibility and resource allocation for historically disadvantaged communities. To address this unfairness, we propose a socially-Equitable Interactive Graph information fusion-based mobility flow prediction system for Dockless E-scooter Sharing (EIGDES). By considering city regions as nodes connected by trips, EIGDES learns and captures the complex interactions across spatial and temporal graph features through a novel interactive graph information dissemination and fusion structure. We further design a novel model learning objective with metrics that capture both the mobility distributions and the socio-economic factors, ensuring spatial fairness in the communities' resource accessibility and their experienced DES prediction accuracy. Through its integration with the optimization regularizer, EIGDES jointly learns the DES flow patterns and socio-economic factors, and returns socially-equitable flow predictions. Our in-depth experimental study upon a total of more than 2,122,270 DES trips from three metropolitan cities in North America has demonstrated EIGDESÂ's effectiveness in accurately predicting DES flow patterns
Multi-dimensional Probabilistic Regression over Imprecise Data Streams
In applications of Web of Things or Web of Events, a massive volume of multi-dimensional streaming data are automatically and continuously generated from different sources,such as GPS, sensors, and other measurement equipments, which are essentially imprecise (inaccurate and/or uncertain).It is challenging to monitor and get insights over the streaming data of imprecision and of low-level abstraction, in order to capture potentially dramatic data changing trends and to initiate prompt responses. In this work, we investigate solutions for conducting multi-dimensional and multi-granularity probabilistic regression for the imprecise streaming data. The probabilistic nature of streaming data poses big computational challenges to the regression and its aggregation.In this paper, we study a series of techniques on multi-dimensional probabilistic regression, including aggregation, sketching, popular path materialization, and exception-driven querying. Extensive experiments on real and synthetic datasets demonstrate the efficiency and scalability of our proposals.
Lie to Me: Abusing the Mobile Content Sharing Service for Fun and Profit
Online content sharing is a widely used feature in Android apps. In this paper, we observe a new Fake-Share attack that adversaries can abuse existing content sharing services to manipulate the displayed source of shared content to bypass the content review of targeted Online Social Apps (OSAs) and induce users to click on the shared fraudulent content. We show that seven popular content-sharing services (including WeChat, AliPay, and KakaoTalk) are vulnerable to such an attack. To detect this kind of attack and explore whether adversaries have leveraged it in the wild, we propose DeFash, a multi-granularity detection tool including static analysis and dynamic verification. The extensive in-the-lab and in-the-wild experiments demonstrate that DeFash is effective in detecting such attacks. We have identified 51 real-world apps involved in Fake-Share attacks. We have further harvested over 24K Sharing Identification Information (SIIs) that attackers can abuse. It is hence urgent for our community to take actions to detect and mitigate this kind of attack.
Knowledge Enhanced GAN for IoT Traffic Generation
Network traffic data facilitate understanding the Internet of Things (IoT) behaviors and improve IoT services quality in the real world.
Beyond the First Law of Geography: Learning Representations of Satellite Imagery by Leveraging Point-of-Interests
Satellite imagery depicts the earth's surface remotely and provides comprehensive information for many applications, such as land use monitoring and urban planning. Most of the satellite images are unlabeled, and thus they have limitations in plenty of downstream tasks. Although some studies for learning representations for unlabeled satellite imagery recently emerged, they only consider the spatial information of the images, and the representations lack knowledge of human factors. To bridge this gap, besides the representation of spatial information, we propose to use the Point-of-Interest (POI) data to capture human factors and introduce a contrastive learning method to merge the POI data into the representation of satellite imagery. On top of the spatial representation and POI-related representation, we design an attention model to merge the representations from different modalities. We evaluate our method leveraging real-world socioeconomic indicators from Beijing. The results show that representation containing POI information can estimate the commercial activity-related indicators better than the spatial representation. Our proposed attention model can estimate the socioeconomic indicators with $R^2$ of $0.874$ at most and outperforms the baseline methods.
A Model-Agnostic Causal Learning Framework for Recommendation using Search Data
Machine-learning based recommendation has become an effective means to help people automatically discover their interests.
Optimizing Rankings for Recommendation in Matching Markets
Based on the success of recommender systems in e-commerce and entertainment, there is growing interest in their use in matching markets like job search. While this holds potential for improving market fluidity and fairness, we show in this paper that naively applying existing recommender systems to matching markets is sub-optimal. Considering the standard process where candidates apply and then get evaluated by employers, we present a new recommendation framework to model this interaction mechanism and propose efficient algorithms for computing personalized rankings in this setting. We show that the optimal rankings need to not only account for the potentially divergent preferences of candidates and employers, but they also need to account for capacity constraints. This makes conventional ranking systems that merely rank by some local score (e.g., one-sided or reciprocal relevance) highly sub-optimal  - not only for an individual user, but also for societal goals (e.g., low unemployment). To address this shortcoming, we propose the first method for jointly optimizing the rankings for all candidates in the market to explicitly maximize social welfare. In addition to the theoretical derivation, we evaluate the method both on simulated environments and on data from a real-world networking-recommendation system that we built and fielded at a large computer science conference.
PNMTA: A Pretrained Network Modulation and Task Adaptation Approach for User Cold-Start Recommendation
User cold-start recommendation is a serious problem that limits the performance of recommender systems (RSs). Recent studies have focused on treating this issue as a few-shot problem and seeking solutions with model-agnostic meta-learning (MAML). Such methods regard making recommendations for one user as a task and adapt to new users with a few steps of gradient updates on the metamodel. However, none of those methods consider the limitation of user representation learning imposed by the special task setting of MAML-based RSs. And they learn a common meta-model for all users while ignoring the implicit grouping distribution induced by the correlation differences among users. In response to the above problems, we propose a pretrained network modulation and task adaptation approach (PNMTA) for user cold-start recommendation.
Deep Interest Highlight Network for Click-Through Rate Prediction in Trigger-Induced Recommendation
In many classical e-commerce platforms, personalized recommendation has been proven to be of great business value, which can improve user satisfaction and increase the revenue of platforms. In this paper, we present a new recommendation problem, Trigger-Induced Recommendation (TIR), where users' instant interest can be explicitly induced with a trigger item and follow-up related target items are recommended accordingly. TIR has become ubiquitous and popular in e-commerce platforms. In this paper, we figure out that although existing recommendation models are effective in traditional recommendation scenarios by mining usersÂ' interests based on their massive historical behaviors, they are struggling in discovering usersÂ' instant interests in the TIR scenario due to the discrepancy between these scenarios, resulting in inferior performance.
No abstract available
Industry talk by Aline Cretenoud (Logitech)
Analyzing the Differences Between Professional and Amateur Esports Through Win Probability
Estimating each team's win probability at any given time of a game is a common task for any sport, including esports. Doing so is important for valuing player actions, assessing profitable bets, or engaging fans with interesting metrics. Past studies of win probability in esports have relied on data extracted from matches held in well-structured and organized professional tournaments. In these tournaments, players play on set teams, oftentimes where players are well acquainted with all participants.. However, there has been little study of win probability modeling in casual gaming environments -- those where players are randomly matched -- even though these environments form the bulk of gaming hours played. Furthermore, as data become more complex, win probability models become less interpretable. In this study, we identify the differences between professional, high-skill, and casual Counter-Strike: Global Offensive (CSGO) gaming through interpretable win probability models. We also investigate the use of player skill priors for win probability models and use our estimated models to provide suggestions to improve the CSGO matchmaking experience.
Winning Tracker: A New Model for Real-time Winning Prediction in MOBA Games
With an increasing popularity, Multiplayer Online Battle Arena(MOBA) games where two opposing teams compete against each other, have played a major role in E-sports tournaments. Among game analysis, real-time winning prediction is an important but challenging problem, which is mainly due to the complicated coupling of the overall Confrontation, the randomness of the player's Movement, and unclear optimization goals. Existing research is difficult to solve this problem in a dynamic, comprehensive and systematic way. In this study, we design a unified framework, namely Winning Tracker (WT), for solving this problem. Specifically, offense and defense extractors are developed to extract the Confrontation of both sides. A well-designed trajectory representation algorithm is applied to extracting individual's Movement information. Moreover,we design a hierarchical attention mechanism to capture team-level strategies and facilitate the interpretability of the framework. To accurately optimize training process, we adopt a multi-task learning method to design short-term and long-term goals, which are used to represent the competition state within a local period and make forward-looking predictions respectively. Intensive experiments on a real-world data set demonstrate that our proposed method WT outperforms state-of-the-art algorithms. Furthermore, our work has been practically deployed in real MOBA games, and provided case studies reflecting its outstanding commercial value.
Large-scale Personalized Video Game Recommendation via Social-aware Contextualized Graph Neural Network
Because of the large number of online games available nowadays, online game recommender systems are necessary for users and online game platforms. The former can discover more potential online games of their interests, and the latter can attract users to dwell longer in the platform. This paper investigates the characteristics of user behaviors with respect to the online games on the Steam platform. Based on the observations, we argue that a satisfying recommender system for online games is able to characterize: 1) personalization, 2) game contextualization and 3) social connection and. However, simultaneously solving all is rather challenging for game recommendation. Firstly, personalization for game recommendation requires the incorporation of the dwelling time of engaged games, which are ignored in existing methods.
DraftRec: Personalized Draft Recommendation for Winning in Multi-Player Online Battle Arena Games
This paper presents a personalized draft recommender system for Multiplayer Online Battle Arena (MOBA) games which is considered as one of the most popular online video games around the world. When playing MOBA games, players go through a stage called drafting, where they alternatively select a virtual character (i.e., champion) to play by not only considering their proficiency, but also the synergy and competence of their team's champion combination. However, the complexity of drafting sets up a huge barrier for beginners to enter the game, despite their popularity. To alleviate this problem, we propose DraftRec, a novel hierarchical Transformer-based model that recommends champions with a high probability of winning while understanding each player's individual characteristics via player- and match-level representations.
Learning Privacy-Preserving Graph Convolutional Network with Partially Observed Sensitive Attributes
Recent studies have shown Graph Neural Networks (GNNs) are extremely vulnerable to attribute inference attacks. While showing promising performance, existing privacy-preserving GNNs research assumed that the sensitive attributes of all users are known beforehand. However, due to different privacy preferences, some users (i.e., private users) may prefer not to reveal sensitive information that others (i.e., non-private users) would not mind disclosing. For example, in social networks, male users are typically less sensitive to their age information than female users, therefore, willing to give out their age information on social media. The disclosure potentially leads to the age information of female users in the network exposed. This is partly because social media users are connected, the homophily property and message-passing mechanism of GNNs can exacerbate individual privacy leakage. To address this problem, we study a novel and practical problem of learning privacy-preserving GNNs with partially observed sensitive attributes.
Privacy-preserving Fair Learning of Support Vector Machine with Homomorphic Encryption
Fair learning has received a lot of attention in recent years since machine learning models can be unfair in automated decision-making systems with respect to sensitive attributes such as gender, race, etc. However, to mitigate the discrimination on the sensitive attributes and train a fair model, most fair learning methods have required to get access to the sensitive attributes in training or validation phases. In this study, we propose a privacy-preserving training algorithm for a fair support vector machine classifier based on Homomorphic Encryption (HE), where the privacy of both sensitive information and model secrecy can be preserved. The expensive computational costs of HE can be significantly improved by protecting only the sensitive information, introducing refined formulation and low-rank approximation using shared eigenvectors. Through experiments on the synthetic and real-world data, we demonstrate the effectiveness of our algorithm in terms of accuracy and fairness and show that our method significantly outperforms other privacy-preserving solutions in terms of better trade-offs between accuracy and fairness. To the best of our knowledge, our algorithm is the first privacy-preserving fair learning algorithm using HE.
Can I only share my eyes? A Web Crowdsourcing based Face Partition Approach Towards Privacy-Aware Face Recognition
Human face images represent a rich set of visual information for online social media platforms to optimize the machine learning (ML)/AI models in their data-driven facial applications (e.g., face detection, face recognition). However, there exists a growing privacy concern from social media users to share their online face images that will be annotated by unknown crowd workers and analyzed by ML/AI researchers in the model training and optimization process. In this paper, we focus on a privacy-aware face recognition problem where the goal is to empower the facial applications to train their face recognition models with images shared by social media users while protecting the identity of the users. Our problem is motivated by the limitation of current privacy-aware face recognition approaches that mainly prevent algorithmic attacks by manipulating face images but largely ignore the potential privacy leakage related to human activities (e.g., crowdsourcing annotation). To address such limitations, we develop FaceCrowd, a web crowdsourcing based face partition approach to improve the performance of current face recognition models by designing a novel crowdsourced partial face graph generated from privacy-preserved social media face images. We evaluate the performance of FaceCrowd using two real-world human face datasets that consist of large-scale human face images. The results show that FaceCrowd not only improves the accuracy of the face recognition models but also effectively protects the identity information of the social media users who share their face images.
Discussion on Privacy
A Cluster-Based Nearest Neighbor Matching Algorithm for Enhanced A/A Validation in Online Experimentation
Online controlled experiments are commonly used to measure how much value new features deployed to the products bring to the users. Although the experiment design is straightforward in theory, running large-scale online experiments can be quite complex. An essential step to run a rigorous experiment is to validate the balance between the buckets (a.k.a. the random samples) before it proceeds to the A/B phase. This step is called A/A validation and it serves to ensure that there are no pre-existing significant difference between the test and control buckets. In this paper, we propose a new matching algorithm to assign users to buckets and improve A/A balance. It has the capability to deal with massive user size and shows improved performance compared to existing methods.
Privacy-Preserving Methods for Repeated Measures Experiments
Evolving privacy practices have led to increasing restrictions around the storage of user level data. In turn, this has resulted in analytical challenges, such as properly estimating experimental statistics, especially in the case of long-running tests with repeated measurements. We propose a method for analyzing A/B tests which avoids aggregating and storing data at the unit-level. The approach utilizes a unit-level hashing mechanism which generates and stores the mean and variance of random subsets of the original population, thus allowing estimation of the variance of the average treatment effect (ATE) by bootstrap. Across a sample of past A/B tests at Netflix, we provide empirical results that demonstrate the effectiveness of the approach, and show how techniques to improve the sensitivity of experiments, such as regression adjustment, are still feasible under this new design.
Informative Integrity Frictions in Social Networks
Social media platforms such as Facebook and Twitter benefited from massive adoption in the last decade, and in turn introduced the possibility of spreading harmful content, including false and misleading information. Some of these contents get massive distribution through user actions such as sharing, to a point that content removal or distribution reduction does not stop its viral spread. At the same time, social media platforms efforts to implement solutions to preserve its Integrity are typically not transparent, causing that users are not aware of any Integrity intervention happening on the site. In this paper we present the rationale for adding visible friction mechanisms to content share actions in the Facebook News Feed, its design and implementation challenges, and results obtained when applying them in the platform. We discuss effectiveness metrics for such interventions, and show their effects in terms of positive Integrity outcomes, as well as in terms of bringing awareness to users about potential
Fair Effect Attribution in Parallel Online Experiments
A/B tests serve the purpose of reliably identifying the effect of changes introduced in online services. It is common for online platforms to run a large number of simultaneous experiments by splitting incoming user traffic randomly in treatment and control groups. Despite a perfect randomization between different groups, simultaneous experiments can interact with each other and create a negative impact on average population outcomes such as engagement metrics. These are measured globally and monitored to protect overall user experience. Therefore, it is crucial to measure these interaction effects and attribute their overall impact in a fair way to the respective experimenters. We suggest an approach to measure and disentangle the effect of simultaneous experiments by providing a cost sharing approach based on Shapley values. We also provide a counterfactual perspective, that predicts shared impact based on conditional average treatment effects making use of causal inference techniques. We illustrate our app
Deriving Customer Experience Implicitly from Social Media Data
Organizations that focus on maximizing satisfaction, a consistent and seamless experience throughout the entire customer journey are the ones who dominate the market. Net Promoter Score (NPS) is a widely accepted metric to measure the customer experience, and the most common way to calculate it to date is by conducting a survey. But this comes with a bottleneck. The whole process can be costly, low-sample, responder-biased, and issues could be limited to the questionnaire used for the survey. We have devised a mechanism to approximate it implicitly from the mentions extracted from the four major social media platforms - Twitter, Facebook, Instagram, and YouTube. Our Data Cleaning pipeline discards the viral and promotional content (from brands, sellers, marketplaces, or public figures), and the Machine Learning pipeline captures the different customer journey nodes specific to e-commerce (like discovery, delivery, pricing) with their appropriate sentiment. Since the framework is generic and relies only on pub
Invited Talk by David Rousset (Microsoft): Building Green Progressive Web Apps
Walks in Cyberspace: Improving Web Browsing and Network Activity Analysis With 3D Live Graph Rendering
Web navigation generates traces that are useful for Web cartography, User Equipment Behavior Analysis (UEBA) and resource allocation planning. However, this data requires to be interpreted, sometimes enriched and appropriately visualized to reach its full potential. In this paper, we propose to explore the strengths and weaknesses of standard data collection methods such as mining web browser history and network traffic dumps. We developed the DynaGraph framework that combines classical traces dumping tools with a Web app for live 3D rendering of graph data. We show that mining navigation history provides useful insights but fails to provide real-time analytics and is not easy to deploy. Conversely, mining network traffic dumps appears easy to set up but rapidly fails once the data traffic is encrypted. We show that navigation patterns emerge depending on the data sampling rate when using 3D rendering.
With One Voice: Composing a Travel Voice Assistant from Repurposed Models
Voice assistants provide users a new way of interacting with digital products, allowing them to retrieve information and complete tasks with an increased sense of control and flexibility. Such products are comprised of several machine learning models, like Speech-to-Text transcription, Named Entity Recognition and Resolution, and Text Classification. Building a voice assistant from scratch takes the prolonged efforts of several teams constructing numerous models and orchestrating between components. Alternatives such as using third-party vendors or re-purposing existing models may be considered to shorten time-to-market and development costs. However, each option has its benefits and drawbacks. We present key insights from building a voice search assistant for Booking.com. Our paper compares the achieved performance and development efforts in dedicated tailor-made solutions against existing re-purposed models. We share and discuss our data-driven decisions about implementation trade-offs and their estimated o
Efficient Neural Ranking using Forward Indexes
Neural approaches, specifically transformer models, for ranking documents have delivered impressive gains in ranking performance.
Learning Probabilistic Box Embeddings for Effective and Efficient Ranking
Ranking has been one of the most important tasks in information retrieval. With the development of deep representation learning, many researchers propose to encode both the query and items into embedding vectors and rank the items according to the inner product or distance measures in the embedding space. However, the ranking models based on vector embeddings may have shortages in effectiveness and efficiency. For effectiveness, they lack the intrinsic ability to model the diversity and uncertainty of queries and items in ranking. For efficiency, nearest neighbor search in a large collection of item vectors can be costly. In this work, we propose to use the recently proposed probabilistic box embeddings for effective and efficient ranking, in which queries and items are parameterized as high-dimensional axis-aligned hyper-rectangles. For effectiveness, we utilize probabilistic box embeddings to model the diversity and uncertainty with the overlapping relations of the hyper-rectangles, and prove that such overlapping measure is a kernel function which can be adopted in other kernel-based methods. For efficiency, we propose a box embedding-based indexing method, which can safely filter irrelevant items and reduce the retrieval latency. We further design a training strategy to increase the proportion of irrelevant items that can be filtered by the index. Experiments on public datasets show that the box embeddings and the box embedding-based indexing approaches are effective and efficient in two ranking tasks: ad hoc retrieval and product recommendation.
Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees
Tree-based models underpin many modern semantic search engines and recommender systems due to their sub-linear inference times. In industrial applications, these models operate at extreme scales, where every bit of performance is critical. Memory constraints at extreme scales also require that models be sparse, hence tree-based models are often back-ended by sparse matrix algebra routines. However, there are currently no sparse matrix techniques specifically designed for the sparsity structure one encounters in tree-based models for extreme multi-label ranking/classification (XMR/XMC) problems. To address this issue, we present the *masked sparse chunk multiplication* (MSCM) technique, a sparse matrix technique specifically tailored to XMR trees. MSCM is easy to implement, embarrassingly parallelizable, and offers a significant performance boost to any existing tree inference pipeline at no cost. We perform a comprehensive study of MSCM applied to several different sparse inference schemes and benchmark our methods on a general purpose extreme multi-label ranking framework. We observe that MSCM gives consistently dramatic speedups across both the online and batch inference settings, single- and multi-threaded settings, and on many different tree models and datasets. To demonstrate its utility in industrial applications, we apply MSCM to an enterprise-scale semantic product search problem with 100 million products and achieve sub-millisecond latency of 0.88 ms per query on a single thread --- an 8x reduction in latency over vanilla inference techniques. The MSCM technique requires absolutely no sacrifices to model accuracy as it gives exactly the same results as standard sparse matrix techniques. Therefore, we believe that MSCM will enable users of XMR trees to save a substantial amount of compute resources in their inference pipelines at very little cost. Our code is publicly available at [LINK WITHHELD], a s well as our complete benchmarks and code for reproduction at [LINK TO BE PROVIDED].
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval
Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the {contrastive quantization} and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to $+4.3\%$ recall gain on million-scale corpus, and up to $+17.5\%$ recall gain on billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue ($+1.95\%$), Recall ($+1.01\%$) and CTR ($+0.49\%$). Our code will be open-sourced to facilitate the technique development.
Socialformer: Social Network Inspired Long Document Modeling for Document Ranking
Utilizing pre-trained language models such as BERT has achieved great success for neural document ranking in Information Retrieval. Limited by the computational and memory requirements, long document modeling becomes a critical issue. Recent works propose to modify the full attention matrix in Transformer by designing sparse attention patterns. However, most of them only focus on local connections of terms within a fixed-size window to model semantic dependencies. How to build suitable remote connections between terms to better model document representation remains underexplored. In this paper, we propose the model Socialformer, which introduces the characteristics of social networks into designing sparse attention patterns for long document modeling in document ranking. Specifically, we consider two document-standalone and two query-aware patterns to construct a graph like social networks. Endowed with the characteristic of social networks, most pairs of nodes in such a graph can reach with a short path while ensuring the sparsity. To facilitate efficient calculation, we segment the graph into multiple subgraphs to simulate friend circles in social scenarios. This pruning allows us to implement a two-stage information transmission model with the transformer encoder. Experimental results on two document ranking benchmarks confirm the effectiveness of our model on long document modeling.
SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation
Graph contrastive learning (GCL) has emerged as a dominant technique for graph representation learning which maximizes the mutual information between paired graph data augmentations that share the same semantics. Unfortunately, it is difficult to preserve semantics well during augmentations in view of the diverse nature of graph data. Currently, data augmentations in GCL that are designed to preserve semantics broadly fall into three unsatisfactory ways. First, the augmentations can be manually picked per dataset by trial-and-errors. Second, the augmentations can be selected via cumbersome search (or optimization). Third, the augmentations can be obtained by introducing expensive domain-specific knowledge as guidance. All of these limit the efficiency and more general applicability of existing GCL methods. To circumvent these crucial issues, instead of devising more advanced graph data augmentations strategies, we propose a Simple framework for GRAph Contrastive lEarning, SimGRACE for brevity, which does not require data augmentations. More specifically, we take original graph data as input and GNN model with its perturbed version as two encoders to obtain two correlated views. And then, we maximize the agreement of these two views. SimGRACE is inspired by the observation that graph data can preserve their semantics well during encoder perturbations while not requiring manual trial-and-errors, cumbersome search or expensive domain knowledge for augmentations selection. Also, we explain why SimGRACE can succeed. Furthermore, we devise adversarial training scheme, dubbed AT-SimGRACE, to enhance the robustness of graph contrastive learning and theoretically explain the reasons. Albeit simple, we show that SimGRACE can yield competitive or better performance compared with state-of-the-art methods in terms of generalizability, transferability and robustness, while enjoying unprecede nted degree of flexibility, efficiency and ease of use. The codes and datasets are available at this anonymous github link: https://githu
Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices
Graph classification---i.e., the task of inferring the label of a given graph---has a wide range of applications in bioinformatics, social sciences, automated fake news detection, web document classification, and more. In many practical scenarios, including web-scale applications, where labels are scarce or hard to obtain, unsupervised learning is a natural paradigm but typically it trades off performance.
Adversarial Graph Contrastive Learning with Information Regularization
Contrastive learning is an effective unsupervised method in graph representation learning. Recently, the data augmentation based contrastive learning method has been extended from images to graphs. However, most prior works are directly adapted from the models designed for images. Unlike the data augmentation on images, the data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples, which are the key to the performance of contrastive learning models. This leaves much space for improvement over the existing graph contrastive learning frameworks. In this work, by introducing an adversarial graph view and an information regularizer, we propose a simple but effective method, Adversarial Graph Contrastive Learning (AGCL), to extract informative contrastive
Dual Space Graph Contrastive Learning
Unsupervised graph representation learning has emerged as a powerful tool to address real-world problems and achieves huge success in the graph learning domain. Graph contrastive learning is one of the unsupervised graph representation learning methods, which recently attracts attention from researchers and has achieved state-of-the-art performances on various tasks. The key to the success of graph contrastive learning is to construct proper contrasting pairs to acquire the underlying structural semantics of the graph. However, this key part is not fully explored currently, most of the ways to generate contrasting pairs focus on augmenting or perturbating graph structures to obtain different views of the input graph. But such strategies could degrade the performances via adding noise into the graph, which may narrow down the field of the applications of graph contrastive learning. In this paper, we propose a novel graph contrastive learning method, namely \textbf{D}ual \textbf{S}pace \textbf{G}raph \textbf{C}ontrastive (DSGC) Learning, to conduct graph contrastive learning among views generated in different spaces including the hyperbolic space and the Euclidean space. Since both spaces have their own advantages to represent graph data in the embedding spaces, we hope to utilize graph contrastive learning to bridge the spaces and leverage advantages from both sides. The comparison experiment results show that DSGC achieves competitive or better performances among all the datasets. Plus, we conduct extensive experiments to analyze the impact of different graph encoders on DSGC, giving insights about how to better leverage the advantages of contrastive learning between different spaces.
Robust Self-Supervised Structural Graph Neural Network for Social Network Prediction
Recently, self-supervised graph representation learning achieved some success both in research and many real web applications, including recommendation system, social networks, and anomaly detection. However, prior works suffer from two problems. Firstly, in social networks influential neighbors are important, but the overwhelming routine in graph representation-learning utilizes the node-wise similarity metric defined on embedding vectors that cannot exactly capture the subtle local structure and the network proximity. Secondly, existing works implicitly assume a universal distribution across datasets, which presumably leads to sub-optimal models considering the potential distribution shift. To address these problems, in this paper, we learn structural embeddings in which the proximity is characterized by 1-Wasserstein distance. We propose a distributionally robust self-supervised graph neural network framework to learn the representations. More specifically, in our method, the embeddings are computed based on subgraphs centering at the node of interest and represent both the node of interest and its neighbors, which better preserves the local structure of nodes. To make our model end-to-end trainable, we adopt a deep implicit layer to compute the Wasserstein distance, which can be formulated as a differentiable convex optimization problem. Meanwhile, our distributionally robust formulation explicitly constrains the maximal diversity for matched queries and keys. As such, our model is insensitive to the data distributions and has better generalization abilities. Extensive experiments demonstrate that the graph encoder learned by our approach can be utilized for various downstream analyses, including node classification, graph classification, and top-k similarity search. The results show our algorithm outperforms state-of-the-art baselines, and the ablation study validates the effectiveness of our design.
Learning Robust Recommenders through Cross-Model Agreement
Learning from implicit feedback is one of the most common cases in the application of recommender systems. Generally speaking, interacted examples are considered as positive while negative examples are sampled from uninteracted ones. However, noisy examples are prevalent in real-world implicit feedback. A noisy positive example could be interacted but it actually leads to negative user preference. A noisy negative example which is uninteracted because of unawareness of the user could also denote potential positive user preference. Conventional training methods overlook these noisy examples, leading to sub-optimal recommendations.
Learning Recommenders for Implicit Feedback with Importance Resampling
Recommendation is prevalently studied for implicit feedback recently, but it seriously suffers from the lack of negative samples, which has a significant impact on the training of recommendation models. Existing negative sampling is based on the static or adaptive probability distributions. Sampling from the adaptive probability receives more attention, since it tends to generate more hard examples, to make recommender training faster to converge. However, item sampling becomes much more time-consuming particularly for complex recommendation models. In this paper, we propose an Adaptive \underline{S}ampling method based on Importance Resampling (AdaSIR for short), which is not only almost equally efficient and accurate for any recommender models, but also can robustly accommodate arbitrary proposal distributions. More concretely, AdaSIR maintains a contextualized sample pool of fixed-size with importance resampling, from which items are only uniformly sampled. Such a simple sampling method can be proved to provide approximately accurate adaptive sampling under some conditions. The sample pool plays two extra important roles in (1) reusing historical hard samples with certain probabilities; (2) estimating the rank of positive samples for weighting, such that recommender training can concentrate more on difficult positive samples. Extensive empirical experiments demonstrate that AdaSIR outperforms state-of-the-art methods in terms of sampling efficiency and effectiveness.
FeedRec: News Feed Recommendation with Various User Feedbacks
Personalized news recommendation techniques are widely adopted by many online news feed platforms to target user interests. Learning accurate user interest models is important for news recommendation. Most existing methods for news recommendation rely on implicit feedbacks like click behaviors for inferring user interests and model training. However, click behaviors are implicit feedbacks and usually contain heavy noise. In addition, they cannot help infer complicated user interest such as dislike.
Choice of Implicit Signal Matters: Accounting for User Aspirations in Podcast Recommendations
Recommender systems are modulating what billions of people are exposed to on a daily basis. These systems are typically optimized for user engagement signals such as clicks, streams, etc. A common practice among practitioners is to use one signal chosen for optimization purposes. Although this practice is prevalent, little re-search has been done to explore the downstream impacts. Through online experiments on large-scale recommendation systems, we show that the choice of user engagement signal used for optimization can not only influence the content users are exposed to, but also what they consume.In this work, we use podcast recommendations with two engagement signals: Subscription vs. Plays to show that the choice of user engagement matters. We deployed recommendation models optimized for each signal and observed that consumption outcomes substantially differ depending on the engagement signals used.Upon further investigation, we observed that usersÂ' patterns of podcast engagement differ depending on the type of podcast. Further, podcasts cater to different user goals & needs. Optimizing for streams can bias the recommendations towards certain podcast types, undermine usersÂ' aspirational interests and put some show categories at disadvantage. Finally, using calibration we demonstrate that balanced recommendations can help address this issue and thereby satisfy diverse user interests.
Adaptive Experimentation with Delayed Binary Feedback
Conducting experiments with objectives that take significant delays to materialize (e.g. conversions, add-to-cart events, etc.) is challenging. Although the classical ``split sample testing" is still valid for the delayed feedback, the experiment will take longer to complete, which also means spending more resources on worse-performing strategies due to their fixed allocation schedules.
Characterizing, Detecting, and Predicting Online Ban Evasion
Moderators and automated methods enforce bans on malicious users who engage in disruptive behavior. However, malicious users can easily create a new account to evade such bans. Previous research has focused on other forms of online deception, like the simultaneous operation of multiple accounts by the same entities (sockpuppetry), impersonation of other individuals, and studying the effects of de-platforming individuals and communities. Here we conduct the first data-driven study of ban evasion, i.e., the act of circumventing bans on an online platform, leading to temporally disjoint operation of accounts by the same user. We curate a novel dataset of 8,551 ban evasion pairs (parent, child) identified on Wikipedia and contrast their behavior with benign users and non-evading malicious users. We find that evasion child accounts demonstrate similarities with respect to their banned parent accounts on several behavioral axes  from similarity in usernames and edited pages to similarity in content added to the platform and its psycholinguistic attributes. We reveal key behavioral attributes of accounts that are likely to evade bans. Based on the insights from the analyses, we train logistic regression classifiers to detect and predict ban evasion at three different points in the ban evasion lifecycle. Results demonstrate the effectiveness of our methods in predicting future evaders (AUC = 0.78), early detection of ban evasion (AUC = 0.85), and matching child accounts with parent accounts (MRR = 0.97). Our work can aid moderators by reducing their workload and identifying evasion pairs faster and more efficiently than current manual and heuristic-based approaches.
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates
An online forum that allows participatory engagement between users, very often, becomes a stage for heated debates. These debates sometimes escalate into full blown exchanges of hate and misinformation. As such, modeling these conversations through the lens of argumentation theory as graphs of supports and attacks has shown promise. However, the argumentative relation of supports and attacks, also called the polarity, is difficult to infer from natural language exchanges, not least because support or attack relationship in natural language is intuitively contextual.
Understanding Conflicts in Online Conversations
With the rise of social media, users from across the world are able to connect and converse with each other online. While these connections have facilitated a growth in knowledge, online discussions can also end in acrimonious conflict. Previous computational studies have focused on creating online conflict detection models from inferred labels, primarily examine disagreements but not acrimony, and do not examine the conflictÂ's emergence. Social science studies have investigated offline conflict, which can differ from its online form, and also rarely examine its emergence. Instead, we aim to interpret and understand how online conflicts arise in online personal conversations. We use a Facebook tool that allows group members to report conflict comments as our ground truth. We contrast discussions ending with a conflict report with paired non-conflict discussions from the same post. We study both user characteristics (e.g., historical user-to-user interactions) and conversation dynamics (e.g., changes in emotional intensity over the course of the conversation). Using statistical modeling techniques, we investigate which features are useful in predicting conflict. Our findings indicate that user characteristics such as the commenter's gender and previous involvement in negative online activity are strong indicators of conflict. Meanwhile, conversational dynamics, such as an increase in person-oriented discussion, are important signals of conflict as well. These results help us understand how conflicts emerge and suggest better detection models and ways to alert group administrators and members early on to mediate the conversation.
Emotion Bubbles: Emotional Composition of Online Discourse Before and After the COVID-19 Outbreak
The COVID-19 pandemic has been the single most important global agenda in the past two years. Apart from its health and economic impacts, it has also left marks on peopleÂ's psychological states with the rise of depression, domestic violence, and Sinophobia. We thus traced how the overall emotional states of individual Twitter users have changed before and after the pandemic. Our data including more than 9M tweets posted by 9,493 users illustrated that the threat posed by the virus did not upset the emotional equilibrium of social media. In early 2020, COVID-19 related tweets skyrocketed in number and were filled with negative emotions; however, this emotional outburst was short-lived. We found that those who had expressed positive emotions in a pre-COVID period remained positive after the pandemic. The opposite was true for those that regularly expressed negative emotions. We show that individuals achieved such an emotional consistency by selectively focusing on emotion-reinforcing COVID-19 sub-topics. The implications of the present study were discussed in light of an emotionally motivated confirmation bias, which we conceptualized as emotion bubbles, and the public resilience upon a global health risk.
Leveraging Google's Publisher-Specific IDs to Detect Website Administration
Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively tell: Which entities monetize what websites? What categories of websites does an average entity typically monetize and how diverse these websites are? How has this website administration ecosystem changed across time?
Recommendation Unlearning
Recommender systems provide essential web services by learning users' personal preferences from collected data. However, in many cases, systems also need to forget some training data. From the perspective of privacy, several privacy regulations have recently been proposed, requiring systems to eliminate any impact of the data whose owner requests to forget. From the perspective of utility, if a system's utility is damaged by some bad data, the system needs to forget these data to regain utility. From the perspective of usability, users can delete noise and incorrect entries so that a system can provide more useful recommendations. While unlearning is very important, it has not been well-considered in existing recommender systems. Although there are some researches have studied the problem of machine unlearning in the domains of image and text data, existing methods can not been directly applied to recommendation as they are unable to consider the collaborative information.
A Contrastive Sharing Model for Multi-Task Recommendation
Multi-Task Learning (MTL) has attracted increasing attention in recommender systems.A crucial challenge in MTL is to learn suitable shared parameters among tasks and to avoid negative transfer of information. The most recent sparse sharing models use independent parameter masks, which only activate useful parameters for a task, to choose the useful subnet for each task. However, as all the subnets are optimized in parallel for each task independently, it is faced with the problem of conflict between parameter gradient updates (i.e, parameter conflict problem). To address this challenge, we propose a novel Contrastive Sharing Recommendation model in MTL learning (CSRec). Each task in CSRec learns from the subnet by the independent parameter mask as in sparse sharing models, but a contrastive mask is carefully designed to evaluate the contribution of the parameter to a specific task. The conflict parameter will be optimized relying more on the task which is more impacted by the parameter. Besides, we adopt an alternating training strategy in CSRec, making it possible to self-adaptively update the conflict parameters by fair competitions. We conduct extensive experiments on three real-world large scale datasets, i.e., Tencent Kandian, Ali-CCP and Census-income, showing better effectiveness of our model over state-of-the-art methods for both offline and online MTL recommendation scenarios.
Who to Watch Next: Two-side Interactive Networks for Live Broadcast Recommendation
With the prevalence of live broadcast business nowadays, a new type of recommendation services, called live broadcast recommendation, is widely used in many mobile e-commerce Apps. Different from classical item recommendation, the live broadcast recommendation is to automatically recommend user anchors instead of items considering the interactions among triple-objects (i.e., users, an-chors, items) rather than binary interactions between users and items. Existing methods used in the industry usually only consider attribute and statistic information of users and anchors, while failing to take full advantage of rich dynamic contextual information. Moreover, their techniques based on binary objects, ranging from early matrix factorization to recently emerged deep learning, obtain objectsÂ' embeddings by mapping from pre-existing features; without encoding collaborative signals among triple-objects, which leads to limited performance. In this paper, we propose a novel framework named TWINS for the live broadcast recommendation. In order to fully use both static and dynamic information on user and anchor sides, we combine product-based neural networks with recurrent neural networks to learn the embedding of each object. In addition, instead of directly measuring the similarity, TWINS effectively injects the collaborative effects into the embedding process in an explicit manner by modeling interactive patterns between the userÂ's browsing history and the anchorÂ's broadcast history in both item and anchor aspects. Furthermore, we design a novel co-retrieval technique to efficiently select key user browsed and anchor broadcast items among massive historic records. Offline experiments on real large-scale data show the superior performance of the proposed TWINS, compared to representative methods; and further results of online experiments on Taobao live broadcast App show that TWINS gains average performance improvement of around 8% on ACTR metric, 3% on UCTR metric, 3.5% on UCVR metric.
Neuro-Symbolic Interpretable Collaborative Filtering for Attribute-based Recommendation
Recommender System (RS) is ubiquitous on today's Internet to provide multifaceted personalized information services.
Making Decision like Human: Joint Aspect Category Sentiment Analysis and Rating Prediction with Fine-to-Coarse Reasoning
Joint aspect category sentiment analysis (ACSA) and rating prediction (RP) is a newly proposed task (namely ASAP) that integrates the characteristics of both fine-grained and coarse-grained sentiment analysis. However, the prior joint models for the ASAP task only consider the shallow interaction between the two granularities. In this work, we gain the inspiration from human intuition, presenting an innovative from-fine-to-coarse reasoning framework for better joint task performance. Our system advances mainly in three aspects. First, we additionally make use of the category label text features, co-encoding them with the input document texts, allowing to accurately capture the key clues of each category. Second, we build a fine-to-coarse hierarchical label graph, modeling the aspect categories and the overall rating as a hierarchical structure for full interaction of the two granularities. Third, we propose to perform global iterative reasoning with a cross-collaboration between the hierarchical label graph and the context graphs, enabling sufficient communication between categories and review contexts. Based on the ASAP dataset, experimental results demonstrate that our proposed framework outperforms state-of-the-art baselines by large margins, achieving accuracy improvements of 5.03% and 4.59% on ACSA and RP, respectively. Further in-depth analyses prove that our method is effective on addressing both the unbalanced data distribution and the long-text issue.
Industry talk by Amine Issa (Mobalytics)
FingFormer: Contrastive Graph-based Finger Operation Transformer for Unsupervised Mobile Game Bot Detection
This paper studies the task of detecting bots for online mobile games. Considering the fact of lacking labeled cheating samples and restricted available data in the real detection systems, we aim to study the finger operations captured by screen sensors to infer the potential bots in an unsupervised way. In detail, we introduce a Transformer-style detection model, namely FingFormer.It studies the finger operations in the format of graph structure in order to capture the spatial and temporal relatedness between the two hands' operations. To optimize the model in an unsupervised way, we introduce two contrastive learning strategies to refine both finger moving patterns and players' operation habits.
Unsupervised Representation Learning of Player Behavioral Data with Confidence Guided Masking
Players of online games generate rich behavioral data during gaming. Based on this player behavioral data, game developers can build a range of data science applications, including bot detection, social recommendation, and churn prediction with machine learning techniques, to improve the gaming experience and increase revenue. However, the development of such applications requires much work, including data cleansing, training sample labeling, feature engineering, and model development, which makes the use of such applications in small and medium-sized game studios still uncommon. While acquiring supervised learning data is costly, unlabeled behavioral logs are often continuously and automatically generated in games. Thus we resort to unsupervised representation learning of player behavioral data to facilitate optimizing intelligent services in games. Behavioral data has many unique properties, including semantic complexity, excessive length, extreme long-tail imbalance among tokens, etc. A worth noting property within raw player behavioral data is that a lot of it is task-irrelevant. For these data characteristics, we introduce a BPE-enhanced compression method and propose a novel adaptive masking strategy called Masking by Token Confidence (MTC) for the Masked Language Modeling (MLM) pre-training task. MTC dynamically adjusts masking probabilities for tokens based on its predictions, to increase the masking probabilities of task-relevant tokens. Experiments on four downstream tasks and successful deployment in a world-renowned MMORPG prove the effectiveness of our proposed MTC strategy.
Nebula: Reliable Low-latency Video Transmission for Mobile Cloud Gaming
Mobile cloud gaming enables high-end games on constrained devices by streaming the game content from powerful servers through mobile networks. Mobile networks suffer from highly variable bandwidth, latency, and losses that affect the gaming experience. This paper introduces Nebula, an end-to-end cloud gaming framework to minimize the impact of network conditions on the user experience. Nebula relies on an end-to-end distortion model adapting the video source rate and the amount of frame-level redundancy based on the measured network conditions.
The Price to Play: A Privacy Analysis of Free and Paid Games in the Android Ecosystem
With an ever growing number of smartphone users, the mobile
Distributionally-robust Recommendations for Improving Worst-case User Experience
Modern recommender systems have evolved rapidly along with deep learning models that are well-optimized for overall performance, especially those trained under Empirical Risk Minimization (ERM). However, a recommendation algorithm that focuses solely on the average performance may reinforce the exposure bias and exacerbate the "rich-get-richer" effect, leading to unfair user experience. In a simulation study, we demonstrate that such performance gap among various user groups is enlarged by an ERM-trained recommender in the long-term. To mitigate such amplification effects, we propose to optimize for the worse-case performance under the Distributionally Robust Optimization (DRO) framework, with the goal of improving long-term fairness for disadvantaged subgroups. In addition, we propose a simple-yet-effective streaming optimization improvement called Streaming-DRO (S-DRO), which effectively reduces loss variances for recommendation problems with sparse and long-tailed data distributions. Our results on two large-scale datasets suggest that (1) DRO is a flexible and effective technique for improving worst-case performance, and (2) Streaming-DRO outperforms vanilla DRO and other strong baselines by improving the worse-case and overall performance at the same time.
Following Good Examples - Health Goal-Oriented Food Recommendation based on Behavior Data
Typical recommender systems try to mimic the past behaviors of users to make future recommendations. For example, in food recommendation, they tend to recommend the foods the user prefers. While the recommended foods may be easily accepted by the user, it cannot improve the userÂ's dietary habits for a specific goal such as weight control. In this paper, we build a food recommendation system that can be used on the web or in a mobile app to help users to meet their goals on body weight, while also taking into account their health information (BMI) and the nutrition information of foods (calories). Instead of applying dietary guidelines as constraints, we build recommendation models from the successful behaviors of comparable users: the weight loss model is trained using the historical food consumption data of similar users who successfully lost weight. By combining such a goal-oriented recommendation model with a general model, the recommendations can be smoothly tuned toward the goal without disruptive food changes. We tested the approach on real data collected from a popular weight management app. It is shown that our recommendation approach can better predict the foods for test periods where the user truly meets the goal, than the typical existing approaches.
Link Recommendations for PageRank Fairness
Network algorithms play a critical role in a variety of applications, such in recommendations, diffusion maximization, and web search. In this paper, we focus on the fairness of such algorithms, and more specifically of the PageRank algorithm. PageRank fairness asks for a fair allocation of the PageRank weights to the minority group of nodes, both at a global and at a personalized level. We look at the structure of the network. Concretely, we provide analytical formulas for computing the effect on fairness of an edge addition as well as the conditions that an edge must satisfy so that its addition improves fairness. We also provide analytical formulas for evaluating the role that existing edges play in fairness. We use our findings to propose efficient link recommendation algorithms based on absorbing random walks that aim at maximizing fairness. We evaluate the impact that our link recommendation algorithms have in improving PageRank fairness using various real datasets.
Causal Representation Learning for Out-of-Distribution Recommendation
Modern recommender systems learn user representations from historical interactions, which suffer from the problem of user feature shifts, such as an income increase. Historical interactions will inject out-of-date information into the representation in conflict with the latest user feature, leading to improper recommendations. In this work, we consider the Out-Of-Distribution (OOD) recommendation problem in an OOD environment with user feature shifts. To pursue high fidelity, we set additional objectives for representation learning as: 1) strong OOD generalization and 2) fast OOD adaptation.
ExpScore: Learning Metrics for Recommendation Explanation
Many information access and machine learning systems, including recommender systems, lack transparency and accountability. High-quality recommendation explanations are of great significance to enhance the transparency and interpretability of such systems.
Veracity-aware and Event-driven Personalized News Recommendation for Fake News Mitigation
In the current era of information exploration, fake news have drawn much attention from researchers. Most of researchers focused on detecting fake news while the mitigating of fake news is less studied, though it is even more important for combating fake news. Most of existing work on fake news mitigation learns a prevention strategy (i.e., an optimal model parameter) based on a information diffusion model under some special settings. However, the learned strategy is often not so actionable in the real world. In this paper, we propose a novel strategy to mitigate fake news via recommending personalized true news to users. Accordingly, we propose a novel news recommendation framework tailored for fake news mitigation, called Rec4Mit for short. Thanks to the particular design,Rec4Mit is not only able to well capture a given userÂ's current reading interest (focus on which event, e.g., US election) from her/his recent reading history (may contain true and/or fake news), but also accurately predict the label (true or fake) of candidate news. As a result, Rec4Mit can recommend the most suitable true news to best match the userÂ's preference as well as to mitigate fake news. In particular, for those users who have read fake news of a certain event, Rec4Mit is able to recommend the corresponding true news of the same event. Extensive experiments on real-world datasets show Rec4Mit significantly outperforms the state-of-the-art news recommendation methods in terms of the capability to recommend personalized true news for fake news mitigation task.
Q&A
Short, Colorful, and Irreverent! A Comparative Analysis of New Users on WallstreetBets During the Gamestop Short-squeeze
WallStreetBets (WSB) is a Reddit community that primarily discusses high-risk options and stock trading. In January 2021, it attracted worldwide attention as one of the epicentres of a significant short squeeze on US markets. Following this event, the number of users and their activity increased exponentially. In this paper, we study the changes caused in the WSB community by such an increase in activity. We perform a comparative analysis between long-term users and newcomers and examine their respective writing styles, topics, and susceptibility to community feedback. We report a significant difference in the post length and the number of emojis between the regular and new users joining WSB. Newer users' activity also tends to follow more closely the stock prices of the affected companies. Finally, although community feedback affects the choices of topics for all users, new users are less prone to select their subsequent message topics based on past community feedback.
Cyclic Arbitrage in Decentralized Exchanges
Decentralized Exchanges (DEXes) enable users to create markets for exchanging any pair of cryptocurrencies. The direct exchange rate of two tokens may not match the cross-exchange rate in the market, and such price discrepancies open up arbitrage possibilities with trading through different cryptocurrencies cyclically. In this paper, we conduct a systematic investigation on cyclic arbitrages in DEXes. We propose a theoretical framework for studying cyclic arbitrage. With our framework, we analyze the profitability conditions and optimal trading strategies of cyclic transactions. We further examine exploitable arbitrage opportunities and the market size of cyclic arbitrages with transaction-level data of Uniswap V2. We find that traders have executed 292,606 cyclic arbitrages over eleven months and exploited more than 138 million USD in revenue. However, the revenue of the most profitable unexploited opportunity is persistently higher than 1 ETH (4,000 USD), which indicates that DEX markets may not be efficien
An Exploratory Study of Stock Price Movements from Earnings Calls
Financial market analysis has focused primarily on extracting signals from accounting, stock price, and other numerical "hard" data reported in P&L statements or earnings per share reports. Yet, it is well-known that decision-makers routinely use "soft" text-based documents that interpret the hard data they narrate. Recent advances in computational methods for analyzing unstructured and soft text-based data at scale offer possibilities for understanding financial market behavior that could improve investments and market equity. A critical and ubiquitous form of soft data are earnings calls. Earnings calls are periodic (often quarterly) statements usually by CEOs who attempt to influence investors' expectations of a company's past and future performance. Here, we study the statistical relationship between earnings calls, company sales, stock performance, and analysts' recommendations. Our study covers a decade of observations with approximately 100,000 transcripts of earnings calls from 6,300 public c
WISE: Wavelet based Interpretable Stock Embedding for Risk-Sensitive Portfolio Management
Markowitz's portfolio theory is the cornerstone of the risk-averse portfolio selection (RPS) problem, the core of which lies in minimizing the risk, i.e., a value calculated based on a portfolio risk matrix. Because the real risk matrix is unobservable, usual practices compromise to utilize the covariance matrix of all stocks in the portfolio based on their historical prices to estimate the risk matrix, which, however, lack the interpretability of the computed risk degree. In this paper, we propose a novel RPS method named WISE based on wavelet decomposition, which not only fully exploits stock time series from the perspectives of the time domain and frequency domain, but also has the advantage of providing interpretability on the portfolio decision from different frequency angles. In addition, in WISE, we design a theoretically guaranteed wavelet basis selection mechanism and three auxiliary enhancement tasks to adaptively find the suitable wavelet parameters and improve the representation ability of the sto
Search Filter Ranking with Language-Aware Label Embeddings
A search on the major eCommerce platforms returns up to thousands of relevant products making it impossible for an average customer to audit all the results. Browsing the list of relevant items can be simplified using search filters for specific requirements (e.g., shoes of the wrong size). The complete list of available filters is often overwhelming and hard to visualize. Thus, successful user interfaces desire to display only the ones relevant to customer queries. In this work, we frame the filter selection task as an extreme multi-label classification (XMLC) problem based on historical interaction with eComerce sites. We learn from customers' clicks and purchases which subset of filters is most relevant to their queries treating the relevant/not-relevant signal as binary labels. A common problem in classification settings with a large number of classes is that some classes are underrepresented. These rare categories are difficult to predict. Building on previous work we show that classification performan
Invited Talk by Ruben Verborgh (Ghent University - imec): Developing Apps for a Decentralized Web
Web Audio Modules 2.0 an open Web Audio plugin standard
A group of academic researchers and developers from the computer music industry have joined forces for over a year to propose a new version of Web Audio Modules, an open source framework facilitating the development of high-performance Web Audio plugins (instruments, realtime audio effects and MIDI processors). While JavaScript and Web standards are becoming increasingly flexible and powerful, C, C++, and domain-specific languages such as FAUST or Csound remain the standard used by professional developers of native plugins. Fortunately, it is now possible to compile them in WebAssembly, which means they are suitable for the Web platform. Our work aims to create a continuum between native and browser based audio app development and to appeal to programmers from both worlds. This paper presents our proposal including guidelines and implementations for an open Web Audio plugin standard - essentially the infrastructure to support high level audio plugins for the browser.
JSRehab: Weaning Common Web Interface Components from JavaScript Addiction
Leveraging JavaScript (JS) for User Interface (UI) interactivity has been the norm on the web for many years. Yet, using JS increases bandwidth and battery consumption as scripts need to be downloaded and processed by the browser. Plus, client-side JS may expose visitors to security vulnerabilities such as Cross-Site Scripting (XSS). This paper introduces a new server-side plugin, called JSRehab, that automatically rewrites common web interface components by alternatives that do not require any JavaScript (JS). The main objective of JSRehab is to drastically reduce - and ultimately remove - the inclusion of JS in a web page to improve its responsiveness and consume less resources. We report on our implementation of JSRehab for Bootstrap, the most popular UI framework by far, and evaluate it on a corpus of 100 webpages. We show through manual validation that it is indeed possible to lower the dependencies of pages on JS while keeping intact its interactivity and accessibility. We observe that JSRehab brings ener
Regulatory Instruments for Fair Personalized Pricing
Personalized pricing is a business strategy to charge different prices to individual consumers based on their characteristics and behaviors. It has become common practice in many industries nowadays due to the availability of a growing amount of high granular consumer data. The discriminatory nature of personalized pricing has triggered heat debate among policymakers and academics on how to design regulation policies to balance market efficiency and social welfare. In this paper, we propose two sound policy instruments, i.e., capping the range of the personalized prices or their ratios. We investigate the optimal pricing strategy of a profit-maximizing monopoly under both regulatory constraints and the impact of imposing them on consumer surplus, producer surplus, and social welfare. We theoretically prove that both proposed constraints can help balance consumer surplus and producer surplus at the expense of total surplus for common demand distributions, such as uniform, logistic, and exponential distributions. Experiments on both simulation and real-world datasets demonstrate the correctness of these theoretical results. Our findings and insights shed light on regulatory policy design for the increasingly monopolized business in the digital era.
Optimal Collaterals in Multi-Enterprise Investment Networks
We study a market of investments on networks, where each agent (vertex) can invest in any enterprise linked to her, and at the same time, raise capital for her firm's enterprise from other agents she is linked to. Failing to raise sufficient capital results with the firm defaulting, being unable to invest in others. Our main objective is to examine the role of collateral contracts in handling the strategic risk that can propagate to a systemic risk throughout the network in a cascade of defaults. We take a mechanism-design approach and solve for the optimal scheme of collateral contracts that capital raisers offer their investors. These contracts aim at sustaining the efficient level of investment as a unique Nash equilibrium, while minimizing the total collateral.
Allocating Stimulus Checks in Times of Crisis
We study the problem of financial assistance (bailouts, stimulus payments, or subsidy allocations) in a network where individuals experience income shocks. These questions are pervasive both in policy domains and in the design of new Web-enabled forms of financial interaction. We build on the financial clearing framework of Eisenberg and Noe that allows the incorporation of a bailout policy that is based on discrete bailouts motivated by stimulus programs in both off-line and on-line settings. We show that optimally allocating such bailouts on a financial network in order to maximize a variety of social welfare objectives of this form is a computationally intractable problem. We develop approximation algorithms to optimize these objectives and establish guarantees for their approximation ratios. Then, we incorporate multiple fairness constraints in the optimization problems and study their boundedness. Finally, we apply our methodology to data, both in the context of a system of large financial institutions with real-world data, as well as in a realistic societal context with financial interactions between people and businesses for which we use semi-artificial data derived from mobility patterns. Our results suggest that the algorithms we develop and study have reasonable results in practice and outperform other network-based heuristics. We argue that the presented problem through the societal-level lens could assist policymakers in making informed decisions on issuing subsidies.
CoSimHeat: An Effective Heat Kernel Similarity Measure Based on Billion-Scale Network Topology
Myriads of web applications in the Big Data era demand an effective measure of similarity based on billion-scale network structures, e.g., collaborative filtering. Recently, CoSimRank has been devised as a promising graph-theoretic similarity model, which iteratively captures the notion that Â"two distinct nodes are evaluated as similar if they are connected with similar nodesÂ". However, the existing CoSimRank model for assessing similarities may either yield unsatisfactory results or rather cost-inhibitive, rendering it impractical in massive graphs. In this paper, we propose CoSimHeat, a novel scalable graph-theoretic similarity model based on heat diffusion. Specifically, we first formulate CoSimHeat model by taking advantage of heat diffusion to emulate the activities of similarity propagations on the Web. Then, we show that the similarities produced by CoSimHeat are more satisfactory than those from CoSimRank families since CoSimHeat fulfils four axioms that an ideal similarity model should satisfy while circumventing the Â"dead-loopÂ" problem of CoSimRank. Next, we propose a fast algorithm to substantially accelerate CoSimHeat computations on billion-sized graphs, with guarantees of accuracy. Our experiments on various datasets validate that CoSimHeat achieves higher accuracy and is order-of-magnitude faster than state-of-the-art competitors.
Efficient and Effective Similarity Search over Bipartite Graphs
Similarity search over a bipartite graph aims to retrieve from the graph the nodes that are similar to each other, which finds applications in various fields such as online advertising, recommender systems etc. Existing similarity measures either (i) overlook the unique properties of bipartite graphs, or (ii) fail to capture high-order information between nodes accurately, leading to suboptimal result quality. Recently, Hidden Personalized PageRank (HPP) is applied to this problem and found to be more effective compared with prior similarity measures. However, existing solutions for HPP computation incur significant computational costs, rendering it inefficient especially on large graphs.
RETE: Retrieval-Enhanced Temporal Event Forecasting on Unified Query Product Evolutionary Graph
With the increasing demands on e-commerce platforms, numerous user action history is emerging. Those enriched action records are vital to understand users' interests and intents. Recently, prior works for user behavior prediction mainly focus on the interactions with product-side information. However, the interactions with search queries, which usually act as a bridge between users and products, are still under investigated. In this paper, we explore a new problem named temporal event forecasting, a generalized user behavior prediction task in a unified query product evolutionary graph, to embrace both query and product recommendation in a temporal manner. To fulfill this setting, there involves two challenges: (1) the action data for most users is scarce; (2) user preferences are dynamically evolving and shifting over time. To tackle those issues, we propose a novel Retrieval-Enhanced Temporal Event (RETE) forecasting framework. Unlike existing methods that enhance user representations via roughly absorbing information from connected entities in the whole graph, RETE efficiently and dynamically retrieves relevant entities centrally on each user as high-quality subgraphs, preventing the noise propagation from the densely evolutionary graph structures that incorporate abundant search queries. And meanwhile, RETE autoregressively accumulates retrieval-enhanced user representations from each time step, to capture evolutionary patterns for joint query and product prediction. Empirically, extensive experiments on both the public benchmark and four real-world industrial datasets demonstrate the effectiveness of the proposed RETE method.
Unified Question Generation with Continual Lifelong Learning
Question Generation (QG), as a challenging Natural Language Processing task, aims at generating questions based on given answers and context. Existing QG methods mainly focus on building or training models for specific QG datasets. These works are subject to two major limitations: (1) They are dedicated to specific QG formats (e.g., answer-extraction or multi-choice QG), therefore, if we want to address a new format of QG, a re-design of the QG model is required. (2) Optimal performance is only achieved on the dataset they were just trained on. As a result, we have to train and keep various QG models for different QG datasets, which is resource-intensive and ungeneralizable. To solve the problems, we propose a model named Unified-QG based on lifelong learning techniques, which can continually learn QG tasks across different datasets and formats. Specifically, we first build a format-convert encoding to transform different kinds of QG formats into a unified representation. Then, a method named STRIDER (SimilariTy RegularIzed Difficult Example Replay) is built to alleviate catastrophic forgetting in continual QG learning. Extensive experiments were conducted on 8 QG datasets across 4 QG formats (answer-extraction, answer-abstraction, multi-choice, and boolean QG) to demonstrate the effectiveness of our approach. Experimental results demonstrate that our Unified-QG can effectively and continually and adapt to QG tasks when datasets and formats vary. In addition, we verify the ability of a single trained model in improving 8 Question Answering (QA) systems' performance through generating synthetic QA data.
Translating Place-Related Questions to GeoSPARQL Queries
Many place-related questions can only be answered by complex spatial reasoning, a task poorly supported by factoid question retrieval. Such reasoning using combinations of spatial and non-spatial criteria pertinent to place-related questions is increasingly possible on liked data knowledge bases. Yet, to enable question answering based on linked knowledge bases, natural language questions must first be re-formulated as formal queries. Here, we first present an enhanced version of YAGO2geo, the geospatially-enabled variant of the YAGO2 knowledge base, by linking and adding more than one million places from OpenStreetMap data to YAGO2. We then propose a novel approach to translate the place-related questions into logical representations, theoretically grounded in the core concepts of spatial information. Next, we use a dynamic template-based approach to generate fully executable GeoSPARQL queries from the logical representations. We test our approach using the Geospatial Gold Standard dataset and report substantial improvements over existing methods.
Can Machine Translation be a Reasonable Alternative for Multilingual Question Answering Systems over Knowledge Graphs?
Providing access to information is the main and most important purpose of the Web.
Collaborative Filtering with Attribution Alignment for Review-based Non-overlapped Cross Domain Recommendation
Cross-Domain Recommendation (CDR) has been popularly studied to utilize different domain knowledge to solve the data sparsity and cold-start problem in recommender systems. In this paper, we focus on the Review-based Non-overlapped Recommendation (RNCDR) problem. The problem is commonly-existed and challenging due to two main aspects, i.e, there are only positive user-item ratings on the target domain and there is no overlapped user across different domains. Most previous CDR approaches cannot solve the RNCDR problem well, since (1) they cannot effectively combine review with other information (e.g., ID or ratings) to obtain expressive user or item embedding, (2) they cannot reduce the domain discrepancy on users and items, e.g., mostly-adopted adversarial learning approaches should extra involve a domain discriminator and it is easy to cause the training unstable. To fill this gap, we propose Collaborative Filtering with Attribution Alignment model (CFAA), a cross-domain recommendation framework for the RNCDR problem. CFAA includes two main modules, i.e., rating prediction module and embedding attribution alignment module. The former aims to jointly mine review, one-hot ID, and multi-hot historical ratings to generate expressive user and item embeddings. The later includes the vertical attribution alignment and horizontal attribution alignment, tending to reduce the discrepancy based on multiple perspectives. Our empirical study on Douban and Amazon datasets demonstrates that CFAA significantly outperforms the state-of-the-art models under the RNCDR setting.
Differential Private Knowledge Transfer for Privacy-Preserving Cross-Domain Recommendation
Cross Domain Recommendation (CDR) has been popularly studied to alleviate the cold-start and data sparsity problem commonly existed in recommender systems. CDR models can improve the recommendation performance of a target domain by leveraging the data of other source domains. However, most existing CDR models assume information can directly Âtransfer across the bridgeÂ', ignoring the privacy issues. To solve the privacy concern in CDR, in this paper, we propose a novel two stage based privacy-preserving CDR framework (PriCDR). In the first stage, we propose two methods, i.e., Johnson-Lindenstrauss Transform (JLT) based and Sparse-awareJLT (SJLT) based, to publish the rating matrix of the source domain using differential privacy. We theoretically analyze the privacy and utility of our proposed differential privacy based rating publishing methods. In the second stage, we propose a novel heterogeneous CDR model (HeteroCDR), which uses deep auto-encoder and deep neural network to model the published source rating matrix and target rating matrix respectively. To this end, PriCDR can not only protect the data privacy of the source domain, but also alleviate the data sparsity of the source domain. We conduct experiments on two benchmark datasets and the results demonstrate the effectiveness of our proposed PriCDR and HeteroCDR.
KoMen: Domain Knowledge Guided Interaction Recommendation for Emerging Scenarios
User-User interaction recommendation, or interaction recommendation, is an indispensable service in social platforms, where the system automatically predicts with whom a user wants to interact. In real-world social platforms, we observe that user interactions may occur in diverse scenarios, and new scenarios constantly emerge, such as new games or sales promotions. We observe two challenges in these emerging scenarios: (1) The behavior of users on the emerging scenarios could be different from existing ones due to the diversity among scenarios; (2) Emerging scenarios may only have scarce user behavioral data for model learning. These two challenges raise a dilemma: scenario diversity calls for scenario-specific models, while data scarcity in emerging scenarios encourages the models to properly share as much information as possible. To achieve a prominent trade-off, we present KoMen: a Domain Knowledge Guided Meta-learning framework for Interaction Recommendation. KoMen first learns a set of global model parameters shared among all scenarios and then quickly adapts the parameters for an emerging scenario based on its similarities with the existing ones. There are two highlights of KoMen: (1) KoMen customizes global model parameters by incorporating domain knowledge of the scenarios, which captures scenario inter-dependencies with very limited training. An example of domain knowledge is a taxonomy that organizes scenarios by their purpose and function. (2) KoMen learns the scenario-specific parameters through a mixture-of-expert architecture, which reduces model variance resulting from data scarcity while still achieving the expressiveness to handle diverse scenarios. Extensive experiments demonstrate that KoMen achieves state-of-the-art performance on a public benchmark dataset and a large-scale real industry dataset. Remarkably, KoMen improves over the best baseline w.r.t. weigh ted ROC-AUC by 2.14 and 2.03 on the two datasets, respectively. The code will be released to GitHub upon acceptance.
OffDQ: An Offline Deep Learning Framework for QoS Prediction
With the increasing trend of web services over the Internet, developing a robust Quality of Service (QoS) prediction algorithm for recommending services in real-time is becoming a challenge today. Designing an efficient QoS prediction algorithm achieving high accuracy, while supporting faster prediction to enable the algorithm to be integrated into a real-time system, is one of the primary focuses in the domain of Services Computing. The major state-of-the-art QoS prediction methods are yet to efficiently meet both criteria simultaneously, possibly due to the lack of analysis of challenges involved in designing the prediction algorithm. In this paper, we systematically analyze the various challenges associated with the QoS prediction algorithm and propose solution strategies to overcome the challenges, and thereby propose a novel offline framework using deep neural architectures for QoS prediction to achieve our goals. Our framework, on the one hand, handles the sparsity of the dataset, captures the non-linear relationship among data, figures out the correlation between users and services to achieve desirable prediction accuracy. On the other hand, our framework being an offline prediction strategy enables faster responsiveness. We performed extensive experiments on the publicly available WS-DREAM-1 dataset to show the trade-off between prediction performance and prediction time. Furthermore, we observed our framework significantly improved one of the parameters (prediction accuracy or responsiveness) without considerably compromising the other as compared to the state-of-the-art methods.
QLUE: A Computer Vision Tool for Uniform Qualitative Evaluation of Web Pages
The increasing complexity of modern web pages has attracted a number of solutions to offer optimized versions of these pages that are lighter to process and faster to load. These solutions have been quantitatively evaluated to show significant speed-ups in page load times and/or considerable savings in bandwidth and memory consumption. However, while these solutions often produce optimized versions from existing pages, they rarely evaluate the impact of their proposed optimizations on the original content and functionality. Additionally, due to the lack of a unified metric to evaluate the similarity of the pages generated by these solutions in comparison to the original pages, it is not possible to fairly compare the results obtained from different user studies campaigns, unless recruiting the exact same users, which is extremely challenging. In this paper, we demonstrate the lack of qualitative evaluation metrics, and propose QLUE (QuaLitative Uniform Evaluation), a tool that automates the qualitative evaluation of web pages generated by web complexity solutions with respect to their original versions using computer vision. QLUE evaluates the content and the functionality of these pages separately using two metrics: QLUE's Structural Similarity, to assess the former, and QLUE's Functional Similarity to assess the latter---a task that is proven to be a challenging for humans given the complex functional dependencies in modern pages. Our results show that QLUE computes comparable content and functional scores to those provided by humans. Specifically, for 90% of 100 pages, the human evaluators gave content similarity scores between 90% and 100%, while QLUE shows the same range of similarity scores for more than 75% of the pages. QLUE's time complexity results show that it is capable of generating the scores in a matter of few minutes.
Who Has the Last Word? Understanding How to Sample Online Discussions (journal paper)
In online debates, as in offline ones, individual utterances or arguments support or attack each other, leading to some subset of arguments (potentially from different sides of the debate) being considered more relevant than others. However, online conversations are much larger in scale than offline ones, with often hundreds of thousands of users weighing in, collaboratively forming large trees of comments by starting from an original post and replying to each other. In large discussions, readers are often forced to sample a subset of the arguments being put forth. Since such sampling is rarely done in a principled manner, users may not read all the relevant arguments to get a full picture of the debate from a sample. This article is interested in answering the question of how users should sample online conversations to selectively favour the currently justified or accepted positions in the debate. We apply techniques from argumentation theory and complex networks to build a model that predicts the probabilit
Calibrated Click-Through Auctions
We analyze the optimal information design in a click-through auction with fixed valuations per click, but stochastic click-through rates. While the auctioneer takes as given the auction rule of the click-through auction, namely the generalized second-price auction, the auctioneer can design the information flow regarding the click-through rates among the bidders. A natural requirement in this context is to ask for the information structure to be calibrated in the learning sense. With this constraint, the auction needs to rank the ads by a product of the bid and an unbiased estimator of the click-through rates, and the task of designing an optimal information structure is thus reduced to the task of designing an optimal unbiased estimator.
Auction Design in an Auto-bidding Setting: Randomization Improves Efficiency beyond VCG
Autobidding is an area of increasing importance in the domain of online advertising. We study the problem of designing auctions in an autobidding setting with the goal of maximizing welfare at system equilibrium. Previous results showed that the price of anarchy (PoA) under VCG is 2 and also that this is tight even with two bidders. This raises an interesting question as to whether VCG yields the best efficiency in this setting, or whether the PoA can be improved upon. We present a prior-free randomized auction in which the PoA is approx. 1.91 for the case of two bidders, proving that one can achieve an efficiency strictly better than that under VCG in this setting. We also provide a stark impossibility result for the problem in general as the number of bidders increases -- we show that no (randomized) auction can have a PoA strictly better than 2 asymptotically as the number of bidders per query (the degree) grows. While it was shown in previous work that one can improve on the PoA of 2 if the auction is allowed to use the bidder's values for the queries in addition to the bidder's bids, we note that our randomized auction does not use such additional information; our impossibility result also holds for auctions without additional value information.
Equilibria in Auctions with Ad Types
This paper studies equilibrium quality of semi-separable position auctions (known as the Ad Types setting) with greedy or optima} allocation combined with generalized second-price (GSP) or Vickrey-Clarke-Groves(VCG) pricing. We make three contributions: first, we give upper and lower bounds on the Price of Anarchy (PoA) for auctions which use greedy allocation with GSP pricing, greedy allocations with VCG pricing, and optimal allocation with GSP pricing. Second, we give Bayes-Nash equilibrium characterizations for two-player, two-slot instances (for all auction formats) and show that there exists both a revenue hierarchy and revenue equivalence across some formats. Finally, we use no-regret learning algorithms and bidding data from a large online advertising platform to evaluate the performance of the mechanisms under semi-realistic conditions. We find that the VCG mechanism tends to obtain revenue and welfare comparable to or better than that of the other mechanisms. We also find that in practice, each of the mechanisms obtains significantly better significantly welfare than our worst-case bounds might suggest.
On Designing a Two-stage Auction for Online Advertising
For the scalability of industrial online advertising systems, a two-stage auction architecture is widely used to enable efficient ad allocation on a large set of corpus within a limited response time. The current deployed two-stage ad auction usually retrieves an ad subset by a coarse ad quality metric in a pre-auction stage, and then determines the auction outcome by a refined metric in the subsequent stage. However, this simple and greedy solution suffers from serious performance degradation, as it regards the decision in each stage separately, leading to an improper ad selection metric for the pre-auction stage.
Price Manipulability in First-Price Auctions
First-price auctions have many desirable properties, including uniquely possessing some, like credibility. However, first-price auctions are also inherently non-truthful, and non-truthfulness may result in instability and inefficiencies. Given these pros and cons, we seek to quantify the extent to which first-price auctions are susceptible to manipulation.
Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning
A BERT-based Neural Ranking Model (NRM) can be either a cross-encoder or a bi-encoder. Between the two, bi-encoder is highly efficient because all the documents can be pre-processed before the actual query time. Although query and document are independently encoded, the existing bi-encoder NRMs are Siamese models where a single language model is used for consistently encoding both of query and document. In this work, we show two approaches for improving the performance of BERT-based bi-encoders. The first approach is to replace the full fine-tuning step with a lightweightfine-tuning. We examine lightweight fine-tuning methods that are adapter-based, prompt-based, and hybrid of the two. The second approach is to develop semi-Siamese models where queries and documents are handled with a limited amount of difference. The limited difference is realized by learning two light weight-tuning modules, where the main language model of BERT is kept common for both query and document. We provide extensive experiment results for monoBERT, TwinBERT, and ColBERT where three performance metrics are evaluated over Robust04, ClueWeb09b, and MS-MARCO datasets. The results confirm that both lightweight fine-tuning and semi-Siamese are considerably helpful for improving BERT-based bi-encoders. In fact, lightweight fine-tuning is helpful for cross-encoder, too.
StruBERT: Structure-aware BERT for Table Search and Matching
A large amount of information is stored in data tables. Users can search for data tables using a keyword-based query representing an information need. A table is composed primarily of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important yet neglected aspect in table retrieval as previous methods treat each source of information independently. In addition, users can search for data tables that are similar to an existing table, and this setting can be seen as a content-based table retrieval where the query and queried object are both data tables. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.
Learning Neural Ranking Models Online from Implicit User Feedback
Existing online learning to rank (OL2R) solutions are limited to linear models, which are incompetent to capture possible non-linear relations between queries and documents. In this work, to unleash the power of representation learning in OL2R, we propose to directly learn a neural ranking model from users' implicit feedback (e.g., clicks) collected on the fly. We focus on RankNet and LambdaRank, due to their great empirical success and wide adoption in offline settings.
Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments
In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the assessors a pair of documents and asking them to select which of the two, if any, is the most relevant.
Towards a Better Understanding of Human Reading Comprehension with Brain Signals
Reading comprehension is a complex cognitive process involving many human brain activities. Plenty of works have studied the patterns and attention allocations of reading comprehension in information retrieval related scenarios. However, little is known about what happens in human brain during reading comprehension and how these cognitive activities can affect information retrieval process. Additionally, with the advances in brain imaging techniques such as electroencephalogram (EEG), it is possible to collect brain signals in almost real time and explore whether it can be utilized as feedback to facilitate information acquisition performance.
Ontology-enhanced Prompt-tuning for Few-shot Learning
Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks including relation extraction, event extraction, and knowledge graph completion with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
Knowledge Graph Reasoning with Relational Digraph
Reasoning on the knowledge graph (KG) aims to infer new facts from existing ones. Methods based on the relational path in the literature have shown strong, interpretable, and transferable reasoning ability. However, paths are naturally limited in capturing complex topology in KG. In this paper, we introduce a novel relational structure, i.e., relational directed graph (r-digraph), which is composed of overlapped relational paths, to capture the KG's local evidence. Since the r-digraphs are more complex than paths, how to efficiently construct and learn from them are challenging. Directly encoding the structures cannot scale well and the query-dependent local evidence is hard to capture. Here, we propose a variant of graph neural network, i.e., RED-GNN, to address the above challenges. Specifically, RED-GNN recursively encodes multiple r-digraphs with shared edges and selects the strongly correlated edges through query-dependent attention weights. We demonstrate that RED-GNN is not only efficient but also can achieve significant performance gains in both inductive and transductive reasoning tasks than the existing methods. Besides, the learned attention weights in RED-GNN can exhibit interpretable dependencies for KG reasoning.
Time-aware Entity Alignment using Temporal Relational Attention
Knowledge graph (KG) alignment is to match entities in different KGs, which is important to knowledge fusion and integration. Temporal KGs (TKGs) extend traditional Knowledge Graphs (KGs) by associating static triples with specific timestamps. While entity alignment (EA) between KGs has drawn increasing attention from the research community, EA between TKGs still remains unexplored. In this work, we propose a novel Temporal Relational Entity Alignment method (TREA) to learn alignment-oriented TKG embeddings. We first map entities, relations and timestamps into a uniform embedding space. Furthermore, a temporal relational attention mechanism is utilized to capture relation and time information between nodes. Finally, entity alignments are obtained by computing the similarities of their multi-view vector representations and a margin-based full multi-class log-loss is used for efficient training. Additionally, we construct three new real-world datasets from three large-scale temporal knowledge bases, i.e., ICEWS, Wikidata and YAGO, as new references for evaluating temporal and non-temporal EA methods. Experimental results show that our method significantly outperforms the state-of-the-art EA methods methods due to the inclusion of time information.
Rethinking Graph Convolutional Networks in Knowledge Graph Completion
Graph convolutional networks (GCNs)---which are effective in modeling graph structures---have been increasingly popular in knowledge graph completion (KGC). GCN-based KGC models first use GCNs to generate expressive entity representations and then use knowledge graph embedding (KGE) models to capture the interactions among entities and relations. However, many GCN-based KGC models fail to outperform state-of-the-art KGE models though introducing additional computational complexity. This phenomenon motivates us to explore the real effect of GCNs in KGC. Therefore, in this paper, we build upon representative GCN-based KGC models and introduce variants to find which factor of GCNs is critical in KGC. Surprisingly, we observe from experiments that the graph structure modeling in GCNs does not have a significant impact on the performance of KGC models, which is in contrast to the common belief. Instead, the transformations for entity representations are responsible for the performance improvements. Based on the observation, we propose a simple yet effective framework named LTE-KGE, which equips existing KGE models with linearly transformed entity embeddings. Experiments demonstrate that LTE-KGE models lead to similar performance improvements with GCN-based KGC methods, while being more computationally efficient. These results suggest that existing GCNs are unnecessary for KGC, and novel GCN-based KGC models should count on more ablation studies to validate their effectiveness.
Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings
Knowledge graph embedding (KGE) has drawn great attention due to its potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism, which can help the KGE models focus on hard instances and speed up convergence. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. The HaLE-trained models can obtain a high prediction accuracy after training few minutes and are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
Divide-and-Conquer: Post-User Interaction Network for Fake News Detection on Social Media
Fake News detection has attracted much attention in recent years. Social context based detection methods attempt to utilize the collective wisdom from users on social media, trying to model the spreading patterns of fake news.
H2 -FDetector: A GNN-based Fraud Detector with Homophilic and Heterophilic Connections
In the fraud graph, fraudsters often interact with a large number of benign entities to hide themselves. So, there are not only the homophilic connections formed by the same label nodes (similar nodes), but also the heterophilic connections formed by the different label nodes (dissimilar nodes). However, the existing GNN-based fraud detection methods just regard the fraud graph as homophilic and use the low-pass filter to retain the commonality of node features among the neighbors, which inevitably ignores the difference among neighbor of heterophilic connections. To address this problem, we propose a Graph Neural Network-based Fraud Detector with Homophilic and Heterophilic Interactions (H2-FDetector for short). Firstly, we identify the homophilic and heterophilic connections with the supervision of labeled nodes. Next, we design a new information aggregation strategy to make the homophilic connections propagate similar information and the heterophilic connections propagate difference information. Finally, a prototype prior is introduced to guide the identification of fraudsters. Extensive experiments on two real public benchmark fraud detection tasks demonstrate that our method apparently outperforms state-of-the-art baselines.
AUC-oriented Graph Neural Network for Fraud Detection
Though Graph Neural Networks (GNNs) burgeon a recent success on fraud detection tasks, it yet suffers from the imbalanced label distribution over the limited fraud and massive benign users. This paper attempts to resolve this label-imbalance problem for GNNs by maximizing the AUC (Area Under ROC Curve) metric since it is non-biased with label distribution. However, maximizing AUC on GNN for fraud detection tasks is intractable due to the potential polluted topological structure caused by intentional noisy edges generated by fraudsters. To alleviate this problem, we propose to decouple the AUC maximization process on GNN into a classifier parameter searching and an edge pruning policy searching, respectively. We propose a model named AO-GNN (Short for AUC-oriented GNN) to achieve AUC maximization on GNN under the aforementioned framework. In the proposed model, an AUC-oriented stochastic gradient is applied for classifier parameter searching, and an AUC-oriented reinforcement learning module supervised by a surrogate reward of AUC is devised for edge pruning policy searching. Experiments on three real-world datasets demonstrate that the proposed AO-GNN patently outperforms state-of-the-art baselines in not only AUC but also other general metrics, e.g. F1-macro, G-means.
Prohibited Item Detection via Risk Graph Structure Learning
Prohibited item detection is an important problem in e-commerce, where the goal is to detect illegal items online for evading risks and stemming crimes. Traditional solutions usually mine evidence from single instances, while current efforts try to employ advanced Graph Neural Networks (GNN) to utilize multiple risk-relevant structures of items. However, it still remains two essential challenges: (1) the structures are noisy and incomplete besides heterogeneous (i.e., weak structure), and (2) the simple binary labels cannot express the variety of items belonging to different risk subcategories (i.e., weak supervision). To handle these challenges, we propose the Risk Graph Structure Learning model (RGSL) for prohibited item detection. Specifically, to overcome the weak structure, RGSL first introduces structure learning into large-scale heterogeneous risk graphs, which reduces multiple noisy connections and adds similar pairs. Then, to overcome the weak supervision, RGSL transforms the detection process as a metric learning task between candidates and their similar prohibited items and proposes the pairwise training mechanism. Furthermore, RGSL generates risk-aware item representations and searches risk-relevant pairs for structure learning iteratively. We test RGSL on three real-world risk datasets, and the improvements to representative baselines are up to 21.91% in AP and 18.28% in MAX-F1. Meanwhile, RGSL has been deployed on a real-world e-commerce platform, and the average improvements in the recent week to traditional industrial solutions are up to 23.59% in ACC@1000 and 6.52% in ACC@10000.
A Viral Marketing-Based Model For Opinion Dynamics in Online Social Networks
Online social networks provide a medium for citizens to form opinions on different societal issues, and a forum for public discussion. They also expose users to viral content, such as breaking news articles. In this paper, we study the interplay between these two aspects: opinion formation and information cascades in online social networks. We present a new model that allows us to quantify how users change their opinion as they are exposed to viral content. Our model can be viewed as a combination of the popular FriedkinJohnsen model for opinion dynamics and the independent-cascade model for information propagation. We present algorithms for efficiently simulating our model, and we provide approximation algorithms for optimizing certain network indices, such as the sum of user opinions or the disagreementcontroversy index; our approach can be used to obtain insights into how much viral content can increase these measures in online social networks. Finally, we
Meta-Learning Helps Personalized Product Search
Personalized product search that provides users with customized search services is an important task for e-commerce platforms, and has attracted a lot of research attention. However, this task remains a challenge when inferring users' preferences from few records or even no records, which is also known as the few-shot learning problem or zero-shot learning problem. In this work, we focus on such a problem and propose a Bayesian Online Meta-Learning Model (BOML), which transfers meta-knowledge, from the inference for other users' preferences, to help to infer the current user's interest behind her/his few or even no historical records. To extract meta-knowledge from various inference patterns, our model constructs a mixture of meta-knowledge and transfers the corresponding meta-knowledge to the specific user according to her/his records. Based on the meta-knowledge learn from other similar inferences, our proposed model search a ranked list of products to meet users' personalized query intents for those with few search records (i.e., few-shot learning problem) or even no search records (i.e., zero-shot learning problem). Under the records arriving sequentially setting, we propose an online variational inference algorithm to update meta-knowledge over time. Experimental results demonstrate that our proposed BOML can outperform state-of-the-art algorithms for product search and improve the performance under the few-shot or even zero-shot learning problem scenario in product search.
LBCF: A Large-Scale Budget-Constrained Causal Forest Algorithm
Offering incentives (e.g., coupons at Amazon, discounts at Uber and video bonuses at Tiktok) to user is a common strategy used by online platforms to increase user engagement and platform revenue.
MINDSim: User Simulator for News Recommenders
Recommender system is playing an increasingly important role in online news platforms nowadays. Recently, there is a growing demand for applying reinforcement learning (RL) algorithms to news recommendation aiming to maximize long-term and/or non-differentiable objectives. However, without an interactive simulated environment, it is extremely costly to develop powerful RL agents for news recommendation. In this paper, we build a user simulator, namely \textit{MINDSim}, for news recommendation.
MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks
In many personalized recommendation scenarios, the generalization ability of a target task can be improved via learning with additional auxiliary tasks alongside this target task on a multi-task network. However, this method often suffers from a serious optimization imbalance problem. On the one hand, one or more auxiliary tasks might have a larger influence than the target task and even dominate the network weights, resulting in worse recommendation accuracy for the target task. On the other hand, the influence of one or more auxiliary tasks might be too weak to assist the target task. More challenging is that this imbalance dynamically changes throughout the training process and varies across the parts of the same network. We propose a new method: MetaBalance to balance auxiliary losses via directly manipulating their gradients w.r.t the shared parameters in the multi-task network. Specifically, in each training iteration and adaptively for each part of the network, the gradient of an auxiliary loss is carefully reduced or enlarged to have a closer magnitude to the gradient of the target loss, preventing auxiliary tasks from being so strong that dominate the target task or too weak to help the target task. Moreover, the proximity between the gradient magnitudes can be flexibly adjusted to adapt MetaBalance to different scenarios. The experiments show that our proposed method achieves a significant improvement of 8.34% in terms of NDCG@10 upon the strongest baseline on two real-world datasets. We release the code of experiments here.
Contrastive Learning for Knowledge Tracing
Knowledge tracing is the task of understanding student's knowledge acquisition processes by estimating whether to solve the next question correctly or not. Most deep learning-based methods tackle this problem by identifying hidden representations of knowledge states from learning histories. However, due to the sparse interactions between students and questions, the hidden representations can be easily over-fitted and often fail to capture student's knowledge states accurately. This paper introduces a contrastive learning framework for knowledge tracing that reveals semantically similar or dissimilar examples of a learning history and stimulates to learn their relationships. To deal with the complexity of knowledge acquisition during learning, we carefully design the components of contrastive learning, such as architectures, data augmentation methods, and hard negatives, taking into account pedagogical rationales. Our extensive experiments on six benchmarks show statistically significant improvements from the previous methods. Further analysis shows how our methods contribute to improving knowledge tracing performances.
PopNet: Real-Time Population-Level Disease Prediction with Data Latency
Population-level disease prediction estimates the number of potential patients of particular diseases in some location at a future time based on (frequently updated) historical disease statistics. Existing approaches often assume the existing disease statistics are reliable and will not change. However, in practice, data collection is often time-consuming and has time delays, with both historical and current disease statistics being updated continuously. In this work, we propose a real-time population-level disease prediction model which captures data latency (PopNet) and incorporates the updated data for improved predictions. To achieve this goal, PopNet models real-time data and updated data using two separate systems, each capturing spatial and temporal effects using hybrid graph attention networks and recurrent neural networks. PopNet then fuses the two systems using both spatial and temporal latency-aware attentions in an end-to-end manner. We evaluate PopNet on real-world disease datasets and show that PopNet consistently outperforms all baseline disease prediction and general spatial-temporal prediction models, achieving up to 47% lower root mean squared error and 24% lower mean absolute error compared with the best baselines.
BSODA: A Bipartite Scalable Framework for Online Disease Diagnosis
A growing number of people are seeking healthcare advice online. Usually, they diagnose their medical conditions based on the symptoms they are experiencing, which is also known as self-diagnosis. From the machine learning perspective, online disease diagnosis is a sequential feature (symptom) selection and classification problem. Reinforcement learning (RL) methods are the standard approaches to this type of tasks. Generally, they perform well when the feature space is small, but frequently become inefficient in tasks with a large number of features, such as the self-diagnosis. To address the challenge, we propose a non-RL Bipartite Scalable framework for Online Disease diAgnosis, called BSODA. BSODA is composed of two cooperative branches that handle symptom-inquiry and disease-diagnosis, respectively. The inquiry branch determines which symptom to collect next by an information-theoretic reward. We employ a Product-of-Experts encoder to significantly improve the handling of partial observations of a large number of features. Besides, we propose several approximation methods to substantially reduce the computational cost of the reward to a level that is acceptable for online services. Additionally, we leverage the diagnosis model to estimate the reward more precisely. For the diagnosis branch, we use a knowledge-guided self-attention model to perform predictions. In particular, BSODA determines when to stop inquiry and output predictions using both the inquiry and diagnosis models. We demonstrate that BSODA outperforms the state-of-the-art methods on several public datasets. Moreover, we propose a novel evaluation method to test the transferability of symptom checking methods from synthetic to real-world tasks. Compared to existing RL baselines, BSODA is more effectively scalable to large search spaces.
Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification
In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods in terms of its ability to capture other usages of disease or symptoms terms. We conclude by discussing the empirical and ethical considerations of our study.
Early Identification of Depression Severity Levels on Reddit Using Ordinal Classification
User-generated text on social media is a promising avenue for public health surveillance and has been actively explored for its feasibility in the early identification of depression. Existing methods in the identification of depression have shown promising results; however, these methods were all focused on treating the identification as a binary classification problem. To date, there has been little effort towards identifying usersÂ' depression severity level and disregard the inherent ordinal nature across these fine-grain levels. This paper aims to make early identification of depression severity levels on social media data. To accomplish this, we built a new dataset based on the inherent ordinal nature over depression severity levels using clinical depression standards on Reddit posts. The posts were classified into 4 depression severity levels covering the clinical depression standards on social media. Accordingly, we reformulate the early identification of depression as an ordinal classification task over clinical depression standards such as BeckÂ's Depression Inventory and the Depressive Disorder Annotation scheme to identify depression severity levels. With these, we propose a hierarchical attention method optimized to factor in the increasing depression severity levels through a soft probability distribution. We experimented using two datasets (a public dataset having more than one post from each user and our built dataset with a single user post) using real-world Reddit posts that have been classified according to questionnaires built by clinical experts and demonstrated that our method outperforms state-of-the-art models. Finally, we conclude by analyzing the minimum number of posts required to identify depression severity level followed by a discussion of empirical and practical considerations of our study.
Assessing the Causal Impact of COVID-19 Related Policies on Outbreak Dynamics: A Case Study in the US
Analyzing the causal impact of different policies in reducing the spread of COVID-19 is of critical importance. The main challenge here is the existence of unobserved confounders (e.g., vigilance of residents) which influence both the presence of policies and the spread of COVID-19. Besides, as the confounders may be time-varying, it is even more difficult to capture them. Fortunately, the increasing prevalence of web data from various online applications provides an important resource of time-varying observational data, and enhances the opportunity to capture the confounders from them, e.g., the vigilance of residents over time can be reflected by the popularity of Google searches about COVID-19 at different time periods. In this paper, we study the problem of assessing the causal effects of different COVID-19 related policies on the outbreak dynamics in different counties at any given time period. To this end, we integrate COVID-19 related observational data covering different U.S. counties over time, and then develop a neural network based causal effect estimation framework which learns the representations of time-varying (unobserved) confounders from the observational data. Experimental results indicate the effectiveness of our proposed framework in quantifying the causal impact of policies at different granularities, ranging from a category of policies with a certain goal to a specific policy type. Compared with baseline methods, our assessment of policies is more consistent with existing epidemiological studies of COVID-19. Besides, our assessment also provides insights for future policy-making.
OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision
Automatic extraction of product attributes from their textual descriptions is essential for online shopper experience. One inherent challenge of this task is the emerging nature of e-commerce products  - we see new types of products with their unique set of new attributes constantly. Most prior works on this matter mine new values for a set of known attributes but cannot handle new attributes that arose from constantly changing data. In this work, we study the attribute mining problem in an open-world setting to extract novel attributes and their values. Instead of providing comprehensive training data, the user only needs to provide a few examples for a few known attribute types as weak supervision. We propose a principled framework that first generates attribute value candidates and then groups them into clusters of attributes. The candidate generation step probes a pre-trained language model to extract phrases from product titles. Then, an attribute-aware fine-tuning method optimizes a multitask objective and shapes the language model representation to be attribute-discriminative. Finally, we discover new attributes and values through the self-ensemble of our framework, which handles the open-world challenge. We run extensive experiments on a large distantly annotated development set and a gold standard human-annotated test set that we collected. Our model significantly outperforms strong baselines and can generalize to unseen attributes and product types.
A Deep Markov Model for Clickstream Analytics in Online Shopping
Machine learning is widely used in e-commerce to analyze clickstream sessions and then to allocate marketing resources. Traditional neural learning can model long-term dependencies in clickstream data, yet it ignores the different shopping phases (i.e., goal-directed search vs. browsing) in user behavior as theorized by marketing research. In this paper, we develop a novel, theory-informed machine learning model to account for different shopping phases as defined in marketing theory. Specifically, we formalize a tailored attentive deep Markov model called ClickstreamDMM for predicting the risk of user exits without purchase in e-commerce web sessions. Our ClickstreamDMM combines (1) an attention network to learn long-term dependencies in clickstream data and (2) a latent variable model to capture different shopping phases (i.e., goal-directed search vs. browsing). Due to the interpretable structure, our ClickstreamDMM allows marketers to generate new insights on how shopping phases relate to actual purchase behavior. We evaluate our model using real-world clickstream data from a leading e-commerce platform consisting of 26,279 sessions with 250,287 page clicks. Thereby, we demonstrate that our model is effective in predicting user exits without purchase: improvement by 11.5% in AUROC and 12.7% in AUPRC. Overall, our model enables e-commerce platforms to detect users at the risk of exiting without purchase. Based on it, e-commerce platforms can then intervene with marketing resources to steer users toward purchasing.
Using Survival Models to Estimate User Engagement in Online Experiments
Online controlled experiments, in which different variants of a product are compared based on an Overall Evaluation Criterion~(OEC), have emerged as a gold standard for decision-making for online services.
CycleNER: An Unsupervised Training Approach for Named Entity Recognition
Named entity recognition (NER) is a crucial natural language understanding task for many down-stream tasks such as question answering and retrieval. Despite significant progress in developing NER models for multiple languages and domains, scaling to emerging domains or low-resource languages still remains challenging, due to the costly nature of acquiring training data.
Geospatial Entity Resolution
A geospatial database is today at the core of an ever increasing number of services. Building and maintaining it remains challenging due to the need to merge information from multiple providers. Entity Resolution (ER) consists of finding entity mentions from different sources that refer to the same real world entity. In geospatial ER, entities are often represented using different schemes and are subject to incomplete information and inaccurate location, making ER and deduplication daunting tasks. While tremendous advances have been made in traditional entity resolution and natural language processing, geospatial data integration approaches still heavily rely on static similarity measures and human-designed rules. In order to achieve automatic linking of geospatial data, a unified representation of entities with heterogeneous attributes and their geographical context, is needed. To this end we propose Geo-ER, a joint framework that combines Transformer-based language models, that have been successfully applied in ER, with a novel learning-based architecture to represent the geospatial character of the entity. Different from existing solutions, Geo-ER does not rely on pre-defined rules and is able to capture information from surrounding entities in order to make context-based, accurate predictions. Extensive experiments on eight real world datasets demonstrate the effectiveness of our solution over state-of-the-art methods. Moreover, Geo-ER proves to be robust in settings where there is no available training data for a specific city.
A Never-Ending Project for Humanity Called "the Web"
In this paper we summarized the main historical steps in making the Web, its foundational principles and its evolution. First we mention some of the influences and streams of thought that interacted to bring the Web about. Then we recall that its birthplace, the CERN, had a need for a global hypertext system and at the same time was the perfect microcosm to provide a cradle for the Web. We stress how this invention required to strike a balance between the integration of and the departure from the existing and emerging paradigms of the day. We then review the pillars of the Web architecture and the features that made the Web so viral compared to competitors. Finally we survey the multiple mutations the Web underwent no sooner it was born, evolving in multiple directions. We conclude on the fact the Web is now an architecture, an artefact, a science object and a research and development object, and of which we haven't seen the full potential yet.
Invited Speaker George Metakides (EU)
Invited Speaker Jean-François Abramatic (INRIA)
Through the Lens of the Web Conference Series: A Look Into the History of the Web
During the last three decades, the Web has been growing in terms of number of available resources, traffic, types of media, usages... In parallel, with 30+ editions, the WebConf series (ex. WWW, soon-to-be ACM WebConf) has witnessed how Academia has been dealing with the Web as an object of research. In this study, we want to focus on the small story within the great one of the Web. In particular, by analysing the accepted papers and the yearly events, we review how the conference has evolved across these decades and "driven" the evolution of the Web.
From Indymedia to Tahrir Square: The Revolutionary Origins of Status Updates on Twitter
One of the most important developments in the history of the Web was the development of the status update. Although social media has been approached by a number of critical theorists, from Fuchs to Zuboff, as an instrument of control and surveillance, it should be remembered that social media began as a liberatory extension of human social relationships into the Web. In this essay, we trace the origin of the "status update" as a social network from like SixApart to their independent invention for spreading news from the protest-driven community networks like Indymedia. In fact, the use of status updates by Indymedia by the marginal anti-globalization movement prefigured their usage in Tahrir Square and in the Black Lives Matter movement in the USA. The link goes through Twitter itself, as the early Twitter engineers were veterans of Indymedia as well as advocates of early IETF standards for status updates such as RSS. As Twitter itself becomes viewed as a threat to democracy itself, is there hope that Twitter's "Blue Sky" project to decentralize itself can return to open standards and the decentralized sharing of status updates?
Auctions between Regret-Minimizing Agents
We analyze a scenario in which software agents implemented as regret minimizing algorithms engage in a repeated auction on behalf of their users. We study first price and second price auctions, as well as their generalized versions (e.g., as those used for ad auctions). Using both theoretical analysis and simulations, we show that, surprisingly, in second price auctions the players have incentives to mis-report their true valuations to their own learning agents, while in the first price auction it is a dominant strategy for all players to truthfully report their valuations to their agents.
Nash Convergence of Mean-Based Learning Algorithms in First Price Auctions
The outcome of learning dynamics in advertising auctions is an important and fundamental question for the online markets on the web. This work focus on repeated first price auctions where bidder with fixed values learn to bid using mean-based algorithms --- a large class of learning algorithms that include popular no-regret algorithms such as Multiplicative Weights Update and Follow the Perturbed Leader.
Multi-Granularity Residual Learning with Confidence Estimation for Time Series Prediction
Time-series prediction is of high practical value in a wide range of applications such as econometrics and meteorology, where the data are commonly formed by temporal patterns. Most prior works ignore the diversity of dynamic pattern frequency, i.e., different granularities, suffering from insufficient information exploitation. Thus, multi-granularity learning is still under-explored for time-series prediction.
Alleviating Cold-start Problem in CTR Prediction with A Variational Embedding Learning Framework
We propose a Variational Embedding Learning Framework (VELF) for alleviating cold-start problem in CTR prediction. Based on Bayesian inference, we replace point estimate with distribution estimate in user and Ad embedding learning that shares statistical strength among users and Ads. Variational inference technique is adopted to estimate the distributions because of its two advantages. Firstly, it is suitable for computing. Secondly, on strength of variational inference, our proposed probabilistic embedding framework VELF enjoys an end-to-end training process. To further enhance the reliable embedding obtaining for cold-start users and Ads, we come up with well designed regularized parameterized priors to facilitat the knowledge sharing among users and Ads. Extensive empirical tests on benchmark datasets well demonstrate the advantages of the proposed VELF. Extended experiments justified that our regularized parameterized priors provide more generalization capability than traditional fixed priors.
The Parity Ray Regularizer for Pacing in Auction Markets
Budget-management systems are one of the key components of modern auction markets. Internet advertising platforms typically offer advertisers the possibility to pace the rate at which their budget is depleted, through budget-pacing mechanisms. We focus on multiplicative pacing mechanisms in an online setting in which a bidder is repeatedly confronted with a series of advertising opportunities. After collecting bids, each item is then allocated through a single-item, second-price auction. If there were no budgetary constraints, bidding truthfully would be an optimal choice for the advertiser. However, since their budget is limited, the advertiser may want to shade their bid downwards in order to preserve their budget for future opportunities, and to spread expenditures evenly over time. The literature on online pacing problems mostly focuses on the setting in which the bidder optimizes an additive separable objective, such as the total click-through rate or the revenue of the allocation. In many settings, however, bidders may also care about other objectives which oftentimes are non-separable. We study the frequent case in which the utility of a bidder depends on the reward obtained from items they are allocated, and on the distance of the realized distribution of impressions from a target distribution. We introduce a novel regularizer which can describe those distributional preferences according to some desired statistical distance measure. We show that this regularizer can be integrated into an existing online mirror descent scheme with minor modifications, attaining the optimal order of sub-linear regret compared to the optimal allocation in hindsight when inputs are drawn independently, from an unknown distribution. Moreover, we show that our approach can easily be incorporated in standard existing pacing systems that are not usually built for this objective. The effectiveness of our algorithm in intern et advertising applications is confirmed by numerical experiments on real-world data.
A Multi-task Learning Framework for Product Ranking with BERT
Product ranking is a crucial component for many e-commerce services. One of the major challenges in product search is the vocabulary mismatch between query and products, which may be a larger vocabulary gap problem compared to other information retrieval domains. While there is a growing collection of neural learning to match methods aimed specifically at overcoming this issue, they do not leverage the recent advances of large language models for product search. On the other hand, product ranking often deals with multiple types of engagement signals such as clicks, add to cart, and purchases, while most of the existing work is focused on optimizing one single metric such as click-through rate, which may suffer from data sparsity. In this work, we propose a novel end-to-end multi-task learning framework for product ranking with BERT to address the above challenges. The proposed model utilizes domain-specific BERT with fine-tuning to bridge the vocabulary gap and employs multi-task learning to optimize multiple objectives simultaneously, which yields a general end-to-end learning framework for product search. We conduct a set of comprehensive experiments on a real-world E-commerce dataset and demonstrate significant improvement of the proposed approach over the state-of-the-art baseline methods.
Modeling User Behavior with Graph Convolution for Personalized Product Search
User preference modeling is a vital yet challenging problem in personalized product search. In recent years, latent space based models have achieved state-of-the-art performance by jointly learning semantic representations of products, users, and text tokens. However, existing methods are limited in their ability to model user preferences. They typically represent users by the products they visited in a short span of time using attentive models and lack the ability to exploit relational information such as user-product interactions or item co-occurrence relations. In this work, we propose to address the limitations of prior arts by exploring local and global user behavior patterns on a user successive behavior graph, which is constructed by utilizing short-term actions of all users. To capture implicit user preference signals and collaborative patterns, we design an efficient jumping graph convolution to enrich product representations for user preference modeling. Our method can be used as a plug-and-play module and built upon existing latent space based product search models. Extensive experiments show our method consistently outperforms state-of-the-art methods on 8 Amazon benchmark datasets, demonstrating its high effectiveness.
IHGNN: Interactive Hypergraph Neural Network for Personalized Product Search
A good personalized product search (PPS) system should not only focus on retrieving relevant products, but also consider user personalized preference. Recent work on PPS mainly adopts the representation learning paradigm, \eg learning representations for each entity (including user, product and query) from historical user behaviors (\aka user-product-query interactions). However, we argue that existing methods do not sufficiently exploit the crucial \textit{collaborative signal}, which is latent in historical interactions to reveal the affinity between the entities. Collaborative signal is quite helpful for generating high-quality representation, exploiting which would benefit the learning of one representation from other related nodes.
A Category-aware Multi-interest Model for Personalized Product Search
Product search has been an important way for people to find products on online shopping platforms. Existing approaches in personalized product search mainly embed user preferences into one single vector. However, this simple strategy easily results in sub-optimal representations, failing to model and disentangle user's multiple preferences. To overcome this problem, we proposed a category-aware multi-interest model to encode users as multiple preference embeddings to represent user-specific interests. Specifically, we also capture the category indications for each preference to indicate the distribution of categories it focuses on, which is derived from rich relations between users, products, and attributes. Based on these category indications, we develop a category attention mechanism to aggregate these various preference embeddings concerning current queries and items as the user's comprehensive representation. By this means, we can use this representation to calculate matching scores of retrieved items to determine whether they meet the user's search intent. Besides, we introduce a homogenization regularization term to avoid the redundancy between user interests. Experimental results show that the proposed method significantly outperforms existing approaches.
Enhancing Knowledge Bases with Quantity Facts
Machine knowledge about the worldÂ's entities should include quantity properties: heights of buildings, running times of athletes, energy efficiency of car models, energy production of power plants, and more. State-of-the-art knowledge bases, such as Wikidata, cover many relevant entities but often miss the corresponding quantities. Prior work on extracting quantity facts from web contents focused on high precision for top-ranked outputs, but did not tackle the KB coverage issue. This paper presents a recall-oriented approach, which aims to close this gap in knowledge-base coverage. Our method is based on iterative learning for extracting quantity facts, with two novel contributions to boost recall for KB augmentation without sacrificing the quality standards of the knowledge base. The first contribution is a query expansion technique to capture a larger pool of fact candidates. The second contribution is a novel technique for harnessing observations on value distributions for self-consistency. Experiments with extractions from more than 13 million web documents demonstrate the benefits of our method.
Trustworthy Knowledge Graph Completion Based on Multi-sourced Noisy Data
Knowledge graphs (KGs) have become a valuable asset for many AI applications. Although some KGs contain plenty of facts, they are widely acknowledged as incomplete. To address this issue, many KG completion methods are proposed. Among them, open KG completion methods leverage the Web to find missing facts. However, noisy data collected from diverse sources may damage the completion accuracy. In this paper, we propose a new trustworthy method that exploits facts for a KG based on multi-sourced noisy data and existing facts in the KG. Specifically, we introduce a graph neural network with a holistic scoring function to judge the plausibility of facts with various value types. We design value alignment networks to resolve the heterogeneity between values and map them to entities even outside the KG. Furthermore, we present a truth inference model that incorporates data source qualities into the fact scoring function, and design a semi-supervised learning way to infer the truths from heterogeneous values. We conduct extensive experiments to compare our method with the state-of-the-arts. The results show that our method achieves superior accuracy not only in completing missing facts but also in discovering new facts.
Uncertainty-aware Pseudo Label Refinery for Entity Alignment
Entity alignment (EA), which aims to discover equivalent entities in knowledge graphs (KGs), bridges heterogeneous sources of information and facilitates the integration of knowledge. Recently, based on translational models, EA has achieved impressive performance in utilizing graph structures, or by adopting auxiliary information. However, existing entity alignment methods mainly rely on manually labeled entity alignment seeds, which limits their applicability on real scenarios. In this paper, a simple but effective Uncertainty-aware Pseudo Label Refinery (UPLR) framework is proposed without manually labelling requirement, and is capable of learning high-quality entity embeddings from pseudo-labeled data sets containing noisy data. The proposed method relies on two key factors:
Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments
Linked Data Fragments (LDFs) refer to Web interfaces that allow for accessing and querying knowledge graphs on the Web. These interfaces, such as SPARQL endpoints or Triple Pattern Fragment servers, differ in the SPARQL expressions they can evaluate and the metadata they provide. To evaluate queries over individual interfaces, client-side query processing approaches tailored to specific interfaces have been proposed. Moreover, federated query processing has focused on federations with a single type of LDF interface, typically SPARQL endpoints.
SelfKG: Self-Supervised Entity Alignment in Knowledge Graphs
Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing Web-scale KGs.
Compact Graph Structure Learning via Mutual Information Compression
Graph Structure Learning (GSL) recently has attracted considerable attention in its capacity of optimizing graph structure as well as learning suitable parameters of Graph Neural Networks (GNNs) simultaneously. Current GSL methods mainly learn an optimal graph structure (final view) from single or multiple information sources (basic views), however the theoretical guidance on what is the optimal graph structure is still unexplored. In essence, an optimal graph structure should only contain the information about tasks while compress redundant noise as much as possible, which is defined as "minimal sufficient structure", so as to maintain the accurancy and robustness. How to obtain such structure in a principled way? In this paper, we theoretically prove that if we optimize basic views and final view based on mutual information, and keep their performance on labels simultaneously, the final view will be a minimal sufficient structure. With this guidance, we propose a Compact GSL architecture by MI compression, named CoGSL. Specifically, two basic views are extracted from original graph as two inputs of the model, which are refinedly reestimated by a view estimator. Then, we propose an adaptive technique to fuse estimated views into the final view. Furthermore, we maintain the performance of estimated views and the final view and reduce the mutual information of every two views. To comprehensively evaluate the performance of CoGSL, we conduct extensive experiments on several datasets under clean and attacked conditions, which demonstrate the effectiveness and robustness of CoGSL.
Multimodal Continual Graph Learning with Neural Architecture Search
Continual graph learning is rapidly emerging as an important role in a variety of real-world applications such as online product recommendation systems and social media. While achieving great success, existing works on continual graph learning ignore the information from multiple modalities (e.g., visual and textual features) as well as the rich dynamic structural information hidden in the ever-changing graph data and evolving tasks. However, considering multimodal continual graph learning with evolving topological structures poses great challenges: i) it is unclear how to incorporate the multimodal information into continual graph learning and ii) it is nontrivial to design models that can capture the structure-evolving dynamics in continual graph learning. To tackle these challenges, in this paper we propose a novel Multimodal Structure-evolving Continual Graph Learning (MSCGL) model, which continually learns both the model architectures and the corresponding parameters for multimodal Graph Neural Networks (GNNs). To be concrete, our proposed MSCGL model simultaneously takes social information and multimodal information into account to build the multimodal graphs. In order for continually adapting to new tasks without forgetting the old ones, our MSCGL model explores a new strategy with joint optimization of Neural Architecture Search (NAS) and Group Sparse Regularization (GSR) across different tasks. These two parts interact with each other reciprocally, where NAS is expected to explore more promising architectures and GSR is in charge of preserving important information from the previous tasks. We conduct extensive experiments over two real-world multimodal continual graph scenarios to demonstrate the superiority of the proposed MSCGL model. Empirical experiments indicate that both the architectures and weight sharing across different tasks play important roles in affecting the model perform ances.
Graph Sanitation with Application to Node Classification
The past decades have witnessed the prosperity of graph mining, with a multitude of sophisticated models and algorithms designed for various mining tasks, such as ranking, classification, clustering and anomaly detection. Generally speaking, the vast majority of the existing works aim to answer the following question, that is, given a graph, what is the best way to mine it?
Model-Agnostic Augmentation for Accurate Graph Classification
Given a graph dataset, how can we augment it for accurate graph classification? Graph augmentation is an essential strategy to improve the performance of graph-based tasks by enlarging the distribution of training data. However, previous works for graph augmentation either a) involve the target model in the process of augmentation, losing the generalizability to other tasks, or b) rely on simple heuristics that lead to unreliable results. In this work, we introduce five desired properties for effective augmentation. Then, we propose NodeSam (Node Split and Merge) and SubMix (Subgraph Mix), two model-agnostic approaches for graph augmentation that satisfy all desired properties with different motivations. NodeSam makes a balanced change of the graph structure to minimize the risk of semantic change, while SubMix mixes random subgraphs of multiple graphs to create rich soft labels combining the evidence for different classes. Our experiments on seven benchmark datasets show that NodeSam and SubMix consistently outperform existing approaches, making the highest accuracy in graph classification.
Towards Unsupervised Deep Graph Structure Learning
In recent years, graph neural networks (GNNs) have emerged as a successful tool in a variety of graph-related applications. However, the performance of GNNs can be deteriorated when noisy connections occur in the original graph structures; besides, the dependence on explicit structures prevents GNNs from being applied to general unstructured scenarios. To address these issues, recently emerged deep graph structure learning (GSL) methods propose to jointly optimize the graph structure along with GNN under the supervision of a node classification task. Nonetheless, these methods focus on a supervised learning scenario, which leads to several problems, i.e., the reliance on labels, the bias of edge distribution, and the limitation on application tasks. In this paper, we propose a more practical GSL paradigm, unsupervised graph structure learning, where the learned graph topology is optimized by data itself without any external guidance (i.e., labels). To solve the unsupervised GSL problem, we propose a novel StrUcture Bootstrapping contrastive LearnIng fraMEwork (SUBLIME for abbreviation) with the aid of self-supervised contrastive learning. Specifically, we generate a learning target from the original data as an Â"anchor graphÂ", and use a contrastive loss to maximize the agreement between the anchor graph and the learned graph. To provide persistent guidance, we design a novel bootstrapping mechanism that upgrades the anchor graph with learned structures during model learning. We also design a series of graph learners and post-processing schemes to model the structures to learn. Extensive experiments on eight benchmark datasets demonstrate the significant effectiveness of our proposed SUBLIME and high quality of the optimized graphs.
Ready Player One! Eliciting Diverse Knowledge Using a Configurable Game
Access to commonsense knowledge is receiving renewed interest for developing neuro-symbolic AI systems, or debugging deep learning models. Little is currently understood about the types of knowledge that can be gathered using existing knowledge elicitation methods. Moreover, these methods fall short of meeting the evolving requirements of several downstream AI tasks. To this end, collecting broad and tacit knowledge, in addition to negative or discriminative knowledge can be highly useful. Addressing this re-search gap, we developed a novel game with a purpose, ÂFindItOutÂ',to elicit different types of knowledge from human players through easily configurable game mechanics. We recruited 125 players from a crowdsourcing platform, who played 2430 rounds, resulting in the creation of more than 150k tuples of knowledge. Through an extensive evaluation of these tuples, we show that FindItOut can successfully result in the creation of plural knowledge with a good player experience. We evaluate the efficiency of the game (over 10à - higher than a reference baseline) and the usefulness of the resulting knowledge, through the lens of two downstream tasks  - common-sense question answering and the identification of discriminative attributes. Finally, we present a rigorous qualitative analysis of the tuplesÂ' characteristics, that informs the future use of FindItOut across various researcher and practitioner communities.
Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks
When human annotators label data, a key metric for quality assurance is inter-annotator agreement: to what extent do annotators agree in their labeling decisions? For simple categorical and ordinal labeling tasks, many agreement measures already exist, but what about more complex labeling tasks, such as structured, multi-object, or free-text annotation? Though Krippendorff's agreement measure alpha is best known for use with simpler labeling tasks, its distance-based formulation offers broader applicability. However, little work has studied its general usefulness across various complex annotation tasks,
The Influences of Task Design on Crowdsourced Judgement: A Case Study of Recidivism Risk Evaluation
Crowdsourcing is widely used to solicit judgement from people in diverse applications ranging from evaluating information quality to rating gig worker performance. To encourage the crowd to put in genuine effort in the judgement tasks, various ways to structure and organize these tasks have been explored, though the understandings of how these task design choices influence the crowdÂ's judgement are still largely lacking. In this paper, using recidivism risk evaluation as an example, we conduct a randomized experiment to examine the effects of two common designs of crowdsourcing judgement tasks - encouraging the crowd to deliberate and providing feedback to the crowd - on the quality, strictness, and fairness of the crowdÂ's recidivism risk judgements. Our results show that different designs of the judgement tasks significantly affect the strictness of the crowdÂ's judgements (i.e., the crowdÂ's tendency to predict a defendant to recidivate). Moreover, task designs also have the potential to significantly influence how fairly the crowd judges defendants from different racial groups, on those cases where the crowd exhibits substantial in-group bias. Finally, we find that the impacts of task designs on the judgement also vary with the crowd workersÂ' own characteristics, such as their cognitive reflection levels. Together, these results highlight the importance of obtaining a nuanced understanding on the relationship between task designs and properties of the crowdsourced judgements.
Will You Accept the AI Recommendation? Predicting Human Behavior in AI-Assisted Decision Making
Internet users make numerous decisions online on a daily basis. With the rapid advances in AI recently, AI-assisted decision making - in which an AI model provides decision recommendations and confidence, while the humans make the final decisions - has emerged as a new paradigm of human-AI collaboration. In this pa-per, we aim at obtaining a quantitative understanding of the human behavior in AI-assisted decision making, particularly with respect to whether and when would human decision makers adopt the AI modelÂ's recommendations. Through a large-scale randomized experiment, we collect the human behavior data from 404 real human subjects on 16,160 AI-assisted loan risk assessment tasks. We further define a space of human behavior models by decomposing the human decision makerÂ's cognitive process in each decision-making task into two components: the utility component (i.e., evaluate the utility of different actions) and the selection component (i.e., select an action to take), and we perform a systematic search in the model space to identify the model that fits human decision makersÂ' behavior the best. Our results highlight that in AI-assisted decision making, human decision makersÂ' utility evaluation and action selection are influenced by their own judgement and confidence on the decision-making task. Further, human decision makers exhibit a tendency to distort the decision confidence in utility evaluations. Finally, we also analyze the differences in humansÂ' adoption behavior of AI recommendations as the stakes of the decisions vary.
Outlier Detection for Streaming Task Assignment in Crowdsourcing
Crowdsourcing aims to enable the assignment of available resources to the completion of tasks at scale. The continued digitization of societal processes translates into increased opportunities for crowdsourcing. For example, crowdsourcing enables the assignment of computational resources of humans, called workers, to tasks that are notoriously hard for computers. In settings faced with malicious actors, detection of such actors holds the potential to increase the robustness of crowdsourcing platform. We propose a framework called Outlier Detection for Streaming Task Assignment that aims to improve robustness by detecting malicious actors. In particular, we model the arrival of workers and the submission of tasks as evolving time series and provide means of detecting malicious actors by means of outlier detection. We propose a novel socially aware Generative Adversarial Network (GAN) based architecture that is capable of contending with the complex distributions found in time series. The architecture includes two GANs that are designed to adversarially train an autoencoder to learn the patterns of distributions in worker and task time series, thus enabling outlier detection based on reconstruction errors. A GAN structure encompasses a game between a generator and a discriminator, where it is desirable that the two can learn to coordinate towards socially optimal outcomes, while avoiding being exploited by selfish opponents. To this end, we propose a novel training approach that incorporates social awareness into the loss functions of the two GANs. Additionally, to improve task assignment efficiency, we propose an efficient greedy algorithm based on degree reduction that transforms task assignment into a bipartite graph matching. Extensive experiments offer insight into the effectiveness and efficiency of the proposed framework.
Stochastic-Expert Variational Autoencoder for Collaborative Filtering
Motivated by the recent successes of deep generative models used for collaborative filtering,
Mutually-Regularized Dual Collaborative Variational Auto-encoder for Recommendation Systems
Recently, user-oriented auto-encoders (UAEs) have been widely used in recommender systems to learn semantic representations of users based on their historical ratings. However, since latent item variables are not modeled in UAE, it is difficult to utilize the widely available item content information when ratings are sparse. In addition, whenever new items arrive, we need to wait for the collection of rating data for these items and retrain the UAE from scratch, which is inefficient in practice. Aiming to simultaneously address the above two problems, we propose a mutually-regularized dual collaborative variational auto-encoder (MD-CVAE) for recommendation. First, by replacing the randomly initialized last layer weights of the vanilla UAE with stacked latent item embeddings, MD-CVAE integrates two heterogeneous sources, i.e., item contents and user ratings, into the same principled variational framework where the weights of UAE are regularized by item content such that the convergence to a non-optima due to data sparsity can be avoided. In addition, the regularization is mutual in that the ratings can also help the dual item embedding module learn more recommendation-oriented item content embeddings. Finally, we propose a novel symmetric inference strategy for MD-CVAE where the first layer weights of the encoder are tied to the latent item embeddings of the decoder. Through this strategy, no retraining is required to recommend newly introduced items. Empirical results show the effectiveness of MD-CVAE in both normal and cold-start scenarios. Codes are anonymously released in this URL.
Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative Filtering
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering. However, the bottleneck of VAE lies in the softmax computation over all items, such that it takes linear costs in the number of items to compute the loss and gradient for optimization. This hinders the practical use due to millions of items in real-world scenarios. Importance sampling is an effective approximation method, based on which the sampled softmax has been derived. However, existing methods usually exploit the uniform or popularity sampler as proposal distributions, leading to a large bias of gradient estimation. To this end, we propose to decompose the inner-product-based softmax probability based on the inverted multi-index, leading to sublinear-time and highly accurate sampling. Based on the proposed proposals, we develop a fast Variational AutoEncoder (FastVAE) for collaborative filtering. FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency according to the experiments on three real-world datasets.
Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering
Over the past decades, for One-Class Collaborative Filtering (OCCF), many learning objectives have been researched based on a variety of underlying probabilistic models.
FIRE: Fast Incremental Recommendation with Graph Signal Processing
Real-world recommender systems are incremental in nature, in which new users, items and user-item interactions are observed continuously over time. Recent progresses in incremental recommendation rely on capturing the temporal dynamics of users/items from temporal interaction graphs, so that their user/item embeddings can evolve together with the graph structures. However, these methods are faced with two key challenges: 1) model training and/or updating are time-consuming and 2) new users and items cannot be effectively handled. To this end, we propose the fast incremental recommendation (FIRE) method from a graph signal processing perspective. FIRE is non-parametric which does not suffer from the time-consuming back-propagations as in previous learning-based methods, significantly improving the efficiency of model updating. In addition, we encode user/item temporal information and side information by designing new graph filters in the proposed method, which can capture the temporal dynamics of users/items and address the cold-start issue for new users/items, respectively. Experimental studies on four popular datasets demonstrate that FIRE can improve the recommendation accuracy
Left or Right: A Peek into the Political Biases in Email Spam Filtering Algorithms During US Election 2020
Email services use spam filtering algorithms (SFAs) to filter emails that are unwanted by the user. However, at times, the emails perceived by an SFA as unwanted may be important to the user. Such incorrect decisions can have significant implications if SFAs treat emails of user interest as spam on a large scale. This is particularly important during national elections. To study whether the SFAs of popular email services have any biases in treating the campaign emails, we conducted a large-scale study of the campaign emails of the US elections 2020 by subscribing to a large number of Presidential, Senate, and House candidates using over a hundred email accounts on Gmail, Outlook, and Yahoo. We analyzed the biases in the SFAs towards the left and the right candidates and further studied the impact of the interactions (such as reading or marking emails as spam) of email recipients on these biases. We observed that the SFAs of different email services indeed exhibit biases towards different political affiliations. We present this and several other important observations in this paper.
Controlled Analyses of Social Biases in Wikipedia Bios
Social biases on Wikipedia, a widely-read global platform, could greatly influence public opinion.
What Does Perception Bias on Social Networks Tell Us about Friend Count Satisfaction?
Social network platforms have enabled large-scale measurement of user-to-user networks such as friendships. Less studied is user sentiment about their networks, such as a user's satisfaction with their number of friends. We surveyed over 85,000 Facebook users about how satisfied they were with their number of friends on Facebook, connecting these responses to their on-platform activity. As suggested in prior work, we'd expect users who are not satisfied with their friend count to have a higher probability of experiencing the friendship paradox: "your friends have more friends than you". However in our sample, among users with more than ~3,500 friends, no user experiences the friendship paradox. Instead, we still observe that those users with more friends would prefer to have even more friends. The friendship paradox also contributes to local perception bias, defined as the difference between the average number of friends among a user's friends and the average friend count in the population. Users with a positive perception bias -- their friends have more friends than others -- are less satisfied with their friend count. We then introduce a weighted perception bias metric that considers the fact that different friends have different effects on an individual's perception. We find this new weighted perception bias is able to better distinguish friend count satisfaction outcomes for users with high friend count when compared to the original perception bias metric. We conclude with modeling the behavior interactions via a machine learning model, demonstrating the heterogeneous behavior interactions across users with different perception bias. Altogether, these findings offer more insights on users' friend count satisfaction, which may provide guidelines to improve user experience and promote healthy interactions.
Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways
Given the content that a user consumes on a platform at any moment, recommender systems typically suggest similar additional content for the user. Consequently, if the user happens to be exposed to highly radicalized content, she might subsequently receive consecutive recommendations still radicalized, thus being trapped in a Â"radicalization pathwayÂ". This phenomenon may lead to adverse societal effects, as repeated exposure to radicalized content may shape similar user opinions. In this paper we try to mitigate radicalization pathways from the perspective of graph topology. Specifically, we model the set of recommendations of a Â"what-to-watch-nextÂ" recommender as a d-regular directed graph where the nodes correspond to content items, the links to recommendations, and user sessions to random walks on the graph. When at a radicalized node, a user will have a high chance to get trapped in a radicalization pathway if the nodeÂ's Â"segregationÂ", which is measured by the expected length of a random walk from there to any non-radicalized node, is high. We thus aim to reduce the prevalence of radicalization pathways by choosing a small number of edges to rewire, so as to minimize the maximum segregation among all radicalized nodes while maintaining the high quality of recommendations. We prove that the problem is NP-hard and NP-hard to approximate within any factor. Therefore, we turn our attention to heuristics and design an efficient, yet effective, greedy algorithm based on the absorbing random walk theory to find the edges to rewire. Our experiments on real-world datasets in the context of video and news recommendations confirm the effectiveness of our proposal.
Jettisoning Junk Messaging in the Era of End-to-End Encryption: A Case Study of WhatsApp
WhatsApp is a popular messaging app used by over a billion users around the globe. Due to this popularity, understanding misbehavior on WhatsApp is an important issue. The distribution of sending unwanted junk messages by unknown contacts via WhatsApp remains understudied by researchers, in part because of the end-to-end encryption offered by the platform. This paper addresses this gap by studying junk messaging on a multilingual dataset of 2.6 million messages sent to 5,051 public WhatsApp groups in India over 300 days. We characterise both junk content and senders. We find that nearly 1 in 10 messages is unwanted content sent by junk senders (jettisons), and a number of unique strategies are employed to reflect challenges faced in WhatsApp, e.g., the need to change phone numbers regularly. We then experiment with on-device classification to automate the moderation process, whilst respecting end-to-end encryption.
Topological Transduction for Hybrid Few-shot Learning
Digging informative knowledge and analyzing contents from the internet is a challenging task as web data may contain new concepts that are lack of sufficient labeled data as well as could be multimodal. Few-shot learning (FSL) has attracted significant research attention for dealing with scarcely labeled concepts. However, existing FSL algorithms have assumed a uniform task setting such that all samples in a few-shot task share a common feature space. Yet in the real web applications, it is usually the case that a task may involve multiple input feature spaces due to the heterogeneity of source data, that is, the few labeled samples in a task may be further divided and belong to different feature spaces, namely hybrid few-shot learning (hFSL). The hFSL setting results in a hybrid number of shots per class in each space and aggravates the data scarcity challenge as the number of training samples per class in each space is reduced. To alleviate these challenges, we propose the Task-adaptive Topological Transduction Network, namely TopoNet, which trains a heterogeneous graph-based transductive meta-learner that can combine information from both labeled and unlabeled data to enrich the knowledge about the task-specific data distribution and multi-space relationships. Specifically, we model the underlying data relationships of the few-shot task in a node-heterogeneous multi-relation graph, and then the meta-learner adapts to each task's multi-space relationships as well as its inter- and intra-class data relationships, through an edge-enhanced heterogeneous graph neural network. Our experiments compared with existing approaches demonstrate the effectiveness of our method.
Is Least-Squares Inaccurate in Fitting Power-Law Distributions? The Criticism is Complete Nonsense
Power-law distributions have been observed to appear in many natural and societal systems. According to the Gauss-Markov theorem, ordinary least-squares estimation is the best linear unbiased estimator. In the last two decades, however, some researchers criticize that least-squares estimation is substantially inaccurate in fitting power-law distributions to data. Such criticism has caused a strong bias in research community about using least squares to estimating parameters of power-law models. In this paper, we conduct extensive experiments to rebut that such criticism is complete nonsense. Specifically, we sample different sizes of discrete and continuous data from power-law models, and the statistics of these sampled data show that the sampling noise does not satisfy the strictly monotonic property of a power-law function even though they are sampled form power-law models. We define the correct way to bin continuous samples into data points and propose an average strategy for LSE to fit power-law distributions to both simulated and real-world data with excluding sampling noise. Experiments demonstrate that our LSE method fits power-law data perfectly. We uncover a fundamental flaw in the widely known method proposed by Clauset et al (2009): it tends to discard the majority of data and fit the sampling noise. Our analysis also shows that the reverse cumulative distribution function proposed by Newman (2005) to plot power-law data is a terrible strategy in practice: it hides the true probability distribution of data. We hope that our research can clean up the bias in the research community about using LSE to fit power-law distributions.
Can Small Heads Help? Understanding and Improving Multi-Task Generalization
Multi-task learning aims to solve multiple machine learning tasks at the same time, with good solutions being both generalizable and Pareto optimal. A multi-task deep learning model consists of a shared representation learned to capture task commonalities, and task-specific sub-networks capturing the specificities of each task. In this work, we offer insights on the under-explored trade-off between minimizing task training conflicts in multi-task learning and improving \emph{multi-task generalization}, i.e. the generalization capability of the shared presentation across all tasks. The trade-off can be viewed as the tension between multi-objective optimization and shared representation learning: As a multi-objective optimization problem, sufficient parameterization is needed for mitigating task conflicts in a constrained solution space; However, from a representation learning perspective, over-parameterizing the task-specific sub-networks may give the model too many "degrees of freedom" and impedes the generalizability of the shared representation.
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
Extreme multi-label text classification (XMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing XMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer from a long-tailed label distribution (i.e., many labels occur only a few times in the training set). In this paper, we study XMTC under the zero-shot setting, which does not require any annotated documents with labels and only relies on label surface names and descriptions. To train a classifier that calculates the similarity score between a document and a label, we propose a novel metadata-induced contrastive learning (MICoL) method. Different from previous textbased contrastive learning techniques, MICoL exploits document metadata (e.g., authors, venue, and references between documents), which are widely available on the Web, to derive similar documentÂdocument pairs. Experimental results on two large-scale datasets show that: (1) MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines by up to 6.0%; (2) MICoL is on par with the supervised metadata-aware XMTC method trained on 10KÂ200K labeled documents; and (3) MICoL tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.
On the Origins of Hypertext in the Disasters of the Short 20th Century
The development of hypertext and the World Wide Web is most frequently explained by reference to changes in underlying technologies - Moore's Law giving rise to faster computers, more ample memory, increased bandwidth, inexpensive color displays. That story is true, but it is not complete: hypertext and the Web are also built on a foundation of ideas. Specifically, I believe the Web we know arose from ideas rooted in the disasters of the short twentieth century, 1914-1989. The experience of these disasters differed in the Americas and in Eurasia, and this distinction helps explain many longstanding tensions in research and practice alike.
"Way back then": A Data-driven View of 25+ years of Web Evolution
Since the inception of the first web page three decades back, the Web has evolved considerably, from static HTML pages in the beginning to the dynamic web pages of today, from text-only to more towards multimedia, etc. Although much of this known anecdotally, to our knowledge, there is no quantitative documentation of the extent and timing of these changes. This paper attempts to address this gap in the literature by looking at the top 100 Alexa websites for over 25 years from the Internet Archive or the "Wayback Machine", archive.org. We study the changes in popularity, from Geocities and Yahoo! in the mid-to-late 1990s to the likes of Google, Facebook, and Tiktok of today. We also look at different categories of websites and their popularity over the years, the emergence and relative prevalence of different mime-types (text vs. image vs. videos vs. javascript and json) and study whether the use of text on the Internet is declining.
Invited Speaker Bebo White (SLAC retired)
Invited Speaker Judy Brewer (W3C)
Steve Thompson: From GUI to BMI
Abhishek Kumar: Philosophy of Web in Age of Data Capitalism and Misinformation
Almost (Weighted) Proportional Allocations for Indivisible Chores
In this paper, we study how to fairly allocate $m$ indivisible chores to $n$ (asymmetric) agents. We consider (weighted) {\em proportionality up to any item} (PROPX), and show that a (weighted) PROPX allocation always exists and can be computed efficiently. We argue that PROPX might be a more reliable relaxation for proportionality in practice by the facts that any PROPX allocation ensures 2-approximation of maximin share (MMS) fairness [Budish, 2011] for symmetric agents and of anyprice share (APS) fairness [Babaioff et al, 2021] for asymmetric agents. APS allocations for chores have not been studied before the current work, and our result directly implies a 2-approximation algorithm. Another by-product result is that an EFX and a weighted EF1 allocation for indivisible chores exist if all agents have the same ordinal preference, which might be of independent interest. We then study the price of fairness (PoF), i.e., the loss in social welfare by enforcing allocations to be (weighted) PROPX. We prove that the tight ratio for PoF is $\Theta(n)$ for symmetric agents and unbounded for asymmetric agents. Finally, we consider the partial information setting and design algorithms that only use agents' ordinal preferences to compute approximately PROPX allocations. Our algorithm achieves 2-approximation for both symmetric and asymmetric agents, and the approximation ratio is optimal.
Truthful Online Scheduling of Cloud Workloads under Uncertainty
Cloud computing customers often submit repeating jobs and computation pipelines on approximately regular schedules, with arrival and running times that exhibit variance. This pattern, typical of training tasks in machine learning, allows customers to predict future job requirements -- but only partially.
Beyond Customer Lifetime Valuation: Measuring the Value of Acquisition and Retention for Subscription Services
Understanding the value of acquiring or retaining subscribers is crucial for subscription-based businesses. While customer lifetime value (LTV) is commonly used to do so, we demonstrate that LTV likely over-states the true value of acquisition or retention. We establish a methodology to estimate the monetary value of acquired or retained subscribers based on estimating both on and off-service LTV. To overcome the lack of data on off-service households, we use an approach based on Markov chains that recovers off-service LTV from minimal data on non-subscriber transitions. Furthermore, we demonstrate how the methodology can be used to (i) forecast aggregate subscriber numbers that respect both aggregate market constraints and account-level dynamics, (ii) estimate the impact of price changes on revenue and subscription growth and (iii) provide optimal policies, such as price discounting, that maximize expected lifetime revenue.
BONUS! Maximizing Surprise
Multi-round competitions often double or triple the points awarded in the final round, calling it a bonus, to maximize spectators' excitement. In a two-player competition with n rounds, we aim to derive the optimal bonus size to maximize the audience's overall expected surprise (as defined in [Ely et al. 2015]). We model the audience's prior belief over the two players' ability levels as a beta distribution. Using a novel analysis that clarifies and simplifies the computation, we find that the optimal bonus depends greatly upon the prior belief and obtain solutions of various forms for both the case of a finite number of rounds and the asymptotic case. In an interesting special case, we show that the optimal bonus approximately and asymptotically equals to the "expected lead", the number of points the weaker player will need to come back in expectation. Moreover, we observe that priors with a higher skewness lead to a higher optimal bonus size, and in the symmetric case, priors with a higher uncertainty also lead to a higher optimal bonus size. This matches our intuition since a highly asymmetric prior leads to a high "expected lead", and a highly uncertain symmetric prior often leads to a lopsided game, which again benefits from a larger bonus.
Interference, Bias, and Variance in Two-Sided Marketplace Experimentation: Guidance for Platforms
Two-sided marketplace platforms often run experiments (or A/B tests) to test the effect of an intervention before launching it platform-wide. A typical approach is to randomize individuals into the treatment group, which receives the intervention, and the control group, which does not. The platform then compares the performance in the two groups to estimate the effect if the intervention were launched to everyone. We focus on two common experiment types, where the platform randomizes individuals either on the supply side or on the demand side. For these experiments, it is well known that the resulting estimates of the treatment effect are typically biased: because individuals in the market compete with each other, individuals in the treatment group affect individuals in the control group (and vice versa), creating interference and leading to a biased estimate.
Global or Local: Constructing Personalized Click Models for Web Search
Click models are widely used for user simulation, relevance inference, and evaluation in Web search. Most existing click models implicitly assume that users' relevance judgment and behavior patterns are homogeneous. However, previous studies have shown that different users interact with search engines in rather different ways. Therefore, a unified click model can hardly capture the heterogeneity in users' click behavior. To shed light on this research question, we propose a Click Model Personalization framework (CMP) that adaptively selects from global and local models for individual users. Different adaptive strategies are designed to personalize click behavior modeling only for specific users and queries. We also reveal that capturing personalized behavior patterns is more important than modeling personalized relevance assessments in constructing personalized click models. To evaluate the performance of the proposed CMP framework, we build a large-scale practical Personalized Web Search (PWS) dataset, which consists of the search logs of 1,249 users from a commercial search engine over six months. Experimental results show that the proposed CMP framework achieves significant performance improvements than the non-personalized click models in click prediction.
Implicit User Awareness Modeling via Candidate Items for CTR Prediction in Search Ads
Click-through rate (CTR) prediction plays a crucial role in sponsored search advertising. User click behavior usually showcases strong comparison patterns among relevant/competing items within the user awareness. Explicit user awareness could be characterized by user behavior sequence modeling, which however suffers from issues such as cold start, behavior noise and hidden channels. Instead, in this paper, we study the problem of modeling implicit user awareness about relevant/competing items. We notice that candidate items of the CTR prediction model could play as surrogates for relevant/competing items within the user awareness. Motivated by this finding, we propose a novel framework, named CIM (Candidate Item Modeling), to characterize users' awareness on candidate items. CIM introduces an additional module to encode candidate items into a context vector and therefore is plug-and-play for existing neural network-based CTR prediction models. Offline experiments on a ten-billion-scale real production dataset collected from the real traffic of a search advertising system, together with the corresponding online A/B testing, demonstrate CIM's superior performance. Notably, CIM has been deployed in production, serving the main traffic of hundreds of millions of users, leading to an additional $0.1Billion/year ad revenue, and showing great application value.
ParClick: A Scalable Algorithm for EM-based Click Models
Research on click models usually focuses on developing effective approaches to reduce biases in user clicks. However, one of the major drawbacks of existing click models is the lack of scalability. In this work, we tackle the scalability of Expectation-Maximization (EM)-based click models by introducing ParClick, a new parallel algorithm designed by following the Partitioning-CommunicationAggregation-Mapping (PCAM) method. To this end, we first provide a generic formulation of EM-based click models. Then, we design an efficient parallel version of this generic click model following the PCAM approach: we partition user click logs and model parameters into separate tasks, analyze communication among them, and aggregate these tasks to reduce communication overhead. Finally, we provide a scalable, parallel implementation of the proposed design, which maps well on a multi-core machine. Our experiments on the Yandex relevance prediction dataset show that ParClick scales well when increasing the amount of training data and computational resources. In particular, ParClick is 24.7 times faster to train with 40 million search sessions and 40 threads compared to the standard sequential version of the Click Chain Model (CCM) without any
Asymptotically Unbiased Estimation for Delayed Feedback Modeling via Label Correction
Alleviating the delayed feedback problem is of crucial importance to the conversion rate(CVR) prediction in online advertising. Present delayed feedback modeling approaches introduce an observation window to balance the trade-off between waiting for accurate labels and consuming fresh feedback. Moreover, to estimate CVR upon the freshly observed but biased distribution with fake negatives, the importance sampling is widely used to reduce the distribution bias. While effective, we argue that previous approaches falsely treat fake negative samples as real negative during the importance weighting and have not fully utilized the observed positive samples, leading to suboptimal performance.
Cross DQN: Cross Deep Q Network for Ads Allocation in Feed
E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user {behaviors}, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge.
What's in an Index: Extracting Domain-specific Knowledge Graphs from Textbooks
A typical index at the end of a textbook contains a manually-provided vocabulary of terms related to the content of the textbook. In this paper, we extend our previous work on extraction of knowledge models from digital textbooks. We are taking a more critical look at the content of a textbook index and present a mechanism for classifying index terms according to their domain specificity: a core domain concept, an in-domain concept, a concept from a \related domain, and a concept from a foreign domain. We link the extracted models to DBpedia and leverage the aggregated linguistic and structural information from textbooks and DBpedia to construct and prune the domain specificity graph. The evaluation experiments demonstrate (1) the ability of the approach to identify (with high accuracy) different levels of domain specificity for automatically extracted concepts, (2) its cross-domain robustness, and (3) the added value of the domain specificity information. These results clearly indicate the improved quality of the refined domain models and widen their potential applicability for a variety of intelligent web applications, such as resource recommendation, semantic annotation and information extraction.
Conditional Generation Net for Medication Recommendation
Medication recommendation targets to provide a proper set of medicines according to patients' diagnoses, which is a critical task in clinics. Currently, the recommendation is manually conducted by doctors. However, for complicated cases, like patients with multiple diseases at the same time, it's difficult to propose a considerate recommendation even for experienced doctors. This urges the emergence of automatic medication recommendation which can treat all diagnosed diseases without causing harmful drug-drug interactions. Due to the clinical value, medication recommendation has attracted growing research interests. Existing works mainly formulate medication recommendation as a multi-label classification task to predict the set of medicines. In this paper, we propose the Conditional Generation Net (COGNet) which introduces a novel copy-or-predict mechanism to generate the set of medicines. Given a patient, the proposed model first retrieves his historical diagnoses and medication recommendations and mines their relationship with current diagnoses. Then in predicting each medicine, the proposed model decides whether to copy a medicine from previous recommendations or to predict a new one. This process is quite similar to the decision process of human doctors. We validate the proposed model on the public MIMIC data set, and the experimental results show that the proposed model can outperform state-of-the-art approaches.
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a deeper understanding of unknown unknowns and more effective identification and treatment, we focus on unknown unknowns characterization in this paper. We introduce a human-in-the-loop, semantic analysis framework for characterizing unknown unknowns at scale. Humans are engaged in two tasks that specify what a machine \emph{should know} and describe what it \emph{really knows}, respectively, both at the conceptual level, supported by information extraction and machine learning interpretability methods. Data partitioning and sampling techniques are employed to scale up human contributions in handling large data. Through extensive experimentation on scene recognition tasks, we show that our approach provides a rich, descriptive characterization of unknown unknowns and allows for more effective and efficient detection than the state of the art.
Context-Enriched Learning Models for Aligning Biomedical Vocabularies at Scale in the UMLS Metathesaurus
The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and it largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential to be improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem.
Capturing Diverse and Precise Reactions to a Comment with User-Generated Labels
Simple up- and down-votes, arguably the most widely used reaction mechanism across social media platforms, allow users to efficiently express their opinions and quickly evaluate othersÂ' opinions from aggregated votes. However, from our formative studies, we found that such design forces users to project their diverse opinions onto dichotomized reactions (up or down) and provides limited information to readers on why a comment was up- or down-voted. We explore user-generated labels (UGLs) as an alternative reaction design to capture the rich context of user reactions to comments. We conducted a between-subjects experiment with 218 participants from Mechanical Turk to understand how people use and are influenced by UGLs compared to up- and down-votes. Specifically, we examine how UGLs affect usersÂ' ability to express viewpoints and perceive diverse opinions. Participants generated 234 unique labels (109 positives, 125 negatives) regarding the degree of agreement, the strength of the argument, the style of the comment, judgments on the commenter, and feelings or beliefs related to the topic. UGLs enabled participants to better understand the multifacetedness of public evaluation of a comment. Participants reported that the ability to express their opinions about a comment significantly improved with UGLs.
Successful New-entry Prediction for Multi-Party Online Conversations via Latent Topics and Discourse Modeling
With the increasing popularity of social media, online interpersonal communication now plays an essential role in peopleÂ's everyday information exchange. Whether and how a newcomer can better engage in the community has attracted great interest due to its application in many scenarios. Although some prior works that explore early socialization have obtained salient achievements, they are focusing on sociological surveys based on the small group. To help individuals get through the early socialization period and engage well in online conversations, we study a novel task to foresee whether a newcomerÂ's message will be responded to by other participants in a multi-party conversation (henceforthSuccessful New-entry Prediction). The task would be an important part of the research in online assistants and social media. To further investigate the key factors indicating such engagement success, we employ an unsupervised neural network, Variational Auto-Encoder (VAE), to examine the topic content and discourse behavior from newcomerÂ's chatting history and conversationÂ's ongoing context. Furthermore, two large-scale datasets, from Reddit and Twitter, are collected to support further research on new-entries. Extensive experiments on both Twitter and Reddit datasets show that our model significantly outperforms all the baselines and popular neural models. Additional explainable and visual analyses on new-entry behavior shed light on how to better join in othersÂ' discussions.
This Must Be the Place: Predicting Engagement of Online Communities in a Large-scale Distributed Campaign
Understanding collective decision making at a large-scale, and elucidating how community organization and community dynamics shape collective behavior are at the heart of social science research. In this work we study the behavior of thousands of communities with millions of active members. We define a novel task: predicting which community will undertake an unexpected, large-scale, distributed campaign.
Element-guided Temporal Graph Representation Learning for Temporal Sets Prediction
Given a sequence of sets with timestamps, where each set includes an arbitrary number of elements, temporal sets prediction aims to predict elements in the consecutive set. Indeed, predicting temporal sets is much more complicated than the conventional predictions of time series and temporal events. Recent studies on temporal sets prediction only focus on the learning within each user's own sequence, and thus fail to discover the collaborative signals among different users' sequences. In this paper, we propose a novel Element-guided Temporal Graph Neural Network (ETGNN) for temporal sets prediction to address the above issue. Specifically, we first connect sequences of different users via a temporal graph, where nodes contain users and elements, and edges represent user-element interactions with time information. Then, we devise a new message aggregation mechanism to improve the model expressive ability via adaptively learning element-specific representations for users with the guidance of elements. By performing the element-guided message aggregation among multiple hops, collaborative signals latent in high-order user-element interactions could be explicitly encoded. Finally, we present a temporal information utilization module to capture both the semantic and periodic patterns in user sequential behaviors. Experiments on real-world datasets demonstrate that our approach can not only outperform the existing methods with a significant margin but also explore the collaborative signals.
Learn over Past, Evolve for Future: Search-based Time-aware Recommendation with Sequential Behavior Data
Personalized recommendation is an essential part of modern e-commerce, where user's demands are not only conditioned by their profile but also by their recent browsing behaviors as well as periodical purchases made some time ago. In this paper, we propose a novel framework named STARec, which captures the evolving demands of users over time through a unified search-based time-aware model. More concretely, we first design a search-based module to retrieve a user's related historical behaviors which are then mixed with her recent records to be fed into a time-aware sequential network for capturing her time-sensitive demands. Besides retrieving the message from her personal history, we also propose to search and include similar users' records as an additional reference. All this information is further fused to make the final recommendation. Beyond this framework, we also develop a novel label trick where the labels (i.e., user's feedbacks) are used as the input with a masking technique to address the label leakage issue. Apart from the learning algorithm, we also analyze how to efficiently deploy STARec in the large-scale industrial system. We conduct extensive experiments on three real-world commercial datasets on click-through-rate prediction tasks against state-of-the-art methods. Experimental results demonstrate the superiority and efficiency of our proposed framework and techniques. Furthermore, results of online experiments on a daily item recommendation platform under a mainstream bank company show that STARec gains average performance improvement of around 6% and 1.5% in its two main item recommendation scenarios on CTR metric respectively.
MBCT: Tree-Based Feature-Aware Binning for Individual Uncertainty Calibration
Machine learning applications such as medical diagnosis, meteorological forecasting, and computation advertising often require the predictor (model) output a calibrated estimate, which is the true probability of an event. Most existing predictors mainly concern classification accuracy, but their predicted probabilities are not calibrated. Thus, researchers have developed various calibration methods to post-process the outputs of a predictor to obtain calibrated values. Existing calibration studies include binning- and scaling-based methods. Compared with scaling, binning methods are shown to have distribution-free theoretical guarantees, which motivates us to prefer binning methods for calibration. However, we notice that existing binning methods have several drawbacks: (a) the binning scheme only considers the original prediction values, thus limiting the performance of calibration; and (b) the binning approach is non-individual, mapping multiple samples in a bin to the same value, and thus is not suitable for order sensitive applications. In this paper, we propose a feature-aware binning framework, called Multiple Boosting Calibration Trees (MBCT), along with a multiview calibration loss to tackle the above issues. Specifically, MBCT optimizes the binning scheme by the tree structures of features, and adopts a linear function in a tree node to achieve individual calibration. Our MBCT is non-monotonic, and has the potential to improve order accuracy, due to its learnable binning scheme and the individual calibration. We conduct comprehensive experiments on three datasets in different fields. Experimental results show that our method surpasses all the competing models in terms of both calibration error and order accuracy. We also conduct simulation experiments, justifying that the proposed multi-view calibration loss is a better metric in modeling calibration error. In additio n, our approach is deployed in a real-world online advertising platform; an A/B test over two weeks further demonstrates the effectiveness
Prototype Feature Extraction for Multi-task Learning
Multi-task learning (MTL) has been widely utilized in various industrial scenarios, such as recommender systems and search engines. MTL can improve learning efficiency and prediction accuracy by exploiting commonalities and differences across tasks. However, MTL is sensitive to relationships among tasks and may have performance degradation in real-world applications, because existing neural-based MTL models often share the same network structures and original input features. To address this issue, we propose a novel multi-task learning model based on Prototype Feature Extraction (PFE) to balance task-specific objectives and inter-task relationships. PFE is a novel component to disentangle features for multiple tasks. To better extract features from original inputs before gating networks, we introduce a new concept, namely prototype feature center, to disentangle features for multiple tasks. The extracted prototype features fuse various features from different tasks to better learn inter-task relationships. PFE updates prototype feature centers and prototype features iteratively. Our model utilizes the learned prototype features and task-specific experts for MTL. We evaluate PFE on two public datasets. Empirical results show that PFE outperforms state-of-the-art MTL models by extracting prototype features. Furthermore, we deploy PFE in a real-world recommender system (one of the world's top-tier short video sharing platforms) to showcase that PFE can be widely applied in industrial scenarios.
AutoField: Automating Feature Selection in Deep Recommender Systems
Feature quality has an impactful effect on recommendation performance. Thereby, the selection of essential features is a critical process in developing DNN-based recommender systems. Most existing deep recommendation models, however, spend lots of efforts on designing sophisticated neural networks, while the techniques of selecting valuable features remain not well-studied. Typically,existing models just feed all possible features into their proposed deep architectures, or select important features
Seesaw Counting Filter: An Efficient Guardian for Vulnerable Negative Keys During Dynamic Filtering
Bloom filter is an efficient data structure for filtering negative keys (keys not in a given set) with bounded one-side error probability within substantially small space. However, in real-world applications, there widely exist vulnerable negative keys, which will bring high costs if not being properly filtered. Recently, there are works focusing on handling such (vulnerable) negative keys by incorporating learning techniques. However, these techniques will incur high overhead of filter construction or query latency. Moreover, in the scenarios where keys are dynamically added and deleted, these learning-based filters fail to work as the learning techniques can hardly handle incremental insertions or deletions. To address the problem, we propose SeeSaw Counting Filter (SSCF), which is innovated with encapsulating the vulnerable negative keys into a unified counter array named seesaw counter array, and dynamically modulating (or varying) the applied hash functions during insertion to guard the encapsulated keys from being misidentified. Moreover, to handle the scenarios where the vulnerable negative keys cannot be obtained in advance, we propose ada-SSCF that can take vulnerable negative keys as input dynamically. We theoretically analyze our proposed approach, and then extensively evaluate it on several representative data sets. Our experiments show that, under the same memory space, SSCF outperforms the cutting-edge filters by 3Ã - on average (up to more than one order of magnitude on skewed data) regarding accuracy while achieving low operation latency similar to standard Counting Bloom filter. All source codes are available in [1].
DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams
Finding special items, like heavy hitters, top-$k$ items, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements in data streams to reveal the close correlations. Based on the special items mentioned above, we extend our problem to three applications related to heavy hitters, top-$k$, and persistent items, and design a generic framework \textbf{DUET} to process them. Besides, we analyze the error bound of our algorithm theoretically and conduct extensive experiments on four publicly available data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms.
GRAND+: Scalable Graph Random Neural Networks
Graph data is a commonplace on the Web where information is naturally interconnected as in social networks or online academic publications.
Dual-branch Density Ratio Estimation for Signed Network Embedding
Signed network embedding (SNE) has received considerable attention in recent years. A mainstream idea of SNE is to learn node representations by estimating the ratio of sampling densities. Though achieving promising performance, these methods based on density ratio estimation are limited to the issues of confusing sample, expected error, and fixed priori. To alleviate the above-mentioned issues, in this paper, we propose a novel dual-branch density ratio estimation (DDRE) architecture for SNE. Specifically, DDRE 1) consists of a dual-branch network, dealing with the confusing sample; 2) proposes the expected matrix factorization without sampling to avoid the expected error; and 3) devises an adaptive cross noise sampling to alleviate the fixed priori. We perform sign prediction and node classification experiments on four real-world and three artificial datasets, respectively. Extensive empirical results demonstrate that DDRE not only significantly outperforms the methods based on density ratio estimation but also achieves competitive performance compared with other types of methods such as graph likelihood, generative adversarial networks, and graph convolutional networks. Code is publicly available at https://github.com/WWW22-code/DDRE.
Resource-Efficient Training for Large Graph Convolutional Networks with Label-Centric Cumulative Sampling
Graph Convolutional Networks (GCNs) are popular for learning representation of graph data and have a wide range of applications in social networks, recommendation systems, etc. However, training GCN models for large networks is resource-intensive and time-consuming, which hinders them from real deployment. The existing GCN training methods intended to optimize the sampling of mini-batches for stochastic gradient descent to accelerate the training process, which did not reduce the problem size and had a limited reduction in computation complexity. In this paper, we argue that a GCN can be trained with a sampled subgraph to produce approximate node representations, which inspires us with a novel perspective to accelerate GCN training via network sampling. To this end, we propose a label-centric cumulative sampling (LCS) framework for training GCNs for large graphs. The proposed method constructs a subgraph cumulatively based on probabilistic sampling, and trains the GCN model iteratively to generate approximate node representations. The optimality of LCS is theoretically guaranteed to minimize the bias during node aggregation procedure in GCN training. Extensive experiments based on four real-world network datasets show that the proposed framework accelerates the training for the state-of-the-art GCN models up to 16x without causing a noteworthy model accuracy drop. Our codes are publicly available at GitHub.
Curvature Graph Generative Adversarial Networks
Generative adversarial network (GAN) is widely used for generalized and robust learning on graph data. However, for non-Euclidean graph data, the existing GAN-based graph representation methods generate negative samples by random walk or traverse in discrete space, leading to the information loss of topological properties (e.g. hierarchy and circularity). Moreover, due to the topological heterogeneity (i.e., different densities across the graph structure) of graph data, they suffer from serious topological distortion problem. In this paper, we proposed a novel Curvature Graph Generative Adversarial Networks method, named CurvGAN, which is the first GAN-based graph representation method in the Riemannian geometric manifold. To better preserve the topological properties, we approximate the discrete structure as continuous Riemannian geometric manifold and generate negative samples efficiently from the wrapped normal distribution. To deal with the topological heterogeneity, we leverage the Ricci curvature for local structures with different topological properties, obtaining to low-distortion representations. Extensive experiments show that CurvGAN consistently and significantly outperforms the state-of-the-art methods across multiple tasks and superior robustness and generalization.
On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks
The prevalence of graph structures appeals to a surge of investigation on graph data, enabling several downstream tasks such as multi-graph classification. However, in the multi-graph setting, graphs usually follow a long-tailed distribution in terms of their sizes, i.e., the number of nodes. In particular, a large fraction of tail graphs usually have small sizes. Though recent graph neural networks (GNNs) can learn powerful graph-level representations, they treat the graphs uniformly and marginalize the tail graphs which suffer from the lack of distinguishable structures, resulting in inferior performance on tail graphs. To alleviate this concern, in this paper we propose a novel graph neural network named SOTA-GNN, to close the representational gap between the head and tail graphs from the perspective of knowledge transfer. In particular, SOTA-GNN capitalizes on the co-occurrence substructures exploitation to extract the transferable patterns from head graphs. Furthermore, a novel relevance prediction function is proposed to memorize the pattern relevance derived from head graphs, in order to predict the complements for tail graphs to achieve more comprehensive structures for enrichment. We conduct extensive experiments on five benchmark datasets, and demonstrate that our proposed model can outperform all the state-of-the-art baselines.
Exploiting Anomalous Structural Change-based Nodes in Generalized Dynamic Social Networks (journal paper)
Recently dynamic social network research has attracted a great amount of attention, especially in the area of anomaly analysis that analyzes the anomalous change in the evolution of dynamic social networks. However, most of current research focused on anomaly analysis of the macro representation of dynamic social networks and failed to analyze the nodes that have anomalous structural changes at a micro-level. To identify and evaluate anomalous structural change-based nodes in generalized dynamic social networks that only have limited structural information, this research considers undirected and unweighted graphs and develops a multiple-neighbor superposition similarity method (MNSSM), which mainly consists of a multiple-neighbor range algorithm (MNRA) and a superposition similarity fluctuation algorithm (SSFA). MNRA introduces observation nodes, characterizes the structural similarities of nodes within multiple-neighbor ranges, and proposes a new multiple-neighbor similarity index on the basis of extensional
Web mining to inform locations of charging stations for electric vehicles
The availability of charging stations is an important factor for promoting electric vehicles (EVs) as a carbon-friendly way of transportation. Hence, for city planners, the crucial question is where to place charging stations so that they reach a large utilization. Here, we hypothesize that the utilization of EV charging stations is driven by the proximity to points-of-interest (POIs), as EV owners have a certain limited willingness to walk between charging stations and POIs. To address our research question, we propose the use of web mining: we characterize the influence of different POIs from OpenStreetMap on the utilization of charging stations. For this, we present a tailored interpretable model that takes into account the full spatial distributions of both the POIs and the charging stations. This allows us then to estimate the distance and magnitude of the influence of different POI types. We evaluate our model with data from approx. 300 charging stations and 4,000 POIs in Amsterdam, Netherlands. Our mod
Know Your Victim: Tor Browser Setting Identification via Network Traffic Analysis
Network traffic analysis (NTA) is widely researched to fingerprint users' behavior by analyzing network traffic with machine learning algorithms. It has introduced new lines of de-anonymizing attacks in the Tor network, inclusive of Website Fingerprinting (WF) and Hidden Service Fingerprinting (HSF). Previous work observed that the Tor browser version may affect network traffic and claimed that having identical browsing settings between the users and adversaries is one of the challenges in WF and HSF. Based on this observation, we propose a NTA method to identify users' browser settings in the Tor network. We confirm that browser settings have notable impacts on network traffic and create a classifier to identify the browser settings. The classifier can establish over 99% accuracy under the closed-world assumption. The open-world assumption results indicate classification success except for one security setting option. Last, we provide our observations and insights through feature analysis and changelog inspe
Knowledge Distillation for Discourse Relation Analysis
Automatically identifying the discourse relations can help many downstream NLP tasks such as reading comprehension. It can be categorized into explicit and implicit discourse relation recognition (EDRR and IDRR). Due to the lack of connectives, IDRR remains to be a big challenge. In this paper, we take the first step to exploit the knowledge distillation (KD) technique for discourse relation analysis. Our target is to train a focused single-data single-task student with the help of a general multi-data multi-task teacher. Specifically, we first train one teacher for both the top and second level relation classification tasks with explicit and implicit data. We then transfer the feature embeddings and soft labels from the teacher network to the student network. Extensive experimental results on the popular PDTB dataset proves that our model achieves a new state-of-the-art performance. We also show the effectiveness of our proposed KD architecture through detailed analysis.
User Donations in Online Social Game Streaming: the Case of paid subscription in Twitch.tv
Online social game streaming has proliferated with the rise of communities like Twitch.tv and Youtube Gaming. Beyond entertainment, they become vibrant communities for streamers and viewers to interact and to support each other, and the phenomenon of user donation is rapidly emerging in these communities. In this article, we provide a publicly available (anonymized) dataset and conduct an in-depth analysis of user donations (made through paid user subscriptions) on Twitch, a worldwide popular online social game streaming community. Based on information of over 2.77 million subscription relationships that worth in total over 14.1 million US dollars, we propose a subscription graph, and reveal the scale and diversity of paid user subscriptions received and made. Among other results, we find that (i) the paid subscriptions received and made are highly skewed, (ii) majority streamers are casual streamers who only come online occasionally, and regular streamers stream in different categories are relatively receiv
Multi-task GNN for Substitute Identification
Substitute product recommendation is important to improve customer satisfaction on E-commerce domain. E-commerce in nature provides rich sources of substitute relationships, e.g., customers view one product also view another, customers purchase a substitute product when the viewed product is sold out, etc. However, existing recommendation systems usually learn the product substitution correlation without jointly considering variant customer behavior sources. In this paper, we propose a unified multi-task heterogeneous graph neural network (M-HetSage), which captures the complementary information across various customer behavior data sources. This allows us to explore synergy across sources with different attributes and quality. Moreover, we introduce a list-aware average precision (LaAP) loss, which exploits correlations among lists of substitutes and non-substitutes by directly optimizing an approximation of the target ranking metric, i.e., mean average precision (mAP). On top of that, LaAP leverages a list-
Unsupervised Post-Time Fake Social Message Detection with Recommendation-aware Representation Learning
This paper deals with a more realistic scenario of fake message detection on social media, i.e., unsupervised post-time detection. Given a source message, our goal is to determine whether it is fake without using labeled data and without requiring user interacted with the given message. We present a novel learning framework, Recommendation-aware Message Representation (RecMR), to achieve the goal. The key idea is to learn user preferences and have them encoded into the representation of the source message through jointly training the tasks of user recommendation and binary detection. Experiments conducted on two real Twitter datasets exhibit the promising performance of RecMR, and show the effectiveness of recommended users in unsupervised detection.
Cross-Language Learning for Product Matching
Transformer-based matching methods have significantly moved the state-of-the-art for less-structured matching tasks such as matching product offers in e-commerce. In order to excel in these tasks, Transformer-based matching methods require a decent amount of training pairs. Providing enough training data can be challenging, especially if a matcher for non-English product descriptions should be learned. This poster explores along the use case of matching product offers from different e-shops to which extent it is possible to improve the performance of Transformer-based entity matchers by complementing a small set of training pairs in the target language, German in our case, with a larger set of English-language training pairs. Our experiments using different Transformers show that extending the German set with English pairs improves the matching performance in all cases. The impact of adding the English pairs is especially high in low-resource settings in which only a rather small number of non-English pairs i
PREP: Pre-training with Temporal Elapse Inference for Popularity Prediction
Predicting the popularity of online content is a fundamental problem in various applications. One practical challenge takes roots in the varying length of observation time or prediction horizon, i.e., a good model for popularity prediction is desired to handle various prediction tasks with different settings. However, most existing methods adopt the paradigm that trains a separate prediction model for each prediction setting and the obtained model for one prediction setting is difficult to be generalized to other settings, causing a great waste of computational resources and a large demand for downstream labels. To solve the above issues, we propose a novel pre-training framework for popularity prediction, namely PREP, aiming to pre-train a general deep representation model from the readily available unlabeled diffusion data. We design a novel pretext task for pre-training, i.e., temporal elapse inference for two randomly sampled time slices of popularity dynamics, impelling the deep representation model to e
Supervised Contrastive Learning for Product Matching
Contrastive learning has seen increasing success in the fields of computer vision and information retrieval in recent years. This poster is the first work that applies contrastive learning to the task of product matching in e-commerce using product offers from different e-shops. More specifically, we employ a supervised contrastive learning technique to pre-train a Transformer encoder which is afterwards fine-tuned for the matching problem using pair-wise training data. We further propose a source-aware sampling strategy which enables contrastive learning to be applied for use cases in which the training data does not contain product idenifiers. We show that applying supervised contrastive pre-training in combination with source-aware sampling significantly improves the state-of-the art performance on several widely used benchmark datasets: For Abt-Buy, we reach an F1 of 94.29 (+3.24 compared to the previous state-of-the-art), for Amazon-Google 79.28 (+ 3.7). For WDC Computers datasets, we reach improvements
VisPaD: Visualization and Pattern Discovery for Fighting Human Trafficking
Human trafficking analysts investigate related online escort advertisements (called micro-clusters) to detect suspicious activities and identify various modus operandi. This task is complex as it requires finding patterns and linked meta-data across micro-clusters such as the geographical spread of ads, cluster sizes, etc. Additionally, drawing insights from the data is challenging without visualizing these micro-clusters. To address this, in close-collaboration with domain experts, we built VisPaD, a novel interactive way for characterizing and visualizing micro-clusters and their associated meta-data, all in one place. VisPaD helps discover underlying patterns in the data by projecting micro-clusters in a lower dimensional space. It also allows the user to select micro-clusters involved in suspicious patterns and interactively examine them leading to faster detection and identification of trends in the data. A demo of VisPaD is also released.
Towards Preserving Server-Side Privacy of On-Device Models
Machine learning-based predictions are popular in many applications including healthcare, recommender systems and finance. More recently, the development of low-end edge hardware (e.g., Apple's Neural Engine and Intel's Movidius VPU) has provided a path for the proliferation of machine learning on the edge with on-device modeling. Modeling on the device reduces latency and helps maintain the user's privacy. However, on-device modeling can leak private server-side information. In this work, we investigate on-device machine learning models that are used to provide a service and propose novel privacy attacks that can leak sensitive proprietary information of the service provider. We demonstrate that different adversaries can easily exploit such models to maximize their profit and accomplish content theft. Motivated by the need to preserve both client and server privacy, we present preliminary ideas on thwarting such attacks.
Demo: PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs
Blacklists are a widely-used Internet security mechanism to protect Internet users from financial scams, malicious web pages and other cyber attacks based on blacklisted URLs. In this demo, we introduce PhishChain, a transparent and decentralized system to blacklisting phishing URLs. At present, public/private domain blacklists, such as PhishTank, CryptoScamDB, and APWG, are maintained by a centralized authority, but operate in a crowd sourcing fashion to create a manually verified blacklist periodically. In addition to being a single point of failure, the blacklisting process utilized by such systems is not transparent. We utilize the blockchain technology to support transparency and decentralization, where no single authority is controlling the blacklist and all operations are recorded in an immutable distributed ledger. Further, we design a page rank based truth discovery algorithm to assign a phishing score to each URL based on crowd sourced assessment of URLs. As an incentive for voluntary participation,
Mikaela Poulymenopoulou, Scientific Officer at ERCEA
Marlon Dumas, former ERC panel member/panel chair and ERC grantee
Aristides Gionis, ERC grantee
Stefano Leonardi, ERC grantee
Q & A