Health on the Web

List of accepted papers :

  • Did You Really Just Have a Heart Attack?: Towards Robust Detection of Personal Health Mentions in Social Media
    Authors: Payam Karisani and Eugene Agichtein

    Keywords: Social Media Classification, Health Tracking in Social Media, Representation learning for text classification

    Millions of users share their experiences on social media sites, such as Twitter, which in turn generates potentially valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, basic, task for any of these applications is classifying whether a personal health event was mentioned. Thus, identifying actual personal health mentions (PHM) is critical. This task is challenging for many reasons, including typically short length of posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like “heart attack” or “cancer” for emphasis and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as “stroke”. To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embeddings to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep learning techniques, WESPAD requires relatively little training data–which makes it possible to adapt with minimal effort to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions, which we plan to share with the research community. Our experiments show that WESPAD consistently and significantly improves upon baseline methods in a variety of settings, and consistently outperforms all state-of-the-art deep neural network methods. Most importantly, we show that WESPAD outperforms the lexical baseline by a large margin when the number and proportion of true health mentions in the training data is small. These properties make our method particularly valuable for extending online public health methods to an ever expanding set of diseases and conditions.

  • Multi-Task Pharmacovigilance Mining from Social Media Posts
    Authors: Shaika Chowdhury, Chenwei Zhang and Philip S. Yu

    Keywords: Multi-Task Learning, Pharmacovigilance, Adverse Drug Reaction, Attention Mechanism, Coverage, Recurrent Neural Network, Social Media

    Social media has grown to be a crucial information source for pharmacovigilance studies where an increasing number of people post adverse reactions to medical drugs that are previously unreported. Aiming to effectively monitor various aspects of Adverse Drug Reactions (ADRs) from diversely expressed social medical posts, we propose a multi-task neural network framework that learns several tasks associated with ADR monitoring with different levels of supervisions collectively. Besides being able to correctly classify ADR posts and accurately extract ADR mentions from online posts, the proposed framework is also able to further understand reasons for which the drug is being taken, known as ‘indications’, from the given social media post. A coverage-based attention mechanism is adopted in our framework to help the model properly identify ‘phrasal’ ADRs and Indications that are attentive to multiple words in a post. Our framework is applicable in situations where limited parallel data for different pharmacovigilance tasks are available. We evaluate the proposed framework on real-world Twitter datasets, where the proposed model outperforms the state-of-the-art alternatives of each individual task consistently.

  • Multi-Task Learning Improves Disease Models from Web Search
    Authors: Bin Zou, Vasileios Lampos and Ingemar Cox

    Keywords: Web Search, User-Generated Content, Disease Surveillance, Multi-Task Learning, Regularized Regression, Gaussian Processes

    We investigate the utility of multi-task learning to disease surveillance using Web search data. Our motivation is two-fold. Firstly, we assess whether concurrently training models for various geographies – inside a country or across different countries – can improve accuracy. We also test the ability of such models to assist health systems that are producing sporadic disease surveillance reports that reduce the quantity of available training data. We explore both linear and nonlinear models, specifically a multi-task expansion of elastic net and a multi-task Gaussian Process, and compare them to their respective single task formulations. We use influenza-like illness as a case study and conduct experiments on the United States (US) as well as England, where both health and Google search data were obtained. Our empirical results indicate that multi-task learning improves regional as well as national models for the US. The percentage of improvement on mean absolute error increases up to 14.8% as the historical training data is reduced from 5 to 1 year(s), illustrating that accurate models can be obtained, even by training on relatively short time intervals. Furthermore, in simulated scenarios, where only a few health reports (training data) are available, we show that multi-task learning helps to maintain a stable performance across all the affected locations. Finally, we present results from a cross-country experiment, where data from the US improves the estimates for England. As the historical training data for England is reduced, the benefits of multi-task learning increase, reducing mean absolute error by up to 40%.

  • A Fast Deep Learning Model for Textual Relevance in Biomedical Information Retrieval
    Authors: Sunil Mohan, Nicolas Fiorini, Sun Kim and Zhiyong Lu

    Keywords: Deep Learning, Biomedical Information Retrieval, Learning to Rank

    Publications in the life sciences are characterized by a large technical vocabulary, with many lexical and semantic variations for expressing the same concept. Towards addressing the problem of relevance in biomedical literature search, we introduce a deep learning model for the relevance of a document’s text to a keyword style query. Limited by a relatively small amount of training data, the model uses pre-trained word embeddings. With these, the model first computes a variable-length Delta matrix between the query and document, representing a difference between the two texts, which is then passed through a deep convolution stage followed by a deep feed-forward network to compute a relevance score. This results in a fast model suitable for use in an online search engine. The model is robust and outperforms comparable state-of-the art deep learning approaches.

  • Modeling Individual Cyclic Variation in Human Behavior
    Authors: Emma Pierson, Tim Althoff and Jure Leskovec

    Keywords: cycle modeling, cyclical data, cycles, time series, HMMs, human activity data, user clustering

    Cycles are fundamental to human health and behavior. Examples include mood cycles, circadian rhythms, and the menstrual cycle. However, modeling cycles in time series data is challenging because in most cases the cycles are not labeled or directly observed and need to be inferred from multidimensional measurements taken over time. Here, we present Cyclic Hidden Markov Models (CyHMMs) for detecting and modeling cycles in a collection of multidimensional heterogeneous time series data. In contrast to previous cycle modeling methods, CyHMMs deal with a number of challenges encountered in modeling real-world cycles: they can model multivariate data with both discrete and continuous dimensions; they explicitly model and are robust to missing data; and they can share information across individuals to accommodate variation both within and between individual time series. Experiments on synthetic and real-world health-tracking data demonstrate that CyHMMs infer cycle lengths more accurately than existing methods, with 58% lower error on simulated data and 63% lower error on real-world data compared to the best-performing baseline. CyHMMs can also perform functions which baselines cannot: they can model the progression of individual features/symptoms over the course of the cycle, identify the most variable features, and cluster individual time series into groups with distinct characteristics. Applying CyHMMs to two real-world health-tracking datasets — of human menstrual cycle symptoms and physical activity tracking data — yields important insights including which symptoms to expect at each point during the cycle. We also find that people fall into several groups with distinct cycle patterns, and that these groups differ along dimensions not provided to the model. For example, by modeling missing data in the menstrual cycles dataset, we are able to discover a medically relevant group of birth control users even though information on birth control is not given to the model.

  • Multi-instance Domain Adaptation for Vaccine Adverse Event Detection
    Authors: Junxiang Wang and Liang Zhao

    Keywords: Multi-instance learning, Transfer learning, Vaccine adverse event detection

    Detection of vaccine adverse events is crucial to the discovery and improvement of problematic vaccines. To achieve it, traditionally formal reporting systems like VAERS support accurate but delayed surveillance, while recently social media have been mined for timely but noisy observations. Utilizing the complementary strengths of these two domains to boost the detection performance looks very promising but cannot be effectively achieved by existing methods due to significant differences between their data characteristics, including: 1) formal language v.s. informal language, 2) single-message per user v.s. multi-messages per user, and 3) one class v.s. binary class. In this paper, we propose a novel generic framework named Multi-instance Domain Adaptation (MIDA) to maximize the synergy between these two domains in the vaccine adverse event detection task for social media users. Specifically, we propose a generalized Maximum Mean Discrepancy (MMD) criterion to measure the semantic distances between the heterogeneous messages from these two domains in their shared latent semantic space. Then these message-level generalized MMD distances are synthesized by newly proposed mixed instance kernels to user-level distances. We finally minimize the distances between the samples of the partially-matched classes from these two domains. In order to solve the non-convex optimization problem, an efficient Alternating Direction Method of Multipliers (ADMM) based algorithm combined with the Convex-Concave Procedure (CCP) is developed to optimize parameters accurately. Extensive experiments demonstrated that our model outperformed the baselines by a large margin under six metrics. Case studies showed that formal reports and extracted adverse-relevant tweets by MIDA shared a similarity of keyword and description patterns.

  • Modeling Interdependent and Periodic Real-World Action Sequences
    Authors: Takeshi Kurashima, Tim Althoff and Jure Leskovec
    Presentation moved to track Health on the Web

    Keywords: real-world behavior, human action sequence, real-world actions, periodic behavior, user modeling, activity tracking, activity logging, quantified self, mobile health, point process

    Mobile health applications, including those that track activities such as exercise, sleep, and diet, are becoming widely used. Accurately predicting human actions in the real world is essential for targeted recommendations that could improve our health and for personalization of these applications. However, making such predictions is extremely difficult due to the complexities of human behavior, which consists of a large number of potential actions that vary over time, depend on each other, and are periodic. Previous work has not jointly modeled these dynamics and has largely focused on item consumption patterns instead of broader types of behaviors such as eating, commuting or exercising. In this work, we develop a novel statistical model, called TIPAS, for Time-varying, Interdependent, and Periodic Action Sequences. Our approach is based on personalized, multivariate temporal point processes that model time-varying action propensities through a mixture of Gaussian intensities. Our model captures short-term and long-term periodic interdependencies between actions through Hawkes process-based self-excitations. We evaluate our approach on two activity logging datasets comprising 12 million real-world actions (e.g., eating, sleep, and exercise) taken by 20 thousand users over 17 months. We demonstrate that our approach allows us to make successful predictions of future user actions and their timing. Specifically, TIPAS improves predictions of actions, and their timing, over existing methods across two datasets by up to 156% and 37%, respectively. Performance improvements are particularly large for relatively rare and periodic actions such as walking and biking, improving over baselines by up to 256%. This demonstrates that explicit modeling of dependencies and periodicities in real-world behavior enables successful predictions of future actions, with implications for modeling human behavior, app personalization, and targeting of health interventions.