Text Mining
Wednesday, 4:00–5:30 PM
Chair: Deepayan Chakrabarti
Cross-Domain Sentiment Classification via Spectral Feature Alignment
Sinno Pan, Xiaochuan Ni, Jiantao Sun, Qiang Yang, Zheng Chen
User sentiment data are widely spread on the Web. However, when users publish sentiment data (e.g., reviews, blogs), they may not explicitly indicate their sentiment polarity (e.g., positive or negative). Sentiment classification, which is used to automatically predict sentiment category, is critical in many applications particularly when user text is large in scale. Although traditional classification algorithms can be used to train sentiment classifiers from manually labeled text data, the labeling work can be time-consuming and expensive. Meanwhile, users often use some different words when they express sentiment in different domains. In this work, we develop a general solution for sentiment classification when we do not have any labels in a target domain. We solve this problem by leveraging the knowledge from a different domain with labels, which is what we call a source domain. In this cross-domain sentiment classification setting, if we directly apply a classification model trained in the source domain to the target domain, performance will be very low due to the differences between these domains. To bridge the gap between the domains, we propose a spectral feature alignment (SFA) algorithm to align the domain-specific words from different domains into unified clusters, with the help of domain-independent words as a bridge. In this way, the clusters can be used to reduce the gap between domain specific words of the two domains, which can be used to train sentiment classifiers in the target domain accurately. Compared to previous approaches to cross-domain sentiment classification, SFA is both general and optimized, since it discovers a robust representation for cross-domain data by fully exploiting the relationship between the domain-specific and domain independent words. We perform extensive experiments on two real world datasets, and demonstrate that SFA outperforms previous approaches for cross-domain sentiment classification.
Highlighting Disputed Claims on the Web
Rob Ennals, Beth Trushkowsky, John Mark Agosta
We describe Dispute Finder, a browser extension that alerts a user when information they read online is disputed by a source that they might trust. Dispute Finder examines the text on the page that the user is browsing and highlights any phrases that appear to entail claims in its database of known disputed claims. If a user clicks on a highlighted phrase then Dispute Finder will show the user a summary of articles that support other points of view. Dispute Finder builds it’s database of disputed claims by crawling web sites that already maintain lists of disputed claims, and by allowing users to enter claims that they believe are disputed. Dispute Finder identifies instances of disputed claims by running a simple textual entailment algorithm inside the browser extension, referring to a cached local copy of a subset of our claim database. Performing these tasks well is a hard problem, and we do not yet claim to have an implementation that is good enough to be compelling for most users. We do however believe that Dispute Finder attacks an interesting problem that, if addressed well, could significantly improve the utility of the web.
Topic Initiator Detection on the World Wide Web
Xin Jin, Scott Spangler, Rui Ma, Jiawei Han
In this paper we introduce a new Web mining and search technique—Topic Initiator Detection (TID) on the Web. Given a topic query on the Internet and a collection of time-stamped web document results each of which contains the query keywords, the task of TID is to automatically return which web document (or its author) initiated the topic or was the first to discuss about the topic. To deal with the TID problem, we design a system framework and propose algorithm InitRank to rank the web documents by their possibility to be the topic initiator. InitRank is based on features extracted from the web documents, such as the time, content and link information. Experiments show that compared with baseline methods, such as direct time sorting, well-known link based ranking algorithms PageRank and HITS, InitRank achieves the best overall performance. In case studies, we successfully detected (1) the original of a famous rumor about an Australia product and (2) the pre-leakage of IBM and Google Cloud Computing collaboration plan before the official announcement date.
.