WWW2006, Edinburgh, UK 8 2006
2
tabularp35 em
tabularp10em
The Pennsylvania State University, IST Building
University Park, PA 16802
tabularp10em
The Pennsylvania State University, Thomas Building
University Park, PA 16802
tabularp10em
The Pennsylvania State University, IST Building
University Park, PA 16802
Date: 30 July 1999
H.4Information Systems ApplicationsMiscellaneous G.3Probability and StatisticsModels J.4Social and Behavioral Sciences
Algorithms, Experimentation, Human Factors
An important characteristic of all SNs is the community graph structure: how social actors gather into groups such that they are intra-group close and inter-group loose [13]. An illustration of a simple two-community SN is sketched in Fig. 1. Here each node represents a social actor in the SN and different node shapes represent different communities. Two nodes share an edge if and only if a relationship exists between them according to social definitions such as their role or participation in the social network. Connections in this case are binary.
Discovering community structures from general networks is of obvious interest. Early methods include graph partitioning [8] and hierarchical clustering [19,25]. Recent algorithms [6,14,2] addressed several problems related to prior knowledge of community size, the precise definition of inter-vertices similarity measure and improved computational efficiency [13]. They have been applied successfully to various areas such as email networks [22] and the Web [5]. Semantic similarity between Web pages can be measured using human-generated topical directories [9]. In general, semantic similarity in SNs is the meaning or reason behind the network connections.
For the extraction of community structures from email corpora [22,3], the social network is usually constructed measuring the intensity of contacts between email users. In this setting, every email user is a social actor, modeled as a node in the SN. An edge between two nodes indicates that the existing email communication between them is higher than certain frequency threshold.
However, discovering a community simply based purely on communication intensity becomes problematic in some scenarios. (1) Consider a spammer in an email system who sends out a large number of messages. There will be edges between every user and the spammer, in theory presenting a problem to all community discovery methods which are topology based. (2) Aside from the possible bias in network topology due to unwanted communication, existing methods also suffer from the lack of semantic interpretation. Given a group of email users discovered as a community, a natural question is why these users form a community? Pure graphical methods based on network topology, without the consideration of semantics, fall short in answering to such questions.
Consider other ways a community can be established, e.g. Fig. 2. From the preset communication intensity, person and person belong to two different communities, denoted by squares and circles, based on a simple graph partitioning. However, ignoring the document semantics in their communications, their common interests (denoted by the dashed line) are not considered in traditional community discovery.
In this paper, we examine the inner community property within SNs by analyzing the semantically rich information, such as emails or documents. We approach the problem of community detection using a generative Bayesian network that models the generation of communication in an SN. As suggested in established social science theory [23], we consider the formation of communities as resulting from the similarity among social actors. The generative models we propose introduce such similarity as a hidden layer in the probabilistic model.
As a parallel study in social network with the sociological approaches, our method advances existing algorithms by not exclusively relying the intensity of contacts. Our approach provides topic tags for every community and corresponding users, giving a semantic description to each community. We test our method on the newly disclosed email corpora benchmark - the Enron email dataset and compare with an existing method.
The outline of this paper is as follows: in Section 2, we introduce the previous work which our models are built on. The community detection problem following the line of probabilistic modeling is explained. Section 3 describes our community-user-topic (CUT) models. In Section 4 we introduce the algorithms of Gibbs sampling and EnF-Gibbs sampling (Gibbs sampling with Entropy Filtering). Experimental results are presented in Section 5. We conclude and discuss future work in Section 6.
[Topic-Word (LDA)]
0.4topicModel.eps
[Author-Word]
0.4authorModel.eps
[Author-Topic]
0.4author-topicModel.eps
|
Given a set of documents , each consisting of a sequence of words of size , the generation of each word for a specific document can be modeled from the perspective of either author or topic, or the combination of both. Fig. 3 illustrates the three possibilities using plate notations. Let denote a specific word observed in document ; and represent the number of topics and authors; is the observed set of authors for . Note that the latent variables are light-colored while the observed ones are shadowed. Fig. 3(a) models documents as generated by a mixture of topics [1]. The prior distributions of topics and words follow Dirichlets parameterized respectively by and . Each topic is a probabilistic multinomial distribution over words. Let denote the topic's distributions over words while the document's distribution over topics. 1
In the Topic-Word model, a document is considered as a mixture of topics. Each topic corresponds to a multinomial distribution over the vocabulary. The existence of observed word in document is considered to be drawn from the word distribution , which is specific to topic . Similarly the topic was drawn from the document-specific topic distribution , usually a row in the matrix .2
Similar to the Topic-Word model, an Author-Word model prioritizes the author interest as the origin of a word [11]. In Fig. 3(b), is the author set that composes document . Each word in this is chosen from the author-specific distribution over words. Note that in this Author-Word model, the author responsible for a certain word is chosen at random from .
An influential work following this model [21] introduces the Author-Topic model combined with the Topic-Word and Author-Word models and regards the generation of a document as affected by both factors in a hierarchical manner. Fig. 3(c) presents the hierarchical Bayesian structure.
According to the Author-Topic model in Fig. 3(c), for each observed word in document , an author is drawn uniformly from the corresponding author group . Then with the probability distribution of topics conditioned on , , a topic is generated. Finally the produces as observed in document .
The Author-Topic model has been shown to perform well for document content characterization because it involves two essential factors in producing a general document: the author and the topic. Modeling both factors as variables in the Bayesian network provides the model with capacity to group the words used in a document corpus into semantic topics. Based on the posterior probability obtained after the network is set up, a document can be denoted as a mixture of topic distributions. In addition, each author's preference of using words and involvement in topics can be discovered.
The estimation of the Bayesian network in the aforementioned models typically reply on the observed pairs of author and words in documents. Each word is treated as an instance generated following the probabilistic hierarchy in the models. Some layers in the Bayesian hierarchy are observed, such as authors and words. Other layers are hidden that is to be estimated, such as topics.
Much communication in SNs usually occurs by exchanging documents, such as emails, instant messages or posts on message boards [15]. Such content rich documents naturally serve as an indicator of the innate semantics in the communication among an SN. Consider an information scenario where all communications rely on email. Such email documents usually reflect nearly every aspect of and reasons for this communication. For example, the recipient list records the social actors that are associated with this email and the message body stores the topics they are interested in.
We define such a document carrier of communication as a communication document. Our main contribution is resolving the SN communication modeling problem into the modeling of generation of the communication documents, based on whose features the social actors associate with each other.
Modeling communication based on communication document takes into consideration the semantic information of the document as well as the interactions among social actors. Many features of the SN can be revealed from the parameterized models such as the leader-follower relation [12]. Using such models, we can avoid the effect of meaningless communication documents, such as those generated by a network spammer, in producing communities.
Our models accentuate the impact of community on the SN communications by introducing community as a latent variable in the generative models for communication documents. One direct application of the models is semantic community detection from SNs. Rather than studying network topology, we address the problem of community exploration and generation in SNs following the line of aforementioned research in probabilistic modeling.
We study the community structure of an SN by modeling the communication documents among its social actors and the format of communication documents we model is email because emails embody valuable information regarding shared knowledge and the SN infrastructure [22].
Our Community-User-Topic (CUT) model3 builds on the Author-Topic model. However, the modeling of a communication document includes more factors than the combination of authors and topics.
Serving as an information carrier for communication, a communication document is usually generated to share some information within a group of individuals. But unlike publication documents such as technical reports, journal papers, etc., the communication documents are inaccessible for people who are not in the recipient list. The issue of a communication document indicates the activities of and is also conditioned on the community structure within an SN. Therefore we consider the community as an extra latent variable in the Bayesian network in addition to the author and topic variables. By doing so, we guarantee that the issue of a communication document is purposeful in terms of the existing communities. As a result, the communities in an SN can be revealed and also semantically explanable.
We will use generative Bayesian networks to simulate the generation of emails in SNs. Differing in weighting the impact of a community on users and topics, two versions of CUT are proposed.
We first consider an SN community as no more than a group of users. This is a notion similar to that assumed in a topology-based method. For a specific topology-based graph partitioning algorithm such as [13], the connection between two users can be simply weighted by the frequency of their communications. In our first model CUT, we treat each community as a multinomial distribution over users. Each user is associated with a conditional probability which measures the degree that belongs to community . The goal is therefore to find out the conditional probability of a user given each community. Then users can be tagged with a set of topics, each of which is a distribution over words. A community discovered by CUT is typically in the structure as shown in Fig. 8.
Fig. 4 presents the hierarchy of the Bayesian network for CUT. Let us use the same notations in Author-Topic model: and parameterizing the prior Dirichlets for topics and words. Let denote the multinomial distribution over users for each community , each marginal of which is a Dirichlet parameterized by . Let the prior probabilities for be uniform. Let , , denote the number of community, users and topics.
Typically, an email message is generated by four steps: (1) there is a need for a community to issue an act of communication by sending an email ; (2) a user is chosen from as observed in the recipient list in ; (3) presents to read since a topic is concerned, which is drawn from the conditional probability on over topics; (4) given topic , a word is created in . By iterating the same procedure, an email message is composed word by word.
Note that the is not necessarily the composer of the message in our models. This differs from existing literatures which assume as the author of document. The assumption is that a user is concerned with any word in a communication document as long as the user is on the recipient list.
To compute , the posterior probability of assigning each word to a certain community , user and topic , consider the joint distribution of all variables in the model:
Theoretically, the conditional probability can be computed using the joint distribution .
A possible side-effect of CUT, which considers a community solely as a multinomial distribution over users, is it relaxes the community's impact on the generated topics. Intrinsically, a community forms because its users communicate frequently and in addition they share common topics in discussions as well. In CUT where community only generates users and the topics are generated conditioned on users, the relaxation is propagated, leading to a loose connection between community and topic. We will see in the experiments that the communities discovered by CUT is similar to the topology-based algorithm proposed in [13].
As illustrated in Fig. 5, each word observed in email is finally chosen from the multinomial distribution of a user , which is from the recipient list of . Before that, is sampled from another multinomial of topic and is drawn from community 's distribution over topics.
Analogously, the products of CUT are a set of conditional probability that determines which of the topics are most likely to be discussed in community . Given a topic group that associates for each topic , the users who refer to can be discovered by measuring .
CUT differs from CUT in strengthing the relation between community and topic. In CUT, semantics play a more important role in the discovery of communities. Similar to CUT, the side-effect of advancing topic in the generative process might lead to loose ties between community and users. An obvious phenomena of using CUT is that some users are grouped to the same community when they share common topics even if they correspond rarely, leading to the different scenarios for which the CUT models are most appropriate. For CUT, users often tend to be grouped to the same communities while CUT accentuates the topic similarities between users even if their communication seem less frequent.
Derived from Fig. 5, define in CUT the joint distribution of community , user , topic and word :
Let us see how these models can be used to discover the communities that consist of users and topics. Consider the conditional probability , a word associates three variables: community, user and topic. Our interpretation of the semantic meaning of is the probability that word is generated by user under topic , in community .
Unfortunately, this conditional probability can not be computed directly. To get ,we have:
Consider the denominator in Eq. 3, summing over all , and makes the computation impractical in terms of efficiency. In addition, as shown in [7], the summing doesn't factorize, which makes the manipulation of denominator difficult. In the following section, we will show how an approximate approach of Gibbs sampling will provide solutions to such problems. A faster algorithm EnF-Gibbs sampling will also be introduced.
Gibbs sampling was first introduced to estimate the Topic-Word model in [7]. In Gibbs sampling, a Markov chain is formed, the transition between successive states of which is simulated by repeatedly drawing a topic for each observed word from its conditional probability on all other variables. In the Author-Topic model, the algorithm goes over all documents word by word. For each word , the topic and the author responsible for this word are assigned based on the posterior probability conditioned on all other variables: . and denote the topic and author assigned to , while and are all other assignments of topic and author excluding current instance. represents other observed words in the document set and is the observed author set for this document.
A key issue in using Gibbs sampling for distribution approximation is the evaluation of conditional posterior probability. In Author-Topic model, given topics and words, is estimated by:
The transformation from Eq. 4 to Eq. 5 drops the variables, , , , , because each instance of is assumed independent of the other words in a message.
The framework of Gibbs sampling is illustrated in Fig. 6. Given the set of users , set of email documents , the number of desired topic , number of desired community are input, the algorithm starts with randomly assigning words to a community, user and topic. A Markov chain is constructed to converge to the target distribution. In each trial of this Monte Carlo simulation, a block of is assigned to the observed word . After a number of states in the chain, the joint distribution approximates the targeted distribution.
To adapt Gibbs sampling for CUT models, the key step is estimation of . For the two CUT models, we describe the estimation methods respectively.
Let be the probability that is generated by community , user on topic , which is conditioned on all the assignments of words excluding the current observation of . , and represent all the assignments of topic, user and word not including current assignment of word .
In CUT, combining Eq. 1 and Eq. 3, assuming uniform prior probabilities on community , we can compute for CUT by:
(7) |
In the equations above, is the number of times that word is assigned to topic , not including the current instance. is the number of times that topic is associated with user and is the number of times that user belongs to community , both not including the current instance. is the number of communities in the social network given as an argument.
The computation for Eq. 8 requires keeping a matrix , each entry of which records the number of times that word is assigned to topic . Similarly, a matrix and a matrix are needed for computation in Eq. 9 and Eq. 10.
Similarly, is estimated based on the Bayesian structure in CUT:
Hence the computation of CUT demands the storage of three 2-D matrices: , and .
With the set of matrices obtained after successive states in the Markov chain, the semantic communities can be discovered and tagged with semantic labels. For example, in CUT, the users belonging to each community can be discovered by maximizing in . Then the topics that these users concern are similarly obtained from and explanation for each topic can be retrieved from .
Consider two problems with Gibbs sampling illustrated in Fig. 6: (1) efficiency: Gibbs sampling has been known to suffer from high computational complexity. Given a textual corpus with words. Let there be users, communities and topics. An iteration Gibbs sampling has the worst time complexity , which in this case is about computations. (2) performance: unless performed explicitly before Gibbs sampling, the algorithm may yield poor performance by including many nondescriptive words. For Gibbs sampling, some common words like 'the', 'you', 'and' must be cleaned before Gibbs sampling. However, the EnF-Gibbs sampling saves such overhead by automatically removing the non-informative words based on entropy measure.
Fig. 7 illustrates the EnF-Gibbs sampling algorithm we propose. We incorporate the idea of entropy filtering into Gibbs sampling. During the interactions of EnF-Gibbs sampling, the algorithm keeps in an index of words that are not informative. After times of iterations, we start to ignore the words that are either already in the or are non-informative. In Step 15 of Fig. 7, we quantify the informativeness of a word by the entropy of this word over another variable. For example, in CUT where keeps the numbers of times is assigned to all topics, we calculate the entropy on the row of the matrix.
In this section, we present examples of discovered semantic communities. Then we compare our communities with those discovered by the topology-based algorithm [2] by comparing groupings of users. Finally we evaluate the computational complexity of Gibbs sampling and EnF-Gibbs sampling for our models.
We implemented all algorithms in JAVA and all experiments have been executed on Pentium IV 2.6GHz machines with 1024MB DDR of main memory and Linux as operating system.
In all of our simulations, we fixed the number of communities at 6 and the number of topics at 20. The smoothing hyper-parameters , and were set at , 0.01 and 0.1 respectively. We ran 1000 iterations for both our Gibbs sampling and EnF-Gibbs sampling with the MySQL database support. Because the quality of results produced by Gibbs sampling and our EnF-Gibbs sampling are very close, we simply present the results of EnF-Gibbs sampling hereafter.
The ontologies for both models are illustrated in Fig. 8 and Fig. 11. In both figures, we denote user, topic and community by square, hexagon and dot respectively. In CUT results, a community connects a group of users and each user is associated with a set of topics. A probability threshold can be set to tune the number of users and topics desired for description of a community. In Fig. 8, we present all the users and two topics of one user for a discovered community. By union all the topics for the desired users of a community, we can tag a community with topic labels.
Topic 3 | Topic 5 | Topic 12 | Topic 14 |
---|---|---|---|
rate | dynegy | budget | contract |
cash | gas | plan | monitor |
balance | transmission | chart | litigation |
number | energy | deal | agreement |
price | transco | project | trade |
analysis | calpx | report | cpuc |
database | power | group | pressure |
deals | california | meeting | utility |
letter | reliant | draft | materials |
fax | electric | discussion | citizen |
Fig. 8 shows that user mike.grigsby is one of the users in community 3. Two of the topics that is mostly concerned with mike.grigsby are topic 5 and topic 12. Table 1 shows explanations for some of the topics discovered for this community. We obtain the word description for a topic by choosing 10 from the top 20 words that maximize . We only choose 10 words out of 20 because there exist some names with large conditional probability on a topic that for privacy concern we do not want to disclose.
abbreviations | organizations |
---|---|
dynegy | An electricity, natural gas provider |
transco | A gas transportation company |
calpx | California Power Exchange Corp. |
cpuc | California Public Utilities Commission |
ferc | Federal Energy Regulatory Commission |
epsa | Electric Power Supply Association |
naruc | National Association of |
Regulatory Utility Commissioners | |
We can see from Table 1 that words with similar semantics are nicely grouped to the same topics. For better understanding of some abbreviate names popularly used in Enron emails, we list the abbreviations with corresponding complete names in Table 2.
[Over communities]
0.35userCom.eps
[Over topics]
0.35harry.arora.eps
|
For a single user, Fig. 9 illustrates its probability distribution over communities and topics as learned from the CUT model. We can see the multinomial distribution we assumed was nicely discovered in both figures. The distribution over topics for all users are presented in Fig. 10. From Fig. 10, we can see some Enron employees are highly active to be involved in certain topics while some are relatively inactive, varying in heights of peaks over users.
Fig. 11 illustrates a community discovered by CUT. According to the figure, Topic 8 belongs to the semantic community and this topic concerns a set of users, which includes rick.buy whose frequently used words are more or less related to business and risk. Surprisingly enough, we found the words our CUT learned to describe such users were very appropriate after we checked the original positions of these employees in Enron. For the four users presented in Table 3, d..steffes was the vice president of Enron in charge of government affairs; cara.semperger was a senior analyst; mike.grigsby was a marketing manager and rick.buy was the chief risk management officer.
d..steffes | cara.s | mike.grigsby | rick.buy |
---|---|---|---|
power | number | file | corp |
transmission | cash | trader | loss |
epsa | ferc | report | risk |
ferc | database | price | activity |
generator | peak | customer | validation |
government | deal | meeting | off |
california | bilat | market | business |
cpuc | caps | sources | possible |
electric | points | position | increase |
naruc | analysis | project | natural |
Given data objects, the similarity between two clustering results is defined5:
where denotes the count of object pairs that are in different clusters for both clustering and is the count of pair that are in the same cluster.
The similarities between three CUT models and are illustrated in Fig. 12. We can see that as we expected the similarity between CUT and is large while that between CUT and is small. This is because the CUT is more similar to than CUT by defining a community as no more than a group of users.
We also test the similarity among topics(users) for the users(topics) which are discovered as a community by CUT (CUT). Typically the topics(users) associated with the users(topics) in a community represent high similarities. For example, in Fig. 8, Topic 5 and Topic 12 that concern mike.grigsby are both contained in the topic set of lindy.donoho, who is the community companion of mike.grigsby.
[Time vs. sample size]
0.4cut1time.eps
[Time vs. iterations]
0.4itertime.eps
|
We evaluate the computational complexity of Gibbs sampling and EnF-Gibbs sampling for our models. For the two metrics we measure the computational complexity based on are total running time and iteration-wise running time. For overall running time we sampled different scales of subsets of messages from Enron email corpus. For the iteration-wise evaluation, we ran both Gibbs sampling and EnF-Gibbs sampling on complete dataset.
In Fig. 13(a), the running time of both sampling algorithms on two models are illustrated. We can see that generally learning CUT is more efficient than CUT. It is a reasonable result considering the matrices for CUT are larger in scales than CUT. Also entropy filtering in Gibbs sampling leads to 4 to 5 times speedup overall.
The step-wise running time comparison between Gibbs sampling and EnF-Gibbs sampling is shown in Fig. 13(b). We perform the entropy filtering removal after 8 iterations in the Markov chain. We can see the EnF-Gibbs sampling well outperforms Gibbs sampling in efficiency. Our experimental results also show that the quality of EnF-Gibbs sampling and Gibbs sampling are almost the same.
We present two versions of Community-User-Topic models for semantic community discovery in social networks. Our models combine the generative probabilistic modeling with community detection. To simulate the generative models, we introduce EnF-Gibbs sampling which extends Gibbs sampling based on entropy filtering. Experiments have shown that our method effectively tags communities with topic semantics with better efficiency than Gibbs sampling.
Future work would consider the possible expansion of our CUT models as illustrated in Fig. 14. The two CUT models we proposed either emphasize the relation between community and users or between community and topics. It would be interesting to see how the community structure changes when both factors are simultaneously considered. One would expect new communities to emerge. The model in Fig. 14 constrains the community as a joint distribution over topic and users. However, such nonlinear generative models require larger computational resources, asking for more efficient yet approximate solutions. It would also be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs.