WWW10 Poster 1110

The problem of analysing dynamically evolving textual data has recently arisen. An example of such data is the discussion appearing in Internet chat lines. In this paper a recently introduced method, termed complexity pursuit, is used to extract the topics of a dynamical chat line discussion. Experimental results demonstrate that meaningful topics can be found and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.

Keywords

chat line discussion, dynamical text, topic identification, complexity pursuit, data mining

1. INTRODUCTION

In times of huge information flow in the Internet, there is a strong need for automatic textual data analysis tools. Algorithms developed for text mining from static text collections have been presented e.g. in [1,2,7]. Our emphasis is in the recently arisen issue of analyzing dynamically evolving textual data; investigating appropriate tools for this task is of practical importance. An example of such data is found in the Internet relay chat rooms: the topic of interest changes after participants' contributions. The online text stream can thus be seen as a time series, and methods of time series processing may be used to extract the topics of the discussion.

We present results of applying a recently introduced powerful method, complexity pursuit [3], to topic extraction in a dynamically evolving discussion. Complexity pursuit uses both information-theoretic measures and time-correlations of the data, which makes it more powerful than methods using only one of these -- the latter kind of methods include [5,6,8,9].

2. CHAT LINE DATA

The discussion found in chat lines on the Internet is an ongoing stream of text generated by the chat participants and the chat line moderator. To analyze it using data mining methods a convenient technique is to split the stream into windows that may be overlapping if desired. Each such window can now be viewed as one document. We represent the documents using the vector space model [11]: each document forms one T-dimensional vector where T is the number of distinct terms in the vocabulary. The i-th element of the vector indicates (some function of) the frequency of the i-th vocabulary term in the document. The term by document matrix X contains the document vectors as its columns and is of size T times N where N is the number of documents.

As a preprocessing step we perform the LSI [2] of the data matrix X by computing its singular value decomposition, thus acquiring a lower dimensional projection of the the high-dimensional data. We denote the new data matrix as Z = (z(t)). The topics of the discussion can be found by projecting Z to the directions $\mathbf W = (\mathbf w_1 \cdots \mathbf w_M)$ given by the algorithm described in the following Section. M, the number of estimated minimum complexity projections, may be smaller than K, the dimension of the LSI projection.

3. THE ALGORITHM

Complexity pursuit [3] is a recently developed, computationally simple algorithm for separating "interesting" components from high dimensional time series. These components -- here, the topics of the discussion -- are found by projecting the data in the directions given by the algorithm. In complexity pursuit, an "interesting" component is one whose distribution has a low coding complexity, measured by an approximative Kolmogoroff complexity. We model the topics of the discussion as probability distributions on terms, and the distributions having minimum complexity are assumed to best represent the distinct topics.

We assume that the preprocessed K-dimensional observations z(t) are linear mixtures of some latent, time-correlated topics s. Both the latent topics and the mixing process are unknown. The topic time series are found by projecting the observation time series into directions w as s(t) = w^T z(t) where the projection directions w are to be found. A separate autoregressive (AR) model, here $\hat{s}(t) = \alpha s(t-1)$ , is assumed to model each topic time series s. At every step of the algorithm, the AR constant $\alpha$ is first estimated. Then the gradient update of w that minimizes the approximate Kolmogoroff complexity [3] of the AR residuals $s(t)-\hat{s}(t)$ = w^T(z(t) - \alpha z(t-1)) is the following:
$\begin{align} \mathbf w \leftarrow & \mathbf w - \mu E\{ (\mathbf z(t) - \alph... ...\\ \mathbf w \leftarrow & \mathbf w /\vert\vert\mathbf w\vert\vert \end{align}$
To estimate several projection directions we can proceed in a deflationary manner, making the new projection direction orthogonal to the previously found ones, similarly to Gram-Schmidt orthogonalization. After the estimation of p projections, we run the algorithm for w_p+1 and after every iteration step subtract from w_p+1 the projections of the previously estimated p vectors, and then renormalize w_p+1. Another possibility is to estimate all projections simultaneously in a symmetric manner and use orthogonal decorrelation W := (WW^T)^-1/2 W instead of the normalization (2) in the above set of equations. Details of the approximation of the Kolmogoroff complexity and its minimization can be found in [3].

The computational load of the algorithm can be approximated as follows. The LSI preprocessing of a sparse T times N data matrix is of order O(NTc) where c is the number of nonzero entries in each column of the data matrix. (In our experiments, the data matrix is very sparse and c is about 30 to 40, as each document contains about 30 to 40 distinct vocabulary terms.) The algorithm itself is of order O(NK²M) where K is the dimensionality (K << T,N) into which the data was projected using LSI, and M is the number of topics estimated. Thus the algorithm scales linearly in the number of observations.

4. EXPERIMENTS ON CHAT LINE DATA

The chat line data was collected from the CNN Newsroom chat line http://www.cnn.com/chat/channel/cnn_newsroom. A contiguous stream of almost 24 hours of discussion of 3200 chat participants, contributing 25 000 comment lines, was recorded on January 18th, 2001. The data was cleaned by omitting all user names and non-user generated text. The remaining text stream was split into overlapping windows of about 750 characters. From these windows a term histogram was generated using the Bow toolkit http://www.cs.cmu.edu/~mccallum/bow/, resulting in a term by document matrix X of size T times N, that is, 4743 times 7430. The LSI of order K=50 was computed as a preprocessing step. The choice of the number of estimated topics M is relatively flexible, and in this abstract we present results on M=10, 7 and 4.

Figure 1 shows the topic time series s(t) when 10 topics are assumed. We can see that the topics are autocorrelated in time; different topics are activated at different times. Some topics show short peaks of contribution whereas others are active for longer periods.

**Figure 1:** Topic time series (horizontal axis: chat windows). 10 topics are assumed.

The validity of the identified topics is easy to evaluate using the most representative terms associated with each topic. These are obtained by projecting the term data Z_term (which is the document by term matrix X^T projected into its LSI space) into the minimum complexity directions w_i found earlier. By listing the terms corresponding to the highest peaks in the projection w_i^TZ_term we get a list of keywords for the i-th topic. In Table 1 it is seen that each keyword list indeed characterizes one distinct topic quite clearly.

Table 1: The most representative terms of each topic, assuming 10 topics

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
monei	room	count	gener	good	polit	gun	power	kennedi	god
tax	chat	won	attornei	make	conserv	violenc	california	ted	religion
jackson	join	vote	agre	bush	religion	school	electr	peopl	peopl
night	discuss	gore	view	time	liber	report	blackout	bush	jesu
cut	cnn	bush	senat	cnn	govern	youth	energi	teddi	bibl
agre	est	stop	read	american	mind	children	plant	work	doesn
bill	peopl	recount	reno	man	free	point	peopl	car	call
pai	todai	court	gui	don	form	home	deregul	kill	read
lot	tonight	florida	bush	presid	opinion	drug	dai	hear	give
america	american	hand	question	black	life	famili	state	put	back
exempt	studio	night	dai	talk	philosophi	major	build	nation	show
hous	elect	win	missouri	senat	establish	parent	crisi	drive	true

Topics 1, 4, 5, 6 and 9 in Table 1 deal with general daily politics in the US, with the emphasis on Jesse Jackson (topic 1), Department of Justice (topic 4), expected success of the country under the guidance of President Bush (topic 5), the values of the politicians (topic 6) and Ted Kennedy's political comments and the car accident he was involved in several years ago (topic 9). Topic 2 corresponds to comments given by the chat line operator. Topic 3 is about the presidential election in the US, especially the vote recounting in Florida. Topic 7 deals with problems of the youth: violence, drug abuse etc. Topic 8 involves the energy shortage in California in mid-January 2001. Topic 10 is a religious discussion. Some of the topics display similarity to other topics, whereas some topics are clearly distinct. This suggest the estimation of a smaller number of topics; results of this will be seen later in this section.

Neglecting the time series nature of the data in the complexity pursuit algorithm yields an Independent Component Analysis -type algorithm [5, 6, 9]. To demonstrate this, we leave out the autoregressive modeling of the topic time series in our complexity pursuit algorithm. The results of this are presented in Table 2. We can see that topics 2, 5, 8, 9 and 10 in Table 2 more or less correspond to topics 2, 3, 7, 8 and 6 in Table 1, respectively, but the keyword lists are not as informative as in the previous case. The rest of the topics in Table 2 are not meaningfully characterized by the keyword lists. Thus the presented method outperforms methods that do not consider the text data dynamical.

Table 2: The most representative terms of each topic, assuming 10 topics, neglecting the time series nature of the data.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
american	room	www	put	elect	dai	white	violenc	power	conserv
peopl	chat	http	kill	bush	till	hous	gun	california	polit
conserv	join	html	death	gore	night	black	report	electr	religion
cnn	discuss	index	peopl	vote	countri	don	school	state	liber
bush	state	cnn	innoc	presid	presid	jackson	youth	blackout	mind
tonight	est	presid	back	don	hear	gui	children	plant	govern
fact	thing	god	oil	peopl	elect	point	point	energi	free
black	cnn	world	discuss	florida	join	todai	major	deregul	form
don	todai	stori	monei	room	week	god	drug	home	opinion
show	tonight	januari	join	call	open	dai	countri	life	philosophi
problem	studio	fill	todai	win	inaugur	kennedi	famili	compani	establish
good	american	tech	type	thing	folk	racist	home	cnn	independ

Next, we analyze the results of assuming only 7 topics of discussion in the same data set. The topic time series are seen in Figure 2. The most representative keywords for the seven estimated topics are listed in Table 3. Topics 1, 2, 4, 5 and 7 now correspond clearly to topics 6, 10, 3, 7 and 8 in Table 1. Topic 3 involves the contributions of the chat line operator, now with a different set of keywords. Topic 6 deals with general future expectations regarding President Bush.

Table 3: The keywords of each topic, assuming 7 topics

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7
conserv	god	www	won	violenc	peopl	power
polit	religion	http	vote	gun	kennedi	california
religion	thing	html	count	report	elect	electr
liber	jesu	index	gore	school	cnn	don
govern	bibl	cnn	bush	youth	dai	blackout
mind	doesn	time	stop	children	live	plant
free	white	world	hand	point	call	energi
opinion	don	thing	recount	home	bush	deregul
form	work	stori	court	drug	back	state
life	kill	good	florida	famili	man	problem
philosophi	talk	make	don	major	senat	crisi
establish	good	put	win	gener	presid	build

Estimating less than 7 topics gives a subset of the most clearly formed topics, in addition to a mixture of the other chat contributions. This is seen in Table 4 where 4 topics are assumed. Topics 1, 3 and 4 now correspond to topics 5, 2 and 7 in Table 3, and topic 2 is now a mixture of several discussions.

Table 4: The most representative terms of each topic, assuming 4 topics

Topic 1	Topic 2	Topic 3	Topic 4
violenc	bush	god	power
gun	don	peopl	california
report	good	religion	electr
school	thing	make	energi
youth	kennedi	jesu	blackout
children	hear	life	plant
point	time	call	peopl
major	white	bibl	deregul
peopl	fact	doesn	state
drug	vote	man	dai
home	www	back	build
famili	talk	read	compani

Our experimental results suggest that the topics are not strictly independent of others, but instead some topics share commonalities and some are more distinct from others. This encourages a topographical extension of the complexity pursuit algorithm in the future.

5. CONCLUSIONS AND FUTURE WORK

In this paper we have shown experimental results on how minimum complexity projections of a dynamically evolving textual data identify some underlying latent or hidden topics in the text stream. In contrast to static text collections, methods for analyzing dynamical text have been sparse. As an example of such dynamically evolving data we used chat line discussion data that is found in the Internet relay chat rooms. We modeled the topics as probability distributions on terms; the distributions having minimum complexity were assumed to best represent the distinct topics.

In our experiments we were able to find distinct and meaningful topics of the discussion, outperforming some more traditional methods. The results suggest that our method could serve in queries on temporally changing text stream, of which news-feed data is another example.

As some topics of discussion are more distinct than others, natural extensions of the work include recursive binary partitioning [10] of the data, or finding the topographic structure [4] of the topics.

Finding topics in dynamical text: application to chat line discussions

Ella Bingham
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
ella.bingham@hut.fi

Ata Kabán
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
kaba-ci0@paisley.ac.uk

Mark Girolami
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
giro-ci0@paisley.ac.uk

ABSTRACT

Keywords

1. INTRODUCTION

2. CHAT LINE DATA

3. THE ALGORITHM

4. EXPERIMENTS ON CHAT LINE DATA

5. CONCLUSIONS AND FUTURE WORK

6. REFERENCES

Finding topics in dynamical text: application to chat line discussions

Ella Bingham Neural Networks Research Centre Helsinki University of Technology P.O.Box 5400, FIN-02015 HUT, Finland ella.bingham@hut.fi

Ata Kabán Neural Networks Research Centre Helsinki University of Technology P.O.Box 5400, FIN-02015 HUT, Finland kaba-ci0@paisley.ac.uk

Mark Girolami Neural Networks Research Centre Helsinki University of Technology P.O.Box 5400, FIN-02015 HUT, Finland giro-ci0@paisley.ac.uk

ABSTRACT

Keywords

1. INTRODUCTION

2. CHAT LINE DATA

3. THE ALGORITHM

4. EXPERIMENTS ON CHAT LINE DATA

5. CONCLUSIONS AND FUTURE WORK

6. REFERENCES

Ella Bingham
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
ella.bingham@hut.fi

Ata Kabán
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
kaba-ci0@paisley.ac.uk

Mark Girolami
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
giro-ci0@paisley.ac.uk