Finding topics in dynamical text: application to chat line discussions

Ella Bingham
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
ella.bingham@hut.fi
Ata Kabán
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
kaba-ci0@paisley.ac.uk
Mark Girolami
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
giro-ci0@paisley.ac.uk

ABSTRACT

The problem of analysing dynamically evolving textual data has recently arisen. An example of such data is the discussion appearing in Internet chat lines. In this paper a recently introduced method, termed complexity pursuit, is used to extract the topics of a dynamical chat line discussion. Experimental results demonstrate that meaningful topics can be found and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.

Keywords

chat line discussion, dynamical text, topic identification, complexity pursuit, data mining

1. INTRODUCTION

In times of huge information flow in the Internet, there is a strong need for automatic textual data analysis tools. Algorithms developed for text mining from static text collections have been presented e.g. in [1,2,7]. Our emphasis is in the recently arisen issue of analyzing dynamically evolving textual data; investigating appropriate tools for this task is of practical importance. An example of such data is found in the Internet relay chat rooms: the topic of interest changes after participants' contributions. The online text stream can thus be seen as a time series, and methods of time series processing may be used to extract the topics of the discussion.

We present results of applying a recently introduced powerful method, complexity pursuit [3], to topic extraction in a dynamically evolving discussion. Complexity pursuit uses both information-theoretic measures and time-correlations of the data, which makes it more powerful than methods using only one of these -- the latter kind of methods include [5,6,8,9].

2. CHAT LINE DATA

The discussion found in chat lines on the Internet is an ongoing stream of text generated by the chat participants and the chat line moderator. To analyze it using data mining methods a convenient technique is to split the stream into windows that may be overlapping if desired. Each such window can now be viewed as one document. We represent the documents using the vector space model [11]: each document forms one T-dimensional vector where T is the number of distinct terms in the vocabulary. The i-th element of the vector indicates (some function of) the frequency of the i-th vocabulary term in the document. The term by document matrix X contains the document vectors as its columns and is of size T times N where N is the number of documents.

As a preprocessing step we perform the LSI [2] of the data matrix X by computing its singular value decomposition, thus acquiring a lower dimensional projection of the the high-dimensional data. We denote the new data matrix as Z = (z(t)). The topics of the discussion can be found by projecting Z to the directions $\mathbf W = (\mathbf w_1 \cdots \mathbf w_M)$ given by the algorithm described in the following Section. M, the number of estimated minimum complexity projections, may be smaller than K, the dimension of the LSI projection.

3. THE ALGORITHM

Complexity pursuit [3] is a recently developed, computationally simple algorithm for separating "interesting" components from high dimensional time series. These components -- here, the topics of the discussion -- are found by projecting the data in the directions given by the algorithm. In complexity pursuit, an "interesting" component is one whose distribution has a low coding complexity, measured by an approximative Kolmogoroff complexity. We model the topics of the discussion as probability distributions on terms, and the distributions having minimum complexity are assumed to best represent the distinct topics.

We assume that the preprocessed K-dimensional observations z(t) are linear mixtures of some latent, time-correlated topics s. Both the latent topics and the mixing process are unknown. The topic time series are found by projecting the observation time series into directions w as s(t) = wT z(t) where the projection directions w are to be found. A separate autoregressive (AR) model, here $ \hat{s}(t) = \alpha s(t-1)$, is assumed to model each topic time series s. At every step of the algorithm, the AR constant $\alpha$ is first estimated. Then the gradient update of w that minimizes the approximate Kolmogoroff complexity [3] of the AR residuals $s(t)-\hat{s}(t)$ = wT(z(t) - \alpha z(t-1)) is the following:
  \begin{align}
\mathbf w \leftarrow & \mathbf w - \mu E\{ (\mathbf z(t) -
\alph...
...\\
\mathbf w \leftarrow & \mathbf w /\vert\vert\mathbf w\vert\vert
\end{align}
To estimate several projection directions we can proceed in a deflationary manner, making the new projection direction orthogonal to the previously found ones, similarly to Gram-Schmidt orthogonalization. After the estimation of p projections, we run the algorithm for wp+1 and after every iteration step subtract from wp+1 the projections of the previously estimated p vectors, and then renormalize wp+1. Another possibility is to estimate all projections simultaneously in a symmetric manner and use orthogonal decorrelation W := (WWT)-1/2 W instead of the normalization (2) in the above set of equations. Details of the approximation of the Kolmogoroff complexity and its minimization can be found in [3].

The computational load of the algorithm can be approximated as follows. The LSI preprocessing of a sparse T times N data matrix is of order O(NTc) where c is the number of nonzero entries in each column of the data matrix. (In our experiments, the data matrix is very sparse and c is about 30 to 40, as each document contains about 30 to 40 distinct vocabulary terms.) The algorithm itself is of order O(NK2M) where K is the dimensionality (K << T,N) into which the data was projected using LSI, and M is the number of topics estimated. Thus the algorithm scales linearly in the number of observations.

4. EXPERIMENTS ON CHAT LINE DATA

The chat line data was collected from the CNN Newsroom chat line http://www.cnn.com/chat/channel/cnn_newsroom. A contiguous stream of almost 24 hours of discussion of 3200 chat participants, contributing 25 000 comment lines, was recorded on January 18th, 2001. The data was cleaned by omitting all user names and non-user generated text. The remaining text stream was split into overlapping windows of about 750 characters. From these windows a term histogram was generated using the Bow toolkit http://www.cs.cmu.edu/~mccallum/bow/, resulting in a term by document matrix X of size T times N, that is, 4743 times 7430. The LSI of order K=50 was computed as a preprocessing step. The choice of the number of estimated topics M is relatively flexible, and in this abstract we present results on M=10, 7 and 4.

Figure 1 shows the topic time series s(t) when 10 topics are assumed. We can see that the topics are autocorrelated in time; different topics are activated at different times. Some topics show short peaks of contribution whereas others are active for longer periods.
Figure 1: Topic time series (horizontal axis: chat windows). 10 topics are assumed.

The validity of the identified topics is easy to evaluate using the most representative terms associated with each topic. These are obtained by projecting the term data Zterm (which is the document by term matrix XT projected into its LSI space) into the minimum complexity directions wi found earlier. By listing the terms corresponding to the highest peaks in the projection wiTZterm we get a list of keywords for the i-th topic. In Table 1 it is seen that each keyword list indeed characterizes one distinct topic quite clearly.

 
Table 1: The most representative terms of each topic, assuming 10 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
monei room count gener good polit gun power kennedi god
tax chat won attornei make conserv violenc california ted religion
jackson join vote agre bush religion school electr peopl peopl
night discuss gore view time liber report blackout bush jesu
cut cnn bush senat cnn govern youth energi teddi bibl
agre est stop read american mind children plant work doesn
bill peopl recount reno man free point peopl car call
pai todai court gui don form home deregul kill read
lot tonight florida bush presid opinion drug dai hear give
america american hand question black life famili state put back
exempt studio night dai talk philosophi major build nation show
hous elect win missouri senat establish parent crisi drive true
 

Topics 1, 4, 5, 6 and 9 in Table 1 deal with general daily politics in the US, with the emphasis on Jesse Jackson (topic 1), Department of Justice (topic 4), expected success of the country under the guidance of President Bush (topic 5), the values of the politicians (topic 6) and Ted Kennedy's political comments and the car accident he was involved in several years ago (topic 9). Topic 2 corresponds to comments given by the chat line operator. Topic 3 is about the presidential election in the US, especially the vote recounting in Florida. Topic 7 deals with problems of the youth: violence, drug abuse etc. Topic 8 involves the energy shortage in California in mid-January 2001. Topic 10 is a religious discussion. Some of the topics display similarity to other topics, whereas some topics are clearly distinct. This suggest the estimation of a smaller number of topics; results of this will be seen later in this section.

Neglecting the time series nature of the data in the complexity pursuit algorithm yields an Independent Component Analysis -type algorithm [5, 6, 9]. To demonstrate this, we leave out the autoregressive modeling of the topic time series in our complexity pursuit algorithm. The results of this are presented in Table 2. We can see that topics 2, 5, 8, 9 and 10 in Table 2 more or less correspond to topics 2, 3, 7, 8 and 6 in Table 1, respectively, but the keyword lists are not as informative as in the previous case. The rest of the topics in Table 2 are not meaningfully characterized by the keyword lists. Thus the presented method outperforms methods that do not consider the text data dynamical.

 
Table 2: The most representative terms of each topic, assuming 10 topics, neglecting the time series nature of the data.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
american room www put elect dai white violenc power conserv
peopl chat http kill bush till hous gun california polit
conserv join html death gore night black report electr religion
cnn discuss index peopl vote countri don school state liber
bush state cnn innoc presid presid jackson youth blackout mind
tonight est presid back don hear gui children plant govern
fact thing god oil peopl elect point point energi free
black cnn world discuss florida join todai major deregul form
don todai stori monei room week god drug home opinion
show tonight januari join call open dai countri life philosophi
problem studio fill todai win inaugur kennedi famili compani establish
good american tech type thing folk racist home cnn independ
 
Figure 2: Topic time series (horizontal axis: chat windows). 7 topics are assumed.

Next, we analyze the results of assuming only 7 topics of discussion in the same data set. The topic time series are seen in Figure 2. The most representative keywords for the seven estimated topics are listed in Table 3. Topics 1, 2, 4, 5 and 7 now correspond clearly to topics 6, 10, 3, 7 and 8 in Table 1. Topic 3 involves the contributions of the chat line operator, now with a different set of keywords. Topic 6 deals with general future expectations regarding President Bush.

 
Table 3: The keywords of each topic, assuming 7 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
conserv god www won violenc peopl power
polit religion http vote gun kennedi california
religion thing html count report elect electr
liber jesu index gore school cnn don
govern bibl cnn bush youth dai blackout
mind doesn time stop children live plant
free white world hand point call energi
opinion don thing recount home bush deregul
form work stori court drug back state
life kill good florida famili man problem
philosophi talk make don major senat crisi
establish good put win gener presid build
 

Estimating less than 7 topics gives a subset of the most clearly formed topics, in addition to a mixture of the other chat contributions. This is seen in Table 4 where 4 topics are assumed. Topics 1, 3 and 4 now correspond to topics 5, 2 and 7 in Table 3, and topic 2 is now a mixture of several discussions.

 
Table 4: The most representative terms of each topic, assuming 4 topics
Topic 1 Topic 2 Topic 3 Topic 4
violenc bush god power
gun don peopl california
report good religion electr
school thing make energi
youth kennedi jesu blackout
children hear life plant
point time call peopl
major white bibl deregul
peopl fact doesn state
drug vote man dai
home www back build
famili talk read compani
 

Our experimental results suggest that the topics are not strictly independent of others, but instead some topics share commonalities and some are more distinct from others. This encourages a topographical extension of the complexity pursuit algorithm in the future.

5. CONCLUSIONS AND FUTURE WORK

In this paper we have shown experimental results on how minimum complexity projections of a dynamically evolving textual data identify some underlying latent or hidden topics in the text stream. In contrast to static text collections, methods for analyzing dynamical text have been sparse. As an example of such dynamically evolving data we used chat line discussion data that is found in the Internet relay chat rooms. We modeled the topics as probability distributions on terms; the distributions having minimum complexity were assumed to best represent the distinct topics.

In our experiments we were able to find distinct and meaningful topics of the discussion, outperforming some more traditional methods. The results suggest that our method could serve in queries on temporally changing text stream, of which news-feed data is another example.

As some topics of discussion are more distinct than others, natural extensions of the work include recursive binary partitioning [10] of the data, or finding the topographic structure [4] of the topics.

6. REFERENCES

    1
    R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.

    2
    S. Deerwester et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.

    3
    A. Hyvärinen. Complexity pursuit: separating interesting components from time-series. Neural Computation. To appear.

    4
    A. Hyvärinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computation. To appear.

    5
    C. L. Isbell and P. Viola. Restucturing sparse high dimensional data for effective retrieval. In Advances in Neural Information Processing Systems 11, pages 480-486, 1998.

    6
    A. Kabán and M. Girolami. Unsupervised topic separation and keyword identification in document collections: a projection approach. Tech. Rep. 10, Dept. of Computing and Information Systems, University of Paisley, August 2000.

    7
    T. Kohonen et al. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574-585, May 2000. Special Issue on Neural Networks for Data Mining and Knowledge Discovery.

    8
    T. Kolenda and L. K. Hansen. Dynamical components of chat. Tech. rep., Technical University of Denmark, 2000.

    9
    T. Kolenda, L. K. Hansen, and S. Sigurdsson. Independent components in text. In M. Girolami, editor, Advances in Independent Component Analysis, chapter 13, pages 235-256. Springer-Verlag, 2000.

    10
    P. Pajunen and M. Girolami. Implementing decisions in binary decision trees using independent component analysis. In P. Pajunen and E. Oja, editors, Proc. of ICA2000, pages 477-481.

    11
    G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, New York, 1983.