The problem of analysing dynamically evolving textual data has recently arisen. An example of such data is the discussion appearing in Internet chat lines. In this paper a recently introduced method, termed complexity pursuit, is used to extract the topics of a dynamical chat line discussion. Experimental results demonstrate that meaningful topics can be found and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.
chat line discussion, dynamical text, topic identification, complexity pursuit, data mining
In times of huge information flow in the Internet, there is a strong need for automatic textual data analysis tools. Algorithms developed for text mining from static text collections have been presented e.g. in [1,2,7]. Our emphasis is in the recently arisen issue of analyzing dynamically evolving textual data; investigating appropriate tools for this task is of practical importance. An example of such data is found in the Internet relay chat rooms: the topic of interest changes after participants' contributions. The online text stream can thus be seen as a time series, and methods of time series processing may be used to extract the topics of the discussion.
We present results of applying a recently introduced powerful method, complexity pursuit [3], to topic extraction in a dynamically evolving discussion. Complexity pursuit uses both information-theoretic measures and time-correlations of the data, which makes it more powerful than methods using only one of these -- the latter kind of methods include [5,6,8,9].
The discussion found in chat lines on the Internet is an ongoing stream of text generated by the chat participants and the chat line moderator. To analyze it using data mining methods a convenient technique is to split the stream into windows that may be overlapping if desired. Each such window can now be viewed as one document. We represent the documents using the vector space model [11]: each document forms one T-dimensional vector where T is the number of distinct terms in the vocabulary. The i-th element of the vector indicates (some function of) the frequency of the i-th vocabulary term in the document. The term by document matrix X contains the document vectors as its columns and is of size T times N where N is the number of documents.
As a preprocessing step
we perform the LSI [2] of the data matrix X
by computing its singular value decomposition,
thus acquiring a lower dimensional projection of the the high-dimensional
data.
We denote the new data matrix as
Z = (z(t)).
The topics of the discussion
can be found by projecting Z to the directions
given by the algorithm described in the following Section.
M, the number of estimated minimum complexity projections,
may be smaller than K, the dimension of the LSI projection.
Complexity pursuit [3] is a recently developed, computationally simple algorithm for separating "interesting" components from high dimensional time series. These components -- here, the topics of the discussion -- are found by projecting the data in the directions given by the algorithm. In complexity pursuit, an "interesting" component is one whose distribution has a low coding complexity, measured by an approximative Kolmogoroff complexity. We model the topics of the discussion as probability distributions on terms, and the distributions having minimum complexity are assumed to best represent the distinct topics.
We assume that the preprocessed K-dimensional observations
z(t)
are linear mixtures of some latent, time-correlated topics s. Both
the latent topics and the mixing process are unknown.
The topic time series are found by projecting the observation time series
into directions w as
s(t) = wT z(t)
where the projection directions w are to be found.
A separate autoregressive (AR) model, here
,
is assumed to model each topic time series s.
At every step of the algorithm, the AR constant
is first estimated.
Then the gradient update of w that minimizes the
approximate Kolmogoroff complexity [3] of the AR residuals
= wT(z(t) - \alpha z(t-1))
is the following:
To estimate several projection directions we can proceed in a deflationary
manner, making the new projection direction orthogonal to the previously
found ones, similarly to Gram-Schmidt orthogonalization. After the
estimation of p projections, we run the algorithm
for wp+1 and after every iteration step subtract from
wp+1 the projections of the previously estimated
p vectors, and then renormalize wp+1.
Another possibility is
to estimate all projections simultaneously in a symmetric manner and use
orthogonal decorrelation
W := (WWT)-1/2 W instead of
the normalization (2) in the above set of equations.
Details of the approximation of the Kolmogoroff complexity and its
minimization can be found in [3].
The computational load of the algorithm can be approximated as follows. The LSI preprocessing of a sparse T times N data matrix is of order O(NTc) where c is the number of nonzero entries in each column of the data matrix. (In our experiments, the data matrix is very sparse and c is about 30 to 40, as each document contains about 30 to 40 distinct vocabulary terms.) The algorithm itself is of order O(NK2M) where K is the dimensionality (K << T,N) into which the data was projected using LSI, and M is the number of topics estimated. Thus the algorithm scales linearly in the number of observations.
The chat line data was collected from the CNN Newsroom chat line http://www.cnn.com/chat/channel/cnn_newsroom. A contiguous stream of almost 24 hours of discussion of 3200 chat participants, contributing 25 000 comment lines, was recorded on January 18th, 2001. The data was cleaned by omitting all user names and non-user generated text. The remaining text stream was split into overlapping windows of about 750 characters. From these windows a term histogram was generated using the Bow toolkit http://www.cs.cmu.edu/~mccallum/bow/, resulting in a term by document matrix X of size T times N, that is, 4743 times 7430. The LSI of order K=50 was computed as a preprocessing step. The choice of the number of estimated topics M is relatively flexible, and in this abstract we present results on M=10, 7 and 4.
Figure 1 shows the topic time series s(t)
when 10 topics are assumed.
We can see that the topics are autocorrelated in time;
different topics are activated at different times.
Some topics show short peaks of contribution whereas
others are active for longer periods.
The validity of the
identified topics is easy to evaluate using the
most representative terms associated with each topic. These are obtained by
projecting the term data
Zterm (which is
the document by term matrix
XT projected into its LSI space)
into the minimum complexity directions
wi found earlier.
By listing the terms corresponding to the highest peaks in the projection
wiTZterm
we get a list of keywords for the i-th
topic. In Table 1 it is
seen that each keyword list indeed characterizes one distinct
topic quite clearly.
Topics 1, 4, 5, 6 and 9 in Table 1 deal with general daily politics in the US, with the emphasis on Jesse Jackson (topic 1), Department of Justice (topic 4), expected success of the country under the guidance of President Bush (topic 5), the values of the politicians (topic 6) and Ted Kennedy's political comments and the car accident he was involved in several years ago (topic 9). Topic 2 corresponds to comments given by the chat line operator. Topic 3 is about the presidential election in the US, especially the vote recounting in Florida. Topic 7 deals with problems of the youth: violence, drug abuse etc. Topic 8 involves the energy shortage in California in mid-January 2001. Topic 10 is a religious discussion. Some of the topics display similarity to other topics, whereas some topics are clearly distinct. This suggest the estimation of a smaller number of topics; results of this will be seen later in this section.
Neglecting the time series nature of the data in the complexity pursuit
algorithm yields an Independent Component Analysis -type algorithm
[5,
6,
9]. To demonstrate this, we leave
out the autoregressive modeling of the topic time series in our complexity
pursuit algorithm. The results of this are presented in Table 2. We can see
that topics 2, 5, 8, 9 and 10 in Table 2 more or less correspond to
topics 2, 3, 7, 8 and 6 in Table 1, respectively, but the keyword lists are
not as informative as in the previous case. The rest of the topics in Table
2 are not meaningfully characterized by the keyword lists.
Thus the presented method outperforms methods that do not consider the
text data dynamical.
|
Next, we analyze the results of assuming only 7 topics of discussion in
the same data set. The topic time series are seen in Figure 2.
The most representative keywords for the seven estimated topics are listed in
Table 3.
Topics 1, 2, 4, 5 and 7 now correspond clearly to topics 6, 10, 3, 7 and 8
in Table 1. Topic 3 involves the
contributions of the chat line operator, now with a different set of keywords.
Topic 6 deals with general future expectations regarding President Bush.
|
Estimating less than 7 topics gives a subset of the most clearly formed
topics, in
addition to a mixture of the other chat contributions. This is seen in
Table 4 where 4 topics are assumed.
Topics 1, 3 and 4 now correspond to topics 5, 2 and 7 in Table 3, and topic 2
is now a mixture of several discussions.
|
Our experimental results suggest that the topics are not strictly independent of others, but instead some topics share commonalities and some are more distinct from others. This encourages a topographical extension of the complexity pursuit algorithm in the future.
In this paper we have shown experimental results on how minimum complexity projections of a dynamically evolving textual data identify some underlying latent or hidden topics in the text stream. In contrast to static text collections, methods for analyzing dynamical text have been sparse. As an example of such dynamically evolving data we used chat line discussion data that is found in the Internet relay chat rooms. We modeled the topics as probability distributions on terms; the distributions having minimum complexity were assumed to best represent the distinct topics.
In our experiments we were able to find distinct and meaningful topics of the discussion, outperforming some more traditional methods. The results suggest that our method could serve in queries on temporally changing text stream, of which news-feed data is another example.
As some topics of discussion are more distinct than others, natural extensions of the work include recursive binary partitioning [10] of the data, or finding the topographic structure [4] of the topics.