Steven C. H. Hoi
Rong Jin
Michael R. Lyu
Department of Computer Science and Engineering
Department of Computer Science and Engineering
The Chinese University of Hong Kong
Michigan State University
Shatin, N.T., Hong Kong
East Lansing, MI 48824, U.S.A.
{chhoi, lyu}@cse.cuhk.edu.hk
rongjin@cse.msu.edu
28 February 2006
H.3.3Information Systems Information Search and Retrieval I.5.2 Design Methodology Classifier Design and Evaluation
Algorithms, Performance, Experimentation
text categorization, active learning, logistic regression, Fisher information, convex optimization
To reduce the number of labeled documents, in the past, there have been a number of studies on applying active learning to text categorization. The main idea is to only select the most informative documents for labeling manually. Most active learning algorithms are conducted in the iterative fashion. In each iteration, the example with the largest classification uncertainty is chosen for labeling manually. Then, the classification model is retrained with the additional labeled example. The step of training a classification model and the step of soliciting a labeled example are iterated alternatively until most of the examples can be classified with reasonably high confidence. One of the main problems with such a scheme is that only a single example is selected for labeling. As a result, the classification model has to be retrained after each labeled example is solicited.
In the paper, we propose a novel active learning scheme that is able
to select a batch of unlabeled examples in each iteration. A
simple strategy toward the batch mode active learning is to select
the top most informative examples. The problem with such an
approach is that some of the selected examples could be similar, or
even identical, and therefore do not provide additional information
for model updating. In general, the key of the batch mode active
learning is to ensure the small redundancy among the selected
examples such that each example provides unique information for
model updating. To this end, we use the Fisher information matrix,
which represents the overall uncertainty of a classification model.
We choose the set of examples such that the Fisher information of a
classification model can be effectively maximized.
The rest of this paper is organized as follows. Section 2 reviews the related work on text categorization and active learning algorithms. Section 3 briefly introduces the concept of logistic regression, which is used as the classification model in our study for text categorization. Section 4 presents the batch mode active learning algorithm and an efficient learning algorithm based on the bound optimization algorithm. Section 5 presents the results of our empirical study. Section 6 sets out our conclusions.
Text categorization is a long-term research topic which has been actively studied in the communities of Web data mining, information retrieval and statistical learning [15,35]. Essentially the text categorization techniques have been the key toward automated categorization of large-scale Web pages and Web sites [18,27], which is further applied to improve Web searching engines in finding relevant documents and to facilitate users in browsing Web pages or Web sites.
In the past decade, a number of statistical learning techniques have been applied to text categorization [34], including the K Nearest Neighbor approaches [20], decision trees [2], Bayesian classifiers [32], inductive rule learning [5], neural networks [23], and support vector machines (SVM) [9]. Empirical studies in recent years [9] have shown that SVM is the state-of-the-art technique among all the methods mentioned above.
Recently, logistic regression, a traditional statistical tool, has attracted considerable attention for text categorization and high-dimension data mining [12]. Several recent studies have shown that the logistic regression model can achieve comparable classification accuracy as SVMs in text categorization. Compared to SVMs, the logistic regression model has the advantage in that it is usually more efficient than SVMs in model training, especially when the number of training documents is large [13,36]. This motivates us to choose logistic regression as the basis classifier for large-scale text categorization.
The other critical issue for large-scale text document categorization is how to reduce the number of labeled documents that are required for building reliable text classification models. Given the limited amount of labeled documents, the key is to exploit the unlabeled documents. One solution is the semi-supervised learning, which tries to learn a classification model from the mixture of labeled and unlabeled examples [30]. A comprehensive study of semi-supervised learning techniques can be found in [25,38]. Another solution is active learning [19,26] that tries to choose the most informative unlabeled examples for labeling manually. Although previous studies have shown the promising performance of semi-supervised learning for text categorization [11], the high computation cost has limited its application [38]. In this paper, we focus our discussion on active learning.
Active learning, or called pool-based active learning, has been extensively studied in machine learning for many years and has already been employed for text categorization in the past [16,17,21,22]. Most active learning algorithms are conducted in the iterative fashion. In each iteration, the example with the highest classification uncertainty is chosen for labeling manually. Then, the classification model is retrained with the additional labeled example. The step of training a classification model and the step of soliciting a labeled example are iterated alternatively until most of the examples can be classified with reasonably high confidence. One of the key issues in active learning is how to measure the classification uncertainty of unlabeled examples. In [6,7,8,14,21,26], a number of distinct classification models are first generated. Then, the classification uncertainty of a test example is measured by the amount of disagreement among the ensemble of classification models in predicting the labels for the test example. Another group of approaches measure the classification uncertainty of a test example by how far the example is away from the classification boundary (i.e., classification margin) [4,24,31]. One of the most well-known approaches within this group is support vector machine active learning developed by Tong and Koller [31]. Due to its popularity and success in the previous studies, it is used as the baseline approach in our study.
One of the main problems with most existing active learning
algorithm is that only a single example is selected for
labeling. As a result, the classification model has to be retrained
after each labeled example is solicited. In this paper, we focus on
the batch mode active learning that selects a batch of
unlabeled examples in each iteration. A simple strategy is to choose
the top most uncertain examples. However, it is likely that some
of the most uncertain examples can be strongly correlated and even
identical in the extreme cases, which are redundant in providing the
informative clues to the classification model. In general, the
challenge in choosing a batch of unlabeled examples is twofold: on
one hand the examples in the selected batch should be informative to
the classification model; on the other hand the examples should be
diverse enough such that information provided by different examples
does not overlap with each other. To address this challenge, we
employ the Fisher information matrix as the measurement of model
uncertainty, and choose the set of examples that efficiently
maximize the Fisher information of the classification model. Fisher
information matrix has been used widely in statistics for measuring
model uncertainty [28]. For example, in the Cramer-Rao
bound, Fisher information matrix provides the low bound for the
variance of a statistical model. In this study, we choose the set of
examples that can well represent the structure of the Fisher
information matrix.
In this section, we give a brief background review of logistic regression, which has been a well-known and mature statistical model suitable for probabilistic binary classification. Recently, logistic regression has been actively studied in statistical machine learning community due to its close relation to SVMs and Adaboost [33,36].Compared with many other statistical learning models, such as SVMs, the logistic regression model has the following advantages:
Logistic regression can be applied to both real and binary data.
It outputs the posterior probabilities for test examples that can
be conveniently processed and engaged in other systems. In theory,
given a test example , logistic regression models the conditional
probability of assigning a class label
to the example by
![]() |
(1) |
In this section, we present a batch mode active learning algorithm for large-scale text categorization. In our proposed scheme, logistic regression is used as the base classifier for binary classification. In the following, we first introduce the theoretical foundation of our active learning algorithm. Based on the theoretical framework, we then formulate the active learning problem into a semi-definite programming (SDP) problem [3]. Finally, we present an efficient learning algorithm for the related optimization problem based on the eigen space simplification and a bound optimization strategy.
Let be the distribution of all unlabeled examples,
and
be the distribution of unlabeled examples that
are chosen for labeling manually. Let
denote the
parameters of the classification model. Let
and
denote the Fisher information matrix of the
classification model for the distribution
and
, respectively. Then, the set of examples that can
most efficiently reduce the uncertainty of classification model is
found by minimizing the ratio of the two Fisher information matrix
and
, i.e.,
For the logistic regression model, the Fisher information
is attained as:
![]() |
(3) |
![]() |
(4) |
![]() |
(5) |
![]() |
(6) |
In this section, we will qualitatively justify the theory of
minimizing the Fisher information for batch mode active learning.
In particular, we consider two cases, the case of selecting a
single unlabeled example and the case of selecting two unlabeled
examples simultaneously. To simplify our discussion, let's assume
for all unlabeled examples.
Selecting a single unlabeled example. The Fisher information
matrix is simplified into the following form when the
-th
example is selected:
Selecting two unlabeled examples simultaneously. To simplify
our discussion, we assume that the three examples, ,
, and
, have the largest classification
uncertainty. Let's further assume that
, and meanwhile
is far away from
and
. Then, if we follow the simple
greedy approach, the two example
and
will be selected given their largest classification uncertainty.
Apparently, this is not an optimal strategy given both examples
provide almost identical information for updating the classification
model. Now, if we follow the criterion of minimizing Fisher
information, this mistake could be prevented because
Given the optimization problem in (2), we can rewrite
the objective function
as
. We then introduce a
slack matrix
such that
. Then original optimization problem can
be rewritten as follows:
![]() |
(8) |
![]() |
(9) |
![]() |
(14) |
![]() |
(15) |
(16) |
Let and
denote the solutions obtained in
two consecutive iterations, and let
be the
objective function in (15). Based on the proof given in
Appendix-A, we have the following expression:
Remark: It is interesting to examine the property of the
solution obtained by the updating equation in (17).
First, according to (17), the example with a large
classification uncertainty will be assigned with a large
probability. This is because is proportional to
, the classification uncertainty of the
-the unlabeled
example. Second, according to (17), the example that
is similar to many unlabeled examples is more likely to be
selected. This is because probability
is proportional to the
term
, the similarity of the
-th
example to the principal eigenvectors. This is consistent with our
intuition that we should select the most informative and
representative examples for active learning.
In this section we discuss the experimental evaluation of our active learning algorithm in comparison to the state-of-the-art approaches. For a consistent evaluation, we conduct our empirical comparisons on three standard datasets for text document categorization. For all three datasets, the same pre-processing procedure is applied: the stopwords and the numbers are removed from the documents, and all the words are converted into the low cases without stemmming.
The first dataset is the Reuters-21578 Corpus dataset, which has
been widely used as a testbed for evaluating algorithms for text
categorization. In our experiments, the ModApte split of the
Reuters-21578 is used. There are a total of text documents
in this collection. Table 1 shows a list of the
most frequent categories contained in the dataset. Since each
document in the dataset can be assigned to multiple categories, we
treat the text categorization problem as a set of binary
classification problems, i.e., a different binary classification
problem for each category. In total,
word features are
extracted and used to represent the text documents.
The other two datasets are Web-related: the WebKB data collection
and the Newsgroup data collection. The WebKB dataset comprises of
the WWW-pages collected from computer science departments of
various universities in January 1997 by the World Wide Knowledge
Base (Web-Kb) project of the CMU text learning group. All the
Web pages are classified into seven categories: student, faculty,
staff, department, course, project, and other. In this study, we
ignore the category of others due to its unclear definition. In
total, there are
data samples in the selected dataset, and
word features are extracted to represent the text
documents. Table 2 shows the details of this
dataset. The newsgroup dataset includes
messages from
different newsgroups. Each newsgroup contains roughly about
messages. In this study, we randomly select
out of 20
newsgroups for evaluation. In total, there are
data
samples in the selected dataset, and
word features are
extracted to represent the text documents.
Table 3 shows the details of the engaged dataset.
Compared to the Reuter-21578 dataset, the two Web-related data
collections are different in that more unique words are found in
the Web-related datasets. For example, for both the Reuter-21578
dataset and the Newsgroup dataset, they both contain roughly
documents. But, the number of unique words for the
Newgroups dataset is close to
, which is about twice as
the number of unique words found in the Reuter-21578. It is this
feature that makes the text categorization of WWW documents more
challenging than the categorization of normal text documents
because considerably more feature weights need to be decided for
the WWW documents than the normal documents. It is also this
feature that makes the active learning algorithms more valuable
for text categorization of WWW documents than the normal documents
since by selecting the informative documents for labeling
manually, we are able to decide the appropriate weights for more
words than by a randomly chosen document.
In order to remove the uninformative word features, feature
selection is conducted using the Information
Gain [35] criterion. In particular, of the
most informative features are selected for each category in each of
the three datasets above.
For performance measurement, the metric is adopted as our
evaluation metric, which has been shown to be more reliable metric
than other metrics such as the classification
accuracy [35]. More specifically, the
is
defined as
![]() |
(20) |
To examine the effectiveness of the proposed active learning
algorithm, two reference models are used in our experiment. The
first reference model is the logistic regression active learning
algorithm that measures the classification uncertainty based on the
entropy of the distribution
. In particular, for a
given test example
and a logistic regression model with
the weight vector
and the bias term
, the entropy of
the distribution
is calculated as:
To evaluate the performance of the proposed active learning
algorithms, we first pick training samples, which include
positive examples and
negative examples, randomly from
the dataset for each category. Both the logistic regression model
and the SVM classifier are trained on the labeled data. For the
active learning methods,
unlabeled data samples are chosen
for labeling and their performances are evaluated after rebuilding
the classifiers respectively. Each experiment is carried out
times and the averaged
with its variance is calculated and
used for final evaluation.
To deploy efficient implementations of our scheme toward
large-scale text categorization tasks, all the algorithms used in
this study are programmed in the C language. The testing hardware
environment is on a Linux workstation with GHz CPU and
physical memory. To implement the logistic regression algorithm
for our text categorization tasks, we employ the implementation of
the logistic regression tool developed by Komarek and Moore
recently [13]. To implement our active learning
algorithm based on the bound optimization approach, we employ a
standard math package, i.e., LAPACK [1], to solve
the eigen decomposition in our algorithm efficiently. The
package [10] is used in our
experiments for the implementation of SVM, which has been
considered as the state-of-the-art tool for text categorization.
Since SVM is not parameter-free and can be very sensitive to the
capacity parameter, a separate validation set is used to determine
the optimal parameters for configuration.
![]() ![]() ![]() [``earn"] [``acq"] [``money-fx"] |
![]() ![]() ![]() [``grain"] [``crude"] [``trade"] |
![]() ![]() ![]() [``interest"] [``wheat"] [``ship"] |
Table 4 shows the experimental results of
performance averaging over
executions on
major
categories in the dataset.
First, as listed in the first and the second columns of Table
4, we observe that the performance of the
two classifiers, logistic regression and SVM, are comparable when
only the initially labeled examples are used for training.
For categories, such as ``trade'' and ''interest'', SVM achieves
noticeably better performance than the logistic regression model.
Second, we compare the performance of the two classifiers for
active learning, i.e., LogReg-AL and SVM-AL, which are the greedy
algorithms and select the most informative examples for labeling
manually.
The results are listed in
the third and the fourth columns of Table 4.
We find that the performance of these two active learning methods
becomes closer than the case when no actively labeled examples are
used for training. For example, for category ``trade'', SVM performs
substantially better than the logistic regression model when only
labeled examples are used. The difference in
measurement
between LogReg-AL and SVM-AL almost diminishes when both classifiers
use the
actively labeled examples for training. Finally, we
compare the performance of the proposed active learning algorithm,
i.e., LogReg-BMAL, to the margin-based active learning approaches
LogReg-AL and SVM-AL. It is evident that the proposed batch mode
active learning algorithm outperforms the margin-based active
learning algorithms. For categories, such as ``corn'' and ``wheat'',
where the two margin-based active learning algorithms achieve
similar performance, the proposed algorithm LogReg-BMAL is able to
achieve substantially better
scores. Even for the categories
where the SVM performs substantially better than the logistic
regression model, the proposed algorithm is able to outperform the
SVM-based active learning algorithm noticeably. For example, for
category ``ship'' where SVM performs noticeably better than the
logistic regression, the proposed active learning method is able to
achieve even better performance than the margin-based active
learning based on the SVM classifier.
In order to evaluate the performance in more detail, we conduct the
evaluation on each category by varying the number of initially
labeled instances for each classifier. Fig. 1,
Fig. 2 and Fig. 3 show the
experimental results of the mean measurement on
major
categories. From the experimental results, we can see that our
active learning algorithm outperforms the other two active learning
algorithms in most of the cases while the SVM-AL method is generally
better than the LogReg-AL method. We also found that the improvement
of our active learning method is more evident comparing with the
other two approaches when the number of labeled instances is
smaller. This is because the smaller the number of initially labeled
examples used for training, the larger the improvement we would
expect. When more labeled examples are used for training, the gap
for future improvement begins to decrease. As a result, the three
methods start to behavior similarly. This result also indicates that
the proposed active learning algorithm is robust even when the
number of labeled examples is small while the other two active
learning approaches may suffer critically when the margin criterion
is not very accurate for the small sample case.
|
|
The classification results of the WebKB dataset and the Newsgroup dataset are listed in Table 5 and Table 6, respectively.
First, notice that for the two Web-related datasets, there are a
few categories whose measurements are extremely low. For
example, for the category ``staff'' of the WebKB dataset, the
measurement is only about
for all methods. This fact
indicates that the text categorization of WWW documents can be
more difficult than the categorization of normal documents.
Second, we observe that the difference in the
measurement
between the logistic regression model and the SVM is smaller for
both the WebKB dataset and the Newsgroup dataset than for the
Reuters-21578 dataset. In fact, there are a few categories in
WebKB and Newsgroup that the logistic regression model performs
slightly better than the SVM.
Third, by comparing the two margin-based approaches for active
learning, namely, LogReg-AL and SVM-AL, we observe that, for a
number of categories, LogReg-AL achieves substantially better
performance than SVM-AL. The most noticeable case is the category
of the Newsgroup dataset where the SVM-AL algorithm is unable
to improve the
measurement than the SVM even with the
additional labeled examples. In contrast, the LogReg-AL algorithm
is able to improve the
measurement from
to
. Finally, comparing the LogReg-BMAL algorithm with the
LogReg-AL algorithm, we observe that the proposed algorithm is
able to improve the
measurement substantially over the
margin-based approach. For example, for the category
of the
Newsgroup dataset, the active learning algorithm LogReg-AL only
make a slight improvement in the
measurement with the
additional
labeled examples. The improvement for the same
category by the proposed batch active learning algorithm is much
more significant, increasing from
to
.
Comparing all the learning algorithms, the proposed learning
algorithm achieves the best or close to the best performance for
almost all categories. This observation indicates that the
proposed active learning algorithm is effective and robust for
large-scale text categorization of WWW documents.
This paper presents a novel active learning algorithm that is able to select a batch of informative and diverse examples for labeling manually. This is different from traditional active learning algorithms that focus on selecting the most informative examples for manually labeling. We use the Fisher information matrix for the measurement of model uncertainty and choose the set of examples that will effectively maximize the Fisher information matrix. We conducted extensive experimental evaluations on three standard data collections for text categorization. The promising results demonstrate that our method is more effective than the margin-based active learning approaches, which have been the dominating method for active learning. We believe our scheme is essential to performing large-scale categorization of text documents especially for the rapid growth of Web documents on World Wide Web.
![]() |
![]() |