Document Logistics for Cooperative Research

Hai Zhuge
Knowledge Grid Research Group, Key Lab of Intelligent Information Processing,
Institute of Computing Technology, Chinese Academy of Sciences, 100080, Beijing, China
zhuge@ict.ac.cn

Yanyan Li
Knowledge Grid Research Group, Key Lab of Intelligent Information Processing,
Institute of Computing Technology, Chinese Academy of Sciences, 100080, Beijing, China
yyli75@yeah.net

ABSTRACT

This paper proposes a document logistics approach for cooperative research based on the web and Knowledge Grid. The approach realizes effective research document collection, organization and provision as well as knowledge sharing by incorporating the following functions: construction of semantic profiles representing interests, continuous discovery and collection of potentially relevant documents, synthesis of feedback evaluations, and support of flexible management operations and document recommendation services. The prototype has been implemented and is available for use online.

Keywords

Feedback synthesis, Knowledge Grid, Logistics, Profile, Web application.

1. INTRODUCTION

The rapid expansion of research documents available on the Web has led to researchers constantly fighting information overload in their pursuit of knowledge. Though the Scientific Literature Digital Library [3] and other citation indices of scientific literature (such as NCSTRL, LTRS, etc.) have alleviated the information overload to some extent, researchers have to still expend a great deal of time and effort looking for new documents that may interest them.

Information logistics is an innovative technology that aims to efficiently collect, organize and provide personalized heterogeneous information on demand. Document logistics is a special case of information logistics, which aims at enhancing the cooperation and efficiency of research groups by effectively organizing and managing the research documents. Knowledge Grid is a platform that enables sharing and managing the distributed heterogeneous resources spread across the Internet in a uniform way [5,6,7].

Based on the web information retrieval and the Knowledge Grid, this paper proposes a document logistics approach serving research groups across the Internet.

2. CONSTRUCTING SEMANTIC PROFILES

Taken the terms specified by the researchers as core concepts, more related concepts can be discovered by mining and inferring concept associations in a collection of documents. The automatic profile construction process is as follows:

l Concept Extraction. Concepts usually refer to the nouns or noun phrases that characterize a document, so we mainly extract concepts from the specified sections of a document by eliminating the noises or useless words.

l Co-occurrence Analysis. The co-occurrence analysis is performed by adopting the concept space approach with some revisions [2], e.g., terms occur in different location are assigned different weight.

l Authority Identification. The heuristics and statistics method based on the association weight are used to automatically identify the authorities that indicate the well-known journals, conferences, experts, documents, communities and websites.

Profiles can be automatically updated by tracking users’ searching and browsing behaviors. Group members can also contribute to the profiles explicitly by manually editing and tuning of the profiles through the system's interface at anytime. The representation of resources and profiles in the Knowledge Grid is based on the markup languages like XML and RDF in the semantic web [1]. Based on the profiles, we have developed a tool GruDexer to continuously collect relevant research documents with three methods including manually uploading, customized collection and meta-search engine.

3. FEEDBACK SYNTHESIS MECHANISM

Peers’ evaluations on documents are useful for researchers’ reference. We adopt a dynamic mechanism for research groups to generate different evaluation criteria for their own research goal by consulting a set of general evaluation criteria specified by the experts and research students. Currently, we simply combine the comments inputted by users to form one text, and use the following two formulas to compute the overall evaluation score denoted as E_j for the jth document.

Where m is the total number of evaluation criteria, w_k is the weight assigned to the kth evaluation criterion, and the default value is 1. S_k is the evaluation score corresponding to the kth evaluation criterion, n is the number of members who give feedback for the same document, Vⁱ_jk indicates the value assigned to the kth evaluation criterion of jth document according to the ith member’s option, and CR_i indicates the credibility of the ith member in a specified research domain.

Where, E_i denotes the number of documents that are evaluated by the ith member, |D| denotes the total number of documents, P_i (or N_i) denotes that the number of documents on which the ith user’s evaluation is confirmed (or negated) by others. W_E, W_P and W_N respectively is the assigned value with respect to each case.

4. SERVICE PROVIDING MECHANISM

The Knowledge Grid provides users a set of management operations (such as put, get, browse, delete, etc) to cooperatively manage the research documents. We herein mainly illustrate the document retrieval approach in terms of get operation.

By making use of the keyword-based approach and the PageRank method [4], we propose a profile-based matching approach to make an estimation of document relevance, which considers two factors: the semantics associative keywords and the citation times that reflect the quality of a paper. Given a query k inputted by the user, we firstly determine the appropriate profile according to the keyword matching method. Second, calculate the similarity score of each document with the selected profile based on the cosine similarity method, and then select the documents whose similarity scores are bigger than the threshold. Third, re-rank the selected documents by further considering their average citation times per year. Finally the documents are displayed to the user in the order of the relevance score. The following two formulas respectively computes the similarity score denoted as S_i and the relevance score denoted as R_i for the ith document.

Where X_d ={x^d₁,x^d₂ , …,x^d_n } is a document feature vector extracted from document d where each component indicates the degree of importance of a term in the document, P_i ={kⁱ₁,kⁱ₂,…,kⁱ_m } denotes the selected ith profile vector where m denotes the number of general terms in the profile and kⁱ_jdenotes the weight of the jth term in the profile. B_u is the set of documents that cite the ith document, N_u is the number of citations of the uth document, is a adjustment factor to avoid the numerator is zero when the ith document hasn’t been cited, and here is assigned initial value 0.01. t and t_i are respectively the current year and the publication year of the ith document, and is a factor used for normalization.

Additionally, the system can continuously check for new documents that match the user’s profile and notify group members of new recommendations through the interface of Knowledge Grid or by email. The general framework is illustrated in Figure. 1.

Figure1. The framework of document logistics platform.

5. CONCLUSIONS

The proposed document logistics approach enhances cooperative research by incorporating the following three characteristics. First, it continuously discovers and collects new potentially relevant documents based on the semantic profiles. Second, it allows distributed group members to collaborate on organizing and evaluating shared documents with the support of Knowledge Grid. Third, it provides flexible management operations and recommendation services for group members to efficiently access relevant documents. We have implemented the prototype of document logistics based on the Knowledge Grid platform VEGA-KG (available at http://kg.ict.ac.cn).

6. ACKNOWLEDGEMENT

The research work was supported by the National Science Foundation of China (NSFC).

7. REFERENCES

T.Berners-Lee, J.Hendler, and O.Lassila. Semantic Web, Scientific American, May, 17, 2001.
H.Chen, et al. A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System, Journal of the American Society for Information Science, 48(1), pp.17-31, 1997.
S.Lawrence, C.L.Giles, K.Bollacker. Digital Libraries and Autonomous Citation Indexing, IEEE Computer, 32(6), pp.67-71, 1999.
L.Page, S.Brin, R.Motwani, and T.Winograd. The pagerank citation ranking: Bringing order to the web, Technical report, Stanford, Santa Barbara, CA 93106, January, 1998.
H.Zhuge. A Knowledge Grid Model and Platform for Global Knowledge Sharing, Expert Systems with Application, 22(4), pp.313-320, 2002.
H.Zhuge. Distributed Team Knowledge Management by Incorporating Knowledge Flow with Knowledge Grid, In proceedings of 2nd International Conference on Knowledge Management, Austria, July, pp.218-223, 2002.
H.Zhuge. VEGA-KG: A Way to the Knowledge Web, The 11th International World Wide Web Conference, Honolulu, Hawaii, USA, May, 2002. http://www2002.org/CDROM/poster/53.pdf.