Linguistic Research with XML/RDF-aware WebCorp tool

Barry Morley
RDUES
The University of Liverpool
UK
044 151 794 2290
barry@rdues.liv.ac.uk

Antoinette Renouf
RDUES
The University of Liverpool
UK
044 151 794 2286
ant@rdues.liv.ac.uk

Andrew Kehoe
RDUES
The University of Liverpool
UK
044 151 794 2287
andrew@rdues.liv.ac.uk

ABSTRACT

In this paper, we report on our XML/RDF-aware WebCorp application, a specialist search tool designed to treat the Web as a large text corpus. We review the current state of annotation in Web resources and report on our attempts to reconcile this with our application development and linguistic research as a whole.

Keywords

Corpus linguistics; Diachronicity; WebCorp; XML

1. INTRODUCTION

The availability of certain metatextual information, including dating (for chronology), source language, domain and topic in a standardised format, is essential for much automated linguistic study of Web text. XML, coupled with a standardised RDF labelling scheme, is perfectly suited to meeting these needs, as illustrated in the following sections with reference to the WebCorp tool. WebCorp is a fast, online web-text parsing system with an easy-to-use graphical interface (basic and linguistically advanced versions). A few systems have similar features to WebCorp. KWiCFinder [3] searches the Web for a user-specified term which is returned in XML-formatted context; this tool supports user annotation of the data (and conversion to web-readable HTML) but does not provide further linguistic analysis or evaluation. Other systems consider only a subset of the Web: Glossanet [1,2] conducts overnight searches of downloaded news corpora for contexts around a regular expression, reporting results by e-mail; CorpusWeb [4] allows sites to be downloaded into an off-line corpus for further processing. Solutions achieved by the use of XML and RDF are presented, for chronology in Section 2 and source language in Section 3; they are discussed for domain and topic in Section 4.

2. CHRONOLOGICAL STUDY

An awareness is emerging in the linguistic community of the need to account not just for current language use, but also for language change [7,8]. This requires knowledge of the date of authorship of a language sample, to inform linguistic intuition as to whether the object observed represents incipient, current or obsolescent usage; and to trace development.

2.1 Dating Using Traditional Corpora

Our approach to diachronic corpus linguistics has long been to treat the corpus as a flow of chronologically ordered text, monitoring changes and trends, rather than studying discrete synchronic corpora, the usual approach in both historical and modern corpus linguistics. Unfortunately, when using WebCorp, we have encountered and had to accommodate problems relating to the current paucity of Web document dating.

2.2 Document Dating on the Web

Dates may be extracted from Web pages in one of the following ways:

Last-modified metatag in HTTP headers – occurs in 53% of the 917 test sites. Does not accurately reflect date of authorship. There is no “edition history”
Created/last updated date in HTML body – not standard or machine readable
Date in URL – only standard on news sites
Inference from bibliography dates
Web archive sites – as yet insufficiently comprehensive or accessible
Date range from search engine – biased towards last indexed date [6]

We have found that a more reliable method is to use XML/RDF, in which the date can be specified in a standardised, machine-understandable form (ISO 8601). This markup is usefully coupled with the Dublin Core Metadata Schema, where the element qualifiers: Created, Valid, Available, Issued and Modified allow the author to specify exactly what the date represents.

Given the sparseness of uniform XML pages on the Web, we have created a test-corpus of news articles in this format, making use of the Dublin Core Element Set. Meta-information can be automatically extracted or inferred from the standard annotation accompanying each article. An example of page structure is shown in Figure 1.

Figure 1: Example of automatically generated XML news article

Webcorp has been made XML-aware by separating the text and metatext and storing them in parallel. This approach supports linguistic processing; e.g. the extraction of contextualised words, such as cleansing in Figure 2.

Figure 2: Concordances showing change in reference for "cleansing" from 1992

The existence of date metatagging in Dublin Core allows WebCorp to extract and label those contexts in chronological order. Secondary analyses are also possible; e.g. collocational profiles, which show the word ethnic, reflecting the meaning `genocide’, as becoming the major significant collocate of cleansing from 1994 onwards. Results are shown in Figure 3.

Figure 3: Change in significant contiguous (+1 -1) collocates for "cleansing"

3. LANGUAGE IDENTIFICATION

Source language identification is essential for some automated language study. Currently, to extract this from a Web source, one of the following processes is required:

Extraction of Content-Language from HTTP header – only marks one language, only found on 8.5% of pages
Extraction of HTML “lang” attribute – marks sublanguages but only found on 3.5% of pages These are standardised and machine readable using the ISO 3316 (language) and ISO 639 (country) codes.
Surface feature analysis on the text content of the page [5,9]

Such analyses are useful, but the Dublin Core Element Set allows the specification of document language in open text, providing more efficient access to language information. A desirable enhancement here would be the specification of primary and subordinate document languages.

4. DOMAIN AND TOPIC

For the linguist, the key ontological features of a document are domain and topic. These terms are poorly understood, and applied without reference to internal linguistic content. Loosely defined, `Domain’ is the field of knowledge to which a document refers; e.g. Sport, Music. `Topic’ is what a document is `about’; e.g. `War in the Balkans’. Clearly, `aboutness’ is a multi-level phenomenon, involving statement of location, description of event, exposition of proposition, authorial evaluation, etc. Domain and topic can be accommodated by the Dublin Core elements Subject and Description. As far as their more precise application goes, we are working on the automatic identification of domain and topic information for them from the internal linguistic features of a document.

5. CONCLUSION

We have reported that a promising convention for our WebCorp application and other linguistic tools is the widely-adopted and W3C-cited Dublin Core Metadata Schema, which contains appropriate properties for storing information required in linguistic study. However, we observe that for the potential of this (or any metadata schema) to be realised, and to be automatically exploitable for linguistic (or other) purposes, standardisation and, crucially, widespread takeup, are required. We envisage WebCorp and our linguistic expertise contributing to the debate on knowledge representation and Web linguistic search accuracy over this period of takeup.

6. ACKNOWLEDGEMENTS

We gratefully acknowledge the EPSRC support for this research.

7. REFERENCES

Fairon, C. (1998-1999), "Parsing a Web Site as a Corpus" in Fairon (1999), 450.
Fairon, C., ed. (1998-1999) "Analyse lexicale et syntaxique: Le systeme INTEX, Linvisticae Investigationes Tome XXII (Volume Special)", Amsterdam/Philedelphia : John Benjamins Publishing
Fletcher, W. (2001), "Concordancing the Web with KWiCFinder" in Proceedings of The American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching.
Kubler, N. & P-Y Foucou (2000), "A Web-based Environment for Teaching Technical English", in Rethinking Language Pedagogy: papers from the third international conference on language and teaching, Burnard, Lou and Tony McEnery (eds.). Peter Lang GmbH, Frankfurt am Main.
Longuemaux, F., Morandeau, F., Riviere, A., Tadayoni-Rouchon, R. & Vaz Martinho, P. (2001). "Reconnaissance de la langue à partir de facteurs interdits", Univ. Paris VII Denis Diderot, unpubl. manuscript.
Price, G. & G. Tyburski (2002). "It’s Tough to Get a Good Date with a Search Engine." SearchDay, June 5 2002 – Number 283 http://www.searchenginewatch.com
Renouf, A, Pacey, M & A. Collier, (1998), "Refining the Automatic Identification of Conceptual Relations in Large-scale Corpora" in Proceedings of the Sixth Workshop on Very large Corpora, ACL/COLING ’98
Renouf, A.J. (2002-b) "The Time Dimension in Modern English Corpus Linguistics" in Kettemann, B. & G Markov (eds) Teaching and Learning by Doing Corpus Analysis, Amsterdam: Rodopi
Souter, C, Churcher, G, Hayes, G, Hughes, J & Johnson, S, (1994), "Natural Language Identification using Corpus-Based Models" in Lauridsen, K&O (eds.), HERMES Journal of Linguistics Vol. 13, pp183-203, Faculty of Modern Languages, Aarhus School of Business