In this paper, we report on our XML/RDF-aware WebCorp application, a specialist search tool designed to treat the Web as a large text corpus. We review the current state of annotation in Web resources and report on our attempts to reconcile this with our application development and linguistic research as a whole.
Corpus linguistics; Diachronicity; WebCorp; XML
The availability of certain metatextual information, including dating (for chronology), source language, domain and topic in a standardised format, is essential for much automated linguistic study of Web text. XML, coupled with a standardised RDF labelling scheme, is perfectly suited to meeting these needs, as illustrated in the following sections with reference to the WebCorp tool. WebCorp is a fast, online web-text parsing system with an easy-to-use graphical interface (basic and linguistically advanced versions). A few systems have similar features to WebCorp. KWiCFinder [3] searches the Web for a user-specified term which is returned in XML-formatted context; this tool supports user annotation of the data (and conversion to web-readable HTML) but does not provide further linguistic analysis or evaluation. Other systems consider only a subset of the Web: Glossanet [1,2] conducts overnight searches of downloaded news corpora for contexts around a regular expression, reporting results by e-mail; CorpusWeb [4] allows sites to be downloaded into an off-line corpus for further processing. Solutions achieved by the use of XML and RDF are presented, for chronology in Section 2 and source language in Section 3; they are discussed for domain and topic in Section 4.
An awareness is emerging in the linguistic community of the need to account not just for current language use, but also for language change [7,8]. This requires knowledge of the date of authorship of a language sample, to inform linguistic intuition as to whether the object observed represents incipient, current or obsolescent usage; and to trace development.
Our approach to diachronic corpus linguistics has long been to treat the corpus as a flow of chronologically ordered text, monitoring changes and trends, rather than studying discrete synchronic corpora, the usual approach in both historical and modern corpus linguistics. Unfortunately, when using WebCorp, we have encountered and had to accommodate problems relating to the current paucity of Web document dating.
Dates may be extracted from Web pages in one of the following ways:
We have found that a more reliable method is to use XML/RDF, in which the date can be specified in a standardised, machine-understandable form (ISO 8601). This markup is usefully coupled with the Dublin Core Metadata Schema, where the element qualifiers: Created, Valid, Available, Issued and Modified allow the author to specify exactly what the date represents.
Given the sparseness of uniform XML pages on the Web, we have created a test-corpus of news articles in this format, making use of the Dublin Core Element Set. Meta-information can be automatically extracted or inferred from the standard annotation accompanying each article. An example of page structure is shown in Figure 1.
Figure 1: Example of automatically generated XML news article
Webcorp has been made XML-aware by separating the text and metatext and storing them in parallel. This approach supports linguistic processing; e.g. the extraction of contextualised words, such as cleansing in Figure 2.
Figure 2: Concordances showing change in reference for "cleansing" from 1992
The existence of date metatagging in Dublin Core allows WebCorp to extract and label those contexts in chronological order. Secondary analyses are also possible; e.g. collocational profiles, which show the word ethnic, reflecting the meaning `genocide’, as becoming the major significant collocate of cleansing from 1994 onwards. Results are shown in Figure 3.
Figure 3: Change in significant contiguous (+1 -1) collocates for "cleansing"
Source language identification is essential for some automated language study. Currently, to extract this from a Web source, one of the following processes is required:
Such analyses are useful, but the Dublin Core Element Set allows the specification of document language in open text, providing more efficient access to language information. A desirable enhancement here would be the specification of primary and subordinate document languages.
For the linguist, the key ontological features of a document are domain and topic. These terms are poorly understood, and applied without reference to internal linguistic content. Loosely defined, `Domain’ is the field of knowledge to which a document refers; e.g. Sport, Music. `Topic’ is what a document is `about’; e.g. `War in the Balkans’. Clearly, `aboutness’ is a multi-level phenomenon, involving statement of location, description of event, exposition of proposition, authorial evaluation, etc. Domain and topic can be accommodated by the Dublin Core elements Subject and Description. As far as their more precise application goes, we are working on the automatic identification of domain and topic information for them from the internal linguistic features of a document.
We have reported that a promising convention for our WebCorp application and other linguistic tools is the widely-adopted and W3C-cited Dublin Core Metadata Schema, which contains appropriate properties for storing information required in linguistic study. However, we observe that for the potential of this (or any metadata schema) to be realised, and to be automatically exploitable for linguistic (or other) purposes, standardisation and, crucially, widespread takeup, are required. We envisage WebCorp and our linguistic expertise contributing to the debate on knowledge representation and Web linguistic search accuracy over this period of takeup.
We gratefully acknowledge the EPSRC support for this research.