1. INTRODUCTION
The main purpose of this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, a task completed in July 1994 [Footnote 1]. We will emphasise the goals of (a) general- purpose adaptability, (b) incorporation of linguistic knowledge to improve quality and consistency, and (c) accuracy, measured consistently and in a linguistically informed way.
The British National Corpus (BNC) consists of c.100 million words of English written texts and spoken transcriptions, sampled from a comprehensive range of text types. The BNC includes 10 million words of spoken language, c.45% of which is impromptu conversation (see Crowdy, forthcoming). It also includes an immense variety of written texts, including unpublished materials. The grammatical tagging of the corpus has therefore required the "super-robustness" of a tagger which can adapt well to virtually all kinds of text. The tagger also has had to be versatile in dealing with different tagsets (sets of grammatical category labels - see 3 below) and accepting text in varied input formats. For the purposes of the BNC, the tagger has been required both to accept and to output text in a corpus- oriented TEI-conformant mark-up format known as CDIF (Corpus Document Interchange Format), but within this format many variant formats (affecting, for example, segmentation into words and sentences) can be readily accepted. In addition, CLAWS allows variable output formats: for the current tagger, these include (a) a vertically-presented format suitable for manual editing, and (b) a more compact horizontally- presented format often more suitable for end- users. Alternative output formats are also allowed with (c) so-called "portmanteau tags", i.e. combinations of two alternative tags, where the tagger calculates there is insufficient evidence for safe disambiguation, and (d) with simplified "plain text" mark-up for the human reader.
CLAWS4, the BNC tagger[Footnote 2], incorporates many features of adaptability such as the above. It also incorporates many refinements of linguistic analysis which have built up over 14 years: particularly in the construction and content of the idiom-tagging component (see 2 below). At the same time, there are still many improvements to be made: the claim that "you can put together a tagger from scratch in a couple of months" (recently heard at a research conference) is, in our view, absurdly optimistic.