File size: 3,110 Bytes
dc78b20
1
many words are ambiguous in their part of speech. for example, "tag" can be a noun or a verb. however, when a word appears in the context of other words, the ambiguity is often reduced: in "a tag is a part-of-speech label," the word "tag" can only be a noun. a part-of-speech tagger is a system that uses context to assign parts of speech to words. automatic text tagging is an important first step in discovering the linguistic structure of large text corpora. part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text. for a tagger to function as a practical component in a language processing system, we believe that a tagger must be: robust text corpora contain ungrammatical constructions, isolated phrases (such as titles), and nonlinguistic data (such as tables). corpora are also likely to contain words that are unknown to the tagger. it is desirable that a tagger deal gracefully with these situations. efficient if a tagger is to be used to analyze arbitrarily large corpora, it must be efficient—performing in time linear in the number of words tagged. any training required should also be fast, enabling rapid turnaround with new corpora and new text genres. accurate a tagger should attempt to assign the correct part-of-speech tag to every word encountered. tunable a tagger should be able to take advantage of linguistic insights. one should be able to correct systematic errors by supplying appropriate a priori "hints." it should be possible to give different hints for different corpora. reusable the effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal.reusable the effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal. many words are ambiguous in their part of speech. one should be able to correct systematic errors by supplying appropriate a priori "hints." it should be possible to give different hints for different corpora. the algorithm has an accuracy of approximately 80% in assigning grammatical functions. we have used the tagger in a number of applications. by using the fact that words are typically associated with only a few part-ofspeech categories, and carefully ordering the computation, the algorithms have linear complexity (section 3.3). for example, "tag" can be a noun or a verb. several different approaches have been used for building text taggers. probabilities corresponding to category sequences that never occurred in the training data are assigned small, non-zero values, ensuring that the model will accept any sequence of tokens, while still providing the most likely tagging. we describe three applications here: phrase recognition; word sense disambiguation; and grammatical function assignment. if a noun phrase is labeled, it is also annotated as to whether the governing verb is the closest verb group to the right or to the left. taggit disambiguated 77% of the corpus; the rest was done manually over a period of several years.