alwaysaditi's picture
End of training
dc78b20 verified
the work reported in this paper aims at providing syntactically annotated corpora (treebanks') for stochastic grammar induction. in particular, we focus on several methodological issues concerning the annotation of non-configurational languages. in section 2, we examine the appropriateness of existing annotation schemes. on the basis of these considerations, we formulate several additional requirements. a formalism complying with these requirements is described in section 3. section 4 deals with the treatment of selected phenomena. for a description of the annotation tool see section 5.for a description of the annotation tool see section 5. its extension is subject to further investigations. as the annotation scheme described in this paper focusses on annotating argument structure rather than constituent trees, it differs from existing treebanks in several aspects. the work reported in this paper aims at providing syntactically annotated corpora (treebanks') for stochastic grammar induction. these differences can be illustrated by a comparison with the penn treebank annotation scheme. a uniform representation of local and non-local dependencies makes the structure more transparent'. partial automation included in the current version significantly reduces the manna.1 effort. the development of linguistically interpreted corpora presents a laborious and time-consuming task. owing to the partial automation, the average annotation efficiency improves by 25% (from around 4 minutes to 3 minutes per sentence). combining raw language data with linguistic information offers a promising basis for the development of new efficient and robust nlp methods. such a word order independent representation has the advantage of all structural information being encoded in a single data structure. in order to make the annotation process more efficient, extra effort has been put. into the development of an annotation tool. realworld texts annotated with different strata of linguistic information can be used for grammar induction.