alwaysaditi's picture
End of training
dc78b20 verified
unlike english and other western languages, many asian languages such as chinese, japanese, and thai, do not delimit words by white-space. wordsegmentation is therefore a key precursor for language processing tasks in these languages. for chinese, there has been significant research on find ing word boundaries in unsegmented sequences(see (sproat and shih, 2002) for a review). un fortunately, building a chinese word segmentation system is complicated by the fact that there is no standard definition of word boundaries in chinese. approaches to chinese segmentation fall roughly into two categories: heuristic dictionary-based methods and statistical machine learning methods.in dictionary-based methods, a predefined dictio nary is used along with hand-generated rules for segmenting input sequence (wu, 1999). howeverthese approaches have been limited by the impossibility of creating a lexicon that includes all possible chinese words and by the lack of robust statistical inference in the rules. machine learning approaches are more desirable and have been successful in both unsupervised learning (peng and schuur mans, 2001) and supervised learning (teahan et al, 2000). many current approaches suffer from either lackof exact inference over sequences or difficulty in incorporating domain knowledge effectively into seg mentation. domain knowledge is either not used, used in a limited way, or used in a complicated way spread across different components. for example,the n-gram generative language modeling based ap proach of teahan et al(2000) does not use domainknowledge. gao et al(2003) uses class-based language for word segmentation where some word cat egory information can be incorporated. zhang et al (2003) use a hierarchical hidden markov model to incorporate lexical knowledge. a recent advance in this area is xue (2003), in which the author uses a sliding-window maximum entropy classifier to tag chinese characters into one of four position tags, and then covert these tags into a segmentation using rules. maximum entropy models give tremendousflexibility to incorporate arbitrary features. how ever, a traditional maximum entropy tagger, as used in xue (2003), labels characters without consideringdependencies among the predicted segmentation labels that is inherent in the state transitions of finite state sequence models. linear-chain conditional random fields (crfs) (lafferty et al, 2001) are models that address both issues above. unlike heuristic methods, they are principled probabilistic finite state models onwhich exact inference over sequences can be ef ficiently performed. unlike generative n-gram or hidden markov models, they have the ability to straightforwardly combine rich domain knowledge, for example in this paper, in the form of multiple readily-available lexicons. furthermore, they arediscriminatively-trained, and are often more accurate than generative models, even with the same fea tures. in their most general form, crfs are arbitrary undirected graphical models trained to maximize the conditional probability of the desired outputs given the corresponding inputs. in the linear-chainspecial case we use here, they can be roughly un derstood as discriminatively-trained hidden markovmodels with next-state transition functions represented by exponential models (as in maximum en tropy classifiers), and with great flexibility to viewthe observation sequence in terms of arbitrary, over lapping features, with long-range dependencies, and at multiple levels of granularity. these beneficialproperties suggests that crfs are a promising ap proach for chinese word segmentation.new word detection is one of the most impor tant problems in chinese information processing.many machine learning approaches have been pro posed (chen and bai, 1998; wu and jiang, 2000; nie et al, 1995). new word detection is normally considered as a separate process from segmentation.however, integrating them would benefit both seg mentation and new word detection. crfs provide aconvenient framework for doing this. they can pro duce not only a segmentation, but also confidence in local segmentation decisions, which can be usedto find new, unfamiliar character sequences sur rounded by high-confidence segmentations. thus, our new word detection is not a stand-alone process, but an integral part of segmentation. newly detected words are re-incorporated into our word lexicon,and used to improve segmentation. improved seg mentation can then be further used to improve new word detection. comparing chinese word segmentation accuracyacross systems can be difficult because many re search papers use different data sets and different ground-rules. some published results claim 98% or99% segmentation precision and recall, but these ei ther count only the words that occur in the lexicon, or use unrealistically simple data, lexicons that haveextremely small (or artificially non-existant) outof-vocabulary rates, short sentences or many numbers. a recent chinese word segmentation competition (sproat and emerson, 2003) has made compar isons easier. the competition provided four datasets with significantly different segmentation guidelines, and consistent train-test splits. the performance ofparticipating system varies significantly across different datasets. our system achieves top performance in two of the runs, and a state-of-the-art per formance on average. this indicates that crfs are a viable model for robust chinese word segmentation.this indicates that crfs are a viable model for robust chinese word segmentation. unlike english and other western languages, many asian languages such as chinese, japanese, and thai, do not delimit words by white-space. wordsegmentation is therefore a key precursor for language processing tasks in these languages. the contribution of this paper is three-fold. our system achieves top performance in two of the runs, and a state-of-the-art per formance on average. for chinese, there has been significant research on find ing word boundaries in unsegmented sequences(see (sproat and shih, 2002) for a review). the performance ofparticipating system varies significantly across different datasets. acknowledgmentsthis work was supported in part by the center for intelligent information retrieval, in part by the cen tral intelligence agency, the national security agencyand national science foundation under nsf grant #iis 0326249, and in part by spawarsyscen-sd grant number n66001-02-1-8903. to make a comprehensive evaluation, we use allfour of the datasets from a recent chinese word segmentation bake-off competition (sproat and emer son, 2003). conditional random fields (crfs) are undirected graphical models trained to maximize a conditional probability (lafferty et al, 2001). however, training is a one-time process, and testing time is still linear in the length of the input.