we address here the problem of base np translation, in which for a given base noun phrase in a source language (e.g., ?information age? in english), we are to find out its possible translation(s) in a target language (e.g., ? in chinese). we define a base np as a simple and non-recursive noun phrase. in many cases, base nps represent holistic and non-divisible concepts, and thus accurate translation of them from one language to another is extremely important in applications like machine translation, cross language information retrieval, and foreign language writing assistance. in this paper, we propose a new method for base np translation, which contains two steps: (1) translation candidate collection, and (2) translation selection. in translation candidate collection, for a given base np in the source language, we look for its translation candidates in the target language. to do so, we use a word-to-word translation dictionary and corpus data in the target language on the web. in translation selection, we determine the possible translation(s) from among the candidates. we use non-parallel corpus data in the two languages on the web and employ one of the two methods which we have developed. in the first method, we view the problem as that of classification and employ an ensemble of na?ve bayesian classifiers constructed with the em algorithm. we will use ?em-nbc-ensemble? to denote this method, hereafter. in the second method, we view the problem as that of calculating similarities between context vectors and use tf-idf vectors also constructed with the em algorithm. we will use ?em-tf-idf? to denote this method. experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively. the results are significantly better than those of the baseline methods relying on existing technologies. the higher performance of our method can be attributed to the enormity of the web data used and the employment of the em algorithm.the higher performance of our method can be attributed to the enormity of the web data used and the employment of the em algorithm. we address here the problem of base np translation, in which for a given base noun phrase in a source language (e.g., ?information age? we also acknowledge shenjie li for help with program coding. this paper has proposed a new and effective method for base np translation by using web data and the em algorithm. the results are significantly better than those of the baseline methods relying on existing technologies. in english), we are to find out its possible translation(s) in a target language (e.g., ? 2.1 translation with non-parallel. we conducted experiments on translation of the base nps from english to chinese. experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively. in chinese). we extracted base nps (noun-noun pairs) from the encarta 1 english corpus using the tool developed by xun et al(2000). for nagata et al?s method, we found that it was almost impossible to find partial-parallel corpora in the non-web data. they observed that there are many partial parallel corpora between english and japanese on the web, and most typically english translations of japanese terms (words or phrases) are parenthesized and inserted immediately after the japanese terms in documents written in japanese. |