to automatically evaluate machine translations, the machine translation community recently adopted an n-gram co-occurrence scoring procedure bleu (papineni et al 2001). a similar metric, nist, used by nist (nist 2002) in a couple of machine translation evaluations in the past two years is based on bleu. the main idea of bleu is to measure the translation closeness between a candidate translation and a set of reference translations with a numerical metric. although the idea of using objective functions to automatically evaluate machine translation quality is not new (su et al 1992), the success of bleu prompts a lot of interests in developing better automatic evaluation metrics. for example, akiba et al (2001) proposed a metric called red based on edit distances over a set of multiple references. nie?en et al (2000) calculated the length normalized edit distance, called word error rate (wer), between a candidate and multiple reference translations. leusch et al (2003) proposed a related measure called position independent word error rate (per) that did not consider word position, i.e. using bag-of-words instead. turian et al (2003) introduced general text matcher (gtm) based on accuracy measures such as recall, precision, and f-measure. with so many different automatic metrics available, it is necessary to have a common and objective way to evaluate these metrics. comparison of automatic evaluation metrics are usually conducted on corpus level using correlation analysis between human scores and automatic scores such as bleu, nist, wer, and per. however, the performance of automatic metrics in terms of human vs. system correlation analysis is not stable across different evaluation settings. for example, table 1 shows the pearson?s linear correlation coefficient analysis of 8 machine translation systems from 2003 nist chinese english machine translation evaluation. the pearson? correlation coefficients are computed according to different automatic evaluation methods vs. human assigned adequacy and fluency. bleu1, 4, and 12 are bleu with maximum n-gram lengths of 1, 4, and 12 respectively. gtm10, 20, and 30 are gtm with exponents of 1.0, 2.0, and 3.0 respectively. 95% confidence intervals are estimated using bootstrap resampling (davison and hinkley 1997). from the bleu group, we found that shorter bleu has better adequacy correlation while longer bleu has better fluency correlation. gtm with smaller exponent has better adequacy correlation and gtm with larger exponent has better fluency correlation. nist is very good in adequacy correlation but not as good as gtm30 in fluency correlation. based on these observations, we are not able to conclude which metric is the best because it depends on the manual evaluation criteria. this results also indicate that high correlation between human and automatic scores in both adequacy and fluency cannot always been achieved at the same time. the best performing metrics in fluency according to table 1 are bleu12 and gtm30 (dark/green cells). however, many metrics are statistically equivalent (gray cells) to them when we factor in the 95% confidence intervals. for example, even per is as good as bleu12 in adequacy. one reason for this might be due to data sparseness since only 8 systems are available. the other potential problem for correlation analysis of human vs. automatic framework is that high corpus-level correlation might not translate to high sentence-level correlation. however, high sentence-level correlation is often an important property that machine translation researchers look for. for example, candidate translations shorter than 12 words would have zero bleu12 score but bleu12 has the best correlation with human judgment in fluency as shown in table 1. in order to evaluate the ever increasing number of automatic evaluation metrics for machine translation objectively, efficiently, and reliably, we introduce a new evaluation method: orange. we describe orange in details in section 2 and briefly introduce three new automatic metrics that will be used in comparisons in section 3. the results of comparing several existing automatic metrics and the three new automatic metrics using orange will be presented in section 4. we conclude this paper and discuss future directions in section 5.we conclude this paper and discuss future directions in section 5. however, we plan to conduct the sampling procedure to verify this is indeed the case. we conjecture that this is the case for the currently available machine translation systems. the orange score for each metric is calculated as the average rank of the average reference (oracle) score over the whole corpus (872 sentences) divided by the length of the n-best list plus 1. the results of comparing several existing automatic metrics and the three new automatic metrics using orange will be presented in section 4. if the portion is small then the orange method can be confidently applied. to automatically evaluate machine translations, the machine translation community recently adopted an n-gram co-occurrence scoring procedure bleu (papineni et al 2001). assuming the length of the n-best list is n and the size of the corpus is s (in number of sentences), we compute orange as follows: orange = )1( )( 1 + ??? ranging from 0 to 9 (rouge-s0 to s9) and without any skip distance limit (rouge-s*) we compute the average score of the references and then rank the candidate translations and the references according to these automatic scores.