This tool performs automated evaluation of machine translated documents


TBSJ's Sanbi uses several quality evaluation metrics to calculate scores on both document and sentence levels.

The name Sanbi comes from the words “san” (算), meaning calculation, and “bi” (比), meaning comparison. It is pronounced like a portmanteau of “sun” and “be” in English.

Metrics used

Different evaluation metrics reflect different aspects of machine translation quality, and we plan to integrate more metrics into Sanbi.

Please subscribe to our newsletter and stay updated on the latest developments.

If you have feedback, questions, or requests, please send us an email.


Bilingual Evaluation Understudy, ACL 2002.

BLEU scores can range from 0 to 1. The closer to 1, the stronger the indication that a hypothesis translation is closer to the reference translation.

BLEU has showed a high correlation with human judgments and is used as the de facto standard metric for automatic evaluation.

Hyper-parameters used: Four weights (0.45, 0.35, 0.1, 0.1) and the method2 smoothing function.


Rank-based Intuitive Bilingual Evaluation Score, EMNLP 2010.

The RIBES metric is particularly well suited to assessing translation quality for distant language pairs (like English and Japanese).

The score can range from 0 (worst) to 1 (best) .

Hyper-parameters used: Alpha=0.250000 beta=0.100000.


An Automatic Metric for MT Evaluation with HighLevels of Correlation with Human Judgments, ACL 2007

METEOR is similar to BLEU, but it also considers synonyms and compares the stems of words (so that “running” matches “runs”).

In addition, it is specifically designed to compare sentences, whereas BLUE is ideally used to compare entire corpora.

Hyper-parameters used: lowercasing, PorterStemmer, Wordnet corpus, alpha=0.9, beta=3, gamma=0.5.


Language-independent Model for Machine Translation Evaluation with Reinforced Factors, XIV MTSummit 2013

hLEPOR is a language-independent metric with enhanced factors like length penalty, precision, n-gram position difference penalty and recall, and optional linguistic information.

Scores range from 0 to 1. The higher the score, the closer a translation is to the reference.

Hyper-parameters used: alpha=9.0 and beta=1.0, n=2, weight_elp=2.0, weight_pos=1.0, weight_pr=7.0. See Table 1 for optimal parameters for some language pairs.