Summary: This report details a comprehensive evaluation of a new translation workflow, incorporating machine translation with human post-editing and quality assurance, against our traditional workflow of human translation, editing and quality assurance. Our primary objective was to determine if cost-saving measures could be achieved for our interested clients without negatively impacting the quality perceived by our clients or their clients. Through a rigorous methodology involving both objective analysis using the BLEU metric and subjective evaluations by professional linguists, we found that despite some variations in word choice, the overall translation quality and effective message expression remained consistently high across all eight languages assessed. The subjective ratings indicate that both workflows delivered positive results, with the new workflow performing at a comparable level to the original. This suggests that a transition to the new workflow by interested clients is unlikely to result in a perceptible change in translation quality for the languages tested.
---
At Responsive Translation, we are committed to delivering the highest quality translations possible while continually seeking innovative solutions to optimize efficiency and cost-effectiveness for our clients. To validate a workflow change for one of our long-time clients aimed at cost reduction without compromising quality, we conducted a comprehensive evaluation of two distinct translation workflows. This report details our methodology, findings and the implications for your own organization.
Our evaluation rigorously assessed two workflows:
To gain a holistic understanding of translation quality, we employed both objective and subjective evaluation methods:
The BLEU (Bilingual Evaluation Understudy) metric provides a quantitative measure of the similarity between a machine-translated text and a set of high-quality reference translations. It works by comparing n-grams (contiguous sequences of words) in the machine-generated translation to those in the reference translations, calculating precision and applying a brevity penalty. A higher BLEU score indicates a closer match in word choices between the translated output and the reference.
To complement the objective data, we engaged professional linguists proficient in the source and target languages for a subjective evaluation. These expert reviewers assessed translations for aspects such as fluency, grammar, style, coherence and accuracy. Crucially, the evaluators were unaware of the specific workflow used for each translation, ensuring unbiased opinions.
Our evaluation involved a selection of documents of approximately 1,000 words, translated across a diverse range of languages using both workflows.
The BLEU scores offered valuable insights into word choice variations across different languages. While these scores indicate the degree of similarity in word choices, they serve as complementary data rather than a direct measure of overall translation quality.
| Language | Unigram | Bigram | Trigram |
|---|---|---|---|
| Brazilian Portuguese | 0.53 | 0.32 | 0.24 |
| Polish | 0.61 | 0.4 | 0.29 |
| Hungarian | 0.76 | 0.61 | 0.53 |
| Spanish | 0.76 | 0.6 | 0.49 |
| Russian | 0.93 | 0.88 | 0.86 |
| Chinese | 0.76 | 0.76 | 0.75 |
| German | 0.8 | 0.69 | 0.6 |
| French | 0.94 | 0.9 | 0.88 |
The subjective analysis provided crucial insights into the perceived quality of translations from both workflows.
| Language | Original Workflow | New Workflow |
|---|---|---|
| Brazilian Portuguese | 8.86 | 9.64 |
| Polish | 9.04 | 8.93 |
| Hungarian | 9.93 | 9.28 |
| Spanish | 9.21 | 8.62 |
| Russian | 9.65 | 9.67 |
| Chinese | 9.75 | 9.94 |
| German | 8.43 | 8.99 |
| French | 9.71 | 9.61 |
Overall, both workflows received consistently positive ratings, representing a high level of translation quality. Across the evaluated languages, average scores ranged from 8.43 to 9.94. Importantly, the new workflow generally performed at a similar level to the original workflow, with minimal discrepancies in scores. This implies that the new workflow change is unlikely to result in a noticeable decline in translation quality from the client’s perspective or their target audience’s perspective.
Our comprehensive evaluation demonstrates that, while there were variations in word choice between the two workflows, the overall quality of translation and message expression remained consistently high across the eight languages assessed. The subjective ratings confirm that translations produced by both workflows meet Responsive Translation’s stringent quality standards.
Therefore, based on this evaluation, we confidently conclude that adopting the new workflow (Machine Translation + Human Post-Editing + QA) does not substantially impact the quality of the final output, ensuring that your intended message is effectively conveyed to your target audience. This finding supports the implementation of cost-saving measures for our interested clients without compromising their target audience’s experience.
At the same time, it’s important to note that mileage may vary for other languages. While our findings are robust for the languages included in this report, we recognize the possibility that the new workflow may exhibit lower quality in certain less common languages. However, we can help determine which workflow is best for your specific needs.
At Responsive Translation, we are fully committed to meeting all your quality expectations with efficient and cost-effective translation solutions and helping you make informed decisions. We invite you to request a custom proposal or book a consultation call here.