6 months ago

Abstract

Accurate translation of bug reports is critical for efficient collaborationin global software development. In this study, we conduct the firstcomprehensive evaluation of machine translation (MT) performance on bugreports, analyzing the capabilities of DeepL, AWS Translate, and large languagemodels such as ChatGPT, Claude, Gemini, LLaMA, and Mistral using data from theVisual Studio Code GitHub repository, specifically focusing on reports labeledwith the english-please tag. To assess both translation quality and sourcelanguage identification accuracy, we employ a range of MT evaluationmetrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongsideclassification metrics such as accuracy, precision, recall, and F1-score. Ourfindings reveal that while ChatGPT (gpt-4o) excels in semantic and lexicaltranslation quality, it does not lead in source language identification. Claudeand Mistral achieve the highest F1-scores (0.7182 and 0.7142, respectively),and Gemini records the best precision (0.7414). AWS Translate shows the highestaccuracy (0.4717) in identifying source languages. These results highlight thatno single system dominates across all tasks, reinforcing the importance oftask-specific evaluations. This study underscores the need for domainadaptation when translating technical content and provides actionable insightsfor integrating MT into bug-triaging workflows. The code and dataset for thispaper are available at GitHub-https://github.com/av9ash/English-Please

Source PDF