Abstract
We collect data from open sources on the Internet, and classify them into different categories, each labeled with a specific language style 3. In total, there are 3.3 million pairs of English and Vietnamese texts, ranging from single sentences to paragraphs. A model trained with our dataset outperforms Google Translate on a selected set of diverse text sources. On IWSLT'15 we achieved a BLEU score of 37.84.