Machine Translation

Machine Translation
Designing, training, and productionizing dedicated systems for translation that enable cross-border trade.
Info

Machine translation at eBay is developed with the primary goal of facilitating cross-border trade. We design, train, and prepare for production dedicated systems for translation of user search queries, item/product titles on the search result pages and item descriptions. Our MT systems leverage state-of-the-art technology, including statistical models for phrase-to-phrase translation, machine learning algorithms, as well as hybridization through combination with hand-crafted rules for translating and/or preserving named entities such as numbers and product brands.

We build customized, domain-adapted systems which are trained on parallel corpora that contain both eBay-relevant data and automatically crawled web data from the eCommerce domain. The systems are able to utilize both sentence-level and topic-level context in order to disambiguate between different word meanings and translation alternatives. At the same time, the MT systems we build are optimized to work in real-time, yielding high-quality translations within milliseconds. This is especially important for translation of user queries.

eBay's MT group has developed dedicated e-commerce domain systems for translating Russian, Portuguese, Spanish, Italian, French, German into English and vice versa, as well as systems for translating between German and other main European languages. We also have high-quality MT systems for translating Arabic and Chinese. Besides expanding the language coverage, the current work of the MT group focuses on topic or product category adaptation, as well as incorporating explicit and implicit user feedback to improve the user experience with our translations. We also actively explore deep learning methods, including neural network based machine translation.

Publications
MT Summit, Nagoya, Japan, September 2017

Harvesting Polysemous Terms from e-commerce Data to Enhance QA

Silvio Picinini

Polysemous words can be difficult to translate and can affect the quality of Machine Translation (MT) output. Once the MT quality is affected, it has a direct impact on post-editing and on human-assisted machine translation. The presence of these terms increases the risk of errors. We think that these important words can be used to improve and to measure quality of translations. We present three methods for finding these words from e-commerce data, based on Named Entity Recognition, Part-of-Speech and Search Queries.

MT Summit, Nagoya, Japan, September 2017

Neural and Statistical Methods for Leveraging Meta-information in Machine Translation

Shahram Khadivi, Patrick Wilken, Leonard Dahlmann, Evgeny Matusov

In this paper, we discuss different methods which use meta information and richer context that may accompany source language input to improve machine translation quality. We focus on category information of input text as meta information, but the proposed methods can be extended to all textual and non-textual meta information that might be available for the input text or automatically predicted using the text content. The main novelty of this work is to use state-of-the-art neural network methods to tackle this problem within a statistical machine translation (SMT) framework. We observe translation quality improvements up to 3% in terms of BLEU score in some text categories.

International Conference on Natural Language Generation, Santiago de Compostela, Spain, September 2017

Generating titles for millions of browse pages on an e-Commerce site

We present three approaches to generate titles for browse pages in five different languages, namely English, German, French, Italian and Spanish. These browse pages are structured search pages in an e-commerce domain. We first present a rule-based approach to generate these browse page titles. In addition, we also present a hybrid approach which uses a phrase-based statistical machine translation engine on top of the rule-based system to assemble the best title. For the two languages English and German, we have access to a large amount of rule-based generated and human-curated titles. For these languages, we present an automatic post-editing approach which learns how to post-edit the rule-based titles into curated titles.

MT Summit, Nagoya, Japan, September 2017

A detailed investigation of Bias Errors in Post-editing of MT output

Silvio Picinini, Nicola Ueffing

The use of post-editing of machine translation output is increasing throughout the language technology community. In this work, we investigate whether the MT system influences the human translator, thereby introducing "bias" and potentially leading to errors in the post-editing. We analyze how often a translator accepts an incorrect suggestion from the MT system and determine different types of bias errors. We carry out quantitative analysis on translations of eCommerce data from English into Portuguese, consisting of 713 segments with about 15k words. We observed a higher-than-expected number of bias errors, about 18 bias errors per 1,000 words. Among the most frequent types of bias error we observed ambiguous modifiers, terminology errors, polysemy, and omissions. The goal of this work is to provide quantitative data about bias errors in post-editing that help indicate the existence of bias. We explore some ideas on how to automate the finding of these error patterns and facilitate the quality assurance of post-editing.

EMNLP, Copenhagen, Denmark, September 2017

Neural Machine Translation Leveraging Phrase-based Models in a Hybrid Search

Leonard Dahlmann, Evgeny Matusov, Pavel Petrushkov, Shahram Khadivi

In this paper, we introduce a hybrid search for attention-based neural machine translation (NMT). A target phrase learned with statistical MT models extends a hypothesis in the NMT beam search when the attention of the NMT model focuses on the source words translated by this phrase. Phrases added in this way are scored with the NMT model, but also with SMT features including phrase-level translation probabilities and a target language model. Experimental results on German->English news domain and English->Russian ecommerce domain translation tasks show that using phrase-based models in NMT search improves MT quality by up to 2.3% BLEU absolute as compared to a strong NMT baseline.

Association for Machine Translation in the Americas (AMTA), Oct. 2016

Guided Alignment Training for Topic-Aware Neural Machine Translation

Wenhu Chen, Evgeny Matusov, Shahram Khadivi, Jan-Thorsten Peter

In this paper, we propose an effective way for biasing the attention mechanism of a sequence-to-sequence neural machine translation (NMT) model towards the well-studied statistical word alignment models. We show that our novel guided alignment training approach improves translation quality on real-life e-commerce texts consisting of product titles and descriptions, overcoming the problems posed by many unknown words and a large type/token ratio. We also show that meta-data associated with input texts such as topic or category information can significantly improve translation quality when used as an additional signal to the decoder part of the network. With both novel features, the BLEU score of the NMT system on a product title set improves from 18.6 to 21.3%. Even larger MT quality gains are obtained through domain adaptation of a general domain NMT system to e-commerce data. The developed NMT system also performs well on the IWSLT speech translation task, where an ensemble of four variant systems outperforms the phrase-based baseline by 2.1% BLEU absolute.

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Keyboard layout errors and homoglyphs in cross-language queries impact our ability to correctly interpret user information needs and offer relevant results. We present a machine learning approach to correcting these errors, based largely on character-level n-gram features. We demonstrate superior performance over rule-based methods, as well as a significant reduction in the number of queries that yield null search results.

Proceedings of the 6th International Joint Conference on Natural Language Processing

Selective Combination of Pivot and Direct Statistical Machine Translation Models

In this paper, we propose a selective combination approach of pivot and direct statistical machine translation (SMT) models to improve translation quality. We work with Persian-Arabic SMT as a case study. We show positive results (from 0.4 to 3.1 BLEU on different direct training corpus sizes) in addition to a large reduction of pivot translation model size.

ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, Sofia, Bulgaria: 4 – 9 August, 2013

Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.

 

Proceedings of KDD’12, Beijing, China. August 2012

Bootstrapped Language Identification For Multi-Site Internet Domains

We present an algorithm for language identification, in particular of short documents, for the case of an Internet domain with sites in multiple countries with differing languages.

The algorithm is significantly faster than standard language identification methods, while providing state-of-the-art identification. We bootstrap the algorithm based on the language identification based on the site alone, a methodology suitable for any supervised language identification algorithm.

We demonstrate the bootstrapping and algorithm on eBay email data and on Twitter status updates data. The algorithm is deployed at eBay as part of the back-office development data repository.

People