Optimal Beam Search for Machine Translation Alexander M. Rush - PDF document

Optimal Beam Search for Machine Translation Alexander M. Rush Yin-Wen Chang Michael Collins MIT CSAIL, Department of Computer Science, Cambridge, MA 02139, USA Columbia University, { srush, yinwen } @csail.mit.edu New York, NY 10027, USA mcollins@cs.columbia.edu • In theory, it can provide a certificate of optimal- Abstract ity; in practice, we show that it produces opti- Beam search is a fast and empirically effective mal hypotheses, with certificates of optimality, method for translation decoding, but it lacks on the vast majority of examples. formal guarantees about search error. We de- velop a new decoding algorithm that combines • It utilizes well-studied algorithms and extends the speed of beam search with the optimal cer- off-the-shelf beam search decoders. tificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based transla- • Empirically it is very fast, results show that it is tion decoding. The new method is efficient, 3.5 times faster than an optimized incremental utilizes standard MT algorithms, and returns constraint-based solver. an exact solution on the majority of translation examples in our test data. The algorithm While our focus is on fast decoding for machine is 3.5 times faster than an optimized incremen- translation, the algorithm we present can be applied tal constraint-based decoder for phrase-based translation and 4 times faster for syntax-based to a variety of dynamic programming-based decod- translation. ing problems. The method only relies on having a constrained beam search algorithm and a fast unconstrained search algorithm. Similar algorithms exist 1 Introduction for many NLP tasks. Beam search (Koehn et al., 2003) and cube prun- We begin in Section 2 by describing constrained ing (Chiang, 2007) have become the de facto decod- hypergraph search and showing how it generalizes ing algorithms for phrase- and syntax-based trans- translation decoding. Section 3 introduces a variant lation. The algorithms are central to large-scale of beam search that is, in theory, able to produce machine translation systems due to their efficiency a certificate of optimality. Section 4 shows how to and tendency to produce high-quality translations improve the effectiveness of beam search by using (Koehn, 2004; Koehn et al., 2007; Dyer et al., 2010). weights derived from Lagrangian relaxation. Sec- However despite practical effectiveness, neither al- tion 5 puts everything together to derive a fast beam search algorithm that is often optimal in practice. gorithm provides any bound on possible decoding error. Experiments compare the new algorithm with several variants of beam search, cube pruning, A ∗ In this work we present a variant of beam search decoding for phrase- and syntax-based translation. search, and relaxation-based decoders on two trans- The motivation is to exploit the effectiveness and ef- lation tasks. The optimal beam search algorithm is ficiency of beam search, but still maintain formal able to find exact solutions with certificates of opti- guarantees. The algorithm has the following bene- mality on 99% of translation examples, significantly fits: more than other baselines. Additionally the optimal 210 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages 210–221, Seattle, Washington, USA, 18-21 October 2013. c � 2013 Association for Computational Linguistics

procedure B EST P ATH S CORE ( θ, τ ) beam search algorithm is much faster than other ex- π [ v ] ← 0 for all v ∈ T act methods. for e ∈ E in topological order do �� v 2 , . . . , v | v | � , v 1 � ← e 2 Background | v | � s ← θ ( e ) + π [ v i ] The focus of this work is decoding for statistical ma- i =2 chine translation. Given a source sentence, the goal if s > π [ v 1 ] then π [ v 1 ] ← s is to find the target sentence that maximizes a com- return π [1] + τ bination of translation model and language model scores. In order to analyze this decoding problem, Figure 1: Dynamic programming algorithm for uncon- we first abstract away from the specifics of transla- strained hypergraph search. Note that this version only returns the highest score: max x ∈X θ ⊤ x + τ . The optimal tion into a general form, known as a hypergraph. In hyperpath can be found by including back-pointers. this section, we describe the hypergraph formalism and its relation to machine translation. x ( e ) = 0 otherwise. The set of valid hyperpaths is 2.1 Notation defined as Throughout the paper, scalars and vectors are writ-   � x : x ( e ) = 1 , ten in lowercase, matrices are written in uppercase,           and sets are written in script-case, e.g. X . All vec- e ∈E : h ( e )=1 X = tors are assumed to be column vectors. The function � �   x ( e ) = x ( e ) ∀ v ∈ N \ { 1 }       δ ( j ) yields an indicator vector with δ ( j ) j = 1 and   e : h ( e )= v e : v ∈ t ( e ) δ ( j ) i = 0 for all i � = j . The first problem we consider is unconstrained hypergraph search. Let θ ∈ R |E| be the weight vector 2.2 Hypergraphs and Search for the hypergraph and let τ ∈ R be a weight offset. 1 A directed hypergraph is a pair ( V , E ) where V = The unconstrained search problem is to find { 1 . . . |V|} is a set of vertices, and E is a set of directed hyperedges. Each hyperedge e ∈ E is a tuple � x ∈X θ ⊤ x + τ max θ ( e ) x ( e ) + τ = max � � � v 2 , . . . , v | v | � , v 1 where v i ∈ V for i ∈ { 1 . . . | v |} . x ∈X e ∈E The head of the hyperedge is h ( e ) = v 1 . The tail of the hyperedge is the ordered sequence t ( e ) = This maximization can be computed for any � v 2 , . . . , v | v | � . The size of the tail | t ( e ) | may vary weights and directed acyclic hypergraph in time O ( |E| ) using dynamic programming. across different hyperedges, but | t ( e ) | ≥ 1 for all Figure 1 edges and is bounded by a constant. A directed shows this algorithm which is simply a version of graph is a directed hypergraph with | t ( e ) | = 1 for the CKY algorithm. all edges e ∈ E . Next consider a variant of this problem: con- Each vertex v ∈ V is either a non-terminal or a strained hypergraph search. Constraints will be nec- essary for both phrase- and syntax-based decoding. terminal in the hypergraph. The set of non-terminals is N = { v ∈ V : h ( e ) = v for some e ∈ E} . Con- In phrase-based models, the constraints will ensure versely, the set of terminals is defined as T = V \N . that each source word is translated exactly once. In syntax-based models, the constraints will be used to All directed hypergraphs used in this work are intersect a translation forest with a language model. acyclic: informally this implies that no hyperpath (as In the constrained hypergraph problem, hyper- defined below) contains the same vertex more than paths must fulfill additional linear hyperedge con- once (see Martin et al. (1990) for a full definition). straints. Define the set of constrained hyperpaths as Acyclicity implies a partial topological ordering of the vertices. We also assume there is a distinguished X ′ = { x ∈ X : Ax = b } root vertex 1 where for all e ∈ E , 1 �∈ t ( e ) . Next we define a hyperpath as x ∈ { 0 , 1 } |E| where 1 The purpose of the offset will be clear in later sections. For x ( e ) = 1 if hyperedge e is used in the hyperpath, this section, the value of τ can be taken as 0 . 211

Optimal Beam Search for Machine Translation Alexander M. Rush - PDF document

Optimal Beam Search for Machine Translation Alexander M. Rush Yin-Wen Chang Michael Collins MIT CSAIL, Department of Computer Science, Cambridge, MA 02139, USA Columbia University, { srush, yinwen } @csail.mit.edu New York, NY 10027, USA

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Beam Search Shahrzad Kiani and Zihao Chen CSC2547 Presentation Beam Search Greedy Search: Always

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

E-lens related beam-beam experiment Xiaofeng Gu 1 IP10 -- e-beam collision with proton only (up

Beam-beam Studies, Tool Development and Tests EIC Collaboration Meeting, Jlab, Oct. 29-Nov. 1,

Intra-Pulse Beam-Beam Scans at the NLC IP Steve Smith SLAC Nanobeams 2002 Beam-Beam Scans

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Tax-Exempt Financing by Nonprofit Practice Group(s): Public Finance Corporations Alternative

More inclusive, more stable? The financial inclusion - stability nexus in the global financial

Standards I m plem entation Update Office of Standards and I nstructional Support Facilitating

Money in macro forecasts A talk at Sheffield University on 20 th February, 2020 The underlying

United States Court of Appeals for the Federal Circuit 2006-1523 STEVEN E. BYRNE,

The Nonhuman Rights Projects Struggle to Gain Legal Rights and Personhood for Nonhuman

S outh Plainfield Public S chools District Budget 2016/ 17 April 27, 2016 Johanna S. Ruberto,

Mini Mustangs Grandparents Day Visiting Author Title I Halloween Party 3rd Grade Fieldtrip