Machine Translation 3: Linguistics in SMT and NMT
Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague
January 2019 MT3: Linguistics in SMT and NMT
Machine Translation 3: Linguistics in SMT and NMT Ond rej Bojar - - PowerPoint PPT Presentation
Machine Translation 3: Linguistics in SMT and NMT Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague January 2019 MT3: Linguistics in SMT and NMT
January 2019 MT3: Linguistics in SMT and NMT
January 2019 MT3: Linguistics in SMT and NMT 1
January 2019 MT3: Linguistics in SMT and NMT 2
January 2019 MT3: Linguistics in SMT and NMT 3
I saw two green striped cats . j´ a pila dva zelen´ y pruhovan´ y koˇ cky . pily dvˇ e zelen´ a pruhovan´ a koˇ cek . . . dvou zelen´ e pruhovan´ e koˇ ck´ am vidˇ el dvˇ ema zelen´ ı pruhovan´ ı koˇ ck´ ach vidˇ ela dvˇ emi zelen´ eho pruhovan´ eho koˇ ckami . . . zelen´ ych pruhovan´ ych uvidˇ el zelen´ emu pruhovan´ emu uvidˇ ela zelen´ ym pruhovan´ ym . . . zelenou pruhovanou vidˇ el jsem zelen´ ymi pruhovan´ ymi vidˇ ela jsem . . . . . . January 2019 MT3: Linguistics in SMT and NMT 4
January 2019 MT3: Linguistics in SMT and NMT 5
January 2019 MT3: Linguistics in SMT and NMT 6
fem-loc neut-acc masc-nom-sg fem-loc
masc-nom masc-nom masc-nom fem-nom fem-nom fem-nom
fem-nom fem-nom fem-nom fem-nom fem-acc fem-acc fem-acc fem-acc
fem-dat fem-dat fem-dat fem-dat January 2019 MT3: Linguistics in SMT and NMT 7
January 2019 MT3: Linguistics in SMT and NMT 8
(Koehn and Hoang, 2007) January 2019 MT3: Linguistics in SMT and NMT 9
January 2019 MT3: Linguistics in SMT and NMT 10
January 2019 MT3: Linguistics in SMT and NMT 11
January 2019 MT3: Linguistics in SMT and NMT 12
January 2019 MT3: Linguistics in SMT and NMT 13
January 2019 MT3: Linguistics in SMT and NMT 14
January 2019 MT3: Linguistics in SMT and NMT 15
January 2019 MT3: Linguistics in SMT and NMT 16
January 2019 MT3: Linguistics in SMT and NMT 17
January 2019 MT3: Linguistics in SMT and NMT 18
January 2019 MT3: Linguistics in SMT and NMT 19
January 2019 MT3: Linguistics in SMT and NMT 20
January 2019 MT3: Linguistics in SMT and NMT 21
January 2019 MT3: Linguistics in SMT and NMT 22
PPPP ✏ ✏ ✏ ✏
❵❵❵❵❵ ❵ ✥ ✥ ✥ ✥ ✥ ✥
PPPP ✏ ✏ ✏ ✏
January 2019 MT3: Linguistics in SMT and NMT 23
Examples by Zdenˇ ek ˇ Zabokrtsk´ y, Karel Oliva and others.
January 2019 MT3: Linguistics in SMT and NMT 24
The grass around your house should be cut soon
Czech manual trees: 50% of edges link neighbours, 80% of edges fit in a 4-gram.
January 2019 MT3: Linguistics in SMT and NMT 25
❤❤❤❤❤❤❤ ❤ ✭ ✭ ✭ ✭ ✭ ✭ ✭ ✭
❳❳❳❳❳ ❳ ✘ ✘ ✘ ✘ ✘ ✘
❛❛❛ ❛ ✦ ✦ ✦ ✦
1
Despite this shortcoming, CFGs are popular and “the” formal grammar for many. Possibly due to the charm of the father of linguistics, or due to the abundance of dependency formalisms with no clear winner (Nivre, 2005).
January 2019 MT3: Linguistics in SMT and NMT 26
See Kuhlmann and M¨
January 2019 MT3: Linguistics in SMT and NMT 27
January 2019 MT3: Linguistics in SMT and NMT 28
Proti odm´ ıtnut´ ı Against dismissal se aux-refl z´ ıtra tomorrow Petr Peter v pr´ aci at work rozhodl decided protestovat to object Peter decided to object against the dismissal at work tomorrow.
January 2019 MT3: Linguistics in SMT and NMT 29
Background: Prague Linguistic Circle (since 1926). Theory: Sgall (1967), Panevov´ a (1980), Sgall et al. (1986).
January 2019 MT3: Linguistics in SMT and NMT 30
January 2019 MT3: Linguistics in SMT and NMT 31
AUXK AUXR OBJ A U X V SB P R E D
PAT ACT PRED
add nodes for “deleted” participants
e.g. active/passive voice, analytical verbs etc.
topic-focus articulation or anaphora January 2019 MT3: Linguistics in SMT and NMT 32
AUXK AUXR OBJ A U X V SB P R E D
S B AUXV AUXV PRED A U X K January 2019 MT3: Linguistics in SMT and NMT 33
PAT ACT PRED
PAT A C T PRED
January 2019 MT3: Linguistics in SMT and NMT 34
The complications:
“Not necessary” once you have a t-tree but useful understand or to blame the right people. January 2019 MT3: Linguistics in SMT and NMT 35
MT3: Linguistics in SMT and NMT 36
"
$
MT3: Linguistics in SMT and NMT 37
January 2019 MT3: Linguistics in SMT and NMT 38
MT3: Linguistics in SMT and NMT 39
January 2019 MT3: Linguistics in SMT and NMT 40
January 2019 MT3: Linguistics in SMT and NMT 41
January 2019 MT3: Linguistics in SMT and NMT 42
January 2019 MT3: Linguistics in SMT and NMT 43
January 2019 MT3: Linguistics in SMT and NMT 44
1|f J 1 ) = p(e1|f J 1 ) · p(e2|e1, f J 1 ) · p(e3|e2, e1, f J 1 ) . . .
January 2019 MT3: Linguistics in SMT and NMT 45
January 2019 MT3: Linguistics in SMT and NMT 46
January 2019 MT3: Linguistics in SMT and NMT 47
Src there are a million different kinds of pizza . Baseline (BPE) existuj´ ı miliony druh˚ u piz@@ zy . Interleave VB3P existovat NNIP1 milion NNIP2 druh NNFS2 pizza Z: .
Src BPE Obama receives Net+ an+ yahu in the capital of USA Tgt NP Obama ((S[dcl]\NP)/PP)/NP receives NP Net+ an+ yahu PP/NP in N January 2019 MT3: Linguistics in SMT and NMT 48
(The sequence of CCG tags may not match the translated sentence.)
January 2019 MT3: Linguistics in SMT and NMT 49
January 2019 MT3: Linguistics in SMT and NMT 50
January 2019 MT3: Linguistics in SMT and NMT 51
January 2019 MT3: Linguistics in SMT and NMT 52
January 2019 MT3: Linguistics in SMT and NMT 53
January 2019 MT3: Linguistics in SMT and NMT 54
Reversed Curriculum by target length Baseline Curriculum by target length Sorted by length
January 2019 MT3: Linguistics in SMT and NMT 55
January 2019 MT3: Linguistics in SMT and NMT 56
January 2019 MT3: Linguistics in SMT and NMT 57
January 2019 MT3: Linguistics in SMT and NMT 58
January 2019 MT3: Linguistics in SMT and NMT 59
January 2019 MT3: Linguistics in SMT and NMT 60
January 2019 MT3: Linguistics in SMT and NMT 61
January 2019 MT3: Linguistics in SMT and NMT 62
January 2019 MT3: Linguistics in SMT and NMT 63
January 2019 MT3: Linguistics in SMT and NMT 64
Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 182–189, Athens, Greece, March. Association for Computational Linguistics. Ondˇ rej Bojar and Jan Hajiˇ
Ohio, June. Association for Computational Linguistics. Ondˇ rej Bojar and Aleˇ s Tamchyna. 2011. Improving Translation Model by Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 330–336, Edinburgh, Scotland, July. Association for Computational Linguistics. Ondˇ rej Bojar. 2007. English-to-Czech Factored Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 232–239, Prague, Czech Republic, June. Association for Computational Linguistics. Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 72–78, Vancouver, Canada, July. Association for Computational Linguistics. Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2017. Effective strategies in zero-shot neural machine
Jan Hajiˇ c and Barbora Hladk´
Rich, Structured Tagset. In Proceedings of COLING-ACL Conference, pages 483–490, Montreal, Canada. Tom´ aˇ s Holan, Vladislav Kuboˇ n, Karel Oliva, and Martin Pl´
January 2019 MT3: Linguistics in SMT and NMT 65
Dependency-Based Grammars, Montreal. University of Montreal. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Vi´ egas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558. V´ aclav Klimeˇ
UFAL, MFF UK, Prague, Czech Republic. Tom Kocmi and Ondˇ rej Bojar. 2017. Curriculum Learning and Minibatch Bucketing in Neural Machine Translation. In Proceedings of Recent Advances in NLP (RANLP 2017). Philipp Koehn and Hieu Hoang. 2007. Factored Translation Models. In Proc. of EMNLP. Marco Kuhlmann and Mathias M¨
45th Annual Meeting of the Association of Computational Linguistics, pages 160–167, Prague, Czech Republic,
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajiˇ
Spanning Tree Algorithms. In Proceedings of HLT/EMNLP 2005, October. Guido Minnen, John Carroll, and Darren Pearce. 2001. Applied morphological processing of English. Natural Language Engineering, 7(3):207–223. Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, and Alexandra Birch. 2017. Predicting target language ccg supertags improves neural machine translation. In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper, pages 68–79, Copenhagen, Denmark, September. Association for Computational Linguistics. Joakim Nivre. 2005. Dependency Grammar and Dependency Parsing. Technical Report MSI report 05133, V¨ axj¨
January 2019 MT3: Linguistics in SMT and NMT 66
Jarmila Panevov´
e ˇ cesk´ e vˇ ety [Forms and functions in the structure of the Czech se Academia, Prague, Czech Republic. Jan Pt´ aˇ cek and Zdenˇ ek ˇ Zabokrtsk´
Adwait Ratnaparkhi. 1996. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, University of Pennsylvania, May. Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation, pages 83–91, Berlin, Germany, August. Association for Computational Linguistics. Petr Sgall, Eva Hajiˇ cov´ a, and Jarmila Panevov´
Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht, Netherlands. Petr Sgall. 1967. Generativn´ ı popis jazyka a ˇ cesk´ a deklinace. Academia, Prague, Czech Republic. Aleˇ s Tamchyna, Marion Weller-Di Marco, and Alexander Fraser. 2017. Modeling target-side inflection in neural machine translation. In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper, pages 32–42, Copenhagen, Denmark, September. Association for Computational Linguistics. Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Semi-supervised model adaptation for statistical machine translation. Machine Translation, 21(2):77–94.
January 2019 MT3: Linguistics in SMT and NMT 67