Learning attention for historical text normalization by learning to - PowerPoint PPT Presentation

Historical text normalization Encoder/decoder models Attention vs. multi-task learning Learning attention for historical text normalization by learning to pronounce Marcel Bollmann 1 Joachim Bingel 2 Anders Søgaard 2 1 Ruhr-Universität Bochum, Germany 2 University of Copenhagen, Denmark ACL 2017, Vancouver, Canada July 31, 2017 Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning Motivation Sample of a manuscript from Early New High German Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning A corpus of Early New High German ◮ Medieval religious treatise “Interrogatio Sancti Anselmi de Passione Domini” ◮ > 50 manuscripts and prints (in German) ◮ 14 th –16 th century ◮ Various dialects ◮ Bavarian ◮ Middle German ◮ Low German ◮ ... Sample from an Anselm manuscript http://www.linguistics.rub.de/anselm/ Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning Examples for historical spellings Frau (woman) fraw, frawe, fräwe, frauwe, fraüwe, frow, frouw, vraw, vrow, vorwe, vrauwe, vrouwe Kind (child) chind, chinde, chindt, chint, kind, kinde, kindi, kindt, kint, kinth, kynde, kynt Mutter (mother) moder, moeder, mueter, müeter, muoter, muotter, muter, mutter, mvoter, mvter, mweter Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning Examples for historical spellings Frau (woman) fraw, frawe, fräwe, frauwe, fraüwe, frow, frouw, vraw, vrow, vorwe, vrauwe, vrouwe Kind (child) chind, chinde, chindt, chint, kind, kinde, kindi, kindt, kint, kinth, kynde, kynt Mutter (mother) moder, moeder, mueter, müeter, muoter, muotter, muter, mutter, mvoter, mvter, mweter Normalization as the mapping of historical spellings to their modern-day equivalents. Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning Previous work ◮ Hand-crafted algorithms ◮ VARD (Baron & Rayson, 2008) ◮ Norma (Bollmann, 2012) ◮ Character-based statistical machine translation (CSMT) ◮ Scherrer and Erjavec (2013), Pettersson et al. (2013), ... ◮ Sequence labelling with neural networks ◮ Bollmann and Søgaard (2016) Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization What is historical text normalization? Encoder/decoder models Previous work Attention vs. multi-task learning Previous work ◮ Hand-crafted algorithms ◮ VARD (Baron & Rayson, 2008) ◮ Norma (Bollmann, 2012) ◮ Character-based statistical machine translation (CSMT) ◮ Scherrer and Erjavec (2013), Pettersson et al. (2013), ... ◮ Sequence labelling with neural networks ◮ Bollmann and Søgaard (2016) ◮ Now: “Character-based neural machine translation” Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning An encoder/decoder model k i n d ‹E› Prediction layer Decoder LSTM Embeddings ‹S› k i n d LSTM Encoder Embeddings c h i n t Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning An encoder/decoder model Avg. Accuracy Bi-LSTM tagger (Bollmann & Søgaard, 2016) 79.91% Greedy 78.91% Base model Evaluation on 43 texts from the Anselm corpus ( ≈ 4,000–13,000 tokens each) Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning An encoder/decoder model Avg. Accuracy Bi-LSTM tagger (Bollmann & Søgaard, 2016) 79.91% Greedy 78.91% Beam 79.27% Base model Evaluation on 43 texts from the Anselm corpus ( ≈ 4,000–13,000 tokens each) Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning An encoder/decoder model Avg. Accuracy Bi-LSTM tagger (Bollmann & Søgaard, 2016) 79.91% Greedy 78.91% Beam 79.27% Base model Beam + Filter 80.46% Evaluation on 43 texts from the Anselm corpus ( ≈ 4,000–13,000 tokens each) Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Attentional model k i n d ‹E› Prediction layer Decoder Attentional LSTM ‹S› k i n d Attention model Bidirectional LSTM Encoder c h i n t Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Attentional model Avg. Accuracy Bi-LSTM tagger (Bollmann & Søgaard, 2016) 79.91% Greedy 78.91% Beam 79.27% Base model Beam + Filter 80.46% Beam + Filter + Attention 82.72% Evaluation on 43 texts from the Anselm corpus ( ≈ 4,000–13,000 tokens each) Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Learning to pronounce Can we improve results with multi-task learning? Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Learning to pronounce ◮ Idea: grapheme-to-phoneme mapping as auxiliary task ◮ CELEX 2 lexical database (Baayen et al., 1995) ◮ Sample mappings for German: Jungfrau jUN-frB → Abend ab@nt → nicht nIxt → Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Multi-task learning Prediction layer for CELEX task k i n d ‹E› Prediction layer for historical task Decoder LSTM ‹S› d k i n Encoder LSTM c h i n t Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Multi-task learning n I x t ‹E› Prediction layer for CELEX task Prediction layer for historical task Decoder LSTM ‹S› n I x t Encoder LSTM n i c h t Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Model description Encoder/decoder models Attention mechanism Attention vs. multi-task learning Multi-task learning Multi-task learning Avg. Accuracy Bi-LSTM tagger (Bollmann & Søgaard, 2016) 79.91% Greedy 78.91% Beam 79.27% Base model Beam + Filter 80.46% Beam + Filter + Attention 82.72% Greedy 80.64% Beam 81.13% MTL model Beam + Filter 82.76% Beam + Filter + Attention 82.02% Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Historical text normalization Analysis Encoder/decoder models Conclusion Attention vs. multi-task learning Why does MTL not improve with attention? Hypothesis Attention and MTL learn similar functions of the input data. “MTL can be used to coerce the learner to attend to patterns in the input it would otherwise ignore. This is done by forcing it to learn internal representations to support related tasks that depend on such patterns.” – Caruana (1998), p. 112 f. Marcel Bollmann, Joachim Bingel, Anders Søgaard Learning attention for hist. normalization by learning to pronounce

Learning attention for historical text normalization by learning to - PowerPoint PPT Presentation

Historical text normalization Encoder/decoder models Attention vs. multi-task learning Learning attention for historical text normalization by learning to pronounce Marcel Bollmann 1 Joachim Bingel 2 Anders Sgaard 2 1 Ruhr-Universitt Bochum,

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

PRESENTATION SKILLS FOR EXPATS AND DUTCH PROFESSIONALS How to present both clearly and engaging?

Discover the power of your personal story Give Your Story Legs Through Social Media Even when

5 Types of Video Every Business Can Use Robin Panish Email Marketing Manager at Wistia Hi,

BSBCMM401 Make a presentation This unit covers the skills and knowledge required to prepare,

Language to Image Generation Generate a bird with Generate a bird with Generate a bird

Putting the Learning Resources Approach into Practice Prof. Dr. Dr. Albert Ziegler Please assume

Social Media Use, Gaming, and Media-Multitasking: Should you be Concerned? (Presentation for

Comments on Behaviorally Informed by Cass R. Sunstein Varun Gauri World Bank June 9,2016