Using unsupervised corpus-based methods to build rule-based machine - PowerPoint PPT Presentation

Using unsupervised corpus-based methods to build rule-based machine translation systems Felipe S´ anchez Mart´ ınez fsanchez@dlsi.ua.es Ph.D. thesis supervised by Mikel L. Forcada Juan Antonio P´ erez Ortiz 30th June 2008 Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 1 / 45

Outline 1 Motivation & goal 2 Part-of-speech taggers for machine translation Part-of-speech tagging MT-oriented hidden Markov model training Pruning of disambiguation paths 3 Disadvantages of the MT-oriented method Pruning method 4 Part-of-speech tag clustering Best HMM topology for taggers used in MT Bottom-up agglomerative clustering 5 Automatic inference of transfer rules Alignment templates for shallow-transfer machine translation Generation of Apertium transfer rules Concluding remarks 6 Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 2 / 45

Motivation & goal Outline 1 Motivation & goal 2 Part-of-speech taggers for machine translation Part-of-speech tagging MT-oriented hidden Markov model training Pruning of disambiguation paths 3 Disadvantages of the MT-oriented method Pruning method 4 Part-of-speech tag clustering Best HMM topology for taggers used in MT Bottom-up agglomerative clustering 5 Automatic inference of transfer rules Alignment templates for shallow-transfer machine translation Generation of Apertium transfer rules Concluding remarks 6 Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 2 / 45

Motivation & goal Motivation Experience in the development of shallow-transfer MT systems interNOSTRUM Spanish ↔ Catalan Traductor Universia Spanish ↔ Portuguese Apertium Several language pairs available Huge human effort to code all the linguistic resources Resources usually needed by shallow-transfer MT systems Monolingual dictionaries Part-of speech (PoS) taggers Bilingual dictionaries Structural transfer rules Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 3 / 45

Motivation & goal Goal Goal: To reduce the human effort Using corpus-based methods In an unsupervised way Focus on: the PoS taggers used in the analysis phase the set of shallow structural transfer rules used in translation ⇒ Benefiting from the rest of resources ⇐ lexical transfer � text → morph. → PoS tagger → struct. → morph. generator → post- SL generator → TL analyzer transfer text http://apertium.org Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 4 / 45

Part-of-speech taggers for machine translation Outline 1 Motivation & goal 2 Part-of-speech taggers for machine translation Part-of-speech tagging MT-oriented hidden Markov model training Pruning of disambiguation paths 3 Disadvantages of the MT-oriented method Pruning method 4 Part-of-speech tag clustering Best HMM topology for taggers used in MT Bottom-up agglomerative clustering 5 Automatic inference of transfer rules Alignment templates for shallow-transfer machine translation Generation of Apertium transfer rules Concluding remarks 6 Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 4 / 45

Part-of-speech taggers for machine translation Part-of-speech tagging Part-of-speech tagging /1 Problem: Selecting the correct PoS tag for those words with more than one (ambiguous words) ⇒ Hidden Markov models (HMM) are one of the standard statistical solutions Each HMM state corresponds to a different PoS tag Each input word is replaced by its corresponding ambiguity class {verb} {verb, noun, adj} {verb, noun} 0.02 0.2 . . . {noun} {noun, verb} {noun, prep} 0.1 0.2 0.08 {noun} . . . 0.01 verb 0 0.12 {verb} . . . noun 0 0.4 . . . Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 5 / 45

Part-of-speech taggers for machine translation Part-of-speech tagging Part-of-speech tagging /2 In MT PoS tagging becomes crucial: Translation may differ from one PoS tag to another English PoS Spanish libro noun book reservar verb Structural transformations may be applied (or not) for some PoS tag English PoS Spanish reordering green -adj la casa verde ← rule the green house green -noun * el c´ esped casa applied Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 6 / 45

Part-of-speech taggers for machine translation Part-of-speech tagging General-purpose HMM training methods General-purpose HMM training methods: Supervised (hand-tagged corpora available): Maximum-likelihood estimate (MLE) Unsupervised (only untagged corpora available): Baum-Welch (expectation-maximization, EM) Main features: Only use information from the language being tagged Independent of the natural language processing application To get high tagging accuracy supervised resources (hand-tagged corpora) must be built Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 7 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training MT-oriented HMM training method PoS tagging is just an intermediate task for the whole translation procedure Good translation performance, rather than PoS tagging accuracy, becomes the real objective Idea: As the goal is to get good translations into TL, let a TL model decide whether a given “construction” in the TL is good or not Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 8 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training MT-oriented HMM training method: overview /1 lexical transfer � text → morph. → PoS → morph. generator → post- SL tagger → struct. generator → TL analyzer transfer text Unsupervised training Resources required: an SL untagged text automatically obtained from an SL raw corpus the other modules of the MT system following the PoS tagger a TL model trained from a raw TL corpus Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 9 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training MT-oriented HMM training method: overview /2 Procedure: SL corpus is segmented 1 All possible disambiguations of each segment are translated into TL 2 A TL model is used to score each translation 3 HMM parameters are computed according to the likelihood of the 4 corresponding translations into TL paths translations M TL scores counts ˜ ր τ ( g 1 , s ) ց ր P TL ( τ ( g 1 , s )) �� n ( · ) ր g 1 ց ˜ τ ( g 2 , s ) P TL ( τ ( g 2 , s )) �� n ( · ) g 2 s MT TL MT . . . . . . . . . . . ր . . . . ց ց ր ց ˜ τ ( g m , s ) P TL ( τ ( g m , s )) �� n ( · ) g m ⇒ The resulting tagger is tuned to the translation fluency ⇐ Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 10 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training Example: English → Spanish SL segment (English): He -prn rocks -noun|verb the -art table -noun|verb Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 11 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training Example: English → Spanish SL segment (English): He -prn rocks -noun|verb the -art table -noun|verb Possible translations (Spanish) according to each disambiguation and their normalized likelihoods according to a TL model: • ´ El -prn mece -verb la -art mesa -noun 0.75 • ´ El -prn mece -verb la -art presenta -verb 0.15 • ´ El -prn rocas -noun la -art mesa -noun 0.06 • ´ El -prn rocas -noun la -art presenta -verb + 0.04 1.00 Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 11 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training Example: English → Spanish SL segment (English): He -prn rocks -noun|verb the -art table -noun|verb Possible translations (Spanish) according to each disambiguation and their normalized likelihoods according to a TL model: • ´ El -prn mece -verb la -art mesa -noun 0.75 • ´ El -prn mece -verb la -art presenta -verb 0.15 • ´ El -prn rocas -noun la -art mesa -noun 0.06 • ´ El -prn rocas -noun la -art presenta -verb + 0.04 1.00 The HMM parameters involved in these 4 disambiguations are updated according to their likelihoods in the TL Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 11 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training Experiments /1 Task: training PoS tagger for Spanish, French and Occitan to be used in MT into Catalan TL model: trigram language model trained from a Catalan corpus with ≈ 2 · 10 6 words Experiments conducted with 5 disjoint corpora with 0 . 5 · 10 6 words for Spanish 5 disjoint corpora with 0 . 5 · 10 6 words for French Only one corpus with 0 . 3 · 10 6 words for Occitan Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 12 / 45

Part-of-speech taggers for machine translation MT-oriented hidden Markov model training Experiments /2 Reference results: Baum-Welch expectation maximization on 10 · 10 6 words corpora Supervised: MLE from a hand-tagged corpus ≈ 21 . 5 · 10 3 words (only for Spanish) TLM-best: when a TL model is used at translation time to select always the most likely translation approximate indication of the best results the MT-oriented method could achieve Felipe S´ anchez Mart´ ınez (Univ. d’Alacant) 30th June 2008 13 / 45

Using unsupervised corpus-based methods to build rule-based machine - PowerPoint PPT Presentation

Using unsupervised corpus-based methods to build rule-based machine translation systems Felipe S anchez Mart nez fsanchez@dlsi.ua.es Ph.D. thesis supervised by Mikel L. Forcada Juan Antonio P erez Ortiz 30th June 2008 Felipe S

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Tonal Analysis Hidden Markov Model Graduate School of Culture Technology, KAIST Juhan Nam

Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering

CS 730/730W/830: Intro AI Break HMMs 1 handout: slides final blog entries were due Wheeler

Entropy & Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

Bayesian Hierarchical Models for parameter inference with missing

!""#$%&'()*%+$),' -.,")/)0%1/$2+' 34'5$%/)/26$2)#'7.&%#+' '

Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa

Learning Hierarchical Priors in VAEs Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke,

Using unsupervised corpus-based methods to build rule-based machine - PowerPoint PPT Presentation

Using unsupervised corpus-based methods to build rule-based machine translation systems Felipe S anchez Mart nez fsanchez@dlsi.ua.es Ph.D. thesis supervised by Mikel L. Forcada Juan Antonio P erez Ortiz 30th June 2008 Felipe S

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Tonal Analysis Hidden Markov Model Graduate School of Culture Technology, KAIST Juhan Nam

Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering

CS 730/730W/830: Intro AI Break HMMs 1 handout: slides final blog entries were due Wheeler

Entropy &amp; Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

Bayesian Hierarchical Models for parameter inference with missing

!&quot;&quot;#$%&amp;'()*%+$),' -.,&quot;)/)0%1/$2+' 34'5$%/)/26$2)#'7.&amp;%#+' '

Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa

Learning Hierarchical Priors in VAEs Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke,

Entropy & Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

!""#$%&'()*%+$),' -.,")/)0%1/$2+' 34'5$%/)/26$2)#'7.&%#+' '