Obtaining SMT dictionaries for related languages Miguel Rios, Serge - PowerPoint PPT Presentation

Introduction Methodology Results Conclusions Obtaining SMT dictionaries for related languages Miguel Rios, Serge Sharoff University of Leeds Centre for Translation Studies University of Leeds 30 July 2015 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Motivation Results Conclusions Outline Introduction 1 Motivation Methodology 2 Cognate detection Cognate ranking Results 3 Data Results ranking Results comparable corpora Results Machine Translation Conclusions 4 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Motivation Results Conclusions Motivation Extracting cognates for related languages in Romance and Slavonic language families Reducing the number of unknown words on SMT training data Learning regular differences in words roots/endings shared across related languages Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Outline Introduction 1 Motivation Methodology 2 Cognate detection Cognate ranking Results 3 Data Results ranking Results comparable corpora Results Machine Translation Conclusions 4 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Method Produce n-best lists of cognates using a family of distance measures from comparable corpora Prune the n-best lists by ranking Machine Learning (ML) algorithm trained over parallel corpora Motivation n-best list allows surface variation on possible cognate translations Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Similarity metrics Compare words between frequency lists over comparable corpora Produce n-best lists L matching between the languages using Levenshtein distance: maladie → malattia L-R Levenshtein distance computed separately for the roots and for the endings: aceit o (pt) vs acept o (es) rejeit o (pt) vs rechaz o (es) L-C Levenshtein distance over words with similar number of starting characters (i.e. prefix): introdu ¸ c˜ ao (pt) vs introdu cci´ on (es) introdu ziu (pt) vs introdu jo (es) Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Search space constraints Motivation Exhaustive method compares all the combinations of source and target words Order the target side frequency list into bins of similar frequency Compare each source word with target bins of similar frequency around a window L-C metric only compares words that share a given n prefix (characters) Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Ranking Motivation Prune n-best lists by ranking ML algorithm Training data come from aligned parallel corpora where the rank is given by the alignment probability from GIZA++ Simulate cognate training data by pruning pairs of words below a Levenshtein threshold Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Cognate detection Results Cognate ranking Conclusions Features Similarity metric L Number of times of each edit operation, the model assigns a different weight to each operation Cosine between the distributional vectors of the source and target words vectors from word2vec mapped to same space via a learned transformation matrix SVM ranking default configuration (RBF kernel) Easy-adapt features given different domains (Wikipedia, subtitles) Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Outline Introduction 1 Motivation Methodology 2 Cognate detection Cognate ranking Results 3 Data Results ranking Results comparable corpora Results Machine Translation Conclusions 4 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Data description n-best lists from Wikipedia dumps (frequency lists) ML training Wiki-titles, parallel data from inter language links from the tittles of the Wikipedia articles 500K aligned links (i.e. ‘sentences’) Opensubs, 90K training instances Zoo proprietary corpus of subtitles produced by professional translators, 20K training instances Ranking test Heldout data from training Manual cognate test Wikipedia most frequent words SMT test Zoo data Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Language pairs Romance Source: Portuguese, French, Italian Target: Spanish Slavonic Source: Ukrainian, Bulgarian Target: Russian Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Results on heldout data Error score on heldout data E Edit distance features EC Edit distance plus distributed vectors features Zoo error% Opensubs error% Wiki-titles error% Lang pairs Model E Model EC Model E Model EC Model E Model EC Romance pt-es 53.31 53.72 54.81 48.31 12.22 9.87 it-es 56.00 42.86 63.95 63.03 8.44 11.23 fr-es 59.05 53.00 43.00 41.19 10.75 10.09 Slavonic uk-ru 47.90 40.84 37.06 30.19 10.71 10.72 bg-ru 54.17 43.98 49.12 57.89 18.72 17.13 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Manual evaluation Results on sample of 100 words Accuracy at 1, 10 n-best lists L , L-R , L-C ranking model E List L List L-R List L-C Lang Pairs acc@1 acc@10 acc@1 acc@10 acc@1 acc@10 pt-es 20 60 22 59 32 70 it-es 16 53 18 45 44 66 fr-es 10 48 12 51 29 59 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Addition of lists SMT Moses phrase-based SMT 1-best lists with L-C and E ranking pt-es: 80K training sentences, 100K cognate pairs BLEU score baseline: 20.68 and augmented:20.86, +0.18 not significant uk-ru: 140K training sentences, 100K cognate pairs BLEU score baseline: 28.72 and augmented: 29.56, +0.93 not significant Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Data Methodology Results ranking Results Results comparable corpora Conclusions Results Machine Translation Out-of-vocabulary reduction pt-es (OOV): 623 types ( 21.1% ) to 337 types ( 11.4% ) uk-ru (OOV): 756 types ( 21.6% ) to 545 types ( 15.6% ) Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Results Conclusions Outline Introduction 1 Motivation Methodology 2 Cognate detection Cognate ranking Results 3 Data Results ranking Results comparable corpora Results Machine Translation Conclusions 4 Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Results Conclusions Conclusions MT dictionaries extracted from comparable resources for related languages Positive results on the n-bes lists with L-C Frequency window heuristic shows poor results ML models are able to rank similar words on the top of the list Preliminary results on an SMT system show modest improvements compare to the baseline The OOV rate shows improvements around 10% reduction on word types Rios, Sharoff Obtaining SMT dictionaries for related languages

Introduction Methodology Results Conclusions Future work Morphology features for the n-best list (Unsupervised) Instead of prefix heuristic ( L-C ) and stemmer ( L-R ) Contribution for all the produced cognate lists on SMT Using char-based transliteration model trained on Zoo plus n-best lists Motivation alignment learns useful transformations: e.g. introdu ¸ c˜ ao (pt) vs introdu cci´ on (es) Rios, Sharoff Obtaining SMT dictionaries for related languages

Obtaining SMT dictionaries for related languages Miguel Rios, Serge - PowerPoint PPT Presentation

Introduction Methodology Results Conclusions Obtaining SMT dictionaries for related languages Miguel Rios, Serge Sharoff University of Leeds Centre for Translation Studies University of Leeds 30 July 2015 Rios, Sharoff Obtaining SMT

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

61A Lecture 13 {'Dem': 0} Wednesday, September 28 2 Limitations on Dictionaries Implementing

Computational Dictionaries Computational Dictionaries & Terminology & Terminology

Py Python Dictionaries Python dictionaries are the only built-in mapping type: unordered

Dictionaries A Key-Value Relationship C-START Python PD Workshop C-START Python PD Workshop

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

DIVERSIFIED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

DIVERSIFED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT-LIB for HOL Daniel Kroening Philipp Rmmer Georg Weissenbacher Oxford University Computing

Motivation SMT Theories of Interest History of SMT Eager approach Lazy approach Optimizations

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Strategic management tools and governance structures in urban water services a research

Gover ernance a nance and d Commu munity G nity Governance vernance Well Well-Being Being

STRIVING FOR A BRIGHTER TOMORROW 14 SBUs with Corporate Offices & Manufacturing Facilities

Mtis Nation and Environmental Assessment Mtis Nation Special Sitting of the General Assembly

Bilingual SSD & Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning

INFORMATION SESSION CSUB NURSING PROGRAMS Traditional BSN: Pre-licensure RN-BSN:

Student-Centered Guidance & Counseling Aldine ISD Career and Technical Education Department

Academic Programs and Research Esfan Haghverdi Executive Associate Dean for Academic Affairs

Obtaining SMT dictionaries for related languages Miguel Rios, Serge - PowerPoint PPT Presentation

Introduction Methodology Results Conclusions Obtaining SMT dictionaries for related languages Miguel Rios, Serge Sharoff University of Leeds Centre for Translation Studies University of Leeds 30 July 2015 Rios, Sharoff Obtaining SMT

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 &amp; angr

61A Lecture 13 {'Dem': 0} Wednesday, September 28 2 Limitations on Dictionaries Implementing

Computational Dictionaries Computational Dictionaries &amp; Terminology &amp; Terminology

Py Python Dictionaries Python dictionaries are the only built-in mapping type: unordered

Dictionaries A Key-Value Relationship C-START Python PD Workshop C-START Python PD Workshop

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

DIVERSIFIED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

DIVERSIFED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT-LIB for HOL Daniel Kroening Philipp Rmmer Georg Weissenbacher Oxford University Computing

Motivation SMT Theories of Interest History of SMT Eager approach Lazy approach Optimizations

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Strategic management tools and governance structures in urban water services a research

Gover ernance a nance and d Commu munity G nity Governance vernance Well Well-Being Being

STRIVING FOR A BRIGHTER TOMORROW 14 SBUs with Corporate Offices &amp; Manufacturing Facilities

Mtis Nation and Environmental Assessment Mtis Nation Special Sitting of the General Assembly

Bilingual SSD &amp; Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning

INFORMATION SESSION CSUB NURSING PROGRAMS Traditional BSN: Pre-licensure RN-BSN:

Student-Centered Guidance &amp; Counseling Aldine ISD Career and Technical Education Department

Academic Programs and Research Esfan Haghverdi Executive Associate Dean for Academic Affairs

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

Computational Dictionaries Computational Dictionaries & Terminology & Terminology

STRIVING FOR A BRIGHTER TOMORROW 14 SBUs with Corporate Offices & Manufacturing Facilities

Bilingual SSD & Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning

Student-Centered Guidance & Counseling Aldine ISD Career and Technical Education Department