czechizator echiz tor
play

Czechizator echiztor Charles University Faculty of Mathematics and - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator echiztor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatransk Matliare, 18 September 2016 Czechizator lexicon-less


  1. Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator – Čechizátor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatranské Matliare, 18 September 2016

  2. Czechizator  lexicon-less “translation” from English to Czech Rudolf Rosa: Czechizator - Čechizátor 2/32

  3. Czechizator  lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon presentation input statistical Czech- translation training translation English system model prezentace texts output Rudolf Rosa: Czechizator - Čechizátor 3/32

  4. Czechizator  lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon presentation input statistical Czech- translation training translation English system model prezentace texts output  Czechizator approach: use a set of rules instead rules: presentation input Czech- -ise → -iza translation English -tion → -ce system presentace texts ... output Rudolf Rosa: Czechizator - Čechizátor 4/32

  5. Example: Czechizating ITAT titles  Statistical modelling in climate science Rudolf Rosa: Czechizator - Čechizátor 5/32

  6. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci Rudolf Rosa: Czechizator - Čechizátor 6/32

  7. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing Rudolf Rosa: Czechizator - Čechizátor 7/32

  8. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence Rudolf Rosa: Czechizator - Čechizátor 8/32

  9. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence  Multivariable Approximation by Convolutional Kernel Networks Rudolf Rosa: Czechizator - Čechizátor 9/32

  10. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence  Multivariable Approximation by Convolutional Kernel Networks Multivariabilní aproximace Konvolucional Kernel netvorksu Rudolf Rosa: Czechizator - Čechizátor 10/32

  11. Implementation  lexical translation: a set of Czechization rules  43 ending-based transformation rules (see later)  33 transliteration rules: th → t, ti → ci, ck → k, ph → f, sh → š, igh → aj, dg → dž, w → v, c → k…  36 hard-coded translations of semi-auxiliaries: be, have, do, and, or, all, this, many, only, main…  grammar and function words: TectoMT  English-Czech machine translation system  Czechizator implemented as a TectoMT lexical translation model Rudolf Rosa: Czechizator - Čechizátor 11/32

  12. Implementation I preferred the presentation of David. Rudolf Rosa: Czechizator - Čechizátor 12/32

  13. Implementation I preferred TectoMT the presentation analysis of David. Rudolf Rosa: Czechizator - Čechizátor 13/32

  14. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Rudolf Rosa: Czechizator - Čechizátor 14/32

  15. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. transfer Rudolf Rosa: Czechizator - Čechizátor 15/32

  16. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas Rudolf Rosa: Czechizator - Čechizátor 16/32

  17. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas preferovat prezentace David Rudolf Rosa: Czechizator - Čechizátor 17/32

  18. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat prezentace David Rudolf Rosa: Czechizator - Čechizátor 18/32

  19. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past prezentace noun, accusative David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 19/32

  20. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past TectoMT prezentace synthesis noun, accusative David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 20/32

  21. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past Preferoval jsem TectoMT prezentace prezentaci synthesis noun, accusative Davida. David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 21/32

  22. Transformation rules for adjectives  partial  native → parciální → nativní  stable  regular → stabilní → regulární  tolerant  fatal → tolerantní → fatální  tolerated  nervous → tolerovaný → nervózní  turkic  parsed → turkický → parsovaný  practical  parsing → praktický → parsující  park → parkový Rudolf Rosa: Czechizator - Čechizátor 22/32

  23. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts Rudolf Rosa: Czechizator - Čechizátor 23/32

  24. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts:  Accenture Operations combines technology that digitizes and automates business processes, unlocks actionable insights, and delivers everything- as-a-service with our team's deep industry, functional and technical expertise.  Operacions acenturu kombinuje technologii, která digitizuje a automuje procesy businosti, unlokuje akcionabilní insajty a deliveruje everyting-as-a- servicová s funkcionální a technickou expertizou dípové industrie našeho tímu. Rudolf Rosa: Czechizator - Čechizátor 24/32

  25. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts  still, only a proof of concept & a fun application  not really useful as a standalone tool  maybe as a starting point for later post-editing Rudolf Rosa: Czechizator - Čechizátor 25/32

  26. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts  still, only a proof of concept & a fun application  not really useful as a standalone tool  maybe as a starting point for later post-editing  potential: combine with TectoMT lexical models  frequent words: translation model trained from data  infrequent words: insufficient training data, Czechize! Rudolf Rosa: Czechizator - Čechizátor 26/32

  27. Complementing TectoMT  rare/unseen words not well handled by TectoMT  unreliable translation for rare words, none for unseen  e.g. scientific terms  large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora → anafora hypotactical → hypotaktický circumfixal → cirkumfixální Rudolf Rosa: Czechizator - Čechizátor 27/32

  28. Complementing TectoMT  rare/unseen words not well handled by TectoMT  unreliable translation for rare words, none for unseen  e.g. scientific terms  large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora → anafora hypotactical → hypotaktický circumfixal → cirkumfixální  current issues: named entities get Czechized  usually should be avoided, but detection insufficient Rudolf Rosa: Czechizator - Čechizátor 28/32

  29. Conclusion  lexicon-less lexical “translation” module  transformation (endings) and transliteration rules  grammar and aux words handled by TectoMT  Czechization of lemmas on t-layer  Czechization of scientific titles sometimes “good”  but still not really useful  work in progress: integrate into TectoMT  complement existing lexical models  Czechize rare and unseen words, e.g. science terms Rudolf Rosa: Czechizator - Čechizátor 29/32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend