Open Source Tools for Statistical Machine Translation Philipp - - PowerPoint PPT Presentation

open source tools for statistical machine translation
SMART_READER_LITE
LIVE PREVIEW

Open Source Tools for Statistical Machine Translation Philipp - - PowerPoint PPT Presentation

Open Source Tools for Statistical Machine Translation Philipp Koehn, University of Edinburgh 28 February 2008 Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008 1 Research Process new ideas SMT is increasingly a big systems


slide-1
SLIDE 1

Open Source Tools for Statistical Machine Translation

Philipp Koehn, University of Edinburgh 28 February 2008

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-2
SLIDE 2

1

Research Process

new ideas prototype experiments research paper dissemination rebuild prototype new ideas

SMT is increasingly a big systems field building prototypes requires huge efforts

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-3
SLIDE 3

2

Research Process

new ideas prototype experiments research paper dissemination rebuild prototype new ideas

SMT is increasingly a big systems field building prototypes requires huge efforts

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-4
SLIDE 4

3

Requirements for Building MT Systems

  • Data resources

– parallel corpora (translated texts) – monolingual corpora, especially for output language

  • Support tools

– basic corpus preparation: tokenization, sentence alignment – linguistic tools: tagger, parsers, morphology, semantic processing

  • MT tools

– word alignment, training – decoding (translation engine) – tuning (optimization) – re-ranking, incl. posterior methods

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-5
SLIDE 5

4

Who will do MT Research?

  • If MT research requires the development of many resources

– who will be able to do relevant research? – who will be able to deploy the technology?

  • A few big labs?
  • ... or a broad network of academic and commercial institutions?

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-6
SLIDE 6

5

MT is diverse

  • Many different stakeholders

– academic researchers – commercial developers – multi-lingual or trans-lingual content providers – end users of online translation services – human translation service providers

  • Many different language pairs

– few languages with rich resources: English, Spanish, German, Chinese, ... – many second tier languages: Czech, Danish, Greek, ... – many under-resourced languages: Gaelic, Basque, ...

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-7
SLIDE 7

6

Open Research

new ideas prototype experiments research paper dissemination re-use prototype new ideas

SMT is increasingly a big systems field building prototypes requires huge efforts sharing of resources reduces duplication

  • f efforts

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-8
SLIDE 8

7

Making Open Research Work

  • Non-restrictive licensing
  • Active development

– working high-quality prototype – ongoing development – open to contributions

  • Support and dissemination

– support by email, web sites, documentation – offering tutorials and courses

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-9
SLIDE 9

8

EuroMatrix: Open Research

  • Open source statistical MT system
  • Open source rule-based system
  • Parallel corpora
  • Dissemination activities

– MT Marathon – Evaluation campaign and workshops – Online platform

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-10
SLIDE 10

9

Moses: Open Source SMT

  • Open

source statistical machine translation system – state-of-the-art phrase-based approach – full SMT system: training, tuning, decoding – incorporates research on factored translation models

  • Additional features

– confusion network decoding – support for very large models through memory-efficient data structures – multiple language models, translation tables for domain adaptation – minimum Bayes risk decoding

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-11
SLIDE 11

10

Collaboration Beyond EuroMatrix

  • Active development centered at U Edinburgh, but also

– Charles University – ITC-irst, Italy – University of Maryland, USA

  • Development also supported by

– EC-funded TC-STAR project – Johns Hopkins Summer Workshop 2006 – US funding agencies DARPA, NSF

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-12
SLIDE 12

11

Web Site

  • URL: http://www.statmt.org/moses/
  • Download

– compiled binaries for Unix and Windows – current source code from SVN repository

  • Documentation

– introduction to statistical MT methods – step-by-step tutorial on training, decoding, factored models – step-by-step instructions on how to build a baseline system – descriptions of all features – automatically generated code documentation – mailing lists for users and developers

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-13
SLIDE 13

12

Widely Used

  • Web site gets 3000 visits per month
  • Mailing list distributes 100 emails per month
  • Academic uses

– de-facto benchmark for new MT methods – starting point for most new research groups – half of IWSLT submissions used Moses

  • Commercial uses

– explored by many machine translation developers (incl. Systran) – systems built for second tier languages (e.g. Swedish, Danish)

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-14
SLIDE 14

13

Online Demos

  • English to Czech

– provided by Charles University – hosted at https://blackbird.ms.mff.cuni.cz/cgi-bin/bojar/mt cgi.pl

  • German, Spanish, French to English and back

– provided by Edinburgh University – hosted at http://demo.statmt.org/webtrans/

  • Outside parties have also created demos

– Finnish to English, Swedish and back – English to Russian

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-15
SLIDE 15

14

Online Demos

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-16
SLIDE 16

15

MT Marathon

  • First MT Marathon: April 2007, Edinburgh

– one-week intense class with hands-on experience – research showcase with talks from leading researchers

  • Second MT Marathon: 12-20 May 2008, Berlin

– one-week intense class with hands-on experience – research showcase with talks from leading researchers – open source convention – evaluation workshop – Translingual Europe conference

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-17
SLIDE 17

16

The Matrix

  • http://matrix.statmt.org/
  • Listing of available resources

– parallel and monolingual corpora – tools and systems – can be augmented and edited by users

  • Online evaluation campaign

– developers can upload their translations of standard test sets – reference performance for all language pairs of official EU languages

  • Note: currently functional, but still working on some features

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-18
SLIDE 18

17

Parallel Data: the Bottleneck?

  • More data, better performance with statistical systems

0.15 0.20 0.25 0.30 10k 20k 40k 80k 160k 320k Swedish Finnish German French

[from Koehn, 2003: Europarl]

  • Where do we get more translated texts from?

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-19
SLIDE 19

18

Parallel Corpora

  • Europarl: proceedings of the European Parliament

– Release of v3 in September 2007 – 30-40 million words per language, all 11 official languages of EU-15

  • News Commentary: from http://www.project-syndicate.com/

– used in ACL WMT 2007 Shared Task – 1-2 million words in English, French, Spanish, German, Czech, Arabic, ...

  • Other corpus projects

– Acquis Communitaire: includes all 23 languages of EU-25 (JRC) – CzEng corpus build by Charles University – Hungarian-English corpus extended by Morphologic – more data from European Union / European Commission → good translation quality possible with this data

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-20
SLIDE 20

19

Data from Commercial Sources?

  • All large corpora are from governments, international institutions
  • Commercial sources are hard to come by

– ownership between original author, translator – intellectual property rights and privacy concerns – data is seen as competitive advantage

  • What could be done:

– randomizing the order of sentences – anonymizing named entities

  • User generated data?

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-21
SLIDE 21

20

Open $ource

Academic Research Companies

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-22
SLIDE 22

21

Open $ource

Academic Research Open Source Companies

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-23
SLIDE 23

22

Open $ource

Academic Research Open Source Users Companies

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-24
SLIDE 24

23

Open $ource

Academic Research Open Source Users Companies

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-25
SLIDE 25

24

The Tipping Point for MT

Machine Translation Quality Money

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-26
SLIDE 26

25

The Tipping Point for MT

Machine Translation Quality Money

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008

slide-27
SLIDE 27

26

Thank you

Questions?

Philipp Koehn, U of Edinburgh Open Source SMT 28 February 2008