Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami - PowerPoint PPT Presentation

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen Helsinki University of Technology (TKK)

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Goals of the project • Design statistical machine learning algorithms that discover which morphemes words consist of • Find morphemes that are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval • Discover approaches suitable for a wide range of languages and tasks

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge summary • Part of the EU Network of Excellence PASCAL • Organized in collaboration with CLEF • Participation is open to all and free of charge • Data provided in: Finnish, English, German, Turkish and Arabic • Task : Implement an unsupervised algorithm that discovers morpheme analysis of words in each language! • Results : Evaluations in IR and SMT • Workshop : Corfu, Greece, September 30, 2009

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE The vocabulary problem • ASR, IR and SMT require a large vocabulary • Morphologically rich languages suffer from a severe vocabulary explosion • More efficient representation units needed

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Agglutinative morphology • Finnish words typically consist of lengthy sequences of morphemes — stems, suffixes (and sometimes prefixes ): – kahvi + n + juo + ja + lle + kin ( coffee + of + drink + - er + for + also = ’also for [the] coffee drinker’ ) – nyky + ratkaisu + i + sta + mme ( current + solution + -s + from + our = ’from our current solutions’ ) – tietä + isi + mme + kö + hän ( know + would + we + INTERR + indeed = ’would we really know?’ ) – tietä + vä + mmä + lle ( know + -ing + COMP + for = ’for the more knowing’ = ’for the one who knows more’ )

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Morfessor algorithm at TKK 2002 ● Automatic segmentation of words into morphemes ● A fully data-driven unsupervised machine learning algorithm ● Discovers a compact representation of the input text corpus ● MAP optimization where the result resembles linguistic morphemes: left + hand + ed, hand + ful ● Language independent, no morphological rules or annotated data needed ● Toolkit available at http://www.cis.hut.fi/projects/morpho/ [PhD thesis of M.Creutz (2006)]

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge since 2005 • Evaluation languages: – 2005: Finnish, Turkish, English – 2007: + German – 2008 - 2009: + Arabic • Evaluation tasks: – 2005: linguistic & speech recognition (ASR) – 2007-2008: linguistic & information retrieval (IR) – 2009: + machine translation (SMT)

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE History of Morpho Challenge • Participating groups: – 2005: 6 (+ 5 students groups) – 2007: 6 – 2008: 6 – 2009: 10 • Type of submission: – 2005: words split into smaller units – 2007-2009: full morpheme analysis of words

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Plan of 2009 Challenge • The participants submit their morpheme analyses • The organizers evaluate them in various ways: 1.Comparison to a linguistic morpheme "gold standard“ 2.Information retrieval experiments, where the indexing is based on morphemes instead of entire words 3.Machine translation experiments, where the translation is based on morphemes

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Plan of 2009 Challenge • The participants submit their morpheme analyses • The organizers evaluate them in various ways: 1.Comparison to a linguistic morpheme "gold standard“ 2.Information retrieval experiments, where the indexing is based on morphemes instead of entire words 3.Machine translation experiments, where the translation is based on morphemes

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Information Retrieval evaluation 2009 • English, German and Finnish tasks • Words in the documents and queries were replaced by the suggested segmentations • If no segmentation was provided, the word was left unsegmented

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Example • Query: Französische Atomtests • Doc 1: Ein zweiter französischer Atomtest fand mit 15-20 kt Sprengkraft... Heim ist nicht automatisch • Doc 2: ein gutes Heim...

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Example: Method A • Query: französisch +e atom test +s • Doc 1: ein zwei +t +er französisch +er atom test fand mit 15-20 kt spreng kraft... h eim ist nicht automat isch • Doc 2: ein gut +es heim...

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Example: Method B • Query: fran zö sische a tom tes ts • Doc 1: ein z weiter fran zö sischer a tom test fand mit 15–20 kt spr eng kraf t... • Doc 2: h eim ist nicht au tom a tisch ein gu tes heim...

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Setup • LEMUR-toolkit: http://www.lemurproject.org/ • Okapi BM25 ranking • Stoplist for the most common morphemes – a fixed threshold for corpus frequency • Evaluation metric is Mean Average Precision (MAP)

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE IR data sets (same as in 2007-2008) • Finnish (CLEF 2004) – 55K documents from articles in Aamulehti 1994-95 – 50 test queries, 23K binary relevance assessments • English (CLEF 2005) – 107K documents from articles in Los Angeles Times 1994 and Glasgow Herald 1995 – 50 test queries, 20K binary relevance assessments • German (CLEF 2003) – 300K documents from short articles in Frankfurter Rundschau 1994, Der Spiegel 1994-95 and SDA German 1994-95 – 60 test queries, 23K binary relevance assessments

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Reference methods • Morfessor Baseline: our public code since 2002 • Morfessor Categories-MAP: improved, public 2006 • dummy : no segmentation, all words unsplit • grammatical : full gold standard segmentation – all: all alternative segmentations included – first: only the first alternative chosen • TWOL : word normalization by a commercial rule-based morphological analyzer (all & first) • Snowball : Language specific stemming

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS English results [Lignos et al.]* RESEARCH CENTRE Reference methods [Virpioja & Kohonen] Allomor- fessor 0.4 [Monson et al.] ParaMor Mimic [Monson et al.] ParaMor-Mor- fessor Union [Lavellée & Langlais] RALI-ANA* [Monson et al.] ParaMor-Mor- fessor Mimic 0.35 [Tchoukalov et al.] MetaMorph* [Lavellée & Langlais] RALI-COF* [Bernhard] MorphoNet [Golénia et al.] UNGRADE* [Can & Manandhar]* [Spiegler et al.] PROMODES* [Spiegler et al.] PROMODES 2* 0.3 [Spiegler et al.] PROMODES committee* snowball porter Best2008 (Monson Paramor+Morfessor) TWOL first 0.25 TWOL all Morfessor Baseline grammatical first Morfessor CatMAP grammatical all dummy 0.2

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS English results [Lignos et al.]* RESEARCH CENTRE Reference methods [Virpioja & Kohonen] Allomor- fessor 0.4 [Monson et al.] ParaMor Mimic No significant [Monson et al.] ParaMor-Mor- difference to fessor Union the best above [Lavellée & Langlais] RALI-ANA* this line [Monson et al.] ParaMor-Mor- fessor Mimic 0.35 [Tchoukalov et al.] MetaMorph* [Lavellée & Langlais] RALI-COF* [Bernhard] MorphoNet [Golénia et al.] UNGRADE* [Can & Manandhar]* [Spiegler et al.] PROMODES* [Spiegler et al.] PROMODES 2* 0.3 [Spiegler et al.] PROMODES committee* snowball porter Best2008 (Monson Paramor+Morfessor) TWOL first 0.25 TWOL all Morfessor Baseline grammatical first Morfessor CatMAP grammatical all dummy 0.2

Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami - PowerPoint PPT Presentation

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen Helsinki University of Technology (TKK) DEPARTMENT OF INFORMATION

Cross-Language Evaluation Forum What happened at CLEF 2003 From CLEF 2003 to CLEF 2004

10:50 Paul McNamee : "Retrieval 09:10 Mikko Kurimo: " Morpho Experiments at Morpho

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction

Grid@CLEF Track Overview Donna Harman Nicola Ferro NIST, USA University of Padua, Italy

C lt Cultural Heritage in CLEF (CHiC) 2012 l H it i CLEF (CHiC) 2012 Pilot Lab Overview

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

CLEF: 15 Years of IR Evaluation in Europe Nicola Ferro University of Padua, Italy Forum

CLEF 20 th Anniversary Nicola Ferro @frrncl University of Padua, Italy 10 th Conference and Labs

Search Snippet Evaluation Mikhail Lebedev, Pavel Braslavski, Denis Savenkov CLEF 2011 CLEF 2011

CLEF and P CLEF and P PROMISEs PROMISEs Nicola a Ferro Information Management Sys

Better know your limits and adversaries Julien Bringer julien bringer (at) morpho com 0 /

Introduction to Morpho Challenge 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen Helsinki

A RuleBased Unsupervised Morphology Learning Framework Constan'ne Lignos, Erwin Chan*, Mitch

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

ClefIp 2009: retrieval experiments in the Intellectual Property domain Giovanna Roda

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

M ULTI UN A M ULTILINGUAL C ORPUS FROM U NITED N ATION D OCUMENTS Andreas Eisele, Yu Chen DFKI

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU,

iOS/macOS 0-day^w48-hours from sandbox to kernel Prsent 31/05/2018 Pour BeeRumP Par Eloi

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &

Anatomic Pathology and Quality Process Improvement 17 December 2013 Steve Halasey Chief Editor

Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami - PowerPoint PPT Presentation

DEPARTMENT OF INFORMATION AND COMPUTER SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen Helsinki University of Technology (TKK) DEPARTMENT OF INFORMATION

Cross-Language Evaluation Forum What happened at CLEF 2003 From CLEF 2003 to CLEF 2004

10:50 Paul McNamee : &quot;Retrieval 09:10 Mikko Kurimo: &quot; Morpho Experiments at Morpho

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction

Grid@CLEF Track Overview Donna Harman Nicola Ferro NIST, USA University of Padua, Italy

C lt Cultural Heritage in CLEF (CHiC) 2012 l H it i CLEF (CHiC) 2012 Pilot Lab Overview

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

CLEF: 15 Years of IR Evaluation in Europe Nicola Ferro University of Padua, Italy Forum

CLEF 20 th Anniversary Nicola Ferro @frrncl University of Padua, Italy 10 th Conference and Labs

Search Snippet Evaluation Mikhail Lebedev, Pavel Braslavski, Denis Savenkov CLEF 2011 CLEF 2011

CLEF and P CLEF and P PROMISEs PROMISEs Nicola a Ferro Information Management Sys

Better know your limits and adversaries Julien Bringer julien bringer (at) morpho com 0 /

Introduction to Morpho Challenge 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen Helsinki

A RuleBased Unsupervised Morphology Learning Framework Constan'ne Lignos, Erwin Chan*, Mitch

VAST CHALLENGE 2017 Bianca Barnucz &amp; Stephanie Wegscheidl OVERVIEW VAST Challenge

ClefIp 2009: retrieval experiments in the Intellectual Property domain Giovanna Roda

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

M ULTI UN A M ULTILINGUAL C ORPUS FROM U NITED N ATION D OCUMENTS Andreas Eisele, Yu Chen DFKI

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU,

iOS/macOS 0-day^w48-hours from sandbox to kernel Prsent 31/05/2018 Pour BeeRumP Par Eloi

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &amp;

Anatomic Pathology and Quality Process Improvement 17 December 2013 Steve Halasey Chief Editor

10:50 Paul McNamee : "Retrieval 09:10 Mikko Kurimo: " Morpho Experiments at Morpho

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &