[PPT] - Machine Translation and Type Theory Aarne Ranta Types 2010, Warsaw PowerPoint Presentation

SLIDE 1

Machine Translation and Type Theory

Aarne Ranta Types 2010, Warsaw 14 October 2010

Download this Android application!

SLIDE 2

Important research problems

(From Hamming, ”You and your research”) What are the important problems in your field? Are you working on one of them? If not, why? http://www.paulgraham.com/hamming.html

SLIDE 4

The important problems in computational linguistics

type-theoretical semantics

SLIDE 5

The important problems in computational linguistics

type-theoretical semantics

anaphora resolution

SLIDE 6

The important problems in computational linguistics

type-theoretical semantics

anaphora resolution

multilingual syntax editing

SLIDE 7

The important problems in computational linguistics

type-theoretical semantics

anaphora resolution

multilingual syntax editing

machine translation

SLIDE 8

A history of machine transla- tion

SLIDE 9

Beginnings of machine translation

Weaver 1947, encouraged by cryptography in WW II Word lookup − → n-gram models (Shannon’s ”noisy channel”) ^ e = argmax P(f|e)P(e) e P(w1 ... wn) approximated by e.g. P(w1w2)P(w2w3)...P(w(n-1)wn) (2-grams) Modern version: Google translate translate.google.com

SLIDE 10

Word sense disambiguation

Eng. even −

→ Fre ´ egal, ´ equitable, pair, plat ; mˆ eme, ...

Eng. even number −

→ Fre nombre pair

Eng. not even −

→ Fre mˆ eme pas

Eng. 7 is not even −

→ Fre 7 n’est pas pair

SLIDE 11

Long-distance dependencies

Ger. er bringt mich um −

→ Eng. he kills me Ger. er bringt seinen besten Freund um − → Eng. he kills his best friend

SLIDE 12

Type theory and machine translation

Bar-Hillel (1953): MT should aim at rendering meaning, not words. Method: Ajdukiewicz syntactic calculus (1935) for syntax and semantics. Directional types (prefix and postfix functions) loves : (n\s)n Mary : n

John : n

loves Mary : n\s

John loves Mary : s

Categorial grammar, developed further by Lambek (1958), Curry (1961)

SLIDE 13

Bar-Hillel’s criticism

1963: FAHQT (Fully Automatic High-Quality Translation) is impossible - not only in foreseeable future but in principle. Example: word sense disambiguation for pen: the pen is in the box vs. the box is in the pen Requires unlimited intelligence, universal encyclopedia.

SLIDE 14

The ALPAC report

Automatic Language Processing Advisory Committee, 1966 Conclusion: MT funding had been wasted money Outcome: MT changed to more modest goals of computational linguistics: to describe language Main criticisms: MT was too expensive

too much postprocessing needed
only small needs for translation - well covered by humans

SLIDE 15

1970’s and 1980’s

Trade-off: coverage vs. precision Precision-oriented systems: Curry − → Montague − → Rosetta Interactive systems (Kay 1979/1996)

ask for disambiguation if necessary
text editor + translation memory

SLIDE 16

Present day

IBM system (Brown, Jelinek, & al. 1990): back to Shannon’s model Google translate 2007- (Och, Ney, Koehn, ...)

57 languages
models built automatically from text data

Browsing quality rather than publication quality (Systran/Babelfish: rule-based, since 1960’s)

SLIDE 17

The MOLTO project

SLIDE 18

Multilingual On-Line Translation FP7-ICT-247914 Mission: to develop a set of tools for translating texts between multiple languages in real time with high quality. www.molto-project.eu

SLIDE 19

Consumer vs. producer quality

Tool Google, Babelfish MOLTO target consumers producers input unpredictable predictable coverage unlimited limited quality browsing publishing

SLIDE 20

Producer’s quality

Cannot afford translating

prix 99 euros

to

pris 99 kronor

SLIDE 21

Producer’s quality

Cannot afford translating

I miss her

to

je m’ennuie d’elle

(”I’m bored of her”)

SLIDE 22

The translation directions

Statistical methods (e.g. Google translate) work the best to English

rigid word order
simple morphology
focus of DARPA-funded research

Grammar-based methods work equally well for different languages

Finnish cases, German word order

SLIDE 23

MOLTO languages

SLIDE 24

Domain-specific interlinguas

The abstract syntax must be formally specified, well-understood

semantic model for translation
fixed word senses
proper idioms

SLIDE 25

Examples of domain semantics

Expressed in various formal languages

mathematics, in predicate logic
software functionality, in UML/OCL
dialogue system actions, in SISR
museum object descriptions, in OWL

Type theory can be used for any of these!

SLIDE 26

Two things we do better than before

No universal interlingua:

The Rosetta stone is not a monolith, but a boulder field.

Yes universal concrete syntax:

no hand-crafted ad hoc grammars
but a general-purpose resource grammar library

SLIDE 27

Challenge: grammar tools

Scale up production of domain interpreters

from 100’s to 1000’s of words
from GF experts to domain experts and translators
from months to days
writing a grammar ≈ translating a set of examples

SLIDE 28

Challenge: translator’s tools

Transparent use:

text input + prediction
syntax editor for modification
disambiguation
on the fly extension
normal workflows: API for plug-ins in standard tools, web, mobile

phones...

SLIDE 29

Scientific challenge: robustness and statistics

1. Statistical Machine Translation (SMT) as fall-back
2. Hybrid systems
3. Learning of GF grammars by statistics
4. Improving SMT by grammars

SLIDE 30

Demo: MOLTO phrasebook

Touristic phrases in 14 languages. Incremental parsing Disambiguation Test of example-based with humans and Google translate grammaticalframework.org/demos/phrasebook/ Android application via embedded GF interpreter in Java

SLIDE 31

Grammatical Framework (GF): a crash course

SLIDE 32

History

Background: type theory, logical frameworks (LF) GF = LF + concrete syntax Started at Xerox (XRCE Grenoble) in 1998 for multilingual document authoring Functional language with dependent types, parametrized modules, op- timizing compiler

SLIDE 33

Factoring out functionalities

GF grammars are declarative programs that define

parsing
generation
translation
editing

Some of this can also be found in BNF/Yacc, HPSG/LKB, LFG/XLE ...

SLIDE 34

Multilingual grammars in compilers

Source and target language related by abstract syntax iconst_2 iload_0 2 * x + 1 <-----> plus (times 2 x) 1 <------> imul iconst_1 iadd

SLIDE 35

A GF grammar for expressions

abstract Expr = { cat Exp ; fun plus : Exp -> Exp -> Exp ; fun times : Exp -> Exp -> Exp ; fun one, two : Exp ; } concrete ExprJava of Expr = { concrete ExprJVM of Expr= { lincat Exp = Str ; lincat Expr = Str ; lin plus x y = x ++ "+" ++ y ; lin plus x y = x ++ y ++ "iadd" ; lin times x y = x ++ "*" ++ y ; lin times x y = x ++ y ++ "imul" ; lin one = "1" ; lin one = "iconst_1" ; lin two = "2" ; lin two = "iconst_2" ; } }

SLIDE 36

Multilingual grammars in natural language

SLIDE 37

Natural language structures

Predication: John + loves Mary Complementation: love + Mary Noun phrases: John Verb phrases: love Mary 2-place verbs: love

SLIDE 38

Abstract syntax of sentence formation

abstract Zero = { cat S ; NP ; VP ; V2 ; fun Pred : NP -> VP -> S ; Compl : V2 -> NP -> VP ; John, Mary : NP ; Love : V2 ; }

SLIDE 39

Concrete syntax, English

concrete ZeroEng of Zero = { lincat S, NP, VP, V2 = Str ; lin Pred np vp = np ++ vp ; Compl v2 np = v2 ++ np ; John = "John" ; Mary = "Mary" ; Love = "loves" ; }

SLIDE 40

Multilingual grammar

The same system of trees can be given

different words
different word orders
different linearization types

SLIDE 41

Concrete syntax, French

concrete ZeroFre of Zero = { lincat S, NP, VP, V2 = Str ; lin Pred np vp = np ++ vp ; Compl v2 np = v2 ++ np ; John = "Jean" ; Mary = "Marie" ; Love = "aime" ; } Just use different words

SLIDE 42

Translation and multilingual generation in GF

Import many grammars with the same abstract syntax > i ZeroEng.gf ZeroFre.gf Languages: ZeroEng ZeroFre Translation: pipe parsing to linearization > p -lang=ZeroEng "John loves Mary" | l -lang=ZeroFre Jean aime Marie Multilingual random generation: linearize into all languages > gr | l Pred Mary (Compl Love Mary) Mary loves Mary Marie aime Marie

SLIDE 43

Parameters in linearization

Latin has cases: nominative for subject, accusative for object.

Ioannes Mariam amat ”John-Nom loves Mary-Acc”
Maria Ioannem amat ”Mary-Nom loves John-Acc”

Parameter type for case (just 2 of Latin’s 6 cases): param Case = Nom | Acc

SLIDE 44

Concrete syntax, Latin

concrete ZeroLat of Zero = { lincat S, VP, V2 = Str ; NP = Case => Str ; lin Pred np vp = np ! Nom ++ vp ; Compl v2 np = np ! Acc ++ v2 ; John = table {Nom => "Ioannes" ; Acc => "Ioannem"} ; Mary = table {Nom => "Maria" ; Acc => "Mariam"} ; Love = "amat" ; param Case = Nom | Acc ; } Different word order (SOV), different linearization type, parameters.

SLIDE 45

Table types and tables

The linearization type of NP is a table type: from Case to Str, lincat NP = Case => Str The linearization of John is an inflection table, lin John = table {Nom => "Ioannes" ; Acc => "Ioannem"} When using an NP, select (!) the appropriate case from the table, Pred np vp = np ! Nom ++ vp Compl v2 np = np ! Acc ++ v2

SLIDE 46

Concrete syntax, Dutch

concrete ZeroDut of Zero = { lincat S, NP, VP = Str ; V2 = {v : Str ; p : Str} ; lin Pred np vp = np ++ vp ; Compl v2 np = v2.v ++ np ++ v2.p ; John = "Jan" ; Mary = "Marie" ; Love = {v = "heeft" ; p = "lief"} ; } The verb heeft lief is a discontinuous constituent.

SLIDE 47

Record types and records

The linearization type of V2 is a record type lincat V2 = {v : Str ; p : Str} The linearization of Love is a record lin Love = {v = "heeft" ; p = "lief"} The values of fields are picked by projection (.) lin Compl v2 np = v2.v ++ np ++ v2.p

SLIDE 48

Concrete syntax, Hebrew

The verb agrees to the gender of the subject.

SLIDE 49

Abstract trees and parse trees

SLIDE 50

Word alignment via trees

SLIDE 51

A more involved word alignment

SLIDE 52

The GF Resource Grammar Library

Morphology and basic syntax Common API for different languages Currently (September 2010) 16 languages: Bulgarian, Catalan, Dan- ish, Dutch, English, Finnish, French, German, Italian, Norwegian, Pol- ish, Romanian, Russian, Spanish, Swedish, Urdu. Under construction for more languages: Amharic, Arabic, Farsi, He- brew, Icelandic, Japanese, Latin, Latvian, Maltese, Mongol, Portuguese, Swahili, Thai, Tswana, Turkish. (Summer School 2009) Contributions welcome!

SLIDE 53

Implementing a smart paradigm

SLIDE 54

Inflectional morphology

Goal: a complete system of inflection paradigms Paradigm: a function from ”basic form” to full inflection table GF morphology is inspired by

Zen (Huet 2005): typeful functional programming
XFST (Beesley and Karttunen 2003): regular expressions

SLIDE 55

Example: English verb inflection

Or: how to avoid giving all forms of all verbs. Start by defining parameter types and parts of speech. param VForm = VInf | VPres | VPast | VPastPart | VPresPart ;

per

Verb : Type = {s : VForm => Str} ; Judgement form oper: auxiliary operation.

SLIDE 56

Start: worst-case function

To save writing and to abstract over the Verbtype

per

mkVerb : (_,_,_,_,_ : Str) -> Verb = \go,goes,went,gone,going -> { s = table { VInf => go ; VPres => goes ; VPast => went ; VPastPart => gone ; VPresPart => going } } ;

SLIDE 57

Defining paradigms

A paradigm is an operation of type Str -> Verb which takes a string and returns an inflection table. E.g. regular verbs: regVerb : Str -> Verb = \walk -> mkVerb walk (walk + "s") (walk + "ed") (walk + "ed") (walk + "ing") ; This will work for walk, interest, play. It will not work for sing, kiss, use, cry, fly, stop.

SLIDE 58

More paradigms

For verbs ending with s, x, z, ch s_regVerb : Str -> Verb = \kiss -> mkVerb kiss (kiss + "es") (kiss + "ed") (kiss + "ed") (kiss + "ing") ; For verbs ending with e e_regVerb : Str -> Verb = \use -> let us = init use in mkVerb use (use + "s") (us + "ed") (us + "ed") (us + "ing") ; Notice:

the local definition let c = d in ...
the operation init from Prelude, dropping the last character

SLIDE 59

More paradigms still

For verbs ending with y y_regVerb : Str -> Verb = \cry -> let cr = init cry in mkVerb cry (cr + "ies") (cr + "ied") (cr + "ied") (cry + "ing") ; For verbs ending with ie ie_regVerb : Str -> Verb = \die -> let dy = Predef.tk 2 die + "y" in mkVerb die (die + "s") (die + "d") (die + "d") (dy + "ing") ;

SLIDE 60

What paradigm to choose

If the infinitive ends with s, x, z, ch, choose s regRerb: munch, munches If the infinitive ends with y, choose y regRerb: cry, cries, cried

except if a vowel comes before: play, plays, played

If the infinitive ends with e, choose e regVerb: use, used, using

except if an i precedes: die, dying
or if an e precedes: free, freeing

SLIDE 61

A smart paradigm

Let GF choose the paradigm by pattern matching on strings smartVerb : Str -> Verb = \v -> case v of { _ + ("s"|"z"|"x"|"ch") => s_regVerb v ; _ + "ie" => ie_regVerb v ; _ + "ee" => ee_regVerb v ; _ + "e" => e_regVerb v ; _ + ("a"|"e"|"o"|"u") + "y" => regVerb v ; _ + "y" => y_regVerb v ; _ => regVerb v } ;

SLIDE 62

Testing the smart paradigm in GF

> cc -all smartVerb "munch" munch munches munched munched munching > cc -all smartVerb "die" die dies died died dying > cc -all smartVerb "agree" agree agrees agreed agreed agreeing > cc -all smartVerb "deploy" deploy deploys deployed deployed deploying > cc -all smartVerb "classify" classify classifies classified classified classifying

SLIDE 63

The smart paradigm is not perfect

Irregular verbs are obviously not covered > cc -all smartVerb "sing" sing sings singed singed singing Neither are regular verbs with consonant duplication > cc -all smartVerb "stop" stop stops stoped stoped stoping

SLIDE 64

The final consonant duplication paradigm

Use the Prelude function last dupRegVerb : Str -> Verb = \stop -> let stopp = stop + last stop in mkVerb stop (stop + "s") (stopp + "ed") (stopp + "ed") (stopp + "ing") ; String pattern: relevant consonant preceded by a vowel _ + ("a"|"e"|"i"|"o"|"u") + ("b"|"d"|"g"|"m"|"n"|"p"|"r"|"s"|"t") => dupRegVerb v ;

SLIDE 65

Testing consonant duplication

Now it works > cc -all smartVerb "stop" stop stops stopped stopped stopping But what about > cc -all smartVerb "coat" coat coats coatted coatted coatting Solution: a prior case for diphthongs before the last char (? matches

ne char)

_ + ("ea"|"ee"|"ie"|"oa"|"oo"|"ou") + ? => regVerb v ;

SLIDE 66

There is no waterproof solution

Duplication depends on stress, which is not marked in English:

omit [o’mit]: omitted, omitting
vomit [’vomit]: vomited, vomiting

This means that we occasionally have to give more forms than one. We knew this already for irregular verbs. And we cannot write patterns for each of them either, because e.g. lie can be both lie, lied, lied or lie, lay, lain.

SLIDE 67

Grammar engineering

SLIDE 68

Complexity of grammar writing

To implement a translation system, we need

domain expertise: technical and idiomatic expression
linguistic expertise: how to inflect words and build phrases

SLIDE 69

Example: an email program

Task: generate phrases saying you have n message(s) Domain expertise: choose correct words (in Swedish, not budskap but meddelande) Linguistic expertise: avoid you have one messages

SLIDE 70

Correct number in Arabic

(From ”Implementation of the Arabic Numerals and their Syntax in GF” by Ali El Dada, ACL workshop

n Arabic, Prague 2007)

SLIDE 71

Resource grammar API

Smart paradigms for morphology mkN : (talo : Str) -> N Abstract syntax functions for syntax mkCl : NP -> V2 -> NP -> Cl

- John loves Mary

mkNP : Numeral -> CN -> NP

- five houses

SLIDE 72

Using the library in English

mkCl youSg_NP have_V2 (mkNP n2_Numeral (mkN "message")) ===> you have two messages mkCl youSg_NP have_V2 (mkNP n1_Numeral (mkN "message")) ===> you have one message

SLIDE 73

Localization

Adapt the email program to Italian, Swedish, Finnish... mkCl youSg_NP have_V2 (mkNP n2_Numeral (mkN "messaggio")) ===> hai due messaggi mkCl youSg_NP have_V2 (mkNP n2_Numeral (mkN "meddelande")) ===> du har tv˚ a meddelanden mkCl youSg_NP have_V2 (mkNP n2_Numeral (mkN "viesti")) ===> sinulla on kaksi viesti¨ a The new languages are more complex than English - but only internally, not on the API level!

SLIDE 74

Functor implementation

abstract Email = { cat Info ; Number ; fun HaveMessages : Number -> Info ; } incomplete concrete EmailI = open Syntax, LexEmail in { lincat Info = Cl ; lincat Number = Numeral ; lin HaveMessages n = mkCl youSg_NP have_V2 (mkNP n2_Numeral message) ; } interface LexEmail = open Syntax in {

per message : N ;

} instance LexEmailEng of LexEmail = open SyntaxEng, ParadigmsEng in {

per message = mkN "message" ;

} concrete EmailEng = EmailI with (Syntax = SyntaxEng), (LexEmail = LexEmailEng) ;

SLIDE 75

Porting to new languages

instance LexEmailIta of LexEmail = open SyntaxIta, ParadigmsIta in {

per message = mkN "messaggio" ;

} concrete EmailIta = EmailI with (Syntax = SyntaxIta), (LexEmail = LexEmailIta) ; instance LexEmailFin of LexEmail = open SyntaxFin, ParadigmsFin in {

per message = mkN "viesti" ;

} concrete EmailFin = EmailI with (Syntax = SyntaxFin), (LexEmail = LexEmailFin) ;

SLIDE 76

Meaning-preserving translation

Translation must preserve meaning. It need not preserve syntactic structure. Sometimes this is even impossible:

John likes Mary in Italian is Maria piace a Giovanni

The abstract syntax in the semantic grammar is a logical predicate: fun Like : Person -> Person -> Fact lin Like x y = x ++ "likes" ++ y

- English

lin Like x y = y ++ "piace" ++ "a" ++ x

- Italian

SLIDE 77

Translation and resource grammar

To get all grammatical details right, we use resource grammar and not strings lincat Person = NP ; Fact = Cl ; lin Like x y = mkCl x like_V2 y

- Engligh

lin Like x y = mkCl y piacere_V2 x

- Italian

From syntactic point of view, we perform transfer, i.e. structure change. GF has compile-time transfer, and uses interlingua (semantic abstrac syntax) at run time. Compile-time transfer -> exceptions to functors.

SLIDE 78

Example-based grammar writing

Abstract syntax Nat : Set Even : Exp -> Prop Odd : Exp -> Prop Gt : Exp -> Exp -> Prop Sum : Exp -> Exp

SLIDE 79

English concrete syntax, by examples

Nat = "number" Even x = "x is even" Odd x = "x is odd" Gt x y = "x is greater than y" Sum x = "the sum of x"

SLIDE 80

German concrete syntax, by examples

Nat = "Zahl" Even x = "x ist gerade" Odd x = "x ist ungerade" Gt x y = "x ist gr¨

ßer als y"

Sum x = "die Summe von x"

SLIDE 81

Resulting genaralization

every even number that is greater than 0 is the sum of two odd num- bers jede gerade Zahl, die gr¨

ßer als 0 ist, ist die Summe von zwei ungeraden

Zahlen

SLIDE 82

Using SMT to build grammars

SMT is good for short and common sentences. − → feed examples to Google translate to bootstrap a grammar!

SLIDE 83

To learn more

GF homepage: grammaticalframework.org

A. Ranta, Grammatical Framework, A Programming Language for

Multilingual Grammars and Their Applications, CSLI Publications, Stan- ford, 2010, to appear. 2nd GF Summer School, Barcelona, 15-26 August 2011.

Machine Translation and Type Theory

Contents

Important research problems

The important problems in computational linguistics

The important problems in computational linguistics

The important problems in computational linguistics

multilingual syntax editing

The important problems in computational linguistics

multilingual syntax editing

machine translation

A history of machine transla- tion

Beginnings of machine translation

Word sense disambiguation

Long-distance dependencies

Type theory and machine translation

Bar-Hillel’s criticism

The ALPAC report

1970’s and 1980’s

Present day

The MOLTO project

Consumer vs. producer quality

Producer’s quality

Producer’s quality

The translation directions

Domain-specific interlinguas

Examples of domain semantics

Two things we do better than before

Challenge: grammar tools

Challenge: translator’s tools

Scientific challenge: robustness and statistics

Demo: MOLTO phrasebook

Grammatical Framework (GF): a crash course

History

Factoring out functionalities

Multilingual grammars in compilers

A GF grammar for expressions

Multilingual grammars in natural language

Natural language structures

Abstract syntax of sentence formation

Concrete syntax, English

Multilingual grammar

Concrete syntax, French

Translation and multilingual generation in GF

Parameters in linearization

Concrete syntax, Latin

Table types and tables

Concrete syntax, Dutch

Record types and records

Concrete syntax, Hebrew

Abstract trees and parse trees

Word alignment via trees

A more involved word alignment

The GF Resource Grammar Library

Implementing a smart paradigm

Inflectional morphology

Example: English verb inflection

Start: worst-case function

Defining paradigms

More paradigms

More paradigms still

What paradigm to choose

A smart paradigm

Testing the smart paradigm in GF

The smart paradigm is not perfect

The final consonant duplication paradigm

Testing consonant duplication

There is no waterproof solution

Grammar engineering

Complexity of grammar writing

Example: an email program

Correct number in Arabic

Resource grammar API

Using the library in English

Localization

Functor implementation

Porting to new languages

Meaning-preserving translation

Translation and resource grammar

Example-based grammar writing

English concrete syntax, by examples