Finite-State Technology in Natural Language Processing Andreas - - PowerPoint PPT Presentation

finite state technology in natural language processing
SMART_READER_LITE
LIVE PREVIEW

Finite-State Technology in Natural Language Processing Andreas - - PowerPoint PPT Presentation

Finite-State Technology in Natural Language Processing Andreas Maletti Institute for Natural Language Processing Universitt Stuttgart, Germany maletti@ims.uni-stuttgart.de Ume August 18, 2015 FST in NLP A. Maletti 1 Roadmap


slide-1
SLIDE 1

Finite-State Technology in Natural Language Processing

Andreas Maletti

Institute for Natural Language Processing Universität Stuttgart, Germany maletti@ims.uni-stuttgart.de

Umeå — August 18, 2015

FST in NLP

  • A. Maletti

· 1

slide-2
SLIDE 2

Roadmap

1

Linguistic Basics and Weighted Automata

2

Part-of-Speech Tagging

3

Parsing

4

Machine Translation Always ask questions right away!

FST in NLP

  • A. Maletti

· 2

slide-3
SLIDE 3

Linguistic Basics

Units

Sentence (syntactic unit expressing complete thought)

◮ Alla sätt är bra utom de dåliga. ◮ “Are you serious?” she asked. FST in NLP

  • A. Maletti

· 3

slide-4
SLIDE 4

Linguistic Basics

Units

Sentence (syntactic unit expressing complete thought) Clause (grammatically complete syntactic unit)

◮ Vännens örfil är ärligt menad,

fiendens kyssar vill bedra. (2 main clauses)

◮ People

who live in glass houses should not throw stones. (main clause + relative clause)

FST in NLP

  • A. Maletti

· 3

slide-5
SLIDE 5

Linguistic Basics

Units

Sentence (syntactic unit expressing complete thought) Clause (grammatically complete syntactic unit) Phrases (smaller syntactic units)

◮ the green car (noun phrase = noun and its modifiers) ◮ killed the snake (verb phrase = verb and its objects) FST in NLP

  • A. Maletti

· 3

slide-6
SLIDE 6

Linguistic Basics

Units

Sentence (syntactic unit expressing complete thought) Clause (grammatically complete syntactic unit) Phrases (smaller syntactic units) Token (smallest unit = “word”; often derived from lexicon entry)

◮ house, car, lived, smallest, 45th, STACS, Knuth ◮ but tricky: Knuth’s vs. Knuth ’s;

well-known vs. well - known

FST in NLP

  • A. Maletti

· 3

slide-7
SLIDE 7

Linguistic Basics

Tokenization

Splitting text into sentences and tokens relatively simple for English, Swedish, German, etc. (actually often via complicated regular expression)

FST in NLP

  • A. Maletti

· 4

slide-8
SLIDE 8

Linguistic Basics

Tokenization

Splitting text into sentences and tokens relatively simple for English, Swedish, German, etc. (actually often via complicated regular expression) hard for other languages:

◮ Chinese: 小洞不补,大洞吃苦。

(A small hole not plugged will make you suffer a big hole.)

FST in NLP

  • A. Maletti

· 4

slide-9
SLIDE 9

Linguistic Basics

Tokenization

Splitting text into sentences and tokens relatively simple for English, Swedish, German, etc. (actually often via complicated regular expression) hard for other languages:

◮ Chinese: 小洞不补,大洞吃苦。

(A small hole not plugged will make you suffer a big hole.)

◮ Turkish: Çekoslovakyalıla¸

stıramadıklarımızdanmı¸ ssınız (You are said to be one of those that we couldn’t manage to convert to a Czechoslovak)

FST in NLP

  • A. Maletti

· 4

slide-10
SLIDE 10

Linguistic Basics

Tokenization

Splitting text into sentences and tokens relatively simple for English, Swedish, German, etc. (actually often via complicated regular expression) hard for other languages:

◮ Chinese: 小洞不补,大洞吃苦。

(A small hole not plugged will make you suffer a big hole.)

◮ Turkish: Çekoslovakyalıla¸

stıramadıklarımızdanmı¸ ssınız (You are said to be one of those that we couldn’t manage to convert to a Czechoslovak)

◮ Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként

(like the most of most undesecratable ones of you)

FST in NLP

  • A. Maletti

· 4

slide-11
SLIDE 11

Linguistic Basics

Tokenization

Splitting text into sentences and tokens relatively simple for English, Swedish, German, etc. (actually often via complicated regular expression) hard for other languages:

◮ Chinese: 小洞不补,大洞吃苦。

(A small hole not plugged will make you suffer a big hole.)

◮ Turkish: Çekoslovakyalıla¸

stıramadıklarımızdanmı¸ ssınız (You are said to be one of those that we couldn’t manage to convert to a Czechoslovak)

◮ Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként

(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)

RE . WhiteSpace+ [A–Z] covers most cases (in English)

FST in NLP

  • A. Maletti

· 4

slide-12
SLIDE 12

Linguistic Basics

Example (English)

tokens usually separated by whitespace sentence end marker “.” highly ambiguous:

◮ common abbreviations ◮ dates, ordinals, and phone numbers FST in NLP

  • A. Maletti

· 5

slide-13
SLIDE 13

Linguistic Basics

Example (English)

tokens usually separated by whitespace sentence end marker “.” highly ambiguous:

◮ common abbreviations ◮ dates, ordinals, and phone numbers

STANFORD tokenizer

◮ implemented in JAVA

(based on JFlex)

◮ compiles RegEx into DFA and runs DFA ◮ can process 1,000,000 tokens per second FST in NLP

  • A. Maletti

· 5

slide-14
SLIDE 14

Weighted Automata

Definition

A weighted automaton is a system (Q, Σ, I, ∆, F, wt) finite set Q of states input alphabet Σ initial states I ⊆ Q transitions ∆ ⊆ Q × Σ × Q final states F ⊆ Q

Example

q0 q1 q2 q3 . WhiteSpace WhiteSpace {A, . . . , Z}

FST in NLP

  • A. Maletti

· 6

slide-15
SLIDE 15

Weighted Automata

Definition

A weighted automaton is a system (Q, Σ, I, ∆, F, wt) finite set Q of states input alphabet Σ initial states I ⊆ Q transitions ∆ ⊆ Q × Σ × Q final states F ⊆ Q transition weights wt: ∆ → [0, 1]

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

FST in NLP

  • A. Maletti

· 6

slide-16
SLIDE 16

Weighted Automata

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

FST in NLP

  • A. Maletti

· 7

slide-17
SLIDE 17

Weighted Automata

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

Definition (Semantics)

Weight of a run: product of the transition weights (run (q0, ., q1)(q1, _, q2)(q2, F, q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

FST in NLP

  • A. Maletti

· 7

slide-18
SLIDE 18

Weighted Automata

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

Definition (Semantics)

Weight of a run: product of the transition weights (run (q0, ., q1)(q1, _, q2)(q2, F, q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

FST in NLP

  • A. Maletti

· 7

slide-19
SLIDE 19

Weighted Automata

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

Definition (Semantics)

Weight of a run: product of the transition weights (run (q0, ., q1)(q1, _, q2)(q2, F, q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

FST in NLP

  • A. Maletti

· 7

slide-20
SLIDE 20

Weighted Automata

Example

q0 q1 q2 q3 . 0.5 WhiteSpace 0.8 WhiteSpace 1 {A, . . . , Z} 0.3

Definition (Semantics)

Weight of a run: product of the transition weights (run (q0, ., q1)(q1, _, q2)(q2, F, q3) has weight 0.5 · 0.8 · 0.3 = 0.12) Weight of an input: sum of the weights of all successful runs (input “._F” has weight 0.12)

FST in NLP

  • A. Maletti

· 7

slide-21
SLIDE 21

Part-of-Speech Tagging

FST in NLP

  • A. Maletti

· 8

slide-22
SLIDE 22

Part-of-Speech Tagging

Motivation

Indexing (GOOGLE): Which (meaning-carrying) tokens to index in Alla sätt är bra utom de dåliga.

FST in NLP

  • A. Maletti

· 9

slide-23
SLIDE 23

Part-of-Speech Tagging

Motivation

Indexing (GOOGLE): Which (meaning-carrying) tokens to index in Alla sätt är bra utom de dåliga. Usually nouns, adjectives, and verbs are meaning-carrying Part-of-speech tagging = task of assigning a grammatical function to each token = task of determining the word class for each token (of a sentence)

FST in NLP

  • A. Maletti

· 9

slide-24
SLIDE 24

Part-of-Speech Tagging

Motivation

Indexing (GOOGLE): Which (meaning-carrying) tokens to index in Alla sätt är bra utom de dåliga. Usually nouns, adjectives, and verbs are meaning-carrying Part-of-speech tagging = task of assigning a grammatical function to each token = task of determining the word class for each token (of a sentence)

Example

Alla sätt är bra utom de dåliga det. noun verb adj. subord. det. adj. conj.

FST in NLP

  • A. Maletti

· 9

slide-25
SLIDE 25

Part-of-Speech Tagging

Motivation

Indexing (GOOGLE): Which (meaning-carrying) tokens to index in Alla sätt är bra utom de dåliga. Usually nouns, adjectives, and verbs are meaning-carrying Part-of-speech tagging = task of assigning a grammatical function to each token = task of determining the word class for each token (of a sentence)

Example

Alla sätt är bra utom de dåliga det. noun verb adj. subord. det. adj. conj.

FST in NLP

  • A. Maletti

· 9

slide-26
SLIDE 26

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)

DT = determiner NN = noun (singular or mass) JJ = adjective MD = modal VB = verb (base form) VBD = verb (past tense) VBG = verb (gerund or present participle)

FST in NLP

  • A. Maletti

· 10

slide-27
SLIDE 27

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)

DT = determiner NN = noun (singular or mass) JJ = adjective MD = modal VB = verb (base form) VBD = verb (past tense) VBG = verb (gerund or present participle)

Example (Tagging exercise)

show

FST in NLP

  • A. Maletti

· 10

slide-28
SLIDE 28

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)

DT = determiner NN = noun (singular or mass) JJ = adjective MD = modal VB = verb (base form) VBD = verb (past tense) VBG = verb (gerund or present participle)

Example (Tagging exercise)

show VB

FST in NLP

  • A. Maletti

· 10

slide-29
SLIDE 29

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)

DT = determiner NN = noun (singular or mass) JJ = adjective MD = modal VB = verb (base form) VBD = verb (past tense) VBG = verb (gerund or present participle)

Example (Tagging exercise)

the show

FST in NLP

  • A. Maletti

· 10

slide-30
SLIDE 30

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)

DT = determiner NN = noun (singular or mass) JJ = adjective MD = modal VB = verb (base form) VBD = verb (past tense) VBG = verb (gerund or present participle)

Example (Tagging exercise)

the show DT NN

FST in NLP

  • A. Maletti

· 10

slide-31
SLIDE 31

Part-of-Speech Tagging

History

1960s

◮ manually tagged BROWN corpus

(1,000,000 words)

◮ tag lists with frequency for each token

e.g., {VB, MD, NN} for can

◮ excluding ling.-implausible sequences

(e.g. DT VB)

◮ “most common tag” yields 90% accuracy

[CHARNIAK, 97]

FST in NLP

  • A. Maletti

· 11

slide-32
SLIDE 32

Part-of-Speech Tagging

History

1960s

◮ manually tagged BROWN corpus

(1,000,000 words)

◮ tag lists with frequency for each token

e.g., {VB, MD, NN} for can

◮ excluding ling.-implausible sequences

(e.g. DT VB)

◮ “most common tag” yields 90% accuracy

[CHARNIAK, 97]

1980s

◮ hidden MARKOV models (HMM) ◮ dynamic programming and VITERBI algorithms

(wA algorithms)

FST in NLP

  • A. Maletti

· 11

slide-33
SLIDE 33

Part-of-Speech Tagging

History

1960s

◮ manually tagged BROWN corpus

(1,000,000 words)

◮ tag lists with frequency for each token

e.g., {VB, MD, NN} for can

◮ excluding ling.-implausible sequences

(e.g. DT VB)

◮ “most common tag” yields 90% accuracy

[CHARNIAK, 97]

1980s

◮ hidden MARKOV models (HMM) ◮ dynamic programming and VITERBI algorithms

(wA algorithms)

2000s

◮ British national corpus

(100,000,000 words)

◮ parsers are better taggers

(wTA algorithms)

FST in NLP

  • A. Maletti

· 11

slide-34
SLIDE 34

Markov Model

Statistical approach

Given a sequence w = w1 · · · wk of tokens, determine the most likely sequence t1 · · · tk of part-of-speech tags (ti is the tag of wi) (ˆ t1, . . . ,ˆ tk)= arg max

(t1,...,tk)

p(t1, . . . , tk | w) = arg max

(t1,...,tk)

p(t1, . . . , tk, w) p(w) = arg max

(t1,...,tk)

p(t1, . . . , tk, w1, . . . , wk) = arg max

(t1,...,tk)

p(t1, w1) ·

k

  • i=2

p(ti, wi | t1, . . . , ti−1, w1, . . . , wi−1)

FST in NLP

  • A. Maletti

· 12

slide-35
SLIDE 35

Markov Model

Modelling as stochastic process

introduce event Ei = wi ∩ ti = (wi, ti) p(t1, w1) ·

k

  • i=2

p(ti, wi | t1, . . . , ti−1, w1, . . . , wi−1) = p(E1) ·

k

  • i=2

p(Ei | E1, . . . , Ei−1) assume MARKOV property p(Ei | E1, . . . , Ei−1) = p(Ei | Ei−1) = p(E2 | E1)

FST in NLP

  • A. Maletti

· 13

slide-36
SLIDE 36

Markov Model

Summary

initial weights p(E) (not indicated below) transition weights p(E | E′)

(the, DT) (fun, NN) (a, DT) (car, NN) 0.2 0.4 0.1 0.1 0.4 0.1 0.3 0.1 0.2 0.1

FST in NLP

  • A. Maletti

· 14

slide-37
SLIDE 37

Markov Model

Maximum likelihood estimation (MLE)

Assume that likelihood = relative frequency in corpus initial weights p(E) How often does E start a tagged sentence? transition weights p(E | E′) How often does E follow E′?

FST in NLP

  • A. Maletti

· 15

slide-38
SLIDE 38

Markov Model

Maximum likelihood estimation (MLE)

Assume that likelihood = relative frequency in corpus initial weights p(E) How often does E start a tagged sentence? transition weights p(E | E′) How often does E follow E′?

Problems

Vocabulary: ≈ 350,000 English tokens, but only 50,000 tokens (14%) in BROWN corpus Sparsity: (car, NN) (fun, NN) not attested in corpus, but plausible (frequency estimates might be wrong)

FST in NLP

  • A. Maletti

· 15

slide-39
SLIDE 39

Transformation into Weighted Automaton

(the, DT) (fun, NN) (a, DT) (car, NN) 0.2 0.4 0.1 0.1 0.4 0.1 0.3 0.1 0.2 0.1

FST in NLP

  • A. Maletti

· 16

slide-40
SLIDE 40

Transformation into Weighted Automaton

q1 q2 q0 q3 q4

(fun,NN) 0.2 (car,NN) 0.4 (the,DT) 0.1 (a,DT) 0.1 (car,NN) 0.4 (fun,NN) 0.1 (car,NN) 0.3 (the,DT) 0.1 (fun,NN) 0.2 (a,DT) 0.1 FST in NLP

  • A. Maletti

· 16

slide-41
SLIDE 41

Transformation into Weighted Automaton

q1 q2 q0 q3 q4

(fun,NN) 0.2 (car,NN) 0.4 (the,DT) 0.1 (a,DT) 0.1 (car,NN) 0.4 (fun,NN) 0.1 (car,NN) 0.3 (the,DT) 0.1 (fun,NN) 0.2 (a,DT) 0.1 (the,DT) 0.5 (fun,NN) 0.2 (a,DT) 0.3 (car,NN) 0.5 FST in NLP

  • A. Maletti

· 16

slide-42
SLIDE 42

Part-of-Speech Tagging

Typical questions

Decoding: (or language model evaluation) Given model M and sentence w, determine probability M1(w)

◮ project labels to first components ◮ evaluate w in the obtained wA M1 ◮ efficient: initial-algebra semantics

(forward algorithm)

FST in NLP

  • A. Maletti

· 17

slide-43
SLIDE 43

Part-of-Speech Tagging

Typical questions

Decoding: (or language model evaluation) Given model M and sentence w, determine probability M1(w)

◮ project labels to first components ◮ evaluate w in the obtained wA M1 ◮ efficient: initial-algebra semantics

(forward algorithm)

Tagging: Given model M and sentence w, determine the best tag sequence t1 · · · tk

◮ intersect M with the DFA for w and any tag sequence ◮ determine best run in the obtained wA ◮ efficient: VITERBI algorithm FST in NLP

  • A. Maletti

· 17

slide-44
SLIDE 44

Part-of-Speech Tagging

Typical questions

(Weight) Induction: (or MLE training) Given NFA (Q, Σ, I, ∆, F) and sequence w1, . . . , wk of tagged sentences wi ∈ Σ∗, determine transition weights wt: ∆ → [0, 1] such that k

i=1 Mwt(wi) is maximal with Mwt = (Q, Σ, I, ∆, F, wt)

◮ no closed solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.) FST in NLP

  • A. Maletti

· 18

slide-45
SLIDE 45

Part-of-Speech Tagging

Typical questions

(Weight) Induction: (or MLE training) Given NFA (Q, Σ, I, ∆, F) and sequence w1, . . . , wk of tagged sentences wi ∈ Σ∗, determine transition weights wt: ∆ → [0, 1] such that k

i=1 Mwt(wi) is maximal with Mwt = (Q, Σ, I, ∆, F, wt)

◮ no closed solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.)

Learning: (or HMM induction) Given NFA (Q, Σ, I, ∆, F) and sequence w1, . . . , wk of untagged sentences wi, determine transition weights wt: ∆ → [0, 1] such that k

i=1(Mwt)1(wi) is maximal with Mwt = (Q, Σ, I, ∆, F, wt)

◮ no exact solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.) FST in NLP

  • A. Maletti

· 18

slide-46
SLIDE 46

Part-of-Speech Tagging

Issues

WA too big (in comparison to training data)

◮ cannot reliably estimate that many probabilities p(E | E′) ◮ simplify model

e.g., assume transition probability only depends on tags p

  • (w, t) | (w′, t′)
  • = p(t | t′)

FST in NLP

  • A. Maletti

· 19

slide-47
SLIDE 47

Part-of-Speech Tagging

Issues

WA too big (in comparison to training data)

◮ cannot reliably estimate that many probabilities p(E | E′) ◮ simplify model

e.g., assume transition probability only depends on tags p

  • (w, t) | (w′, t′)
  • = p(t | t′)

unknown words

◮ no statistics on words that do not occur in corpus ◮ allow only assignment of open tags

(open tag = potentially unbounded number of elements, e.g. NNP) (closed tag = fixed finite number of elements, e.g. DT or PRP)

◮ use morphological clues (capitalization, affixes, etc.) ◮ use context to disambiguate ◮ use “global” statistics FST in NLP

  • A. Maletti

· 19

slide-48
SLIDE 48

Part-of-Speech Tagging

TCS contributions

efficient evaluation and complexity considerations (initial-algebra semantics, best runs, best strings, etc.) model simplifications (trimming, determinization, minimization, etc.) model transformations (projection, intersection, RegEx-to-DFA, etc.) model induction (grammar induction, weight training, etc.)

FST in NLP

  • A. Maletti

· 20

slide-49
SLIDE 49

Parsing

FST in NLP

  • A. Maletti

· 21

slide-50
SLIDE 50

Parsing

Motivation

(syntactic) parsing = determining the syntactic structure of a sentence important in several applications:

◮ co-reference resolution

(determining which noun phrases refer to the same object/concept)

◮ comprehension

(determining the meaning)

◮ speech repair and sentence-like unit detection in speech

(speech offers no punctuation; needs to be predicted)

FST in NLP

  • A. Maletti

· 22

slide-51
SLIDE 51

Parsing

We must bear in mind the Community as a whole

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole

FST in NLP

  • A. Maletti

· 23

slide-52
SLIDE 52

Parsing

We must bear in mind the Community as a whole

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole

FST in NLP

  • A. Maletti

· 23

slide-53
SLIDE 53

Trees

Finite sets Σ and W

Definition

Set TΣ(W) of Σ-trees indexed by W is smallest T w ∈ T for all w ∈ W σ(t1, . . . , tk) ∈ T for all k ∈ N, σ ∈ Σ, and t1, . . . , tk ∈ T

FST in NLP

  • A. Maletti

· 24

slide-54
SLIDE 54

Trees

Finite sets Σ and W

Definition

Set TΣ(W) of Σ-trees indexed by W is smallest T w ∈ T for all w ∈ W σ(t1, . . . , tk) ∈ T for all k ∈ N, σ ∈ Σ, and t1, . . . , tk ∈ T

Notes

  • bvious recursion & induction principle

FST in NLP

  • A. Maletti

· 24

slide-55
SLIDE 55

Parsing

Problem

assume a hidden g : W ∗ → TΣ(W) (reference parser) given a finite set T ⊆ TΣ(W) (training set) generated by g develop a system representing f : W ∗ → TΣ(W) (parser) approximating g

FST in NLP

  • A. Maletti

· 25

slide-56
SLIDE 56

Parsing

Problem

assume a hidden g : W ∗ → TΣ(W) (reference parser) given a finite set T ⊆ TΣ(W) (training set) generated by g develop a system representing f : W ∗ → TΣ(W) (parser) approximating g

FST in NLP

  • A. Maletti

· 25

slide-57
SLIDE 57

Parsing

Problem

assume a hidden g : W ∗ → TΣ(W) (reference parser) given a finite set T ⊆ TΣ(W) (training set) generated by g develop a system representing f : W ∗ → TΣ(W) (parser) approximating g

Clarification

T generated by g ⇐ ⇒ T = g(L) for some finite L ⊆ W ∗ for approximation we could use |{w ∈ W ∗ | f(w) = g(w)}|

FST in NLP

  • A. Maletti

· 25

slide-58
SLIDE 58

Parsing

Short history

before 1990

◮ hand-crafted rules based on POS tags

(unlexicalized parsing)

◮ corrections and selection by human annotators

1990s

◮ PENN tree bank

(1,000,000 words)

◮ weighted local tree grammars (weighted CFG) as parsers

(often still unlexicalized)

◮ WALL STREET JOURNAL tree bank

(30,000,000 words)

since 2000

◮ weighted tree automata

(weighted CFG with latent variables)

◮ lexicalized parsers FST in NLP

  • A. Maletti

· 26

slide-59
SLIDE 59

Weighted Local Tree Grammars

S NP PRP$ My NN dog VP VBZ sleeps S NP PRP I VP VBD scored ADVP RB well

FST in NLP

  • A. Maletti

· 27

slide-60
SLIDE 60

Weighted Local Tree Grammars

S NP PRP$ My NN dog VP VBZ sleeps S NP PRP I VP VBD scored ADVP RB well

LTG production extraction

simply read off CFG productions: S − → NP VP NP − → PRP$ NN PRP$ − → My NN − → dog VP − → VBZ VBZ − → sleeps NP − → PRP PRP − → I VP − → VBD ADVP VBD − → scored ADVP − → RB RB − → well

FST in NLP

  • A. Maletti

· 27

slide-61
SLIDE 61

Weighted Local Tree Grammars

Observations

LTG offer unique explanation on tree level (rules observable in training data; as for POS tagging) but ambiguity on the string level (i.e., on unannotated data; as for POS tagging) → weighted productions

FST in NLP

  • A. Maletti

· 28

slide-62
SLIDE 62

Weighted Local Tree Grammars

Observations

LTG offer unique explanation on tree level (rules observable in training data; as for POS tagging) but ambiguity on the string level (i.e., on unannotated data; as for POS tagging) → weighted productions

Illustration

S NP PRP We VP VBD saw NP PRP$ her NN duck S NP PRP We VP VBD saw S-BAR S NP PRP her VP VBP duck

FST in NLP

  • A. Maletti

· 28

slide-63
SLIDE 63

Weighted Local Tree Grammars

Definition

A weighted local tree grammar (wLTG) is a weighted CFG G = (N, W, S, P, wt) finite set N (nonterminals) finite set W (terminals) S ⊆ N (start nonterminals) finite set P ⊆ N × (N ∪ W)∗ (productions) mapping wt: P → [0, 1] (weight assignment) It computes the weighted derivation trees of the wCFG

FST in NLP

  • A. Maletti

· 29

slide-64
SLIDE 64

Weighted Local Tree Grammars

S NP PRP$ My NN dog VP VBZ sleeps S NP PRP I VP VBD scored ADVP RB well

wLTG production extraction

simply read of CFG productions and keep counts: S − → NP VP (2) NP − → PRP$ NN (1) PRP$ − → My (1) NN − → dog (1) VP − → VBZ (1) VBZ − → sleeps (1) NP − → PRP (1) PRP − → I (1) VP − → VBD ADVP (1) VBD − → scored (1) ADVP − → RB (1) RB − → well (1)

FST in NLP

  • A. Maletti

· 30

slide-65
SLIDE 65

Weighted Local Tree Grammars

wLTG production extraction

normalize counts: (here by left-hand side) S − → NP VP (2) NP − → PRP$ NN (1) NP − → PRP (1) PRP$ − → My (1) NN − → dog (1) VP − → VBZ (1) VP − → VBD ADVP (1) VBZ − → sleeps (1) PRP − → I (1) VBD − → scored (1) ADVP − → RB (1) RB − → well (1)

FST in NLP

  • A. Maletti

· 31

slide-66
SLIDE 66

Weighted Local Tree Grammars

wLTG production extraction

normalize counts: (here by left-hand side) S

1

− → NP VP NP

0.5

− → PRP$ NN NP

0.5

− → PRP PRP$

1

− → My NN

1

− → dog VP

0.5

− → VBZ VP

0.5

− → VBD ADVP VBZ

1

− → sleeps PRP

1

− → I VBD

1

− → scored ADVP

1

− → RB RB

1

− → well

FST in NLP

  • A. Maletti

· 31

slide-67
SLIDE 67

Weighted Local Tree Grammars

Weighted parses

S NP PRP$ My NN dog VP VBZ sleeps S NP PRP I VP VBD scored ADVP RB well

weight: 0.25 weight: 0.25

Weighted LTG productions

(only productions with weight = 1) NP

0.5

− → PRP$ NN NP

0.5

− → PRP VP

0.5

− → VBZ VP

0.5

− → VBD ADVP

FST in NLP

  • A. Maletti

· 32

slide-68
SLIDE 68

Parser Evaluation

BERKELEY parser [Reference]:

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole

CHARNIAK-JOHNSON parser:

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole

FST in NLP

  • A. Maletti

· 33

slide-69
SLIDE 69

Parser Evaluation

Definition (ParseEval measure)

precision = number of correct constituents (heading the same phrase as in reference) divided by number of all constituents in parse

FST in NLP

  • A. Maletti

· 34

slide-70
SLIDE 70

Parser Evaluation

Definition (ParseEval measure)

precision = number of correct constituents (heading the same phrase as in reference) divided by number of all constituents in parse recall = number of correct constituents divided by number of all constituents in reference

FST in NLP

  • A. Maletti

· 34

slide-71
SLIDE 71

Parser Evaluation

Definition (ParseEval measure)

precision = number of correct constituents (heading the same phrase as in reference) divided by number of all constituents in parse recall = number of correct constituents divided by number of all constituents in reference (weighted) harmonic mean Fα = (1 + α2) · precision · recall α2 · precision + recall

FST in NLP

  • A. Maletti

· 34

slide-72
SLIDE 72

Parser Evaluation

Reference Parser output

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole

precision = 9

9 = 100%

FST in NLP

  • A. Maletti

· 35

slide-73
SLIDE 73

Parser Evaluation

Reference Parser output

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole

precision = 9

9 = 100%

recall =

9 10 = 90%

FST in NLP

  • A. Maletti

· 35

slide-74
SLIDE 74

Parser Evaluation

Reference Parser output

S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole

precision = 9

9 = 100%

recall =

9 10 = 90%

F1 = 2 · 1·0.9

1+0.9 = 95%

FST in NLP

  • A. Maletti

· 35

slide-75
SLIDE 75

Parser Evaluation

Standardized Setup

training data: PENN treebank Sections 2–21 (articles from the WALL STREET JOURNAL) development test data: PENN treebank Section 22 evaluation data: PENN treebank Section 23

Experiment [POST, GILDEA, ’09]

grammar model precision recall F1 wLTG 75.37 70.05 72.61

FST in NLP

  • A. Maletti

· 36

slide-76
SLIDE 76

Parser Evaluation

Standardized Setup

training data: PENN treebank Sections 2–21 (articles from the WALL STREET JOURNAL) development test data: PENN treebank Section 22 evaluation data: PENN treebank Section 23

Experiment [POST, GILDEA, ’09]

grammar model precision recall F1 wLTG 75.37 70.05 72.61 These are bad compared to the state-of-the-art!

FST in NLP

  • A. Maletti

· 36

slide-77
SLIDE 77

Parser Evaluation

State-of-the-art models

context-free grammars with latent variables (CFGlv) [COLLINS, ’99], [KLEIN, MANNING, ’03], [PETROV, KLEIN, ’07] tree substitution grammars with latent variables (TSGlv) [SHINDO et al., ’12] (both as expressive as weighted tree automata)

  • ther models

FST in NLP

  • A. Maletti

· 37

slide-78
SLIDE 78

Parser Evaluation

State-of-the-art models

context-free grammars with latent variables (CFGlv) [COLLINS, ’99], [KLEIN, MANNING, ’03], [PETROV, KLEIN, ’07] tree substitution grammars with latent variables (TSGlv) [SHINDO et al., ’12] (both as expressive as weighted tree automata)

  • ther models

Experiment [SHINDO et al., ’12]

grammar model F1 wLTG = wCFG 72.6 wTSG [COHN et al., 2010] 84.7 wCFGlv [PETROV, 2010] 91.8 wTSGlv [SHINDO et al., 2012] 92.4

FST in NLP

  • A. Maletti

· 37

slide-79
SLIDE 79

Grammars with Latent Variables

Definition

A grammar with latent variables is (grammar with relabeling) a grammar G generating L(G) ⊆ TΣ(W) a (total) mapping ρ: Σ → ∆ functional relabeling

FST in NLP

  • A. Maletti

· 38

slide-80
SLIDE 80

Grammars with Latent Variables

Definition

A grammar with latent variables is (grammar with relabeling) a grammar G generating L(G) ⊆ TΣ(W) a (total) mapping ρ: Σ → ∆ functional relabeling

Definition (Semantics)

L(G, ρ) = ρ(L(G)) = {ρ(t) | t ∈ L(G)} Language class: REL(L) for language class L

FST in NLP

  • A. Maletti

· 38

slide-81
SLIDE 81

Weighted Tree Automata

Definition

A weighted tree automaton (wTA) is a system G = (Q, N, W, S, P, wt) finite set Q (states) finite set N (nonterminals) finite set W (terminals) S ⊆ Q (start states) finite set P ⊆

  • Q × N × (Q ∪ W)+

  • Q × W
  • (productions)

mapping wt: P → [0, 1] (weight assignment) production (q, n, w1, . . . , wk) is often written q → n(w1, . . . , wk)

FST in NLP

  • A. Maletti

· 39

slide-82
SLIDE 82

Grammars with Latent Variables

Theorem

REL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA] REL(wTSL) [wTSGlv] wTSL [wTSG] REL(wCFL) [wCFGlv] wCFL [wCFG]

FST in NLP

  • A. Maletti

· 40

slide-83
SLIDE 83

Grammars with Latent Variables

Theorem

REL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA] REL(wTSL) [wTSGlv] wTSL [wTSG] REL(wCFL) [wCFGlv] wCFL [wCFG]

FST in NLP

  • A. Maletti

· 40

slide-84
SLIDE 84

Grammars with Latent Variables

Theorem

REL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA] REL(wTSL) [wTSGlv] wTSL [wTSG] REL(wCFL) [wCFGlv] wCFL [wCFG]

here: latent variables ≈ finite-state

FST in NLP

  • A. Maletti

· 40

slide-85
SLIDE 85

Parsing

Typical questions

Decoding: (or language model evaluation) Given model M and sentence w, determine probability M(w)

◮ intersect M with the DTA for w and any parse ◮ evaluate w in the obtained WTA ◮ efficient: initial-algebra semantics

(forward algorithm)

FST in NLP

  • A. Maletti

· 41

slide-86
SLIDE 86

Parsing

Typical questions

Decoding: (or language model evaluation) Given model M and sentence w, determine probability M(w)

◮ intersect M with the DTA for w and any parse ◮ evaluate w in the obtained WTA ◮ efficient: initial-algebra semantics

(forward algorithm)

Parsing: Given model M and sentence w, determine the best parse t for w

◮ intersect M with the DTA for w and any parse ◮ determine best tree in the obtained WTA ◮ efficient: none

(NP-hard even for wLTG)

FST in NLP

  • A. Maletti

· 41

slide-87
SLIDE 87

Parsing

Statistical parsing approach

Given wLTG M and sentence w, return highest-scoring parse for w

FST in NLP

  • A. Maletti

· 42

slide-88
SLIDE 88

Parsing

Statistical parsing approach

Given wLTG M and sentence w, return highest-scoring parse for w

Consequence

The first parse should be prefered (“duck” more frequently a noun, etc.)

S NP PRP We VP VBD saw NP PRP$ her NN duck S NP PRP We VP VBD saw S-BAR S NP PRP her VP VBP duck

FST in NLP

  • A. Maletti

· 42

slide-89
SLIDE 89

Parsing

TCS contributions

efficient evaluation and complexity considerations (initial-algebra semantics, best runs, best trees, etc.) model simplifications (trimming, determinization, minimization, etc.) model transformations (intersection, normalization, lexicalization, etc.) model induction (grammar induction, weight training, spectral learning, etc.)

FST in NLP

  • A. Maletti

· 43

slide-90
SLIDE 90

Parsing

NLP contribution to TCS

good source of (relevant) problems good source for practical techniques (e.g., fine-to-coarse decoding) good source of (relevant) large wTA language states non-lexical productions English 1,132 1,842,218 Chinese 994 1,109,500 German 981 616,776

FST in NLP

  • A. Maletti

· 44

slide-91
SLIDE 91

Machine Translation

FST in NLP

  • A. Maletti

· 45

slide-92
SLIDE 92

Machine Translation

Applications

Technical manuals

Example (An mp3 player)

The synchronous manifestation of lyrics is a procedure for can broadcasting the music, waiting the mp3 file at the same time showing the lyrics.

FST in NLP

  • A. Maletti

· 46

slide-93
SLIDE 93

Machine Translation

Applications

Technical manuals

Example (An mp3 player)

The synchronous manifestation of lyrics is a procedure for can broadcasting the music, waiting the mp3 file at the same time showing the lyrics. With the this kind method that the equipments that synchronous function of support up broadcast to make use of document create setup, you can pass the LCD window way the check at the document contents that broadcast.

FST in NLP

  • A. Maletti

· 46

slide-94
SLIDE 94

Machine Translation

Applications

Technical manuals

Example (An mp3 player)

The synchronous manifestation of lyrics is a procedure for can broadcasting the music, waiting the mp3 file at the same time showing the lyrics. With the this kind method that the equipments that synchronous function of support up broadcast to make use of document create setup, you can pass the LCD window way the check at the document contents that broadcast. That procedure returns

  • fferings to have to modify, and delete, and stick top , keep etc. edit

function.

FST in NLP

  • A. Maletti

· 46

slide-95
SLIDE 95

Machine Translation

Applications

Technical manuals US military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name? A: Abdul. E: And your last name? A: Al Farran.

FST in NLP

  • A. Maletti

· 46

slide-96
SLIDE 96

Machine Translation

Applications

Technical manuals US military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name? A: Abdul. E: And your last name? A: Al Farran. E: Okay, what’s your name? A: milk a mechanic and I am here I mean yes

FST in NLP

  • A. Maletti

· 46

slide-97
SLIDE 97

Machine Translation

Applications

Technical manuals US military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name? A: Abdul. E: And your last name? A: Al Farran. E: Okay, what’s your name? A: milk a mechanic and I am here I mean yes E: What is your last name? A: every two weeks my son’s name is ismail

FST in NLP

  • A. Maletti

· 46

slide-98
SLIDE 98

Machine Translation

VAUQUOIS triangle: phrase syntax semantics foreign German Translation model:

FST in NLP

  • A. Maletti

· 47

slide-99
SLIDE 99

Machine Translation

VAUQUOIS triangle: phrase syntax semantics foreign German Translation model: string-to-tree

FST in NLP

  • A. Maletti

· 47

slide-100
SLIDE 100

Machine Translation

VAUQUOIS triangle: phrase syntax semantics foreign German Translation model: tree-to-tree

FST in NLP

  • A. Maletti

· 47

slide-101
SLIDE 101

Machine Translation

Training data

parallel corpus word alignments parse trees for the target sentences

FST in NLP

  • A. Maletti

· 48

slide-102
SLIDE 102

Machine Translation

Training data

parallel corpus word alignments parse trees for the target sentences

Parallel Corpus

linguistic resource containing example translations (sentence level)

FST in NLP

  • A. Maletti

· 48

slide-103
SLIDE 103

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP

  • A. Maletti

· 49

slide-104
SLIDE 104

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

via GIZA++ [OCH, NEY, ’03]

FST in NLP

  • A. Maletti

· 49

slide-105
SLIDE 105

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

via BERKELEY parser [PETROV et al., ’06]

FST in NLP

  • A. Maletti

· 49

slide-106
SLIDE 106

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule FST in NLP

  • A. Maletti

· 50

slide-107
SLIDE 107

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

FST in NLP

  • A. Maletti

· 50

slide-108
SLIDE 108

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

FST in NLP

  • A. Maletti

· 50

slide-109
SLIDE 109

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

FST in NLP

  • A. Maletti

· 50

slide-110
SLIDE 110

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

FST in NLP

  • A. Maletti

· 50

slide-111
SLIDE 111

Extended Tree Transducer

Extended top-down tree transducer (STSG)

variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09] rules of the form NT →

  • r, r1
  • for nonterminal NT

◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

FST in NLP

  • A. Maletti

· 50

slide-112
SLIDE 112

Extended Tree Transducer

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

Rule application

1

Selection of synchronous nonterminals

FST in NLP

  • A. Maletti

· 51

slide-113
SLIDE 113

Extended Tree Transducer

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

Rule application

1

Selection of synchronous nonterminals

FST in NLP

  • A. Maletti

· 51

slide-114
SLIDE 114

Extended Tree Transducer

S → PPER would like KOUS PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

Rule application

1

Selection of synchronous nonterminals

2

Selection of suitable rule

KOUS → would like K¨

  • nnten

KOUS FST in NLP

  • A. Maletti

· 51

slide-115
SLIDE 115

Extended Tree Transducer

S → PPER KOUS would like PPER advice PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S

Rule application

1

Selection of synchronous nonterminals

2

Selection of suitable rule

3

Replacement on both sides

KOUS → would like K¨

  • nnten

KOUS FST in NLP

  • A. Maletti

· 51

slide-116
SLIDE 116

Extended Tree Transducer

S → PPER would like PPER advice APPR NN CD PP PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S

Rule application

1

synchronous nonterminals

FST in NLP

  • A. Maletti

· 52

slide-117
SLIDE 117

Extended Tree Transducer

S → PPER would like PPER advice APPR NN CD PP PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S

Rule application

1

synchronous nonterminals

FST in NLP

  • A. Maletti

· 52

slide-118
SLIDE 118

Extended Tree Transducer

S → PPER would like PPER advice APPR NN CD PP PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S

Rule application

1

synchronous nonterminals

2

suitable rule

PP → APPR NN CD PP APPR NN CD PP PP FST in NLP

  • A. Maletti

· 52

slide-119
SLIDE 119

Extended Tree Transducer

S → PPER would like PPER advice PP APPR NN CD PP K¨

  • nnten

eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S

Rule application

1

synchronous nonterminals

2

suitable rule

3

replacement

PP → APPR NN CD PP APPR NN CD PP PP FST in NLP

  • A. Maletti

· 52

slide-120
SLIDE 120

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP

  • A. Maletti

· 53

slide-121
SLIDE 121

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 53

slide-122
SLIDE 122

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 53

slide-123
SLIDE 123

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 53

slide-124
SLIDE 124

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 53

slide-125
SLIDE 125

Rule extraction

Removal of extractable rule:

I would like your advice about Rule 143 concerning inadmissibility K¨

  • nnten

Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP

  • A. Maletti

· 54

slide-126
SLIDE 126

Rule extraction

Removal of extractable rule:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP

  • A. Maletti

· 54

slide-127
SLIDE 127

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP

  • A. Maletti

· 55

slide-128
SLIDE 128

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP

  • A. Maletti

· 55

slide-129
SLIDE 129

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 55

slide-130
SLIDE 130

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 55

slide-131
SLIDE 131

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 55

slide-132
SLIDE 132

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP K¨

  • nnten

Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S

extractable rules marked in red

FST in NLP

  • A. Maletti

· 55

slide-133
SLIDE 133

Extended Tree Transducer

Advantages

very simple implemented in MOSES [KOEHN et al., ’07] “context-free”

FST in NLP

  • A. Maletti

· 56

slide-134
SLIDE 134

Extended Tree Transducer

Advantages

very simple implemented in MOSES [KOEHN et al., ’07] “context-free”

Disadvantages

problems with discontinuities composition and binarization not possible [M. et al., ’09] and [ZHANG et al., ’06] “context-free”

FST in NLP

  • A. Maletti

· 56

slide-135
SLIDE 135

Extended Tree Transducer

Remarks

synchronization breaks almost all existing constructions (e.g., the normalization construction) → the basic grammar model very important

FST in NLP

  • A. Maletti

· 57

slide-136
SLIDE 136

Extended Tree Transducer

Remarks

synchronization breaks almost all existing constructions (e.g., the normalization construction) → the basic grammar model very important tree-to-tree models use trees on both sides

FST in NLP

  • A. Maletti

· 57

slide-137
SLIDE 137

Extended Tree Transducer

Major (tree-to-tree) models

1

linear top-down tree transducer (with look-ahead)

◮ input-side: tree automaton ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT FST in NLP

  • A. Maletti

· 58

slide-138
SLIDE 138

Extended Tree Transducer

Major (tree-to-tree) models

1

linear top-down tree transducer (with look-ahead)

◮ input-side: tree automaton ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT 2

linear extended top-down tree transducer (w. look-ahead)

◮ input-side: regular tree grammar ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT FST in NLP

  • A. Maletti

· 58

slide-139
SLIDE 139

Extended Tree Transducer

Synchronous grammar rule: VP q1 q2 q3

q

— VP q2 VP q1 q3 “Classical” top-down tree transducer rule: q VP x1 x2 x3 → VP q2 x2 VP q1 x1 q3 x3

FST in NLP

  • A. Maletti

· 59

slide-140
SLIDE 140

Extended Tree Transducer

Syntactic restrictions

nondeleting if synchronization bijective (in all rules) strict if r1 not a nonterminal (for all rules q → (r, r1)) ε-free if r not a nonterminal (for all rules q → (r, r1))

Composition (COMP)

executing transformations τ ⊆ TΣ × T∆ and τ ′ ⊆ T∆ × TΓ

  • ne after the other:

τ ; τ ′ = {(s, u) | ∃t ∈ T∆ : (s, t) ∈ τ, (t, u) ∈ τ ′}

FST in NLP

  • A. Maletti

· 60

slide-141
SLIDE 141

Top-down Tree Transducer

TOPR

TOP∞ l-TOPR

1

l-TOP2 ls-TOPR

1

ls-TOP2 ln-TOP1 lns-TOP1

composition closure indicated in subscript

FST in NLP

  • A. Maletti

· 61

slide-142
SLIDE 142

Extended Tree Transducer

XTOP∞ XTOPR

l-XTOP∞ l-XTOPR

ln-XTOP∞ ε-XTOP∞ TOPR

lns-XTOP∞ lε-XTOP4 lε-XTOPR

3

lnε-XTOP∞ lsε-XTOPR

2

lnsε-XTOP2 lsε-XTOP2 l-TOPF

2

l-TOPR

1

TOP∞ ls-TOP2 l-TOP2 ln-TOP1 lns-TOP1

composition closure indicated in subscript

FST in NLP

  • A. Maletti

· 62

slide-143
SLIDE 143

Machine Translation

TCS contributions

efficient evaluation and complexity considerations (exact decoding, best runs, best translations, etc.) evaluation of expressive power (which linguistic phenomena can be captured? relationship to other models) model transformations (intersection, language model integration, parse forest decoding, etc.) very little on model induction so far (mostly local models so far; power of finite-state not yet explored)

FST in NLP

  • A. Maletti

· 63

slide-144
SLIDE 144

Evaluation

Task System BLEU English → German STSG 15.22 MBOT 15.90 phrase-based 16.73 hierarchical 16.95 GHKM 17.10 English → Arabic STSG 48.32 MBOT 49.10 phrase-based 50.27 hierarchical 51.71 GHKM 46.66 English → Chinese STSG 17.69 MBOT 18.35 phrase-based 18.09 hierarchical 18.49 GHKM 18.12

from [SEEMANN et al., ’15]

FST in NLP

  • A. Maletti

· 64

slide-145
SLIDE 145

Selected Literature

ENGELFRIET: Bottom-up and Top-down Tree Transformations — A Comparison. Math. Systems Theory 9 ’75 KLEIN, MANNING: Accurate Unlexicalized Parsing

  • Proc. ACL

’03 KOEHN: Statistical Machine Translation Cambridge University Press ’10 MANNING, SCHÜTZE: Foundations of Statistical Natural Language

  • Processing. MIT Press

’99 PETROV, BARRETT, THIBAUX AND KLEIN: Learning Accurate, Compact, and Interpretable Tree Annotation. Proc. ACL ’06 SHINDO, MIYAO, FUJINO AND NAGATA: Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing. Proc. ACL ’12

FST in NLP

  • A. Maletti

· 65