Finite-State Technology in Natural Language Processing
Andreas Maletti
Institute for Natural Language Processing Universität Stuttgart, Germany maletti@ims.uni-stuttgart.de
Umeå — August 18, 2015
FST in NLP
- A. Maletti
· 1
Finite-State Technology in Natural Language Processing Andreas - - PowerPoint PPT Presentation
Finite-State Technology in Natural Language Processing Andreas Maletti Institute for Natural Language Processing Universitt Stuttgart, Germany maletti@ims.uni-stuttgart.de Ume August 18, 2015 FST in NLP A. Maletti 1 Roadmap
FST in NLP
· 1
1
2
3
4
FST in NLP
· 2
◮ Alla sätt är bra utom de dåliga. ◮ “Are you serious?” she asked. FST in NLP
· 3
◮ Vännens örfil är ärligt menad,
◮ People
FST in NLP
· 3
◮ the green car (noun phrase = noun and its modifiers) ◮ killed the snake (verb phrase = verb and its objects) FST in NLP
· 3
◮ house, car, lived, smallest, 45th, STACS, Knuth ◮ but tricky: Knuth’s vs. Knuth ’s;
FST in NLP
· 3
FST in NLP
· 4
◮ Chinese: 小洞不补,大洞吃苦。
FST in NLP
· 4
◮ Chinese: 小洞不补,大洞吃苦。
◮ Turkish: Çekoslovakyalıla¸
FST in NLP
· 4
◮ Chinese: 小洞不补,大洞吃苦。
◮ Turkish: Çekoslovakyalıla¸
◮ Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként
FST in NLP
· 4
◮ Chinese: 小洞不补,大洞吃苦。
◮ Turkish: Çekoslovakyalıla¸
◮ Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként
FST in NLP
· 4
◮ common abbreviations ◮ dates, ordinals, and phone numbers FST in NLP
· 5
◮ common abbreviations ◮ dates, ordinals, and phone numbers
◮ implemented in JAVA
◮ compiles RegEx into DFA and runs DFA ◮ can process 1,000,000 tokens per second FST in NLP
· 5
FST in NLP
· 6
FST in NLP
· 6
FST in NLP
· 7
FST in NLP
· 7
FST in NLP
· 7
FST in NLP
· 7
FST in NLP
· 7
FST in NLP
· 8
FST in NLP
· 9
FST in NLP
· 9
FST in NLP
· 9
FST in NLP
· 9
FST in NLP
· 10
FST in NLP
· 10
FST in NLP
· 10
FST in NLP
· 10
FST in NLP
· 10
◮ manually tagged BROWN corpus
◮ tag lists with frequency for each token
◮ excluding ling.-implausible sequences
◮ “most common tag” yields 90% accuracy
FST in NLP
· 11
◮ manually tagged BROWN corpus
◮ tag lists with frequency for each token
◮ excluding ling.-implausible sequences
◮ “most common tag” yields 90% accuracy
◮ hidden MARKOV models (HMM) ◮ dynamic programming and VITERBI algorithms
FST in NLP
· 11
◮ manually tagged BROWN corpus
◮ tag lists with frequency for each token
◮ excluding ling.-implausible sequences
◮ “most common tag” yields 90% accuracy
◮ hidden MARKOV models (HMM) ◮ dynamic programming and VITERBI algorithms
◮ British national corpus
◮ parsers are better taggers
FST in NLP
· 11
FST in NLP
· 12
FST in NLP
· 13
(the, DT) (fun, NN) (a, DT) (car, NN) 0.2 0.4 0.1 0.1 0.4 0.1 0.3 0.1 0.2 0.1
FST in NLP
· 14
FST in NLP
· 15
FST in NLP
· 15
FST in NLP
· 16
(fun,NN) 0.2 (car,NN) 0.4 (the,DT) 0.1 (a,DT) 0.1 (car,NN) 0.4 (fun,NN) 0.1 (car,NN) 0.3 (the,DT) 0.1 (fun,NN) 0.2 (a,DT) 0.1 FST in NLP
· 16
(fun,NN) 0.2 (car,NN) 0.4 (the,DT) 0.1 (a,DT) 0.1 (car,NN) 0.4 (fun,NN) 0.1 (car,NN) 0.3 (the,DT) 0.1 (fun,NN) 0.2 (a,DT) 0.1 (the,DT) 0.5 (fun,NN) 0.2 (a,DT) 0.3 (car,NN) 0.5 FST in NLP
· 16
◮ project labels to first components ◮ evaluate w in the obtained wA M1 ◮ efficient: initial-algebra semantics
FST in NLP
· 17
◮ project labels to first components ◮ evaluate w in the obtained wA M1 ◮ efficient: initial-algebra semantics
◮ intersect M with the DFA for w and any tag sequence ◮ determine best run in the obtained wA ◮ efficient: VITERBI algorithm FST in NLP
· 17
◮ no closed solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.) FST in NLP
· 18
◮ no closed solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.)
◮ no exact solution (in general), but many approximations ◮ efficient: hill-climbing methods (EM, simulated annealing, etc.) FST in NLP
· 18
◮ cannot reliably estimate that many probabilities p(E | E′) ◮ simplify model
FST in NLP
· 19
◮ cannot reliably estimate that many probabilities p(E | E′) ◮ simplify model
◮ no statistics on words that do not occur in corpus ◮ allow only assignment of open tags
◮ use morphological clues (capitalization, affixes, etc.) ◮ use context to disambiguate ◮ use “global” statistics FST in NLP
· 19
FST in NLP
· 20
FST in NLP
· 21
◮ co-reference resolution
◮ comprehension
◮ speech repair and sentence-like unit detection in speech
FST in NLP
· 22
FST in NLP
· 23
FST in NLP
· 23
FST in NLP
· 24
FST in NLP
· 24
FST in NLP
· 25
FST in NLP
· 25
FST in NLP
· 25
◮ hand-crafted rules based on POS tags
◮ corrections and selection by human annotators
◮ PENN tree bank
◮ weighted local tree grammars (weighted CFG) as parsers
◮ WALL STREET JOURNAL tree bank
◮ weighted tree automata
◮ lexicalized parsers FST in NLP
· 26
FST in NLP
· 27
FST in NLP
· 27
FST in NLP
· 28
S NP PRP We VP VBD saw NP PRP$ her NN duck S NP PRP We VP VBD saw S-BAR S NP PRP her VP VBP duck
FST in NLP
· 28
FST in NLP
· 29
FST in NLP
· 30
FST in NLP
· 31
1
0.5
0.5
1
1
0.5
0.5
1
1
1
1
1
FST in NLP
· 31
0.5
0.5
0.5
0.5
FST in NLP
· 32
S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole
S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole
FST in NLP
· 33
FST in NLP
· 34
FST in NLP
· 34
FST in NLP
· 34
S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole
FST in NLP
· 35
S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole
FST in NLP
· 35
S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP NP DT the NN Community PP IN as NP DT a NN whole S NP PRP We VP MD must VP VB bear PP IN in NP NN mind NP DT the NNP Community PP IN as NP DT a NN whole
FST in NLP
· 35
FST in NLP
· 36
FST in NLP
· 36
FST in NLP
· 37
FST in NLP
· 37
FST in NLP
· 38
FST in NLP
· 38
FST in NLP
· 39
FST in NLP
· 40
FST in NLP
· 40
FST in NLP
· 40
◮ intersect M with the DTA for w and any parse ◮ evaluate w in the obtained WTA ◮ efficient: initial-algebra semantics
FST in NLP
· 41
◮ intersect M with the DTA for w and any parse ◮ evaluate w in the obtained WTA ◮ efficient: initial-algebra semantics
◮ intersect M with the DTA for w and any parse ◮ determine best tree in the obtained WTA ◮ efficient: none
FST in NLP
· 41
FST in NLP
· 42
S NP PRP We VP VBD saw NP PRP$ her NN duck S NP PRP We VP VBD saw S-BAR S NP PRP her VP VBP duck
FST in NLP
· 42
FST in NLP
· 43
FST in NLP
· 44
FST in NLP
· 45
FST in NLP
· 46
FST in NLP
· 46
FST in NLP
· 46
FST in NLP
· 46
FST in NLP
· 46
FST in NLP
· 46
FST in NLP
· 47
FST in NLP
· 47
FST in NLP
· 47
FST in NLP
· 48
FST in NLP
· 48
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP
· 49
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 49
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 49
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule FST in NLP
· 50
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
FST in NLP
· 50
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
FST in NLP
· 50
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
FST in NLP
· 50
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
FST in NLP
· 50
◮ right-hand side r of context-free grammar rule ◮ right-hand side r1 of regular tree grammar rule
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
FST in NLP
· 50
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
1
FST in NLP
· 51
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
1
FST in NLP
· 51
S → PPER would like KOUS PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
1
2
KOUS → would like K¨
KOUS FST in NLP
· 51
S → PPER KOUS would like PPER advice PP K¨
eine Auskunft geben KOUS PPER PPER ART NN PP VV NP S
1
2
3
KOUS → would like K¨
KOUS FST in NLP
· 51
S → PPER would like PPER advice APPR NN CD PP PP K¨
eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S
1
FST in NLP
· 52
S → PPER would like PPER advice APPR NN CD PP PP K¨
eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S
1
FST in NLP
· 52
S → PPER would like PPER advice APPR NN CD PP PP K¨
eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S
1
2
PP → APPR NN CD PP APPR NN CD PP PP FST in NLP
· 52
S → PPER would like PPER advice PP APPR NN CD PP K¨
eine Auskunft geben KOUS PPER PPER ART NN APPR NN CD PP VV PP NP S
1
2
3
PP → APPR NN CD PP APPR NN CD PP PP FST in NLP
· 52
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP
· 53
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 53
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 53
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 53
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S
FST in NLP
· 53
I would like your advice about Rule 143 concerning inadmissibility K¨
Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzul¨ assigkeit geben KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV PP PP PP NP S FST in NLP
· 54
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP
· 54
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP
· 55
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S FST in NLP
· 55
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S
FST in NLP
· 55
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S
FST in NLP
· 55
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S
FST in NLP
· 55
PPER would like your advice about Rule 143 PP K¨
Sie eine Auskunft zu Artikel 143 geben KOUS PPER PPER ART NN APPR NN CD VV PP PP NP S
FST in NLP
· 55
FST in NLP
· 56
FST in NLP
· 56
FST in NLP
· 57
FST in NLP
· 57
1
◮ input-side: tree automaton ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT FST in NLP
· 58
1
◮ input-side: tree automaton ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT 2
◮ input-side: regular tree grammar ◮ output-side: regular tree grammar ◮ synchronization: mapping output NT to input NT FST in NLP
· 58
FST in NLP
· 59
FST in NLP
· 60
FST in NLP
· 61
XTOP∞ XTOPR
∞
l-XTOP∞ l-XTOPR
∞
ln-XTOP∞ ε-XTOP∞ TOPR
∞
lns-XTOP∞ lε-XTOP4 lε-XTOPR
3
lnε-XTOP∞ lsε-XTOPR
2
lnsε-XTOP2 lsε-XTOP2 l-TOPF
2
l-TOPR
1
TOP∞ ls-TOP2 l-TOP2 ln-TOP1 lns-TOP1
FST in NLP
· 62
FST in NLP
· 63
FST in NLP
· 64
FST in NLP
· 65