Overview Introduction Lexicalized TAG, Advantages of parsing with - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Introduction Lexicalized TAG, Advantages of parsing with - - PowerPoint PPT Presentation

Parsing with Lexicalized TAG (1) Extracting and comparing LTAG (2) Presentation by Philip John Gorinski Seminar Recent Advances in Parsing Technology Saarland University, Winter Term 2011/12 (1) Yves Shabes, Aravind K. Joshi, 1990 (2)


slide-1
SLIDE 1

Parsing with Lexicalized TAG(1) Extracting and comparing LTAG(2)

Presentation by Philip John Gorinski Seminar “Recent Advances in Parsing Technology” Saarland University, Winter Term 2011/12

(1) Yves Shabes, Aravind K. Joshi, 1990 (2) Fei Xia, Chung-hye Han, Martha Palmer, and Aravind Joshi, 2001

slide-2
SLIDE 2

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-3
SLIDE 3

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-4
SLIDE 4

Introduction: Lexicalized TAG

  • additional properties
  • lexical “anchor” for each tree, i.e., all trees

associated with the lexicon

  • here also: separation of lexicon and tree families

Parsing with lexicalized TAG 4 36 /

  • like regular Tree Adjoining Grammar
  • initial trees (α-trees) / auxiliary trees (β-trees)
  • substitution (↓) / adjunction (*) of trees
slide-5
SLIDE 5

Introduction: Lexicalized TAG

Substitution:

Parsing with lexicalized TAG

NP N girl D the S VP V saw NP0 N boy D the NP1↓ S VP V saw NP0 N boy D the NP N girl D the

5 36 /

slide-6
SLIDE 6

Introduction: Lexicalized TAG

S VP V saw NP N boy D the NP D the N N A pretty girl Adjunction:

Parsing with lexicalized TAG

S VP V saw NP N boy D the NP N girl D the N N* A pretty

6 36 /

slide-7
SLIDE 7

Introduction: Lexicalized TAG

  • Tree families
  • essentially LTAG trees, but abstracted anchor
  • e.g., family of verbs taking one object (np0Vnp1)

S VP V◊ NP0↓ NP1↓ S VP V◊ NP0↓ NP1↓ εi S NPi↓ (+wh)

...

  • Lexicon: associates verbs with tree families

Parsing with lexicalized TAG 7 36 /

slide-8
SLIDE 8

Introduction: Advantages

  • TAG provides extended domain of locality
  • capture non-local features in a localized fashion
  • 'production-like'
  • LTAG preserves this feature
  • LTAG provides linking to lexical information
  • very useful for actual parsing
  • limited search space, prevention of recursion [...]

Parsing with lexicalized TAG 8 36 /

slide-9
SLIDE 9

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-10
SLIDE 10

Parsing LTAGs

  • General two-step strategy for lexicalized

grammars

  • 1. select elementary structures for lexical input items
  • 2. parse sentence wrt. to resulting set of structures
  • first step 'filters' the grammar
  • may drastically reduce search space

➔ LTAGs are finitely ambiguous!

  • may guide top-down parser by using bottom-up

information, e.g., item's position in input string

  • second step suitable for any parsing algorithm

Parsing with lexicalized TAG 10 36 /

slide-11
SLIDE 11

Parsing LTAGs: bottom-up

  • CKY-type parser for TAG (Vijay-Shanker and

Joshi, 1985)

  • data driven
  • bottom-up information of first stage has no effect on

algorithm itself

  • grammar filtering reduces number of nodes in the

recognition matrix

Parsing with lexicalized TAG 11 36 /

slide-12
SLIDE 12

Parsing LTAGs: top-down

  • like push-down automatons for CFG parsing

(Lang, 1990)

  • indices for sub trees spanning the input
  • CFG: 2 indices; (L)TAG: 4 indices for positions

left/right of anchor in auxiliary trees X X*

i l j k Parsing with lexicalized TAG 12 36 /

slide-13
SLIDE 13

Parsing LTAGs: top-down

  • problem for top-down: left-recursion
  • A → A B
  • infinite search space
  • quite frequent phenomenon in TAG
  • solved by grammar filtering for LTAG
  • parser considers only elementary trees selected by

first stage

  • can be distinguished by typology and position in

input string

➔ each tree only used once

  • finite search space even for top-down parser!

Parsing with lexicalized TAG 13 36 /

slide-14
SLIDE 14

Parsing: bottom-up + dynamic top-down

  • Earley-type TAG parser (Schabes and Joshi,

1988)

  • scan / predict / complete
  • use bottom-up prediction to guide top-down parsing
  • straight forward parsing for LTAGs
  • lexicalization simplifies certain steps of the

algorithm

Parsing with lexicalized TAG 14 36 /

slide-15
SLIDE 15
  • 1. first pass selects subset of grammar

➔ limits search space

  • 2. each tree is anchored

➔ same state set can not predict that a tree can be

substituted and be completed

➔ same state set can not predict an auxiliary tree for

left adjunction and right completion

  • 3. information of anchor position can be used to

filter top-down prediction / completions for adjunction and substitution

Parsing: bottom-up + dynamic top-down

Parsing with lexicalized TAG 15 36 /

slide-16
SLIDE 16

Parsing: bottom-up + dynamic top-down

  • with normal TAG, “men” could be predicted for

substitution in “hate/smoke” structure

  • would lead to back tracking in later analysis
  • lexicalization prevents prediction!
  • anchor position does not match the string

the 1 men 2 who 3 hate 4 women 5 that 6 smoke 7 cigarettes 8 are 9 intolerant 10

Parsing with lexicalized TAG 16 36 /

slide-17
SLIDE 17

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-18
SLIDE 18

Motivation

  • Automatic extraction of grammars has

motivations in both theoretical linguistics and NLP engineering

  • Theoretical motivation
  • quantitative testing of Universal Grammar
  • explore similarities and differences of languages
  • Engineering motivation
  • links between structures of different grammars
  • valuable for parsing, lexicon development, machine

translation ...

Extracting and comparing LTAG 18 36 /

slide-19
SLIDE 19

Data

  • 3 Languages for comparison
  • English, Chinese, Korean
  • Germanic, Sino-Tibetan, Altaic
  • Different word order
  • SVO (En, Ch) vs. SOV (Ko)
  • permutable argument NPs (Ko)
  • Subject/Object deletion
  • freely (Ch, Ko) vs. none (En)
  • Inflectional morphology
  • rich (Ko) vs. little (En) vs. none (Ch)

Extracting and comparing LTAG 19 36 /

slide-20
SLIDE 20

Data

  • English Penn Treebank II (Marcus et al., 1993)
  • 1,174K words, ~23.85 words/sentence, 94 tags
  • Chinese Penn Treebank (Xia et al., 2000)
  • 100K words, ~23.81 words/sentences, 92 tags
  • Korean Penn Treebank (Han et al., 2001)
  • 54K words, ~10.71 words/sentence, 61 tags
  • All provide phrase structure annotation
  • Use similar annotation scheme

Extracting and comparing LTAG 20 36 /

slide-21
SLIDE 21

Data

  • Example of English Penn Treebank sentence

Extracting and comparing LTAG 21 36 /

slide-22
SLIDE 22

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-23
SLIDE 23

Extraction

  • Tool: LexTract
  • recognizes 3 types of initial/auxiliary LTAG trees
  • Spine: predicate-argument relations
  • Mod: modification rules
  • Conj: coordination relations
  • each extracted tree should fall into exactly one

category

Extracting and comparing LTAG 23 36 /

slide-24
SLIDE 24

Extraction

  • Spine-trees
  • X⁰: anchor, head of Xm
  • tree is formed by
  • a spine Xm → Xm-1 → ... → X⁰
  • the arguments of X⁰

Extracting and comparing LTAG 24 36 /

slide-25
SLIDE 25

Extraction

  • Mod-trees
  • Wq: root with two children
  • Wq*: adjunction node with same label as Wq
  • Xm: modifier of Wq*, spine-tree with

Extracting and comparing LTAG 25 36 /

slide-26
SLIDE 26

Extraction

  • Conj-trees
  • root with 3 children
  • Conjunct: adjunction node Xm*
  • Conjunction
  • Conjunct: spine tree Xm → ... → X⁰

Extracting and comparing LTAG 26 36 /

slide-27
SLIDE 27

Extraction

“(at) underwriters still draft policies using fountain pens and blotting paper”

spine-trees mod-trees conj-tree

Extracting and comparing LTAG 27 36 /

slide-28
SLIDE 28

Extraction: Results

template types etree types word types context-free rules English 6,926 131,397 49,206 1,524 Chinese 1,140 21,125 10,772 515 Korean 632 13,941 10,035 152

  • Templates: etrees with lexical items removed
  • CFG extracted by reading rules off the

templates

  • small subsets of frequent templates cover

majority of tokens

  • English: Top 100 (500, 1000, 1500) = 87.1%

(96.6%, 98.4%, 99.0%)

Extracting and comparing LTAG 28 36 /

slide-29
SLIDE 29

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-30
SLIDE 30

Language Comparison

  • Make LTAGs comparable
  • create new shared tagset
  • merge original tags into new tags
  • replace original treebank tags
  • re-run LexTract
  • Compare LTAGs for English, Chinese, Korean
  • templates
  • context-free rules
  • sub-templates

30 36 / Extracting and comparing LTAG

slide-31
SLIDE 31

Language Comparison

  • new tagsets reduce templates by ~50%
  • few shared, high-frequency templates account

for large portion of observed data across languages

Extracting and comparing LTAG 31 36 /

slide-32
SLIDE 32

Language Comparison

  • Annotation errors may have an impact on

cross-language evaluation

  • valid pattern t of language A may cover 10% of A's

tokens

  • may be invalid for language B, but still be observed

due to being in single erroneous annotation

  • if GB covers 50% of GA without t, it will cover 60%

when using t

➔ use thresholds to filter out low-frequency

patterns

  • also evens out different sizes of treebanks

Extracting and comparing LTAG 32 36 /

slide-33
SLIDE 33

Language Comparison

  • thresholds drastically decrease number of patterns

per language

  • thresholds lowers cross-language coverage
  • effect wears off after a certain threshold is reached

33 36 /

slide-34
SLIDE 34

Language Comparison

  • flattens after threshold of 6 occurrences is reached
  • most templates due to annotation errors occur less

than 6 times in treebanks

34 36 /

slide-35
SLIDE 35

Overview

  • Introduction
  • Lexicalized TAG, Advantages of parsing with LTAG
  • Parsing LTAGs
  • bottom-up
  • top-down
  • bottom-up + dynamic top-down
  • Extracting and Comparing LTAG
  • Data
  • Extraction
  • Language comparison using LTAGs
  • Conclusion
slide-36
SLIDE 36

Conclusion

  • Parsing of LTAG straight forward with TAG parsers
  • additional lexicalization improves performance
  • limited search space
  • resolved recursion problems
  • extracted, inter-mappable structures for different

languages support UG hypothesis

  • ~50% - 80% inter-language coverage of templates
  • possible immediate extensions
  • transfer lexicon construction, mapping trees
  • machine translation, mapping derivations for parallel

sentences

Extracting and comparing LTAG 36 36 /