Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For - - PowerPoint PPT Presentation

using treebanks
SMART_READER_LITE
LIVE PREVIEW

Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For - - PowerPoint PPT Presentation

Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For discovery For evaluation of theories For identifying tendencies distribution of a class of words distribution of structural configurations frequency of a


slide-1
SLIDE 1

Using Treebanks

tgrep2 Lecture 2: 07/12/2011

slide-2
SLIDE 2

Using Corpora

  • For discovery
  • For evaluation of theories
  • For identifying tendencies

– distribution of a class of words – distribution of structural configurations – frequency of a certain distribution

slide-3
SLIDE 3

Why Treebanks

  • Raw corpora are not enough for most

linguistic purposes.

  • Let’s start with the rawest of them all: the

web, which I’ll call `Google’

  • Convenient
  • Potentially inexhaustible
  • Varied and free-form
slide-4
SLIDE 4

Problems with `Google’

  • Quality control

– hard to identify the identity of the author, making it difficult to keep track of variation the text could be computer generated

  • What does Google count?

– google counts are notoriously unreliable and change from minute to minute – problem of repeated elements – no clear estimate of sample size, so difficult to go beyond order of magnitude estimations of frequency

slide-5
SLIDE 5

Sentences

  • Sentences are an important unit of linguistic
  • rganization.
  • They are not an important unit of organization

for most search engines.  Consequently not straightforward to restrict searches to remain within a sentence, a task that is crucial for linguistic purposes.

slide-6
SLIDE 6

Selecting Texts

  • Using the web/search engine directly is inadequate for

any but the most basic linguistic purposes.

  • The next step forward is to judiciously assemble a set of

texts (possibly from the web) and use an appropriate search language – Regular Expressions based systems are fast, easily available, and easy to use.

slide-7
SLIDE 7

The need for annotation

Even after the creation of a corpus (a set of texts), there are still many basic linguistic investigations that cannot be conducted.

  • generalizing searches – e.g. when we want to

examine all sequences of Det and N

  • identifying a subset of cases when there is

ambiguity in part-of-speech e.g. `to’

slide-8
SLIDE 8

Step 1: POS tags

  • Part of Speech Tagging is the most basic kind
  • f annotation.

– POS-tagging makes corpora much more linguistically useful. – POS-tagging can often be done automatically with high reliability, allowing us to use large texts for linguistic purposes.

slide-9
SLIDE 9

Step 2: Beyond POS tags

(1) Ann likes Bill and Tim likes Nina. (2) Ann likes Bill and Tim, who are her mentors. Assume you want to search for coordinated noun

  • phrases. You want to get (2) but not (1).
  • But a search for the POS sequence `Noun Det

Noun’ will catch both.

  • We need structural information.
slide-10
SLIDE 10

Step 3: Structural Information

  • We need structural information for a corpus to

be fully linguistically useful.

  • We also need structural information if we

want to train parsers off the corpus.

  • The nature of this structural information is

quite underdetermined.

slide-11
SLIDE 11

Structural and Other Information

  • One could include structural information, leading

to a set of syntactic trees of the familiar sort: hence the term `treebank’

  • But the information can be quite different and

there is no commitment that the formal objects involved are `trees’

  • Other alternatives: theta-roles, semantic

argument information (PropBank) etc.

slide-12
SLIDE 12

Searching a Treebank

  • For linguistic purposes, we need a way to extract

linguistically interesting patterns.

  • What counts as a `linguistically interesting

pattern’ will vary greatly depending upon your theoretical interests and the nature of the treebank.

  • Here we will assume that the formal objects are

trees and discuss a general and powerful way to search for trees with certain properties.

slide-13
SLIDE 13

Verbs Accounts

  • Class material is in:

/data/home/verbs/shared/LSA7800_076

  • To reduce typing, set up a link:

ln –s /data/home/verbs/shared/LSA7800_076 lsa

slide-14
SLIDE 14

The Unix you’ll need

  • ls –l

lists files in current directory

  • cd DIRECTORYNAME

go to designated directory

  • cat FILENAME

read file and display on screen, or

  • > FILENAME

direct output into designated file

  • more FILENAME

scroll designated file

  • wc

count number of words/characters in input, provided by

  • |

pipe output of one program into another cat FILENAME | wc

slide-15
SLIDE 15

Regular Expressions and grep

  • Regular Expressions are a powerful and fast

way to express search patterns

  • grep

– program for using Regular Expression search patterns – Syntax: grep RegExPattern FILENAME

slide-16
SLIDE 16

Very Basic RegEx

  • words are RegExs that match themselves as

well as superstrings of themselves

  • If A and B are RegEx, then
  • 1. AB
  • 2. A|B
  • 3. A* (also A? and A+)

are also RegEx

slide-17
SLIDE 17

Regular expression summary

. Matches any character (aka wildcard) ^abc Matches some pattern abc at the start of a string abc$ Matches some pattern abc at the end of a string [abc] Matches one of a range of characters [A-Z0-9] Matches one of a range of characters ed|ing|s Matches one of the specified strings (aka disjunction) * Zero or more of previous item (aka closure) + One or more of previous item (aka closure) ? Zero or one of the previous item (aka optionality) {n} Exactly n repeats {n,} At least n repeats {,n} At most n repeats {m,n} At least m and at most n repeats a(b|c)+ Parentheses indicate the scope of the operators

slide-18
SLIDE 18

TGrep2

  • A general way of writing Regular Expressions
  • ver trees
  • tgrep2, grep for trees, written by Douglas

Rohde, builds upon tgrep

  • To use TGrep2, a special TGrep2 corpus file

needs to be created – wsj.t2c in lsa directory, has 49,209 sentences.

slide-19
SLIDE 19

TGrep2 Syntax 1

  • Basic Pattern Syntax

– Regular expression syntax can be used to select words or node labels: – Ex: /ˆNP/ matches any node label that begins with NP such as NP-SBJ

  • Command syntax:

lsa/tgrep –c lsa/wsj.t2c PATTERN

slide-20
SLIDE 20

At the beginning

  • >lsa/tgrep -c lsa/wsj.t2c 'S << Vinken' | more

(S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) (S (NP-SBJ (NNP Mr.) (NNP Vinken)) (VP (VBZ is) (NP-PRD (NP (NN chairman)) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.)) (, ,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) (. .))

slide-21
SLIDE 21

Tree Relationships

  • 1. Immediate domination (<)
  • 2. Domination (<<)
  • 3. Sisterhood ($)
  • 4. Immediate Precedence (.)
  • 5. Precedence (..)
  • 6. First/nth/last child
  • 7. First/Last descendant
slide-22
SLIDE 22

Immediate Domination

A < B A is the parent of B, retrieves subtree rooted at A. A > B A is the daughter of B, retrieves subtree rooted at A. NP < PP Matches any NP that immediately dominates a PP.

slide-23
SLIDE 23

Domination

A << B A dominates B, retrieves subtree rooted at A. A >> B A is dominated by B, retrieves subtree rooted at A. NP << PP Matches any NP that dominates a PP.

slide-24
SLIDE 24

Combining Search Patterns 1

  • Default interpretation is &

VP < NP < PP

  • a VP that immediately dominates an NP and a PP
  • Ex: (You can [see comets with a telescope])

VP < (NP < PP)

  • a VP that immediately dominates [an NP that

dominates a PP]

  • Ex: (I [praised [the students [with good grades]]])
slide-25
SLIDE 25

Combining Search Patterns 2

  • For optionality, use |

NP < PP | << AP

  • an NP that

immediately dominates a PP OR dominates an AP

slide-26
SLIDE 26

Sisterhood and Precedence 1

  • A $ B
  • A is a sister of B
  • A . B
  • A immediately precedes B
  • A .. B
  • A precedes B
  • A , B
  • A immediately follows B
  • A ,, B
  • A follows B
slide-27
SLIDE 27

Sisterhood and Precedence 2

Combining the two:

  • A $. B

– A is a sister of B and immediately precedes B

  • A $.. B
  • - A is a sister of B and precedes B
  • A $, B
  • - A is a sister of B and immediately follows B
  • A $,, B
  • - A is a sister of B and follows B
slide-28
SLIDE 28

Which child?

We can pick out a particular child: from left to right (start at 1):

  • A <N B
  • B is the nth child of A
  • A >N B
  • A is the nth child of B
  • A <1 B

(also used: A <, B)

  • B is the first child of A
  • A >1 B

(also used: A >, B)

  • A is the first child of B
slide-29
SLIDE 29

Which child?

We can pick out a particular child: from right to left (start at -1):

  • A <-N B
  • B is the nth child of A from the right
  • A >-N B
  • A is the nth child of B from the right
  • A <-1 B

(also used: A <‘ B)

  • B is the last child of A
  • A >-1 B

(also used: A >‘ B)

  • A is the last child of B
slide-30
SLIDE 30

Which descendant?

  • A <<, B
  • B is *a* left-most descendant of an A
  • A >>, B
  • A is *a* left-most descendant of a B
  • A <<‘ B
  • B is *a* right-most descendant of an A
  • A >>’ B
  • A is *a* right-most descendant of a B
slide-31
SLIDE 31

Uniqueness

  • A <: B
  • B is the only child of A
  • A >: B
  • A is the only child of B
  • A <<: B
  • there is a single path of descent from A and B is on it
  • A >>: B
  • there is a single path of descent from B and A is on it
slide-32
SLIDE 32

Combining Links

NP << (PP . VP) NP <‘ (PP <, (IN < on)) S < (A < B) < C S < ((A < B) < C) S < (A < B < C)

slide-33
SLIDE 33

! Negating Links !

  • ! before any link relationship negates it

A !.. B

  • A does not precede B

A [< B | . C] [< D | . E] A [< B | ![. C !, F]] | ![< D !.. E]

slide-34
SLIDE 34

= Naming Node labels =

  • Any node label can be given a name using =

S=foo << (VP .. (PP >> =foo)) NP < (PP=pp < (IN < on)) | < (NP < =pp)

slide-35
SLIDE 35

: Segmented Patterns :

Sometimes it is useful to break patterns up into segments S < (NP=n1 .. (VP=v < PP)) < (NP=n2 !.. VP) S < NP=n1 < NP=n2 : =n1 .. VP=v : =v < PP : =n2 !.. VP S << (VP=v < NP) : =v < /ˆPP/ S << (VP=v < NP < /ˆPP/)

slide-36
SLIDE 36

@ Macros @

  • Patterns that are likely to re-used can be

constructed using macros @ NP /ˆNP/; @ NN /ˆNN/; @ CNP @NP=cnp [!< @NP | < @NN] !$.. @NN; #CNP – core NP macros make subsequent modification easy!

slide-37
SLIDE 37

Heavy NP Shift

>tgrep -c wsj.t2c 'VP <-1 (NP <: *)' | more >tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’ >tgrep -c wsj.t2c 'VP <2 (PP $. NP=foo) <-1 (=foo <: *)’ >tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’ >tgrep -c wsj.t2c 'VP <-1 (NP <: PRP)’ >tgrep -c wsj.t2c 'VP <1 /^V*/ <2 PP <-1 (NP <: PRP)'

slide-38
SLIDE 38

Adjective Ordering

>tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/’ >tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/ <4 /^JJ/ <5 /^JJ/’ (NP (DT the) (JJ first) (JJ negative) (JJ compound) (JJ annual) (NN growth) (NN rate))