Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For - - PowerPoint PPT Presentation
Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For - - PowerPoint PPT Presentation
Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For discovery For evaluation of theories For identifying tendencies distribution of a class of words distribution of structural configurations frequency of a
Using Corpora
- For discovery
- For evaluation of theories
- For identifying tendencies
– distribution of a class of words – distribution of structural configurations – frequency of a certain distribution
Why Treebanks
- Raw corpora are not enough for most
linguistic purposes.
- Let’s start with the rawest of them all: the
web, which I’ll call `Google’
- Convenient
- Potentially inexhaustible
- Varied and free-form
Problems with `Google’
- Quality control
– hard to identify the identity of the author, making it difficult to keep track of variation the text could be computer generated
- What does Google count?
– google counts are notoriously unreliable and change from minute to minute – problem of repeated elements – no clear estimate of sample size, so difficult to go beyond order of magnitude estimations of frequency
Sentences
- Sentences are an important unit of linguistic
- rganization.
- They are not an important unit of organization
for most search engines. Consequently not straightforward to restrict searches to remain within a sentence, a task that is crucial for linguistic purposes.
Selecting Texts
- Using the web/search engine directly is inadequate for
any but the most basic linguistic purposes.
- The next step forward is to judiciously assemble a set of
texts (possibly from the web) and use an appropriate search language – Regular Expressions based systems are fast, easily available, and easy to use.
The need for annotation
Even after the creation of a corpus (a set of texts), there are still many basic linguistic investigations that cannot be conducted.
- generalizing searches – e.g. when we want to
examine all sequences of Det and N
- identifying a subset of cases when there is
ambiguity in part-of-speech e.g. `to’
Step 1: POS tags
- Part of Speech Tagging is the most basic kind
- f annotation.
– POS-tagging makes corpora much more linguistically useful. – POS-tagging can often be done automatically with high reliability, allowing us to use large texts for linguistic purposes.
Step 2: Beyond POS tags
(1) Ann likes Bill and Tim likes Nina. (2) Ann likes Bill and Tim, who are her mentors. Assume you want to search for coordinated noun
- phrases. You want to get (2) but not (1).
- But a search for the POS sequence `Noun Det
Noun’ will catch both.
- We need structural information.
Step 3: Structural Information
- We need structural information for a corpus to
be fully linguistically useful.
- We also need structural information if we
want to train parsers off the corpus.
- The nature of this structural information is
quite underdetermined.
Structural and Other Information
- One could include structural information, leading
to a set of syntactic trees of the familiar sort: hence the term `treebank’
- But the information can be quite different and
there is no commitment that the formal objects involved are `trees’
- Other alternatives: theta-roles, semantic
argument information (PropBank) etc.
Searching a Treebank
- For linguistic purposes, we need a way to extract
linguistically interesting patterns.
- What counts as a `linguistically interesting
pattern’ will vary greatly depending upon your theoretical interests and the nature of the treebank.
- Here we will assume that the formal objects are
trees and discuss a general and powerful way to search for trees with certain properties.
Verbs Accounts
- Class material is in:
/data/home/verbs/shared/LSA7800_076
- To reduce typing, set up a link:
ln –s /data/home/verbs/shared/LSA7800_076 lsa
The Unix you’ll need
- ls –l
lists files in current directory
- cd DIRECTORYNAME
go to designated directory
- cat FILENAME
read file and display on screen, or
- > FILENAME
direct output into designated file
- more FILENAME
scroll designated file
- wc
count number of words/characters in input, provided by
- |
pipe output of one program into another cat FILENAME | wc
Regular Expressions and grep
- Regular Expressions are a powerful and fast
way to express search patterns
- grep
– program for using Regular Expression search patterns – Syntax: grep RegExPattern FILENAME
Very Basic RegEx
- words are RegExs that match themselves as
well as superstrings of themselves
- If A and B are RegEx, then
- 1. AB
- 2. A|B
- 3. A* (also A? and A+)
are also RegEx
Regular expression summary
. Matches any character (aka wildcard) ^abc Matches some pattern abc at the start of a string abc$ Matches some pattern abc at the end of a string [abc] Matches one of a range of characters [A-Z0-9] Matches one of a range of characters ed|ing|s Matches one of the specified strings (aka disjunction) * Zero or more of previous item (aka closure) + One or more of previous item (aka closure) ? Zero or one of the previous item (aka optionality) {n} Exactly n repeats {n,} At least n repeats {,n} At most n repeats {m,n} At least m and at most n repeats a(b|c)+ Parentheses indicate the scope of the operators
TGrep2
- A general way of writing Regular Expressions
- ver trees
- tgrep2, grep for trees, written by Douglas
Rohde, builds upon tgrep
- To use TGrep2, a special TGrep2 corpus file
needs to be created – wsj.t2c in lsa directory, has 49,209 sentences.
TGrep2 Syntax 1
- Basic Pattern Syntax
– Regular expression syntax can be used to select words or node labels: – Ex: /ˆNP/ matches any node label that begins with NP such as NP-SBJ
- Command syntax:
lsa/tgrep –c lsa/wsj.t2c PATTERN
At the beginning
- >lsa/tgrep -c lsa/wsj.t2c 'S << Vinken' | more
(S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) (S (NP-SBJ (NNP Mr.) (NNP Vinken)) (VP (VBZ is) (NP-PRD (NP (NN chairman)) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.)) (, ,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) (. .))
Tree Relationships
- 1. Immediate domination (<)
- 2. Domination (<<)
- 3. Sisterhood ($)
- 4. Immediate Precedence (.)
- 5. Precedence (..)
- 6. First/nth/last child
- 7. First/Last descendant
Immediate Domination
A < B A is the parent of B, retrieves subtree rooted at A. A > B A is the daughter of B, retrieves subtree rooted at A. NP < PP Matches any NP that immediately dominates a PP.
Domination
A << B A dominates B, retrieves subtree rooted at A. A >> B A is dominated by B, retrieves subtree rooted at A. NP << PP Matches any NP that dominates a PP.
Combining Search Patterns 1
- Default interpretation is &
VP < NP < PP
- a VP that immediately dominates an NP and a PP
- Ex: (You can [see comets with a telescope])
VP < (NP < PP)
- a VP that immediately dominates [an NP that
dominates a PP]
- Ex: (I [praised [the students [with good grades]]])
Combining Search Patterns 2
- For optionality, use |
NP < PP | << AP
- an NP that
immediately dominates a PP OR dominates an AP
Sisterhood and Precedence 1
- A $ B
- A is a sister of B
- A . B
- A immediately precedes B
- A .. B
- A precedes B
- A , B
- A immediately follows B
- A ,, B
- A follows B
Sisterhood and Precedence 2
Combining the two:
- A $. B
– A is a sister of B and immediately precedes B
- A $.. B
- - A is a sister of B and precedes B
- A $, B
- - A is a sister of B and immediately follows B
- A $,, B
- - A is a sister of B and follows B
Which child?
We can pick out a particular child: from left to right (start at 1):
- A <N B
- B is the nth child of A
- A >N B
- A is the nth child of B
- A <1 B
(also used: A <, B)
- B is the first child of A
- A >1 B
(also used: A >, B)
- A is the first child of B
Which child?
We can pick out a particular child: from right to left (start at -1):
- A <-N B
- B is the nth child of A from the right
- A >-N B
- A is the nth child of B from the right
- A <-1 B
(also used: A <‘ B)
- B is the last child of A
- A >-1 B
(also used: A >‘ B)
- A is the last child of B
Which descendant?
- A <<, B
- B is *a* left-most descendant of an A
- A >>, B
- A is *a* left-most descendant of a B
- A <<‘ B
- B is *a* right-most descendant of an A
- A >>’ B
- A is *a* right-most descendant of a B
Uniqueness
- A <: B
- B is the only child of A
- A >: B
- A is the only child of B
- A <<: B
- there is a single path of descent from A and B is on it
- A >>: B
- there is a single path of descent from B and A is on it
Combining Links
NP << (PP . VP) NP <‘ (PP <, (IN < on)) S < (A < B) < C S < ((A < B) < C) S < (A < B < C)
! Negating Links !
- ! before any link relationship negates it
A !.. B
- A does not precede B
A [< B | . C] [< D | . E] A [< B | ![. C !, F]] | ![< D !.. E]
= Naming Node labels =
- Any node label can be given a name using =
S=foo << (VP .. (PP >> =foo)) NP < (PP=pp < (IN < on)) | < (NP < =pp)
: Segmented Patterns :
Sometimes it is useful to break patterns up into segments S < (NP=n1 .. (VP=v < PP)) < (NP=n2 !.. VP) S < NP=n1 < NP=n2 : =n1 .. VP=v : =v < PP : =n2 !.. VP S << (VP=v < NP) : =v < /ˆPP/ S << (VP=v < NP < /ˆPP/)
@ Macros @
- Patterns that are likely to re-used can be