[PPT] - Corpus Linguistics Seminar Resources for Computational Linguists SS PowerPoint Presentation

SLIDE 1

Corpus Linguistics

Seminar „Resources for Computational Linguists“ SS 2007 Magdalena Wolska & Michaela Regneri

SLIDE 2

Armchair Linguists vs. Corpus Linguists

Competence Performance

Resources for Comp‘ Linguists 07

2

Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 3

Motivation (for Corpus Linguistics)

3

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 4

Outline

Corpora
Annotation
Data Analysis
The Web as Corpus

4

Corpus Linguistics - Michaela Regneri & Magdalena Wolska Resources for Comp‘ Linguists 07

SLIDE 5

Outline

Corpora
Annotation
Data Analysis
The Web as Corpus

5

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 6

Corpus - definition

in principle: every collection of text
(desired or necessary) properties of corpora for linguistic processing:
representativeness
finite size (mostly)
machine-readability
standard reference

6

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 7

Corpus - properties

language mode (speech vs. text)
languages and alignment:

mono-/bilingual, comparable/parallel

text types (newspapers, novels,

phone calls...)

text domains (business, finance,

love stories...)

balance: homo-/heterogeneous,

balanced/unbalanced

annotation: plain/annotated,

annotation type and depth

date / time span (of texts used)
size

7

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 8

Outline

Corpora
Annotation
Data Analysis
The Web as Corpus

8

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 9

Annotation - principles

linguistic information in a corpus
maxims of annotation (Leech 1993):
removable and extractable annotation
guidelines available to end user
awareness of fallibility (but potential usefulness)
scheme should be based on widely-agreed principles which are

theory-neutral

9

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 10

Annotation - Data Format

Often variants of XML:

The dog barks.

<sentence> <phrase type=“NP“> <word ind=“1“ pos=“det“/> <word ind=“2“ pos=“N“/> </phrase> <phrase type=“VP“> <word ind=“3“ pos=“VI“/> </phrase> </sentence> <sentence> <phrase type=“NP“> <word pos=“det“>the</word> <word pos=“N“>dog</word> </phrase> <phrase type =“VP“> <word pos=“VI“>barks</word> </phrase> </sentence>

stand-off: inline:

10

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 11

Annotation - examples: Treebanks (syntax)

11

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 12

Annotation - examples: semantic roles (SALSA)

12

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 13

Annotation - examples: discourse structure

13

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 14

Annotation - Tools

Graphical UIs, similar to output, for „drawing“ annotations
Example:

RSTTool

14

Resources for Comp‘ Linguists 07

SLIDE 15

Outline

Corpora
Annotation
Data Analysis
The Web as Corpus

15

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 16

Data Analysis

Word counts (word frequency, „token per type“)
concordance: same word in different contexts

La Streisand sounded just like the student activist she played in the film T s pilot's wings, he was judged top student. After his weapons training, he w with Der Bettelstudent (The Beggar Student) and Gasparone in the fairly rece S.LOWRY: THE MAN AND HIS ART: As a student and long-time resident of Salford Antony Fleat, a second-year law student at Oxford Brookes University; and t

ung life. This second-year student at Robert Gordon's university in Ab

erdeen, having matriculated as a student at Robert Gordon's university. In M , 78, from Harrow, an anthropology student at the University of the Third Ag he had enough of London as a law student at University College and the Colle

n-grams: count the frequencies of word combinations of n words

3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche

16

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 17

Data Analysis - Information Access

pattern matching with query languages like CQP:

Query: [lemma="dog"] [pos!="\$."] [pos="NN"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner

17

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 18

Outline

Corpora
Annotation
Data Analysis
The Web as Corpus

18

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 19

The web as corpus

the web is a collection of text, thus it is a corpus
the largest available corpus: more than 7.2*1011 words (10 times bigger

than the English Gigaword Corpus [Liu and Curran 2006])

nearly all kinds of text and lots of languages present
not preprocessed, lots of ungrammatical (and linguistically useless) text
how to access it?

19

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 20

The web as corpus

Document counts are shown to correlate directly with „real“ frequencies

(Keller 2003), so search engines can help - but...

lots of repetitions of the same text (not representative)
very limited query precision (no upper/lower case, no punctuation...)
only estimated counts, often hart to reproduce exactly
how to access Google? :) (Google API, Scripts)
Alexa: „buy“ (parts of) web, and process it on their machines

20

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 21

The web as corpus - examples

Extracting and filtering web documents to create linguistically

annotated corpora (Kilgarriff 2006)

gather documents for different topics (balance!)
exclude documents which cannot be preprocessed with available

tools (here taggers and lemmatizers)

exclude documents which seem irrelevant for a corpus (too short or

too long, word lists,...)

do this for several languages and make the corpora available

21

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 22

The web as corpus - examples

Directly using web counts (instead of corpus counts), e.g. VerbOcean

(Chklovski & Pantel 2004, see http://semantics.isi.edu/ocean/ )

gather verb pairs which are semantically related but the relation is unknown
-> DIRT (Lin and Pantel 2001)

example pair: „love -- marry“

pick a semantic relation (e.g. „happens-before“) and design typical patterns

for this relation (e.g. „to X and then Y“)

instantiate the patterns („to love and then marry“) and count Google hits

(here: 6)

estimate whether or not the number of hits indicates a significant

correlation, then assign the relation (or not)

22

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 23

Thanks to Sabine Schulte im Walde & Magdalena Wolska for some slides
Literature:
McEnery & Wilson (1996): Corpus Linguistics. Edinburgh University Press.

(See http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm)

Chklovski & Pantel (2004): VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In

Proceedings of EMNLP-04.

Keller (2003): Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29 2003,
Nr. 3, 459–484
Baroni and Kilgarriff (2006): Large linguistically-processed Web Corpora for multiple languages. In

Proceedings of EACL-2006.

Leech (1993): Corpus annotation schemes. Literary and Linguistic Computing 8(4): 275-81.
Lin and Pantel (2001): DIRT – Discovery of Inference Rules from Text. In Proceedings of KDD-01.
Liu and Curran (2006): Web Text Corpus for Natural Language Processing. In Proceedings of EACL-2006.

References

23

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

SLIDE 24

References

Some Corpora:
Brown: http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM
LOB: http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM
BNC: http://www.natcorp.ox.ac.uk/ (online search: http://thetis.bl.uk/lookup.html)
TIGER: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
Penn Treebank: http://www.cis.upenn.edu/~treebank/
Penn Discourse Treebank: http://www.cis.upenn.edu/~pdtb/
Prague Dependency Treebank: http://ufal.mff.cuni.cz/pcedt/

24

Corpus Linguistics - Michaela Regneri Resources for Comp‘ Linguists 07

Corpus Linguistics

Seminar „Resources for Computational Linguists“ SS 2007 Magdalena Wolska & Michaela Regneri

Armchair Linguists vs. Corpus Linguists

Competence Performance

Motivation (for Corpus Linguistics)

Outline

Outline

Corpus - definition

Corpus - properties

mono-/bilingual, comparable/parallel

phone calls...)

love stories...)

balanced/unbalanced

annotation type and depth

Outline

Annotation - principles

theory-neutral

Annotation - Data Format

The dog barks.

stand-off: inline:

Annotation - examples: Treebanks (syntax)

Annotation - examples: semantic roles (SALSA)

Annotation - examples: discourse structure

Annotation - Tools

RSTTool

Outline

Data Analysis

3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche

Data Analysis - Information Access

Query: [lemma="dog"] [pos!="\$.*"]* [pos="NN"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner

Outline

The web as corpus

than the English Gigaword Corpus [Liu and Curran 2006])

The web as corpus

(Keller 2003), so search engines can help - but...

The web as corpus - examples

annotated corpora (Kilgarriff 2006)

tools (here taggers and lemmatizers)

too long, word lists,...)

The web as corpus - examples

(Chklovski & Pantel 2004, see http://semantics.isi.edu/ocean/ )

example pair: „love -- marry“

for this relation (e.g. „to X and then Y“)

(here: 6)

correlation, then assign the relation (or not)

(See http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm)

References

References

Query: [lemma="dog"] [pos!="\$."] [pos="NN"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner