Corpus Linguistics Seminar Resources for Computational Linguists SS - - PowerPoint PPT Presentation

corpus linguistics
SMART_READER_LITE
LIVE PREVIEW

Corpus Linguistics Seminar Resources for Computational Linguists SS - - PowerPoint PPT Presentation

Corpus Linguistics Seminar Resources for Computational Linguists SS 2007 Magdalena Wolska & Michaela Regneri Armchair Linguists vs. Corpus Linguists Competence Performance Resources for Comp Corpus


slide-1
SLIDE 1

Corpus Linguistics

Seminar „Resources for Computational Linguists“ SS 2007 Magdalena Wolska & Michaela Regneri

slide-2
SLIDE 2

Armchair Linguists vs. Corpus Linguists

Competence Performance

Resources for Comp‘ Linguists 07

2

Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-3
SLIDE 3

Motivation (for Corpus Linguistics)

3

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-4
SLIDE 4

Outline

  • Corpora
  • Annotation
  • Data Analysis
  • The Web as Corpus

4

Corpus Linguistics - Michaela Regneri & Magdalena Wolska Resources for Comp‘ Linguists 07

slide-5
SLIDE 5

Outline

  • Corpora
  • Annotation
  • Data Analysis
  • The Web as Corpus

5

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-6
SLIDE 6

Corpus - definition

  • in principle: every collection of text
  • (desired or necessary) properties of corpora for linguistic processing:
  • representativeness
  • finite size (mostly)
  • machine-readability
  • standard reference

6

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-7
SLIDE 7

Corpus - properties

  • language mode (speech vs. text)
  • languages and alignment:

mono-/bilingual, comparable/parallel

  • text types (newspapers, novels,

phone calls...)

  • text domains (business, finance,

love stories...)

  • balance: homo-/heterogeneous,

balanced/unbalanced

  • annotation: plain/annotated,

annotation type and depth

  • date / time span (of texts used)
  • size

7

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-8
SLIDE 8

Outline

  • Corpora
  • Annotation
  • Data Analysis
  • The Web as Corpus

8

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-9
SLIDE 9

Annotation - principles

  • linguistic information in a corpus
  • maxims of annotation (Leech 1993):
  • removable and extractable annotation
  • guidelines available to end user
  • awareness of fallibility (but potential usefulness)
  • scheme should be based on widely-agreed principles which are

theory-neutral

9

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-10
SLIDE 10

Annotation - Data Format

  • Often variants of XML:

The dog barks.

<sentence> <phrase type=“NP“> <word ind=“1“ pos=“det“/> <word ind=“2“ pos=“N“/> </phrase> <phrase type=“VP“> <word ind=“3“ pos=“VI“/> </phrase> </sentence> <sentence> <phrase type=“NP“> <word pos=“det“>the</word> <word pos=“N“>dog</word> </phrase> <phrase type =“VP“> <word pos=“VI“>barks</word> </phrase> </sentence>

stand-off: inline:

10

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-11
SLIDE 11

Annotation - examples: Treebanks (syntax)

11

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-12
SLIDE 12

Annotation - examples: semantic roles (SALSA)

12

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-13
SLIDE 13

Annotation - examples: discourse structure

13

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-14
SLIDE 14

Annotation - Tools

  • Graphical UIs, similar to output, for „drawing“ annotations
  • Example:

RSTTool

14

Resources for Comp‘ Linguists 07

slide-15
SLIDE 15

Outline

  • Corpora
  • Annotation
  • Data Analysis
  • The Web as Corpus

15

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-16
SLIDE 16

Data Analysis

  • Word counts (word frequency, „token per type“)
  • concordance: same word in different contexts

La Streisand sounded just like the student activist she played in the film T s pilot's wings, he was judged top student. After his weapons training, he w with Der Bettelstudent (The Beggar Student) and Gasparone in the fairly rece S.LOWRY: THE MAN AND HIS ART: As a student and long-time resident of Salford Antony Fleat, a second-year law student at Oxford Brookes University; and t

  • ung life. This second-year student at Robert Gordon's university in Ab

erdeen, having matriculated as a student at Robert Gordon's university. In M , 78, from Harrow, an anthropology student at the University of the Third Ag he had enough of London as a law student at University College and the Colle

  • n-grams: count the frequencies of word combinations of n words

3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche

16

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-17
SLIDE 17

Data Analysis - Information Access

  • pattern matching with query languages like CQP:

Query: [lemma="dog"] [pos!="\$.*"]* [pos="NN"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner

17

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-18
SLIDE 18

Outline

  • Corpora
  • Annotation
  • Data Analysis
  • The Web as Corpus

18

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-19
SLIDE 19

The web as corpus

  • the web is a collection of text, thus it is a corpus
  • the largest available corpus: more than 7.2*1011 words (10 times bigger

than the English Gigaword Corpus [Liu and Curran 2006])

  • nearly all kinds of text and lots of languages present
  • not preprocessed, lots of ungrammatical (and linguistically useless) text
  • how to access it?

19

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-20
SLIDE 20

The web as corpus

  • Document counts are shown to correlate directly with „real“ frequencies

(Keller 2003), so search engines can help - but...

  • lots of repetitions of the same text (not representative)
  • very limited query precision (no upper/lower case, no punctuation...)
  • only estimated counts, often hart to reproduce exactly
  • how to access Google? :) (Google API, Scripts)
  • Alexa: „buy“ (parts of) web, and process it on their machines

20

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-21
SLIDE 21

The web as corpus - examples

  • Extracting and filtering web documents to create linguistically

annotated corpora (Kilgarriff 2006)

  • gather documents for different topics (balance!)
  • exclude documents which cannot be preprocessed with available

tools (here taggers and lemmatizers)

  • exclude documents which seem irrelevant for a corpus (too short or

too long, word lists,...)

  • do this for several languages and make the corpora available

21

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-22
SLIDE 22

The web as corpus - examples

  • Directly using web counts (instead of corpus counts), e.g. VerbOcean

(Chklovski & Pantel 2004, see http://semantics.isi.edu/ocean/ )

  • gather verb pairs which are semantically related but the relation is unknown
  • -> DIRT (Lin and Pantel 2001)

example pair: „love -- marry“

  • pick a semantic relation (e.g. „happens-before“) and design typical patterns

for this relation (e.g. „to X and then Y“)

  • instantiate the patterns („to love and then marry“) and count Google hits

(here: 6)

  • estimate whether or not the number of hits indicates a significant

correlation, then assign the relation (or not)

22

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-23
SLIDE 23
  • Thanks to Sabine Schulte im Walde & Magdalena Wolska for some slides
  • Literature:
  • McEnery & Wilson (1996): Corpus Linguistics. Edinburgh University Press.

(See http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm)

  • Chklovski & Pantel (2004): VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In

Proceedings of EMNLP-04.

  • Keller (2003): Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29 2003,
  • Nr. 3, 459–484
  • Baroni and Kilgarriff (2006): Large linguistically-processed Web Corpora for multiple languages. In

Proceedings of EACL-2006.

  • Leech (1993): Corpus annotation schemes. Literary and Linguistic Computing 8(4): 275-81.
  • Lin and Pantel (2001): DIRT – Discovery of Inference Rules from Text. In Proceedings of KDD-01.
  • Liu and Curran (2006): Web Text Corpus for Natural Language Processing. In Proceedings of EACL-2006.

References

23

Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska

slide-24
SLIDE 24

References

  • Some Corpora:
  • Brown: http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM
  • LOB: http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM
  • BNC: http://www.natcorp.ox.ac.uk/ (online search: http://thetis.bl.uk/lookup.html)
  • TIGER: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
  • Penn Treebank: http://www.cis.upenn.edu/~treebank/
  • Penn Discourse Treebank: http://www.cis.upenn.edu/~pdtb/
  • Prague Dependency Treebank: http://ufal.mff.cuni.cz/pcedt/

24

Corpus Linguistics - Michaela Regneri Resources for Comp‘ Linguists 07