Corpus Linguistics Seminar Resources for Computational Linguists SS - - PowerPoint PPT Presentation
Corpus Linguistics Seminar Resources for Computational Linguists SS - - PowerPoint PPT Presentation
Corpus Linguistics Seminar Resources for Computational Linguists SS 2007 Magdalena Wolska & Michaela Regneri Armchair Linguists vs. Corpus Linguists Competence Performance Resources for Comp Corpus
Armchair Linguists vs. Corpus Linguists
Competence Performance
Resources for Comp‘ Linguists 07
2
Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Motivation (for Corpus Linguistics)
3
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Outline
- Corpora
- Annotation
- Data Analysis
- The Web as Corpus
4
Corpus Linguistics - Michaela Regneri & Magdalena Wolska Resources for Comp‘ Linguists 07
Outline
- Corpora
- Annotation
- Data Analysis
- The Web as Corpus
5
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Corpus - definition
- in principle: every collection of text
- (desired or necessary) properties of corpora for linguistic processing:
- representativeness
- finite size (mostly)
- machine-readability
- standard reference
6
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Corpus - properties
- language mode (speech vs. text)
- languages and alignment:
mono-/bilingual, comparable/parallel
- text types (newspapers, novels,
phone calls...)
- text domains (business, finance,
love stories...)
- balance: homo-/heterogeneous,
balanced/unbalanced
- annotation: plain/annotated,
annotation type and depth
- date / time span (of texts used)
- size
7
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Outline
- Corpora
- Annotation
- Data Analysis
- The Web as Corpus
8
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - principles
- linguistic information in a corpus
- maxims of annotation (Leech 1993):
- removable and extractable annotation
- guidelines available to end user
- awareness of fallibility (but potential usefulness)
- scheme should be based on widely-agreed principles which are
theory-neutral
9
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - Data Format
- Often variants of XML:
The dog barks.
<sentence> <phrase type=“NP“> <word ind=“1“ pos=“det“/> <word ind=“2“ pos=“N“/> </phrase> <phrase type=“VP“> <word ind=“3“ pos=“VI“/> </phrase> </sentence> <sentence> <phrase type=“NP“> <word pos=“det“>the</word> <word pos=“N“>dog</word> </phrase> <phrase type =“VP“> <word pos=“VI“>barks</word> </phrase> </sentence>
stand-off: inline:
10
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - examples: Treebanks (syntax)
11
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - examples: semantic roles (SALSA)
12
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - examples: discourse structure
13
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Annotation - Tools
- Graphical UIs, similar to output, for „drawing“ annotations
- Example:
RSTTool
14
Resources for Comp‘ Linguists 07
Outline
- Corpora
- Annotation
- Data Analysis
- The Web as Corpus
15
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Data Analysis
- Word counts (word frequency, „token per type“)
- concordance: same word in different contexts
La Streisand sounded just like the student activist she played in the film T s pilot's wings, he was judged top student. After his weapons training, he w with Der Bettelstudent (The Beggar Student) and Gasparone in the fairly rece S.LOWRY: THE MAN AND HIS ART: As a student and long-time resident of Salford Antony Fleat, a second-year law student at Oxford Brookes University; and t
- ung life. This second-year student at Robert Gordon's university in Ab
erdeen, having matriculated as a student at Robert Gordon's university. In M , 78, from Harrow, an anthropology student at the University of the Third Ag he had enough of London as a law student at University College and the Colle
- n-grams: count the frequencies of word combinations of n words
3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche
16
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Data Analysis - Information Access
- pattern matching with query languages like CQP:
Query: [lemma="dog"] [pos!="\$.*"]* [pos="NN"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner
17
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
Outline
- Corpora
- Annotation
- Data Analysis
- The Web as Corpus
18
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
The web as corpus
- the web is a collection of text, thus it is a corpus
- the largest available corpus: more than 7.2*1011 words (10 times bigger
than the English Gigaword Corpus [Liu and Curran 2006])
- nearly all kinds of text and lots of languages present
- not preprocessed, lots of ungrammatical (and linguistically useless) text
- how to access it?
19
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
The web as corpus
- Document counts are shown to correlate directly with „real“ frequencies
(Keller 2003), so search engines can help - but...
- lots of repetitions of the same text (not representative)
- very limited query precision (no upper/lower case, no punctuation...)
- only estimated counts, often hart to reproduce exactly
- how to access Google? :) (Google API, Scripts)
- Alexa: „buy“ (parts of) web, and process it on their machines
20
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
The web as corpus - examples
- Extracting and filtering web documents to create linguistically
annotated corpora (Kilgarriff 2006)
- gather documents for different topics (balance!)
- exclude documents which cannot be preprocessed with available
tools (here taggers and lemmatizers)
- exclude documents which seem irrelevant for a corpus (too short or
too long, word lists,...)
- do this for several languages and make the corpora available
21
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
The web as corpus - examples
- Directly using web counts (instead of corpus counts), e.g. VerbOcean
(Chklovski & Pantel 2004, see http://semantics.isi.edu/ocean/ )
- gather verb pairs which are semantically related but the relation is unknown
- -> DIRT (Lin and Pantel 2001)
example pair: „love -- marry“
- pick a semantic relation (e.g. „happens-before“) and design typical patterns
for this relation (e.g. „to X and then Y“)
- instantiate the patterns („to love and then marry“) and count Google hits
(here: 6)
- estimate whether or not the number of hits indicates a significant
correlation, then assign the relation (or not)
22
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
- Thanks to Sabine Schulte im Walde & Magdalena Wolska for some slides
- Literature:
- McEnery & Wilson (1996): Corpus Linguistics. Edinburgh University Press.
(See http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm)
- Chklovski & Pantel (2004): VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In
Proceedings of EMNLP-04.
- Keller (2003): Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29 2003,
- Nr. 3, 459–484
- Baroni and Kilgarriff (2006): Large linguistically-processed Web Corpora for multiple languages. In
Proceedings of EACL-2006.
- Leech (1993): Corpus annotation schemes. Literary and Linguistic Computing 8(4): 275-81.
- Lin and Pantel (2001): DIRT – Discovery of Inference Rules from Text. In Proceedings of KDD-01.
- Liu and Curran (2006): Web Text Corpus for Natural Language Processing. In Proceedings of EACL-2006.
References
23
Resources for Comp‘ Linguists 07 Corpus Linguistics - Michaela Regneri & Magdalena Wolska
References
- Some Corpora:
- Brown: http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM
- LOB: http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM
- BNC: http://www.natcorp.ox.ac.uk/ (online search: http://thetis.bl.uk/lookup.html)
- TIGER: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
- Penn Treebank: http://www.cis.upenn.edu/~treebank/
- Penn Discourse Treebank: http://www.cis.upenn.edu/~pdtb/
- Prague Dependency Treebank: http://ufal.mff.cuni.cz/pcedt/
24
Corpus Linguistics - Michaela Regneri Resources for Comp‘ Linguists 07