ACCESSING TEXT CORPORA AND LEXICAL RESOURCES Accessing Text Corpora - - PDF document

accessing text corpora and lexical resources
SMART_READER_LITE
LIVE PREVIEW

ACCESSING TEXT CORPORA AND LEXICAL RESOURCES Accessing Text Corpora - - PDF document

CS372 Spring 2013 2013-03-12 Natural Language Processing with Python CS372: Spring, 20 13 Lecture 3 Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology ACCESSING TEXT CORPORA AND LEXICAL RESOURCES


slide-1
SLIDE 1

CS372 Spring 2013 2013-03-12 KAIST 1

Natural Language Processing with Python

CS372: Spring, 20 13 Lecture 3 Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology

ACCESSING TEXT CORPORA AND LEXICAL RESOURCES

Accessing Text Corpora Conditional Frequency Distributions

CS372: NLP with Python 2 2013-03-12

slide-2
SLIDE 2

CS372 Spring 2013 2013-03-12 KAIST 2

 Questions

  • What are some useful text corpora and lexical

resources, and how can we access them with Python?

  • Which Python constructs are most helpful for

this work?

  • How do we avoid repeating ourselves when

writing Python code?

2013-03-12 CS372: NLP with Python 3

Introduction

 Gutenberg Corpus  Web and Chat Text  Brown Corpus  Reuters Corpus  Inaugural Address Corpus  Annotated Text Corpora  Corpora in Other Languages  Text Corpus Structure  Loading Your Own Corpus

2013-03-12 CS372: NLP with Python 4

Accessing Text Corpora

slide-3
SLIDE 3

CS372 Spring 2013 2013-03-12 KAIST 3

 The Project Gutenberg electronic text

archive contains some 25,000 free electronic books. http:/ / www.gutenberg.org/

2013-03-12 CS372: NLP with Python 5

Gutenberg Corpus

2013-03-12 CS372: NLP with Python 6

Gutenberg Corpus

slide-4
SLIDE 4

CS372 Spring 2013 2013-03-12 KAIST 4

2013-03-12 CS372: NLP with Python 7

Gutenberg Corpus

 NLTK’s small collection of web text

includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews.

2013-03-12 CS372: NLP with Python 8

Web and Chat Text

slide-5
SLIDE 5

CS372 Spring 2013 2013-03-12 KAIST 5

2013-03-12 CS372: NLP with Python 9

Web and Chat Text

 A corpus of instant messaging chat

sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators

2013-03-12 CS372: NLP with Python 10

Web and Chat Text

slide-6
SLIDE 6

CS372 Spring 2013 2013-03-12 KAIST 6

 The Brown Corpus was the first million-

word electronic corpus of English, created in 1961 at Brown University.

2013-03-12 CS372: NLP with Python 11

Brown Corpus

2013-03-12 CS372: NLP with Python 12

Brown Corpus

slide-7
SLIDE 7

CS372 Spring 2013 2013-03-12 KAIST 7

2013-03-12 CS372: NLP with Python 13

Brown Corpus

 The Reuters Corpus contains 10,788 news

documents totaling 1.3 million words.

  • The documents have been classified into 90

topics, and grouped into two sets, called “training” and “test”.

2013-03-12 CS372: NLP with Python 14

Reuters Corpus

slide-8
SLIDE 8

CS372 Spring 2013 2013-03-12 KAIST 8

2013-03-12 CS372: NLP with Python 15

Reuters Corpus

2013-03-12 CS372: NLP with Python 16

Reuters Corpus

slide-9
SLIDE 9

CS372 Spring 2013 2013-03-12 KAIST 9

2013-03-12 CS372: NLP with Python 17

Inaugural Address Corpus

2013-03-12 CS372: NLP with Python 18

Inaugural Address Corpus

forward

slide-10
SLIDE 10

CS372 Spring 2013 2013-03-12 KAIST 10

 Many text corpora contain linguistic

annotations represent part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth.

 Consult http:/ / www.nltk.org/ data for

information about downloading them.

2013-03-12 CS372: NLP with Python 19

Annotated Text Corpora

2013-03-12 CS372: NLP with Python 20

Annotated Text Corpora

slide-11
SLIDE 11

CS372 Spring 2013 2013-03-12 KAIST 11

2013-03-12 CS372: NLP with Python 21

Annotated Text Corpora

2013-03-12 CS372: NLP with Python 22

Corpora in Other Languages

slide-12
SLIDE 12

CS372 Spring 2013 2013-03-12 KAIST 12

2013-03-12 CS372: NLP with Python 23

Corpora in Other Languages

2013-03-12 CS372: NLP with Python 24

Corpora in Other Languages

slide-13
SLIDE 13

CS372 Spring 2013 2013-03-12 KAIST 13

 Common structures

  • Isolated, Categorized, Overlapping, Temporal

2013-03-12 CS372: NLP with Python 25

Text Corpus Structure

 Load your own collection of text files.

>>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = ‘/ usr/ share/ dict’ >>> wordlists = PlaintextCorpusReader(corpus_root, ‘.*’) >>> wordlists.fileids() [‘Readme’, ‘connectives’, ‘propernames’, ‘web2’, ‘web2a’, ‘words’] >>> wordlists.words(‘connectives’) [‘the’, ‘of’, ‘and’, ‘to’, ‘a’, ‘in’, ‘that’, ‘is’, … ]

2013-03-12 CS372: NLP with Python 26

Loading Your Own Corpus

slide-14
SLIDE 14

CS372 Spring 2013 2013-03-12 KAIST 14

 Another example

>>> from nltk.corpus import BracketParseCorpusReader >>> corpus_root = r“C:\ corpora\ penntreebank\ parsed\ mrg\ wsj” >>> file_pattern = r“.*/ wsj_.*\ .mrg” >>> ptb = BracketParseCorpusReader(corpus_root, file_pattern) >>> ptb.fileids() [‘00/ wsj_0001.mrg’, ‘00/ wsj_0002.mrg’, ‘00/ wsj_0003.mrg’, … ] >>> len(ptb.sents()) 49208 >>> ptb.sents(fileids=‘20/ wsj_2013.mrg’)[19] [‘The’, ‘55-year-old’, ‘Mr.’, ‘Noriega’, ‘is’, “n’t”, ‘as’, ‘smooth’, ‘as’, ‘the’, ‘shah’, ‘of’, ‘Iran’, ‘,’, ‘as’, ‘well-born’, ‘as’, ‘Nicaragua’, “’s”, … ]

2013-03-12 CS372: NLP with Python 27

Loading Your Own Corpus

 Conditions and Events  Counting Words by Genre  Plotting and Tabulating Distributions  Generating Random Text with Bigrams

2013-03-12 CS372: NLP with Python 28

Conditional Frequency Distributions

slide-15
SLIDE 15

CS372 Spring 2013 2013-03-12 KAIST 15

 While a frequency distribution counts

  • bservable events, a conditional frequency

distribution needs to pair each event with a condition.

>>> text = [‘The’, ‘Fulton’, ‘County’, ‘Grand’, ‘Jury’, ‘said’, … ] >>> pairs = [(‘news’, ‘The’), (‘news’, ‘Fulton’), (‘news’, ‘County’), (‘news’, ‘Grand’), (‘news’, ‘Jury’), (‘news’, ‘said’), … ]

2013-03-12 CS372: NLP with Python 29

Conditions and Events

2013-03-12 CS372: NLP with Python 30

Counting Words by Genre

slide-16
SLIDE 16

CS372 Spring 2013 2013-03-12 KAIST 16

2013-03-12 CS372: NLP with Python 31

Counting Words by Genre

 Pages 17 and 18 of this lecture note.  Pages 23 and 24 of this lecture note.

2013-03-12 CS372: NLP with Python 32

Plotting and Tabulating Distributions

slide-17
SLIDE 17

CS372 Spring 2013 2013-03-12 KAIST 17

 Create a table of bigrams using a

conditional frequency distribution.

2013-03-12 CS372: NLP with Python 33

Generating Random Text with Bigrams

2013-03-12 CS372: NLP with Python 34

Generating Random Text with Bigrams

 Example 2-1. Generating random text

slide-18
SLIDE 18

CS372 Spring 2013 2013-03-12 KAIST 18

 Accessing Text Corpora

  • Gutenberg Corpus
  • Web and Chat Text
  • Brown Corpus
  • Reuters Corpus
  • Inaugural Address Corpus
  • Annotated Text Corpora
  • Corpora in Other Languages
  • Text Corpus Structure
  • Loading Your Own Corpus

2013-03-12 CS372: NLP with Python 35

Summary

 Conditional Frequency Distributions

  • Conditions and Events
  • Counting Words by Genre
  • Plotting and Tabulating Distributions
  • Generating Random Text with Bigrams

2013-03-12 CS372: NLP with Python 36

Summary

slide-19
SLIDE 19

CS372 Spring 2013 2013-03-12 KAIST 19

 Due: 22 March, 2013 (midnight)  Problems

  • Exercises: 1.4, 1.19, 1.20, 1.22, 1.24, 1.26
  • Your Turn: Pages 6, 8, 24

 Submission

  • Send a message to cs372@nlp.kaist.ac.kr with

a Word file that includes answers and/ or Python codes, with Subject: [CS372] HW#1, Your Name.

2013-03-12 CS372: NLP with Python 37

Homework # 1