Introduction to Natural Language Processing Steven Bird Ewan Klein - - PowerPoint PPT Presentation

introduction to natural language processing
SMART_READER_LITE
LIVE PREVIEW

Introduction to Natural Language Processing Steven Bird Ewan Klein - - PowerPoint PPT Presentation

Introduction to Natural Language Processing Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA University of Edinburgh, UK University of Pennsylvania, USA August 27, 2008 Knowledge and Communication in Language human


slide-1
SLIDE 1

Introduction to Natural Language Processing

Steven Bird Ewan Klein Edward Loper

University of Melbourne, AUSTRALIA University of Edinburgh, UK University of Pennsylvania, USA

August 27, 2008

slide-2
SLIDE 2

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-3
SLIDE 3

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-4
SLIDE 4

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-5
SLIDE 5

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-6
SLIDE 6

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-7
SLIDE 7

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-8
SLIDE 8

Knowledge and Communication in Language

  • human knowledge, human communication, expressed in

language

  • language technologies: process human language

automatically

  • handheld devices: predictive text, handwriting recognition
  • web search engines: access to information locked up in

text

  • two facets of the multilingual information society:
  • natural human-machine interfaces
  • access to stored information
slide-9
SLIDE 9

Problem

  • awash with language data
  • inadequate tools (will this ever change?)
  • overheads: Perl, Prolog, Java
  • Natural Language Toolkit (NLTK) as a solution
slide-10
SLIDE 10

Problem

  • awash with language data
  • inadequate tools (will this ever change?)
  • overheads: Perl, Prolog, Java
  • Natural Language Toolkit (NLTK) as a solution
slide-11
SLIDE 11

Problem

  • awash with language data
  • inadequate tools (will this ever change?)
  • overheads: Perl, Prolog, Java
  • Natural Language Toolkit (NLTK) as a solution
slide-12
SLIDE 12

Problem

  • awash with language data
  • inadequate tools (will this ever change?)
  • overheads: Perl, Prolog, Java
  • Natural Language Toolkit (NLTK) as a solution
slide-13
SLIDE 13

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-14
SLIDE 14

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-15
SLIDE 15

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-16
SLIDE 16

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-17
SLIDE 17

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-18
SLIDE 18

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-19
SLIDE 19

NLTK: What you get...

  • Book
  • Documentation
  • FAQ
  • Installation instructions for Python, NLTK, data
  • Distributions: Windows, Mac OSX, Unix, data,

documentation

  • CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization, instructions

  • Mailing lists:

nltk-announce, nltk-devel, nltk-users, nltk-portuguese

slide-20
SLIDE 20

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-21
SLIDE 21

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-22
SLIDE 22

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-23
SLIDE 23

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-24
SLIDE 24

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-25
SLIDE 25

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-26
SLIDE 26

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-27
SLIDE 27

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-28
SLIDE 28

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-29
SLIDE 29

NLTK: Who it is for...

  • people who want to learn how to:
  • write programs
  • to analyze written language
  • does not presume programming abilities:
  • working examples
  • graded exercises
  • experienced programmers:
  • quickly learn Python (if necessary)
  • Python features for NLP
  • NLP algorithms and data structures
slide-30
SLIDE 30

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-31
SLIDE 31

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-32
SLIDE 32

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-33
SLIDE 33

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-34
SLIDE 34

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-35
SLIDE 35

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-36
SLIDE 36

NLTK: What you will learn...

1 how to analyze language data 2 key concepts from linguistic description and analysis 3 how linguistic knowledge is used in NLP components 4 data structures and algorithms used in NLP and linguistic

data management

5 standard corpora and their use in formal evaluation 6 organization of the field of NLP 7 skills in Python programming for NLP

slide-37
SLIDE 37

NLTK: Your likely goals...

Goals Background Arts and Humanities Science and Engineering Language Analysis Programming to manage language data, explore lin- guistic models, and test empirical claims Language as a source

  • f interesting problems in

data modeling, data min- ing, and knowledge dis- covery Language Technol-

  • gy

Learning to program, with applications to familiar problems, to work in lan- guage technology or other technical field Knowledge

  • f

linguis- tic algorithms and data structures for high quality, maintainable language processing software

slide-38
SLIDE 38

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-39
SLIDE 39

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-40
SLIDE 40

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-41
SLIDE 41

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-42
SLIDE 42

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-43
SLIDE 43

Philosophy

  • practical
  • programming
  • principled
  • pragmatic
  • pleasurable
  • portal
slide-44
SLIDE 44

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-45
SLIDE 45

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-46
SLIDE 46

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-47
SLIDE 47

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-48
SLIDE 48

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-49
SLIDE 49

Structure

  • Three parts:

1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification

2 Parsing: phrase structure, trees, grammars, chunking,

parsing

3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguistic data management

  • each part: chapter on programming; three chapters on

NLP

  • each chapter: motivation, sections, graded exercises,

summary, further reading

slide-50
SLIDE 50

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-51
SLIDE 51

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-52
SLIDE 52

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-53
SLIDE 53

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-54
SLIDE 54

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-55
SLIDE 55

Python: Key Features

  • simple yet powerful, shallow learning curve
  • object-oriented: encapsulation, re-use
  • scripting language, facilitates interactive exploration
  • excellent functionality for processing linguistic data
  • extensive standard library, incl graphics, web, numerical

processing

  • downloaded for free from http://www.python.org/
slide-56
SLIDE 56

Python Example

import sys for line in sys.stdin.readlines(): for word in line.split(): if word.endswith(’ing’): print word

1 whitespace: nesting lines of code; scope 2 object-oriented: attributes, methods (e.g. line) 3 readable

slide-57
SLIDE 57

Comparison with Perl

while (<>) { foreach my $word (split) { if ($word =~ /ing$/) { print "$word\n"; } } }

1 syntax is obscure: what are: <> $ my split ? 2 “it is quite easy in Perl to write programs that simply look

like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47)

3 large programs difficult to maintain, reuse

slide-58
SLIDE 58

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:

  • Basic classes for representing data relevant to natural

language processing

  • Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing

  • Standard implementations for each task, which can be

combined to solve complex problems

  • Demonstrations (parsers, chunkers, chatbots)
  • Extensive documentation, including tutorials and reference

documentation

slide-59
SLIDE 59

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:

  • Basic classes for representing data relevant to natural

language processing

  • Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing

  • Standard implementations for each task, which can be

combined to solve complex problems

  • Demonstrations (parsers, chunkers, chatbots)
  • Extensive documentation, including tutorials and reference

documentation

slide-60
SLIDE 60

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:

  • Basic classes for representing data relevant to natural

language processing

  • Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing

  • Standard implementations for each task, which can be

combined to solve complex problems

  • Demonstrations (parsers, chunkers, chatbots)
  • Extensive documentation, including tutorials and reference

documentation

slide-61
SLIDE 61

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:

  • Basic classes for representing data relevant to natural

language processing

  • Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing

  • Standard implementations for each task, which can be

combined to solve complex problems

  • Demonstrations (parsers, chunkers, chatbots)
  • Extensive documentation, including tutorials and reference

documentation

slide-62
SLIDE 62

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:

  • Basic classes for representing data relevant to natural

language processing

  • Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing

  • Standard implementations for each task, which can be

combined to solve complex problems

  • Demonstrations (parsers, chunkers, chatbots)
  • Extensive documentation, including tutorials and reference

documentation

slide-63
SLIDE 63

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial building

blocks

2 consistency: uniform data structures, interfaces —

predictability

3 extensibility: accommodates new components (replicate

vs extend exiting functionality)

4 modularity: interaction between components 5 well-documented: substantial documentation

slide-64
SLIDE 64

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial building

blocks

2 consistency: uniform data structures, interfaces —

predictability

3 extensibility: accommodates new components (replicate

vs extend exiting functionality)

4 modularity: interaction between components 5 well-documented: substantial documentation

slide-65
SLIDE 65

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial building

blocks

2 consistency: uniform data structures, interfaces —

predictability

3 extensibility: accommodates new components (replicate

vs extend exiting functionality)

4 modularity: interaction between components 5 well-documented: substantial documentation

slide-66
SLIDE 66

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial building

blocks

2 consistency: uniform data structures, interfaces —

predictability

3 extensibility: accommodates new components (replicate

vs extend exiting functionality)

4 modularity: interaction between components 5 well-documented: substantial documentation

slide-67
SLIDE 67

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial building

blocks

2 consistency: uniform data structures, interfaces —

predictability

3 extensibility: accommodates new components (replicate

vs extend exiting functionality)

4 modularity: interaction between components 5 well-documented: substantial documentation

slide-68
SLIDE 68

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students to

extend it

2 efficiency: not highly optimised for runtime performance 3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

slide-69
SLIDE 69

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students to

extend it

2 efficiency: not highly optimised for runtime performance 3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

slide-70
SLIDE 70

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students to

extend it

2 efficiency: not highly optimised for runtime performance 3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

slide-71
SLIDE 71

Corpora Distributed with NLTK

  • Australian ABC News, 2 genres, 660k words, sentence-segmented
  • Brown Corpus, 15 genres, 1.15M words, tagged
  • CMU Pronouncing Dictionary, 127k entries
  • CoNLL 2000 Chunking Data, 270k words, tagged and chunked
  • CoNLL 2002 Named Entity, 700k words, pos- and named-entity-tagged (Dutch, Spanish)
  • Floresta Treebank, 9k sentences (Portuguese)
  • Genesis Corpus, 6 texts, 200k words, 6 languages
  • Gutenberg (sel), 14 texts, 1.7M words
  • Indian POS-Tagged Corpus, 60k words pos-tagged (Bangla, Hindi, Marathi, Telugu)
  • NIST 1999 Info Extr (sel), 63k words, newswire and named-entity SGML markup
  • Names Corpus, 8k male and female names
  • PP Attachment Corpus, 28k prepositional phrases, tagged as noun or verb modifiers
  • Presidential Addresses, 485k words, formatted text
  • Roget’s Thesaurus, 200k words, formatted text
  • SEMCOR, 880k words, part-of-speech and sense tagged
  • SENSEVAL 2, 600k words, part-of-speech and sense tagged
  • Shakespeare XML Corpus (sel), 8 books
  • Stopwords Corpus, 2,400 stopwords for 11 languages
  • Switchboard Corpus (sel), 36 phonecalls, transcribed, parsed
  • Univ Decl Human Rights, 480k words, 300+ languages
  • US Pres Addr Corpus, 480k words
  • Penn Treebank (sel), 40k words, tagged and parsed
  • TIMIT Corpus (sel), audio files and transcripts for 16 speakers
  • Wordlist Corpus, 960k words and 20k affixes for 8 languages
  • WordNet, 145k synonym sets