Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - - PowerPoint PPT Presentation

corpus bootstrapping with nltk
SMART_READER_LITE
LIVE PREVIEW

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - - PowerPoint PPT Presentation

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk Problem you want to do NLProc many proven supervised


slide-1
SLIDE 1

Corpus Bootstrapping with NLTK

by Jacob Perkins

slide-2
SLIDE 2

Jacob Perkins

http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk

slide-3
SLIDE 3

Problem

you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus

slide-4
SLIDE 4

Solution

make a custom training corpus

slide-5
SLIDE 5

Problems with Manual Annotation

takes time requires expertise expert time costs $$$

slide-6
SLIDE 6

Solution: Bootstrap

less time less expertise costs less requires thinking & creativity

slide-7
SLIDE 7

Corpus Bootstrapping at Weotta

review sentiment keyword classification phrase extraction & classification

slide-8
SLIDE 8

Bootstrapping Examples

english -> spanish sentiment phrase extraction

slide-9
SLIDE 9

Translating Sentiment

start with english sentiment corpus & classifier english -> spanish -> spanish

slide-10
SLIDE 10

English -> Spanish -> Spanish

  • 1. translate english examples to spanish
  • 2. train classifier
  • 3. classify spanish text into new corpus
  • 4. correct new corpus
  • 5. retrain classifier
  • 6. add to corpus & goto 4 until done
slide-11
SLIDE 11

Translate Corpus

$ translate_corpus.py movie_reviews --source english

  • -target spanish
slide-12
SLIDE 12

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

slide-13
SLIDE 13

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

slide-14
SLIDE 14

Manual Correction

  • 1. scan each file
  • 2. move incorrect examples to correct file
slide-15
SLIDE 15

Train New Classifier

$ train_classifier.py spanish_sentiment

slide-16
SLIDE 16

Adding to the Corpus

start with >90% probability retrain carefully decrease probability threshold

slide-17
SLIDE 17

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt

slide-18
SLIDE 18

When are you done?

what level of accuracy do you need? does your corpus reflect real text? how much time do you have?

slide-19
SLIDE 19

Tips

garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features

slide-20
SLIDE 20

Bootstrapping a Phrase Extractor

  • 1. find a pos tagged corpus
  • 2. annotate raw text
  • 3. train pos tagger
  • 4. create pos tagged & chunked corpus
  • 5. tag unknown words
  • 6. train pos tagger & chunker
  • 7. correct errors
  • 8. add to corpus, goto 5 until done
slide-21
SLIDE 21

NLTK Tagged Corpora

English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/

slide-22
SLIDE 22

Train Tagger

$ train_tagger.py treebank --simplify_tags

slide-23
SLIDE 23

Phrase Annotation

Hello world, [this is an important phrase].

slide-24
SLIDE 24

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

slide-25
SLIDE 25

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

slide-26
SLIDE 26

Correct Unknown Words

  • 1. find -NONE- tagged words
  • 2. fix tags
slide-27
SLIDE 27

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

slide-28
SLIDE 28

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

slide-29
SLIDE 29

Extracting Phrases

import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

slide-30
SLIDE 30

Final Tips

error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement

slide-31
SLIDE 31

Links

http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com