Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the - - PowerPoint PPT Presentation

Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the Shapes of Stories https://www.youtube.com/watch?v=oP3c1h8v2ZQ We know how to work with tidy data We know how to work with tidy data Regression linear model, polynomial


slide-1
SLIDE 1

Natural Language Processing

STOR 390 4/18/17

slide-2
SLIDE 2

Kurt Vonnegut on the Shapes

  • f Stories

https://www.youtube.com/watch?v=oP3c1h8v2ZQ

slide-3
SLIDE 3

We know how to work with tidy data

slide-4
SLIDE 4

We know how to work with tidy data

Regression linear model, polynomial terms Classification K-nearest-neighbors, SVM Clustering K-means

slide-5
SLIDE 5

Unstructured data: not all data is tidy

Networks Text Images

slide-6
SLIDE 6

Network data

slide-7
SLIDE 7

http://dogtime.com/puppies/255-puppies http://www.dailytarheel.com/article/2017/04/a-title-to- remember-north-carolina-wins-its-sixth-ncaa- championship

Image data

slide-8
SLIDE 8

https://emeraldcitybookreview.com/2014/06/beautiful-books-picturing-jane-austen_20.html

Text data

slide-9
SLIDE 9

Unstructured ≠ no structure

slide-10
SLIDE 10

Two strategies

Invent new tools PageRank Turn it into tidy data

slide-11
SLIDE 11

https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks- f40359318721

Images are numbers

slide-12
SLIDE 12

https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html

slide-13
SLIDE 13

Text data

One document = string of words Corpus = collection of documents

slide-14
SLIDE 14

—Text Mining with R

“A token is a meaningful unit of text, most

  • ften a word, that we are interested in using for

further analysis, and tokenization is the process

  • f splitting text into tokens.”
slide-15
SLIDE 15

Tokenization turns text into tidy format

Word Sentence Paragraph Chapter

slide-16
SLIDE 16

Jane Austen’s books tokenized by word

slide-17
SLIDE 17

Make text lower case

Make words more comparable Door —> door

slide-18
SLIDE 18

Tokenization loses information

Ignores word order

slide-19
SLIDE 19

Most frequently appearing words

slide-20
SLIDE 20

Remove stop words

Commonly occurring words the to and Hand code a list of words

slide-21
SLIDE 21

Most frequently occurring words (no stop words)

slide-22
SLIDE 22

Sentiment analysis attempts to quantify emotional content

Assign each word an emotional value positive/negative trust, fear, sadness, anger, surprise, disgust, joy, anticipation”

  • 5, -4, … 4, 5
slide-23
SLIDE 23

There are precompiled lexicons

Hand coded Crowdsourced Amazon turk Online reviews Yelp

slide-24
SLIDE 24

Assign each word a sentiment

slide-25
SLIDE 25

Sentiment analysis is noisy

slide-26
SLIDE 26

Sentiment analysis is noisy

Lexicons may not generalize Unigrams no good Context

slide-27
SLIDE 27

Sentiment analysis is noisy

Statistics is so much fun Statistics is so much fun vs.

slide-28
SLIDE 28

Jane Austen novels are fairly balanced

slide-29
SLIDE 29

Different ways to quantify “time"

chapter paragraph line sentence

slide-30
SLIDE 30

Different ways to quantify “time"

chapter paragraph line sentence we choose

  • ne unit of time = 80 lines
slide-31
SLIDE 31
slide-32
SLIDE 32

index = line number %/% 80 sentiment = (# positive words) - (# negative words)

slide-33
SLIDE 33

Smooth time series with a low band pass filter

http://www.matthewjockers.net/2015/02/02/syuzhet/

slide-34
SLIDE 34
slide-35
SLIDE 35

References

Revealing Sentiment and Plot Arcs with the Syuzhet Package http://www.matthewjockers.net/2015/02/02/syuzhet/ Text Mining with R http://tidytextmining.com/