CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS - - PowerPoint PPT Presentation

cse217 introduction to data science lecture 3 sentiment
SMART_READER_LITE
LIVE PREVIEW

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS - - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann SENTIMENT ANALYSIS discover peoples opinions , emotions , feelings about a subject , topic , product , or service from text Step 3: Step 1:


slide-1
SLIDE 1

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019 Marion Neumann

LECTURE 3: SENTIMENT ANALYSIS

slide-2
SLIDE 2

2

SENTIMENT ANALYSIS

…discover people’s opinions, emotions, feelings about a subject, topic, product, or service from text

Step 1: Get the data Step 2: Process text into features Step 3: Infer sentiment

slide-3
SLIDE 3

SENTIMENT ANALYSIS

Recap: Data Science Workflow

3

collect & understand data clean & format data data problem use data to create solution scientific, social, or business problem sentiment analysis improve movie recommender

  • r

gauging brand perception scrape web/twitter working with text data

  • rule-based predictor
  • machine learning

classifier

?

slide-4
SLIDE 4

SENTIMENT ANALYSIS WORKFLOW

4 Feature Extraction

excluded bad

& Negation Handling

bad ping pong exclude rio 2016 Stemming bad ping pong excluded rio 2016

à rule-based prediction à machine learning classifier

slide-5
SLIDE 5

RULE-BASED APPROACH

à Lab 3

5

DSFS p25 Control Flow

slide-6
SLIDE 6

TEXT DATA

  • Data representation à strings
  • four kinds of string data

1) categorical data 2) free strings (that can be semantically mapped to categories) 3) structured string data 4) free-form text data

6

à What makes text different?

slide-7
SLIDE 7

TEXT DATA

…is Big Data!

7

slide-8
SLIDE 8

MACHINE LEARNING APPROACH

  • Classification

8

slide-9
SLIDE 9

great small location friends …

FEATURES FOR TEXT DATA

  • bag of words

à does word occur in document yes/no à binary feature

  • word counts

à how often does word occur? à count feature

  • more advanced: n-grams, TF-IDF

9

Same great flavor and friendly service as in the S 18th street

  • location. This location is not as small but it's hard to talk to friends.

Thankfully there is great outdoor seating to escape the noise.

slide-10
SLIDE 10

FEATURE REPRESENTATION

  • bag-of-words and word counts are vectors

10

PDSH p38 Arrays

features

  • freview

counts

  • r binary review

great

III

resin

D

  • f's'EukeJvocasueary

1

horrible

D

i

170.000

words

easel

Tpositive

KEITEL

dictionary

D extremely

sparse features

many zeros since most word donot appear

in review

slide-11
SLIDE 11

WHAT IS A CLASSIFIER?

  • Rule-based
  • list of positive and negative words results in fixed score

(+1, -1, or 0) for each word

  • Classifier
  • no fixed lists of positive/negative words
  • each word gets a weight parameter ! assigned
  • classifier = parameterized model of the

relationship between input and output/label

  • e.g. label = sign(w(x + +) using a linear relationship
  • classifier learns the weights from labeled training data

11 w(x is referred to as

  • dot product,
  • inner product, or
  • scalar product
slide-12
SLIDE 12

CLASSIFIER

  • output (sentiment) is a binary class

12

Is this new review positive or negative?

  • r
slide-13
SLIDE 13

EVALUATION

  • Which approach (rule-based or machine learning)

performs better?

  • Measures:
  • error rate (or misclassification rate) =

# #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,%

  • average accuracy (= 1 − 23343 3562)

13

à How can we measure this?

slide-14
SLIDE 14

14

  • DSFS
  • Ch4: Linear Algebra à Vectors (p49-53)
  • Ch9: Getting Data (p105-108, p114-120)
  • Ch20: Natural Language Processing (p239-244)

SUMMARY & READING

  • Sentiment Analysis automatically identifies,

extracts, and analyzes emotions in text data.

  • Text data needs to be preprocessed to get features

that can be used for prediction and learning.

  • Linear classification is used to predict binary or

categorical targets.

Do not use the implementations introduced in this chapter à use NumPy Arrays! PDSH p38