CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann SENTIMENT ANALYSIS discover peoples opinions , emotions , feelings about a subject , topic , product , or service from text Step 3: Step 1:
2
SENTIMENT ANALYSIS
…discover people’s opinions, emotions, feelings about a subject, topic, product, or service from text
Step 1: Get the data Step 2: Process text into features Step 3: Infer sentiment
SENTIMENT ANALYSIS
Recap: Data Science Workflow
3
collect & understand data clean & format data data problem use data to create solution scientific, social, or business problem sentiment analysis improve movie recommender
- r
gauging brand perception scrape web/twitter working with text data
- rule-based predictor
- machine learning
classifier
?
SENTIMENT ANALYSIS WORKFLOW
4 Feature Extraction
excluded bad
& Negation Handling
bad ping pong exclude rio 2016 Stemming bad ping pong excluded rio 2016
à rule-based prediction à machine learning classifier
RULE-BASED APPROACH
à Lab 3
5
DSFS p25 Control Flow
TEXT DATA
- Data representation à strings
- four kinds of string data
1) categorical data 2) free strings (that can be semantically mapped to categories) 3) structured string data 4) free-form text data
6
à What makes text different?
TEXT DATA
…is Big Data!
7
MACHINE LEARNING APPROACH
- Classification
8
great small location friends …
FEATURES FOR TEXT DATA
- bag of words
à does word occur in document yes/no à binary feature
- word counts
à how often does word occur? à count feature
- more advanced: n-grams, TF-IDF
9
Same great flavor and friendly service as in the S 18th street
- location. This location is not as small but it's hard to talk to friends.
Thankfully there is great outdoor seating to escape the noise.
FEATURE REPRESENTATION
- bag-of-words and word counts are vectors
10
PDSH p38 Arrays
features
- freview
counts
- r binary review
great
III
resin
D
- f's'EukeJvocasueary
1
horrible
D
i
170.000
words
easel
Tpositive
KEITEL
dictionary
D extremely
sparse features
many zeros since most word donot appear
in review
WHAT IS A CLASSIFIER?
- Rule-based
- list of positive and negative words results in fixed score
(+1, -1, or 0) for each word
- Classifier
- no fixed lists of positive/negative words
- each word gets a weight parameter ! assigned
- classifier = parameterized model of the
relationship between input and output/label
- e.g. label = sign(w(x + +) using a linear relationship
- classifier learns the weights from labeled training data
11 w(x is referred to as
- dot product,
- inner product, or
- scalar product
CLASSIFIER
- output (sentiment) is a binary class
12
Is this new review positive or negative?
- r
EVALUATION
- Which approach (rule-based or machine learning)
performs better?
- Measures:
- error rate (or misclassification rate) =
# #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,%
- average accuracy (= 1 − 23343 3562)
13
à How can we measure this?
14
- DSFS
- Ch4: Linear Algebra à Vectors (p49-53)
- Ch9: Getting Data (p105-108, p114-120)
- Ch20: Natural Language Processing (p239-244)
SUMMARY & READING
- Sentiment Analysis automatically identifies,
extracts, and analyzes emotions in text data.
- Text data needs to be preprocessed to get features
that can be used for prediction and learning.
- Linear classification is used to predict binary or
categorical targets.
Do not use the implementations introduced in this chapter à use NumPy Arrays! PDSH p38