cse217 introduction to data science lecture 3 sentiment
play

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann SENTIMENT ANALYSIS discover peoples opinions , emotions , feelings about a subject , topic , product , or service from text Step 3: Step 1:


  1. CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann

  2. SENTIMENT ANALYSIS …discover people’s opinions , emotions , feelings about a subject , topic , product , or service from text Step 3: Step 1: Step 2: Infer sentiment Get the data Process text into features 2

  3. SENTIMENT ANALYSIS Recap: Data Science Workflow scientific, collect & clean & use data social, or data understand format to create business problem data solution data problem ? improve movie sentiment scrape working with rule-based predictor • recommender analysis web/twitter text data machine learning • or classifier gauging brand perception 3

  4. SENTIMENT ANALYSIS WORKFLOW à rule-based prediction à machine learning classifier bad & Negation Handling Feature Extraction excluded bad ping pong excluded rio 2016 Stemming 4 bad ping pong exclude rio 2016

  5. RULE-BASED APPROACH à Lab 3 DSFS p25 5 Control Flow

  6. TEXT DATA • Data representation à strings • four kinds of string data 1) categorical data 2) free strings (that can be semantically mapped to categories) 3) structured string data 4) free-form text data à What makes text different ? 6

  7. TEXT DATA …is Big Data! 7

  8. MACHINE LEARNING APPROACH • Classification 8

  9. FEATURES FOR TEXT DATA • bag of words à does word occur in document yes / no à binary feature location great Same great flavor and friendly service as in the S 18th street friends location. This location is not as small but it's hard to talk to friends. small Thankfully there is great outdoor seating to escape the noise. … • word counts à how often does word occur? à count feature • more advanced: n-grams, TF-IDF 9

  10. FEATURE REPRESENTATION • bag-of-words and word counts are vectors of review features or binary review counts III resin great D f's'EukeJvocasueary o 1 horrible 170.000 D i words Tpositive easel KEITEL dictionary D extremely sparse features many zeros since most word do not appear in review PDSH p38 10 Arrays

  11. WHAT IS A CLASSIFIER? • Rule-based • list of positive and negative words results in fixed score (+1, -1, or 0) for each word • Classifier • no fixed lists of positive/negative words • each word gets a weight parameter ! assigned w ( x is referred to as • classifier = parameterized model of the dot product, • inner product, or • relationship between input and output/label scalar product • • e.g. label = sign(w ( x + +) using a linear relationship • classifier learns the weights from labeled training data 11

  12. CLASSIFIER • output ( sentiment ) is a binary class Is this new review positive or negative? or 12

  13. EVALUATION • Which approach (rule-based or machine learning) performs better? à How can we measure this? • Measures: • error rate (or misclassification rate) = # #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,% • average accuracy ( = 1 − 23343 3562 ) 13

  14. SUMMARY & READING • Sentiment Analysis automatically identifies , extracts , and analyzes emotions in text data. • Text data needs to be preprocessed to get features that can be used for prediction and learning. • Linear classification is used to predict binary or categorical targets . Do not use the implementations PDSH • DSFS introduced in this p38 chapter à use NumPy Arrays! • Ch4: Linear Algebra à Vectors (p49-53) • Ch9: Getting Data (p105-108, p114-120) • Ch20: Natural Language Processing (p239-244) 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend