Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u lar y of the w ords and a meas u re of their presence SENTIMENT ANALYSIS IN PYTHON

Ama z on prod u ct re v ie w s SENTIMENT ANALYSIS IN PYTHON

Sentiment anal y sis w ith BOW : E x ample This is the best book e v er . I lo v ed the book and highl y recommend it !!! {‘This’: 1, ‘is’: 1, ‘the’: 2 , ‘best’: 3 , ’book’: 2, ‘ever’: 1, ‘I’:1 , ‘loved’:1 , ‘and’: 1 , ‘highly’: 1, ‘recommend’: 1 , ‘it’: 1 } Lose w ord order and grammar r u les ! SENTIMENT ANALYSIS IN PYTHON

BOW end res u lt The o u tp u t w ill look something like this : SENTIMENT ANALYSIS IN PYTHON

Co u ntVectori z er f u nction import pandas as pd from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(max_features=1000) vect.fit(data.review) X = vect.transform(data.review) SENTIMENT ANALYSIS IN PYTHON

Co u ntVectori z er o u tp u t X <10000x1000 sparse matrix of type '<class 'numpy.int64'>' with 406668 stored elements in Compressed Sparse Row format> SENTIMENT ANALYSIS IN PYTHON

Transforming the v ectori z er # Transform to an array my_array = X.toarray() # Transform back to a dataframe, assign column names X_df = pd.DataFrame(my_array, columns=vect.get_feature_names()) SENTIMENT ANALYSIS IN PYTHON

Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Getting gran u lar w ith n - grams SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Conte x t matters I am happ y , not sad . I am sad , not happ y . P u � ing ' not ' in front of a w ord ( negation ) is one e x ample of ho w conte x t ma � ers . SENTIMENT ANALYSIS IN PYTHON

Capt u ring conte x t w ith a BOW Unigrams : single tokens Bigrams : pairs of tokens Trigrams : triples of tokens n - grams : seq u ence of n - tokens SENTIMENT ANALYSIS IN PYTHON

Capt u ring conte x t w ith BOW The w eather toda y is w onderf u l . Unigrams : { The , w eather , toda y, is w onderf u l } Bigrams : { The w eather , w eather toda y, toda y is , is w onderf u l } Trigrams : { The w eather toda y, w eather toda y is , toda y is w onderf u l } SENTIMENT ANALYSIS IN PYTHON

n - grams w ith the Co u ntVectori z er from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(ngram_range=(min_n, max_n)) # Only unigrams ngram_range=(1, 1) # Uni- and bigrams ngram_range=(1, 2) SENTIMENT ANALYSIS IN PYTHON

What is the best n ? Longer seq u ence of tokens Res u lts in more feat u res Higher precision of machine learning models Risk of o v er � � ing SENTIMENT ANALYSIS IN PYTHON

Specif y ing v ocab u lar y si z e CountVectorizer(max_feature, max_df, min_df) ma x_ feat u res : if speci � ed , it w ill incl u de onl y the top most freq u ent w ords in the v ocab u lar y If ma x_ feat u res = None , all w ords w ill be incl u ded ma x_ df : ignore terms w ith higher than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms min _ df : ignore terms w ith lo w er than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms SENTIMENT ANALYSIS IN PYTHON

B u ild ne w feat u res from te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Goal of the v ideo Goal : Enrich the e x isting dataset w ith feat u res related to the te x t col u mn ( capt u ring the sentiment ) SENTIMENT ANALYSIS IN PYTHON

Prod u ct re v ie w s data reviews.head() SENTIMENT ANALYSIS IN PYTHON

Feat u res from the re v ie w col u mn Ho w long is each re v ie w? Ho w man y sentences does it contain ? What parts of speech are in v ol v ed ? Ho w man y p u nct u ation marks ? SENTIMENT ANALYSIS IN PYTHON

Tokeni z ing a string from nltk import word_tokenize anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.' word_tokenize(anna_k) ['Happy','families','are', 'all','alike',',', 'every','unhappy', 'family', 'is','unhappy','in', 'its','own','way','.'] SENTIMENT ANALYSIS IN PYTHON

Tokens from a col u mn # General form of list comprehension [expression for item in iterable] word_tokens = [word_tokenize(review) for review in reviews.review] type(word_tokens) list type(word_tokens[0]) list SENTIMENT ANALYSIS IN PYTHON

Tokens from a col u mn len_tokens = [] # Iterate over the word_tokens list for i in range(len(word_tokens)): len_tokens.append(len(word_tokens[i])) # Create a new feature for the length of each review reviews['n_tokens'] = len_tokens SENTIMENT ANALYSIS IN PYTHON

Dealing w ith p u nct u ation We did not address it b u t y o u can e x cl u de it A feat u re that meas u res the n u mber of p u nct u ation signs A re v ie w w ith man y p u nct u ation signs co u ld signal a v er y emotionall y charged opinion SENTIMENT ANALYSIS IN PYTHON

Re v ie w s w ith a feat u re for the length reviews.head() SENTIMENT ANALYSIS IN PYTHON

Can y o u g u ess the lang u age ? SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Lang u age of a string in P y thon from langdetect import detect_langs foreign = 'Este libro ha sido uno de los mejores libros que he leido.' detect_langs(foreign) [es:0.9999945352697024] SENTIMENT ANALYSIS IN PYTHON

Lang u age of a col u mn Problem : Detect the lang u age of each of the strings and capt u re the most likel y lang u age in a ne w col u mn from langdetect import detect_langs reviews = pd.read_csv('product_reviews.csv') reviews.head() SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age languages = [] for row in range(len(reviews)): languages.append(detect_langs(reviews.iloc[row, 1])) languages [it:0.9999982541301151], [es:0.9999954153640488], [es:0.7142833997345875, en:0.2857160465706441], [es:0.9999942365605781], [es:0.999997956049055] ... SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age # Transform the first list to a string and split on a colon str(languages[0]).split(':') ['[es', '0.9999954153640488]'] str(languages[0]).split(':')[0] '[es' str(languages[0]).split(':')[0][1:] 'es' SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age languages = [str(lang).split(':')[0][1:] for lang in languages] reviews['language'] = languages SENTIMENT ANALYSIS IN PYTHON

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

Clean Land, Safe Water, Healthy Lives Understanding and Tracking Disposable Bag Consumption in the

City of Los Angeles Reusable Bag Program Single-Use Carryout Bag Ordinance Bureau of Sanitation

The Plastic Retail Bag Legislative Landscape Retail Bag Ordinances - Today Passed in 2007

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Some considerations about the vacuum bag method. The vacuum bag method takes advantage of the good

St Georges Hospital Schools Emergency Asthma Bag How to use the Bag Training for

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

Modeling and Representing Negation in Data-driven Machine Learning-based Sentiment Analysis

Definition Liu et al. (2009) define a sentiment or opinion as a quintuple ,

Data Mining The Oscars on Twitter Yun Zhou, Weiyan Shi, Mingyung Kim, Jiang Zhu, Alanna Iverson

ASSIST project Aims to deliver a service for searching and qualitatively analysing social

CSE 190 Data Mining and Predictive Analytics Introduction What is CSE 190? In this course we

Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout Outline 2 Stochastic

Phrase-Indexed Question Answering : A New Challenge for Scalable Document Comprehension Minjoon

Essentials in Scaling Your Company & Growing Your Customers Presented by: Mona Elesseily

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

Clean Land, Safe Water, Healthy Lives Understanding and Tracking Disposable Bag Consumption in the

City of Los Angeles Reusable Bag Program Single-Use Carryout Bag Ordinance Bureau of Sanitation

The Plastic Retail Bag Legislative Landscape Retail Bag Ordinances - Today Passed in 2007

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Some considerations about the vacuum bag method. The vacuum bag method takes advantage of the good

St Georges Hospital Schools Emergency Asthma Bag How to use the Bag Training for

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

Modeling and Representing Negation in Data-driven Machine Learning-based Sentiment Analysis

Definition Liu et al. (2009) define a sentiment or opinion as a quintuple ,

Data Mining The Oscars on Twitter Yun Zhou, Weiyan Shi, Mingyung Kim, Jiang Zhu, Alanna Iverson

ASSIST project Aims to deliver a service for searching and qualitatively analysing social

CSE 190 Data Mining and Predictive Analytics Introduction What is CSE 190? In this course we

Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout Outline 2 Stochastic

Phrase-Indexed Question Answering : A New Challenge for Scalable Document Comprehension Minjoon

Essentials in Scaling Your Company &amp; Growing Your Customers Presented by: Mona Elesseily

Essentials in Scaling Your Company & Growing Your Customers Presented by: Mona Elesseily