Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a - PowerPoint PPT Presentation

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u ages {'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...} Conte x t ma � ers {'movie', 'movies', 'film', 'films', 'cinema'} SENTIMENT ANALYSIS IN PYTHON

Stop w ords w ith w ord clo u ds Word clo u d , not remo v ing stop w ords Word clo u d w ith stop w ords remo v ed SENTIMENT ANALYSIS IN PYTHON

Remo v e stop w ords from w ord clo u ds # Import libraries from wordcloud import WordCloud, STOPWORDS # Define the stopwords list my_stopwords = set(STOPWORDS) my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"]) # Generate and show the word cloud my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(name_string) plt.imshow(my_cloud, interpolation='bilinear') SENTIMENT ANALYSIS IN PYTHON

Stop w ords w ith BOW from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS # Define the set of stop words my_stop_words = ENGLISH_STOP_WORDS.union(['film', 'movie', 'cinema', 'theatre']) vect = CountVectorizer(stop_words=my_stop_words) vect.fit(movies.review) X = vect.transform(movies.review) SENTIMENT ANALYSIS IN PYTHON

Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Capt u ring a token pattern SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

String operators and comparisons # Checks if a string is composed only of letters my_string.isalpha() # Checks if a string is composed only of digits my_string.isdigit() # Checks if a string is composed only of alphanumeric characters my_string.isalnum() SENTIMENT ANALYSIS IN PYTHON

String operators w ith list comprehension # Original word tokenization word_tokens = [word_tokenize(review) for review in reviews.review] # Keeping only tokens composed of letters cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens] len(word_tokens[0]) 87 len(cleaned_tokens[0]) 78 SENTIMENT ANALYSIS IN PYTHON

Reg u lar e x pressions import re my_string = '#Wonderfulday' # Extract #, followed by any letter, small or capital x = re.search('#[A-Za-z]', my_string) x <re.Match object; span=(0, 2), match='#W'> SENTIMENT ANALYSIS IN PYTHON

Token pattern w ith a BOW # Default token pattern in CountVectorizer '\b\w\w+\b' # Specify a particular token pattern CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b') SENTIMENT ANALYSIS IN PYTHON

Stemming and lemmati z ation SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

What is stemming ? Stemming is the process of transforming w ords to their root forms , e v en if the stem itself is not a v alid w ord in the lang u age . staying, stays, stayed ----> stay house, houses, housing ----> hous SENTIMENT ANALYSIS IN PYTHON

What is lemmati z ation ? Lemmati z ation is q u ite similar to stemming b u t u nlike stemming , it red u ces the w ords to roots that are v alid w ords in the lang u age . stay, stays, staying, stayed ----> stay house, houses, housing ----> house SENTIMENT ANALYSIS IN PYTHON

Stemming v s . lemmati z ation Stemming Lemmati z ation Prod u ces roots of w ords Prod u ces act u al w ords Fast and e � cient to comp u te Slo w er than stemming and can depend on the part - of - speech SENTIMENT ANALYSIS IN PYTHON

Stemming of strings from nltk.stem import PorterStemmer porter = PorterStemmer() porter.stem('wonderful') 'wonder' SENTIMENT ANALYSIS IN PYTHON

Non - English stemmers Sno w ball Stemmer : Danish , D u tch , English , Finnish , French , German , H u ngarian , Italian , Nor w egian , Port u g u ese , Romanian , R u ssian , Spanish , S w edish from nltk.stem.snowball import SnowballStemmer DutchStemmer = SnowballStemmer("dutch") DutchStemmer.stem("beginen") 'begin' SENTIMENT ANALYSIS IN PYTHON

Ho w to stem a sentence ? porter.stem('Today is a wonderful day!') 'today is a wonderful day!' tokens = word_tokenize('Today is a wonderful day!') stemmed_tokens = [porter.stem(token) for token in tokens] stemmed_tokens ['today', 'is', 'a', 'wonder', 'day', '!'] SENTIMENT ANALYSIS IN PYTHON

Lemmati z ation of a string from nltk.stem import WordNetLemmatizer WNlemmatizer = WordNetLemmatizer() WNlemmatizer.lemmatize('wonderful', pos='a') 'wonderful' SENTIMENT ANALYSIS IN PYTHON

TfIdf : More w a y s to transform te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

What are the components of TfIdf ? TF : term freq u enc y : Ho w o � en a gi v en w ord appears w ithin a doc u ment in the corp u s In v erse doc u ment freq u enc y : Log - ratio bet w een the total n u mber of doc u ments and the n u mber of doc u ments that contain a speci � c w ord Used to calc u late the w eight of w ords that do not occ u r freq u entl y SENTIMENT ANALYSIS IN PYTHON

TfIDF score of a w ord TfIdf score : TfIdf = term frequency * inverse document frequency BOW does not acco u nt for length of a doc u ment , TfIDf does . TfIdf likel y to capt u re w ords common w ithin a doc u ment b u t not across doc u ments . SENTIMENT ANALYSIS IN PYTHON

Ho w is TfIdf u sef u l ? T w i � er airline sentiment Lo w TfIdf scores : United , Virgin America High TfIdf scores : check - in process ( if rare across doc u ments ) More on TfIdf Since it penali z es freq u ent w ords , less need to deal w ith stop w ords e x plicitl y. Q u ite u sef u l in search q u eries and information retrie v al to rank the rele v ance of ret u rned res u lts . SENTIMENT ANALYSIS IN PYTHON

TfIdf in P y thon # Import the TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer Arg u ments of T � dfVectori z er : ma x_ feat u res , ngrams _ range , stop _w ords , token _ pa � ern , ma x_ df , min _ df vect = TfidfVectorizer(max_features=100).fit(tweets.text) X = vect.transform(tweets.text) SENTIMENT ANALYSIS IN PYTHON

TfidfVectori z er X <14640x100 sparse matrix of type '<class 'numpy.float64'>' with 119182 stored elements in Compressed Sparse Row format> X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names()) X_df.head() SENTIMENT ANALYSIS IN PYTHON

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a - PowerPoint PPT Presentation

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

2019-2020 What is a consolidated bus stop? A consolidated bus stop is a centralized stop that

Consolidated Bus Stops 2015-2016 What is a consolidated bus stop? A consolidated bus stop is a

Stop, Question and Frisk Procedure NYS Senator Eric Adams District #20 What is Stop,

Sep. 23, 2019 Stop #5 Advise Planning & Zoning Commission of updates To Stop #3 Make

OREGON STOP PROGRAM Ken Sanchagrin Tiffany Quintero Oregon STOP Program Co-Directors 11

Consolidated Bus Stops 2020-2021 What is a consolidated bus stop? A consolidated bus stop is

Stop? I cannot stop. What? Shall the old African blasphemer stop while he can speak? ~

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

9/11 ORAL HISTORY PROJECT I N T HEIR O WN W ORDS : F ROM O RAL H ISTORY TO V ISUAL A RT , M EDIA A

The repetition threshold for binary rich words Lucas Mol Joint work with James D. Currie and

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

BRCAnt Stop Me Mollie Smith CEO BRCAnt Stop Me Miss Sanilac County 2016 About BRCAnt

NON STOP N ew smart digital O perations N eeded for a S ustainable T ransition O f P orts J-No.:

Strategic Advocacy Direction of the Stop TB Partnership Jon Lidn, Manager, Advocacy and

WITH C++ Prof. Amr Goneid AUC Part 13. Abstract Data Types (ADTs) Prof. amr Goneid, AUC 1

Pattern Recognition Part 6: Bandwidth Extension Gerhard Schmidt Christian-Albrechts-Universitt

Primary Custody Concurrent vs. Consecutive Sentences Jail Time Credit 1 09/19/2014

Parents Briefing 25 January 2019 English Language S T E L LA R The EL curriculum adopts

Text mining with ngram variables Matthias Schonlau, Ph.D. The most common approach to dealing

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a - PowerPoint PPT Presentation

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

2019-2020 What is a consolidated bus stop? A consolidated bus stop is a centralized stop that

Consolidated Bus Stops 2015-2016 What is a consolidated bus stop? A consolidated bus stop is a

Stop, Question and Frisk Procedure NYS Senator Eric Adams District #20 What is Stop,

Sep. 23, 2019 Stop #5 Advise Planning &amp; Zoning Commission of updates To Stop #3 Make

OREGON STOP PROGRAM Ken Sanchagrin Tiffany Quintero Oregon STOP Program Co-Directors 11

Consolidated Bus Stops 2020-2021 What is a consolidated bus stop? A consolidated bus stop is

Stop? I cannot stop. What? Shall the old African blasphemer stop while he can speak? ~

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

9/11 ORAL HISTORY PROJECT I N T HEIR O WN W ORDS : F ROM O RAL H ISTORY TO V ISUAL A RT , M EDIA A

The repetition threshold for binary rich words Lucas Mol Joint work with James D. Currie and

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

BRCAnt Stop Me Mollie Smith CEO BRCAnt Stop Me Miss Sanilac County 2016 About BRCAnt

NON STOP N ew smart digital O perations N eeded for a S ustainable T ransition O f P orts J-No.:

Strategic Advocacy Direction of the Stop TB Partnership Jon Lidn, Manager, Advocacy and

WITH C++ Prof. Amr Goneid AUC Part 13. Abstract Data Types (ADTs) Prof. amr Goneid, AUC 1

Pattern Recognition Part 6: Bandwidth Extension Gerhard Schmidt Christian-Albrechts-Universitt

Primary Custody Concurrent vs. Consecutive Sentences Jail Time Credit 1 09/19/2014

Parents Briefing 25 January 2019 English Language S T E L LA R The EL curriculum adopts

Text mining with ngram variables Matthias Schonlau, Ph.D. The most common approach to dealing

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Sep. 23, 2019 Stop #5 Advise Planning & Zoning Commission of updates To Stop #3 Make