stop w ords
play

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a - PowerPoint PPT Presentation

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u


  1. Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  2. What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u ages {'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...} Conte x t ma � ers {'movie', 'movies', 'film', 'films', 'cinema'} SENTIMENT ANALYSIS IN PYTHON

  3. Stop w ords w ith w ord clo u ds Word clo u d , not remo v ing stop w ords Word clo u d w ith stop w ords remo v ed SENTIMENT ANALYSIS IN PYTHON

  4. Remo v e stop w ords from w ord clo u ds # Import libraries from wordcloud import WordCloud, STOPWORDS # Define the stopwords list my_stopwords = set(STOPWORDS) my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"]) # Generate and show the word cloud my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(name_string) plt.imshow(my_cloud, interpolation='bilinear') SENTIMENT ANALYSIS IN PYTHON

  5. Stop w ords w ith BOW from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS # Define the set of stop words my_stop_words = ENGLISH_STOP_WORDS.union(['film', 'movie', 'cinema', 'theatre']) vect = CountVectorizer(stop_words=my_stop_words) vect.fit(movies.review) X = vect.transform(movies.review) SENTIMENT ANALYSIS IN PYTHON

  6. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  7. Capt u ring a token pattern SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  8. String operators and comparisons # Checks if a string is composed only of letters my_string.isalpha() # Checks if a string is composed only of digits my_string.isdigit() # Checks if a string is composed only of alphanumeric characters my_string.isalnum() SENTIMENT ANALYSIS IN PYTHON

  9. String operators w ith list comprehension # Original word tokenization word_tokens = [word_tokenize(review) for review in reviews.review] # Keeping only tokens composed of letters cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens] len(word_tokens[0]) 87 len(cleaned_tokens[0]) 78 SENTIMENT ANALYSIS IN PYTHON

  10. Reg u lar e x pressions import re my_string = '#Wonderfulday' # Extract #, followed by any letter, small or capital x = re.search('#[A-Za-z]', my_string) x <re.Match object; span=(0, 2), match='#W'> SENTIMENT ANALYSIS IN PYTHON

  11. Token pattern w ith a BOW # Default token pattern in CountVectorizer '\b\w\w+\b' # Specify a particular token pattern CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b') SENTIMENT ANALYSIS IN PYTHON

  12. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  13. Stemming and lemmati z ation SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  14. What is stemming ? Stemming is the process of transforming w ords to their root forms , e v en if the stem itself is not a v alid w ord in the lang u age . staying, stays, stayed ----> stay house, houses, housing ----> hous SENTIMENT ANALYSIS IN PYTHON

  15. What is lemmati z ation ? Lemmati z ation is q u ite similar to stemming b u t u nlike stemming , it red u ces the w ords to roots that are v alid w ords in the lang u age . stay, stays, staying, stayed ----> stay house, houses, housing ----> house SENTIMENT ANALYSIS IN PYTHON

  16. Stemming v s . lemmati z ation Stemming Lemmati z ation Prod u ces roots of w ords Prod u ces act u al w ords Fast and e � cient to comp u te Slo w er than stemming and can depend on the part - of - speech SENTIMENT ANALYSIS IN PYTHON

  17. Stemming of strings from nltk.stem import PorterStemmer porter = PorterStemmer() porter.stem('wonderful') 'wonder' SENTIMENT ANALYSIS IN PYTHON

  18. Non - English stemmers Sno w ball Stemmer : Danish , D u tch , English , Finnish , French , German , H u ngarian , Italian , Nor w egian , Port u g u ese , Romanian , R u ssian , Spanish , S w edish from nltk.stem.snowball import SnowballStemmer DutchStemmer = SnowballStemmer("dutch") DutchStemmer.stem("beginen") 'begin' SENTIMENT ANALYSIS IN PYTHON

  19. Ho w to stem a sentence ? porter.stem('Today is a wonderful day!') 'today is a wonderful day!' tokens = word_tokenize('Today is a wonderful day!') stemmed_tokens = [porter.stem(token) for token in tokens] stemmed_tokens ['today', 'is', 'a', 'wonder', 'day', '!'] SENTIMENT ANALYSIS IN PYTHON

  20. Lemmati z ation of a string from nltk.stem import WordNetLemmatizer WNlemmatizer = WordNetLemmatizer() WNlemmatizer.lemmatize('wonderful', pos='a') 'wonderful' SENTIMENT ANALYSIS IN PYTHON

  21. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  22. TfIdf : More w a y s to transform te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  23. What are the components of TfIdf ? TF : term freq u enc y : Ho w o � en a gi v en w ord appears w ithin a doc u ment in the corp u s In v erse doc u ment freq u enc y : Log - ratio bet w een the total n u mber of doc u ments and the n u mber of doc u ments that contain a speci � c w ord Used to calc u late the w eight of w ords that do not occ u r freq u entl y SENTIMENT ANALYSIS IN PYTHON

  24. TfIDF score of a w ord TfIdf score : TfIdf = term frequency * inverse document frequency BOW does not acco u nt for length of a doc u ment , TfIDf does . TfIdf likel y to capt u re w ords common w ithin a doc u ment b u t not across doc u ments . SENTIMENT ANALYSIS IN PYTHON

  25. Ho w is TfIdf u sef u l ? T w i � er airline sentiment Lo w TfIdf scores : United , Virgin America High TfIdf scores : check - in process ( if rare across doc u ments ) More on TfIdf Since it penali z es freq u ent w ords , less need to deal w ith stop w ords e x plicitl y. Q u ite u sef u l in search q u eries and information retrie v al to rank the rele v ance of ret u rned res u lts . SENTIMENT ANALYSIS IN PYTHON

  26. TfIdf in P y thon # Import the TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer Arg u ments of T � dfVectori z er : ma x_ feat u res , ngrams _ range , stop _w ords , token _ pa � ern , ma x_ df , min _ df vect = TfidfVectorizer(max_features=100).fit(tweets.text) X = vect.transform(tweets.text) SENTIMENT ANALYSIS IN PYTHON

  27. TfidfVectori z er X <14640x100 sparse matrix of type '<class 'numpy.float64'>' with 119182 stored elements in Compressed Sparse Row format> X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names()) X_df.head() SENTIMENT ANALYSIS IN PYTHON

  28. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend