bag of w ords
play

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u


  1. Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  2. What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u lar y of the w ords and a meas u re of their presence SENTIMENT ANALYSIS IN PYTHON

  3. Ama z on prod u ct re v ie w s SENTIMENT ANALYSIS IN PYTHON

  4. Sentiment anal y sis w ith BOW : E x ample This is the best book e v er . I lo v ed the book and highl y recommend it !!! {‘This’: 1, ‘is’: 1, ‘the’: 2 , ‘best’: 3 , ’book’: 2, ‘ever’: 1, ‘I’:1 , ‘loved’:1 , ‘and’: 1 , ‘highly’: 1, ‘recommend’: 1 , ‘it’: 1 } Lose w ord order and grammar r u les ! SENTIMENT ANALYSIS IN PYTHON

  5. BOW end res u lt The o u tp u t w ill look something like this : SENTIMENT ANALYSIS IN PYTHON

  6. Co u ntVectori z er f u nction import pandas as pd from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(max_features=1000) vect.fit(data.review) X = vect.transform(data.review) SENTIMENT ANALYSIS IN PYTHON

  7. Co u ntVectori z er o u tp u t X <10000x1000 sparse matrix of type '<class 'numpy.int64'>' with 406668 stored elements in Compressed Sparse Row format> SENTIMENT ANALYSIS IN PYTHON

  8. Transforming the v ectori z er # Transform to an array my_array = X.toarray() # Transform back to a dataframe, assign column names X_df = pd.DataFrame(my_array, columns=vect.get_feature_names()) SENTIMENT ANALYSIS IN PYTHON

  9. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  10. Getting gran u lar w ith n - grams SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  11. Conte x t matters I am happ y , not sad . I am sad , not happ y . P u � ing ' not ' in front of a w ord ( negation ) is one e x ample of ho w conte x t ma � ers . SENTIMENT ANALYSIS IN PYTHON

  12. Capt u ring conte x t w ith a BOW Unigrams : single tokens Bigrams : pairs of tokens Trigrams : triples of tokens n - grams : seq u ence of n - tokens SENTIMENT ANALYSIS IN PYTHON

  13. Capt u ring conte x t w ith BOW The w eather toda y is w onderf u l . Unigrams : { The , w eather , toda y, is w onderf u l } Bigrams : { The w eather , w eather toda y, toda y is , is w onderf u l } Trigrams : { The w eather toda y, w eather toda y is , toda y is w onderf u l } SENTIMENT ANALYSIS IN PYTHON

  14. n - grams w ith the Co u ntVectori z er from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(ngram_range=(min_n, max_n)) # Only unigrams ngram_range=(1, 1) # Uni- and bigrams ngram_range=(1, 2) SENTIMENT ANALYSIS IN PYTHON

  15. What is the best n ? Longer seq u ence of tokens Res u lts in more feat u res Higher precision of machine learning models Risk of o v er � � ing SENTIMENT ANALYSIS IN PYTHON

  16. Specif y ing v ocab u lar y si z e CountVectorizer(max_feature, max_df, min_df) ma x_ feat u res : if speci � ed , it w ill incl u de onl y the top most freq u ent w ords in the v ocab u lar y If ma x_ feat u res = None , all w ords w ill be incl u ded ma x_ df : ignore terms w ith higher than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms min _ df : ignore terms w ith lo w er than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms SENTIMENT ANALYSIS IN PYTHON

  17. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  18. B u ild ne w feat u res from te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  19. Goal of the v ideo Goal : Enrich the e x isting dataset w ith feat u res related to the te x t col u mn ( capt u ring the sentiment ) SENTIMENT ANALYSIS IN PYTHON

  20. Prod u ct re v ie w s data reviews.head() SENTIMENT ANALYSIS IN PYTHON

  21. Feat u res from the re v ie w col u mn Ho w long is each re v ie w? Ho w man y sentences does it contain ? What parts of speech are in v ol v ed ? Ho w man y p u nct u ation marks ? SENTIMENT ANALYSIS IN PYTHON

  22. Tokeni z ing a string from nltk import word_tokenize anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.' word_tokenize(anna_k) ['Happy','families','are', 'all','alike',',', 'every','unhappy', 'family', 'is','unhappy','in', 'its','own','way','.'] SENTIMENT ANALYSIS IN PYTHON

  23. Tokens from a col u mn # General form of list comprehension [expression for item in iterable] word_tokens = [word_tokenize(review) for review in reviews.review] type(word_tokens) list type(word_tokens[0]) list SENTIMENT ANALYSIS IN PYTHON

  24. Tokens from a col u mn len_tokens = [] # Iterate over the word_tokens list for i in range(len(word_tokens)): len_tokens.append(len(word_tokens[i])) # Create a new feature for the length of each review reviews['n_tokens'] = len_tokens SENTIMENT ANALYSIS IN PYTHON

  25. Dealing w ith p u nct u ation We did not address it b u t y o u can e x cl u de it A feat u re that meas u res the n u mber of p u nct u ation signs A re v ie w w ith man y p u nct u ation signs co u ld signal a v er y emotionall y charged opinion SENTIMENT ANALYSIS IN PYTHON

  26. Re v ie w s w ith a feat u re for the length reviews.head() SENTIMENT ANALYSIS IN PYTHON

  27. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  28. Can y o u g u ess the lang u age ? SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  29. Lang u age of a string in P y thon from langdetect import detect_langs foreign = 'Este libro ha sido uno de los mejores libros que he leido.' detect_langs(foreign) [es:0.9999945352697024] SENTIMENT ANALYSIS IN PYTHON

  30. Lang u age of a col u mn Problem : Detect the lang u age of each of the strings and capt u re the most likel y lang u age in a ne w col u mn from langdetect import detect_langs reviews = pd.read_csv('product_reviews.csv') reviews.head() SENTIMENT ANALYSIS IN PYTHON

  31. B u ilding a feat u re for the lang u age languages = [] for row in range(len(reviews)): languages.append(detect_langs(reviews.iloc[row, 1])) languages [it:0.9999982541301151], [es:0.9999954153640488], [es:0.7142833997345875, en:0.2857160465706441], [es:0.9999942365605781], [es:0.999997956049055] ... SENTIMENT ANALYSIS IN PYTHON

  32. B u ilding a feat u re for the lang u age # Transform the first list to a string and split on a colon str(languages[0]).split(':') ['[es', '0.9999954153640488]'] str(languages[0]).split(':')[0] '[es' str(languages[0]).split(':')[0][1:] 'es' SENTIMENT ANALYSIS IN PYTHON

  33. B u ilding a feat u re for the lang u age languages = [str(lang).split(':')[0][1:] for lang in languages] reviews['language'] = languages SENTIMENT ANALYSIS IN PYTHON

  34. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend