b u ilding tf idf doc u ment v ectors
play

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R - PowerPoint PPT Presentation

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist n - gram modeling Weight of dimension dependent on the freq u enc y of the w ord corresponding to the dimension . Doc u


  1. B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. n - gram modeling Weight of dimension dependent on the freq u enc y of the w ord corresponding to the dimension . Doc u ment contains the w ord human in �v e places . Dimension corresponding to human has w eight 5 . FEATURE ENGINEERING FOR NLP IN PYTHON

  3. Moti v ation Some w ords occ u r v er y commonl y across all doc u ments Corp u s of doc u ments on the u ni v erse One doc u ment has jupiter and universe occ u rring 20 times each . jupiter rarel y occ u rs in the other doc u ments . universe is common . Gi v e more w eight to jupiter on acco u nt of e x cl u si v it y. FEATURE ENGINEERING FOR NLP IN PYTHON

  4. Applications A u tomaticall y detect stop w ords Search Recommender s y stems Be � er performance in predicti v e modeling for some cases FEATURE ENGINEERING FOR NLP IN PYTHON

  5. Term freq u enc y- in v erse doc u ment freq u enc y Proportional to term freq u enc y In v erse f u nction of the n u mber of doc u ments in w hich it occ u rs FEATURE ENGINEERING FOR NLP IN PYTHON

  6. Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents containing term i i FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents cotaining term i i E x ample : 20 = 5 ⋅ log ( ) ≈ 2 w library , document 8 FEATURE ENGINEERING FOR NLP IN PYTHON

  10. tf - idf u sing scikit - learn # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray()) [[0. 0. 0. 0. 0.25434658 0.33443519 0.33443519 0. 0.25434658 0. 0.25434658 0. 0.76303975] [0. 0.46735098 0. 0.46735098 0. 0. 0. 0.46735098 0. 0.46735098 0.35543247 0. 0. ] ... FEATURE ENGINEERING FOR NLP IN PYTHON

  11. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  12. Cosine similarit y FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  13. 1 Image co u rtes y techninpink . com FEATURE ENGINEERING FOR NLP IN PYTHON

  14. The dot prod u ct Consider t w o v ectors , V = ( v , v ,⋯ , v ), W = ( w , w ,⋯ , w ) 1 2 1 2 n n Then the dot prod u ct of V and W is , V ⋅ W = ( v × w ) + ( v × w ) + ⋯ + ( v × w ) 1 1 2 2 n n E x ample : A = (4,7,1) , B = (5,2,3) A ⋅ B = (4 × 5) + (7 × 2) + ⋯ (1 × 3) = 20 + 14 + 3 = 37 A ⋅ Bd FEATURE ENGINEERING FOR NLP IN PYTHON

  15. Magnit u de of a v ector For an y v ector , V = ( v , v ,⋯ , v ) 1 2 n The magnit u de is de � ned as , ∣∣ V ∣∣ = √ ( v ) + ( v ) + ... + ( v ) 1 2 2 2 n 2 E x ample : A = (4,7,1) , B = (5,2,3) ∣∣ A ∣∣ = √ (4) + (7) + (1) 2 2 2 √66 √16 + 49 + 1 filler = = FEATURE ENGINEERING FOR NLP IN PYTHON

  16. The cosine score A : (4,7,1) B : (5,2,3) The cosine score , A ⋅ B cos ( A , B ) = ∣ A ∣ ⋅ ∣ B ∣ 37 fillerslorem = √ √ 66 × 38 fillersl = 0.7388 FEATURE ENGINEERING FOR NLP IN PYTHON

  17. Cosine Score : points to remember Val u e bet w een -1 and 1. In NLP , v al u e bet w een 0 and 1. Rob u st to doc u ment length . FEATURE ENGINEERING FOR NLP IN PYTHON

  18. Implementation u sing scikit - learn # Import the cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Define two 3-dimensional vectors A and B A = (4,7,1) B = (5,2,3) # Compute the cosine score of A and B score = cosine_similarity([A], [B]) # Print the cosine score print(score) array([[ 0.73881883]]) FEATURE ENGINEERING FOR NLP IN PYTHON

  19. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  20. B u ilding a plot line based recommender FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  21. Mo v ie recommender Title O v er v ie w A pro v incial bo y related to a Shanghai crime famil y is recr u ited b y his u ncle Shanghai into cosmopolitan Shanghai in the 1930 s to be a ser v ant to a ganglord ' s Triad mistress . Cr y, the A So u th - African preacher goes to search for his w a yw ard son w ho has Belo v ed commi � ed a crime in the big cit y. Co u ntr y FEATURE ENGINEERING FOR NLP IN PYTHON

  22. Mo v ie recommender get_recommendations("The Godfather") 1178 The Godfather: Part II 44030 The Godfather Trilogy: 1972-1990 1914 The Godfather: Part III 23126 Blood Ties 11297 Household Saints 34717 Start Liquidation 10821 Election 38030 Goodfellas 17729 Short Sharp Shock 26293 Beck 28 - Familjen Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON

  23. Steps 1. Te x t preprocessing 2. Generate tf - idf v ectors 3. Generate cosine similarit y matri x FEATURE ENGINEERING FOR NLP IN PYTHON

  24. The recommender f u nction 1. Take a mo v ie title , cosine similarit y matri x and indices series as arg u ments . 2. E x tract pair w ise cosine similarit y scores for the mo v ie . 3. Sort the scores in descending order . 4. O u tp u t titles corresponding to the highest scores . 5. Ignore the highest similarit y score ( of 1). FEATURE ENGINEERING FOR NLP IN PYTHON

  25. Generating tf - idf v ectors # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of tf-idf vectors tfidf_matrix = vectorizer.fit_transform(movie_plots) FEATURE ENGINEERING FOR NLP IN PYTHON

  26. Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Generate cosine similarity matrix cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON

  27. The linear _ kernel f u nction Magnit u de of a tf - idf v ector is 1 Cosine score bet w een t w o tf - idf v ectors is their dot prod u ct . Can signi � cantl y impro v e comp u tation time . Use linear_kernel instead of cosine_similarity . FEATURE ENGINEERING FOR NLP IN PYTHON

  28. Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import linear_kernel # Generate cosine similarity matrix cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON

  29. The get _ recommendations f u nction get_recommendations('The Lion King', cosine_sim, indices) 7782 African Cats 5877 The Lion King 2: Simba's Pride 4524 Born Free 2719 The Bear 4770 Once Upon a Time in China III 7070 Crows Zero 739 The Wizard of Oz 8926 The Jungle Book 1749 Shadow of a Doubt 7993 October Baby Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON

  30. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  31. Be y ond n - grams : w ord embeddings FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  32. The problem w ith BoW and tf - idf 'I am happy' 'I am joyous' 'I am sad' FEATURE ENGINEERING FOR NLP IN PYTHON

  33. Word embeddings Mapping w ords into an n - dimensional v ector space Prod u ced u sing deep learning and h u ge amo u nts of data Discern ho w similar t w o w ords are to each other Used to detect s y non y ms and anton y ms Capt u res comple x relationships King - Queen → Man - Woman France - Paris → Russia - Moscow Dependent on spac y model ; independent of dataset y o u u se FEATURE ENGINEERING FOR NLP IN PYTHON

  34. Word embeddings u sing spaC y import spacy # Load model and create Doc object nlp = spacy.load('en_core_web_lg') doc = nlp('I am happy') # Generate word vectors for each token for token in doc: print(token.vector) [-1.0747459e+00 4.8677087e-02 5.6630421e+00 1.6680446e+00 -1.3194644e+00 -1.5142369e+00 1.1940931e+00 -3.0168812e+00 ... FEATURE ENGINEERING FOR NLP IN PYTHON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend