B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R - PowerPoint PPT Presentation

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

n - gram modeling Weight of dimension dependent on the freq u enc y of the w ord corresponding to the dimension . Doc u ment contains the w ord human in �v e places . Dimension corresponding to human has w eight 5 . FEATURE ENGINEERING FOR NLP IN PYTHON

Moti v ation Some w ords occ u r v er y commonl y across all doc u ments Corp u s of doc u ments on the u ni v erse One doc u ment has jupiter and universe occ u rring 20 times each . jupiter rarel y occ u rs in the other doc u ments . universe is common . Gi v e more w eight to jupiter on acco u nt of e x cl u si v it y. FEATURE ENGINEERING FOR NLP IN PYTHON

Applications A u tomaticall y detect stop w ords Search Recommender s y stems Be � er performance in predicti v e modeling for some cases FEATURE ENGINEERING FOR NLP IN PYTHON

Term freq u enc y- in v erse doc u ment freq u enc y Proportional to term freq u enc y In v erse f u nction of the n u mber of doc u ments in w hich it occ u rs FEATURE ENGINEERING FOR NLP IN PYTHON

Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j FEATURE ENGINEERING FOR NLP IN PYTHON

Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j FEATURE ENGINEERING FOR NLP IN PYTHON

Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents containing term i i FEATURE ENGINEERING FOR NLP IN PYTHON

Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents cotaining term i i E x ample : 20 = 5 ⋅ log ( ) ≈ 2 w library , document 8 FEATURE ENGINEERING FOR NLP IN PYTHON

tf - idf u sing scikit - learn # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray()) [[0. 0. 0. 0. 0.25434658 0.33443519 0.33443519 0. 0.25434658 0. 0.25434658 0. 0.76303975] [0. 0.46735098 0. 0.46735098 0. 0. 0. 0.46735098 0. 0.46735098 0.35543247 0. 0. ] ... FEATURE ENGINEERING FOR NLP IN PYTHON

Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Cosine similarit y FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

1 Image co u rtes y techninpink . com FEATURE ENGINEERING FOR NLP IN PYTHON

The dot prod u ct Consider t w o v ectors , V = ( v , v ,⋯ , v ), W = ( w , w ,⋯ , w ) 1 2 1 2 n n Then the dot prod u ct of V and W is , V ⋅ W = ( v × w ) + ( v × w ) + ⋯ + ( v × w ) 1 1 2 2 n n E x ample : A = (4,7,1) , B = (5,2,3) A ⋅ B = (4 × 5) + (7 × 2) + ⋯ (1 × 3) = 20 + 14 + 3 = 37 A ⋅ Bd FEATURE ENGINEERING FOR NLP IN PYTHON

Magnit u de of a v ector For an y v ector , V = ( v , v ,⋯ , v ) 1 2 n The magnit u de is de � ned as , ∣∣ V ∣∣ = √ ( v ) + ( v ) + ... + ( v ) 1 2 2 2 n 2 E x ample : A = (4,7,1) , B = (5,2,3) ∣∣ A ∣∣ = √ (4) + (7) + (1) 2 2 2 √66 √16 + 49 + 1 filler = = FEATURE ENGINEERING FOR NLP IN PYTHON

The cosine score A : (4,7,1) B : (5,2,3) The cosine score , A ⋅ B cos ( A , B ) = ∣ A ∣ ⋅ ∣ B ∣ 37 fillerslorem = √ √ 66 × 38 fillersl = 0.7388 FEATURE ENGINEERING FOR NLP IN PYTHON

Cosine Score : points to remember Val u e bet w een -1 and 1. In NLP , v al u e bet w een 0 and 1. Rob u st to doc u ment length . FEATURE ENGINEERING FOR NLP IN PYTHON

Implementation u sing scikit - learn # Import the cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Define two 3-dimensional vectors A and B A = (4,7,1) B = (5,2,3) # Compute the cosine score of A and B score = cosine_similarity([A], [B]) # Print the cosine score print(score) array([[ 0.73881883]]) FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding a plot line based recommender FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

Mo v ie recommender Title O v er v ie w A pro v incial bo y related to a Shanghai crime famil y is recr u ited b y his u ncle Shanghai into cosmopolitan Shanghai in the 1930 s to be a ser v ant to a ganglord ' s Triad mistress . Cr y, the A So u th - African preacher goes to search for his w a yw ard son w ho has Belo v ed commi � ed a crime in the big cit y. Co u ntr y FEATURE ENGINEERING FOR NLP IN PYTHON

Mo v ie recommender get_recommendations("The Godfather") 1178 The Godfather: Part II 44030 The Godfather Trilogy: 1972-1990 1914 The Godfather: Part III 23126 Blood Ties 11297 Household Saints 34717 Start Liquidation 10821 Election 38030 Goodfellas 17729 Short Sharp Shock 26293 Beck 28 - Familjen Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON

Steps 1. Te x t preprocessing 2. Generate tf - idf v ectors 3. Generate cosine similarit y matri x FEATURE ENGINEERING FOR NLP IN PYTHON

The recommender f u nction 1. Take a mo v ie title , cosine similarit y matri x and indices series as arg u ments . 2. E x tract pair w ise cosine similarit y scores for the mo v ie . 3. Sort the scores in descending order . 4. O u tp u t titles corresponding to the highest scores . 5. Ignore the highest similarit y score ( of 1). FEATURE ENGINEERING FOR NLP IN PYTHON

Generating tf - idf v ectors # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of tf-idf vectors tfidf_matrix = vectorizer.fit_transform(movie_plots) FEATURE ENGINEERING FOR NLP IN PYTHON

Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Generate cosine similarity matrix cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON

The linear _ kernel f u nction Magnit u de of a tf - idf v ector is 1 Cosine score bet w een t w o tf - idf v ectors is their dot prod u ct . Can signi � cantl y impro v e comp u tation time . Use linear_kernel instead of cosine_similarity . FEATURE ENGINEERING FOR NLP IN PYTHON

Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import linear_kernel # Generate cosine similarity matrix cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON

The get _ recommendations f u nction get_recommendations('The Lion King', cosine_sim, indices) 7782 African Cats 5877 The Lion King 2: Simba's Pride 4524 Born Free 2719 The Bear 4770 Once Upon a Time in China III 7070 Crows Zero 739 The Wizard of Oz 8926 The Jungle Book 1749 Shadow of a Doubt 7993 October Baby Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON

Be y ond n - grams : w ord embeddings FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

The problem w ith BoW and tf - idf 'I am happy' 'I am joyous' 'I am sad' FEATURE ENGINEERING FOR NLP IN PYTHON

Word embeddings Mapping w ords into an n - dimensional v ector space Prod u ced u sing deep learning and h u ge amo u nts of data Discern ho w similar t w o w ords are to each other Used to detect s y non y ms and anton y ms Capt u res comple x relationships King - Queen → Man - Woman France - Paris → Russia - Moscow Dependent on spac y model ; independent of dataset y o u u se FEATURE ENGINEERING FOR NLP IN PYTHON

Word embeddings u sing spaC y import spacy # Load model and create Doc object nlp = spacy.load('en_core_web_lg') doc = nlp('I am happy') # Generate word vectors for each token for token in doc: print(token.vector) [-1.0747459e+00 4.8677087e-02 5.6630421e+00 1.6680446e+00 -1.3194644e+00 -1.5142369e+00 1.1940931e+00 -3.0168812e+00 ... FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R - PowerPoint PPT Presentation

B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist n - gram modeling Weight of dimension dependent on the freq u enc y of the w ord corresponding to the dimension . Doc u

WEDNESDAY 6 DECEMBER 10:00-11:00 ACTIVITIES IN THE IDF SOUTH EAST ASIA REGION 12:00-12:30 IDF

DOC Zoom Meeting April 28, 2020 www.ncsoccer.org DOC Meeting Welcome www.ncsoccer.org DOC

Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI

Logic in Action Introduction: IDF and Special Operations History of the IDF and special

II are : created ? Gal LEIF ) Lastine Gull Elf ) - f Ift F } Idf ) ={ rebut - iff - idf de re

March 6 , 2 0 1 8 Doc #: UHC0779c 2 Doc #: UHC0779c UHC Com m unity Plan Dual Com plete ONE I

PRO(doc) presentation www.prodoc.one info@prodoc.one About PRO(doc) PRO(doc) is a customised

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all

IDF World Dairy Summit, Vilnius, Lithuania September 22nd, 2015 Special thanks to: ZNL

AND RELEVANCE OF DATASETS BENJAMIN Ben (PhD Student) DICENT-IDF laboratory, University of

Enhancing the Climate Resilience of African Infrastructure T HE W ATER AND POWER S ECTORS : S

Working for Wellbeing: Cr Crossin sing sec ectors, s, cr crossin sing bo border ers

Over-Morgen Omdat mensen belangrijk zijn! Jos Ectors May 21th, 2019 Compared to 80 (42) years

September 1998 doc.: IEEE 802.11-98/315 September, 1998 doc.: IEEE 802.11-98/315 September,

REGDOC-2.1.2, Safety Culture e-Doc: 5429554 (PPT) Commission Meeting, March 15 2018, CMD 18-M11.A

Full Text Search Integration Tugdual Grall Technical Evangelist Distributed Indexing and

PS 406 Week 1 Section: Review of OLS and Matrix Algebra D.J. Flynn April 4, 2014 D.J. Flynn

Short Text Similarity with Word Embeddings Tom Kenter, Maarten de Rijke CIKM 2015 - October 2015

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

MSMS Vectors and Matrices Basilio Bona DAUIN Politecnico di Torino Semester 1, 2015-2016

Diagrams: Declarative Vector Graphics in Haskell Brent Yorgey NY Haskell Users Group

AM 205: lecture 6 Last time: finished the data fitting topic Todays lecture: numerical

CS 4 7 3 1 / 5 4 3 : Com puter Graphics Lecture 3 ( Part I I I ) : 3 D Modeling: Polygonal Meshes

Pyramid Vector Quantization for Video Coding Jean-Marc Valin Daala Coding Party Sep 2013