word embeddings
play

Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released


  1. CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings ​ Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  2. Administrivia • Homework: Will be released today after class • Project Reminder: Teams due Monday Jan 20. • A fun exercise at the end of the class! 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  3. Homework Policy • Late day policy: 3 late days (3 x 24 hour chunks) – Use as needed • Collaboration: – OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with. • Zero tolerance on plagiarism – Follow the GT academic honesty rules 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  4. Recap So Far 1. IR and text processing 2. Evaluation of IR system 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  5. Today’s Lecture • Representing words and phrases – Neural network basics – Word2vec – Continuous bag of words – Skip-gram model Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  6. Representing a Word: One Hot Encoding ​ • Given a vocabulary dog cat person holding tree computer using 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  7. Representing a Word: One Hot Encoding ​ • Given a vocabulary dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  8. Representing a Word: One Hot Encoding ​ • Given a vocabulary, convert to One Hot Encoding dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ] 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  9. Recap: Bag of Words Model • Represent a document as a collection of words (after cleaning the document) – The order of words is irrelevant – The document “ John is quicker than Mary ” is indistinguishable from the doc “ Mary is quicker than John ” • Rank documents according to the overlap between query words and document words 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  10. Representing Phrases: Bag of Words ​ bag of words representation Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​ 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  11. Representing Phrases: Bag of Words ​ bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​ 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  12. Representing Phrases: Bag of Words ​ bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​ 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  13. Representing Phrases: Bag of Words ​ bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​ 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  14. Representing Phrases: Bag of Words ​ bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​ person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ] 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  15. Distributional Hypothesis [Lenci, 2008] • The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts • Similarity in meaning ∝ Similarity of context • Simple definition: context = surrounding words 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  16. What Is The Meaning Of “Barwadic”? • he handed her glass of bardiwac . • Beef dishes are made to complement the bardiwac . • Nigel staggered to his feet, face flushed from too much bardiwac . • Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac . • The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  17. What Is The Meaning Of “Barwadic”? • he handed her glass of barwadic . • Beef dishes are made to complement the barwadic . • Nigel staggered to his feet, face flushed from too much barwadic . • Malbec, one of the lesser-known barwadic grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic . • The drinks were delicious: blood-red barwadic as well as light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  18. Geometric Interpretation: Co-occurrence As Feature • Recall the term-document matrix – Rows are terms, columns are documents, cells represent the number of time a term appears in a document • Here we create a word-word co-occurrence matrix – Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R” • Neighborhood = a window of fixed size around the word 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  19. Row Vectors in Co-occurrence Matrix • Row vector describes the usage of the word in the corpus/document • Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space • Example: n = 2 • Dimensions = ‘get’ and ‘use’ Co-occurrence matrix 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  20. Distance And Similarity • Selected two dimensions ‘get’ and ‘use’ • Similarity between words = spatial proximity in the dimension space • Measured by the Euclidean distance 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  21. Distance And Similarity • Exact position in the space depends on the frequency of the word • More frequent words will appear farther from the origin • E.g., say ‘dog’ is more frequent than ‘cat’ • Does not mean it is more important • Solution: Ignore the length and look only at the direction 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  22. Angle And Similarity • Angle ignores the exact location of the point • Method: Normalize by the length of vectors or use only the angle as a distance measure • Standard metric: Cosine similarity between vectors 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  23. Issues with Co-occurrence Matrix • Problem with using the co-occurrence directly: – The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus Billions! • – Down-sampling dimensions is not straight-forward How many columns to select? • Which columns to select? • • Solution: Compression or Dimensionality Reduction Techniques 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  24. SVD for Dimensionality Reduction • SVD = Singular Value Decomposition • For an input matrix X – U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix Diagonal values of S are called Singular Values • • Matrix U is a get a r-dimension vector for every row of X 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  25. Word Visualization via Dimensionality Reduction 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  26. Issues with SVD • Computational cost for SVD on an N x M matrix is O(NM 2 ) , where N < M • Impossible for large number of word vocabularies or documents • Impractical for real corpus • It is hard to incorporate out-of-sample or new words/documents • Entire row in the matrix will be 0 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  27. Word2Vec: Representing Word Meanings Key idea: Predict the surrounding words of every word Benefits: • Faster • Easier to incorporate new words and documents Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013. 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend