Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Administrivia • Homework: Will be released today after class • Project Reminder: Teams due Monday Jan 20. • A fun exercise at the end of the class! 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Homework Policy • Late day policy: 3 late days (3 x 24 hour chunks) – Use as needed • Collaboration: – OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with. • Zero tolerance on plagiarism – Follow the GT academic honesty rules 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Recap So Far 1. IR and text processing 2. Evaluation of IR system 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Today’s Lecture • Representing words and phrases – Neural network basics – Word2vec – Continuous bag of words – Skip-gram model Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing a Word: One Hot Encoding • Given a vocabulary dog cat person holding tree computer using 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing a Word: One Hot Encoding • Given a vocabulary dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing a Word: One Hot Encoding • Given a vocabulary, convert to One Hot Encoding dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ] 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Recap: Bag of Words Model • Represent a document as a collection of words (after cleaning the document) – The order of words is irrelevant – The document “ John is quicker than Mary ” is indistinguishable from the doc “ Mary is quicker than John ” • Rank documents according to the overlap between query words and document words 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing Phrases: Bag of Words bag of words representation Dog Cat Person Holding Tree Computer Using 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ] 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Distributional Hypothesis [Lenci, 2008] • The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts • Similarity in meaning ∝ Similarity of context • Simple definition: context = surrounding words 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

What Is The Meaning Of “Barwadic”? • he handed her glass of bardiwac . • Beef dishes are made to complement the bardiwac . • Nigel staggered to his feet, face flushed from too much bardiwac . • Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac . • The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

What Is The Meaning Of “Barwadic”? • he handed her glass of barwadic . • Beef dishes are made to complement the barwadic . • Nigel staggered to his feet, face flushed from too much barwadic . • Malbec, one of the lesser-known barwadic grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic . • The drinks were delicious: blood-red barwadic as well as light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Geometric Interpretation: Co-occurrence As Feature • Recall the term-document matrix – Rows are terms, columns are documents, cells represent the number of time a term appears in a document • Here we create a word-word co-occurrence matrix – Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R” • Neighborhood = a window of fixed size around the word 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Row Vectors in Co-occurrence Matrix • Row vector describes the usage of the word in the corpus/document • Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space • Example: n = 2 • Dimensions = ‘get’ and ‘use’ Co-occurrence matrix 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Distance And Similarity • Selected two dimensions ‘get’ and ‘use’ • Similarity between words = spatial proximity in the dimension space • Measured by the Euclidean distance 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Distance And Similarity • Exact position in the space depends on the frequency of the word • More frequent words will appear farther from the origin • E.g., say ‘dog’ is more frequent than ‘cat’ • Does not mean it is more important • Solution: Ignore the length and look only at the direction 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Angle And Similarity • Angle ignores the exact location of the point • Method: Normalize by the length of vectors or use only the angle as a distance measure • Standard metric: Cosine similarity between vectors 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Issues with Co-occurrence Matrix • Problem with using the co-occurrence directly: – The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus Billions! • – Down-sampling dimensions is not straight-forward How many columns to select? • Which columns to select? • • Solution: Compression or Dimensionality Reduction Techniques 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

SVD for Dimensionality Reduction • SVD = Singular Value Decomposition • For an input matrix X – U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix Diagonal values of S are called Singular Values • • Matrix U is a get a r-dimension vector for every row of X 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Word Visualization via Dimensionality Reduction 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Issues with SVD • Computational cost for SVD on an N x M matrix is O(NM 2 ) , where N < M • Impossible for large number of word vocabularies or documents • Impractical for real corpus • It is hard to incorporate out-of-sample or new words/documents • Entire row in the matrix will be 0 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Word2Vec: Representing Word Meanings Key idea: Predict the surrounding words of every word Benefits: • Faster • Easier to incorporate new words and documents Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013. 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Anheuser-Busch InBev Competitive Analysis 11 am, Team 2: Bruno, Christopher, Dylan, Nital Blue

The Demand Side of the The Demand Side of the Market Market Starring Starring N Utility Theory

SN SNAP AP (F (FOOD OOD ST STAM AMPS PS ) Assists families in purchasing food For the

DRUG & ALCOHOL TESTING IN SCHOOLS AND COLLEGES 1 1/8/2014 DR. GEORGE B. ELDER 32 Year

2017 2017-2019 Supported Hou Housing Development Com ompetitive Ap Application Train inin

Open Food Facts : The Wikipedia of food products Fosdem February 2018 Bruxelles

OS Structures Tevfik Ko ar University at Buffalo August 29 th , 2013 1 Roadmap OS Design

OS Structures ! Processes ! Memory management ! CPU Scheduling ! I/O Management Tevfik Ko ar

Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Anheuser-Busch InBev Competitive Analysis 11 am, Team 2: Bruno, Christopher, Dylan, Nital Blue

The Demand Side of the The Demand Side of the Market Market Starring Starring N Utility Theory

SN SNAP AP (F (FOOD OOD ST STAM AMPS PS ) Assists families in purchasing food For the

DRUG &amp; ALCOHOL TESTING IN SCHOOLS AND COLLEGES 1 1/8/2014 DR. GEORGE B. ELDER 32 Year

2017 2017-2019 Supported Hou Housing Development Com ompetitive Ap Application Train inin

Open Food Facts : The Wikipedia of food products Fosdem February 2018 Bruxelles

OS Structures Tevfik Ko ar University at Buffalo August 29 th , 2013 1 Roadmap OS Design

OS Structures ! Processes ! Memory management ! CPU Scheduling ! I/O Management Tevfik Ko ar

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

DRUG & ALCOHOL TESTING IN SCHOOLS AND COLLEGES 1 1/8/2014 DR. GEORGE B. ELDER 32 Year