Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - - PowerPoint PPT Presentation

word embeddings
SMART_READER_LITE
LIVE PREVIEW

Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released


slide-1
SLIDE 1

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

1

CSE 6240: Web Search and Text Mining. Spring 2020

  • Prof. Srijan Kumar

with Arindum Roy and Roshan Pati

Word Embeddings​

slide-2
SLIDE 2

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

2

Administrivia

  • Homework: Will be released today after class
  • Project Reminder: Teams due Monday Jan 20.
  • A fun exercise at the end of the class!
slide-3
SLIDE 3

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

3

Homework Policy

  • Late day policy: 3 late days (3 x 24 hour chunks)

– Use as needed

  • Collaboration:

– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with.

  • Zero tolerance on plagiarism

– Follow the GT academic honesty rules

slide-4
SLIDE 4

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

4

Recap So Far

  • 1. IR and text processing
  • 2. Evaluation of IR system
slide-5
SLIDE 5

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

5

Today’s Lecture

  • Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

slide-6
SLIDE 6

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

6

Representing a Word: One Hot Encoding​

  • Given a vocabulary

dog cat person holding tree computer using

slide-7
SLIDE 7

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

7

Representing a Word: One Hot Encoding​

  • Given a vocabulary

dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7

slide-8
SLIDE 8

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

8

Representing a Word: One Hot Encoding​

  • Given a vocabulary, convert to One Hot Encoding

dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]

slide-9
SLIDE 9

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

9

Recap: Bag of Words Model

  • Represent a document as a collection of words (after

cleaning the document)

– The order of words is irrelevant – The document “John is quicker than Mary” is indistinguishable from the doc “Mary is quicker than John”

  • Rank documents according to the overlap between

query words and document words

slide-10
SLIDE 10

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

10

Representing Phrases: Bag of Words​

bag of words representation

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-11
SLIDE 11

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

11

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-12
SLIDE 12

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

12

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-13
SLIDE 13

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

13

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-14
SLIDE 14

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

14

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-15
SLIDE 15

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

15

Distributional Hypothesis [Lenci, 2008]

  • The degree of semantic similarity between two linguistic

expressions is a function of the similarity of the their linguistic contexts

  • Similarity in meaning ∝ Similarity of context
  • Simple definition: context = surrounding words
slide-16
SLIDE 16

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

16

What Is The Meaning Of “Barwadic”?

  • he handed her glass of bardiwac.
  • Beef dishes are made to complement the bardiwac.
  • Nigel staggered to his feet, face flushed from too much

bardiwac.

  • Malbec, one of the lesser-known bardiwac grapes,

responds well to Australia’s sunshine.

  • I dined off bread and cheese and this excellent bardiwac.
  • The drinks were delicious: blood-red bardiwac as well as

light, sweet Rhenish.

slide-17
SLIDE 17

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

17

What Is The Meaning Of “Barwadic”?

  • he handed her glass of barwadic.
  • Beef dishes are made to complement the barwadic.
  • Nigel staggered to his feet, face flushed from too much

barwadic.

  • Malbec, one of the lesser-known barwadic grapes,

responds well to Australia’s sunshine.

  • I dined off bread and cheese and this excellent barwadic.
  • The drinks were delicious: blood-red barwadic as well as

light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes

slide-18
SLIDE 18

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

18

Geometric Interpretation: Co-occurrence As Feature

  • Recall the term-document matrix

– Rows are terms, columns are documents, cells represent the number of time a term appears in a document

  • Here we create a word-word co-occurrence matrix

– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R”

  • Neighborhood = a window of fixed size around the word
slide-19
SLIDE 19

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

19

Row Vectors in Co-occurrence Matrix

  • Row vector describes the usage of the word in the

corpus/document

  • Row vectors can be seen as coordinates of the point in an

n-dimensional Euclidean space

  • Example: n = 2
  • Dimensions = ‘get’ and ‘use’

Co-occurrence matrix

slide-20
SLIDE 20

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

20

Distance And Similarity

  • Selected two dimensions ‘get’ and ‘use’
  • Similarity between words = spatial proximity in the

dimension space

  • Measured by the Euclidean distance
slide-21
SLIDE 21

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

21

Distance And Similarity

  • Exact position in the space depends on the frequency of

the word

  • More frequent words will appear farther from the origin
  • E.g., say ‘dog’ is more frequent than ‘cat’
  • Does not mean it is more important
  • Solution: Ignore the length and look only at the direction
slide-22
SLIDE 22

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

22

Angle And Similarity

  • Angle ignores the exact location of the

point

  • Method: Normalize by the length of

vectors or use only the angle as a distance measure

  • Standard metric: Cosine similarity

between vectors

slide-23
SLIDE 23

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

23

Issues with Co-occurrence Matrix

  • Problem with using the co-occurrence directly:

– The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus

  • Billions!

– Down-sampling dimensions is not straight-forward

  • How many columns to select?
  • Which columns to select?
  • Solution: Compression or Dimensionality Reduction

Techniques

slide-24
SLIDE 24

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

24

SVD for Dimensionality Reduction

  • SVD = Singular Value Decomposition
  • For an input matrix X

– U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix

  • Diagonal values of S are called Singular Values
  • Matrix U is a get a r-dimension vector for every row of X
slide-25
SLIDE 25

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

25

Word Visualization via Dimensionality Reduction

slide-26
SLIDE 26

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

26

Issues with SVD

  • Computational cost for SVD on an N x M matrix is O(NM2),

where N < M

  • Impossible for large number of word vocabularies or documents
  • Impractical for real corpus
  • It is hard to incorporate out-of-sample or new

words/documents

  • Entire row in the matrix will be 0
slide-27
SLIDE 27

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

27

Word2Vec: Representing Word Meanings

Key idea: Predict the surrounding words of every word Benefits:

  • Faster
  • Easier to incorporate new words and documents

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.

slide-28
SLIDE 28

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

28

Two Styles of Learning Word2Vec

  • Continuous Bag of

Words (CBOW): uses the context words in a window to predict the middle word

  • Skip-gram: uses the

middle word to predict the context words in a window.

slide-29
SLIDE 29

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

29

Neural Network Basics: Neuron

  • Basic building blocks of neural networks
  • Input is a vector: x = [x1, … xm]
  • Weights and bias:

– Neuron has weights w = [w1, w2, …, wm] – Bias term = b (or w0)

  • Activation function:

– Transforms the aggregate – e.g., sigmoid, ReLU

  • Output computation:
slide-30
SLIDE 30

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

30

Neural Network Basics: Fully Connected Layer

  • A layer whose neurons are

connected to all the neurons in the previous layer

– Each neuron takes as input all the

  • utput from the previous layer
  • Multiple layers can be stacked

together

  • Example: 3 fully connected layers
slide-31
SLIDE 31

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

31

Neural Network Basics: More About Layers

  • Input layer: input vectors are

given as inputs here

  • Hidden layer: Intermediate

representation of inputs

– Multiple hidden layers can be stacked together

  • Output layer: final output

– Can have one or more neurons in the output layer

  • Note that information flows in
  • ne direction
slide-32
SLIDE 32

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

32

CBOW: Continuous Bag of Words

Example: “The cat sat

  • n floor” (window size 2)

Input: context words Output: middle word

slide-33
SLIDE 33

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

33

The Architecture

Architecture: input layer, hidden layer, and output layer

  • Fully connected

layers Input: one-hot vector

  • f context words

Desired output:

  • ne-hot vector of the

middle word

slide-34
SLIDE 34

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

34

The Architecture

Input size: R|V| Hidden layer size: R|N| Output size: R|V| Input-to-hidden layer weight matrix: W |V| x |N|

  • All inputs share the W matrix

Hidden-to-output layer weight matrix: W’|N| x |V|

  • All weight matrices are

shared across all examples

slide-35
SLIDE 35

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

35

Parameters To Be Learned

  • Size of the

input and

  • utput word

vector = |V|

  • All weights are

to be learned during the training process

slide-36
SLIDE 36

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

36

Input to Hidden Layer

  • Matrix multiplication

generates the hidden vector

– Multiplication of input

  • ne-hot vector with the

input-to-hidden layer matrix

  • One multiplication per

input

slide-37
SLIDE 37

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

37

Input to Hidden Layer

Multiplication for ‘cat’

slide-38
SLIDE 38

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

38

Input to Hidden Layer

Multiplication for ‘on’

slide-39
SLIDE 39

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

39

Hidden Layer

  • Aggregation is

done at the hidden layer

– Example: simple averaging