[PPT] - Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan PowerPoint Presentation

SLIDE 1

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

1

CSE 6240: Web Search and Text Mining. Spring 2020

Prof. Srijan Kumar

with Arindum Roy and Roshan Pati

Word Embeddings

SLIDE 2

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

2

Administrivia

Homework: Will be released today after class
Project Reminder: Teams due Monday Jan 20.
A fun exercise at the end of the class!

SLIDE 3

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

3

Homework Policy

Late day policy: 3 late days (3 x 24 hour chunks)

– Use as needed

Collaboration:

– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with.

Zero tolerance on plagiarism

– Follow the GT academic honesty rules

SLIDE 4

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

4

Recap So Far

1. IR and text processing
2. Evaluation of IR system

SLIDE 5

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

5

Today’s Lecture

Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

SLIDE 6

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

6

Representing a Word: One Hot Encoding

Given a vocabulary

dog cat person holding tree computer using

SLIDE 7

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

7

Representing a Word: One Hot Encoding

Given a vocabulary

dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7

SLIDE 8

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

8

Representing a Word: One Hot Encoding

Given a vocabulary, convert to One Hot Encoding

dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]

SLIDE 9

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

9

Recap: Bag of Words Model

Represent a document as a collection of words (after

cleaning the document)

– The order of words is irrelevant – The document “John is quicker than Mary” is indistinguishable from the doc “Mary is quicker than John”

Rank documents according to the overlap between

query words and document words

SLIDE 10

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

10

Representing Phrases: Bag of Words

bag of words representation

Dog Cat Person Holding Tree Computer Using

SLIDE 11

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

11

Representing Phrases: Bag of Words

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog Cat Person Holding Tree Computer Using

SLIDE 12

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

12

Representing Phrases: Bag of Words

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog Cat Person Holding Tree Computer Using

SLIDE 13

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

13

Representing Phrases: Bag of Words

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

Dog Cat Person Holding Tree Computer Using

SLIDE 14

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

14

Representing Phrases: Bag of Words

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]

Dog Cat Person Holding Tree Computer Using

SLIDE 15

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

15

Distributional Hypothesis [Lenci, 2008]

The degree of semantic similarity between two linguistic

expressions is a function of the similarity of the their linguistic contexts

Similarity in meaning ∝ Similarity of context
Simple definition: context = surrounding words

SLIDE 16

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

16

What Is The Meaning Of “Barwadic”?

he handed her glass of bardiwac.
Beef dishes are made to complement the bardiwac.
Nigel staggered to his feet, face flushed from too much

bardiwac.

Malbec, one of the lesser-known bardiwac grapes,

responds well to Australia’s sunshine.

I dined off bread and cheese and this excellent bardiwac.
The drinks were delicious: blood-red bardiwac as well as

light, sweet Rhenish.

SLIDE 17

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

17

What Is The Meaning Of “Barwadic”?

he handed her glass of barwadic.
Beef dishes are made to complement the barwadic.
Nigel staggered to his feet, face flushed from too much

barwadic.

Malbec, one of the lesser-known barwadic grapes,

responds well to Australia’s sunshine.

I dined off bread and cheese and this excellent barwadic.
The drinks were delicious: blood-red barwadic as well as

light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes

SLIDE 18

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

18

Geometric Interpretation: Co-occurrence As Feature

Recall the term-document matrix

– Rows are terms, columns are documents, cells represent the number of time a term appears in a document

Here we create a word-word co-occurrence matrix

– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R”

Neighborhood = a window of fixed size around the word

SLIDE 19

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

19

Row Vectors in Co-occurrence Matrix

Row vector describes the usage of the word in the

corpus/document

Row vectors can be seen as coordinates of the point in an

n-dimensional Euclidean space

Example: n = 2
Dimensions = ‘get’ and ‘use’

Co-occurrence matrix

SLIDE 20

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

20

Distance And Similarity

Selected two dimensions ‘get’ and ‘use’
Similarity between words = spatial proximity in the

dimension space

Measured by the Euclidean distance

SLIDE 21

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

21

Distance And Similarity

Exact position in the space depends on the frequency of

the word

More frequent words will appear farther from the origin
E.g., say ‘dog’ is more frequent than ‘cat’
Does not mean it is more important
Solution: Ignore the length and look only at the direction

SLIDE 22

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

22

Angle And Similarity

Angle ignores the exact location of the

point

Method: Normalize by the length of

vectors or use only the angle as a distance measure

Standard metric: Cosine similarity

between vectors

SLIDE 23

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

23

Issues with Co-occurrence Matrix

Problem with using the co-occurrence directly:

– The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus

Billions!

– Down-sampling dimensions is not straight-forward

How many columns to select?
Which columns to select?
Solution: Compression or Dimensionality Reduction

Techniques

SLIDE 24

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

24

SVD for Dimensionality Reduction

SVD = Singular Value Decomposition
For an input matrix X

– U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix

Diagonal values of S are called Singular Values
Matrix U is a get a r-dimension vector for every row of X

SLIDE 25

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

25

Word Visualization via Dimensionality Reduction

SLIDE 26

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

26

Issues with SVD

Computational cost for SVD on an N x M matrix is O(NM2),

where N < M

Impossible for large number of word vocabularies or documents
Impractical for real corpus
It is hard to incorporate out-of-sample or new

words/documents

Entire row in the matrix will be 0

SLIDE 27

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

27

Word2Vec: Representing Word Meanings

Key idea: Predict the surrounding words of every word Benefits:

Faster
Easier to incorporate new words and documents

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.

SLIDE 28

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

28

Two Styles of Learning Word2Vec

Continuous Bag of

Words (CBOW): uses the context words in a window to predict the middle word

Skip-gram: uses the

middle word to predict the context words in a window.

SLIDE 29

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

29

Neural Network Basics: Neuron

Basic building blocks of neural networks
Input is a vector: x = [x1, … xm]
Weights and bias:

– Neuron has weights w = [w1, w2, …, wm] – Bias term = b (or w0)

Activation function:

– Transforms the aggregate – e.g., sigmoid, ReLU

Output computation:

SLIDE 30

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

30

Neural Network Basics: Fully Connected Layer

A layer whose neurons are

connected to all the neurons in the previous layer

– Each neuron takes as input all the

utput from the previous layer
Multiple layers can be stacked

together

Example: 3 fully connected layers

SLIDE 31

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

31

Neural Network Basics: More About Layers

Input layer: input vectors are

given as inputs here

Hidden layer: Intermediate

representation of inputs

– Multiple hidden layers can be stacked together

Output layer: final output

– Can have one or more neurons in the output layer

Note that information flows in
ne direction

SLIDE 32

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

32

CBOW: Continuous Bag of Words

Example: “The cat sat

n floor” (window size 2)

Input: context words Output: middle word

SLIDE 33

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

33

The Architecture

Architecture: input layer, hidden layer, and output layer

Fully connected

layers Input: one-hot vector

f context words

Desired output:

ne-hot vector of the

middle word

SLIDE 34

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

34

The Architecture

Input size: R|V| Hidden layer size: R|N| Output size: R|V| Input-to-hidden layer weight matrix: W |V| x |N|

All inputs share the W matrix

Hidden-to-output layer weight matrix: W’|N| x |V|

All weight matrices are

shared across all examples

SLIDE 35

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

35

Parameters To Be Learned

Size of the

input and

utput word

vector = |V|

All weights are

to be learned during the training process

SLIDE 36

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

36

Input to Hidden Layer

Matrix multiplication

generates the hidden vector

– Multiplication of input

ne-hot vector with the

input-to-hidden layer matrix

One multiplication per

input

SLIDE 37

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

37

Input to Hidden Layer

Multiplication for ‘cat’

SLIDE 38

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

38

Input to Hidden Layer

Multiplication for ‘on’

SLIDE 39

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

39

Hidden Layer

Aggregation is

done at the hidden layer

– Example: simple averaging

SLIDE 40

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

40

Hidden to Output Layer

Hidden vector is

converted to the

utput vector

– Multiplication of hidden vector with the hidden-to-

utput matrix W’

SLIDE 41

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

41

Output Layer

Final output is

generated as a softmax of the

utput vector

– Softmax is for normalization

Softmax function:

SLIDE 42

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

42

Output Layer Optimization

Loss is calculated

as the difference between the generated vector and the desired

ne-hot vector

– ||𝑧 – 𝑧 "sat||

Optimization is done

to get close to the

ne-hot encoding

SLIDE 43

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

43

Generate Final Embeddings

After optimization, the learned input-to-hidden layer weight

matrix will be used to generate the word2vec embeddings Multiplication for ‘on’

SLIDE 44

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

44

Model Learning Method

1. Generate several training samples

– N pairs of <word, context words>

2. Initialize the neural network architecture (all the weight

matrices)

– One way is random initialization

3. Iterate till convergence:

a) Calculate the prediction for each training sample b) Calculate the loss for each training sample and aggregate c) Backpropogate the loss to update the weights

4. Generate final word2vec embeddings

SLIDE 45

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

45

Recap: Word Embeddings

Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – + GloVe embeddings

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

SLIDE 46

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

46

Skip-Gram Model

SLIDE 47

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

47

Model Learning Method

Input: Training corpus (w1, w2, … wN) from vocabulary V
Goal: maximize the average log probability =
Context probability:
Issue: Calculating score for denominator and

backpropagating loss is expensive

Solution: Subsample

SLIDE 48

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

48

Solution: Negative Sampling

Key idea: Do not take all pairs in the denominator
Method: Create training pairs of (w, 𝑑̃), where 𝑑̃ are negative

samples

Given a pair (w, c), can we determine if this came from
ur corpus or not?

– Maximize p ( D=1 | w, c) for pairs (w, c) that occur in the data – Also maximize p (D=0 | w, cN) for (w, cN) pairs where cN is sampled randomly from the empirical unigram distribution.

SLIDE 49

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

49

Skip-Gram Negative Sampling Objective

Model probabilistically as
For each word, the loss is
k: number of negative samples to take (a hyperparameter)
E is an expectation taken with respect to negative samples

SLIDE 50

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

50

Comparing CBOW and Skip-Gram

CBOW is not great for rare words and typically needs less data

to train

Skip-gram is better for rare words and needs more data to

train the model

SLIDE 51

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

51

Insight: Words Have Regularity

Semantic Regularity

woman is to sister as man is to ______
summer is to rain as winter is to ______
man is to king as woman is to ______

Syntactic Regularity

fell is to fallen as ate is to ______
running is to ran as crying is to ______

SLIDE 52

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

52

Insight: Words Have Regularity

Semantic Regularity

woman is to sister as man is to brother
summer is to rain as winter is to snow
man is to king as woman is to queen

Syntactic Regularity

fell is to fallen as ate is to eaten
running is to ran as crying is to cried

SLIDE 53

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

53

Insight: Words Have Regularity

Words have regularity:

Semantic Regularity: woman is to sister as man is to brother
Syntactic Regularity: running is to ran as crying is to cried

Insight: The differences between each pair of words are similar. Can our word representations capture analogies?

YES!

SLIDE 54

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

54

Word Embeddings Capture Analogies

To capture the analogy problems of the form “wa is to wb as

wc is to what?”, we can compute a query vector

Find the most similar word vector wx to q

– Similarity computed by cosine similarity

Recall: cosine similarity normalizes the vector length and focuses on the

vector direction

SLIDE 55

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

55

Word Embeddings Capture Analogies

Test for linear relationships

man : woman :: king : ? + king

[ 0.30 0.70 ]

man

[ 0.20 0.20 ]

+ woman

[ 0.60 0.30 ] queen [ 0.70 0.80 ]

SLIDE 56

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

56

Analogies: Example

SLIDE 57

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

57

Recap: Word Embeddings

Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

SLIDE 58

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

58

GloVe: Global Vectors for Word Representation

How can we capture complex word relations?

Example, the words related to ice, but not to steam?
Method: Create word co-occurrence matrix
Calculate the probability ratio of “trigrams” or word

triplets

Credit: https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6

SLIDE 59

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

59

Trigram Probability Ratio: Interpretation

What is the interpretation of the probability ratio?

Very high score: Pik >> Pjk è word k is closer to i than to j
Very low score: Pik << Pjk è word k is closer to j than to i
Score close to 1: Pik ~ Pjk è k is equally similar or

dissimilar to both i and j

SLIDE 60

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

60

Trigram Probability Ratio: Example

Trigram probability ratios are informative:

solid is related to ice, but not to steam
gas is related to steam, but not to ice
water is similar to both ice and steam
fashion is dissimilar to both ice and steam

SLIDE 61

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

61

GloVe Formulation: Mathematics

Vectors wi, wj, and wk should capture the information in

the probability ratio

Solution: Train the word vectors to the following:
Then,
The information in the co-occurrence probability ratio is

expressed in the vector space

SLIDE 62

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

62

GloVe Training Objective

Input: two sets of vectors: “target” vectors w and “context”

vectors 𝑥 )

Training objective is to ensure:

– Xij = N(wi,wj), the number of times wj appears in the context of wi – bi and 𝑐 +𝑘 are bias terms

Training objective is to minimize the L2 loss:

SLIDE 63

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

63

GloVe Training Objective

Problem: all co-occurrences are weighted equally
Solution: Insert an importance score for word pairs

– f(Xij) serves as a “dampener”: lessening the weight of the rare co-occurrences – where (default) α = 3/4 , Xmax = 100

SLIDE 64

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

64

Comparing GloVe With Other Models

Dataset: 19,544 analogy questions from CoNLL-2003

shared benchmark dataset for NER

Training datasets:

– Wiki2010: 1B tokens – Wiki2014: 1.6B tokens – Gigaword5: 4.3B tokens – Gigaword5 + Wiki2015: 6B tokens

SLIDE 65

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

65

GloVe Accuracy vs Training Corpus Choice

SLIDE 66

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

66

Word2vec (Skip-Gram) vs GloVe: Accuracy

Trained 300-dim vectors on 6B token corpus (400k vocab)

SLIDE 67

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

67

GloVe Accuracy vs Vector Size

Semantic Analogies:

woman is to sister as man is to brother

Syntactic Analogies:

running is to ran as crying is to cried

Training window size = 10

SLIDE 68

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

68

GloVe Accuracy vs Window Size

Vector size = 100

SLIDE 69

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

69

Wrapping Up

Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings

CSE 6240: Web Search and Text Mining. Spring 2020

with Arindum Roy and Roshan Pati

Word Embeddings​

Administrivia

Homework Policy

Recap So Far

Today’s Lecture

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

Representing a Word: One Hot Encoding​

Representing a Word: One Hot Encoding​

Representing a Word: One Hot Encoding​

Recap: Bag of Words Model

cleaning the document)

query words and document words

Representing Phrases: Bag of Words​

Representing Phrases: Bag of Words​

Representing Phrases: Bag of Words​

Representing Phrases: Bag of Words​

Representing Phrases: Bag of Words​

Distributional Hypothesis [Lenci, 2008]

expressions is a function of the similarity of the their linguistic contexts

What Is The Meaning Of “Barwadic”?

bardiwac.

responds well to Australia’s sunshine.

light, sweet Rhenish.

What Is The Meaning Of “Barwadic”?

barwadic.

responds well to Australia’s sunshine.

light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes

Geometric Interpretation: Co-occurrence As Feature

Row Vectors in Co-occurrence Matrix

corpus/document

n-dimensional Euclidean space

Co-occurrence matrix

Distance And Similarity

dimension space

Distance And Similarity

the word

Angle And Similarity

point

vectors or use only the angle as a distance measure

between vectors

Issues with Co-occurrence Matrix

Techniques

SVD for Dimensionality Reduction

Word Visualization via Dimensionality Reduction

Issues with SVD

where N < M

words/documents

Word2Vec: Representing Word Meanings

Key idea: Predict the surrounding words of every word Benefits:

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.

Two Styles of Learning Word2Vec

Words (CBOW): uses the context words in a window to predict the middle word

middle word to predict the context words in a window.

Neural Network Basics: Neuron

Neural Network Basics: Fully Connected Layer

connected to all the neurons in the previous layer

together

Neural Network Basics: More About Layers

given as inputs here

representation of inputs

CBOW: Continuous Bag of Words

Example: “The cat sat

Input: context words Output: middle word

The Architecture

Architecture: input layer, hidden layer, and output layer

layers Input: one-hot vector

Desired output:

middle word

The Architecture

Input size: R|V| Hidden layer size: R|N| Output size: R|V| Input-to-hidden layer weight matrix: W |V| x |N|

Hidden-to-output layer weight matrix: W’|N| x |V|

shared across all examples

Parameters To Be Learned

input and

vector = |V|

to be learned during the training process

Input to Hidden Layer

generates the hidden vector

Word Embeddings

Representing a Word: One Hot Encoding

Representing a Word: One Hot Encoding

Representing a Word: One Hot Encoding

Representing Phrases: Bag of Words

Representing Phrases: Bag of Words

Representing Phrases: Bag of Words

Representing Phrases: Bag of Words

Representing Phrases: Bag of Words