Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - - PowerPoint PPT Presentation

word embeddings
SMART_READER_LITE
LIVE PREVIEW

Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released


slide-1
SLIDE 1

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

1

CSE 6240: Web Search and Text Mining. Spring 2020

  • Prof. Srijan Kumar

with Arindum Roy and Roshan Pati

Word Embeddings​

slide-2
SLIDE 2

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

2

Administrivia

  • Homework: Will be released today after class
  • Project Reminder: Teams due Monday Jan 20.
  • A fun exercise at the end of the class!
slide-3
SLIDE 3

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

3

Homework Policy

  • Late day policy: 3 late days (3 x 24 hour chunks)

– Use as needed

  • Collaboration:

– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with.

  • Zero tolerance on plagiarism

– Follow the GT academic honesty rules

slide-4
SLIDE 4

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

4

Recap So Far

  • 1. IR and text processing
  • 2. Evaluation of IR system
slide-5
SLIDE 5

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

5

Today’s Lecture

  • Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

slide-6
SLIDE 6

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

6

Representing a Word: One Hot Encoding​

  • Given a vocabulary

dog cat person holding tree computer using

slide-7
SLIDE 7

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

7

Representing a Word: One Hot Encoding​

  • Given a vocabulary

dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7

slide-8
SLIDE 8

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

8

Representing a Word: One Hot Encoding​

  • Given a vocabulary, convert to One Hot Encoding

dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]

slide-9
SLIDE 9

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

9

Recap: Bag of Words Model

  • Represent a document as a collection of words (after

cleaning the document)

– The order of words is irrelevant – The document “John is quicker than Mary” is indistinguishable from the doc “Mary is quicker than John”

  • Rank documents according to the overlap between

query words and document words

slide-10
SLIDE 10

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

10

Representing Phrases: Bag of Words​

bag of words representation

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-11
SLIDE 11

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

11

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-12
SLIDE 12

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

12

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-13
SLIDE 13

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

13

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-14
SLIDE 14

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

14

Representing Phrases: Bag of Words​

bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]

Dog​ Cat​ Person​ Holding​ Tree​ Computer​ Using​

slide-15
SLIDE 15

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

15

Distributional Hypothesis [Lenci, 2008]

  • The degree of semantic similarity between two linguistic

expressions is a function of the similarity of the their linguistic contexts

  • Similarity in meaning ∝ Similarity of context
  • Simple definition: context = surrounding words
slide-16
SLIDE 16

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

16

What Is The Meaning Of “Barwadic”?

  • he handed her glass of bardiwac.
  • Beef dishes are made to complement the bardiwac.
  • Nigel staggered to his feet, face flushed from too much

bardiwac.

  • Malbec, one of the lesser-known bardiwac grapes,

responds well to Australia’s sunshine.

  • I dined off bread and cheese and this excellent bardiwac.
  • The drinks were delicious: blood-red bardiwac as well as

light, sweet Rhenish.

slide-17
SLIDE 17

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

17

What Is The Meaning Of “Barwadic”?

  • he handed her glass of barwadic.
  • Beef dishes are made to complement the barwadic.
  • Nigel staggered to his feet, face flushed from too much

barwadic.

  • Malbec, one of the lesser-known barwadic grapes,

responds well to Australia’s sunshine.

  • I dined off bread and cheese and this excellent barwadic.
  • The drinks were delicious: blood-red barwadic as well as

light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes

slide-18
SLIDE 18

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

18

Geometric Interpretation: Co-occurrence As Feature

  • Recall the term-document matrix

– Rows are terms, columns are documents, cells represent the number of time a term appears in a document

  • Here we create a word-word co-occurrence matrix

– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R”

  • Neighborhood = a window of fixed size around the word
slide-19
SLIDE 19

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

19

Row Vectors in Co-occurrence Matrix

  • Row vector describes the usage of the word in the

corpus/document

  • Row vectors can be seen as coordinates of the point in an

n-dimensional Euclidean space

  • Example: n = 2
  • Dimensions = ‘get’ and ‘use’

Co-occurrence matrix

slide-20
SLIDE 20

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

20

Distance And Similarity

  • Selected two dimensions ‘get’ and ‘use’
  • Similarity between words = spatial proximity in the

dimension space

  • Measured by the Euclidean distance
slide-21
SLIDE 21

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

21

Distance And Similarity

  • Exact position in the space depends on the frequency of

the word

  • More frequent words will appear farther from the origin
  • E.g., say ‘dog’ is more frequent than ‘cat’
  • Does not mean it is more important
  • Solution: Ignore the length and look only at the direction
slide-22
SLIDE 22

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

22

Angle And Similarity

  • Angle ignores the exact location of the

point

  • Method: Normalize by the length of

vectors or use only the angle as a distance measure

  • Standard metric: Cosine similarity

between vectors

slide-23
SLIDE 23

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

23

Issues with Co-occurrence Matrix

  • Problem with using the co-occurrence directly:

– The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus

  • Billions!

– Down-sampling dimensions is not straight-forward

  • How many columns to select?
  • Which columns to select?
  • Solution: Compression or Dimensionality Reduction

Techniques

slide-24
SLIDE 24

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

24

SVD for Dimensionality Reduction

  • SVD = Singular Value Decomposition
  • For an input matrix X

– U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix

  • Diagonal values of S are called Singular Values
  • Matrix U is a get a r-dimension vector for every row of X
slide-25
SLIDE 25

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

25

Word Visualization via Dimensionality Reduction

slide-26
SLIDE 26

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

26

Issues with SVD

  • Computational cost for SVD on an N x M matrix is O(NM2),

where N < M

  • Impossible for large number of word vocabularies or documents
  • Impractical for real corpus
  • It is hard to incorporate out-of-sample or new

words/documents

  • Entire row in the matrix will be 0
slide-27
SLIDE 27

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

27

Word2Vec: Representing Word Meanings

Key idea: Predict the surrounding words of every word Benefits:

  • Faster
  • Easier to incorporate new words and documents

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.

slide-28
SLIDE 28

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

28

Two Styles of Learning Word2Vec

  • Continuous Bag of

Words (CBOW): uses the context words in a window to predict the middle word

  • Skip-gram: uses the

middle word to predict the context words in a window.

slide-29
SLIDE 29

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

29

Neural Network Basics: Neuron

  • Basic building blocks of neural networks
  • Input is a vector: x = [x1, … xm]
  • Weights and bias:

– Neuron has weights w = [w1, w2, …, wm] – Bias term = b (or w0)

  • Activation function:

– Transforms the aggregate – e.g., sigmoid, ReLU

  • Output computation:
slide-30
SLIDE 30

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

30

Neural Network Basics: Fully Connected Layer

  • A layer whose neurons are

connected to all the neurons in the previous layer

– Each neuron takes as input all the

  • utput from the previous layer
  • Multiple layers can be stacked

together

  • Example: 3 fully connected layers
slide-31
SLIDE 31

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

31

Neural Network Basics: More About Layers

  • Input layer: input vectors are

given as inputs here

  • Hidden layer: Intermediate

representation of inputs

– Multiple hidden layers can be stacked together

  • Output layer: final output

– Can have one or more neurons in the output layer

  • Note that information flows in
  • ne direction
slide-32
SLIDE 32

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

32

CBOW: Continuous Bag of Words

Example: “The cat sat

  • n floor” (window size 2)

Input: context words Output: middle word

slide-33
SLIDE 33

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

33

The Architecture

Architecture: input layer, hidden layer, and output layer

  • Fully connected

layers Input: one-hot vector

  • f context words

Desired output:

  • ne-hot vector of the

middle word

slide-34
SLIDE 34

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

34

The Architecture

Input size: R|V| Hidden layer size: R|N| Output size: R|V| Input-to-hidden layer weight matrix: W |V| x |N|

  • All inputs share the W matrix

Hidden-to-output layer weight matrix: W’|N| x |V|

  • All weight matrices are

shared across all examples

slide-35
SLIDE 35

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

35

Parameters To Be Learned

  • Size of the

input and

  • utput word

vector = |V|

  • All weights are

to be learned during the training process

slide-36
SLIDE 36

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

36

Input to Hidden Layer

  • Matrix multiplication

generates the hidden vector

– Multiplication of input

  • ne-hot vector with the

input-to-hidden layer matrix

  • One multiplication per

input

slide-37
SLIDE 37

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

37

Input to Hidden Layer

Multiplication for ‘cat’

slide-38
SLIDE 38

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

38

Input to Hidden Layer

Multiplication for ‘on’

slide-39
SLIDE 39

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

39

Hidden Layer

  • Aggregation is

done at the hidden layer

– Example: simple averaging

slide-40
SLIDE 40

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

40

Hidden to Output Layer

  • Hidden vector is

converted to the

  • utput vector

– Multiplication of hidden vector with the hidden-to-

  • utput matrix W’
slide-41
SLIDE 41

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

41

Output Layer

  • Final output is

generated as a softmax of the

  • utput vector

– Softmax is for normalization

  • Softmax function:
slide-42
SLIDE 42

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

42

Output Layer Optimization

  • Loss is calculated

as the difference between the generated vector and the desired

  • ne-hot vector

– ||𝑧 – 𝑧 "sat||

  • Optimization is done

to get close to the

  • ne-hot encoding
slide-43
SLIDE 43

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

43

Generate Final Embeddings

  • After optimization, the learned input-to-hidden layer weight

matrix will be used to generate the word2vec embeddings Multiplication for ‘on’

slide-44
SLIDE 44

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

44

Model Learning Method

  • 1. Generate several training samples

– N pairs of <word, context words>

  • 2. Initialize the neural network architecture (all the weight

matrices)

– One way is random initialization

  • 3. Iterate till convergence:

a) Calculate the prediction for each training sample b) Calculate the loss for each training sample and aggregate c) Backpropogate the loss to update the weights

  • 4. Generate final word2vec embeddings
slide-45
SLIDE 45

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

45

Recap: Word Embeddings

  • Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – + GloVe embeddings

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

slide-46
SLIDE 46

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

46

Skip-Gram Model

slide-47
SLIDE 47

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

47

Model Learning Method

  • Input: Training corpus (w1, w2, … wN) from vocabulary V
  • Goal: maximize the average log probability =
  • Context probability:
  • Issue: Calculating score for denominator and

backpropagating loss is expensive

– Vocabulary size |V| is huge. So, ▽log p(c|w) takes O(|V|) to compute

  • Solution: Subsample
slide-48
SLIDE 48

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

48

Solution: Negative Sampling

  • Key idea: Do not take all pairs in the denominator
  • Method: Create training pairs of (w, 𝑑̃), where 𝑑̃ are negative

samples

  • Given a pair (w, c), can we determine if this came from
  • ur corpus or not?

– Maximize p ( D=1 | w, c) for pairs (w, c) that occur in the data – Also maximize p (D=0 | w, cN) for (w, cN) pairs where cN is sampled randomly from the empirical unigram distribution.

slide-49
SLIDE 49

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

49

Skip-Gram Negative Sampling Objective

  • Model probabilistically as
  • For each word, the loss is
  • k: number of negative samples to take (a hyperparameter)
  • E is an expectation taken with respect to negative samples
slide-50
SLIDE 50

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

50

Comparing CBOW and Skip-Gram

  • CBOW is not great for rare words and typically needs less data

to train

  • Skip-gram is better for rare words and needs more data to

train the model

slide-51
SLIDE 51

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

51

Insight: Words Have Regularity

Semantic Regularity

  • woman is to sister as man is to ______
  • summer is to rain as winter is to ______
  • man is to king as woman is to ______

Syntactic Regularity

  • fell is to fallen as ate is to ______
  • running is to ran as crying is to ______
slide-52
SLIDE 52

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

52

Insight: Words Have Regularity

Semantic Regularity

  • woman is to sister as man is to brother
  • summer is to rain as winter is to snow
  • man is to king as woman is to queen

Syntactic Regularity

  • fell is to fallen as ate is to eaten
  • running is to ran as crying is to cried
slide-53
SLIDE 53

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

53

Insight: Words Have Regularity

Words have regularity:

  • Semantic Regularity: woman is to sister as man is to brother
  • Syntactic Regularity: running is to ran as crying is to cried

Insight: The differences between each pair of words are similar. Can our word representations capture analogies?

YES!

slide-54
SLIDE 54

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

54

Word Embeddings Capture Analogies

  • To capture the analogy problems of the form “wa is to wb as

wc is to what?”, we can compute a query vector

  • Find the most similar word vector wx to q

– Similarity computed by cosine similarity

  • Recall: cosine similarity normalizes the vector length and focuses on the

vector direction

slide-55
SLIDE 55

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

55

Word Embeddings Capture Analogies

Test for linear relationships

man : woman :: king : ? + king

[ 0.30 0.70 ]

  • man

[ 0.20 0.20 ]

+ woman

[ 0.60 0.30 ] queen [ 0.70 0.80 ]

slide-56
SLIDE 56

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

56

Analogies: Example

slide-57
SLIDE 57

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

57

Recap: Word Embeddings

  • Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

slide-58
SLIDE 58

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

58

GloVe: Global Vectors for Word Representation

How can we capture complex word relations?

  • Example, the words related to ice, but not to steam?
  • Method: Create word co-occurrence matrix
  • Calculate the probability ratio of “trigrams” or word

triplets

Credit: https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6

slide-59
SLIDE 59

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

59

Trigram Probability Ratio: Interpretation

What is the interpretation of the probability ratio?

  • Very high score: Pik >> Pjk è word k is closer to i than to j
  • Very low score: Pik << Pjk è word k is closer to j than to i
  • Score close to 1: Pik ~ Pjk è k is equally similar or

dissimilar to both i and j

slide-60
SLIDE 60

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

60

Trigram Probability Ratio: Example

Trigram probability ratios are informative:

  • solid is related to ice, but not to steam
  • gas is related to steam, but not to ice
  • water is similar to both ice and steam
  • fashion is dissimilar to both ice and steam
slide-61
SLIDE 61

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

61

GloVe Formulation: Mathematics

  • Vectors wi, wj, and wk should capture the information in

the probability ratio

  • Solution: Train the word vectors to the following:
  • Then,
  • The information in the co-occurrence probability ratio is

expressed in the vector space

slide-62
SLIDE 62

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

62

GloVe Training Objective

  • Input: two sets of vectors: “target” vectors w and “context”

vectors 𝑥 )

  • Training objective is to ensure:

– Xij = N(wi,wj), the number of times wj appears in the context of wi – bi and 𝑐 +𝑘 are bias terms

  • Training objective is to minimize the L2 loss:
slide-63
SLIDE 63

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

63

GloVe Training Objective

  • Problem: all co-occurrences are weighted equally
  • Solution: Insert an importance score for word pairs

– f(Xij) serves as a “dampener”: lessening the weight of the rare co-occurrences – where (default) α = 3/4 , Xmax = 100

slide-64
SLIDE 64

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

64

Comparing GloVe With Other Models

  • Dataset: 19,544 analogy questions from CoNLL-2003

shared benchmark dataset for NER

  • Training datasets:

– Wiki2010: 1B tokens – Wiki2014: 1.6B tokens – Gigaword5: 4.3B tokens – Gigaword5 + Wiki2015: 6B tokens

slide-65
SLIDE 65

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

65

GloVe Accuracy vs Training Corpus Choice

slide-66
SLIDE 66

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

66

Word2vec (Skip-Gram) vs GloVe: Accuracy

  • Trained 300-dim vectors on 6B token corpus (400k vocab)
slide-67
SLIDE 67

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

67

GloVe Accuracy vs Vector Size

  • Semantic Analogies:

woman is to sister as man is to brother

  • Syntactic Analogies:

running is to ran as crying is to cried

  • Training window size = 10
slide-68
SLIDE 68

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

68

GloVe Accuracy vs Window Size

  • Vector size = 100
slide-69
SLIDE 69

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

69

Wrapping Up

  • Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings