Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
CSE 6240: Web Search and Text Mining. Spring 2020
- Prof. Srijan Kumar
Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan - - PowerPoint PPT Presentation
CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Administrivia Homework: Will be released
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
3
– Use as needed
– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with.
– Follow the GT academic honesty rules
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
4
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
5
– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
6
dog cat person holding tree computer using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
7
dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
8
dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
9
– The order of words is irrelevant – The document “John is quicker than Mary” is indistinguishable from the doc “Mary is quicker than John”
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
10
bag of words representation
Dog Cat Person Holding Tree Computer Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
11
bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]
Dog Cat Person Holding Tree Computer Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
12
bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]
Dog Cat Person Holding Tree Computer Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
13
bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]
Dog Cat Person Holding Tree Computer Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
14
bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]
Dog Cat Person Holding Tree Computer Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
15
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
16
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
17
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
18
– Rows are terms, columns are documents, cells represent the number of time a term appears in a document
– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R”
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
19
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
20
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
21
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
22
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
23
– The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus
– Down-sampling dimensions is not straight-forward
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
24
– U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
25
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
26
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
27
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
28
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
29
– Neuron has weights w = [w1, w2, …, wm] – Bias term = b (or w0)
– Transforms the aggregate – e.g., sigmoid, ReLU
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
30
– Each neuron takes as input all the
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
31
– Multiple hidden layers can be stacked together
– Can have one or more neurons in the output layer
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
32
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
33
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
34
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
35
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
36
– Multiplication of input
input-to-hidden layer matrix
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
37
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
38
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
39
– Example: simple averaging
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
40
– Multiplication of hidden vector with the hidden-to-
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
41
– Softmax is for normalization
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
42
– ||𝑧 – 𝑧 "sat||
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
43
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
44
– N pairs of <word, context words>
– One way is random initialization
a) Calculate the prediction for each training sample b) Calculate the loss for each training sample and aggregate c) Backpropogate the loss to update the weights
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
45
– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – + GloVe embeddings
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
46
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
47
– Vocabulary size |V| is huge. So, ▽log p(c|w) takes O(|V|) to compute
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
48
– Maximize p ( D=1 | w, c) for pairs (w, c) that occur in the data – Also maximize p (D=0 | w, cN) for (w, cN) pairs where cN is sampled randomly from the empirical unigram distribution.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
49
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
50
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
51
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
52
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
53
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
54
– Similarity computed by cosine similarity
vector direction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
55
man : woman :: king : ? + king
+ woman
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
56
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
57
– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
58
Credit: https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
59
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
60
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
61
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
62
– Xij = N(wi,wj), the number of times wj appears in the context of wi – bi and 𝑐 +𝑘 are bias terms
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
63
– f(Xij) serves as a “dampener”: lessening the weight of the rare co-occurrences – where (default) α = 3/4 , Xmax = 100
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
64
– Wiki2010: 1B tokens – Wiki2014: 1.6B tokens – Gigaword5: 4.3B tokens – Gigaword5 + Wiki2015: 6B tokens
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
65
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
66
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
67
woman is to sister as man is to brother
running is to ran as crying is to cried
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
68
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
69
– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model – GloVe embeddings