Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
CSE 6240: Web Search and Text Mining. Spring 2020
Language Models
- Prof. Srijan Kumar
Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - - PowerPoint PPT Presentation
CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
3
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
4
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
5
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
7
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
8
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
9
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
10
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
11
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
12
Word Probability the 0.2 a 0.1 man 0.01 woman 0.01 said 0.03 likes 0.02
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
13
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
14
Word Probability the 0.2 a 0.1 man 0.01 woman 0.01 said 0.03 likes 0.02
Word Probability the 0.1 a 0.02 man 0.1 woman 0.1 said 0.02 likes 0.01
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
15
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
16
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
17
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
18
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
19
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
20
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
21
1-hot encoding 1-hot encoding 1-hot encoding 1-hot encoding
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
22
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
23
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
24
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
25
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
26
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
27
Learned matrix weights
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
28
Input at time 1 Input at time 2 Input at time 3 Initial hidden state Final hidden state
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
29
Shared weight matrix
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
30
Output at time 1 Output at time 2 Output at time 3 Final output
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
31
Loss at time 1 Loss at time 2 Loss at time 3 Final loss
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
32
Total loss
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
33
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
34
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
35
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
36
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
37
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
38
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
39
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
40
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
41
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
42
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
43
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
44
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
45
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
46
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
47
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
48
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
49
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
50
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
51
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
52
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
53
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
54
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
55
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
56
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
57
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
58
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
59
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
60
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
61
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
62
) ( ) 1 ( ) | , ( ) 1 | , ( ) , | ( ) , | 1 ( ) , | 1 ( = = = = = = = = = R P R P R D Q P R D Q P D Q R P D Q R P D Q R O
Ignored for ranking D
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
63
)) | ( ) , | ( ( ) | ( ) 1 | ( ) 1 , | ( ) | ( ) , | ( ) 1 | ( ) 1 , | ( ) | , ( ) 1 | , ( ) , | 1 ( = » = = = = µ = = = = = = = µ = R Q P R D Q P Assume R D P R D P R D Q P R D P R D Q P R D P R D Q P R D Q P R D Q P D Q R O
Query likelihood p(Q| D,R=1) Document prior
) 1 , | ( ) , | 1 ( = µ = R D Q P D Q R O
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
64
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
65
Doc LM
p(q| qd1) p(q| qd2) p(q| qdN)
Query likelihood Step 1: Given a document, generate a language model Step 2: Compute query likelihood from the document language model
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
66
Document Text mining paper Food nutrition paper Language Model
…
text ? mining ? assocation ? clustering ? … food ?
… …
food ? nutrition ? healthy ? diet ?
…
Query = “data mining algorithms”
Which model would most likely have generated this query?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
67
| | | | | | 1 | | 1 1, 1 1,
( ( ,..., ) | ) ( | ) ( 1| ) ( 0 | )
i i
V V V V i i i i i i x i x
p q x x d p w x d p w d p w d
= = = = =
= = = = = =
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
68
| | ( , ) 1 1 1
( ... | ) ( | ) ( | )
i
V m c w q m j i j i
p q q q d p q d p w d
= =
= = =
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
69
| | 1 1 1 2
V m i i i i i m
= =
| | ( , ) 1 1 1
( ... | ) ( | ) ( | )
i
V m c w q m j i j i
p q q q d p q d p w d
= =
= = =
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
70
70
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
71
d Seen
Discounted ML estimate Collection language model
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
72
> Î = Î Î
d) c(w, d) c(w, , V w , V w d Seen V w
> Î Î
+ a + a =
) d , w ( c , V w V w d d Seen
) C | w ( p log ) q , w ( c log | q | ) C | w ( p ) d | w ( p log ) q , w ( c Query words matched in d Query words not matched in d
> Î Î
a
) d , w ( c , V w d V w d
) C | w ( p log ) q , w ( c ) C | w ( p log ) q , w ( c
All query words Query words matched in d
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
73
= Î Î
+ a + a =
n 1 i i d q w d w i d i Seen
) C | w ( p log log n ] ) C | w ( p ) d | w ( p )[log q , w ( c ) d | q ( p log
i i