NLP!!!
April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation
NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements S/NC Option Special Topics Questions/Concerns? 2 Today 1990s
April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
2
3
things
4
articles/ideas across the country
cause articles to be shared
documents similar to each other, etc…
5
articles/ideas across the country
cause articles to be shared
documents similar to each other, etc…
6
articles/ideas across the country
cause articles to be shared
documents similar to each other, etc…
7
articles/ideas across the country
cause articles to be shared
documents similar to each other, etc…
8
said…”)
9
“meaning of the whole is a function
way in which they are combined”
10
Words
11
Words Sentences
12
Words Sentences = f(Words, Syntax)
13
Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse)
14
Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse) Very difficult… (impossible?) …to achieve
15
Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse) Very difficult… (impossible?) …to achieve horse shoes ≈ alligator shoes?
16
17
18
(We often treat sentences just like short documents, though)
19
set of words
20
set of words
21
set of words
22
set of words
23
Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), nothing is displayed (and the elements do not appear in the html). Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?
24
Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?
1 1 1 1 1 … 1
is it a and copy … markets below paste remorse
25
Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?
1 1 1 1 1 … 1
is it a and copy … markets below paste remorse “one hot”
26
Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?
2 1 2 1 1 … 1
is it a and copy … markets below paste remorse counts/frequencies
27
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse doc 1 doc 2 doc 3
28
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse doc 1 doc 2 doc 3
“Term Document Matrix”
29
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse doc 1 doc 2 doc 3
30
31
deletes, substitutions) needed to transform string 1 into string 2.
32
deletes, substitutions) needed to transform string 1 into string 2.
33
deletes, substitutions) needed to transform string 1 into string 2.
34
35
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2
36
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2
assume one-hot (frequency doesn’ t matter), ignore case/ punctuation
37
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2
38
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091
39
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091
40
Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.
Which document is more relevant to the query, according to Jaccard?
html does not work
Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091
41
deletes, substitutions) needed to transform string 1 into string 2.
42
Changes I make do not affect any of the html in after I load the nations html file
do html 2 1 1 2
43
Changes I make do not affect any of the html in after I load the nations html file
do html 2 1 1 2
When I try to display dots from part 2 …the elements do not appear in the html.
44
Changes I make do not affect any of the html in after I load the nations html file
do html 2 1 1 2
When I try to display dots from part 2 …the elements do not appear in the html.
θ
45
46
Which document is more relevant to the query, according to cosine?
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome query doc 1 doc 2 webdev
47
Which document is more relevant to the query, according to cosine?
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6
48
Which document is more relevant to the query, according to cosine?
html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
49
Which document is more relevant to the query, according to cosine?
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6
50
51
52
They freaked out when they found the bug in their apartment.
53
They freaked out when they found the bug in their apartment.
54
They’ve always been terrified of anything crawly. They freaked out when they found the bug in their apartment.
55
They ran back the CIT right away to tell everyone they’d finally figured it out. They freaked out when they found the bug in their apartment.
56
They ran back the CIT right away to tell everyone they’d finally figured it out. They freaked out when they found the problem in their apartment.
57
Constant Tradeoff
58
Constant Tradeoff Collapse! Try to treat more words as though they are the same
59
Constant Tradeoff Collapse! Try to treat more words as though they are the same Differentiate! Try to preserve as much differences/ nuance as possible
60
Constant Tradeoff Collapse! Try to treat more words as though they are the same Differentiate! Try to preserve as much differences/ nuance as possible normalization, stemming tagging, collocations
61
62
I am trying to display dots from Part 2 on my mac (tried Chrome, Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html).
63
Analysis?)
I am trying to display dots from Part 2 on my mac (tried Chrome, Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html).
64
Analysis?)
I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .
65
Analysis?)
I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .
⽇旦⽂斈章⿂魛怎麼說? “How to say octopus in Japanese?”
66
Analysis?)
I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .
⽇旦⽂斈章⿂魛怎麼說? “How to say octopus in Japanese?” ⽇旦⽂斈 章⿂魛 怎麼 說 ? Japanese octopus how say ?
67
Analysis?)
I am trying to display dots from Part 2 on my mac tried Chrome Firefox and Safari but nothing is displayed and the elements do not appear in the html
68
Analysis?)
i am trying to display dots from part 2 on my mac tried chrome firefox and safari but nothing is displayed and the elements do not appear in the html
69
Analysis?)
i be try to display dot from part 2 on my mac try chrome firefox and safari but nothing be display and the element do not appear in the html
70
Analysis?)
i be try to display dot from part <NUM> on my mac try chrome firefox and safari but nothing be display and the element do not appear in the html
71
Analysis?)
try display dot part <NUM> mac try chrome firefox safari nothing display element not appear html
72
Analysis?)
try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP firefox_NNP safari_NNP nothing_DT display_VB element_NNP not_RB appear_VB html_NN
73
Analysis?)
try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN
74
Analysis?)
try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN
75
pmi?)
(what goes on the columns)
76
pmi?)
(what goes on the columns)
77
Word Frequency Word Rank
https://en.wikipedia.org/wiki/Zipf%27s_law
78
Word Frequency Word Rank The most frequent 0.2% of words make up 50% of occurrences.
79
Word Frequency Word Rank
80
“stop words”: a, the, of, and, …
Word Frequency Word Rank
81
“stop words”: a, the, of, and, … (or use nltk.corpus.stopwords…)
pmi?)
(what goes on the columns)
82
Word Frequency Word Rank
83
Usually set some vocab size (around 30K) or some min count (around 3)
Word Frequency Word Rank
84
Usually set some vocab size (around 30K) or some min count (around 3) seems arbitrary? that’ s cause it is.
pmi?)
(what goes on the columns)
85
this document from other documents
(# of times word appears across all documents)
the term-document matrix accordingly
86
87
webdev: html does work html does work. all webdev is awesome.
html does not work
doc1 doc 2 doc 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome doc1 doc 2 doc 3 webdev
88
webdev: html does work html does work. all webdev is awesome.
html does not work
doc1 doc 2 doc 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome doc1 doc 2 doc 3 webdev a) b) c)
What is the tf-idf vector for doc1
1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2
89
webdev: html does work html does work. all webdev is awesome. html does not work
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
html does not work at all is awesome doc1 doc 2 doc 3 webdev
1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2
a) b) c)
What is the tf-idf vector for doc1 df html: 3 does: 3 not: 1 work: 2 at: 1 all: 2 webdev: 2 is: 1 awesome: 1
90
webdev: html does work html does work. all webdev is awesome. html does not work
html does not work at all is awesome doc1 doc 2 doc 3 webdev
1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2
a) b) c)
What is the tf-idf vector for doc1 df html: 3 does: 3 not: 1 work: 2 at: 1 all: 2 webdev: 2 is: 1 awesome: 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
91
differentiate this document from other documents
word-word collocations (more info in two seconds)
92
pmi?)
(what goes on the columns)
93
trigrams, 4-grams, …)
from “hot dog”)
94
html does work . all webdev is awesome.
1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, ‘does .’, …]
95
by finding words that occur together above chance
96
97
Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…
98
Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
99
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
100
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
101
You
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
102
You
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
103
You javascript
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
104
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
You javascript
105
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
You javascript handin
106
“Latent Semantic Analysis” (LSA)
107
“Latent Semantic Analysis” (LSA) “latent” variable (not observed)
108
“Latent Semantic Analysis” (LSA) words are determined by topic (and are conditionally independent of each other)
109
“Latent Semantic Analysis” (LSA) documents are a distribution over topics
110
“Latent Semantic Analysis” (LSA) set parameters to maximize probability of observations
111
part 2 html does not work
112
part 2 html does not work
15 30 45 60 Topic1 Topic2 Topic3 Topic4
113
part 2 html does not work
15 30 45 60 Topic1 Topic2 Topic3 Topic4 html javascript work handin part stencil 10 20 30 40 html javascript work handin part stencil 7.5 15 22.5 30
114
115
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b)
116
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
117
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
118
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
119
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) b: (0.3+0.2+0+0.1+0.1+0.2)x0.33 (0+0.3+0.4+0.1+0.2)x0.67 = 0.297 + 0.67 = 0.967
120
121
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D
122
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D component = “topic”
123
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D component = “topic” = distribution over words
124
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D document = distribution
125
k bye
126