NLP!!! (Part 2)
April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown - - PowerPoint PPT Presentation
NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Viz Lab tomorrow afternoon (4pm? Check Piazza) Project
April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
2
3
4
trigrams, 4-grams, …)
from “hot dog”)
apply to ngrams too
5
html does work . all webdev is awesome.
1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …]
6
html does work . all webdev is awesome.
1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-1gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, …]
7
8
airplane” or “fly” as in “go fast”?
place or “Washington” the person
today, despite the lockdown, i will get groceries
9
https://explosion.ai/demos/displacy
“Dependency Parsing”
today, despite the lockdown, i will get groceries
10
https://explosion.ai/demos/displacy
“Dependency Parsing”
all webdev is awesome.
11
https://demo.allennlp.org/constituency-parsing
“Constituency Parsing”
12
13
Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…
14
Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
15
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
16
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
17
You
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
18
You
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
19
You javascript
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
20
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
You javascript
21
Where do documents come from? “The generative story”
instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a
You javascript handin
22
23
“latent” variable (not observed)
24
25
words are determined by topic (and are conditionally independent of each other)
26
documents are a distribution over topics
27
set parameters to maximize probability of observations
part 2 html does not work
28
part 2 html does not work
15 30 45 60 Topic1 Topic2 Topic3 Topic4
29
part 2 html does not work
15 30 45 60 Topic1 Topic2 Topic3 Topic4 html javascript work handin part stencil 10 20 30 40 html javascript work handin part stencil 7.5 15 22.5 30
30
31
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b)
32
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
33
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
34
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95
35
Which is the best parameter setting for the observed data?
part <NUM> html does not work
part <NUM> html does not work 0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3
Topic 1 Topic 2
12.5 25 37.5 50 Topic1 Topic2
50 50
17.5 35 52.5 70 Topic1 Topic2
67 33
(a) (b) b: (0.3+0.2+0+0.1+0.1+0.2)x0.33 (0+0.3+0.4+0.1+0.2)x0.67 = 0.297 + 0.67 = 0.967
36
37
38
LDA Generative Model Latent Dirichelet Allocation
(latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution)
Set parameters using EM
39
LDA LSA Generative Model Latent Dirichelet Allocation
(latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution)
Set parameters using EM
Latent Semantic Analysis Discriminative Model Set parameters by factorizing the term-document matrix
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D
40
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D component = “topic”
41
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D component = “topic” = distribution over words
42
the cong ress parli ame US UK doc1
1 1 1 1
doc2
1 1 1
doc3
1 1 1
doc4
1 1 1
d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D document = distribution
43
44
45
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse doc 1 doc 2 doc 3
46
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse markets Washington stimulus
47
1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1
is it a and copy … markets below paste remorse markets Washington stimulus
“Distributional Hypothesis”: the meaning of a word is determined by the contexts in which it is used
48
the cong ress par lia US UK
market
1 1 1 1 0
Washington 1
1 0 1
stimulus
1 1 1 0
Brussels
1 1 0 1
market -0.60 -0.39 0.70 0.00 Washin gton
stimulus -0.43 -0.58 -0.69 0.00 Brussel s
3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D
49
Word-Context Matrix
the cong ress par lia US UK
market
1 1 1 1 0
Washington 1
1 0 1
stimulus
1 1 1 0
Brussels
1 1 0 1
market -0.60 -0.39 0.70 0.00 Washin gton
stimulus -0.43 -0.58 -0.69 0.00 Brussel s
3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D
50
Word Embeddings
51
Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1
Label lovely good raw rubbery rather mushroomy gamy … 1 1 … 1 1 … 1 1 … 1 …
y X
Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1
Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery
y X
Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1
Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery
y X
No longer treated as entirely different words
Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1
Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery
y X
No longer treated as entirely different words (often just add up vectors when more than one word)
the cong ress par lia US UK
market
1 1 1 1 0
Washington 1
1 0 1
stimulus
1 1 1 0
Brussels
1 1 0 1
market -0.60 -0.39 0.70 0.00 Washin gton
stimulus -0.43 -0.58 -0.69 0.00 Brussel s
3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK
0.02
0.02 0.79 0.02 -0.44
0.27 0.00 0.37 0.63
0.73 0.00 -0.68 0.04
U V D
56
Word Embeddings
https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b
57
Word Embeddings
https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b
58
Word Embeddings More in the DL lecture!
k bye
59