NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown - - PowerPoint PPT Presentation

nlp part 2
SMART_READER_LITE
LIVE PREVIEW

NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown - - PowerPoint PPT Presentation

NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Viz Lab tomorrow afternoon (4pm? Check Piazza) Project


slide-1
SLIDE 1

NLP!!! (Part 2)

April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter

1

slide-2
SLIDE 2

Announcements

  • Viz Lab tomorrow afternoon (4pm? Check Piazza)
  • Project Grades/Pitches/Presentations

2

slide-3
SLIDE 3

Today

  • More NLP!
  • Ngrams
  • Topic Models
  • Word Embeddings

3

slide-4
SLIDE 4

Today

  • More NLP!
  • Ngrams
  • Topic Models
  • Word Embeddings

4

slide-5
SLIDE 5
  • N-length sequence of words (unigrams, bigrams,

trigrams, 4-grams, …)

  • Provides some context (differentiating “cute dog”

from “hot dog”)

  • Blows up size of vocabulary, increases sparsity
  • Usually vocab size cutoffs/min count thresholds

apply to ngrams too

N-Grams

5

slide-6
SLIDE 6

html does work . all webdev is awesome.

1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …]

6

N-Grams

slide-7
SLIDE 7

html does work . all webdev is awesome.

1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-1gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, …]

7

N-Grams

slide-8
SLIDE 8

Tagging

8

  • Parts of Speech — “fly” the noun or “fly” the verb?
  • Word Sense Disambiguation — “fly” as in “take an

airplane” or “fly” as in “go fast”?

  • Named Entity Recognition — “Washington” the

place or “Washington” the person

slide-9
SLIDE 9

today, despite the lockdown, i will get groceries

Syntactic Relations

9

https://explosion.ai/demos/displacy

“Dependency Parsing”

slide-10
SLIDE 10

today, despite the lockdown, i will get groceries

Syntactic Relations

10

https://explosion.ai/demos/displacy

“Dependency Parsing”

slide-11
SLIDE 11

all webdev is awesome.

Syntactic Relations

11

https://demo.allennlp.org/constituency-parsing

“Constituency Parsing”

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Today

  • More NLP!
  • Ngrams
  • Topic Models
  • Word Embeddings

13

slide-14
SLIDE 14

Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…

Topic Models

14

slide-15
SLIDE 15

Topic Models

Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

15

slide-16
SLIDE 16

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

16

slide-17
SLIDE 17

Topic Models

Where do documents come from? “The generative story”

  • 1. Sample a topic

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

17

slide-18
SLIDE 18

Topic Models

You

Where do documents come from? “The generative story”

  • 2. Sample a word from that topic

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

18

slide-19
SLIDE 19

Topic Models

You

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 1. Sample a topic

19

slide-20
SLIDE 20

Topic Models

You javascript

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 2. Sample a word from that topic

20

slide-21
SLIDE 21

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 1. Sample a topic

You javascript

21

slide-22
SLIDE 22

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

You javascript handin

  • 2. Sample a word from that topic

22

slide-23
SLIDE 23

Topic Models

23

slide-24
SLIDE 24

Topic Models

“latent” variable (not observed)

24

slide-25
SLIDE 25

Topic Models

25

words are determined by topic (and are conditionally independent of each other)

slide-26
SLIDE 26

Topic Models

26

documents are a distribution over topics

slide-27
SLIDE 27

Topic Models

27

set parameters to maximize probability of observations

slide-28
SLIDE 28

Topic Models

part 2 html does not work

28

slide-29
SLIDE 29

Topic Models

part 2 html does not work

15 30 45 60 Topic1 Topic2 Topic3 Topic4

29

slide-30
SLIDE 30

Topic Models

part 2 html does not work

15 30 45 60 Topic1 Topic2 Topic3 Topic4 html javascript work handin part stencil 10 20 30 40 html javascript work handin part stencil 7.5 15 22.5 30

30

slide-31
SLIDE 31

Clicker Question!

31

slide-32
SLIDE 32

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b)

32

slide-33
SLIDE 33

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

33

slide-34
SLIDE 34

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

34

slide-35
SLIDE 35

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

35

slide-36
SLIDE 36

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) b: (0.3+0.2+0+0.1+0.1+0.2)x0.33 (0+0.3+0.4+0.1+0.2)x0.67 = 0.297 + 0.67 = 0.967

36

slide-37
SLIDE 37

Topic Models

37

slide-38
SLIDE 38

Topic Models

38

LDA Generative Model Latent Dirichelet Allocation

(latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution)

Set parameters using EM

  • r MCMC
slide-39
SLIDE 39

Topic Models

39

LDA LSA Generative Model Latent Dirichelet Allocation

(latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution)

Set parameters using EM

  • r MCMC

Latent Semantic Analysis Discriminative Model Set parameters by factorizing the term-document matrix

slide-40
SLIDE 40

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D

40

slide-41
SLIDE 41

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D component = “topic”

41

slide-42
SLIDE 42

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D component = “topic” = distribution over words

42

slide-43
SLIDE 43

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D document = distribution

  • ver topics

43

slide-44
SLIDE 44

44

slide-45
SLIDE 45

Today

  • More NLP!
  • Ngrams
  • Topic Models
  • Word Embeddings

45

slide-46
SLIDE 46

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse doc 1 doc 2 doc 3

“Bag of Words” (BOW)

Term-Document Matrix

46

slide-47
SLIDE 47

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse markets Washington stimulus

“Bag of Words” (BOW)

Word-Context Matrix (Term-Term*) Matrix

47

slide-48
SLIDE 48

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse markets Washington stimulus

“Bag of Words” (BOW)

“Distributional Hypothesis”: the meaning of a word is determined by the contexts in which it is used

48

slide-49
SLIDE 49

the cong ress par lia US UK

market

1 1 1 1 0

Washington 1

1 0 1

stimulus

1 1 1 0

Brussels

1 1 0 1

market -0.60 -0.39 0.70 0.00 Washin gton

  • 0.48 0.50 -0.12 -0.71

stimulus -0.43 -0.58 -0.69 0.00 Brussel s

  • 0.48 0.50 -0.12 0.71

3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D

49

Word-Context Matrix

slide-50
SLIDE 50

the cong ress par lia US UK

market

1 1 1 1 0

Washington 1

1 0 1

stimulus

1 1 1 0

Brussels

1 1 0 1

market -0.60 -0.39 0.70 0.00 Washin gton

  • 0.48 0.50 -0.12 -0.71

stimulus -0.43 -0.58 -0.69 0.00 Brussel s

  • 0.48 0.50 -0.12 0.71

3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D

50

Word Embeddings

slide-51
SLIDE 51

Word Embeddings

51

slide-52
SLIDE 52

Supervised Classification

Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1

Label lovely good raw rubbery rather mushroomy gamy … 1 1 … 1 1 … 1 1 … 1 …

y X

slide-53
SLIDE 53

Supervised Classification

Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1

Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery

y X

slide-54
SLIDE 54

Supervised Classification

Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1

Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery

y X

No longer treated as entirely different words

slide-55
SLIDE 55

Supervised Classification

Lovely mushroomy nose and good length. 1 Good if not dramatic fizz. 0 Rubbery - rather oxidised. 0 Gamy, succulent tannins. Lovely. 1 Quite raw finish. A bit rubbery. 0 Provence herbs, creamy, lovely. 1

Label 1 D-dimensional vector for lovely 1 D-dimensional vector for good 1 D-dimensional vector for lovely D-dimensional vector for rubbery

y X

No longer treated as entirely different words (often just add up vectors when more than one word)

slide-56
SLIDE 56

the cong ress par lia US UK

market

1 1 1 1 0

Washington 1

1 0 1

stimulus

1 1 1 0

Brussels

1 1 0 1

market -0.60 -0.39 0.70 0.00 Washin gton

  • 0.48 0.50 -0.12 -0.71

stimulus -0.43 -0.58 -0.69 0.00 Brussel s

  • 0.48 0.50 -0.12 0.71

3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D

56

Word Embeddings

slide-57
SLIDE 57

https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b

57

Word Embeddings

slide-58
SLIDE 58

https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b

58

Word Embeddings More in the DL lecture!

slide-59
SLIDE 59

k bye

59