NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation

nlp
SMART_READER_LITE
LIVE PREVIEW

NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation

NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements S/NC Option Special Topics Questions/Concerns? 2 Today 1990s


slide-1
SLIDE 1

NLP!!!

April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter

1

slide-2
SLIDE 2

Announcements

  • S/NC Option
  • “Special Topics”
  • Questions/Concerns?

2

slide-3
SLIDE 3

Today

  • “1990s NLP”…i.e. counting words :)
  • Bags-of-words, Preprocessing
  • “Tools for working with text”
  • No Machine Learning today
  • More on Thursday…

3

slide-4
SLIDE 4

Resources

  • Tokenization, Tagging, Parsing, all sorts of fancy

things

  • NLTK: https://www.nltk.org/
  • Spacy: https://spacy.io/

4

slide-5
SLIDE 5

Ways you might use NLP

  • You want to use text as a feature for some prediction task
  • Classify sentiment in twitter, predict popularity of posts, track spread of

articles/ideas across the country

  • You want to make predictions/test hypotheses about language itself
  • Model changes in word use over time/across locations, find words that

cause articles to be shared

  • Clustering of text data
  • In either of the above use cases
  • Are these words similar, is this document similar to this query, are these

documents similar to each other, etc…

5

slide-6
SLIDE 6

Ways you might use NLP

  • You want to use text as a feature for some prediction task
  • Classify sentiment in twitter, predict popularity of posts, track spread of

articles/ideas across the country

  • You want to make predictions/test hypotheses about language itself
  • Model changes in word use over time/across locations, find words that

cause articles to be shared

  • Clustering of text data
  • In either of the above use cases
  • Are these words similar, is this document similar to this query, are these

documents similar to each other, etc…

6

slide-7
SLIDE 7

Ways you might use NLP

  • You want to use text as a feature for some prediction task
  • Classify sentiment in twitter, predict popularity of posts, track spread of

articles/ideas across the country

  • You want to make predictions/test hypotheses about language itself
  • Model changes in word use over time/across locations, find words that

cause articles to be shared

  • Clustering of text data
  • In either of the above use cases
  • Are these words similar, is this document similar to this query, are these

documents similar to each other, etc…

7

slide-8
SLIDE 8

Ways you might use NLP

  • You want to use text as a feature for some prediction task
  • Classify sentiment in twitter, predict popularity of posts, track spread of

articles/ideas across the country

  • You want to make predictions/test hypotheses about language itself
  • Model changes in word use over time/across locations, find words that

cause articles to be shared

  • Clustering of text data
  • In either of the above use cases
  • Are these words similar, is this document similar to this query, are these

documents similar to each other, etc…

8

slide-9
SLIDE 9

Unit of analysis

  • Characters (“s” “w” “i” “m” “m” “i” “n” “g” “l” “y”)
  • Morphemes (“swim” “ing” “ly")
  • Words (“swimmingly”)
  • Sentences (“remote instruction is going swimmingly”)
  • Documents (“Remote instruction is going
  • swimmingly. Yesterday, for example, a student

said…”)

9

slide-10
SLIDE 10

Compositionality

“meaning of the whole is a function

  • f a meaning of the parts and the

way in which they are combined”

10

slide-11
SLIDE 11

Compositionality

Words

11

slide-12
SLIDE 12

Compositionality

Words Sentences

12

slide-13
SLIDE 13

Compositionality

Words Sentences = f(Words, Syntax)

13

slide-14
SLIDE 14

Compositionality

Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse)

14

slide-15
SLIDE 15

Compositionality

Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse) Very difficult… (impossible?) …to achieve

15

slide-16
SLIDE 16

Compositionality

Words Sentences = f(Words, Syntax) Documents = f(Sentences, Discourse) Very difficult… (impossible?) …to achieve horse shoes ≈ alligator shoes?

16

slide-17
SLIDE 17

Unit of analysis

  • Characters
  • Morphemes
  • Words
  • Sentences
  • Documents

17

slide-18
SLIDE 18

Today

  • Characters
  • Morphemes
  • Words
  • Sentences
  • Documents

18

slide-19
SLIDE 19

Today

  • Characters
  • Morphemes
  • Words
  • Sentences
  • Documents

(We often treat sentences just like short documents, though)

19

slide-20
SLIDE 20

“Bag of Words” (BOW)

  • Represent sentences/documents as just an unordered

set of words

  • Basis of most of modern NLP
  • Information Retrieval/Search
  • Clustering/Recommendation
  • As input to most ML models
  • Changing a bit for sentences, but not for documents (yet)

20

slide-21
SLIDE 21

“Bag of Words” (BOW)

  • Represent sentences/documents as just an unordered

set of words

  • Basis of most of modern NLP
  • Information Retrieval/Search
  • Clustering/Recommendation
  • As input to most ML models
  • Changing a bit for sentences, but not for documents (yet)

21

slide-22
SLIDE 22

“Bag of Words” (BOW)

  • Represent sentences/documents as just an unordered

set of words

  • Basis of most of modern NLP
  • Information Retrieval/Search
  • Clustering/Recommendation
  • As input to most ML models
  • Changing a bit for sentences, but not for documents (yet)

22

slide-23
SLIDE 23

“Bag of Words” (BOW)

  • Represent sentences/documents as just an unordered

set of words

  • Basis of most of modern NLP
  • Information Retrieval/Search
  • Clustering/Recommendation
  • As input to most ML models
  • Changing a bit for sentences, but not for documents (yet)

23

slide-24
SLIDE 24

“Bag of Words” (BOW)

Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), nothing is displayed (and the elements do not appear in the html). Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?

24

slide-25
SLIDE 25

“Bag of Words” (BOW)

Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?

1 1 1 1 1 … 1

is it a and copy … markets below paste remorse

25

slide-26
SLIDE 26

“Bag of Words” (BOW)

Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?

1 1 1 1 1 … 1

is it a and copy … markets below paste remorse “one hot”

26

slide-27
SLIDE 27

“Bag of Words” (BOW)

Is it ok to copy and paste the data into javascript, or is there a filereader that can open a local file?

2 1 2 1 1 … 1

is it a and copy … markets below paste remorse counts/frequencies

27

slide-28
SLIDE 28

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse doc 1 doc 2 doc 3

“Bag of Words” (BOW)

28

slide-29
SLIDE 29

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse doc 1 doc 2 doc 3

“Bag of Words” (BOW)

“Term Document Matrix”

29

slide-30
SLIDE 30

1 1 2 1 … 2 1 3 1 4 … 1 2 1 2 1 2 1 1 … 1

is it a and copy … markets below paste remorse doc 1 doc 2 doc 3

“Bag of Words” (BOW)

How similar are document 1 and document 2?

30

slide-31
SLIDE 31

31

Similarity Metrics

slide-32
SLIDE 32
  • Edit Distance: Minimal number of edits (inserts,

deletes, substitutions) needed to transform string 1 into string 2.

  • Jaccard Simiarlity: words in common / total words
  • Cosine Similarity: by far the most popular metric

Similarity Metrics

32

slide-33
SLIDE 33
  • Edit Distance: Minimal number of edits (inserts,

deletes, substitutions) needed to transform string 1 into string 2.

  • Jaccard Simiarlity: words in common / total words
  • Cosine Similarity: by far the most popular metric

Similarity Metrics

Thoughts?

33

slide-34
SLIDE 34
  • Edit Distance: Minimal number of edits (inserts,

deletes, substitutions) needed to transform string 1 into string 2.

  • Jaccard Similarity: words in common / total words
  • Cosine Similarity: by far the most popular metric

Similarity Metrics

34

slide-35
SLIDE 35

Clicker Question!

35

slide-36
SLIDE 36

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2

36

slide-37
SLIDE 37

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2

assume one-hot (frequency doesn’ t matter), ignore case/ punctuation

37

slide-38
SLIDE 38

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2

38

slide-39
SLIDE 39

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091

39

slide-40
SLIDE 40

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091

40

slide-41
SLIDE 41

Changes I make do not affect any of the html in after I load the nations html file When I try to display dots from part 2 the elements do not appear in the html.

Clicker Question! a) The first one b) The second one c) Yes

Which document is more relevant to the query, according to Jaccard?

html does not work

Query doc 1 doc 2 2/(4 + 17) = 0.095 2/(4+18) = 0.091

41

slide-42
SLIDE 42
  • Edit Distance: Minimal number of edits (inserts,

deletes, substitutions) needed to transform string 1 into string 2.

  • Jaccard Similarity: words in common / total words
  • Cosine Similarity: by far the most popular metric

Similarity Metrics

42

slide-43
SLIDE 43

Cosine Similarity

Changes I make do not affect any of the html in after I load the nations html file

do html 2 1 1 2

43

slide-44
SLIDE 44

Cosine Similarity

Changes I make do not affect any of the html in after I load the nations html file

do html 2 1 1 2

When I try to display dots from part 2 …the elements do not appear in the html.

44

slide-45
SLIDE 45

Cosine Similarity

Changes I make do not affect any of the html in after I load the nations html file

do html 2 1 1 2

When I try to display dots from part 2 …the elements do not appear in the html.

θ

45

slide-46
SLIDE 46

Clicker Question!

46

slide-47
SLIDE 47

Clicker Question! a) doc1 b) doc2 c) Yes

Which document is more relevant to the query, according to cosine?

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome query doc 1 doc 2 webdev

47

slide-48
SLIDE 48

Clicker Question! a) doc1 b) doc2 c) Yes

Which document is more relevant to the query, according to cosine?

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6

48

slide-49
SLIDE 49

Clicker Question! a) doc1 b) doc2 c) Yes

Which document is more relevant to the query, according to cosine?

html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

49

slide-50
SLIDE 50

Clicker Question! a) doc1 b) doc2 c) Yes

Which document is more relevant to the query, according to cosine?

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome query doc 1 doc 2 webdev 3/(√6√6) = 0.5 3/(√6√4) = 0.6

50

slide-51
SLIDE 51

Linguistic Preprocessing

51

slide-52
SLIDE 52

Linguistic Preprocessing

Language is ambiguous but also redundant

52

slide-53
SLIDE 53

Linguistic Preprocessing

Language is ambiguous but also redundant

They freaked out when they found the bug in their apartment.

53

slide-54
SLIDE 54

Linguistic Preprocessing

Language is ambiguous but also redundant

They freaked out when they found the bug in their apartment.

54

slide-55
SLIDE 55

Linguistic Preprocessing

Language is ambiguous but also redundant

They’ve always been terrified of anything crawly. They freaked out when they found the bug in their apartment.

55

slide-56
SLIDE 56

Linguistic Preprocessing

They ran back the CIT right away to tell everyone they’d finally figured it out. They freaked out when they found the bug in their apartment.

Language is ambiguous but also redundant

56

slide-57
SLIDE 57

Linguistic Preprocessing

They ran back the CIT right away to tell everyone they’d finally figured it out. They freaked out when they found the problem in their apartment.

Language is ambiguous but also redundant

57

slide-58
SLIDE 58

Linguistic Preprocessing

Constant Tradeoff

58

slide-59
SLIDE 59

Linguistic Preprocessing

Constant Tradeoff Collapse! Try to treat more words as though they are the same

59

slide-60
SLIDE 60

Linguistic Preprocessing

Constant Tradeoff Collapse! Try to treat more words as though they are the same Differentiate! Try to preserve as much differences/ nuance as possible

60

slide-61
SLIDE 61

Linguistic Preprocessing

Constant Tradeoff Collapse! Try to treat more words as though they are the same Differentiate! Try to preserve as much differences/ nuance as possible normalization, stemming tagging, collocations

61

slide-62
SLIDE 62

Linguistic Preprocessing

62

slide-63
SLIDE 63

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac (tried Chrome, Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html).

63

slide-64
SLIDE 64
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac (tried Chrome, Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html).

64

slide-65
SLIDE 65
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .

65

slide-66
SLIDE 66
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .

⽇旦⽂斈章⿂魛怎麼說? “How to say octopus in Japanese?”

66

slide-67
SLIDE 67
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac ( tried Chrome , Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) .

⽇旦⽂斈章⿂魛怎麼說? “How to say octopus in Japanese?” ⽇旦⽂斈 章⿂魛 怎麼 說 ? Japanese octopus how say ?

67

slide-68
SLIDE 68
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

I am trying to display dots from Part 2 on my mac tried Chrome Firefox and Safari but nothing is displayed and the elements do not appear in the html

68

slide-69
SLIDE 69
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

i am trying to display dots from part 2 on my mac tried chrome firefox and safari but nothing is displayed and the elements do not appear in the html

69

slide-70
SLIDE 70
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

i be try to display dot from part 2 on my mac try chrome firefox and safari but nothing be display and the element do not appear in the html

70

slide-71
SLIDE 71
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

i be try to display dot from part <NUM> on my mac try chrome firefox and safari but nothing be display and the element do not appear in the html

71

slide-72
SLIDE 72
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

try display dot part <NUM> mac try chrome firefox safari nothing display element not appear html

72

slide-73
SLIDE 73
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP firefox_NNP safari_NNP nothing_DT display_VB element_NNP not_RB appear_VB html_NN

73

slide-74
SLIDE 74
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN

74

slide-75
SLIDE 75
  • Tokenization (Phrasal Collocations/Morphological

Analysis?)

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”
  • Remove out-of-vocabulary (OOV)

Linguistic Preprocessing

try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN

75

slide-76
SLIDE 76
  • Remove frequent words? (“stop words”)
  • Remove rare words? (unlikely to appear in test)
  • Remove uninteresting words? (tf-idf? pmi?)
  • Try to add a little syntax? (POS tags? ngrams?

pmi?)

Choosing a vocabulary

(what goes on the columns)

76

slide-77
SLIDE 77
  • Remove frequent words? (“stop words”)
  • Remove rare words? (unlikely to appear in test)
  • Remove uninteresting words? (tf-idf? pmi?)
  • Try to add a little syntax? (POS tags? ngrams?

pmi?)

Choosing a vocabulary

(what goes on the columns)

77

slide-78
SLIDE 78

Zipf’s Law

Word Frequency Word Rank

https://en.wikipedia.org/wiki/Zipf%27s_law

78

slide-79
SLIDE 79

Zipf’s Law

Word Frequency Word Rank The most frequent 0.2% of words make up 50% of occurrences.

79

slide-80
SLIDE 80

Zipf’s Law

Word Frequency Word Rank

80

“stop words”: a, the, of, and, …

slide-81
SLIDE 81

Zipf’s Law

Word Frequency Word Rank

81

“stop words”: a, the, of, and, … (or use nltk.corpus.stopwords…)

slide-82
SLIDE 82
  • Remove frequent words? (“stop words”)
  • Remove rare words? (unlikely to appear in test)
  • Remove uninteresting words? (tf-idf? pmi?)
  • Try to add a little syntax? (POS tags? ngrams?

pmi?)

Choosing a vocabulary

(what goes on the columns)

82

slide-83
SLIDE 83

Zipf’s Law

Word Frequency Word Rank

83

Usually set some vocab size (around 30K) or some min count (around 3)

slide-84
SLIDE 84

Zipf’s Law

Word Frequency Word Rank

84

Usually set some vocab size (around 30K) or some min count (around 3) seems arbitrary? that’ s cause it is.

slide-85
SLIDE 85
  • Remove frequent words? (“stop words”)
  • Remove rare words? (unlikely to appear in test)
  • Remove uninteresting words? (tf-idf? pmi?)
  • Try to add a little syntax? (POS tags? ngrams?

pmi?)

Choosing a vocabulary

(what goes on the columns)

85

slide-86
SLIDE 86
  • Term-Frequency Inverse-Document-Frequency
  • Assigns higher weights to words that differentiate

this document from other documents

  • tf-idf(word,doc) = (# times word appears in doc) /

(# of times word appears across all documents)

  • Can filter out low tf-idf words or else just reweight

the term-document matrix accordingly

Tf-Idf

86

slide-87
SLIDE 87

Clicker Question!

87

slide-88
SLIDE 88

webdev: html does work html does work. all webdev is awesome.

Clicker Question!

html does not work

doc1 doc 2 doc 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome doc1 doc 2 doc 3 webdev

88

slide-89
SLIDE 89

webdev: html does work html does work. all webdev is awesome.

Clicker Question!

html does not work

doc1 doc 2 doc 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome doc1 doc 2 doc 3 webdev a) b) c)

What is the tf-idf vector for doc1

1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2

89

slide-90
SLIDE 90

webdev: html does work html does work. all webdev is awesome. html does not work

Clicker Question!

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

html does not work at all is awesome doc1 doc 2 doc 3 webdev

1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2

a) b) c)

What is the tf-idf vector for doc1 df html: 3 does: 3 not: 1 work: 2 at: 1 all: 2 webdev: 2 is: 1 awesome: 1

90

slide-91
SLIDE 91

webdev: html does work html does work. all webdev is awesome. html does not work

Clicker Question!

html does not work at all is awesome doc1 doc 2 doc 3 webdev

1/3 1/3 1 1/3 1/2 1 1 1/2 1/3 1 1/3 1 1/2 1/2 1 1/3 1/3 1 1/2 1 1/2

a) b) c)

What is the tf-idf vector for doc1 df html: 3 does: 3 not: 1 work: 2 at: 1 all: 2 webdev: 2 is: 1 awesome: 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

91

slide-92
SLIDE 92
  • Pointwise Mutual Information
  • Again: assigns higher weights to words that

differentiate this document from other documents

  • PMI(word,doc) = log P(word|doc)/P(word)
  • Used more for finding word-label relationships or

word-word collocations (more info in two seconds)

PMI

92

slide-93
SLIDE 93
  • Remove frequent words? (“stop words”)
  • Remove rare words? (unlikely to appear in test)
  • Remove uninteresting words? (tf-idf? pmi?)
  • Try to add a little syntax? (POS tags? ngrams?

pmi?)

Choosing a vocabulary

(what goes on the columns)

93

slide-94
SLIDE 94
  • N-length sequence of words (unigrams, bigrams,

trigrams, 4-grams, …)

  • Provides some context (differentiating “cute dog”

from “hot dog”)

  • Blows up size of vocabulary, increases sparsity

N-Grams

94

slide-95
SLIDE 95

html does work . all webdev is awesome.

N-Grams

1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, ‘does .’, …]

95

slide-96
SLIDE 96
  • Try to find just the interesting phrases (e.g. hot dog)

by finding words that occur together above chance

  • Often use PMI for this

Collocations

96

slide-97
SLIDE 97

97

slide-98
SLIDE 98

Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…

Topic Models

98

slide-99
SLIDE 99

Topic Models

Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file When I try to display dots from part 2 on my mac (tried chrome, firefox, and safari), the elements do not appear in the html. Can you elaborate on exactly what the directions are in part 2 step 3, the stencil code does not quite imply what we are supposed to do…

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

99

slide-100
SLIDE 100

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

100

slide-101
SLIDE 101

Topic Models

Where do documents come from? “The generative story”

  • 1. Sample a topic

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

101

slide-102
SLIDE 102

Topic Models

You

Where do documents come from? “The generative story”

  • 2. Sample a word from that topic

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

102

slide-103
SLIDE 103

Topic Models

You

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 1. Sample a topic

103

slide-104
SLIDE 104

Topic Models

You javascript

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 2. Sample a word from that topic

104

slide-105
SLIDE 105

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

  • 1. Sample a topic

You javascript

105

slide-106
SLIDE 106

Topic Models

Where do documents come from? “The generative story”

instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a

You javascript handin

  • 2. Sample a word from that topic

106

slide-107
SLIDE 107

Topic Models

“Latent Semantic Analysis” (LSA)

107

slide-108
SLIDE 108

Topic Models

“Latent Semantic Analysis” (LSA) “latent” variable (not observed)

108

slide-109
SLIDE 109

Topic Models

“Latent Semantic Analysis” (LSA) words are determined by topic (and are conditionally independent of each other)

109

slide-110
SLIDE 110

Topic Models

“Latent Semantic Analysis” (LSA) documents are a distribution over topics

110

slide-111
SLIDE 111

Topic Models

“Latent Semantic Analysis” (LSA) set parameters to maximize probability of observations

111

slide-112
SLIDE 112

Topic Models

part 2 html does not work

112

slide-113
SLIDE 113

Topic Models

part 2 html does not work

15 30 45 60 Topic1 Topic2 Topic3 Topic4

113

slide-114
SLIDE 114

Topic Models

part 2 html does not work

15 30 45 60 Topic1 Topic2 Topic3 Topic4 html javascript work handin part stencil 10 20 30 40 html javascript work handin part stencil 7.5 15 22.5 30

114

slide-115
SLIDE 115

Clicker Question!

115

slide-116
SLIDE 116

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b)

116

slide-117
SLIDE 117

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

117

slide-118
SLIDE 118

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

118

slide-119
SLIDE 119

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 (0+0.3+0.4+0.1+0.2)x0.5 = 0.45 + 0.5 = 0.95

119

slide-120
SLIDE 120

Clicker Question!

Which is the best parameter setting for the observed data?

part <NUM> html does not work

part <NUM> html does not work 0.1 0.2 0.3 0.4

0.2 0.1 0.4 0.3 0.2 0.1 0.1 0.2 0.3

Topic 1 Topic 2

12.5 25 37.5 50 Topic1 Topic2

50 50

17.5 35 52.5 70 Topic1 Topic2

67 33

(a) (b) b: (0.3+0.2+0+0.1+0.1+0.2)x0.33 (0+0.3+0.4+0.1+0.2)x0.67 = 0.297 + 0.67 = 0.967

120

slide-121
SLIDE 121

Topic Models

121

slide-122
SLIDE 122

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D

122

slide-123
SLIDE 123

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D component = “topic”

123

slide-124
SLIDE 124

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D component = “topic” = distribution over words

124

slide-125
SLIDE 125

Topic Models

the cong ress parli ame US UK doc1

1 1 1 1

doc2

1 1 1

doc3

1 1 1

doc4

1 1 1

d1 -0.60 -0.39 0.70 0.00 d2 -0.48 0.50 -0.12 -0.71 d3 -0.43 -0.58 -0.69 0.00 d4 -0.48 0.50 -0.12 0.71 3.06 0.00 0.00 0.00 0.00 0.00 1.81 0.00 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the cong ress parlia ment US UK

  • 0.65 -0.34 -0.51 -0.34 -0.31

0.02

  • 0.54 0.34 -0.54 0.56
  • 0.42

0.02 0.79 0.02 -0.44

  • 0.63

0.27 0.00 0.37 0.63

  • 0.04

0.73 0.00 -0.68 0.04

U V D document = distribution

  • ver topics

125

slide-126
SLIDE 126

k bye

126