Vector Space Models Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation

vector space models
SMART_READER_LITE
LIVE PREVIEW

Vector Space Models Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation

Vector Space Models Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 19, 2017 Based on slides from Jacob Eisenstein, Noah Smith, Mohit Bansal, Richard Socher, and everyone else they copied from. Outline Latent Semantic Analysis


slide-1
SLIDE 1

Vector Space Models

  • Prof. Sameer Singh

CS 295: STATISTICAL NLP WINTER 2017

January 19, 2017

Based on slides from Jacob Eisenstein, Noah Smith, Mohit Bansal, Richard Socher, and everyone else they copied from.

slide-2
SLIDE 2

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 2

Latent Semantic Analysis Vector Models for Words Direct Embeddings Reducing the Dimensions

slide-3
SLIDE 3

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 3

Latent Semantic Analysis Vector Models for Words Direct Embeddings Reducing the Dimensions

slide-4
SLIDE 4

Example: Documents

CS 295: STATISTICAL NLP (WINTER 2017) 4

c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey

From http://lsa.colorado.edu/papers/dp1.LSAintro.pdf

slide-5
SLIDE 5

Example: Term-Doc Matrix

CS 295: STATISTICAL NLP (WINTER 2017) 5 c1 c2 c3 c4 c5 m1 m2 m3 m4

human interface computer user system response time EPS survey trees graph minors

slide-6
SLIDE 6

Problems with Sparse Vectors

CS 295: STATISTICAL NLP (WINTER 2017) 6

c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time m4: Graph minors: A survey

slide-7
SLIDE 7

Example: Distance Matrix

CS 295: STATISTICAL NLP (WINTER 2017) 7

c1 c2 c3 c4 c5 m1 m2 m3 m4

c1 c2 c3 c4 c5 m1 m2 m3 m4

slide-8
SLIDE 8

Option 2: SVD

CS 295: STATISTICAL NLP (WINTER 2017) 8

slide-9
SLIDE 9

Latent Semantic Analysis (LSA)

CS 295: STATISTICAL NLP (WINTER 2017) 9

slide-10
SLIDE 10

Example: Decomposition

CS 295: STATISTICAL NLP (WINTER 2017) 10

human interface computer user system response time EPS survey trees graph minors

c1 c2 c3 c4 c5 m1 m2 m3 m4

slide-11
SLIDE 11

New Document Vectors

CS 295: STATISTICAL NLP (WINTER 2017) 11

slide-12
SLIDE 12

Example: Reconstruction

CS 295: STATISTICAL NLP (WINTER 2017) 12

human interface computer user system response time EPS survey trees graph minors

slide-13
SLIDE 13

Example: Distance Matrix

CS 295: STATISTICAL NLP (WINTER 2017) 13

slide-14
SLIDE 14

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 14

Latent Semantic Analysis Vector Models for Words Direct Embeddings Reducing the Dimensions

slide-15
SLIDE 15

Let’s look at words

CS 295: STATISTICAL NLP (WINTER 2017) 15

A bottle of tezguino is on the table. Everybody likes tezguino. Tezguino makes you drunk. We make tezguino out of corn.

What does tezguino mean? Loud, motor oil, tortillas, choices, wine You shall know a word by the company keeps. (Firth, 1957)

slide-16
SLIDE 16

Term-Context Matrix

CS 295: STATISTICAL NLP (WINTER 2017) 16

C1: A bottle of ______ is on the table. C2: Everybody likes ______. C3: _____ makes you drunk. C4: We make _____ out of corn.

tezguino loud motor oil tortillas choices wine C1 C2 C3 C4

slide-17
SLIDE 17

What is a “Context”?

CS 295: STATISTICAL NLP (WINTER 2017) 17

Can be anything you want!

  • Entire contents of the sentence
  • One word before and after
  • Words in the same sentence
  • Document it appears in
  • Many other variations…

A bottle of tezguino is on the table. Tezguino makes you drunk. … I had a fancy bottle of wine and got drunk last night! The terrible wine is on the table.

slide-18
SLIDE 18

What is a “Context”?

CS 295: STATISTICAL NLP (WINTER 2017) 18

Can be anything you want!

  • Entire contents of the sentence
  • Unlikely to occur again!
  • One word before and after
  • Words in the same sentence
  • Document ID it appears in
  • Many other variations…

A bottle of tezguino is on the table. Tezguino makes you drunk. … I had a fancy bottle of wine and got drunk last night! The terrible wine is on the table.

tezguino wine C1 C2 C3 C4

slide-19
SLIDE 19

What is a “Context”?

CS 295: STATISTICAL NLP (WINTER 2017) 19

Can be anything you want!

  • Entire contents of the sentence
  • One word before and after
  • Or n-words
  • Words in the same sentence
  • Document it appears in
  • Many other variations…

A bottle of tezguino is on the table. Tezguino makes you drunk. … I had a fancy bottle of wine and got drunk last night! The terrible wine is on the table.

tezguino wine

bottle-of is-of makes-you and-got the-terrible is-on

slide-20
SLIDE 20

What is a “Context”?

CS 295: STATISTICAL NLP (WINTER 2017) 20

Can be anything you want!

  • Entire contents of the sentence
  • One word before and after
  • Words in the same sentence
  • Filter: nouns and verbs?
  • Bag of words in a window
  • Document it appears in
  • Many other variations…

A bottle of tezguino is on the table. Tezguino makes you drunk. … I had a fancy bottle of wine and got drunk last night! The terrible wine is on the table.

tezguino wine

bottle table you drunk fancy night terrible

slide-21
SLIDE 21

What is a “Context”?

CS 295: STATISTICAL NLP (WINTER 2017) 21

Can be anything you want!

  • Entire contents of the sentence
  • One word before and after
  • Words in the same sentence
  • Document it appears in
  • Term-document matrix!
  • Latent Semantic Analysis
  • Many other variations…

A bottle of tezguino is on the table. Tezguino makes you drunk. … I had a fancy bottle of wine and got drunk last night! The terrible wine is on the table.

tezguino table bottle drunk wine

D1 D2 D3 D4

slide-22
SLIDE 22

Pointwise Mutual Information

CS 295: STATISTICAL NLP (WINTER 2017) 22

Raw counts are not good

  • Skewed towards common words/contexts
  • Many of them are not informative
  • is, the, it, they, …

PMI(w,c)

  • How much more likely is w to occur in c, than just randomly?
slide-23
SLIDE 23

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 23

Latent Semantic Analysis Vector Models for Words Direct Embeddings Reducing the Dimensions

slide-24
SLIDE 24

Option 1: Revisiting Clustering

CS 295: STATISTICAL NLP (WINTER 2017) 24

slide-25
SLIDE 25

Hierarchical Clustering

CS 295: STATISTICAL NLP (WINTER 2017) 25

slide-26
SLIDE 26

Example

CS 295: STATISTICAL NLP (WINTER 2017) 26

slide-27
SLIDE 27

Brown Clusters for Twitter

CS 295: STATISTICAL NLP (WINTER 2017) 27

http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html

slide-28
SLIDE 28

Option 2: SVD

CS 295: STATISTICAL NLP (WINTER 2017) 28

slide-29
SLIDE 29

Example Word Projection

CS 295: STATISTICAL NLP (WINTER 2017) 29

slide-30
SLIDE 30

Problem with SVD & Clustering

CS 295: STATISTICAL NLP (WINTER 2017) 30

Computational Complexity

  • SVD: O(mn2)
  • Clustering: O(knm) per iteration, or O(n3)
  • But, n can be 100,000!

“One shot”

  • Difficult to add new documents or words
  • Cannot work with streaming data
slide-31
SLIDE 31

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 31

Latent Semantic Analysis Vector Models for Words Direct Embeddings Reducing the Dimensions

slide-32
SLIDE 32

Predict surrounding words

CS 295: STATISTICAL NLP (WINTER 2017) 32

A bottle of tezguino is on the table. u v

slide-33
SLIDE 33

Estimating the Word Vectors

CS 295: STATISTICAL NLP (WINTER 2017) 33

slide-34
SLIDE 34

Similar Meaning = Close

CS 295: STATISTICAL NLP (WINTER 2017) 34

slide-35
SLIDE 35

Similar Meaning = Close

CS 295: STATISTICAL NLP (WINTER 2017) 35 https://siddhant7.github.io/Vector-Representation-of-Words/

slide-36
SLIDE 36

Vectors “know” Gender

CS 295: STATISTICAL NLP (WINTER 2017) 36 https://siddhant7.github.io/Vector-Representation-of-Words/

King - male + female queen male : female :: King : queen

slide-37
SLIDE 37

They “know” Tenses!

CS 295: STATISTICAL NLP (WINTER 2017) 37 https://siddhant7.github.io/Vector-Representation-of-Words/

swimming – walking + walked swam walking : walked :: swimming : swam

slide-38
SLIDE 38

They “know” Facts!

CS 295: STATISTICAL NLP (WINTER 2017) 38 https://siddhant7.github.io/Vector-Representation-of-Words/

Country – Capital + Spain Madrid

slide-39
SLIDE 39

Upcoming…

CS 295: STATISTICAL NLP (WINTER 2017) 39

  • Homework 1 is up!
  • No more material will be covered
  • Due: January 26, 2017

Homework

  • Project pitch is due January 23, 2017!
  • Start assembling teams now
  • Tons of datasets on the “projects” page on website

Project