Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

lecture 6 vector space model
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v How to represent a word, a sentence, or a document? v


slide-1
SLIDE 1

Lecture 6: Vector Space Model

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

v How to represent a word, a sentence, or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”?

2 6501 Natural Language Processing

slide-3
SLIDE 3

6501 Natural Language Processing 3

slide-4
SLIDE 4

How to represent a word

v Naïve way: represent words as atomic symbols: student, talk, university

vN-germ language model, logical analysis

v Represent word as a “one-hot” vector [ 0 0 0 1 0 … 0 ] v How large is this vector?

vPTB data: ~50k, Google 1T data: 13M

v 𝑤 ⋅ 𝑣 =?

6501 Natural Language Processing 4

egg student talk university happy buy

slide-5
SLIDE 5

Issues?

v Dimensionality is large; vector is sparse v No similarity v Cannot represent new words v Any idea?

6501 Natural Language Processing 5

𝑤'())* = [0 0 0 1 0 … 0 ] 𝑤+(, = [0 0 1 0 0 … 0 ] 𝑤-./0 = [1 0 0 0 0 … 0 ] 𝑤'())* ⋅ 𝑤+(, = 𝑤'())* ⋅ 𝑤-./0 = 0

slide-6
SLIDE 6

Idea 1: Taxonomy (Word category)

6501 Natural Language Processing 6

slide-7
SLIDE 7

What is “car”?

>>> fromnltk.corpusimportwordnet as wn >>> wn.synsets('motorcar') [Synset('car.n.01')] 6501 Natural Language Processing 7 >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> [synset.name() forsynsetin paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() forsynsetin paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

slide-8
SLIDE 8

Word similarity?

6501 Natural Language Processing 8

>>> right = wn.synset('right_whale.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> orca = wn.synset('orca.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>>right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')]

Require human labor

slide-9
SLIDE 9

Taxonomy (Word category)

v Synonym, hypernym (Is-A), hyponym

6501 Natural Language Processing 9

slide-10
SLIDE 10

Idea 2: Similarity = Clustering

6501 Natural Language Processing 10

slide-11
SLIDE 11

Cluster n-gram model

v Can be generated from unlabeled corpora

vBased on statistics, e.g., mutual information

6501 Natural Language Processing 11

Implementation of the Brown hierarchical word clustering algorithm. Percy Liang

slide-12
SLIDE 12

Idea 3: Distributional representation v Linguistic items with similar distributions have similar meanings

vi.e., words occur in the same contexts ⇒ similar meaning

6501 Natural Language Processing 12

"a word is characterized by the company it keeps”

  • -Firth, John, 1957
slide-13
SLIDE 13

Vector representation (word embeddings)

v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept” vWhat are “basic concept”? vHow to assign weights? vHow to define the similarity/distance?

6501 Natural Language Processing 13

𝑤0.23 = [0.8 0.9 0.1 … ] 𝑤45662 = [0.8 0.1 0.8 … ] 𝑤())/* = [0.1 0.2 0.1 0.8 … ]

royalty masculinity femininity eatable

slide-14
SLIDE 14

An illustration of vector space model

6501 Natural Language Processing 14

Masculine Eatable

Royalty

w4 w2 W1 W5 w3 |D2-D4|

slide-15
SLIDE 15

Semantic similarity in 2D

v Home depot product

6501 Natural Language Processing 15

slide-16
SLIDE 16

Capture the structure of words

v Example from GloVe

6501 Natural Language Processing 16

slide-17
SLIDE 17

How to use word vectors?

6501 Natural Language Processing 17

slide-18
SLIDE 18

Pre-trained word vectors

v Google Book

https://code.google.com/archive/p/word2vec

v100 billion tokens, 300 dimension, 3M words

v Glove project http://nlp.stanford.edu/projects/glove/

vPre-trained word vectors of Wiki (6B), web crawl data (840B), twitter (27B)

6501 Natural Language Processing 18

slide-19
SLIDE 19

Distance/similarity

v Vector similarity measure ⇒ similarity in meaning v Cosine similarity

vcos 𝑣, 𝑤 =

5 ⋅ ; ||5||⋅||;||

vWord vector are normalized by length

v Euclidean distance ||𝑣 − 𝑤||> v Inner product 𝑣 ⋅ 𝑤

vSame as cosine similarity if vectors are normalized

6501 Natural Language Processing 19

5 ||5|| is a unit vector

slide-20
SLIDE 20

Distance/similarity

v Vector similarity measure ⇒ similarity in meaning v Cosine similarity

vcos 𝑣, 𝑤 =

5 ⋅ ; ||5||⋅||;||

vWord vector are normalized by length

v Euclidean distance ||𝑣 − 𝑤||> v Inner product 𝑣 ⋅ 𝑤

vSame as cosine similarity if vectors are normalized

6501 Natural Language Processing 20

Linguistic Regularities in Sparse and Explicit Word Representations, Levy Goldberg, CoNLL 14 Choosing the right similarity metric is important

slide-21
SLIDE 21

Word similarity DEMO

v http://msrcstr.cloudapp.net/

6501 Natural Language Processing 21

slide-22
SLIDE 22

Word analogy

v 𝑤-(2 − 𝑤?@-(2 + 𝑤52B/6 ∼ 𝑤(52D

6501 Natural Language Processing 22

slide-23
SLIDE 23

From words to phrases

6501 Natural Language Processing 23

slide-24
SLIDE 24

Neural Language Models

6501 Natural Language Processing 24

slide-25
SLIDE 25

How to “learn” word vectors?

6501 Natural Language Processing 25

What are “basic concept”? How to assign weights? How to define the similarity/distance? Cosine similarity

slide-26
SLIDE 26

Back to distributional representation

v Encode relational data in a matrix

v Co-occurrence (e.g., from a general corpus)

v Bag-of-word model: documents (clusters) as the basis for vector space

6501 Natural Language Processing 26

slide-27
SLIDE 27

Back to distributional representation

v Encode relational data in a matrix

v Co-occurrence (e.g., from a general corpus) v Skip-grams

6501 Natural Language Processing 27

slide-28
SLIDE 28

Back to distributional representation

v Encode relational data in a matrix

v Co-occurrence (e.g., from a general corpus) v Skip-grams v From taxonomy (e.g., WordNet, thesaurus)

6501 Natural Language Processing 28 joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 Group 2: “sad” 1 1

Group 3: “affection”

1

Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden

slide-29
SLIDE 29

Back to distributional representation

v Encode relational data in a matrix

v Co-occurrence (e.g., from a general corpus) v Skip-grams v From taxonomy (e.g., WordNet, thesaurus)

6501 Natural Language Processing 29 joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 Group 2: “sad” 1 1

Group 3: “affection”

1

Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden

Cosine similarity?

Pros and cons?

slide-30
SLIDE 30

Problems?

v Number of basic concepts is large v Basis is not orthogonal (i.e., not linearly independent) v Some function words are too frequent (e.g., the) vSyntax has too much impact vE.g, TF-IDF can be applied vE.g, skip-gram: scaling by distance to target

6501 Natural Language Processing 30

slide-31
SLIDE 31

Latent Semantic Analysis (LSA)

v Data representation

vEncode single-relational data in a matrix

v Co-occurrence (e.g., document-term matrix, skip-gram) v Synonyms (e.g., from a thesaurus)

v Factorization

vApply SVD to the matrix to find latent components

6501 Natural Language Processing 31

slide-32
SLIDE 32

Principle Component Analysis (PCA)

v Decompose the similarity space into a set

  • f orthonormal basis vectors

6501 Natural Language Processing 32

slide-33
SLIDE 33

Principle Component Analysis (PCA)

v Decompose the similarity space into a set

  • f orthonormal basis vectors

v For an 𝑛×𝑜 matrix 𝐵 , there exists a factorization such that 𝐵 = 𝑉Σ𝑊L v 𝑉, 𝑊 are orthogonal matrices

6501 Natural Language Processing 33

slide-34
SLIDE 34

Low-rank Approximation

v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000) v SVD can be used to compute optimal low-rank approximation v Set smallest n-r singular value to zero v Similar words map to similar location in low dimensional space

6501 Natural Language Processing 34

slide-35
SLIDE 35

Latent Semantic Analysis (LSA)

v Factorization

vApply SVD to the matrix to find latent components

6501 Natural Language Processing 35

slide-36
SLIDE 36

LSA example

v Original matrix C

6501 Natural Language Processing 36

Example from Christopher Manning and Pandu Nayak, introduction to IR

slide-37
SLIDE 37

LSA example

v SVD: 𝐷 = 𝑉Σ𝑊L

6501 Natural Language Processing 37

slide-38
SLIDE 38

LSA example

v Original matrix C v Dimension reduction 𝐷 ∼ 𝑉Σ𝑊L

6501 Natural Language Processing 38

slide-39
SLIDE 39

LSA example

v Original matrix 𝐷 v.s. reconstructed matrix 𝐷> v What is the similarity between ship and boat?

6501 Natural Language Processing 39

slide-40
SLIDE 40

Word vectors

𝐷 ∼ 𝑉Σ𝑊L 𝐷𝐷L ∼ 𝑉Σ𝑊L × 𝑉Σ𝑊L L = 𝑉Σ𝑊L×𝑊ΣL𝑉L = 𝑉ΣΣL𝑉L (why?) = 𝑉Σ UΣ L

v 𝐷:,+'.) ⋅ 𝐷:,P@(D ∼ 𝑉Σ :,+'.) ⋅ 𝑉Σ :,P@(D

6501 Natural Language Processing 40

slide-41
SLIDE 41

Why we need low rank approximation? v Knowledge base (e.g., thesaurus) is never complete v Noise reduction by dimension reduction v Intuitively, LSA brings together “related” axes (concepts) in the vector space v A compact model

6501 Natural Language Processing 41

slide-42
SLIDE 42

All problem solved?

6501 Natural Language Processing 42

slide-43
SLIDE 43

An analogy game

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS 16

6501 Natural Language Processing 43

slide-44
SLIDE 44

Continuous Semantic Representations

sunny rainy windy cloudy car wheel cab sad joy emotion feeling

6501 Natural Language Processing 44

slide-45
SLIDE 45

Semantics Needs More Than Similarity

Tomorrow will be rainy. Tomorrow will be sunny.

𝑡𝑗𝑛𝑗𝑚𝑏𝑠(rainy, sunny)? 𝑏𝑜𝑢𝑝𝑜𝑧𝑛(rainy, sunny)?

6501 Natural Language Processing 45

slide-46
SLIDE 46

Continuous representations for entities

6501 Natural Language Processing 46

? Michelle Obama Democratic Party George W Bush Laura Bush Republic Party

slide-47
SLIDE 47

Continuous representations for entities

6501 Natural Language Processing 47

  • Useful resources for NLP applications
  • Semantic Parsing & Question Answering
  • Information Extraction
slide-48
SLIDE 48

Next lecture: a more flexible framework v Directly learn word vectors using NN model

vMore flexible vEasier to learn new words vIncorporate other information vOptimize specific task loss.

v Review calculus!

6501 Natural Language Processing 48