Vector Semantics and Embeddings CSE354 - Spring 2020 Natural - - PowerPoint PPT Presentation

vector semantics and embeddings
SMART_READER_LITE
LIVE PREVIEW

Vector Semantics and Embeddings CSE354 - Spring 2020 Natural - - PowerPoint PPT Presentation

Vector Semantics and Embeddings CSE354 - Spring 2020 Natural Language Processing Tasks Dimensionality Reduction Vectors which represent words Recurrent Neural Network and how? or sequences Sequence Models Objective To


slide-1
SLIDE 1

Vector Semantics and Embeddings

CSE354 - Spring 2020 Natural Language Processing

slide-2
SLIDE 2

Tasks

  • Vectors which represent words
  • r sequences
  • Dimensionality Reduction
  • Recurrent Neural Network and

Sequence Models how?

slide-3
SLIDE 3

Objective

To embed: convert a token (or sequence) to a vector that represents meaning.

slide-4
SLIDE 4

Objective

To embed: convert a token (or sequence) to a vector that represents meaning, or is useful to perform downstream NLP application.

slide-5
SLIDE 5

Objective

port

embed

slide-6
SLIDE 6

Objective

port

embed

… 1 …

slide-7
SLIDE 7

Objective

port

embed

… 1 …

Prefer dense vectors

  • Less parameters (weights) for

machine learning model.

  • May generalize better implicitly.
  • May capture synonyms

For deep learning, in practice, they work

  • better. Why? Roughly, less parameters

becomes increasingly important when you are learning multiple layers of weights rather than just a single layer.

  • ne-hot is sparse vector
slide-8
SLIDE 8

Objective

port

embed

… 1 …

Prefer dense vectors

  • Less parameters (weights) for

machine learning model.

  • May generalize better implicitly.
  • May capture synonyms

For deep learning, in practice, they work

  • better. Why? Roughly, less parameters

becomes increasingly important when you are learning multiple layers of weights rather than just a single layer.

  • ne-hot is sparse vector

(Jurafsky, 2012)

slide-9
SLIDE 9

Objective

port

embed

… 1 …

Prefer dense vectors

  • Less parameters (weights) for

machine learning model.

  • May generalize better implicitly.
  • May capture synonyms

For deep learning, in practice, they work

  • better. Why? Roughly, less parameters

becomes increasingly important when you are learning multiple layers of weights rather than just a single layer.

  • ne-hot is sparse vector

(Jurafsky, 2012)

10 2 18 10

5 10 15 20 10 5

3 9

slide-10
SLIDE 10

Objective

To embed: convert a token (or sequence) to a vector that represents meaning.

slide-11
SLIDE 11

Objective

To embed: convert a token (or sequence) to a vector that represents meaning. Wittgenstein, 1945: “The meaning of a word is its use in the language” Distributional hypothesis -- A word’s meaning is defined by all the different contexts it appears in (i.e. how it is “distributed” in natural language). Firth, 1957: “You shall know a word by the company it keeps”

slide-12
SLIDE 12

Objective

To embed: convert a token (or sequence) to a vector that represents meaning. Wittgenstein, 1945: “The meaning of a word is its use in the language” Distributional hypothesis -- A word’s meaning is defined by all the different contexts it appears in (i.e. how it is “distributed” in natural language). Firth, 1957: “You shall know a word by the company it keeps” The nail hit the beam behind the wall.

slide-13
SLIDE 13

Distributional Hypothesis

The nail hit the beam behind the wall.

slide-14
SLIDE 14

Objective

port

embed

0.53 1.5 3.21

  • 2.3

.76

slide-15
SLIDE 15

Objective

port

embed

0.53 1.5 3.21

  • 2.3

.76

port.n.1 (a place (seaport or airport) where people and merchandise can enter or leave a country) port.n.2 port wine (sweet dark-red dessert wine

  • riginally from Portugal)

port.n.3, embrasure, porthole (an opening (in a wall or ship or armored vehicle) for firing through) larboard, port.n.4 (the left side of a ship or aircraft to someone who is aboard and facing the bow or nose) interface, port.n.5 ((computer science) computer circuit consisting of the hardware and associated circuitry that links one device with another (especially a computer and a hard disk drive or other peripherals))

slide-16
SLIDE 16

How?

1. One-hot representation 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) Tf-IDF: Term Frequency, Inverse Document Frequency, PMI: Point-wise mutual information, ...etc…

slide-17
SLIDE 17

How?

1. One-hot representation 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-18
SLIDE 18

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert …, word1, word2, bill, word3, word4, ...

1 1 ...

slide-19
SLIDE 19

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-20
SLIDE 20

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-21
SLIDE 21

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-22
SLIDE 22

SVD-Based Embeddings

Singular Value Decomposition...

slide-23
SLIDE 23

Concept, In Matrix Form:

f1, f2, f3, f4, … fp

  • 1
  • 2
  • 3

  • n

columns: p features rows: n observations

slide-24
SLIDE 24

f1, f2, f3, f4, … fp

  • 1
  • 2
  • 3

  • n

SVD-Based Embeddings

slide-25
SLIDE 25

f1, f2, f3, f4, … fp

  • 1
  • 2
  • 3

  • n

c1, c2, c3, c4, … cp’

  • 1
  • 2
  • 3

  • n

Dimensionality reduction

  • - try to represent with only p’ dimensions

SVD-Based Embeddings

slide-26
SLIDE 26

Concept: Dimensionality Reduction in 3-D, 2-D, and 1-D

Data (or, at least, what we want from the data) may be accurately represented with less dimensions.

P = 2 P’ = 1

slide-27
SLIDE 27

Concept: Dimensionality Reduction in 3-D, 2-D, and 1-D

Data (or, at least, what we want from the data) may be accurately represented with less dimensions.

P = 2 P’ = 1 P = 3 P’ = 2

slide-28
SLIDE 28

Rank: Number of linearly independent columns of A. (i.e. columns that can’t be derived from the other columns through addition). Q: What is the rank of this matrix?

1

  • 2

3 2

  • 3

5 1 1

Concept: Dimensionality Reduction

slide-29
SLIDE 29

Rank: Number of linearly independent columns of A. (i.e. columns that can’t be derived from the other columns through addition). Q: What is the rank of this matrix? A: 2. The 1st is just the sum of the second two columns … we can represent as linear combination of 2 vectors:

1

  • 2

3 2

  • 3

5 1 1 1

  • 2

2

  • 3

1 1

Concept: Dimensionality Reduction

slide-30
SLIDE 30

f1, f2, f3, f4, … fp

  • 1
  • 2
  • 3

  • n

context words are features target words are

  • bservations

co-occurence counts are cells.

c1, c2, c3, c4, … cp’

  • 1
  • 2
  • 3

  • n

Dimensionality reduction

  • - try to represent with only p’ dimensions

SVD-Based Embeddings

slide-31
SLIDE 31

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] = U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

slide-32
SLIDE 32

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] = U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

X

n n p p

slide-33
SLIDE 33

Dimensionality Reduction - PCA - Example

X[nxp] = U[nxr] D[rxr] V[pxr]

T

Word co-occurrence counts:

slide-34
SLIDE 34

Dimensionality Reduction - PCA - Example

X[nxp] ≅ U[nxr] D[rxr] V[pxr]

T

target co-occurence count with “hit” target co-occ count with “nail”

Observation: “beam.”

count(beam, hit) = 100 -- horizontal dimension count(beam, nail) = 80 -- vertical dimension

slide-35
SLIDE 35

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] ≅ U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

Projection (dimensionality reduced space) in 3 dimensions: (U[nx3] D[3x3] V[px3]

T)

To reduce features in new dataset, A: A[ m x p ] VD = Asmall[ m x 3 ]

slide-36
SLIDE 36

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] ≅ U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced: Z[nxp] = U D VT , How does Z compare to original X? To reduce features in new dataset: A[ m x p ] VD = Asmall[ m x 3 ]

slide-37
SLIDE 37

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] ≅U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced: Z[nxp] = U D VT , How does Z compare to original X? To reduce features in new dataset: A[ m x p ] VD = Asmall[ m x 3 ]

slide-38
SLIDE 38

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] ≅U[nxr] D[rxr] V[pxr]

T

X: original matrix, U: “left singular vectors”, D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced: Z[nxp] = U D VT , How does Z compare to original X? To reduce features in new dataset: A[ m x p ] VD = Asmall[ m x 3 ]

This is the objective that SVD Solves

slide-39
SLIDE 39

Dimensionality Reduction - PCA

Linear approximates of data in r dimensions. Found via Singular Value Decomposition:

X[nxp] ≅ U[nxr] D[rxr] V[pxr]

T

U, D, and V are unique D: always positive

slide-40
SLIDE 40

(TechnoWiki)

slide-41
SLIDE 41

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-42
SLIDE 42

How?

1. One-hot represent 2. Selectors (represent context by “multi-hot” representation) 3. From PCA/Singular Value Decomposition (Know as “Latent Semantic Analysis” in some circumstances) “Neural Embeddings”: 4. Word2vec 5. Fasttext 6. Glove 7. Bert

slide-43
SLIDE 43

Word2Vec

Principal: Predict missing word. Similar to language modeling but predicting context, rather than next word. p(context | word)

slide-44
SLIDE 44

Word2Vec

Principal: Predict missing word. Similar to language modeling but predicting context, rather than next word. p(context | word)

To learn, maximize

slide-45
SLIDE 45

Word2Vec

Principal: Predict missing word. Similar to language modeling but predicting context, rather than next word. p(context | word) J = 1 - p(context | word)

To learn, maximize. In practice, minimize

slide-46
SLIDE 46

Word2Vec: Context

p(context | word) 2 Versions of Context: 1. Continuous bag of words (CBOW): Predict word from context 2. Skip-Grams (SG): predict context words from target

slide-47
SLIDE 47

Word2Vec: Context

p(context | word) 2 Versions of Context: 1. Continuous bag of words (CBOW): Predict word from context 2. Skip-Grams (SG): predict context words from target 1. 2. 3. 4.

(Jurafsky, 2017)

slide-48
SLIDE 48

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

2 Versions of Context: 1. Continuous bag of words (CBOW): Predict word from context 2. Skip-Grams (SG): predict context words from target 1. 2. 3. 4.

slide-49
SLIDE 49

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 ...

slide-50
SLIDE 50

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

slide-51
SLIDE 51

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

k negative example (y=0) for every positive. How?

x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

slide-52
SLIDE 52

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

k negative example (y=0) for every positive. How? Randomly draw from unigram distribution

x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

slide-53
SLIDE 53

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

k negative example (y=0) for every positive. How? Randomly draw from unigram distribution adjusted: α = 0.75

x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

slide-54
SLIDE 54

Word2Vec: Context

p(context | word) 2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

k negative example (y=0) for every positive. How? Randomly draw from unigram distribution adjusted: α = 0.75

x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

slide-55
SLIDE 55

Word2Vec: Context

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

single context: P(y=1| c, t) =

slide-56
SLIDE 56

Word2Vec: Context

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

single context: P(y=1| c, t) = All Contexts P(y=1| c, t) = Logistic: 𝜏(z) = 1 / (1 + e-z)

slide-57
SLIDE 57

Word2Vec: Context

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

single context: P(y=1| c, t) = Intuition: t฀c is a measure of similarity: But, it is not a probability! To make it

  • ne, apply logistic

activation: 𝜏(z) = 1 / (1 + e-z)

slide-58
SLIDE 58

Word2Vec: Context

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4 x = (hit, beam), y = 1 x = (the, beam), y = 1 x = (behind, beam), y = 1 … x = (happy, beam), y = 0 x = (think, beam), y = 0 ...

single context: P(y=1| c, t) = all contexts P(y=1| c, t) = Intuition: t฀c is a measure of similarity: But, it is not a probability! To make it

  • ne, apply logistic

activation: 𝜏(z) = 1 / (1 + e-z)

slide-59
SLIDE 59

Word2Vec: How to Learn?

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

P(y=1| c, t)

slide-60
SLIDE 60

Word2Vec: How to Learn?

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t

slide-61
SLIDE 61

Word2Vec: How to Learn?

2 Versions of Context:

  • Continuous bag of words (CBOW): Predict word from context
  • Skip-Grams (SG): predict context words from target

1. 2. 3. 4.

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t Start with random vectors (or all 0s)

slide-62
SLIDE 62

Word2Vec: How to Learn?

(Jurafsky, 2017)

The nail hit the beam behind the wall.

c1 c2 c3 c4

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t Start with random vectors (or all 0s) Goal: Maximize similarity of (c, t) in positive data (y = 1) Minimize similarity of (c, t) in negative data (y = 0)

slide-63
SLIDE 63

Word2Vec: How to Learn?

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t Start with random vectors (or all 0s) Goal: Maximize similarity of (c, t) in positive data (y = 1) Minimize similarity of (c, t) in negative data (y = 0)

slide-64
SLIDE 64

Word2Vec: How to Learn?

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t Start with random vectors (or all 0s) Goal: Maximize similarity of (c, t) in positive data (y = 1) Minimize similarity of (c, t) in negative data (y = 0)

slide-65
SLIDE 65

Word2Vec: How to Learn?

P(y=1| c, t) Assume 300 * |vocab| weights (parameters) for each of c and t Start with random vectors (or all 0s) Goal: Maximize similarity of (c, t) in positive data (y = 1) Minimize similarity of (c, t) in negative data (y = 0)

Optimized using gradient descent type methods.

slide-66
SLIDE 66

Word 2 Vec

(Jurafsky, 2017)

slide-67
SLIDE 67

Word2Vec captures analogies (kind of)

(Jurafsky, 2017) (Jurafsky, 2017)

slide-68
SLIDE 68

(Jurafsky, 2017)

slide-69
SLIDE 69

(Jurafsky, 2017) (Jurafsky, 2017)

slide-70
SLIDE 70

Word2Vec: Quantitative Evaluations

Compare to manually annotated pairs of words: WordSim-353 (Finkelstein et al., 2002) Compare to words in context (Huang et al., 2012) Answer TOEFL synonym questions.

slide-71
SLIDE 71

Current Trends in Embeddings

1. Contextual word embeddings (a different embedding depending on context): The nail hit the beam behind the wall. They reflected a beam off the moon.

slide-72
SLIDE 72

Current Trends in Embeddings

1. Contextual word embeddings (a different embedding depending on context): The nail hit the beam behind the wall. They reflected a beam off the moon. 2. Embeddings can capture changes in word meaning.

(Kulkarni et al.,2015)

slide-73
SLIDE 73

Current Trends in Embeddings

1. Contextual word embeddings (a different embedding depending on context): The nail hit the beam behind the wall. They reflected a beam off the moon. 2. Embeddings can capture changes in word meaning. 3. Embeddings capture demographic biases in data.

(Garg et al., 2018)

slide-74
SLIDE 74

Current Trends in Embeddings

1. Contextual word embeddings (a different embedding depending on context): The nail hit the beam behind the wall. They reflected a beam off the moon. 2. Embeddings can capture changes in word meaning. 3. Embeddings capture demographic biases in data. a. Efforts to debias b. Useful for tracking bias over time. (Garg et al., 2018)

slide-75
SLIDE 75

Current Trends in Embeddings

1. Contextual word embeddings (a different embedding depending on context): The nail hit the beam behind the wall. They reflected a beam off the moon. 2. Embeddings can capture changes in word meaning. 3. Embeddings capture demographic biases in data. a. Efforts to debias b. Useful for tracking bias over time. (Garg et al., 2018)

slide-76
SLIDE 76

Vector Semantics and Embeddings

Take-Aways

  • Dense representation of meaning is desirable.
  • Approach 1: Dimensionality reduction techniques
  • Approach 2: Learning representations by trying to predict held-out words.
  • Word2Vec skipgram model attempts to solve by predicting target word from

context word: maximize similarity between true pairs; minimize similarity between random pairs.

  • Embeddings do in fact seem to capture meaning in applications
  • Dimensionality reduction techniques just as good by some evaluations.
  • Current Trends: Integrating context, Tracking changes in meaning.