The Role of Dimensionality Reduction in Distributional Semantics - - PowerPoint PPT Presentation

the role of dimensionality reduction in distributional
SMART_READER_LITE
LIVE PREVIEW

The Role of Dimensionality Reduction in Distributional Semantics - - PowerPoint PPT Presentation

The Role of Dimensionality Reduction in Distributional Semantics or: having fun with matrix algebra Stefan Evert Technische Universitt Darmstadt, Germany evert@linglit.tu-darmstadt.de Leuven Statistics Days 8 June 2012 Stefan Evert (TU


slide-1
SLIDE 1

The Role of Dimensionality Reduction in Distributional Semantics

  • r: having fun with matrix algebra

Stefan Evert

Technische Universität Darmstadt, Germany evert@linglit.tu-darmstadt.de

Leuven Statistics Days 8 June 2012

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 1 / 50

slide-2
SLIDE 2

Outline

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 2 / 50

slide-3
SLIDE 3

Introduction Definitions and notation

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 3 / 50

slide-4
SLIDE 4

Introduction Definitions and notation

General definition of DSMs

A distributional semantic model (DSM) is a scaled and/or transformed co-occurrence matrix M, such that each row m represents the distribution of a target term across contexts.

get see use hear eat kill knife 0.027

  • 0.024

0.206

  • 0.022
  • 0.044
  • 0.042

cat 0.031 0.143

  • 0.243
  • 0.015
  • 0.009

0.131 dog

  • 0.026

0.021

  • 0.212

0.064 0.013 0.014 boat

  • 0.022

0.009

  • 0.044
  • 0.040
  • 0.074
  • 0.042

cup

  • 0.014
  • 0.173
  • 0.249
  • 0.099
  • 0.119
  • 0.042

pig

  • 0.069

0.094

  • 0.158

0.000 0.094 0.265 banana 0.047

  • 0.139
  • 0.104
  • 0.022

0.267

  • 0.042

Term = word form, lemma, phrase, morpheme, word pair, . . . Targets = rows (terms whose distribution is represented) Features = columns (individual contexts or collocates)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 4 / 50

slide-5
SLIDE 5

Introduction Definitions and notation

Notation: term-context matrix

Frequency matrix F œ Rk·n (term-context row vectors fi œ Rn) F =

S W W W W W W W U

· · · fT

1

· · · · · · fT

2

· · · . . . . . . · · · fT

k

· · ·

T X X X X X X X V F e l i d a e P e t F e r a l B l

  • a

t P h i l

  • s
  • p

h y K a n t B a c k p a i n cat 10 10 7 – – – – dog – 10 4 11 – – – animal 2 15 10 2 – – – time 1 – – – 2 1 – reason – 1 – – 1 4 1 cause – – – 2 1 2 6 effect – – – 1 – 1 –

Interpretation as collection of row vectors:

I F = (fij), where fij = (fi)j = frequency count of target term ti

in context cj (wrt. context tokens, here: Wikipedia articles)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 5 / 50

slide-6
SLIDE 6

Introduction Definitions and notation

Notation: term-term matrix

Cooccurrence matrix M œ Rk·n (term-term row vectors mi œ Rn) M =

S W W W W W W W U

· · · mT

1

· · · · · · mT

2

· · · . . . . . . · · · mT

k

· · ·

T X X X X X X X V b r e e d t a i l f e e d k i l l i m p

  • r

t a n t e x p l a i n l i k e l y cat 83 17 7 37 – 1 – dog 561 13 30 60 1 2 4 animal 42 10 109 134 13 5 5 time 19 9 29 117 81 34 109 reason 1 – 2 14 68 140 47 cause – 1 – 4 55 34 55 effect – – 1 6 60 35 17

Interpretation as collection of row vectors:

I M = (mij), where mij = (mi)j = cooccurrence frequency of

target term ti with feature term τj (a collocate of ti)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 6 / 50

slide-7
SLIDE 7

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-8
SLIDE 8

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-9
SLIDE 9

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-10
SLIDE 10

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-11
SLIDE 11

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling » Similarity/distance measure & normalisation

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-12
SLIDE 12

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling » Similarity/distance measure & normalisation » Dimensionality reduction

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-13
SLIDE 13

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling » Similarity/distance measure & normalisation » Dimensionality reduction » Semantic distance, nearest neighbours, semantic maps, . . .

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-14
SLIDE 14

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling » Similarity/distance measure & normalisation » Dimensionality reduction » Semantic distance, nearest neighbours, semantic maps, . . .

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-15
SLIDE 15

Introduction Definitions and notation

DSM parameters

Corpus with linguistic annotation » Term-context vs. term-term matrix » Type & size of context » Feature scaling » Similarity/distance measure & normalisation » Dimensionality reduction » Semantic distance, nearest neighbours, semantic maps, . . .

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 8 / 50

slide-16
SLIDE 16

Introduction Definitions and notation

Geometric interpretation and semantic distance

I row vector mdog

describes usage of word dog in the corpus

I can be seen as

coordinates of point in n-dimensional Euclidean space Rn

I illustrated for two

dimensions: get and use

I mdog = (115, 10)

  • 20

40 60 80 100 120 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

get use

cat dog knife boat

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 9 / 50

slide-17
SLIDE 17

Introduction Definitions and notation

Geometric interpretation and semantic distance

I similarity = spatial

proximity (Euclidean metric)

I location depends on

frequency of noun (fdog ¥ 2.7 · fcat)

  • 20

40 60 80 100 120 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

get use

cat dog knife boat d = 63.3 d = 57.5

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 10 / 50

slide-18
SLIDE 18

Introduction Definitions and notation

Geometric interpretation and semantic distance

I similarity = spatial

proximity (Euclidean metric)

I location depends on

frequency of noun (fdog ¥ 2.7 · fcat)

I direction more

important than location

  • 20

40 60 80 100 120 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

get use

cat dog knife boat

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 11 / 50

slide-19
SLIDE 19

Introduction Definitions and notation

Geometric interpretation and semantic distance

I similarity = spatial

proximity (Euclidean metric)

I location depends on

frequency of noun (fdog ¥ 2.7 · fcat)

I direction more

important than location

I normalise “length”

ÎmdogÎ of vector

  • 20

40 60 80 100 120 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

get use

cat dog knife boat

  • Stefan Evert (TU Darmstadt)

Dimensionality Reduction for DSM wordspace.collocations.de 12 / 50

slide-20
SLIDE 20

Introduction Definitions and notation

Geometric interpretation and semantic distance

I similarity = spatial

proximity (Euclidean metric)

I location depends on

frequency of noun (fdog ¥ 2.7 · fcat)

I direction more

important than location

I normalise “length”

ÎmdogÎ of vector

I or use angle α as

distance measure

  • 20

40 60 80 100 120 20 40 60 80 100 120

Two dimensions of English V−Obj DSM

get use

cat dog knife boat

  • α = 54.3°

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 12 / 50

slide-21
SLIDE 21

Introduction Definitions and notation

Euclidean norm & inner product

I Euclidean norm ÎxÎ2 =

Èx, xÍ is special because it is induced by an inner product: Èx, yÍ := xTy = x1y1 + · · · + xnyn

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 14 / 50

slide-22
SLIDE 22

Introduction Definitions and notation

Euclidean norm & inner product

I Euclidean norm ÎxÎ2 =

Èx, xÍ is special because it is induced by an inner product: Èx, yÍ := xTy = x1y1 + · · · + xnyn

I angle ϕ between vectors x, y œ Rn:

cos ϕ := Èx, yÍ ÎxÎ · ÎyÎ + cosine similarity is popular “distance” measure for DSM

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 14 / 50

slide-23
SLIDE 23

Introduction Definitions and notation

Euclidean norm & inner product

I Euclidean norm ÎxÎ2 =

Èx, xÍ is special because it is induced by an inner product: Èx, yÍ := xTy = x1y1 + · · · + xnyn

I angle ϕ between vectors x, y œ Rn:

cos ϕ := Èx, yÍ ÎxÎ · ÎyÎ + cosine similarity is popular “distance” measure for DSM

I x and y are orthogonal iff Èx, yÍ = 0

I the shortest connection between a point x and a subspace A

is orthogonal to all vectors y œ A

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 14 / 50

slide-24
SLIDE 24

Introduction Definitions and notation

An exercise in matrix algebra

Task: compute distances (or similarities) between all target terms ti in row-normalised matrix M as quickly as possible. cos ϕij = Èmi, mjÍ = mT

i mj

for i, j œ {1, . . . , k}

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 16 / 50

slide-25
SLIDE 25

Introduction Definitions and notation

An exercise in matrix algebra

Task: compute distances (or similarities) between all target terms ti in row-normalised matrix M as quickly as possible. cos ϕij = Èmi, mjÍ = mT

i mj

for i, j œ {1, . . . , k} cos

S W W W W W W W W W U

. . . · · · · · · ϕij · · · . . . . . . . . .

T X X X X X X X X X V

=

S W W W W W W W W W U

· · · mT

i

· · ·

T X X X X X X X X X V

·

S W W U

. . . mj . . .

T X X V

cos ϕ = M · MT

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 16 / 50

slide-26
SLIDE 26

Introduction Sparse high-dimensional models

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 17 / 50

slide-27
SLIDE 27

Introduction Sparse high-dimensional models

Distributional Memory (Baroni and Lenci 2010)

I Tensor of (word, link, word) triples, e.g. (book, obj, read)

I also (sharp, as adj as, knife); (geek, use, computer); . . .

I TypeDM: feature scores = local MI (Evert 2004) based on

number of distinct surface realisations of the link pattern

I 30,686 target terms ◊ 25,336 link types ◊ 30,686 collocates Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 18 / 50

slide-28
SLIDE 28

Introduction Sparse high-dimensional models

Distributional Memory (Baroni and Lenci 2010)

I Tensor of (word, link, word) triples, e.g. (book, obj, read)

I also (sharp, as adj as, knife); (geek, use, computer); . . .

I TypeDM: feature scores = local MI (Evert 2004) based on

number of distinct surface realisations of the link pattern

I 30,686 target terms ◊ 25,336 link types ◊ 30,686 collocates

I W1 ◊ LW2 matricization yields state-of-the-art DSM

I very high-dimensional: 30,686 ◊ 3,127,436 matrix I extremely sparse: 131 million nonzero cells = 0.137% Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 18 / 50

slide-29
SLIDE 29

Introduction Sparse high-dimensional models

Distributional Memory (Baroni and Lenci 2010)

I Tensor of (word, link, word) triples, e.g. (book, obj, read)

I also (sharp, as adj as, knife); (geek, use, computer); . . .

I TypeDM: feature scores = local MI (Evert 2004) based on

number of distinct surface realisations of the link pattern

I 30,686 target terms ◊ 25,336 link types ◊ 30,686 collocates

I W1 ◊ LW2 matricization yields state-of-the-art DSM

I very high-dimensional: 30,686 ◊ 3,127,436 matrix I extremely sparse: 131 million nonzero cells = 0.137%

 Dimensionality reduction to make data set manageable

I e.g. 1.25 M uninformative features with single nonzero entry Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 18 / 50

slide-30
SLIDE 30

Introduction Sparse high-dimensional models

Goals of dimensionality reduction

I Numerical convenience I Noise reduction (Landauer and Dumais 1997) I Latent meaning dimensions (Schütze 1992, 1998)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 19 / 50

slide-31
SLIDE 31

Introduction Sparse high-dimensional models

Goals of dimensionality reduction

I Numerical convenience I Noise reduction (Landauer and Dumais 1997) I Latent meaning dimensions (Schütze 1992, 1998)

A simple approach: feature selection

I drop least frequent, variable, informative, . . . features I convenient, but no noise reduction & latent dimensions

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 19 / 50

slide-32
SLIDE 32

Introduction Sparse high-dimensional models

Goals of dimensionality reduction

I Numerical convenience I Noise reduction (Landauer and Dumais 1997) I Latent meaning dimensions (Schütze 1992, 1998)

A simple approach: feature selection

I drop least frequent, variable, informative, . . . features I convenient, but no noise reduction & latent dimensions

General form: map data points into low-dimensional subspace

I exploit correlations between features ‹ less information loss

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 19 / 50

slide-33
SLIDE 33

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 81 17 50 1 2 2 ticket 178 9 98 7 coffee 21 9 electricity 2 1 15 1 chocolate 19 2 letter 4 3 950 223 25 note 1 2 167 70 4 statement 1 18 58 7 agreement 45 3 2 13

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 20 / 50

slide-34
SLIDE 34

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 81 17 50 1 2 2 ticket 178 9 98 7 coffee 21 9 electricity 2 1 15 1 chocolate 19 2 letter 4 3 950 223 25 note 1 2 167 70 4 statement 1 18 58 7 agreement 45 3 2 13

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 20 / 50

slide-35
SLIDE 35

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 81 17 50 1 2 2 ticket 178 9 98 7 coffee 21 9 electricity 2 1 15 1 chocolate 19 2 letter 4 3 950 223 25 note 1 2 167 70 4 statement 1 18 58 7 agreement 45 3 2 13 feature selection (2 dimensions)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 20 / 50

slide-36
SLIDE 36

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 81 17 50 1 2 2 ticket 178 9 98 7 coffee 21 9 electricity 2 1 15 1 chocolate 19 2 letter 4 3 950 223 25 note 1 2 167 70 4 statement 1 18 58 7 agreement 45 3 2 13 aggregate meaningful feature combinations

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 21 / 50

slide-37
SLIDE 37

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 84 7 47 2 ticket 177 14 99 7 1 1 coffee 20 2 11 electricity 8 1 4 1 chocolate 15 1 9 letter 4 3 948 230 25 note 1 1 174 42 5 statement 30 7 1 agreement 3 2 4 1 by regression into 2-dimensional subspace

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 22 / 50

slide-38
SLIDE 38

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t company 84 7 47 2 ≠ 1 1 ticket 177 14 99 8 ≠ 1 2 ≠ 1 1 coffee 20 2 11 electricity 8 1 4 1 chocolate 15 1 9 letter 6 ≠ 2 4 ≠ 1 948 230 25 note 1 1 174 42 5 statement 30 7 1 agreement 3 2 4 1 by regression into 2-dimensional subspace

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 23 / 50

slide-39
SLIDE 39

Introduction Sparse high-dimensional models

Approaches to dimensionality reduction

excerpt from verb-object DSM based on British National Corpus

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t d i m 1 d i m 2 company 84 7 47 2 ≠ 1 1 2 96 ticket 177 14 99 8 ≠ 1 2 ≠ 1 1 8 203 coffee 20 2 11 23 electricity 8 1 4 1 1 9 chocolate 15 1 9 18 letter 6 ≠ 2 4 ≠ 1 948 230 25 976 ≠2 note 1 1 174 42 5 179 statement 30 7 1 31 agreement 3 2 4 1 4 3 by regression into 2-dimensional subspace first dimension: written material second dimension: commodities

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 24 / 50

slide-40
SLIDE 40

Dimensionality reduction Singular value decomposition (SVD)

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 25 / 50

slide-41
SLIDE 41

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I Approach: map data points into linear subspace with d π n

dimensions, shifting their positions as little as possible

I same intuition as for linear regression: residuals = “noise” I i.e. minimise displacement ˜

x ≠ x between original data point x and mapped point ˜ x in low-dimensional subspace

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 26 / 50

slide-42
SLIDE 42

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I Approach: map data points into linear subspace with d π n

dimensions, shifting their positions as little as possible

I same intuition as for linear regression: residuals = “noise” I i.e. minimise displacement ˜

x ≠ x between original data point x and mapped point ˜ x in low-dimensional subspace

I For each data point x, best

possible mapping is orthogonal projection ˜ x = PAx into a given subspace A

I ÎxÎ2 = ÎPAxÎ2 + Îx ≠ PAxÎ2

¸ ˚˙ ˝

displacement I Based on Euclidean distance

.

ϕ

  • v 1
  • x
  • x
  • x

⇥ x⇥ P

v

x x, v⇥ v Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 26 / 50

slide-43
SLIDE 43

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I d-dimensional subspace A spanned by basis vectors b1, . . . , bd

with Èbi, bjÍ = δij, forming an orthogonal n ◊ d matrix: Q =

S W W W W W W W W W U

. . . . . . . . . . . . b1 · · · bd . . . . . . . . . . . .

T X X X X X X X X X V

PAx =

d

ÿ

i=1

bi(bT

i x) = QQTx I PAx = QQTx = projection into subspace A ™ Rn I QTx = projection into internal Cartesian coordinates of A

I ÎQTxÎ = ÎPAxÎ (Q is isometric embedding) I QTQ = Id (identity matrix) Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 27 / 50

slide-44
SLIDE 44

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I Project row vectors m of co-occurrence matrix M by matrix

multiplication ‹ row vectors ˜ m of matrix ˜ M ˜ M = M (PA)T = MQQT

I Total displacement given by Frobenius norm Î ˜

M ≠ MÎ2

k

ÿ

i=1

ÎPAmi ≠ miÎ2 = ÎM(PA)T ≠ MÎ2 = ÎMÎ2 ≠ ÎM(PA)TÎ2 + Goal: find subspace A that maximises ÎM(PA)TÎ2 = ÎMQQTÎ2 = ÎMQÎ2

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 28 / 50

slide-45
SLIDE 45

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-46
SLIDE 46

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • variance = 1.26

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-47
SLIDE 47

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • Stefan Evert (TU Darmstadt)

Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-48
SLIDE 48

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • variance = 0.36

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-49
SLIDE 49

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • Stefan Evert (TU Darmstadt)

Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-50
SLIDE 50

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • variance = 0.72

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-51
SLIDE 51

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • Stefan Evert (TU Darmstadt)

Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-52
SLIDE 52

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • variance = 0.9

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-53
SLIDE 53

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • Stefan Evert (TU Darmstadt)

Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-54
SLIDE 54

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

−2 −1 1 2 −2 −1 1 2 buy sell

  • book

bottle good house packet part stock system advertising arm asset car clothe collection copy dress food insurance land liquor number

  • ne

pair pound product property share suit ticket time year

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 29 / 50

slide-55
SLIDE 55

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

I Solution: b = eigenvector for largest eigenvalue of the

symmetric, positive semi-definite covariance matrix MTM

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 30 / 50

slide-56
SLIDE 56

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by orthogonal projection

I For one-dimensional subspace: PA = bbT, so maximise

ÎMbÎ = ÈMb, MbÍ = (bTMT)(Mb) = bT(MTM)b

I Solution: b = eigenvector for largest eigenvalue of the

symmetric, positive semi-definite covariance matrix MTM

I Best d-dimensional subspace is given by orthogonal

eigenvectors b1, . . . , bd corresponding to the d largest eigenvalues s1 Ø s2 Ø . . . Ø sd Ø 0 of MTM

I Quality of the approximation

I ÎMQdÎ = s1 + · · · + sd vs. ÎMÎ = qn

i=1 si

I relative “importance” of dimension bi given by si/ÎMÎ Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 30 / 50

slide-57
SLIDE 57

Dimensionality reduction Singular value decomposition (SVD)

Eigenvalue decomposition

I Symmetric, positive semi-definite matrix MTM has eigenvalue

decomposition MTM = V · S · VT where V is an orthogonal matrix of eigenvectors (columns) and S = Diag(s1, . . . , sn) a diagonal matrix of eigenvalues

V = S W W W W W W W W U . . . . . . . . . . . . . . . . . . v1 v2 · · · vn . . . . . . . . . . . . . . . . . . T X X X X X X X X V D = S W W W W W W U s1 s2 ... ... sn T X X X X X X V

I Best d-dimensional subspace: b1 = v1, . . . , bd = vd I Dimensionality reduction: Md = MVd

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 31 / 50

slide-58
SLIDE 58

Dimensionality reduction Singular value decomposition (SVD)

Singular value decomposition (SVD)

I The idea of eigenvalue decomposition can be generalised to

an arbitrary (non-symmetric, non-square) matrix M

+ such a matrix need not have any eigenvalues

I Singular value decomposition (SVD) factorises B into

M = U · Σ · VT where U and V are orthogonal coordinate transformations and Σ is a rectangular-diagonal matrix of singular values (with customary ordering σ1 Ø σ2 Ø · · · Ø σn Ø 0)

I Truncated SVD only computes first d nonzero singular values

+ Σ is a square d ◊ d matrix

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 32 / 50

slide-59
SLIDE 59

Dimensionality reduction Singular value decomposition (SVD)

Truncated SVD illustration

M ¥ ˜ Md = Ud · Σd · VT

d

S W W W W W W W W W W U

n k ˜ M

T X X X X X X X X X X V

=

S W W W W W W W W W W U

d k U

T X X X X X X X X X X V

·

S W W U

σ1 d d ... Σ σd

T X X V· S W W W U

n d VT

T X X X V

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 33 / 50

slide-60
SLIDE 60

Dimensionality reduction Singular value decomposition (SVD)

Dimensionality reduction by SVD

MTM = (UΣVT)T(UΣVT) = VΣ UTU

¸ ˚˙ ˝

=Id

ΣVT = VΣ2VT

I Eigenvectors of MTM = right singular vectors of M (columns

  • f V) with eigenvalues si = σ2

i , i.e. S = Σ2 I Dimensionality reduction by SVD:

˜ Md = MVd = UdΣd (in Rd) ˜ Md = MVdVd = UdΣdVd (in original space)

+ “importance” of dimension vi given by σ2

i

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 34 / 50

slide-61
SLIDE 61

Dimensionality reduction Singular value decomposition (SVD)

SVD dimensionality reduction example

b u y p u r c h a s e s e l l w r i t e r e a d d r a f t d i m 1 d i m 2 company 84 7 47 2 ≠ 1 1 2 96 ticket 177 14 99 8 ≠ 1 2 ≠ 1 1 8 203 coffee 20 2 11 23 electricity 8 1 4 1 1 9 chocolate 15 1 9 18 letter 6 ≠ 2 4 ≠ 1 948 230 25 976 ≠2 note 1 1 174 42 5 179 statement 30 7 1 31 agreement 3 2 4 1 4 3 ˜ M2 = U2Σ2VT

2 = u1σ1vT 1 + u2σ2vT 2

MV2 = U2Σ2

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 35 / 50

slide-62
SLIDE 62

Dimensionality reduction Interpretations of SVD

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 36 / 50

slide-63
SLIDE 63

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 37 / 50

slide-64
SLIDE 64

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

I Matrix approximation: ˜

Md is best rank-d approximation of M

+ minimise Frobenius norm Î ˜ Md ≠ MÎ2 = qk

i=1Î ˜

mi ≠ miÎ

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 37 / 50

slide-65
SLIDE 65

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

I Matrix approximation: ˜

Md is best rank-d approximation of M

+ minimise Frobenius norm Î ˜ Md ≠ MÎ2 = qk

i=1Î ˜

mi ≠ miÎ

I Distance-preserving embedding into d-dimensional space

+ minimise qk

i=1

qk

j=1

  • Îmi ≠ mjÎ ≠ Î ˜

mi ≠ ˜ mjÎ

  • + principal component analysis (PCA) is best distance-preserving

projection = SVD for column-centered M (i.e. q

i mi = 0)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 37 / 50

slide-66
SLIDE 66

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

I Matrix approximation: ˜

Md is best rank-d approximation of M

+ minimise Frobenius norm Î ˜ Md ≠ MÎ2 = qk

i=1Î ˜

mi ≠ miÎ

I Distance-preserving embedding into d-dimensional space

+ minimise qk

i=1

qk

j=1

  • Îmi ≠ mjÎ ≠ Î ˜

mi ≠ ˜ mjÎ

  • + principal component analysis (PCA) is best distance-preserving

projection = SVD for column-centered M (i.e. q

i mi = 0)

I Latent class model (‹ latent meaning dimensions)

+ ˜ Md = qd

i=1 uiσivT i

(conditional independence given class i)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 37 / 50

slide-67
SLIDE 67

Dimensionality reduction Interpretations of SVD

SVD as a topic model

I Truncated SVD decomposition of term-document matrix:

F ¥ ˜ F =

d

ÿ

i=1

uiσivT

i

I σi = prior frequency of topic i I ui = word frequency distribution for topic i I vi = contribution of topic i to each document

+ assumes unscaled frequency counts F

I This topic model is known as latent semantic indexing (LSI) I Latent semantic analysis (LSA, Landauer and Dumais 1997)

interprets topics as meaning components

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 38 / 50

slide-68
SLIDE 68

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

I Matrix approximation: ˜

Md is best rank-d approximation of M

+ minimise Frobenius norm Î ˜ Md ≠ MÎ2 = qk

i=1Î ˜

mi ≠ miÎ

I Distance-preserving embedding into d-dimensional space

+ minimise qk

i=1

qk

j=1

  • Îmi ≠ mjÎ ≠ Î ˜

mi ≠ ˜ mjÎ

  • + principal component analysis (PCA) is best distance-preserving

projection = SVD for column-centered M (i.e. q

i mi = 0)

I Latent class model (‹ latent meaning dimensions)

+ ˜ Md = qd

i=1 uiσivT i

(conditional independence given class i)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 39 / 50

slide-69
SLIDE 69

Dimensionality reduction Interpretations of SVD

Interpretations of SVD

I “Noise reduction”: projection into d-dimensional subspace

+ minimise cost = diplacement of points (Euclidean distance)

I Matrix approximation: ˜

Md is best rank-d approximation of M

+ minimise Frobenius norm Î ˜ Md ≠ MÎ2 = qk

i=1Î ˜

mi ≠ miÎ

I Distance-preserving embedding into d-dimensional space

+ minimise qk

i=1

qk

j=1

  • Îmi ≠ mjÎ ≠ Î ˜

mi ≠ ˜ mjÎ

  • + principal component analysis (PCA) is best distance-preserving

projection = SVD for column-centered M (i.e. q

i mi = 0)

I Latent class model (‹ latent meaning dimensions)

+ ˜ Md = qd

i=1 uiσivT i

(conditional independence given class i)

I Matrix factorization ˜

M = UΣVT with Σ = Diag(σ1, . . . , σd)

+ SVD: Frobenius cost Î ˜ M ≠ MÎ2, U, V orthogonal, Σ Ø 0 + always implies a latent class model + Σ can be absorbed into U, V under relaxed constraints

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 39 / 50

slide-70
SLIDE 70

Dimensionality reduction Interpretations of SVD

Is SVD really a distance-preserving embedding?

I SVD equivalent to PCA only for column-centered matrix

I centering destroy sparseness and non-negativity of M I does not seem appropriate for highly skewed frequency data I PCA preserves Euclidean distance, but DSMs often use cosine Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 40 / 50

slide-71
SLIDE 71

Dimensionality reduction Interpretations of SVD

Is SVD really a distance-preserving embedding?

I SVD equivalent to PCA only for column-centered matrix

I centering destroy sparseness and non-negativity of M I does not seem appropriate for highly skewed frequency data I PCA preserves Euclidean distance, but DSMs often use cosine

+ SVD preserves inner products = cosine for normalised M

I recall that cos ϕ = MMT if ÎmiÎ = 1 ’i Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 40 / 50

slide-72
SLIDE 72

Dimensionality reduction Interpretations of SVD

Is SVD really a distance-preserving embedding?

I SVD equivalent to PCA only for column-centered matrix

I centering destroy sparseness and non-negativity of M I does not seem appropriate for highly skewed frequency data I PCA preserves Euclidean distance, but DSMs often use cosine

+ SVD preserves inner products = cosine for normalised M

I recall that cos ϕ = MMT if ÎmiÎ = 1 ’i

MMT = UΣ VTV

¸ ˚˙ ˝

I

ΣUT = UΣ2UT

I since U is isometric, best rank-d approximation to MMT is

given by first singular values UdΣ2

dUT d = (UdΣd)(UdΣd)T

 ˜ Md = UdΣd preserves inner products È ˜ mi, ˜ mjÍ (and hence cosines computed without renormalization of ˜ Md)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 40 / 50

slide-73
SLIDE 73

Dimensionality reduction Alternatives to SVD

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 41 / 50

slide-74
SLIDE 74

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as orthogonal projection

I random indexing (RI) projects into random subspace I randomly generated unit basis vectors bi (sparse or Gaussian)

are approximately orthogonal, i.e. Èbi, bjÍ ¥ δij

I Johnson-Lindenstrauss lemma: distances are preserved well

if d is sufficiently high (cf. Papadimitriou et al. 1998) + no “noise reduction” effect (correlations not exploited)

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 42 / 50

slide-75
SLIDE 75

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as orthogonal projection

I random indexing (RI) projects into random subspace I randomly generated unit basis vectors bi (sparse or Gaussian)

are approximately orthogonal, i.e. Èbi, bjÍ ¥ δij

I Johnson-Lindenstrauss lemma: distances are preserved well

if d is sufficiently high (cf. Papadimitriou et al. 1998) + no “noise reduction” effect (correlations not exploited)

I SVD as rank-d matrix approximation

I wrt. other cost function, e.g. Î ˜

M ≠ MÎ1

I I am not aware of any standard algorithm / implementation Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 42 / 50

slide-76
SLIDE 76

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as orthogonal projection

I random indexing (RI) projects into random subspace I randomly generated unit basis vectors bi (sparse or Gaussian)

are approximately orthogonal, i.e. Èbi, bjÍ ¥ δij

I Johnson-Lindenstrauss lemma: distances are preserved well

if d is sufficiently high (cf. Papadimitriou et al. 1998) + no “noise reduction” effect (correlations not exploited)

I SVD as rank-d matrix approximation

I wrt. other cost function, e.g. Î ˜

M ≠ MÎ1

I I am not aware of any standard algorithm / implementation

I SVD as decorrelation

I independent component analysis (ICA) has been applied to

separation of word senses (Rapp 2003) + does not seem useful for dimensionality reduction

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 42 / 50

slide-77
SLIDE 77

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as distance-preserving embedding

I non-linear and non-metric embeddings: kernel PCA,

(non-metric) multidimensional scaling (MDS), . . .

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 43 / 50

slide-78
SLIDE 78

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as distance-preserving embedding

I non-linear and non-metric embeddings: kernel PCA,

(non-metric) multidimensional scaling (MDS), . . .

I SVD as matrix factorization

I non-negative matrix factorization (Lee and Seung 2001) I M ¥ WH with W, H Ø 0 I cost function: Frobenius ÎM ≠ WHÎ2, cross-entropy, . . .

+ expensive iterative algorithm, non-unique solution

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 43 / 50

slide-79
SLIDE 79

Dimensionality reduction Alternatives to SVD

Alternative dimensionality reduction techniques

different methods available depending on interpretation of SVD

I SVD as distance-preserving embedding

I non-linear and non-metric embeddings: kernel PCA,

(non-metric) multidimensional scaling (MDS), . . .

I SVD as matrix factorization

I non-negative matrix factorization (Lee and Seung 2001) I M ¥ WH with W, H Ø 0 I cost function: Frobenius ÎM ≠ WHÎ2, cross-entropy, . . .

+ expensive iterative algorithm, non-unique solution

I SVD as latent class (topic) model

I probabilistic topic models are more plausible for frequency

data, e.g. PLSA (Hoffmann 1999)

I PLSA is equivalent to NMF with cross-entropy cost function I latent Dirichlet allocation (LDA) and other Bayesian models Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 43 / 50

slide-80
SLIDE 80

Dimensionality reduction A case study

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 44 / 50

slide-81
SLIDE 81

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

I Distributional Memory with W1 ◊ LW2 matricization

I k = 30,686 target terms I n = 3,127,436 feature dimensions Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 45 / 50

slide-82
SLIDE 82

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

I Distributional Memory with W1 ◊ LW2 matricization

I k = 30,686 target terms I n = 3,127,436 feature dimensions

I Two standard evaluation tasks

I TOEFL synonym test (Landauer and Dumais 1997) I WordSim-353 semantic similarity ratings for 353 noun pairs

(Finkelstein et al. 2002), with Spearman rank correlation ρ

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 45 / 50

slide-83
SLIDE 83

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

I Distributional Memory with W1 ◊ LW2 matricization

I k = 30,686 target terms I n = 3,127,436 feature dimensions

I Two standard evaluation tasks

I TOEFL synonym test (Landauer and Dumais 1997) I WordSim-353 semantic similarity ratings for 353 noun pairs

(Finkelstein et al. 2002), with Spearman rank correlation ρ

I Dimensionality reduction techniques

I feature selection (based on number of nonzero entries) I random indexing (RI) with sparse random vectors I RI + singular value decomposition (using randomized SVD) I aggregation: collapse DM tensor into W1 ◊ W2 matrix

(yields 30,686 ◊ 30,686 matrix with 6.41% nonzero cells)

I Caveat: no parameter optimization

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 45 / 50

slide-84
SLIDE 84

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | unreduced 3.1M

human model

r = 0.438 rho = 0.430

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-85
SLIDE 85

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | top 1M features

human model

r = 0.439 rho = 0.430

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-86
SLIDE 86

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | top 100k features

human model

r = 0.442 rho = 0.430

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-87
SLIDE 87

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | top 5k features

human model

r = 0.408 rho = 0.400

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-88
SLIDE 88

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400 RI 5k 76.3% .439

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | RI 5k dimensions

human model

r = 0.442 rho = 0.439

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-89
SLIDE 89

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400 RI 5k 76.3% .439 RI 1k 78.8% .400

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | RI 1k dimensions

human model

r = 0.419 rho = 0.400

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-90
SLIDE 90

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400 RI 5k 76.3% .439 RI 1k 78.8% .400 RI 6k + SVD 300 67.5% .426

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,lw) normalized | RI 6k dim | rSVD 300 dimensions

human model

r = 0.433 rho = 0.426

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-91
SLIDE 91

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400 RI 5k 76.3% .439 RI 1k 78.8% .400 RI 6k + SVD 300 67.5% .426 W1 ◊ W2 full 30k 76.3% .461

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,w) normalized | unreduced 30k

human model

r = 0.458 rho = 0.461

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-92
SLIDE 92

Dimensionality reduction A case study

A case study on the usefulness of dimensionality reduction

TOEFL WordSim full 3.1M 76.3% .430 top 1M 76.3% .430 top 100k 77.5% .430 top 5k 71.3% .400 RI 5k 76.3% .439 RI 1k 78.8% .400 RI 6k + SVD 300 67.5% .426 W1 ◊ W2 full 30k 76.3% .461 W1 ◊ W2 SVD 300 70.0% .489

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

DM (w,w) normalized | rSVD 300 dimensions

human model

r = 0.482 rho = 0.489

15 word pairs missing

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 46 / 50

slide-93
SLIDE 93

Outlook and discussion

Outline

Introduction Definitions and notation Sparse high-dimensional models Dimensionality reduction Singular value decomposition (SVD) Interpretations of SVD Alternatives to SVD A case study Outlook and discussion

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 47 / 50

slide-94
SLIDE 94

Outlook and discussion

Things I love to talk about . . .

I Analysis of PLSA as matrix factorization I Term-document vs. term-term matrix, higher-order models

+ can be illustrated nicely for sentence context

I Composition and dimensionality reduction

+ is vector multiplication etc. compatible with SVD?

I Sentence and document vectors

+ centroid? compositional?

I Non-linear dimensionality reduction techniques

+ useful for sparse high-dimensional vectors?

I Broad-scale evaluation and parameter optimization of DSM

+ single evaluation tasks give skewed picture

I Extension to tensor factorization

+ Tucker decomposition, non-negative tensor factorization

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 48 / 50

slide-95
SLIDE 95

Outlook and discussion

References I

Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–712. Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and

  • Collocations. Dissertation, Institut für maschinelle Sprachverarbeitung, University
  • f Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available

from http://www.collocations.de/phd.html. Finkelstein, Lev; Gabrilovich, Evgeniy; Matias, Yossi; Rivlin, Ehud; Solan, Zach; Wolfman, Gadi; Ruppin, Eytan (2002). Placing search in context: The concept

  • revisited. ACM Transactions on Information Systems, 20(1), 116–131.

Hoffmann, Thomas (1999). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99). Landauer, Thomas K. and Dumais, Susan T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of

  • knowledge. Psychological Review, 104(2), 211–240.

Lee, Daniel D. and Seung, H. Sebastian (2001). Algorithms for non-negative matrix

  • factorization. In Advances in Neural Information Processing Systems 13:

Proceedings of the NIPS 2000 Conference, pages 556–562. MIT Press.

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 49 / 50

slide-96
SLIDE 96

Outlook and discussion

References II

Papadimitriou, Christos H.; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th ACM Symposium on the Principles of Database Systems, pages 159–168. Rapp, Reinhard (2003). Die Erkennung semantischer Mehrdeutigkeiten mittels Unabhängigkeitsanalyse. In Proceedings of the GLDV-Frühjahrstagung 2003, Köthen, Germany. Schütze, Hinrich (1992). Dimensions of meaning. In Proceedings of Supercomputing ’92, pages 787–796, Minneapolis, MN. Schütze, Hinrich (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–123.

Stefan Evert (TU Darmstadt) Dimensionality Reduction for DSM wordspace.collocations.de 50 / 50