Dimensionality Reduction & Embedding (part 2/2) Prof. Mike - - PowerPoint PPT Presentation

dimensionality reduction embedding part 2 2
SMART_READER_LITE
LIVE PREVIEW

Dimensionality Reduction & Embedding (part 2/2) Prof. Mike - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Dimensionality Reduction & Embedding (part 2/2) Prof. Mike Hughes Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI) 2 What


slide-1
SLIDE 1

Dimensionality Reduction & Embedding (part 2/2)

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI)

  • Prof. Mike Hughes
slide-2
SLIDE 2

3

Mike Hughes - Tufts COMP 135 - Spring 2019

What will we learn?

Data Examples data x

Supervised Learning Unsupervised Learning Reinforcement Learning

{xn}N

n=1

Task summary

  • f x

Performance measure

slide-3
SLIDE 3

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Embedding

Supervised Learning Unsupervised Learning Reinforcement Learning

embedding

x2 x1

slide-4
SLIDE 4
  • Dim. Reduction/Embedding

Unit Objectives

  • Goals of dimensionality reduction
  • Reduce feature vector size (keep signal, discard noise)
  • “Interpret” features: visualize/explore/understand
  • Common approaches
  • Principal Component Analysis (PCA) + Factor Analysis
  • t-SNE (“tee-snee”)
  • word2vec and other neural embeddings
  • Evaluation Metrics
  • Storage size
  • Reconstruction error
  • “Interpretability”
  • Prediction error

5

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

6

Mike Hughes - Tufts COMP 135 - Spring 2019

Example: Genes vs. geography

Nature, 2008

slide-6
SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Spring 2019

Example: Genes vs. geography

Where possible, we based the geographic origin on the observed country data for grandparents. We used a ‘strict consensus’ approach: if all observed grandparents originated from a single country, we used that country as the

  • rigin. If an individual’s observed grandparents originated from different

countries, we excluded the individual. Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Nature, 2008

slide-7
SLIDE 7

8

Mike Hughes - Tufts COMP 135 - Spring 2019

Eigenvectors and Eigenvalues

slide-8
SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Source: https://textbooks.math.gatech.edu/ila/eigenvectors.html

slide-9
SLIDE 9

Demo: What is an Eigenvector?

  • http://setosa.io/ev/eigenvectors-and-

eigenvalues/

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Centering the Data

Goal: each feature’s mean = 0.0

11

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

Why center?

  • Think of mean vector as simplest possible

“reconstruction” of a dataset

  • No example specific parameters, just one F-

dim vector

12

Mike Hughes - Tufts COMP 135 - Spring 2019

min

m∈RF N

X

n=1

(xn − m)T (xn − m) m∗ = mean(x1, . . . xN)

slide-12
SLIDE 12

Principal Component Analysis

13

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-13
SLIDE 13

Reconstruction with PCA

14

Mike Hughes - Tufts COMP 135 - Spring 2019

F vector High- dim. data K vector Low-dim vector F x K Basis F vector mean

xi = Wzi + m +

slide-14
SLIDE 14

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • K : int, number of components
  • Satisfies 1 <= K <= F
  • Output:
  • m : mean vector, size F
  • W : learned basis of eigenvectors, F x K
  • One F-dim. vector (magnitude 1) for each component
  • Each of the K vectors is orthogonal to every other

15

Mike Hughes - Tufts COMP 135 - Spring 2019

Training step: .fit()

slide-15
SLIDE 15

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • Trained PCA “model”
  • m : mean vector, size F
  • W : learned basis of eigenvectors, F x K
  • One F-dim. vector (magnitude 1) for each component
  • Each of the K vectors is orthogonal to every other
  • Output:
  • Z : projected data, N x K

16

Mike Hughes - Tufts COMP 135 - Spring 2019

Transformation step: .transform()

slide-16
SLIDE 16

PCA Demo

  • http://setosa.io/ev/principal-

component-analysis/

17

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-17
SLIDE 17

Example: EigenFaces

18

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Erik Sudderth

slide-18
SLIDE 18

PCA Principles

  • Minimize reconstruction error
  • Should be able to recreate x from z
  • Equivalent to maximizing variance
  • Want reconstructions to retain maximum

information

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

PCA: How to Select K?

  • 1) Use downstream supervised task metric
  • Regression error
  • 2) Use memory constraints of task
  • Can’t store more than 50 dims for 1M examples?

Take K=50

  • 3) Plot cumulative “variance explained”
  • Take K that seems to capture most or all variance

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-20
SLIDE 20

Empirical Variance of Data X

  • (Assumes each feature is centered)

21

Mike Hughes - Tufts COMP 135 - Spring 2019

= 1 N

N

X

n=1

xT

nxn

Var(X) = 1 N

N

X

n=1 F

X

f=1

x2

nf

slide-21
SLIDE 21

Variance of reconstructions

22

Mike Hughes - Tufts COMP 135 - Spring 2019

= 1 N

N

X

n=1

xT

nxn

= 1 N

N

X

n=1

(zn1w1 + . . . + znKwK)T (zn1w1 + . . . + znKwK)

= 1 N

N

X

n=1 K

X

k=1

z2

nk

=

K

X

k=1

λk

Just sum up the top K eigenvalues!

slide-22
SLIDE 22

Proportion of Variance Explained by first K components

23

Mike Hughes - Tufts COMP 135 - Spring 2019

PVE(K) = PK

k=1 λk

PF

f=1 λf

slide-23
SLIDE 23

Variance explained curve

24

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

PCA Summary

PRO

  • Usually, fast to train, fast to test
  • Slowest step: finding K eigenvectors of an F x F matrix
  • Nested model
  • PCA with K=5 overlaps with PCA with K=4

CON

  • Sensitive to rescaling of input data features
  • Learned basis known only up to +/- scaling
  • Not often best for supervised tasks

25

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-25
SLIDE 25

PCA: Best Practices

  • If features all have different units
  • Try rescaling to all be within (-1, +1) or have

variance 1

  • If features have same units, may not need to do

this

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Beyond PCA: Factor Analysis

27

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

A Probabilistic Model

28

Mike Hughes - Tufts COMP 135 - Spring 2019

F vector High- dim. data K vector Low-dim vector F x K Basis F vector mean F vector noise

xi = Wzi + m + ✏i

✏i ∼ N(0, IF )

slide-28
SLIDE 28

A Probabilistic Model

29

Mike Hughes - Tufts COMP 135 - Spring 2019

X = WZ + M + E

In terms of matrix math:

xi = Wzi + m + ✏i

slide-29
SLIDE 29

A Probabilistic Model

30

Mike Hughes - Tufts COMP 135 - Spring 2019

F vector High- dim. data K vector Low-dim vector F x K Basis F vector mean F vector noise

xi = Wzi + m + ✏i

✏i ∼ N(0,   2 2 2  )

slide-30
SLIDE 30

Face Dataset

31

Mike Hughes - Tufts COMP 135 - Spring 2019

✏i ∼ N(0,   2 2 2  )

Is this noise model realistic?

slide-31
SLIDE 31

Each pixel might need own variance!

32

Mike Hughes - Tufts COMP 135 - Spring 2019

✏i ∼ N(0,   2

1

2

2

2

3

 )

slide-32
SLIDE 32

Factor Analysis

  • Finds a linear basis like PCA, but allows per-

feature estimation of variance

  • Small detail: columns of estimated basis may

not be orthogonal

33

Mike Hughes - Tufts COMP 135 - Spring 2019

✏i ∼ N(0,   2

1

2

2

2

3

 )

slide-33
SLIDE 33

PCA vs Factor Analysis

34

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

35

Mike Hughes - Tufts COMP 135 - Spring 2019

Matrix Factorization and Singular Value Decomposition

slide-35
SLIDE 35

36

Mike Hughes - Tufts COMP 135 - Spring 2019

Matrix Factorization (MF)

  • User ! represented by vector "# ∈ %&
  • Item ' represented by vector (

) ∈ %&

  • Inner product "#

*+) approximates the utility ,#)

  • Intuition:
  • Two items with similar vectors get similar utility scores

from the same user;

  • Two users with similar vectors give similar utility

scores to the same item

slide-36
SLIDE 36

37

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-37
SLIDE 37

General Matrix Factorization

38

Mike Hughes - Tufts COMP 135 - Spring 2019

=

X = ZW

slide-38
SLIDE 38

SVD: Singular Value Decomposition

39

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Wikipedia

slide-39
SLIDE 39

Truncated SVD

40

Mike Hughes - Tufts COMP 135 - Spring 2019

=

X = UDV T

K K K K

slide-40
SLIDE 40

Recall: Eigen Decomposition

41

Mike Hughes - Tufts COMP 135 - Spring 2019

λ1, λ2, . . . λK w1, w2, . . . wK

slide-41
SLIDE 41

Two ways to “fit” PCA

  • First, apply “centering” to X
  • Then, do one of these two options:
  • 1) Compute SVD of X
  • Eigenvalues are rescaled entries of the diagonal D
  • Basis = first K columns of V
  • 2) Compute covariance Cov(X)
  • Eigenvalues = largest eigenvalues of Cov(X)
  • Basis = corresponding eigenvectors of Cov(X)

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-42
SLIDE 42

Visualization with t-SNE

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

Reducing Dimensionality

  • f Digit Images

44

Mike Hughes - Tufts COMP 135 - Spring 2019

INPUT: Each image represented by 784-dimensional vector Apply PCA transformation with K=2 OUTPUT: Each image is a 2-dimensional vector

slide-44
SLIDE 44

45

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-

8ef87e7915b)

slide-45
SLIDE 45

46

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-

8ef87e7915b)

slide-46
SLIDE 46

47

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-47
SLIDE 47

Practical Tips for t-SNE

48

Mike Hughes - Tufts COMP 135 - Spring 2019

https://distill.pub/2016/misread-tsne/

  • If dim is very high, preprocess with

PCA to ~30 dims, then apply t-SNE

  • Beware: Non-convex cost function
slide-48
SLIDE 48

49

Mike Hughes - Tufts COMP 135 - Spring 2019

Word Embeddings

slide-49
SLIDE 49

Word Embeddings (word2vec)

50

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space

vec(swimming) – vec(swim) + vec(walk) = vec(walking)

slide-50
SLIDE 50

51

Word Embeddings (word2vec)

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space
slide-51
SLIDE 51

How to embed?

Training

52

Reward embeddings that predict nearby words in the sentence. tacos s t a f f dinosaur hammer embedding dimensions typical 100-1000

Goal: learn weights

Credit: https://www.tensorflow.org/tutorials/representation/word2vec

3.2

  • 4.1

7.1

fixed vocabulary typical 1000-100k

W = W

slide-52
SLIDE 52

Embeddings Everywhere

  • seq2vec
  • med2vec
  • graph2vec
  • https://arxiv.org/abs/1707.05005
  • https://arxiv.org/abs/1805.11921

53

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Ivanov & Burnaev ICML 2018 Credit: Choi et al. KDD 2016