Dimensionality Reduction & Embedding Prof. Mike Hughes Many - - PowerPoint PPT Presentation

dimensionality reduction embedding
SMART_READER_LITE
LIVE PREVIEW

Dimensionality Reduction & Embedding Prof. Mike Hughes Many - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Dimensionality Reduction & Embedding Prof. Mike Hughes Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU) 2


slide-1
SLIDE 1

Dimensionality Reduction & Embedding

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU)

  • Prof. Mike Hughes
slide-2
SLIDE 2

3

Mike Hughes - Tufts COMP 135 - Spring 2019

What will we learn?

Data Examples data x

Supervised Learning Unsupervised Learning Reinforcement Learning

{xn}N

n=1

Task summary

  • f x

Performance measure

slide-3
SLIDE 3

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Embedding

Supervised Learning Unsupervised Learning Reinforcement Learning

embedding

x2 x1

slide-4
SLIDE 4
  • Dim. Reduction/Embedding

Unit Objectives

  • Goals of dimensionality reduction
  • Reduce feature vector size (keep signal, discard noise)
  • “Interpret” features: visualize/explore/understand
  • Common approaches
  • Principal Component Analysis (PCA)
  • t-SNE (“tee-snee”)
  • word2vec and other neural embeddings
  • Evaluation Metrics
  • Storage size
  • Reconstruction error
  • “Interpretability”
  • Prediction error

5

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

Example: 2D viz. of movies

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Spring 2019

Example: Genes vs. geography

slide-7
SLIDE 7

Example: Eigen Clothing

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-9
SLIDE 9

Principal Component Analysis

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Linear Projection to 1D

11

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

Reconstruction from 1D to 2D

12

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-12
SLIDE 12

2D Orthogonal Basis

13

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-13
SLIDE 13

Which 1D projection is best?

14

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-14
SLIDE 14

PCA Principles

  • Minimize reconstruction error
  • Should be able to recreate x from z
  • Equivalent to maximizing variance
  • Want z to retain maximum information

15

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-15
SLIDE 15

Best Direction related to Eigenvalues of Data Covariance

16

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-16
SLIDE 16

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • K : int, number of dimensions to discover
  • Satisfies 1 <= K <= F
  • Output:
  • m : mean vector, size F
  • V : learned eigenvector basis, K x F
  • One F-dimensional vector for each component
  • Each of the K vectors is orthogonal to every other

17

Mike Hughes - Tufts COMP 135 - Spring 2019

Training step: .fit()

slide-17
SLIDE 17

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • Trained PCA “model”
  • m : mean vector, size F
  • V : learned eigenvector basis, K x F
  • One F-dimensional vector for each component
  • Each of the K vectors is orthogonal to every other
  • Output:
  • Z : projected data, N x K

18

Mike Hughes - Tufts COMP 135 - Spring 2019

Transformation step: .transform()

slide-18
SLIDE 18

PCA Demo

  • http://setosa.io/ev/principal-

component-analysis/

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

Example: EigenFaces

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-20
SLIDE 20

PCA: How to Select K?

  • 1) Use downstream supervised task metric
  • Regression error
  • 2) Use memory constraints of task
  • Can’t store more than 50 dims for 1M examples?

Take K=50

  • 3) Plot cumulative “variance explained”
  • Take K that seems to capture 90% or all variance

21

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-21
SLIDE 21

PCA Summary

PRO

  • Usually, fast to train, fast to test
  • Slow only if finding K eigenvectors of an F x F matrix is

slow

  • Nested model
  • PCA with K=5 has subset of params equal to PCA with

K=4

CON

  • Learned basis known only up to +/- scaling
  • Not often best for supervised tasks

22

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-22
SLIDE 22

Visualization with t-SNE

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

24

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-

8ef87e7915b)

slide-24
SLIDE 24

25

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-

8ef87e7915b)

slide-25
SLIDE 25

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Practical Tips for t-SNE

27

Mike Hughes - Tufts COMP 135 - Spring 2019

https://distill.pub/2016/misread-tsne/

  • If dim is very high, preprocess with

PCA to ~30 dims, then apply t-SNE

  • Beware: Non-convex cost function
slide-27
SLIDE 27

28

Mike Hughes - Tufts COMP 135 - Spring 2019

Matrix Factorization as Learned “Embedding”

slide-28
SLIDE 28

29

Mike Hughes - Tufts COMP 135 - Spring 2019

Matrix Factorization (MF)

  • User ! represented by vector "# ∈ %&
  • Item ' represented by vector () ∈ %&
  • Inner product "#

*() approximates the utility +#)

  • Intuition:
  • Two items with similar vectors get similar utility scores

from the same user;

  • Two users with similar vectors give similar utility

scores to the same item

slide-29
SLIDE 29

30

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-30
SLIDE 30

31

Mike Hughes - Tufts COMP 135 - Spring 2019

Word Embeddings

slide-31
SLIDE 31

Word Embeddings (word2vec)

32

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space

vec(swimming) – vec(swim) + vec(walk) = vec(walking)

slide-32
SLIDE 32

33

Word Embeddings (word2vec)

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space
slide-33
SLIDE 33

How to embed?

Training

34

Reward embeddings that predict nearby words in the sentence. tacos s t a f f dinosaur hammer embedding dimensions typical 100-1000

Goal: learn weights

Credit: https://www.tensorflow.org/tutorials/representation/word2vec

3.2

  • 4.1

7.1

fixed vocabulary typical 1000-100k

W = W