Kernel Methods For Regression and Classification Mike Hughes - - - PowerPoint PPT Presentation

kernel methods
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods For Regression and Classification Mike Hughes - - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2 SVM Logistic Regression Loss hinge


slide-1
SLIDE 1

2

Mike Hughes - Tufts COMP 135 - Fall 2020

Summary of Unit 5:

Kernel Methods

For Regression and Classification

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

slide-2
SLIDE 2

3

Mike Hughes - Tufts COMP 135 - Fall 2020

SVM Logistic Regression

Loss hinge cross entropy (log loss) Sensitive to

  • utliers

Less More sensitive Probabilistic? No Yes Multi-class? Only via separate model for each class (one-vs-all) Easy, using softmax Kernelizable? (cover next class) Yes, with speed benefits from sparsity Yes

slide-3
SLIDE 3

Multi-class SVMs

  • How do we extend idea of margin to more than 2

classes? Not so elegant. Two options: One vs rest Need to fit C separate models Pick class with largest f(x) One vs one Need to fit C(C-1)/2 models Pick class with most f(x) “wins”

4

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-4
SLIDE 4

Multi-class Logistic Regression

  • How do we extend LR to more than 2 classes?
  • Elegant: Can train weights using same

prediction function we’ll use at test time

5

Mike Hughes - Tufts COMP 135 - Fall 2020

ˆ p(x) = softmax(wT

1 x, wT 2 x, . . . wT Cx)

<latexit sha1_base64="CX2Uyb5hSgPhD4S+wTSYLrJzMWI=">ACJ3icbZBNSwMxEIazftb6VfXoJViEFqTsVkEvSrEXjxXaWujWk2zbWh2sySz2rL03jxr3gRVESP/hPTj4NWBwLPvDPDZF4vElyDbX9aC4tLyurqbX0+sbm1nZmZ7euZawoq1EpGp4RDPBQ1YDoI1IsVI4Al24/XL4/rNHVOay7AKw4i1AtINuc8pASO1Mxduj0ASjXKDPD7HLrABJFr6EJDBKIfv285tFQ+ODBSn4IqOBG3y8jPtzNZu2BPAv8FZwZNItKO/PidiSNAxYCFUTrpmNH0EqIAk4FG6XdWLOI0D7psqbBkARMt5LJnSN8aJQO9qUyLwQ8UX9OJCTQeh4pjMg0NPztbH4X60Zg3/WSngYxcBCOl3kxwKDxGPTcIcrRkEMDRCquPkrpj2iCAVjbdqY4Myf/BfqxYJzXChen2RLlzM7UmgfHaActApKqErVE1RNEDekKv6M16tJ6td+tj2rpgzWb20K+wvr4BFrKkLg=</latexit>
slide-5
SLIDE 5

Kernel methods

Use kernel functions (similarity function with special properties) to obtain flexible high- dimensional feature transformations without explicit features Solve “dual” problem (for parameter alpha), not “primal” problem (for weights w) Can use the “kernel trick” for: * regression * classification (Logistic Regr. or SVM)

6

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-6
SLIDE 6

Kernel Methods for Regression

7

Mike Hughes - Tufts COMP 135 - Fall 2020 Kernels exist for:

  • Periodic regression
  • Histograms
  • Strings
  • Graphs,
  • And more!
slide-7
SLIDE 7

Review: Key concepts in supervised learning

  • Parametric vs nonparametric methods
  • Bias vs variance

8

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-8
SLIDE 8

Parametric vs Nonparametric

  • Parametric methods
  • Complexity of decision function fixed in advance

and specified by a finite fixed number of parameters, regardless of training data size

  • Nonparametric methods
  • Complexity of decision function can grow as more

training data is observed

9

Mike Hughes - Tufts COMP 135 - Fall 2020

Linear regression Logistic regression Decision trees Ensembles of trees Nearest neighbor methods Neural networks

slide-9
SLIDE 9

10

Mike Hughes - Tufts COMP 135 - Fall 2020

Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html

Bias & Variance

Known “true” response Estimate (a random variable)

ˆ y

y

slide-10
SLIDE 10

11

Mike Hughes - Tufts COMP 135 - Spring 2019

Decompose into Bias & Variance

is known “true” response value at given known heldout input x is a Random Variable obtained by fitting estimator to random sample of N training data examples, then predicting at x

ˆ y

y

Bias: Error from average model to true

How far the average prediction of our model (averaged over all possible training sets of size N) is from true response

(¯ y − y)2

Variance: Deviation over model samples

How far predictions based on a single training set are from the average prediction

¯ y , E[ˆ y]

Var(ˆ y) = E[(ˆ y − ¯ y)2]

<latexit sha1_base64="h1ZEA4W0jGTPZAVN/oGQAtWEzI=">ACInicbVDLSgNBEJz1bXxFPXoZDIeDLtRUA9CUASPEUwUsmvonUzM4OyDmV4xLPstXvwVLx4U9ST4Mc7GCJpYMFBUVTPd5cdSaLTtD2tsfGJyanpmtjA3v7C4VFxeaegoUYzXWSQjdemD5lKEvI4CJb+MFYfAl/zCvznO/YtbrSIwnPsxdwL4DoUHcEAjdQqHrjI7zBtgMo23S5g2su26CF1A8Cu76cnWZP+6HSbuj6oPHFVoV6rWLdh90lDgDUiID1FrFN7cdsSTgITIJWjcdO0YvBYWCSZ4V3ETzGNgNXPOmoSEXHtp/8SMbhilTuRMi9E2ld/T6QaN0LfJPMV9fDXi7+5zUT7Ox7qQjBHnIvj/qJiRPO+aFsozlD2DAGmhNmVsi4oYGhaLZgSnOGTR0mjUnZ2ypWz3VL1aFDHDFkj62STOGSPVMkpqZE6YeSePJn8mI9WE/Wq/X+HR2zBjOr5A+szy8grqNc</latexit>

= E h ˆ y2i − ¯ y2

slide-11
SLIDE 11

12

Mike Hughes - Tufts COMP 135 - Fall 2020

E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2

= E h ˆ y2i − ¯ y2 + ¯ y2 − 2¯ yy + y2

(¯ y − y)2

Total Error: Bias^2 + Variance

= Var(ˆ y)+

Variance

Expected value is over samples of the

  • bserved training set
slide-12
SLIDE 12

13

Mike Hughes - Tufts COMP 135 - Spring 2019

variance total error bias

Error due to inability of typical fit (averaged over training sets) to capture true predictive relationship Error due to estimating from a single finite-size training set

More flexible

  • verfitting

Toy example: ISL Fig. 6.5 Less flexible underfitting

All supervised learning methods must manage bias/variance tradeoff. Hyperparameter search is key.

slide-13
SLIDE 13

Dimensionality Reduction & Embedding

14

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU)

  • Prof. Mike Hughes
slide-14
SLIDE 14

15

Mike Hughes - Tufts COMP 135 - Fall 2020

What will we learn?

Data Examples data x

Supervised Learning Unsupervised Learning Reinforcement Learning

{xn}N

n=1

Task summary

  • f x

Performance measure

slide-15
SLIDE 15

16

Mike Hughes - Tufts COMP 135 - Fall 2020

Task: Embedding

Supervised Learning Unsupervised Learning Reinforcement Learning

embedding

x2 x1

slide-16
SLIDE 16
  • Dim. Reduction/Embedding

Unit Objectives

  • Goals of dimensionality reduction
  • Reduce feature vector size (keep signal, discard noise)
  • “Interpret” features: visualize/explore/understand
  • Common approaches
  • Principal Component Analysis (PCA)
  • word2vec and other neural embeddings
  • Evaluation Metrics
  • Storage size
  • Reconstruction error
  • “Interpretability”

17

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-17
SLIDE 17

Example: 2D viz. of movies

18

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-18
SLIDE 18

19

Mike Hughes - Tufts COMP 135 - Fall 2020

Example: Genes vs. geography

Where possible, we based the geographic origin on the observed country data for

  • grandparents. We used a ‘strict consensus’ approach: if all observed grandparents
  • riginated from a single country, we used that country as the origin. If an individual’s
  • bserved grandparents originated from different countries, we excluded the individual.

Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Nature, 2008

slide-19
SLIDE 19

20

Mike Hughes - Tufts COMP 135 - Fall 2020

Example: Genes vs. geography

Nature, 2008

slide-20
SLIDE 20

Example: Eigen Clothing

21

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-21
SLIDE 21

22

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-22
SLIDE 22

Centering the Data

Goal: each feature’s mean = 0.0

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

Why center?

  • Think of mean vector as simplest possible

“reconstruction” of a dataset

  • No example specific parameters, just one F-

dim vector

24

Mike Hughes - Tufts COMP 135 - Spring 2019

min

m∈RF N

X

n=1

(xn − m)T (xn − m) m∗ = mean(x1, . . . xN)

slide-24
SLIDE 24

Mean reconstruction

25

Mike Hughes - Tufts COMP 135 - Fall 2020

  • riginal

reconstructed

slide-25
SLIDE 25

Principal Component Analysis

26

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-26
SLIDE 26

Linear Projection to 1D

27

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-27
SLIDE 27

Reconstruction from 1D to 2D

28

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-28
SLIDE 28

2D Orthogonal Basis

29

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-29
SLIDE 29

Which 1D projection is best?

30

Mike Hughes - Tufts COMP 135 - Fall 2020

Idea: Minimize reconstruction error

slide-30
SLIDE 30

K-dim Reconstruction with PCA

31

Mike Hughes - Tufts COMP 135 - Spring 2019

F vector High- dim. data K vector Low-dim vector F x K Weights F vector “mean” vector

xi = Wzi + m +

Problem: Over-parameterized. Too many possible solutions! If we scale z x2, we can scale W / 2 and get equivalent reconstruction We need to constrain the magnitude of weights. Let’s make all the weight vectors be unit vectors: ||W||_2 = 1

slide-31
SLIDE 31

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • K : int, number of components
  • Satisfies 1 <= K <= F
  • Output: Trained parameters for PCA
  • m : mean vector, size F
  • W : learned basis of weight vectors, F x K
  • One F-dim. vector (magnitude 1) for each component
  • Each of the K vectors is orthogonal to every other

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Training step: .fit()

slide-32
SLIDE 32

Principal Component Analysis

  • Input:
  • X : training data, N x F
  • N high-dim. example vectors
  • Trained PCA “model”
  • m : mean vector, size F
  • W : learned basis of eigenvectors, F x K
  • One F-dim. vector (magnitude 1) for each component
  • Each of the K vectors is orthogonal to every other
  • Output:
  • Z : projected data, N x K

33

Mike Hughes - Tufts COMP 135 - Spring 2019

Transformation step: .transform()

slide-33
SLIDE 33

Example: EigenFaces

34

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-34
SLIDE 34

35

Mike Hughes - Tufts COMP 135 - Fall 2020

Word Embeddings

slide-35
SLIDE 35

Word Embeddings (word2vec)

36

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space

vec(swimming) – vec(swim) + vec(walk) = vec(walking)

slide-36
SLIDE 36

37

Word Embeddings (word2vec)

Goal: map each word in vocabulary to an embedding vector

  • Preserve semantic meaning in this new vector space
slide-37
SLIDE 37

How to embed?

Training

38

Reward embeddings that predict nearby words in the sentence. tacos s t a f f dinosaur hammer embedding dimensions typical 100-1000

Goal: learn weights

Credit: https://www.tensorflow.org/tutorials/representation/word2vec

3.2

  • 4.1

7.1

fixed vocabulary typical 1000-100k

W = W

slide-38
SLIDE 38
  • Dim. Reduction/Embedding

Unit Objectives

  • Goals of dimensionality reduction
  • Reduce feature vector size (keep signal, discard noise)
  • “Interpret” features: visualize/explore/understand
  • Common approaches
  • Principal Component Analysis (PCA)
  • word2vec and other neural embeddings
  • Evaluation Metrics
  • Storage size
  • Reconstruction error
  • “Interpretability”

39

Mike Hughes - Tufts COMP 135 - Fall 2020