2
Mike Hughes - Tufts COMP 135 - Fall 2020
Summary of Unit 5:
Kernel Methods
For Regression and Classification
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/
Kernel Methods For Regression and Classification Mike Hughes - - - PowerPoint PPT Presentation
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2 SVM Logistic Regression Loss hinge
2
Mike Hughes - Tufts COMP 135 - Fall 2020
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/
3
Mike Hughes - Tufts COMP 135 - Fall 2020
Loss hinge cross entropy (log loss) Sensitive to
Less More sensitive Probabilistic? No Yes Multi-class? Only via separate model for each class (one-vs-all) Easy, using softmax Kernelizable? (cover next class) Yes, with speed benefits from sparsity Yes
classes? Not so elegant. Two options: One vs rest Need to fit C separate models Pick class with largest f(x) One vs one Need to fit C(C-1)/2 models Pick class with most f(x) “wins”
4
Mike Hughes - Tufts COMP 135 - Fall 2020
prediction function we’ll use at test time
5
Mike Hughes - Tufts COMP 135 - Fall 2020
1 x, wT 2 x, . . . wT Cx)
<latexit sha1_base64="CX2Uyb5hSgPhD4S+wTSYLrJzMWI=">ACJ3icbZBNSwMxEIazftb6VfXoJViEFqTsVkEvSrEXjxXaWujWk2zbWh2sySz2rL03jxr3gRVESP/hPTj4NWBwLPvDPDZF4vElyDbX9aC4tLyurqbX0+sbm1nZmZ7euZawoq1EpGp4RDPBQ1YDoI1IsVI4Al24/XL4/rNHVOay7AKw4i1AtINuc8pASO1Mxduj0ASjXKDPD7HLrABJFr6EJDBKIfv285tFQ+ODBSn4IqOBG3y8jPtzNZu2BPAv8FZwZNItKO/PidiSNAxYCFUTrpmNH0EqIAk4FG6XdWLOI0D7psqbBkARMt5LJnSN8aJQO9qUyLwQ8UX9OJCTQeh4pjMg0NPztbH4X60Zg3/WSngYxcBCOl3kxwKDxGPTcIcrRkEMDRCquPkrpj2iCAVjbdqY4Myf/BfqxYJzXChen2RLlzM7UmgfHaActApKqErVE1RNEDekKv6M16tJ6td+tj2rpgzWb20K+wvr4BFrKkLg=</latexit>Use kernel functions (similarity function with special properties) to obtain flexible high- dimensional feature transformations without explicit features Solve “dual” problem (for parameter alpha), not “primal” problem (for weights w) Can use the “kernel trick” for: * regression * classification (Logistic Regr. or SVM)
6
Mike Hughes - Tufts COMP 135 - Fall 2020
7
Mike Hughes - Tufts COMP 135 - Fall 2020 Kernels exist for:
8
Mike Hughes - Tufts COMP 135 - Fall 2020
and specified by a finite fixed number of parameters, regardless of training data size
training data is observed
9
Mike Hughes - Tufts COMP 135 - Fall 2020
Linear regression Logistic regression Decision trees Ensembles of trees Nearest neighbor methods Neural networks
10
Mike Hughes - Tufts COMP 135 - Fall 2020
Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html
Known “true” response Estimate (a random variable)
11
Mike Hughes - Tufts COMP 135 - Spring 2019
is known “true” response value at given known heldout input x is a Random Variable obtained by fitting estimator to random sample of N training data examples, then predicting at x
Bias: Error from average model to true
How far the average prediction of our model (averaged over all possible training sets of size N) is from true response
Variance: Deviation over model samples
How far predictions based on a single training set are from the average prediction
¯ y , E[ˆ y]
Var(ˆ y) = E[(ˆ y − ¯ y)2]
<latexit sha1_base64="h1ZEA4W0jGTPZAVN/oGQAtWEzI=">ACInicbVDLSgNBEJz1bXxFPXoZDIeDLtRUA9CUASPEUwUsmvonUzM4OyDmV4xLPstXvwVLx4U9ST4Mc7GCJpYMFBUVTPd5cdSaLTtD2tsfGJyanpmtjA3v7C4VFxeaegoUYzXWSQjdemD5lKEvI4CJb+MFYfAl/zCvznO/YtbrSIwnPsxdwL4DoUHcEAjdQqHrjI7zBtgMo23S5g2su26CF1A8Cu76cnWZP+6HSbuj6oPHFVoV6rWLdh90lDgDUiID1FrFN7cdsSTgITIJWjcdO0YvBYWCSZ4V3ETzGNgNXPOmoSEXHtp/8SMbhilTuRMi9E2ld/T6QaN0LfJPMV9fDXi7+5zUT7Ox7qQjBHnIvj/qJiRPO+aFsozlD2DAGmhNmVsi4oYGhaLZgSnOGTR0mjUnZ2ypWz3VL1aFDHDFkj62STOGSPVMkpqZE6YeSePJn8mI9WE/Wq/X+HR2zBjOr5A+szy8grqNc</latexit>= E h ˆ y2i − ¯ y2
12
Mike Hughes - Tufts COMP 135 - Fall 2020
E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2
= E h ˆ y2i − ¯ y2 + ¯ y2 − 2¯ yy + y2
Variance
Expected value is over samples of the
13
Mike Hughes - Tufts COMP 135 - Spring 2019
variance total error bias
Error due to inability of typical fit (averaged over training sets) to capture true predictive relationship Error due to estimating from a single finite-size training set
More flexible
Toy example: ISL Fig. 6.5 Less flexible underfitting
All supervised learning methods must manage bias/variance tradeoff. Hyperparameter search is key.
14
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/
Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU)
15
Mike Hughes - Tufts COMP 135 - Fall 2020
Data Examples data x
Supervised Learning Unsupervised Learning Reinforcement Learning
n=1
Task summary
Performance measure
16
Mike Hughes - Tufts COMP 135 - Fall 2020
Supervised Learning Unsupervised Learning Reinforcement Learning
embedding
17
Mike Hughes - Tufts COMP 135 - Fall 2020
18
Mike Hughes - Tufts COMP 135 - Fall 2020
19
Mike Hughes - Tufts COMP 135 - Fall 2020
Where possible, we based the geographic origin on the observed country data for
Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Nature, 2008
20
Mike Hughes - Tufts COMP 135 - Fall 2020
Nature, 2008
21
Mike Hughes - Tufts COMP 135 - Fall 2020
22
Mike Hughes - Tufts COMP 135 - Fall 2020
23
Mike Hughes - Tufts COMP 135 - Spring 2019
“reconstruction” of a dataset
dim vector
24
Mike Hughes - Tufts COMP 135 - Spring 2019
m∈RF N
n=1
25
Mike Hughes - Tufts COMP 135 - Fall 2020
reconstructed
26
Mike Hughes - Tufts COMP 135 - Fall 2020
27
Mike Hughes - Tufts COMP 135 - Fall 2020
28
Mike Hughes - Tufts COMP 135 - Fall 2020
29
Mike Hughes - Tufts COMP 135 - Fall 2020
30
Mike Hughes - Tufts COMP 135 - Fall 2020
Idea: Minimize reconstruction error
31
Mike Hughes - Tufts COMP 135 - Spring 2019
F vector High- dim. data K vector Low-dim vector F x K Weights F vector “mean” vector
Problem: Over-parameterized. Too many possible solutions! If we scale z x2, we can scale W / 2 and get equivalent reconstruction We need to constrain the magnitude of weights. Let’s make all the weight vectors be unit vectors: ||W||_2 = 1
32
Mike Hughes - Tufts COMP 135 - Spring 2019
Training step: .fit()
33
Mike Hughes - Tufts COMP 135 - Spring 2019
Transformation step: .transform()
34
Mike Hughes - Tufts COMP 135 - Fall 2020
35
Mike Hughes - Tufts COMP 135 - Fall 2020
36
Goal: map each word in vocabulary to an embedding vector
vec(swimming) – vec(swim) + vec(walk) = vec(walking)
37
Goal: map each word in vocabulary to an embedding vector
Training
38
Reward embeddings that predict nearby words in the sentence. tacos s t a f f dinosaur hammer embedding dimensions typical 100-1000
Goal: learn weights
Credit: https://www.tensorflow.org/tutorials/representation/word2vec
3.2
7.1
fixed vocabulary typical 1000-100k
W = W
39
Mike Hughes - Tufts COMP 135 - Fall 2020