Midterm Review +
Ensemble Methods + Recommender Systems
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 21
- Nov. 4, 2019
Machine Learning Department School of Computer Science Carnegie Mellon University
Ensemble Methods + Recommender Systems Matt Gormley Lecture 21 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Midterm Review + Ensemble Methods + Recommender Systems Matt Gormley Lecture 21 Nov. 4, 2019 1 Reminders
1
Matt Gormley Lecture 21
Machine Learning Department School of Computer Science Carnegie Mellon University
– Out: Fri, Oct. 25 – Due: Fri, Nov. 8 at 11:59pm
– Thu, Nov. 14, 6:30pm – 8:00pm – more details announced on Piazza
– Out: Fri, Nov. 8 – Due: Sun, Nov. 24 at 11:59pm
– http://p21.mlcourse.org
2
3
– Time: Evening Exam Thu, Nov. 14 at 6:30pm – 8:00pm – Room: We will contact each student individually with your room
– Seats: There will be assigned seats. Please arrive early. – Please watch Piazza carefully for announcements regarding room / seat assignments.
– Covered material: Lecture 9 – Lecture 19 (95%), Lecture 1 – 8 (5%) – Format of questions:
– No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)
4
5
missing something
6
7
8
9
10
3.2 Logistic regression
Given a training set {(xi, yi), i = 1, . . . , n} where xi 2 Rd is a feature vector and yi 2 {0, 1} is a binary label, we want to find the parameters ˆ w that maximize the likelihood for the training set, assuming a parametric model of the form p(y = 1|x; w) = 1 1 + exp(wTx). The conditional log likelihood of the training set is `(w) =
n
X
i=1
yi log p(yi, |xi; w) + (1 yi) log(1 p(yi, |xi; w)), and the gradient is r`(w) =
n
X
i=1
(yi p(yi|xi; w))xi. (c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x 2 {0, 1}d ⇢ Rd, where feature x1 is rare and happens to appear in the training set with only label 1. What is ˆ w1? Is the gradient ever zero for any finite w? Why is it important to include a regularization term to control the norm of ˆ w? (b) [5 pts.] What is the form of the classifier output by logistic regression?
11
2.1 Train and test errors
In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data Dtrain, and tested on a separate test set Dtest. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0.
(a) Increase the training data size. (b) Decrease the training data size. (c) Increase model complexity (For example, if your classifier is an SVM, use a more complex kernel. Or if it is a decision tree, increase the depth). (d) Decrease model complexity. (e) Train on a combination of Dtrain and Dtest and test on Dtest (f) Conclude that Machine Learning does not work.
12
2.1 Train and test errors
In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data Dtrain, and tested on a separate test set Dtest. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0.
(a) (b)
14
y h1 h2 x1 x2 w11 w21 w12 w22 w31 w32
(b) The neural network architecture 1 2 3 4 5 1 2 3 4 5 x1 x2
(a) The dataset with groups S1, S2, and S3.
Can the neural network in Figure (b) correctly classify the dataset given in Figure (a)?
Neural Networks
15
y h1 h2 x1 x2 w11 w21 w12 w22 w31 w32
(b) The neural network architecture
Apply the backpropagation algorithm to obtain the partial derivative of the mean-squared error
weight w22 assuming a sigmoid nonlinear activation function for the hidden layer.
Neural Networks
16
7.1 Reinforcement Learning
ploitation compared with policy iteration. True False
and supervised learning. Reinforcement learning is a kind of supervised learning problem because you can treat the reward and next state as the label and each state, action pair as the training data. Reinforcement learning differs from supervised learning because it has a tem- poral structure in the learning process, whereas, in supervised learning, the prediction of a data point does not affect the data you would see in the future.
17
7.1 Reinforcement Learning
2 2 4 4 8 4 8
is the corresponding optimal policy? Assume the discount factor is 0.1
are the corresponding V*(s) values? Assume the discount factor is 0.1
are the corresponding Q*(s,a) values? Assume the discount factor is 0.1
18
Figure from Tom Mitchell
Im St
24
Learning Paradigms: What data is available and when? What form of prediction?
Problem Formulation: What is the structure of our output prediction?
boolean Binary Classification categorical Multiclass Classification
Ordinal Classification real Regression
Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & cont. (e.g. mixed graphical models)
Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization
Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Big Ideas in ML: Which are the ideas driving development of the field?
Application Areas Key challenges? NLP, Speech, Computer Vision, Robotics, Medicine, Search
25
26
27
28
29
30
31
than Netflix’s existing system on 3 million held out ratings
32
33
Top performing systems were ensembles
– Initially weight all classifiers equally – Receive a training example and predict the (weighted) majority vote of the classifiers in the pool – Down-weight classifiers that contribute to a mistake by a factor of β
34
(Littlestone & Warmuth, 1994)
36
(Littlestone & Warmuth, 1994)
37
(Littlestone & Warmuth, 1994)
38
Weighted Majority Algorithm
– the classifiers themselves – (majority vote) weight for each classifiers
39
D1
weak classifiers = vertical or horizontal half-planes
42
Slide from Schapire NIPS Tutorial
h1 α ε1 1 =0.30 =0.42 2 D
43
Slide from Schapire NIPS Tutorial
α ε2 2 =0.21 =0.65 h2 3 D
44
Slide from Schapire NIPS Tutorial
h3 α ε3 3=0.92 =0.14
45
Slide from Schapire NIPS Tutorial
H final + 0.92 + 0.65 0.42 sign = =
46
Slide from Schapire NIPS Tutorial
47
Given: where , Initialize . For : Train weak learner using distribution . Get weak hypothesis with error Choose . Update: if if where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis:
Algorithm from (Freund & Schapire, 1999)
49
Figure from (Freund & Schapire, 1999)
error
10 100 1000 5 10 15 20
cumulative distribution
0.5 1 0.5 1.0
# rounds margin Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves, respectively) of the combined classifier as a function of the number of rounds of boosting. The horizontal lines indicate the test error rate of the base classifier as well as the test error of the final combined classifier. Right: The cumulative distribution of margins of the training examples after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, respectively.
50
51
52
57
than Netflix’s existing system on 3 million held out ratings
58
– Items: movies, songs, products, etc. (often many thousands) – Users: watchers, listeners, purchasers, etc. (often many millions) – Feedback: 5-star ratings, not-clicking ‘next’, purchases, etc.
– Can represent ratings numerically as a user/item matrix – Users only rate a small number of items (the matrix is sparse)
59
Doctor Strange Star Trek: Beyond Zootopia Alice 1 5 Bob 3 4 Charlie 3 5 2
60
62
63
Slide from William Cohen
64
Figures from Koren et al. (2009)
65
Figures from Koren et al. (2009)
66
Figures from Koren et al. (2009)
movies and users live in some low- dimensional space describing their properties
movie based on its proximity to the user in the latent space
Matrix Factorization
67
68
69
71
72
Figures from Aggarwal (2016)
ATTLE N O US CAESAR OPA TRA PLESS IN SEA TTY WOMAN ABLANCA NERO JULIU CLEO SLEEP PRET CASA 1 BOTH HISTORY 1 2 3 4 ROMANCE 1 1 1 5 6 1
R
7
(b) Residual matrix 1 2 3 4 5 6 7 HISTORY ROMANCE
X
HISTORY ROMANCE ROMANCE BOTH HISTORY 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA
R U VT
NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 6 7 5 4 3 2 1 E
(a) Example of rank-2 matrix factorization
73 TRAINING ROWS TEST ROWS INDEPENDENT VARIABLES DEPENDENT VARIABLE NO DEMARCATION BETWEEN TRAINING AND TEST ROWS NO DEMARCATION BETWEEN DEPENDENT AND INDEPENDENT VARIABLES
Figures from Aggarwal (2016)
74
75
76
77
78
80
u i
u i)2
81
Figures from Koren et al. (2009)
Figures from Gemulla et al. (2011)
82
Figures from Koren et al. (2009)
83
Figures from Koren et al. (2009)
u i)2
84
Figures from Koren et al. (2009)
u i)2
85
Figures from Koren et al. (2009)
u i
u i)2
86
Figures from Koren et al. (2009)
Figures from Gemulla et al. (2011)
87
Figures from Koren et al. (2009) Figure from Gemulla et al. (2011)
step size
Figure from Gemulla et al. (2011)
88
–1.5 –1.0 –0.5 0.0 0.5 1.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 Factor vector 1 Factor vector 2
Freddy Got Fingered Freddy vs. Jason Half Baked Road Trip The Sound of Music Sophie’s Choice Moonstruck Maid in Manhattan The Way We Were Runaway Bride Coyote Ugly The Royal Tenenbaums Punch-Drunk Love I Heart Huckabees Armageddon Citizen Kane The Waltons: Season 1 Stepmom Julien Donkey-Boy Sister Act The Fast and the Furious The Wizard of Oz Kill Bill: Vol. 1 Scarface Natural Born Killers Annie Hall Belle de Jour Lost in Translation The Longest Yard Being John Malkovich Catwoman
Figure 3. The fjrst two vectors from a matrix decomposition of the Netfmix Prize
vectors in two dimensions. The plot reveals distinct genres, including clusters of movies with strong female leads, fraternity humor, and quirky independent fjlms.
Figure from Koren et al. (2009)
89
ALS = alternating least squares
Figure from Gemulla et al. (2011)
90
92
Theorem: If R fully
regularization, the
SVD equals the
Unconstrained MF
93
– In many settings, users don’t have a way of expressing dislike for an item (e.g. can’t provide negative ratings) – The only mechanism for feedback is to “like” something
– Facebook has a “Like” button, but no “Dislike” button – Google’s “+1” button – Pinterest pins – Purchasing an item on Amazon indicates a preference for it, but there are many reasons you might not purchase an item (besides dislike) – Search engines collect click data but don’t have a clear mechanism for observing dislike of a webpage
94
Examples from Aggarwal (2016)
95
105
106