CSC 2515: Machine Learning
Lecture 1 - Introduction and Nearest Neighbours Roger Grosse
University of Toronto
(UofT) CSC2515-Lec1 1 / 52
CSC 2515: Machine Learning Lecture 1 - Introduction and Nearest - - PowerPoint PPT Presentation
CSC 2515: Machine Learning Lecture 1 - Introduction and Nearest Neighbours Roger Grosse University of Toronto (UofT) CSC2515-Lec1 1 / 52 This course Broad introduction to machine learning First half: algorithms and principles for
(UofT) CSC2515-Lec1 1 / 52
◮ First half: algorithms and principles for supervised learning ◮ nearest neighbors, decision trees, ensembles, linear regression,
◮ neural nets! ◮ Unsupervised learning: PCA, K-means, mixture models ◮ Basics of reinforcement learning
◮ But the structure and difficulty will be similar to past years, when
◮ The majority of students are from outside Computer Science. (UofT) CSC2515-Lec1 2 / 52
(UofT) CSC2515-Lec1 3 / 52
(UofT) CSC2515-Lec1 4 / 52
(UofT) CSC2515-Lec1 5 / 52
◮ Combination of pencil & paper derivations and short programming
◮ Each counts for 10%, except that the lowest mark counts for 5%. ◮ Worth 45% in total.
◮ Worth 5%, honor system.
◮ Oct. 30, 4-6pm ◮ Worth 15% of course mark
◮ Dec. 17, 3-6pm ◮ Worth 35% of course mark (UofT) CSC2515-Lec1 6 / 52
(UofT) CSC2515-Lec1 7 / 52
(UofT) CSC2515-Lec1 8 / 52
◮ recognizing people and objects ◮ understanding human speech
◮ hard to code up a solution by hand (e.g. vision, speech) ◮ system needs to adapt to a changing environment (e.g. spam
◮ want the system to perform better than the human programmers ◮ privacy/fairness (e.g. ranking search results) (UofT) CSC2515-Lec1 9 / 52
◮ Both fields try to uncover patterns in data ◮ Both fields draw heavily on calculus, probability, and linear algebra,
◮ Stats is more concerned with helping scientists and policymakers
◮ Stats puts more emphasis on interpretability and mathematical
(UofT) CSC2515-Lec1 10 / 52
◮ Supervised learning: have labeled examples of the correct
◮ Reinforcement learning: learning system receives a reward
◮ Unsupervised learning: no labeled examples – instead, looking
(UofT) CSC2515-Lec1 11 / 52
◮ Connectionist psychologists explored neural models of cognition ◮ 1984 — Leslie Valiant formalized the problem of learning as PAC
◮ 1988 — Backpropagation (re-)discovered by Geoffrey Hinton and
◮ 1988 — Judea Pearl’s book Probabilistic Reasoning in Intelligent
(UofT) CSC2515-Lec1 12 / 52
◮ Markov chain Monte Carlo ◮ variational inference ◮ kernels and support vector machines ◮ boosting ◮ convolutional networks
◮ 2010–2012 — neural nets smashed previous records in
◮ increasing adoption by the tech industry ◮ 2016 — AlphaGo defeated the human Go champion (UofT) CSC2515-Lec1 13 / 52
Link (UofT) CSC2515-Lec1 14 / 52
(UofT) CSC2515-Lec1 15 / 52
(UofT) CSC2515-Lec1 16 / 52
Link (UofT) CSC2515-Lec1 17 / 52
(UofT) CSC2515-Lec1 18 / 52
(UofT) CSC2515-Lec1 19 / 52
◮ Is there a pattern to detect? ◮ Can I solve it analytically? ◮ Do I have data?
(UofT) CSC2515-Lec1 20 / 52
◮ vectorize computations (express them in terms of matrix/vector
◮ This also makes your code cleaner and more readable! (UofT) CSC2515-Lec1 21 / 52
◮ automatic differentiation ◮ compiling computation graphs ◮ libraries of algorithms and network primitives ◮ support for graphics processing units (GPUs)
◮ So you know what to do if something goes wrong! ◮ Debugging learning algorithms requires sophisticated detective
◮ That’s why we derive things by hand in this class! (UofT) CSC2515-Lec1 22 / 52
(UofT) CSC2515-Lec1 23 / 52
(UofT) CSC2515-Lec1 24 / 52
(UofT) CSC2515-Lec1 25 / 52
(UofT) CSC2515-Lec1 26 / 52
◮ Representation = mapping to another space that’s easy to
◮ Vectors are a great representation since we can do linear algebra! (UofT) CSC2515-Lec1 27 / 52
(UofT) CSC2515-Lec1 28 / 52
◮ Regression: t is a real number (e.g. stock price) ◮ Classification: t is an element of a discrete set {1, . . . , C} ◮ These days, t is often a highly structured object (e.g. image)
◮ Note: these superscripts have nothing to do with exponentiation! (UofT) CSC2515-Lec1 29 / 52
j
j )2
x(i)∈train. set
(UofT) CSC2515-Lec1 30 / 52
(UofT) CSC2515-Lec1 31 / 52
(UofT) CSC2515-Lec1 32 / 52
(UofT) CSC2515-Lec1 33 / 52
[Pic by Olga Veksler]
Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution? Smooth by having k nearest neighbors vote Algorithm (kNN):
y = arg max
t(z) k
δ(t(z), t(r)) (UofT) CSC2515-Lec1 34 / 52
(UofT) CSC2515-Lec1 35 / 52
(UofT) CSC2515-Lec1 36 / 52
◮ Good at capturing fine-grained patterns ◮ May overfit, i.e. be sensitive to random idiosyncrasies in the
◮ Makes stable predictions by averaging over lots of examples ◮ May underfit, i.e. fail to capture important regularities
(UofT) CSC2515-Lec1 37 / 52
(UofT) CSC2515-Lec1 38 / 52
(UofT) CSC2515-Lec1 39 / 52
(UofT) CSC2515-Lec1 40 / 52
◮ By smoothness, p(t | x) is approximately constant for nearby x. ◮ Hence, the labels of the neighbors can be seen as independent
(UofT) CSC2515-Lec1 41 / 52
◮ y (the nearest neighbour prediction) and t (the true label at xq) are
◮ Apply the Union Bound:
◮ I.e., the asymptotic error of 1-NN is at most twice the Bayes error.
◮ The counts of neighbors’ labels (approximately) follow a
◮ For large k, the argmax will agree with the Bayes classifier with
◮ Hence, the KNN approaches the Bayes error, i.e. KNN is Bayes
(UofT) CSC2515-Lec1 42 / 52
◮ Especially in high dimensions! KNN suffers from the Curse of
ǫ )d
(UofT) CSC2515-Lec1 43 / 52
(UofT) CSC2515-Lec1 44 / 52
(UofT) CSC2515-Lec1 45 / 52
(UofT) CSC2515-Lec1 46 / 52
◮ Calculuate D-dimensional Euclidean distances with N data points:
◮ Sort the distances: O(N log N)
(UofT) CSC2515-Lec1 47 / 52
(UofT) CSC2515-Lec1 48 / 52
◮ Distance measure: average distance between corresponding points
(UofT) CSC2515-Lec1 49 / 52
(UofT) CSC2515-Lec1 50 / 52
(UofT) CSC2515-Lec1 51 / 52
(UofT) CSC2515-Lec1 52 / 52